Usage#

Via Command Line Interface (CLI)#

!ekorpkit
/usr/bin/sh: 1: ekorpkit: not found

CLI example to build a corpus#

ekorpkit --config-dir /workspace/projects/ekorpkit-book/config  \
    project=esgml \
    dir.workspace=/workspace \
    verbose=false \
    print_config=false \
    num_workers=1 \
    cmd=fetch_builtin_corpus \
    +corpus/builtin=_dummy_fomc_minutes \
    corpus.builtin.io.force.summarize=true \
    corpus.builtin.io.force.preprocess=true \
    corpus.builtin.io.force.build=false \
    corpus.builtin.io.force.download=false

CLI Help#

To see the available configurations for CLI, run the command:

!ekorpkit --help
/usr/bin/sh: 1: ekorpkit: not found
!ekorpkit --info defaults
/usr/bin/sh: 1: ekorpkit: not found

Via Python#

Compose an ekorpkit config#

from ekorpkit import eKonf
cfg = eKonf.compose()
print('Config type:', type(cfg))
eKonf.pprint(cfg)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 from ekorpkit import eKonf
      2 cfg = eKonf.compose()
      3 print('Config type:', type(cfg))

ModuleNotFoundError: No module named 'ekorpkit'

Instantiating objects with an ekorpkit config#

compose a config for the nltk class#

from ekorpkit import eKonf
config_group='preprocessor/tokenizer=nltk'
cfg = eKonf.compose(config_group=config_group)
eKonf.pprint(cfg)
nltk = eKonf.instantiate(cfg)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 from ekorpkit import eKonf
      2 config_group='preprocessor/tokenizer=nltk'
      3 cfg = eKonf.compose(config_group=config_group)

ModuleNotFoundError: No module named 'ekorpkit'
text = "I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress."
nltk.tokenize(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 text = "I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress."
----> 2 nltk.tokenize(text)

NameError: name 'nltk' is not defined
 nltk.nouns(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 nltk.nouns(text)

NameError: name 'nltk' is not defined

compose a config for the mecab class#

config_group='preprocessor/tokenizer=mecab'
cfg = eKonf.compose(config_group=config_group)
eKonf.pprint(cfg)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 config_group='preprocessor/tokenizer=mecab'
----> 2 cfg = eKonf.compose(config_group=config_group)
      3 eKonf.pprint(cfg)

NameError: name 'eKonf' is not defined

intantiate a mecab config and tokenize a text#

mecab = eKonf.instantiate(cfg)
text = 'IMF가 推定한 우리나라의 GDP갭률은 今年에도 소폭의 마이너스(−)를 持續하고 있다.'
mecab.tokenize(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 mecab = eKonf.instantiate(cfg)
      2 text = 'IMF가 推定한 우리나라의 GDP갭률은 今年에도 소폭의 마이너스(−)를 持續하고 있다.'
      3 mecab.tokenize(text)

NameError: name 'eKonf' is not defined

compose and instantiate a formal_ko config for the normalizer class#

config_group='preprocessor/normalizer=formal_ko'
cfg_norm = eKonf.compose(config_group=config_group)
norm = eKonf.instantiate(cfg_norm)
norm(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 config_group='preprocessor/normalizer=formal_ko'
----> 2 cfg_norm = eKonf.compose(config_group=config_group)
      3 norm = eKonf.instantiate(cfg_norm)
      4 norm(text)

NameError: name 'eKonf' is not defined

instantiate a mecab config with the above normalizer config#

config_group='preprocessor/tokenizer=mecab'
cfg = eKonf.compose(config_group=config_group)
cfg.normalize = cfg_norm
mecab = eKonf.instantiate(cfg)
mecab.tokenize(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 config_group='preprocessor/tokenizer=mecab'
----> 2 cfg = eKonf.compose(config_group=config_group)
      3 cfg.normalize = cfg_norm
      4 mecab = eKonf.instantiate(cfg)

NameError: name 'eKonf' is not defined