Getting started with ekorpkit#

student

Introduction#

ekorpkit is the acronym for eKonomic Research Python Toolkit. It looks like it’s for economic research. Actually, it is a Python library for natural language processing and machine learning. In particular, it is designed to support the Korean language as well as English.

ekorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization. Its powerful configuration composition is backed by Hydra.

It is highly recommended to get used to Hydra and OmegaConf.

Key features#

Easy Configuration#

  • You can compose your configuration dynamically, enabling you to easily get the perfect configuration for each research.

  • You can override everything from the command line, which makes experimentation fast, and removes the need to maintain multiple similar configuration files.

  • With a help of the eKonf class, it is also easy to compose configurations in a jupyter notebook environment.

No Boilerplate#

  • eKorpkit lets you focus on the problem at hand instead of spending time on boilerplate code like command line flags, loading configuration files, logging etc.

Workflows#

  • A workflow is a configurable automated process that will run one or more jobs.

  • You can divide your research into several unit jobs (tasks), then combine those jobs into one workflow.

  • You can have multiple workflows, each of which can perform a different set of tasks.

Sharable and Reproducible#

  • With eKorpkit, you can easily share your datasets and models.

  • Sharing configs along with datasets and models makes every research reproducible.

  • You can share each unit jobs or an entire workflow.

Pluggable Architecture#

  • eKorpkit has a pluggable architecture, enabling it to combine with your own implementation.

Installation#

To use ekorpkit, you need to install it first. The recommended way is to use pip.

Install the latest version of ekorpkit by running:

pip install -U ekorpkit

To install all extra dependencies,

pip install ekorpkit[all]

To install all extra dependencies, exhaustively, (not recommended)

pip install ekorpkit[exhaustive]

To install or upgrade the pre-release version of ekorpkit, run:

pip install -U --pre ekorpkit

Extra dependencies#

To list of extra dependency sets,

from ekorpkit import eKonf
eKonf.dependencies()
['tokenize',
 'all',
 'mecab',
 'dataset',
 'tokenize-en',
 'topic',
 'visualize',
 'parser',
 'wiki',
 'fomc',
 'edgar',
 'transformers',
 'model',
 'automl',
 'cached-path',
 'google',
 'ddbackend',
 'fetch',
 'doc',
 'disco',
 'art',
 'dalle-mini',
 'label',
 'exhaustive']

To seed the list of libraries in each dependency set,

eKonf.dependencies("tokenize")
{'emoji<2.0',
 'fugashi',
 'mecab-ko-dic',
 'mecab-python3',
 'nltk',
 'pynori',
 'pysbd',
 'sacremoses',
 'soynlp'}

Usage#

Via Command Line Interface (CLI)#

This CLI interface is useful when you already have a configuration file with a very little modification to parameter values.

!ekorpkit
name        : ekorpkit
author      : Young Joon Lee
description : eKorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization.
website     : https://entelecheia.github.io/ekorpkit-book/
version     : 0.1.38+23.g8fb9921.dirty

Execute `ekorpkit --help` to see what eKorpkit provides

CLI example to build a corpus#

ekorpkit \
    verbose=false \
    print_config=false \
    num_workers=1 \
    cmd=fetch_builtin_corpus \
    +corpus/builtin=_dummy_fomc_minutes \
    corpus.builtin.io.force.summarize=true \
    corpus.builtin.io.force.preprocess=true \
    corpus.builtin.io.force.build=false \
    corpus.builtin.io.force.download=false

CLI Help#

ekorpkit --help

Via Python#

Your can use ekorpkit in more pythonic ways. eKonf class provides various helper functions. With eKonf class, you can compose an config and instantiate a class with that configuration.

from ekorpkit import eKonf
cfg = eKonf.compose()
print('Config type:', type(cfg))
eKonf.print(cfg)
Hide code cell output
Config type: <class 'omegaconf.dictconfig.DictConfig'>
{'_config_': 'about.app',
 '_target_': 'ekorpkit.cli.cmd',
 'about': {'app': {'_target_': 'ekorpkit.cli.about',
                   'author': 'Young Joon Lee',
                   'description': 'eKorpkit provides a flexible interface for '
                                  'NLP and ML research pipelines such as '
                                  'extraction, transformation, tokenization, '
                                  'training, and visualization.',
                   'name': 'ekorpkit',
                   'version': '0.1.38+23.g8fb9921.dirty',
                   'website': 'https://entelecheia.github.io/ekorpkit-book/'}},
 'app_name': 'ekorpkit',
 'corpus': {'_target_': 'ekorpkit.datasets.corpus.Corpus',
            'auto': {'load': True, 'merge': False},
            'column_info': {'_target_': 'ekorpkit.info.column.CorpusInfo',
                            'columns': {'id': 'id',
                                        'merge_meta_on': 'id',
                                        'text': 'text',
                                        'timestamp': None},
                            'data': {'id': 'int', 'text': 'str'},
                            'datetime': {'columns': None,
                                         'format': None,
                                         'rcParams': None},
                            'meta': None,
                            'segment_separator': '\\n\\n',
                            'sentence_separator': '\\n',
                            'timestamp': {'format': None,
                                          'key': None,
                                          'rcParams': None}},
            'data_dir': '/workspace/data/datasets/corpus',
            'filetype': None,
            'force': {'build': False},
            'metadata_dir': None,
            'name': None,
            'path': {'cache': {'cache_dir': '/workspace/.cache',
                               'extract_archive': True,
                               'force_extract': False,
                               'path': None,
                               'return_parent_dir': True,
                               'uri': None,
                               'verbose': False},
                     'cached_path': None,
                     'columns': None,
                     'concat_data': False,
                     'data_columns': None,
                     'data_dir': '/workspace/data/datasets/corpus',
                     'data_file': None,
                     'filetype': None,
                     'name': None,
                     'output_dir': None,
                     'output_file': None,
                     'root': '/workspace/data',
                     'suffix': None,
                     'verbose': False},
            'use_name_as_subdir': True,
            'verbose': False},
 'dataset': {'_target_': 'ekorpkit.datasets.dataset.Dataset',
             'auto': {'build': False, 'load': True},
             'column_info': {'_target_': 'ekorpkit.info.column.DatasetInfo',
                             'columns': {'id': 'id', 'text': 'text'},
                             'data': {'id': 'int', 'text': 'str'},
                             'datetime': {'columns': None,
                                          'format': None,
                                          'rcParams': None}},
             'data_dir': '/workspace/data/datasets/dataset',
             'filetype': '.parquet',
             'force': {'build': False},
             'info': {'_target_': 'ekorpkit.info.stat.SummaryInfo',
                      'aggregate_info': {'num_examples': 'num_examples',
                                         'size_in_bytes': 'num_bytes'},
                      'data_dir': '/workspace/data/datasets/dataset',
                      'info_file': 'info-None.yaml',
                      'info_list': ['name',
                                    'fullname',
                                    'domain',
                                    'task',
                                    'lang',
                                    'description',
                                    'license',
                                    'homepage',
                                    'version',
                                    'num_examples',
                                    'size_in_bytes',
                                    'size_in_human_bytes',
                                    'data_files_modified',
                                    'info_updated',
                                    'data_files',
                                    'column_info'],
                      'key_columns': None,
                      'modified_info': {'data_files_modified': 'data_file'},
                      'name': None,
                      'stats': {'_func_': {'len_bytes': {'_partial_': True,
                                                         '_target_': 'ekorpkit.utils.func.len_bytes'}},
                                '_partial_': True,
                                '_target_': 'ekorpkit.info.stat.summary_stats',
                                'agg_funcs': {'num_bytes': ['count',
                                                            'sum',
                                                            'median',
                                                            'max',
                                                            'min']},
                                'convert_to_humanbytes': {'num_bytes': 'human_bytes'},
                                'key_columns': None,
                                'num_columns': {'num_bytes': 'len_bytes'},
                                'num_workers': 1,
                                'rename_columns': {'num_bytes_count': 'num_examples',
                                                   'num_bytes_sum': 'num_bytes'},
                                'text_keys': 'text'},
                      'update_files_info': {'data_files': 'data_file',
                                            'meta_files': 'meta_file'},
                      'update_info': ['fullname',
                                      'lang',
                                      'domain',
                                      'task',
                                      'description',
                                      'license',
                                      'homepage',
                                      'version'],
                      'verbose': False},
             'name': None,
             'path': {'cache': {'cache_dir': '/workspace/.cache',
                                'extract_archive': True,
                                'force_extract': False,
                                'path': None,
                                'return_parent_dir': True,
                                'uri': None,
                                'verbose': False},
                      'cached_path': None,
                      'columns': None,
                      'concat_data': False,
                      'data_columns': None,
                      'data_dir': '/workspace/data/datasets/dataset',
                      'data_file': None,
                      'filetype': '.parquet',
                      'name': None,
                      'output_dir': None,
                      'output_file': None,
                      'root': '/workspace/data',
                      'suffix': None,
                      'verbose': False},
             'use_name_as_subdir': True,
             'verbose': False},
 'debug_mode': False,
 'dir': {'archive': '/workspace/data/archive',
         'cache': '/workspace/.cache',
         'corpus': '/workspace/data/datasets/corpus',
         'data': '/workspace/data',
         'dataset': '/workspace/data/datasets',
         'ekorpkit': '/workspace/projects/ekorpkit/ekorpkit',
         'home': '/root',
         'log': '/workspace/projects/ekorpkit-book/logs',
         'model': '/workspace/data/models',
         'output': '/workspace/projects/ekorpkit-book/outputs',
         'project': '/workspace/projects/ekorpkit-book',
         'resource': '/workspace/projects/ekorpkit/ekorpkit/resources',
         'runtime': '/workspace/projects/ekorpkit-book/ekorpkit-book/docs/lectures/deep_nlp',
         'tmp': '/workspace/.tmp',
         'workspace': '/workspace'},
 'env': {'batcher': {'backend': 'joblib',
                     'minibatch_size': 1000,
                     'procs': '230',
                     'task_num_cpus': 1,
                     'task_num_gpus': 0,
                     'verbose': 10},
         'dask': {'n_workers': '230'},
         'distributed_framework': {'backend': 'joblib',
                                   'initialize': True,
                                   'num_workers': '230'},
         'dotenv': None,
         'dotenv_path': '/workspace/projects/ekorpkit-book/ekorpkit-book/docs/lectures/deep_nlp/.env',
         'os': {'CACHED_PATH_CACHE_ROOT': '/workspace/.cache/cached_path',
                'KMP_DUPLICATE_LIB_OK': 'TRUE'},
         'ray': {'num_cpus': '230'}},
 'ignore_warnings': True,
 'model': {'data_dir': '/workspace/projects/ekorpkit-book/data',
           'name': 'ekorpkit-book',
           'num_workers': '230',
           'output_dir': '/workspace/projects/ekorpkit-book/outputs/ekorpkit-book',
           'verbose': False},
 'name': 'ekorpkit-book',
 'num_workers': '230',
 'preprocessor': {'normalizer': {'_target_': 'ekorpkit.preprocessors.normalizer.Normalizer',
                                 'ftfy': {'decode_inconsistent_utf8': True,
                                          'fix_c1_controls': True,
                                          'fix_character_width': True,
                                          'fix_encoding': True,
                                          'fix_latin_ligatures': True,
                                          'fix_line_breaks': True,
                                          'fix_surrogates': True,
                                          'max_decode_length': 1000000,
                                          'normalization': 'NFKC',
                                          'remove_control_chars': True,
                                          'remove_terminal_escapes': True,
                                          'replace_lossy_sequences': True,
                                          'restore_byte_a0': True,
                                          'uncurl_quotes': True,
                                          'unescape_html': True},
                                 'hanja2hangle': False,
                                 'num_repeats': 2,
                                 'spaces': {'collapse_whitespaces': True,
                                            'fix_whitespaces': True,
                                            'num_spaces_for_tab': 4,
                                            'replace_tabs': True,
                                            'strip': True},
                                 'special_characters': {'fix_ellipsis': True,
                                                        'fix_emoticons': False,
                                                        'fix_hyphens': True,
                                                        'fix_slashes': True,
                                                        'fix_tildes': True,
                                                        'regular_parentheses_only': False,
                                                        'single_quotes_only': False}},
                  'segmenter': {'chunk': {'_func_': {'len_bytes': {'_partial_': True,
                                                                   '_target_': 'ekorpkit.utils.func.len_bytes'},
                                                     'len_words': {'_partial_': True,
                                                                   '_target_': 'ekorpkit.utils.func.len_words'}},
                                          'chunk_overlap': False,
                                          'chunk_size': 300,
                                          'len_func': 'len_bytes'},
                                'filter_language': {'detection_level': 'segment',
                                                    'filter': False,
                                                    'languages_to_keep': ['en',
                                                                          'ko'],
                                                    'min_language_probability': 0.8},
                                'filter_programming_language': False,
                                'filter_sentence_length': {'filter': False,
                                                           'min_length': 10,
                                                           'min_num_words': 3},
                                'merge': {'broken_lines_threshold': 0.4,
                                          'empty_lines_threshold': 0.6,
                                          'merge_level': 'segment',
                                          'merge_lines': False},
                                'print_args': False,
                                'return_as_list': False,
                                'separators': {'in_segment': '\\n\\n',
                                               'in_sentence': '\\n',
                                               'out_segment': '\\n\\n',
                                               'out_sentence': '\\n'},
                                'split': {'keep_segment': True,
                                          'max_recover_length': 30000,
                                          'max_recover_step': 0},
                                'verbose': True},
                  'tokenizer': {'_target_': 'ekorpkit.preprocessors.tokenizer.SimpleTokenizer',
                                'extract': {'noun_postags': ['NNG',
                                                             'NNP',
                                                             'XSN',
                                                             'SL',
                                                             'XR',
                                                             'NNB',
                                                             'NR'],
                                            'postag_delim': '/',
                                            'postag_length': None,
                                            'postags': None,
                                            'stop_postags': ['SP',
                                                             'SF',
                                                             'SE',
                                                             'SSO',
                                                             'SSC',
                                                             'SC',
                                                             'SY',
                                                             'SH'],
                                            'strip_pos': True},
                                'normalize': None,
                                'return_as_list': False,
                                'stopwords': {'_target_': 'ekorpkit.preprocessors.stopwords.Stopwords',
                                              'lowercase': True,
                                              'name': 'stopwords',
                                              'nltk_stopwords': None,
                                              'stopwords': None,
                                              'stopwords_path': None,
                                              'verbose': False},
                                'stopwords_path': None,
                                'tagset': None,
                                'tokenize': {'flatten': True,
                                             'include_whitespace_token': True,
                                             'lowercase': False,
                                             'postag_delim': '/',
                                             'postag_length': None,
                                             'punct_postags': ['SF',
                                                               'SP',
                                                               'SSO',
                                                               'SSC',
                                                               'SY'],
                                             'strip_pos': False,
                                             'tokenize_each_word': False,
                                             'userdic_path': None,
                                             'wordpieces_prefix': '##'},
                                'tokenize_article': {'sentence_separator': '\\n'},
                                'verbose': False}},
 'print_config': False,
 'print_resolved_config': True,
 'project': {'name': 'ekorpkit-book'},
 'verbose': False}

Examples#

config_group = 'preprocessor/tokenizer=nltk'
cfg = eKonf.compose(config_group=config_group)
eKonf.print(cfg)
nltk = eKonf.instantiate(cfg)
Hide code cell output
{'_target_': 'ekorpkit.preprocessors.tokenizer.NLTKTokenizer',
 'extract': {'noun_postags': ['NN', 'NNP', 'NNS', 'NNPS'],
             'postag_delim': '/',
             'postag_length': None,
             'postags': None,
             'stop_postags': ['.'],
             'strip_pos': True},
 'nltk': {'lemmatize': False,
          'lemmatizer': {'_target_': 'nltk.stem.WordNetLemmatizer'},
          'stem': True,
          'stemmer': {'_target_': 'nltk.stem.PorterStemmer'}},
 'normalize': None,
 'return_as_list': False,
 'stopwords': {'_target_': 'ekorpkit.preprocessors.stopwords.Stopwords',
               'lowercase': True,
               'name': 'stopwords',
               'nltk_stopwords': None,
               'stopwords': None,
               'stopwords_path': None,
               'verbose': False},
 'stopwords_path': None,
 'tagset': None,
 'tokenize': {'flatten': True,
              'include_whitespace_token': True,
              'lowercase': False,
              'postag_delim': '/',
              'postag_length': None,
              'punct_postags': ['SF', 'SP', 'SSO', 'SSC', 'SY'],
              'strip_pos': False,
              'tokenize_each_word': False,
              'userdic_path': None,
              'wordpieces_prefix': '##'},
 'tokenize_article': {'sentence_separator': '\\n'},
 'verbose': False}
text = "I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress."
nltk.tokenize(text)
['i/PRP',
 'shall/MD',
 'reemphas/VB',
 'some/DT',
 'of/IN',
 'those/DT',
 'thought/NNS',
 'today/NN',
 'in/IN',
 'the/DT',
 'context/NN',
 'of/IN',
 'legisl/JJ',
 'propos/NNS',
 'that/WDT',
 'are/VBP',
 'now/RB',
 'befor/IN',
 'the/DT',
 'current/JJ',
 'congress/NNP',
 './.']
nltk.nouns(text)
['thought', 'today', 'context', 'propos', 'congress']

To use the mecab tokenizer,

cfg = eKonf.compose('preprocessor/tokenizer=mecab')
mecab = eKonf.instantiate(cfg)
text = 'IMF가 推定한 우리나라의 GDP갭률은 今年에도 소폭의 마이너스(−)를 持續하고 있다.'
mecab.tokenize(text)
['IMF/SL',
 '가/JKS',
 '/SP',
 '推定/NNG',
 '한/XSA+ETM',
 '/SP',
 '우리나라/NNG',
 '의/JKG',
 '/SP',
 'GDP/SL',
 '갭/NNG',
 '률/XSN',
 '은/JX',
 '/SP',
 '今年/NNG',
 '에/JKB',
 '도/JX',
 '/SP',
 '소폭/NNG',
 '의/JKG',
 '/SP',
 '마이너스/NNG',
 '(/SSO',
 '−)/SY',
 '를/JKO',
 '/SP',
 '持續/NNG',
 '하/XSV',
 '고/EC',
 '/SP',
 '있/VX',
 '다/EF',
 './SF']

To normalize a formal korean text,

cfg_norm = eKonf.compose('preprocessor/normalizer=formal_ko')
norm = eKonf.instantiate(cfg_norm)
norm(text)
'IMF가 추정한 우리나라의 GDP갭률은 금년에도 소폭의 마이너스(-)를 지속하고 있다.'

To instantiate a mecab config with the above normalizer config,

cfg = eKonf.compose("preprocessor/tokenizer=mecab")
cfg.normalize = cfg_norm
mecab = eKonf.instantiate(cfg)
mecab.tokenize(text)
['IMF/SL',
 '가/JKS',
 '/SP',
 '추정/NNG',
 '한/XSA+ETM',
 '/SP',
 '우리나라/NNG',
 '의/JKG',
 '/SP',
 'GDP/SL',
 '갭/NNG',
 '률/XSN',
 '은/JX',
 '/SP',
 '금년/NNG',
 '에/JKB',
 '도/JX',
 '/SP',
 '소폭/NNG',
 '의/JKG',
 '/SP',
 '마이너스/NNG',
 '(/SSO',
 '-)/SY',
 '를/JKO',
 '/SP',
 '지속/NNG',
 '하/XSV',
 '고/EC',
 '/SP',
 '있/VX',
 '다/EF',
 './SF']

Text to image example#

eKonf.setLogger("WARNING")
cfg = eKonf.compose("model/disco")
disco = eKonf.instantiate(cfg)
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth
results = disco.imagine(
    text_prompts="steampunk vegetable market, cute, pixar, octane render, epic composition, wide angle", 
    n_samples=1, show_collage=False, width_height = [768, 512]
)
Hide code cell output
1 samples generated to /workspace/projects/ekorpkit-book/disco-imagen/outputs/disco-diffusion/TimeToDisco
text prompts: steampunk vegetable market, cute, pixar, octane render, epic composition, wide angle
sample image paths:
/workspace/projects/ekorpkit-book/disco-imagen/outputs/disco-diffusion/TimeToDisco/TimeToDisco(2)_0000.png
../../../_images/33451d17e54aca2d77e3c745cd97da6003ccb72ab9a5f00d5554121e0f0019d1.png