Getting started with ekorpkit#
Introduction#
ekorpkit is the acronym for eKonomic Research Python Toolkit. It looks like it’s for economic research. Actually, it is a Python library for natural language processing and machine learning. In particular, it is designed to support the Korean language as well as English.
ekorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization. Its powerful configuration composition is backed by Hydra.
It is highly recommended to get used to Hydra and OmegaConf.
Key features#
Easy Configuration#
You can compose your configuration dynamically, enabling you to easily get the perfect configuration for each research.
You can override everything from the command line, which makes experimentation fast, and removes the need to maintain multiple similar configuration files.
With a help of the eKonf class, it is also easy to compose configurations in a jupyter notebook environment.
No Boilerplate#
eKorpkit lets you focus on the problem at hand instead of spending time on boilerplate code like command line flags, loading configuration files, logging etc.
Workflows#
A workflow is a configurable automated process that will run one or more jobs.
You can divide your research into several unit jobs (tasks), then combine those jobs into one workflow.
You can have multiple workflows, each of which can perform a different set of tasks.
Pluggable Architecture#
eKorpkit has a pluggable architecture, enabling it to combine with your own implementation.
Installation#
To use ekorpkit, you need to install it first. The recommended way is to use pip.
Install the latest version of ekorpkit by running:
pip install -U ekorpkit
To install all extra dependencies,
pip install ekorpkit[all]
To install all extra dependencies, exhaustively, (not recommended)
pip install ekorpkit[exhaustive]
To install or upgrade the pre-release version of ekorpkit, run:
pip install -U --pre ekorpkit
Extra dependencies#
To list of extra dependency sets,
from ekorpkit import eKonf
eKonf.dependencies()
['tokenize',
'all',
'mecab',
'dataset',
'tokenize-en',
'topic',
'visualize',
'parser',
'wiki',
'fomc',
'edgar',
'model',
'transformers',
'automl',
'cached-path',
'google',
'ddbackend',
'fetch',
'doc',
'disco',
'art',
'dalle-mini',
'label',
'exhaustive']
To seed the list of libraries in each dependency set,
eKonf.dependencies("tokenize")
{'emoji<2.0',
'fugashi',
'mecab-ko-dic',
'mecab-python3',
'nltk',
'pynori',
'pysbd',
'sacremoses',
'soynlp'}
Usage#
Via Command Line Interface (CLI)#
This CLI interface is useful when you already have a configuration file with a very little modification to parameter values.
!ekorpkit
[2022-08-26 08:52:39,439][ekorpkit.base][INFO] - Loaded .env from /workspace/projects/ekorpkit-book/config/.env
[2022-08-26 08:52:39,443][ekorpkit.base][INFO] - setting environment variable CACHED_PATH_CACHE_ROOT to /workspace/.cache/cached_path
[2022-08-26 08:52:39,443][ekorpkit.base][INFO] - setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
name : ekorpkit
author : Young Joon Lee
description : eKorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization.
website : https://entelecheia.github.io/ekorpkit-book/
version : 0.1.38+20.g61522ea
Execute `ekorpkit --help` to see what eKorpkit provides
CLI example to build a corpus#
ekorpkit - -config-dir / workspace/projects/ekorpkit-book/config \
project = esgml \
dir.workspace = /workspace \
verbose = false \
print_config = false \
num_workers = 1 \
cmd = fetch_builtin_corpus \
+ corpus/builtin = _dummy_fomc_minutes \
corpus.builtin.io.force.summarize = true \
corpus.builtin.io.force.preprocess = true \
corpus.builtin.io.force.build = false \
corpus.builtin.io.force.download = false
CLI Help#
ekorpkit - -help
Via Python#
Your can use ekorpkit in more pythonic ways. eKonf class provides various helper functions. With eKonf class, you can compose an config and instantiate a class with that configuration.
from ekorpkit import eKonf
cfg = eKonf.compose()
print('Config type:', type(cfg))
eKonf.print(cfg)
Show code cell output
Config type: <class 'omegaconf.dictconfig.DictConfig'>
{'_config_': 'about.app',
'_target_': 'ekorpkit.cli.cmd',
'about': {'app': {'_target_': 'ekorpkit.cli.about',
'author': 'Young Joon Lee',
'description': 'eKorpkit provides a flexible interface for '
'NLP and ML research pipelines such as '
'extraction, transformation, tokenization, '
'training, and visualization.',
'name': 'ekorpkit',
'version': '0.1.38+20.g61522ea',
'website': 'https://entelecheia.github.io/ekorpkit-book/'}},
'app_name': 'ekorpkit',
'corpus': {'_target_': 'ekorpkit.datasets.corpus.Corpus',
'auto': {'load': True, 'merge': False},
'column_info': {'_target_': 'ekorpkit.info.column.CorpusInfo',
'columns': {'id': 'id',
'merge_meta_on': 'id',
'text': 'text',
'timestamp': None},
'data': {'id': 'int', 'text': 'str'},
'datetime': {'columns': None,
'format': None,
'rcParams': None},
'meta': None,
'segment_separator': '\\n\\n',
'sentence_separator': '\\n',
'timestamp': {'format': None,
'key': None,
'rcParams': None}},
'data_dir': '/workspace/data/datasets/corpus',
'filetype': None,
'force': {'build': False},
'metadata_dir': None,
'name': None,
'path': {'cache': {'cache_dir': '/workspace/.cache',
'extract_archive': True,
'force_extract': False,
'path': None,
'return_parent_dir': True,
'uri': None,
'verbose': False},
'cached_path': None,
'columns': None,
'concat_data': False,
'data_columns': None,
'data_dir': '/workspace/data/datasets/corpus',
'data_file': None,
'filetype': None,
'name': None,
'output_dir': None,
'output_file': None,
'root': '/workspace/data',
'suffix': None,
'verbose': False},
'use_name_as_subdir': True,
'verbose': False},
'dataset': {'_target_': 'ekorpkit.datasets.dataset.Dataset',
'auto': {'build': False, 'load': True},
'column_info': {'_target_': 'ekorpkit.info.column.DatasetInfo',
'columns': {'id': 'id', 'text': 'text'},
'data': {'id': 'int', 'text': 'str'},
'datetime': {'columns': None,
'format': None,
'rcParams': None}},
'data_dir': '/workspace/data/datasets/dataset',
'filetype': '.parquet',
'force': {'build': False},
'info': {'_target_': 'ekorpkit.info.stat.SummaryInfo',
'aggregate_info': {'num_examples': 'num_examples',
'size_in_bytes': 'num_bytes'},
'data_dir': '/workspace/data/datasets/dataset',
'info_file': 'info-None.yaml',
'info_list': ['name',
'fullname',
'domain',
'task',
'lang',
'description',
'license',
'homepage',
'version',
'num_examples',
'size_in_bytes',
'size_in_human_bytes',
'data_files_modified',
'info_updated',
'data_files',
'column_info'],
'key_columns': None,
'modified_info': {'data_files_modified': 'data_file'},
'name': None,
'stats': {'_func_': {'len_bytes': {'_partial_': True,
'_target_': 'ekorpkit.utils.func.len_bytes'}},
'_partial_': True,
'_target_': 'ekorpkit.info.stat.summary_stats',
'agg_funcs': {'num_bytes': ['count',
'sum',
'median',
'max',
'min']},
'convert_to_humanbytes': {'num_bytes': 'human_bytes'},
'key_columns': None,
'num_columns': {'num_bytes': 'len_bytes'},
'num_workers': 1,
'rename_columns': {'num_bytes_count': 'num_examples',
'num_bytes_sum': 'num_bytes'},
'text_keys': 'text'},
'update_files_info': {'data_files': 'data_file',
'meta_files': 'meta_file'},
'update_info': ['fullname',
'lang',
'domain',
'task',
'description',
'license',
'homepage',
'version'],
'verbose': False},
'name': None,
'path': {'cache': {'cache_dir': '/workspace/.cache',
'extract_archive': True,
'force_extract': False,
'path': None,
'return_parent_dir': True,
'uri': None,
'verbose': False},
'cached_path': None,
'columns': None,
'concat_data': False,
'data_columns': None,
'data_dir': '/workspace/data/datasets/dataset',
'data_file': None,
'filetype': '.parquet',
'name': None,
'output_dir': None,
'output_file': None,
'root': '/workspace/data',
'suffix': None,
'verbose': False},
'use_name_as_subdir': True,
'verbose': False},
'debug_mode': False,
'dir': {'archive': '/workspace/data/archive',
'cache': '/workspace/.cache',
'corpus': '/workspace/data/datasets/corpus',
'data': '/workspace/data',
'dataset': '/workspace/data/datasets',
'ekorpkit': '/workspace/projects/ekorpkit/ekorpkit',
'home': '/root',
'log': '/workspace/projects/ekorpkit-book/logs',
'model': '/workspace/data/models',
'output': '/workspace/projects/ekorpkit-book/outputs',
'project': '/workspace/projects/ekorpkit-book',
'resource': '/workspace/projects/ekorpkit/ekorpkit/resources',
'runtime': '/workspace/projects/ekorpkit-book/ekorpkit-book/docs/lectures/deep_nlp',
'tmp': '/workspace/.tmp',
'workspace': '/workspace'},
'env': {'batcher': {'backend': 'joblib',
'minibatch_size': 1000,
'procs': '230',
'task_num_cpus': 1,
'task_num_gpus': 0,
'verbose': 10},
'dask': {'n_workers': '230'},
'distributed_framework': {'backend': 'joblib',
'initialize': True,
'num_workers': '230'},
'dotenv': None,
'dotenv_path': '/workspace/projects/ekorpkit-book/ekorpkit-book/docs/lectures/deep_nlp/.env',
'os': {'CACHED_PATH_CACHE_ROOT': '/workspace/.cache/cached_path',
'KMP_DUPLICATE_LIB_OK': 'TRUE'},
'ray': {'num_cpus': '230'}},
'ignore_warnings': True,
'model': {'data_dir': '/workspace/projects/ekorpkit-book/data',
'name': 'ekorpkit-book',
'num_workers': '230',
'output_dir': '/workspace/projects/ekorpkit-book/outputs/ekorpkit-book',
'verbose': False},
'name': 'ekorpkit-book',
'num_workers': '230',
'preprocessor': {'normalizer': {'_target_': 'ekorpkit.preprocessors.normalizer.Normalizer',
'ftfy': {'decode_inconsistent_utf8': True,
'fix_c1_controls': True,
'fix_character_width': True,
'fix_encoding': True,
'fix_latin_ligatures': True,
'fix_line_breaks': True,
'fix_surrogates': True,
'max_decode_length': 1000000,
'normalization': 'NFKC',
'remove_control_chars': True,
'remove_terminal_escapes': True,
'replace_lossy_sequences': True,
'restore_byte_a0': True,
'uncurl_quotes': True,
'unescape_html': True},
'hanja2hangle': False,
'num_repeats': 2,
'spaces': {'collapse_whitespaces': True,
'fix_whitespaces': True,
'num_spaces_for_tab': 4,
'replace_tabs': True,
'strip': True},
'special_characters': {'fix_ellipsis': True,
'fix_emoticons': False,
'fix_hyphens': True,
'fix_slashes': True,
'fix_tildes': True,
'regular_parentheses_only': False,
'single_quotes_only': False}},
'segmenter': {'chunk': {'_func_': {'len_bytes': {'_partial_': True,
'_target_': 'ekorpkit.utils.func.len_bytes'},
'len_words': {'_partial_': True,
'_target_': 'ekorpkit.utils.func.len_words'}},
'chunk_overlap': False,
'chunk_size': 300,
'len_func': 'len_bytes'},
'filter_language': {'detection_level': 'segment',
'filter': False,
'languages_to_keep': ['en',
'ko'],
'min_language_probability': 0.8},
'filter_programming_language': False,
'filter_sentence_length': {'filter': False,
'min_length': 10,
'min_num_words': 3},
'merge': {'broken_lines_threshold': 0.4,
'empty_lines_threshold': 0.6,
'merge_level': 'segment',
'merge_lines': False},
'print_args': False,
'return_as_list': False,
'separators': {'in_segment': '\\n\\n',
'in_sentence': '\\n',
'out_segment': '\\n\\n',
'out_sentence': '\\n'},
'split': {'keep_segment': True,
'max_recover_length': 30000,
'max_recover_step': 0},
'verbose': True},
'tokenizer': {'_target_': 'ekorpkit.preprocessors.tokenizer.SimpleTokenizer',
'extract': {'noun_postags': ['NNG',
'NNP',
'XSN',
'SL',
'XR',
'NNB',
'NR'],
'postag_delim': '/',
'postag_length': None,
'postags': None,
'stop_postags': ['SP',
'SF',
'SE',
'SSO',
'SSC',
'SC',
'SY',
'SH'],
'strip_pos': True},
'normalize': None,
'return_as_list': False,
'stopwords': {'_target_': 'ekorpkit.preprocessors.stopwords.Stopwords',
'lowercase': True,
'name': 'stopwords',
'nltk_stopwords': None,
'stopwords': None,
'stopwords_path': None,
'verbose': False},
'stopwords_path': None,
'tagset': None,
'tokenize': {'flatten': True,
'include_whitespace_token': True,
'lowercase': False,
'postag_delim': '/',
'postag_length': None,
'punct_postags': ['SF',
'SP',
'SSO',
'SSC',
'SY'],
'strip_pos': False,
'tokenize_each_word': False,
'userdic_path': None,
'wordpieces_prefix': '##'},
'tokenize_article': {'sentence_separator': '\\n'},
'verbose': False}},
'print_config': False,
'print_resolved_config': True,
'project': {'name': 'ekorpkit-book'},
'verbose': False}
Examples#
config_group = 'preprocessor/tokenizer=nltk'
cfg = eKonf.compose(config_group=config_group)
eKonf.pprint(cfg)
nltk = eKonf.instantiate(cfg)
Show code cell output
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
{'_target_': 'ekorpkit.preprocessors.tokenizer.NLTKTokenizer',
'extract': {'noun_postags': ['NN', 'NNP', 'NNS', 'NNPS'],
'postag_delim': '/',
'postag_length': None,
'postags': None,
'stop_postags': ['.'],
'strip_pos': True},
'nltk': {'lemmatize': False,
'lemmatizer': {'_target_': 'nltk.stem.WordNetLemmatizer'},
'stem': True,
'stemmer': {'_target_': 'nltk.stem.PorterStemmer'}},
'normalize': None,
'return_as_list': False,
'stopwords': {'_target_': 'ekorpkit.preprocessors.stopwords.Stopwords',
'lowercase': True,
'name': 'stopwords',
'nltk_stopwords': None,
'stopwords': None,
'stopwords_path': None,
'verbose': False},
'stopwords_path': None,
'tagset': None,
'tokenize': {'flatten': True,
'include_whitespace_token': True,
'lowercase': False,
'postag_delim': '/',
'postag_length': None,
'punct_postags': ['SF', 'SP', 'SSO', 'SSC', 'SY'],
'strip_pos': False,
'tokenize_each_word': False,
'userdic_path': None,
'wordpieces_prefix': '##'},
'tokenize_article': {'sentence_separator': '\\n'},
'verbose': False}
INFO:ekorpkit.preprocessors.tokenizer:instantiating ekorpkit.preprocessors.stopwords.Stopwords...
text = "I shall reemphasize some of those thoughts today in the context of legislative proposals that are now before the current Congress."
nltk.tokenize(text)
['i/PRP',
'shall/MD',
'reemphas/VB',
'some/DT',
'of/IN',
'those/DT',
'thought/NNS',
'today/NN',
'in/IN',
'the/DT',
'context/NN',
'of/IN',
'legisl/JJ',
'propos/NNS',
'that/WDT',
'are/VBP',
'now/RB',
'befor/IN',
'the/DT',
'current/JJ',
'congress/NNP',
'./.']
nltk.nouns(text)
['thought', 'today', 'context', 'propos', 'congress']
To use the mecab tokenizer,
cfg = eKonf.compose('preprocessor/tokenizer=mecab')
mecab = eKonf.instantiate(cfg)
text = 'IMF가 推定한 우리나라의 GDP갭률은 今年에도 소폭의 마이너스(−)를 持續하고 있다.'
mecab.tokenize(text)
INFO:ekorpkit.preprocessors.tokenizer:Initializing mecab with {'userdic_path': None, 'backend': 'mecab-python3', 'verbose': False}...
INFO:ekorpkit.preprocessors.tokenizer:instantiating ekorpkit.preprocessors.stopwords.Stopwords...
['IMF/SL',
'가/JKS',
'/SP',
'推定/NNG',
'한/XSA+ETM',
'/SP',
'우리나라/NNG',
'의/JKG',
'/SP',
'GDP/SL',
'갭/NNG',
'률/XSN',
'은/JX',
'/SP',
'今年/NNG',
'에/JKB',
'도/JX',
'/SP',
'소폭/NNG',
'의/JKG',
'/SP',
'마이너스/NNG',
'(/SSO',
'−)/SY',
'를/JKO',
'/SP',
'持續/NNG',
'하/XSV',
'고/EC',
'/SP',
'있/VX',
'다/EF',
'./SF']
To normalize a formal korean text,
cfg_norm = eKonf.compose('preprocessor/normalizer=formal_ko')
norm = eKonf.instantiate(cfg_norm)
norm(text)
'IMF가 추정한 우리나라의 GDP갭률은 금년에도 소폭의 마이너스(-)를 지속하고 있다.'
To instantiate a mecab config with the above normalizer config,
cfg = eKonf.compose("preprocessor/tokenizer=mecab")
cfg.normalize = cfg_norm
mecab = eKonf.instantiate(cfg)
mecab.tokenize(text)
INFO:ekorpkit.preprocessors.tokenizer:Initializing mecab with {'userdic_path': None, 'backend': 'mecab-python3', 'verbose': False}...
INFO:ekorpkit.preprocessors.tokenizer:instantiating ekorpkit.preprocessors.stopwords.Stopwords...
['IMF/SL',
'가/JKS',
'/SP',
'추정/NNG',
'한/XSA+ETM',
'/SP',
'우리나라/NNG',
'의/JKG',
'/SP',
'GDP/SL',
'갭/NNG',
'률/XSN',
'은/JX',
'/SP',
'금년/NNG',
'에/JKB',
'도/JX',
'/SP',
'소폭/NNG',
'의/JKG',
'/SP',
'마이너스/NNG',
'(/SSO',
'-)/SY',
'를/JKO',
'/SP',
'지속/NNG',
'하/XSV',
'고/EC',
'/SP',
'있/VX',
'다/EF',
'./SF']
Text to image example#
eKonf.setLogger("WARNING")
cfg = eKonf.compose("model/disco")
disco = eKonf.instantiate(cfg)
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /opt/conda/lib/python3.8/site-packages/lpips/weights/v0.1/vgg.pth
results = disco.imagine(
text_prompts="steampunk vegetable market, cute, pixar, octane render, epic composition, wide angle",
n_samples=1, show_collage=False, width_height = [768, 512]
)
Show code cell output
INFO:ekorpkit.base:shell type: ZMQInteractiveShell
1 samples generated to /workspace/projects/ekorpkit-book/disco-imagen/outputs/disco-diffusion/TimeToDisco
text prompts: steampunk vegetable market, cute, pixar, octane render, epic composition, wide angle
sample image paths:
/workspace/projects/ekorpkit-book/disco-imagen/outputs/disco-diffusion/TimeToDisco/TimeToDisco(0)_0000.png
{'image_filepaths': ['/workspace/projects/ekorpkit-book/disco-imagen/outputs/disco-diffusion/TimeToDisco/TimeToDisco(0)_0000.png'],
'config_file': 'TimeToDisco(0)_settings.yaml',
'config': {'animation_mode': 'None',
'batch_name': 'TimeToDisco',
'display_rate': 20,
'n_samples': 1,
'batch_size': 1,
'resume_run': False,
'run_to_resume': 'latest',
'resume_from_frame': 'latest',
'retain_overwritten_frames': True,
'show_collage': False,
'diffusion_sampling_mode': 'ddim',
'use_secondary_model': True,
'steps': 250,
'width_height': [768, 512],
'clip_guidance_scale': 5000,
'tv_scale': 0,
'range_scale': 150,
'sat_scale': 0,
'cutn_batches': 4,
'skip_augs': False,
'init_image': 'None',
'init_scale': 1000,
'skip_steps': 10,
'frames_scale': 1500,
'frames_skip_steps': '60%',
'video_init_steps': 100,
'video_init_clip_guidance_scale': 1000,
'video_init_tv_scale': 0.1,
'video_init_range_scale': 150,
'video_init_sat_scale': 300,
'video_init_cutn_batches': 4,
'video_init_skip_steps': 50,
'video_init_file': 'init.mp4',
'video_init_path': '/workspace/projects/ekorpkit-book/disco-imagen/init_images/init.mp4',
'extract_nth_frame': 2,
'persistent_frame_output_in_batch_folder': True,
'video_init_seed_continuity': False,
'video_init_flow_warp': True,
'video_init_flow_blend': 0.999,
'video_init_check_consistency': False,
'video_init_blend_mode': 'optical flow',
'video_init_frames_scale': 15000,
'video_init_frames_skip_steps': '70%',
'force_flow_generation': False,
'key_frames': True,
'max_frames': 1,
'interp_spline': 'Linear',
'angle': '0:(0)',
'zoom': '0: (1), 10: (1.05)',
'translation_x': '0: (0)',
'translation_y': '0: (0)',
'translation_z': '0: (10.0)',
'rotation_3d_x': '0: (0)',
'rotation_3d_y': '0: (0)',
'rotation_3d_z': '0: (0)',
'midas_depth_model': 'dpt_large',
'midas_weight': 0.3,
'near_plane': 200,
'far_plane': 10000,
'fov': 40,
'padding_mode': 'border',
'sampling_mode': 'bicubic',
'turbo_mode': False,
'turbo_steps': '3',
'turbo_preroll': 10,
'vr_mode': False,
'vr_eye_angle': 0.5,
'vr_ipd': 5.0,
'intermediate_saves': [],
'steps_per_checkpoint': None,
'intermediates_in_subfolder': True,
'perlin_init': False,
'perlin_mode': 'mixed',
'set_seed': 'random_seed',
'eta': 0.8,
'clamp_grad': True,
'clamp_max': 0.05,
'randomize_class': True,
'clip_denoised': False,
'fuzzy_prompt': False,
'rand_mag': 0.05,
'cut_overview': '[12]*400+[4]*600',
'cut_innercut': '[4]*400+[12]*600',
'cut_ic_pow': 1,
'cut_icgray_p': '[0.2]*400+[0]*600',
'use_vertical_symmetry': False,
'use_horizontal_symmetry': False,
'transformation_percent': [0.09],
'video_output': {'skip_video_for_run_all': False,
'blend': 0.5,
'video_init_check_consistency': False,
'init_frame': 1,
'last_frame': 'final_frame',
'fps': 12,
'view_video_in_cell': False},
'text_prompts': {0: ['steampunk vegetable market, cute, pixar, octane render, epic composition, wide angle']},
'image_prompts': None,
'batch_num': 0,
'stop_on_next_loop': False,
'side_x': 768,
'side_y': 512,
'calc_frames_skip_steps': 150,
'start_frame': 0,
'start_sample': 0,
'seed': 69504742}}