# Lab 1: Preparing Wikipedia Corpora

![](../figs/deep_nlp/lab/corpus.png)

## Prepare the Environment

```python
%pip install -U --pre ekorpkit[dataset,wiki]
```

In [1]:
%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env


version: 0.1.40.post0.dev18
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 1.36 s (started: 2022-11-14 01:22:37 +00:00)


## Build corpora with the ekorpkit configs

### Wikipedia Dump

- The first step is to download the Wikipedia dump.
- The dump is a collection of all Wikipedia articles in XML format.
- The dump for English is available at https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
- The dump is about 20.3 GB in size.
- For other languages, change `en` to the language code of your choice.
- For the detailed list of language codes, see https://meta.wikimedia.org/wiki/List_of_Wikipedias.


#### Fetch the dump and extract it to the `data` directory

```python
from ekorpkit.io.fetch.loader.wiki import Wiki

wiki = Wiki(lang="ko", output_dir="data")
wiki.download_dump()
wiki.extract_wiki()
```

Following the instructions above, you can download the dump for other languages.


### Build the Korean Wikipedia Corpus


In [3]:
wiki_cfg = eKonf.compose("io/fetcher=wiki")
wiki_cfg.lang = "ko"
wiki_cfg.name = "kowiki"
wiki_cfg.output_dir = f"{wiki_cfg.dump.dump_dir}/extracted"
wiki_cfg.autoload = False
wiki_cfg.force_download = False
wiki_cfg.num_workers = 50
wiki_cfg.verbose = True

eKonf.print(wiki_cfg)

{'_name_': 'fetcher',
 '_target_': 'ekorpkit.io.fetch.loader.wiki.Wiki',
 'auto': {'load': False},
 'autoload': False,
 'compress': False,
 'dump': {'_target_': 'web_download',
          'dump_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki',
          'dump_file': 'kowiki.xml.bz2',
          'url': 'https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2'},
 'extract': {'_target_': 'extract_wiki'},
 'force': {'download': False},
 'force_download': False,
 'lang': 'ko',
 'limit': -1,
 'name': 'kowiki',
 'num_workers': 50,
 'output_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted',
 'output_file': None,
 'path': {'cached_path': None,
          'columns': None,
          'concat_data': False,
          'data_columns': None,
          'data_dir': '/content/drive/MyDrive/workspace/data/kowiki',
          'data_file': None,
          'filetype': '',
          'name': 'kowiki',
          'output': {'base_dir': '/content/drive/MyDrive/

In [8]:
from ekorpkit.io.fetch.loader.wiki import Wiki

args = eKonf.to_dict(wiki_cfg)
wiki = Wiki(**args)
wiki.download_dump()
wiki.extract_wiki()

[kowiki.xml.bz2] download kowiki.xml.bz2: 0.00B [00:00, ?B/s]

INFO: Preprocessing '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Loaded 60850 templates in 214.3s
INFO: Starting page extraction from /content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2.
INFO: Using 50 extract processes.
INFO: Extracted 100000 articles (2809.9 art/s)
INFO: Extracted 200000 articles (4194.9 art/s)
INFO: Extracted 300000 art

Extracted kowiki from dump file /content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2


INFO: Finished 50-process extraction of 1339049 articles in 247.2s (5417.6 art/s)


#### Build the corpus by Parsing the Wikipedia Dump

- The next step is to parse the Wikipedia dump and build the corpus.
- Extracted Wikipedia dump is in JSON Lines format, which is a line-delimited JSON format.


In [13]:
# get the list of extracted files

files = eKonf.get_filepaths("**/*", wiki_cfg.output_dir)
print(f"Number of files: {len(files)}")
files[:10]


INFO:ekorpkit.io.file:Processing [1628] files from ['**/*']


Number of files: 1628


['/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_23',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_67',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_05',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_20',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_40',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_92',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_42',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_14',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_28',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_24']

Check the first few lines of the extracted dump.


In [11]:
print(eKonf.read(files[0], mode="r", encoding="utf-8", head=200))


{"id": "634327", "revid": "414775", "url": "https://ko.wikipedia.org/wiki?curid=634327", "title": "\uc131\uc774\uc131", "text": "\uc131\uc774\uc131(\u6210\u4ee5\u6027, 1595\ub144(\uc120\uc870 28\ub144


In [7]:
cfg = eKonf.compose("corpus/builtin=kowiki")
cfg.io.fetcher = wiki_cfg
cfg.io.loader.data_dir = wiki_cfg.output_dir
cfg.verbose = True
cfg.num_workers = 50
# eKonf.print(cfg.io)
db = eKonf.instantiate(cfg)

INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.utils.notebook:shell type: ZMQInteractiveShell
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /content/drive/MyDrive/workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...
INFO:ekorpkit.base:instantiating ekorpkit.info.stat.SummaryInfo...
INFO:ekorpkit.info.stat:Loading info file: /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/info-kowiki.yaml
INFO:ekorpkit.base:instantiating ekorpkit.io.load.data.load_data...
INFO:ekorpkit.io.file:Processing [1628] files from ['**/*']
INFO:ekorpkit.io.load.data:Starting multiprocessing with 50 processes at load_data


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'lang': 'ko',
 'license': 'CC Attribution / Share-Alike 3.0',
 'name': 'kowiki',
 'version': '1.0.0'}


::load_data():   0%|          | 0/1628 [00:00<?, ?it/s]

{'curid': '634327', 'url': 'https://ko.wikipedia.org/wiki?curid=634327', 'title': '성이성', 'text': "성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의 문신이자 유학자, 청백리이다. 자(字)는 여습(汝習)이고 호는 계서(溪西)이다. 본관은 창녕(昌寧). 춘향전의 실제 주인공으로 춘향전의 주인공인 몽룡은 원래 성몽룡이었다. 남원부사와 승정원승지를 지낸 성안의의 아들이다.\n강직한 간관이자 청백리이다. 그의 직계 후손들은 춘향전에 나온 '금준미주 천인혈'이 그가 실제로 지은 한시라고 주장한다. 호서 암행어사와 호남 암행어사로 활동, 감찰하며 부패 수령들을 봉고파직시켰다. 이것 역시 춘향전의 소재가 된다. 학맥으로는 김굉필의 손제자이자 그의 학맥을 계승한 강복성(康復誠)의 문인이다. 경상북도 출신.\n생애.\n생애 초반.\n출생과 가계.\n성이성은 경상북도 봉화군 물야면 가평리 태생으로 아버지는 창녕 성씨로 승정원승지와 군수를 지낸 성안의(成安義)이고, 어머니는 예안 김씨로 증(贈) 호조 참판에 추증(追贈)된 김계선의 딸이다.\n그는 어려서부터 그는 학업에 열중하여 13세때 그가 쓴 글을 우연히 정경세(鄭經世)가 보게 되었다. 정경세는 그의 글을 읽고 장차 크게 될 인물이라 하였다.\n수학과 남원 생활.\n어려서부터 공부를 게을리하지 않고 학문에 더욱 증진하여 조경남의 문하에서 수학하다가 뒤에 강복성(康復誠)의 문인이 되었다. 강복성은 사림의 학통인 길재-김숙자-김종직-김굉필(金宏弼)-조광조-이연경(李延慶)의 학통을 계승한 학자였다.\n1607년(선조 40) 남원부사로 부임한 아버지 성안의를 따라 갔다가 그곳에서 만난 기생과의 일화가 후일 춘향전의 주 뼈대가 되었다. 그러나 아버지 성안의가 참의로 발령되면서 기생 춘향과는 이별하게 된다. 이때 시중에는 성이성과 춘향을 소재로 한 춘향전이 희극과 인형극, 만담 등으로 확산되었는데, 양반가의 자제의 스캔들이라 하여

INFO:ekorpkit.datasets.build: >> elapsed time to load and parse data: 0:00:10.173167
INFO:ekorpkit.datasets.build:
Transforming dataframe with pipeline: ['reset_index', 'save_metadata']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('reset_index', 'reset_index'), ('save_metadata', 'save_metadata')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7fce771d3ca0>)
INFO:ekorpkit.pipelines.pipe:Resetting index: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.reset_index'}, 'index_column_name': 'id', 'drop_index': False, 'verbose': True}
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_metadata at 0x7fce771d35e0>)
INFO:ekorpkit.pipelines.pipe:Saving metadata: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_metadata'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': None, 'verbose': True, 'data_dir': '/co

    curid                                         url title  \
0  634327  https://ko.wikipedia.org/wiki?curid=634327   성이성   
1  634328  https://ko.wikipedia.org/wiki?curid=634328    누타   
2  634329  https://ko.wikipedia.org/wiki?curid=634329  공중그네   
3  634331  https://ko.wikipedia.org/wiki?curid=634331   성몽룡   
4  634332  https://ko.wikipedia.org/wiki?curid=634332    계서   

                                                text  split filename  
0  성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의...  train  wiki_23  
1  누타(ぬた)는 잘게 썬 생선이나 조개를 파, 채소, 미역과 함께 초된장으로 무친 요...  train  wiki_23  
2                         공중그네(空中-)는 서커스의 기술 중 하나이다.  train  wiki_23  
3                                                     train  wiki_23  
4                                                     train  wiki_23  
(1339048, 6)
   id   curid                                         url title  \
0   0  634327  https://ko.wikipedia.org/wiki?curid=634327   성이성   
1   1  634328  https://ko.wikipedia.org/wiki?cur

INFO:ekorpkit.io.file: >> elapsed time to save data: 0:00:05.303806
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/kowiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:01:05.529620
INFO:ekorpkit.info.stat:Initializing statistics for split: train with stats: {'name': 'train', 'dataset_name': 'kowiki', 'data_file': 'kowiki-train.parquet', 'meta_file': 'meta-kowiki-train.parquet'}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 1339048 len(args): 5


apply len_bytes to num_bytes:   0%|          | 0/1340 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 1339048 len(args): 5


apply len_sents to num_sents:   0%|          | 0/1340 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics before processing: 0:00:32.387224
INFO:ekorpkit.info.stat: >> updated splits: {'train': {'name': 'train', 'dataset_name': 'kowiki', 'data_file': 'kowiki-train.parquet', 'meta_file': 'meta-kowiki-train.parquet', 'num_docs_before_processing': 1339048, 'num_bytes_before_processing': 801994255, 'num_sents': 3829874}}
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'segment', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('segment', 'segment'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fce771d3ee0>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: jobl

Normalizing column: text:   0%|          | 0/1340 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:00:18.489442
INFO:ekorpkit.base:Applying pipe: functools.partial(<function segment at 0x7fce771d30d0>)
INFO:ekorpkit.pipelines.pipe:instantiating segmenter
INFO:ekorpkit.base:instantiating ekorpkit.preprocessors.segmenter.KSSSegmenter...
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 1339048 len(args): 5


Splitting column: text:   0%|          | 0/1340 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe: >> elapsed time to segment: 0:26:43.238374
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fce771d3280>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 1339048 len(args): 

Calculating length:   0%|          | 0/1340 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe:removed 736936 of 1339048 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:03.079006
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fce771d34c0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:601641 documents after dropping 471 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:01.811704
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fce771d3790>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'data_dir': '/conte

----------------------------------------------------------------------------------------------------

text: 
《그랜드 점프》(, )는 슈에이샤가 발행하는 일본의 소년 만화 잡지이다.

----------------------------------------------------------------------------------------------------
text: 
레이크파크()는 다음과 같은 뜻이 있다.

----------------------------------------------------------------------------------------------------


apply len_bytes to num_bytes:   0%|          | 0/602 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5


apply len_wospc to num_bytes_wospc:   0%|          | 0/602 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5


apply len_words to num_words:   0%|          | 0/602 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5


apply len_sents to num_sents:   0%|          | 0/602 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5


apply len_segments to num_segments:   0%|          | 0/602 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:07.233255
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/info-kowiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [kowiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/kowiki]


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'kowiki-train.parquet'},
 'data_files_modified': '2022-10-29 06:30:41',
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'info_updated': '2022-10-29 06:58:28',


### Build the English Wikipedia Corpus


In [8]:
cfg = eKonf.compose("corpus/builtin=enwiki")
cfg.verbose = True
cfg.num_workers = 50
db = eKonf.instantiate(cfg)

INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...


[enwiki.xml.bz2] download enwiki.xml.bz2: 0.00B [00:00, ?B/s]

INFO: Preprocessing '/content/drive/MyDrive/workspace/.cache/corpus/enwiki/enwiki.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Preprocessed 2300000 pages
INFO: Preprocessed 2400000 pages
INFO: Preprocessed 2500000 pages
INFO: Preprocessed 2600000 pages
IN

Extracted enwiki from dump file /content/drive/MyDrive/workspace/.cache/corpus/enwiki/enwiki.xml.bz2
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'lang': 'en',
 'license': 'CC Attribution 

INFO:ekorpkit.io.file:Processing [17144] files from ['**/*']
INFO:ekorpkit.io.load.data:Starting multiprocessing with 50 processes at load_data


::load_data():   0%|          | 0/17144 [00:00<?, ?it/s]

{'curid': '40754509', 'url': 'https://en.wikipedia.org/wiki?curid=40754509', 'title': 'Endocannabinoid transporter', 'text': 'The endocannabinoid transporters (eCBTs) are transport proteins for the endocannabinoids. Most neurotransmitters are water-soluble and require transmembrane proteins to transport them across the cell membrane. The endocannabinoids (anandamide, AEA, and 2-arachidonoylglycerol, 2-AG) on the other hand, are non-charged lipids that readily cross lipid membranes. However, since the endocannabinoids are water immiscible, protein transporters have been described that act as carriers to solubilize and transport the endocannabinoids through the aqueous cytoplasm. These include the heat shock proteins (Hsp70s) and fatty acid-binding proteins for anandamide (FABPs). FABPs such as FABP1, FABP3, FABP5, and FABP7 have been shown to bind endocannabinoids. FABP inhibitors attenuate the breakdown of anandamide by the enzyme fatty acid amide hydrolase (FAAH) in cell culture. One 

INFO:ekorpkit.datasets.build: >> elapsed time to load and parse data: 0:01:36.349757
INFO:ekorpkit.datasets.build:
Transforming dataframe with pipeline: ['reset_index', 'save_metadata']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('reset_index', 'reset_index'), ('save_metadata', 'save_metadata')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7fce771d3ca0>)
INFO:ekorpkit.pipelines.pipe:Resetting index: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.reset_index'}, 'index_column_name': 'id', 'drop_index': False, 'verbose': True}


      curid                                           url  \
0  40754509  https://en.wikipedia.org/wiki?curid=40754509   
1  40754512  https://en.wikipedia.org/wiki?curid=40754512   
2  40754531  https://en.wikipedia.org/wiki?curid=40754531   
3  40754542  https://en.wikipedia.org/wiki?curid=40754542   
4  40754545  https://en.wikipedia.org/wiki?curid=40754545   

                         title  \
0  Endocannabinoid transporter   
1              Buddy McClinton   
2                   Power Lock   
3                  Mike Ballou   
4          Philip M. Kleinfeld   

                                                text  split filename  
0  The endocannabinoid transporters (eCBTs) are t...  train  wiki_23  
1  Buddy McClinton was a defensive back for the A...  train  wiki_23  
2                                                     train  wiki_23  
3  Mikell Randolph Ballou (born September 11, 194...  train  wiki_23  
4  Philip M. Kleinfeld (June 19, 1894 – January 1...  train  wiki_23  
(1

INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_metadata at 0x7fce771d35e0>)
INFO:ekorpkit.pipelines.pipe:Saving metadata: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_metadata'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': None, 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': None, 'output_file': None, 'suffix': None, 'output': {'filename': 'meta-enwiki-train.parquet', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki', 'filetype': '.parquet', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet', 'columns': None, 'verbose': True}}, 'filepath': None, 'filetype': None, 'column_info': {'columns': {'id': 'id', 'text': 'text', 'merge_meta_on': '

   id     curid                                           url  \
0   0  40754509  https://en.wikipedia.org/wiki?curid=40754509   
1   1  40754512  https://en.wikipedia.org/wiki?curid=40754512   
2   2  40754531  https://en.wikipedia.org/wiki?curid=40754531   
3   3  40754542  https://en.wikipedia.org/wiki?curid=40754542   
4   4  40754545  https://en.wikipedia.org/wiki?curid=40754545   

                         title  \
0  Endocannabinoid transporter   
1              Buddy McClinton   
2                   Power Lock   
3                  Mike Ballou   
4          Philip M. Kleinfeld   

                                                text  split filename  
0  The endocannabinoid transporters (eCBTs) are t...  train  wiki_23  
1  Buddy McClinton was a defensive back for the A...  train  wiki_23  
2                                                     train  wiki_23  
3  Mikell Randolph Ballou (born September 11, 194...  train  wiki_23  
4  Philip M. Kleinfeld (June 19, 1894 – January 1

INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:01:02.858828
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:19:07.049930
INFO:ekorpkit.info.stat:Initializing statistics for split: train with stats: {'name': 'train', 'dataset_name': 'enwiki', 'data_file': 'enwiki-train.parquet', 'meta_file': 'meta-enwiki-train.parquet'}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5


apply len_bytes to num_bytes:   0%|          | 0/16700 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5


apply len_sents to num_sents:   0%|          | 0/16700 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics before processing: 0:01:52.110589
INFO:ekorpkit.info.stat: >> updated splits: {'train': {'name': 'train', 'dataset_name': 'enwiki', 'data_file': 'enwiki-train.parquet', 'meta_file': 'meta-enwiki-train.parquet', 'num_docs_before_processing': 16699988, 'num_bytes_before_processing': 15342187701, 'num_sents': 68973662}}
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'segment', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('segment', 'segment'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fce771d3ee0>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: 

Normalizing column: text:   0%|          | 0/16700 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:02:10.739667
INFO:ekorpkit.base:Applying pipe: functools.partial(<function segment at 0x7fce771d30d0>)
INFO:ekorpkit.pipelines.pipe:instantiating segmenter
INFO:ekorpkit.base:instantiating ekorpkit.preprocessors.segmenter.PySBDSegmenter...
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5


Splitting column: text:   0%|          | 0/16700 [00:02<?, ?it/s]

INFO:ekorpkit.pipelines.pipe: >> elapsed time to segment: 0:41:02.925846
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fce771d3280>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args):

Calculating length:   0%|          | 0/16700 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe:removed 10366199 of 16699988 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:31.301093
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fce771d34c0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:6327718 documents after dropping 6071 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:44.304687
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fce771d3790>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'data_dir': '/

----------------------------------------------------------------------------------------------------

text: 
Novska railway station () is a railway station on the Novska-Tovarnik railway in Novska, Croatia. 
There are three lines connecting Novska to Jasenovac, Okučani, and Lipovljani. 
The railway station consists of 18 railway tracks.

----------------------------------------------------------------------------------------------------
text: 
Zoltán Friedmanszky (22 October 1934 - 31 March 2022) was a Hungarian footballer who played as a forward. 
He was a member of the Hungary national team at the 1958 FIFA World Cup. 
However, he was never capped for the national team. 
He also played for Ferencváros.

----------------------------------------------------------------------------------------------------


INFO:ekorpkit.info.stat:Calculating statistics for split: train
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5


apply len_bytes to num_bytes:   0%|          | 0/6328 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5


apply len_wospc to num_bytes_wospc:   0%|          | 0/6328 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5


apply len_words to num_words:   0%|          | 0/6328 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5


apply len_sents to num_sents:   0%|          | 0/6328 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5


apply len_segments to num_segments:   0%|          | 0/6328 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:01:53.710827
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/info-enwiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [enwiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/enwiki]


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'enwiki-train.parquet'},
 'data_files_modified': '2022-10-29 11:04:23',
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'info_updated': '2022-10-29 11:52:41',
 'lang'

### Build a Wikipedia Corpus for Other Languages


In [6]:
cfg = eKonf.compose("corpus/builtin=wiki")
cfg.lang = "bn"
cfg.verbose = True
cfg.num_workers = 50

db = eKonf.instantiate(cfg)

INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...
INFO:ekorpkit.base:instantiating ekorpkit.info.stat.SummaryInfo...
INFO:ekorpkit.info.stat:Loading info file: /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/info-bnwiki.yaml
INFO:ekorpkit.datasets.build:/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet already exists
INFO:ekorpkit.io.file:Processing [1] files from ['bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'bnwiki-train.parquet'},
 'data_files_modified': '2022-10-31 02:06:51',
 'description': 'Wikipedia',
 'fullname': 'Wikipedia Corpus (bn)',
 'homepage': 'https://bn.wikipedia.org',
 'info_updated': '2022-10-31 02:07:26',
 'lang': '

INFO:ekorpkit.io.file: >> elapsed time to load data: 0:00:03.919305
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fc6497bc700>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 348546 len(args): 5


Normalizing column: text:   0%|          | 0/349 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:00:20.737039
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fc655201f70>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 348546 len(args)

Calculating length:   0%|          | 0/349 [00:00<?, ?it/s]

INFO:ekorpkit.pipelines.pipe:removed 220674 of 348546 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:01.750412
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fc655201dc0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:127833 documents after dropping 39 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:01.292660
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fc655201ee0>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'da

----------------------------------------------------------------------------------------------------

text: 
উম্মে হানি বিনতে আবি তালিব (আরবি فاختة بنت أبي طالب) হযরত মুহাম্মাদ সাঃ এর চাচাত বোন ছিলেন। উম্মে হানি আবু তালিবের কন্যা ছিলেন। তিনি একজন হাদিস বর্ণনাকারী সাহাবা ছিলেন।
নাম ও বংশ পরিচয়.
উম্মে হানি বিনতে আবি তালিব এর আসল ছিলো ফাখিতা মতান্তরে হিন্দ। তার পিতার নাম আবু তালিব ইবনে আবদুল মুত্তালিব ও মাতার নাম ছিলো ফাতিমা বিনতে আসাদ। তিনি জাফর,আকিল ও আলীর সহোদরা ছিলেন।
উম্মে হানির কয়েকজন সন্তানের নাম হলো:
ইসলাম পূর্ব সময়.
তার বাল্যকালের কথা তেমন কিছু জানা যায় না। তবে মহানবী হযরত মুহাম্মাদ সাঃ এর নবুওয়াত প্রাপ্তির পূর্বে চাচা আবু তালিবের নিকট উম্মে হানির বিয়ের প্রস্তাব পাঠান। একই সময়হুবায়রা ইবনে আমর ইবনে আয়িয আল মাখযুমিও উম্মে হানিকে বিয়ে করতে চান। আবু তালিব হুবায়রার প্রস্তাব গ্রহণ করে উম্মে হানিকে তার সাথে বিয়ে দেন। এবং হযরত মুহাম্মাদ সাঃ কে বললেন, "ভাতিজা! আমরা তার সাথে বৈবাহিক সম্পর্ক করেছি। সম্মানীয়দের সমকক্ষ সম্মানীয়রাই হয়ে থাকে।" 
মক্কা বিজয়ের দিন ও ইসলাম গ্রহণ.
ইম

apply len_bytes to num_bytes:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5


apply len_wospc to num_bytes_wospc:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5


apply len_words to num_words:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5


apply len_sents to num_sents:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5


apply len_segments to num_segments:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:05.268952
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/info-bnwiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [bnwiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/bnwiki]


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'bnwiki-train.parquet'},
 'data_files_modified': '2022-11-04 06:36:13',
 'description': 'Wikipedia',
 'fullname': 'Wikipedia Corpus (bn)',
 'homepage': 'https://bn.wikipedia.org',
 'info_updated': '2022-11-04 08:17:09',
 'lang': '

### Build a Corpus using CLI

To build more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.

```bash
ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=true \
    num_workers=1 \
    run=corpus.builtin \
    corpus/builtin=wiki \
    corpus.builtin.lang="bn" \
    corpus.builtin.io.force.summarize=false \
    corpus.builtin.io.force.preprocess=false \
    corpus.builtin.io.force.build=false \
    corpus.builtin.io.force.download=false
```    

## Load the Corpus


In [39]:
cfg = eKonf.compose("corpus")
cfg.name = "enwiki"
enwiki = eKonf.instantiate(cfg)
print(enwiki)

INFO:ekorpkit.datasets.base:Loaded info file: /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/info-enwiki.yaml
INFO:ekorpkit.io.file:Processing [1] files from ['enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['meta-enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet']
INFO:ekorpkit.io.file:Load

Corpus : enwiki
time: 1min 48s (started: 2022-11-04 02:24:13 +00:00)


In [40]:
eKonf.print(enwiki.INFO)


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'enwiki-train.parquet'},
 'data_files_modified': '2022-10-29 11:04:23',
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'info_updated': '2022-10-29 11:52:41',
 'lang'

### Sample and Save the Corpus


In [42]:
enwiki.data.info()

INFO:ekorpkit.io.file:Concatenating 1 dataframes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16699988 entries, 0 to 16699987
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   id        int64 
 1   text      object
 2   split     object
 3   filename  object
dtypes: int64(1), object(3)
memory usage: 509.6+ MB
time: 674 ms (started: 2022-11-04 02:26:53 +00:00)


In [44]:
enwiki.splits["train"] = enwiki.splits["train"].sample(frac=0.05)
enwiki.save_as("enwiki_sampled")
print(enwiki.data.info())

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5


apply len_bytes to num_bytes:   0%|          | 0/835 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5


apply len_wospc to num_bytes_wospc:   0%|          | 0/835 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5


apply len_words to num_words:   0%|          | 0/835 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5


apply len_sents to num_sents:   0%|          | 0/835 [00:00<?, ?it/s]

INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5


apply len_segments to num_segments:   0%|          | 0/835 [00:00<?, ?it/s]

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:04:32.445424
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet
INFO:ekorpkit.io.file:Concatenating 1 dataframes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 834999 entries, 0 to 834998
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        834999 non-null  int64 
 1   text      834999 non-null  object
 2   split     834999 non-null  object
 3   filename  834999 non-null  object
dtypes: int64(1), object(3)
memory usage: 25.5+ MB
None
time: 5min 34s (started: 2022-11-04 02:27:31 +00:00)


## Load Corpora

You can load several corpora at once and merge them into a single corpus.


In [2]:
cfg = eKonf.compose("corpus=corpora")
cfg.name = ["enwiki_sampled", "kowiki", "bnwiki"]
# cfg.data_dir = '../data'
cfg.auto.load = True
crps = eKonf.instantiate(cfg)
print(crps)

INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.utils.notebook:shell type: ZMQInteractiveShell
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /content/drive/MyDrive/workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
INFO:ekorpkit.datasets.corpora:processing enwiki_sampled
INFO:ekorpkit.io.file:Processing [1] files from ['enwiki_sampled-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: index, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 

Corpora
----------
enwiki_sampled
kowiki
bnwiki

time: 18.8 s (started: 2022-11-04 03:20:27 +00:00)


### Checking the Corpus Information


In [3]:
eKonf.print(crps["kowiki"].INFO)


{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'kowiki-train.parquet'},
 'data_files_modified': '2022-10-29 06:30:41',
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'info_updated': '2022-10-29 06:58:28',


### Checking the Corpus Data


In [4]:
print(crps["kowiki"].data.text[0])

INFO:ekorpkit.io.file:Concatenating 1 dataframes


성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의 문신이자 유학자, 청백리이다. 자(字)는 여습(汝習)이고 호는 계서(溪西)이다. 본관은 창녕(昌寧). 춘향전의 실제 주인공으로 춘향전의 주인공인 몽룡은 원래 성몽룡이었다. 남원부사와 승정원승지를 지낸 성안의의 아들이다.
강직한 간관이자 청백리이다. 그의 직계 후손들은 춘향전에 나온 '금준미주 천인혈'이 그가 실제로 지은 한시라고 주장한다. 호서 암행어사와 호남 암행어사로 활동, 감찰하며 부패 수령들을 봉고파직시켰다. 이것 역시 춘향전의 소재가 된다. 학맥으로는 김굉필의 손제자이자 그의 학맥을 계승한 강복성(康復誠)의 문인이다. 경상북도 출신.
생애.
생애 초반.
출생과 가계.
성이성은 경상북도 봉화군 물야면 가평리 태생으로 아버지는 창녕 성씨로 승정원승지와 군수를 지낸 성안의(成安義)이고, 어머니는 예안 김씨로 증(贈) 호조 참판에 추증(追贈)된 김계선의 딸이다.
그는 어려서부터 그는 학업에 열중하여 13세때 그가 쓴 글을 우연히 정경세(鄭經世)가 보게 되었다. 정경세는 그의 글을 읽고 장차 크게 될 인물이라 하였다.
수학과 남원 생활.
어려서부터 공부를 게을리하지 않고 학문에 더욱 증진하여 조경남의 문하에서 수학하다가 뒤에 강복성(康復誠)의 문인이 되었다. 강복성은 사림의 학통인 길재-김숙자-김종직-김굉필(金宏弼)-조광조-이연경(李延慶)의 학통을 계승한 학자였다.
1607년(선조 40) 남원부사로 부임한 아버지 성안의를 따라 갔다가 그곳에서 만난 기생과의 일화가 후일 춘향전의 주 뼈대가 되었다. 그러나 아버지 성안의가 참의로 발령되면서 기생 춘향과는 이별하게 된다. 이때 시중에는 성이성과 춘향을 소재로 한 춘향전이 희극과 인형극, 만담 등으로 확산되었는데, 양반가의 자제의 스캔들이라 하여 조선조정에서 관을 시켜서 금지하게 되자 성몽룡을 이몽룡으로 바꾸고, 성씨(姓氏)가 없던 기생인 춘향에게 성씨 성을 붙여서 시연하게 된다.
1616년(광해군 8년) 그는 사마시 양시에 합격했는

In [5]:
print(crps["bnwiki"].data.text[0])

INFO:ekorpkit.io.file:Concatenating 1 dataframes


শ্যামাদাস মুখোপাধ্যায় (২২ জুন ১৮৬৬ - ৮ মে ১৯৩৭) ছিলেন একজন ভারতীয় বাঙালি গণিতবিদ। তিনি ইউক্লিডিয় জ্যামিতির মুখোপাধ্যায়ের উপপাদ্য এবং চতুর্শীর্ষ উপপাদ্য (Four-vertex theorem) উপস্থাপনের জন্য পরিচিত। তিনি ভারতের প্রথম গণিতবিদ হিসেবে ডক্টরেট ডিগ্রী অর্জন করেন।
জন্ম ও শিক্ষাজীবন.
শ্যামাদাস মুখোপাধ্যায় ১৮৬৬ খ্রিষ্টাব্দের ২২ জুন পশ্চিমবঙ্গের হুগলি জেলার হরিপাল ব্লকে জন্মগ্রহণ করেন। তার বাবা বাবু গঙ্গা কান্ত মুখোপাধ্যায় রাজ্য বিচার বিভাগে নিযুক্ত ছিলেন। চাকরি সূত্রে তাকে বিভিন্ন স্থানে স্থানান্তরিত করা হওয়ায় শ্যামাদাস মুখোপাধ্যায়কে বিভিন্ন সময়ে বিভিন্ন শিক্ষা প্রতিষ্ঠানে শিক্ষা গ্রহণ করতে হয়। তিনি হুগলি কলেজ থেকে স্নাতক হন। তিনি ১৮৯০ খ্রিষ্টাব্দে কলকাতার প্রেসিডেন্সি কলেজ থেকে গণিত বিষয়ে এমএ ডিগ্রি অর্জন করেন। তিনি ১৯০৯ খ্রিষ্টাব্দে তার গাণিতিক তত্ত্বালোচনা "অন দ্যা ইনফিনিটেসিমাল এনালিসিস অফ এন আর্ক"-এর জন্য কলকাতা বিশ্ববিদ্যালয় থেকে গ্রিফিত পুরস্কার পান। তিনি ১৯১০ খ্রিষ্টাব্দে তার নিজস্ব ডিফারেনশিয়াল জ্যামিতির উপরে কলকাতা বিশ্ববিদ্যালয় থেকে পিএইচডি ডিগ্রি লাভ করেন। তার থিসিসের

### Concatenating Corpora


In [6]:
crps.concat_corpora()


INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Adding id [corpus] to ['id']
INFO:ekorpkit.info.column:Added id [corpus], now ['id', 'corpus']
INFO:ekorpkit.info.column:Added a column [corpus] with value [enwiki_sampled]
INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Added a column [corpus] with value [kowiki]
INFO:ekorpkit.info.column:Added a column [corpus] with value [kowiki]
INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Added a column [corpus] with value [bnwiki]
INFO:ekorpkit.info.column:Added a column [corpus] with value [bnwiki]


time: 296 ms (started: 2022-11-04 03:20:47 +00:00)


In [7]:
crps.data


Unnamed: 0,id,text,split,filename,corpus
0,4915400,,train,wiki_92,enwiki_sampled
1,7644961,Anaissini is a tribe of click beetles in the f...,train,wiki_49,enwiki_sampled
2,6658552,The Vicky Metcalf Award for Literature for You...,train,wiki_24,enwiki_sampled
3,16385169,Shri Shivabalayogi Maharaj (24 January 1935 – ...,train,wiki_36,enwiki_sampled
4,11081255,Eylex Films Pvt is a chain of multiplex and si...,train,wiki_94,enwiki_sampled
...,...,...,...,...,...
2522588,348541,"মোহাম্মদ সেলিম (জন্ম: ১৫ অক্টোবর, ১৯৮১) খুলনার...",train,wiki_34,bnwiki
2522589,348542,,train,wiki_34,bnwiki
2522590,348543,"দেশি কামিলা (বৈজ্ঞানিক নাম: ""Congresox talabon...",train,wiki_34,bnwiki
2522591,348544,,train,wiki_34,bnwiki


time: 13.4 ms (started: 2022-11-04 03:20:47 +00:00)


In [8]:
crps.metadata


Unnamed: 0,id,curid,url,title,split,corpus
0,0,634327,https://ko.wikipedia.org/wiki?curid=634327,성이성,train,kowiki
1,1,634328,https://ko.wikipedia.org/wiki?curid=634328,누타,train,kowiki
2,2,634329,https://ko.wikipedia.org/wiki?curid=634329,공중그네,train,kowiki
3,3,634331,https://ko.wikipedia.org/wiki?curid=634331,성몽룡,train,kowiki
4,4,634332,https://ko.wikipedia.org/wiki?curid=634332,계서,train,kowiki
...,...,...,...,...,...,...
1687589,348541,554487,https://bn.wikipedia.org/wiki?curid=554487,মোহাম্মদ সেলিম,train,bnwiki
1687590,348542,554493,https://bn.wikipedia.org/wiki?curid=554493,Mohammad Salim,train,bnwiki
1687591,348543,554495,https://bn.wikipedia.org/wiki?curid=554495,দেশি কামিলা,train,bnwiki
1687592,348544,554501,https://bn.wikipedia.org/wiki?curid=554501,Congresox talabonoides,train,bnwiki


time: 7.47 ms (started: 2022-11-04 03:20:47 +00:00)


### Save the concatenated corpus

In [9]:
eKonf.save_data(crps.data, "wiki_corpus.parquet", project_dir + "/data")

INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_corpus.parquet


time: 2min 56s (started: 2022-11-04 03:20:47 +00:00)
