Lab 1: Preparing Wikipedia Corpora#

Prepare the Environment#

%pip install -U --pre ekorpkit[dataset,wiki]
%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)
INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
version: 0.1.40.post0.dev18
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 1.36 s (started: 2022-11-14 01:22:37 +00:00)

Build corpora with the ekorpkit configs#

Wikipedia Dump#

Fetch the dump and extract it to the data directory#

from ekorpkit.io.fetch.loader.wiki import Wiki

wiki = Wiki(lang="ko", output_dir="data")
wiki.download_dump()
wiki.extract_wiki()

Following the instructions above, you can download the dump for other languages.

Build the Korean Wikipedia Corpus#

wiki_cfg = eKonf.compose("io/fetcher=wiki")
wiki_cfg.lang = "ko"
wiki_cfg.name = "kowiki"
wiki_cfg.output_dir = f"{wiki_cfg.dump.dump_dir}/extracted"
wiki_cfg.autoload = False
wiki_cfg.force_download = False
wiki_cfg.num_workers = 50
wiki_cfg.verbose = True

eKonf.print(wiki_cfg)
{'_name_': 'fetcher',
 '_target_': 'ekorpkit.io.fetch.loader.wiki.Wiki',
 'auto': {'load': False},
 'autoload': False,
 'compress': False,
 'dump': {'_target_': 'web_download',
          'dump_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki',
          'dump_file': 'kowiki.xml.bz2',
          'url': 'https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2'},
 'extract': {'_target_': 'extract_wiki'},
 'force': {'download': False},
 'force_download': False,
 'lang': 'ko',
 'limit': -1,
 'name': 'kowiki',
 'num_workers': 50,
 'output_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted',
 'output_file': None,
 'path': {'cached_path': None,
          'columns': None,
          'concat_data': False,
          'data_columns': None,
          'data_dir': '/content/drive/MyDrive/workspace/data/kowiki',
          'data_file': None,
          'filetype': '',
          'name': 'kowiki',
          'output': {'base_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted',
                     'columns': None,
                     'file': None,
                     'filename': None,
                     'filepath': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted',
                     'filetype': '',
                     'suffix': None},
          'output_dir': '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted',
          'output_file': None,
          'root': '/content/drive/MyDrive/workspace/data/kowiki',
          'suffix': None,
          'verbose': True},
 'verbose': True}
from ekorpkit.io.fetch.loader.wiki import Wiki

args = eKonf.to_dict(wiki_cfg)
wiki = Wiki(**args)
wiki.download_dump()
wiki.extract_wiki()
INFO: Preprocessing '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Loaded 60850 templates in 214.3s
INFO: Starting page extraction from /content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2.
INFO: Using 50 extract processes.
INFO: Extracted 100000 articles (2809.9 art/s)
INFO: Extracted 200000 articles (4194.9 art/s)
INFO: Extracted 300000 articles (4598.9 art/s)
INFO: Extracted 400000 articles (5019.5 art/s)
INFO: Extracted 500000 articles (5461.3 art/s)
INFO: Extracted 600000 articles (5120.8 art/s)
INFO: Extracted 700000 articles (5254.5 art/s)
INFO: Extracted 800000 articles (6065.9 art/s)
INFO: Extracted 900000 articles (19874.6 art/s)
INFO: Extracted 1000000 articles (10843.3 art/s)
INFO: Extracted 1100000 articles (5674.7 art/s)
INFO: Extracted 1200000 articles (5990.6 art/s)
INFO: Extracted 1300000 articles (5678.9 art/s)
Extracted kowiki from dump file /content/drive/MyDrive/workspace/.cache/corpus/kowiki/kowiki.xml.bz2
INFO: Finished 50-process extraction of 1339049 articles in 247.2s (5417.6 art/s)

Build the corpus by Parsing the Wikipedia Dump#

  • The next step is to parse the Wikipedia dump and build the corpus.

  • Extracted Wikipedia dump is in JSON Lines format, which is a line-delimited JSON format.

# get the list of extracted files

files = eKonf.get_filepaths("**/*", wiki_cfg.output_dir)
print(f"Number of files: {len(files)}")
files[:10]
INFO:ekorpkit.io.file:Processing [1628] files from ['**/*']
Number of files: 1628
['/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_23',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_67',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_05',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_20',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_40',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_92',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_42',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_14',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_28',
 '/content/drive/MyDrive/workspace/.cache/corpus/kowiki/extracted/AH/wiki_24']

Check the first few lines of the extracted dump.

print(eKonf.read(files[0], mode="r", encoding="utf-8", head=200))
{"id": "634327", "revid": "414775", "url": "https://ko.wikipedia.org/wiki?curid=634327", "title": "\uc131\uc774\uc131", "text": "\uc131\uc774\uc131(\u6210\u4ee5\u6027, 1595\ub144(\uc120\uc870 28\ub144
cfg = eKonf.compose("corpus/builtin=kowiki")
cfg.io.fetcher = wiki_cfg
cfg.io.loader.data_dir = wiki_cfg.output_dir
cfg.verbose = True
cfg.num_workers = 50
# eKonf.print(cfg.io)
db = eKonf.instantiate(cfg)
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.utils.notebook:shell type: ZMQInteractiveShell
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /content/drive/MyDrive/workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...
INFO:ekorpkit.base:instantiating ekorpkit.info.stat.SummaryInfo...
INFO:ekorpkit.info.stat:Loading info file: /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/info-kowiki.yaml
INFO:ekorpkit.base:instantiating ekorpkit.io.load.data.load_data...
INFO:ekorpkit.io.file:Processing [1628] files from ['**/*']
INFO:ekorpkit.io.load.data:Starting multiprocessing with 50 processes at load_data
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'lang': 'ko',
 'license': 'CC Attribution / Share-Alike 3.0',
 'name': 'kowiki',
 'version': '1.0.0'}
{'curid': '634327', 'url': 'https://ko.wikipedia.org/wiki?curid=634327', 'title': '성이성', 'text': "성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의 문신이자 유학자, 청백리이다. 자(字)는 여습(汝習)이고 호는 계서(溪西)이다. 본관은 창녕(昌寧). 춘향전의 실제 주인공으로 춘향전의 주인공인 몽룡은 원래 성몽룡이었다. 남원부사와 승정원승지를 지낸 성안의의 아들이다.\n강직한 간관이자 청백리이다. 그의 직계 후손들은 춘향전에 나온 '금준미주 천인혈'이 그가 실제로 지은 한시라고 주장한다. 호서 암행어사와 호남 암행어사로 활동, 감찰하며 부패 수령들을 봉고파직시켰다. 이것 역시 춘향전의 소재가 된다. 학맥으로는 김굉필의 손제자이자 그의 학맥을 계승한 강복성(康復誠)의 문인이다. 경상북도 출신.\n생애.\n생애 초반.\n출생과 가계.\n성이성은 경상북도 봉화군 물야면 가평리 태생으로 아버지는 창녕 성씨로 승정원승지와 군수를 지낸 성안의(成安義)이고, 어머니는 예안 김씨로 증(贈) 호조 참판에 추증(追贈)된 김계선의 딸이다.\n그는 어려서부터 그는 학업에 열중하여 13세때 그가 쓴 글을 우연히 정경세(鄭經世)가 보게 되었다. 정경세는 그의 글을 읽고 장차 크게 될 인물이라 하였다.\n수학과 남원 생활.\n어려서부터 공부를 게을리하지 않고 학문에 더욱 증진하여 조경남의 문하에서 수학하다가 뒤에 강복성(康復誠)의 문인이 되었다. 강복성은 사림의 학통인 길재-김숙자-김종직-김굉필(金宏弼)-조광조-이연경(李延慶)의 학통을 계승한 학자였다.\n1607년(선조 40) 남원부사로 부임한 아버지 성안의를 따라 갔다가 그곳에서 만난 기생과의 일화가 후일 춘향전의 주 뼈대가 되었다. 그러나 아버지 성안의가 참의로 발령되면서 기생 춘향과는 이별하게 된다. 이때 시중에는 성이성과 춘향을 소재로 한 춘향전이 희극과 인형극, 만담 등으로 확산되었는데, 양반가의 자제의 스캔들이라 하여 조선조정에서 관을 시켜서 금지하게 되자 성몽룡을 이몽룡으로 바꾸고, 성씨(姓氏)가 없던 기생인 춘향에게 성씨 성을 붙여서 시연하게 된다.\n1616년(광해군 8년) 그는 사마시 양시에 합격했는데 생원시에 합격하여 생원(生員)이 되고, 그 해에 다시 진사시에 합격하여 진사(進士)가 되었다. 그러나 광해군 때의 난세에는 벼슬길에 나아가지 않았다.\n관료 생활.\n과거 급제와 관료생활.\n1627년 (인조5년)에 식년 문과에 병과로 급제하였다.\n1635년(인조 13) 성이성은 사간원 정언이 되고 홍문관 부수찬·부교리를 거쳐 1636년 사헌부지평이 되었다. 1637년(인조 15) 호서(湖西) 지방의 암행어사로 파견되었다가 돌아왔다. 그해 성이성은 사간원 헌납이 되어 공신이며 서인당의 고관인 윤방(尹昉)·김류(金류)·심기원(沈器遠)·김자점(金自點) 등을 탄핵하여 왕을 잘못된 길로 인도했다며 오국불충(誤國不忠)의 죄를 논하기도 했다.\n암행어사 활동.\n1639년(인조 17) 호남(湖南) 지방 암행어사에 임명되어 5년간 호남 지역을 순찰하고 1644년(인조 22) 되돌아왔가. 그 뒤 1647년(인조 25) 다시 호남 암행어사 로 파견되었다.\n그러나 호남 암행어사로 부임했을 때 신분을 노출시키고 마는데 성이성은 암행을 하고 다니다가 1647년 11월 25일 순천에서 실수로 부득이 자신의 신분을 드러내고 이후에는 한양으로 돌아오게 되는데, 그는 일기에 돌아오는 길이던 12월 1일 남원에 들렀다고 적고 있다.\n생애 후반.\n1648년 여름 성이성은 전라도 담양군수로 부임해 장마철 강둑이 범람해 피해가 큰 것을 보고 2년에 걸쳐 제방을 쌓고 그 위에 나무를 심었다. 현재 관방제림으로 불리는 숲이 그가 남긴 치세의 흔적이다. 푸조나무 느티나무 팽나무 등 184그루가 전한다. 본디 제방에 나무를 심으면 나무가 바람에 흔들릴 때 제방에 해롭다 하여 심지 않았는데 성이성은 비바람에 강한 토종나무를 골라 심음으로써 이같은 상식을 엎었다. 담양군은 현재 관방제림에 산책로를 조성하여 관광객들의 발길을 모으고 있다.\n외직으로는 진주부사 · 강계부사 등 네 고을을 다스렸는데, 진주부사로 재직할 때는 서인 출신으로 경상도 암행어사로 파견된 민정중(閔鼎重)이 조사하여 그의 선치(善治)를 보고하여 특별히 왕으로부터 표리(表裏, 옷감)를 받았고, 강계부사 때에는 여진족 등의 약탈과 흉년 등으로 어려움에 처한 부민들에게 삼세(蔘稅)를 모두 면제해주어 백성들이 기뻐하였으며 부처가 환생하여 돌아왔다며 '생불' 또는 '관서활불'(關西活佛)이라며 칭송하였다. 1664년(현종 15)에 향년 70세를 일기로 사망했다.\n사후.\n고향인 봉화군 물야면 가평 1리에는 성이성을 추모하는 사당인 계서당이 건립되었다. 사후인 1695년(숙종 21) 청렴함을 인정받아 조정으로부터 청백리로 선출되었다. 증 통정대부 부제학에 추증되었다. 저서로는 <계서유고>가 있다.\n귀신 문제 해결.\n전라도 지역에 귀신이 자주 출몰한다는 곳이 있었다. 그 곳은 상인이나 과거 시험을 보러 가던 선비들이 여러 번 변을 당했는데, 성이성이 이 문제를 해결하였다 한다. 호남 암행어사가 돼서 호남 지역의 귀신이 많이 나오는 곳을 찾아가 억울함을 달래주고 문제를 해결하였고 이것 역시 입에서 입으로 전해져 설화가 되었다.\n부패 관리 파직.\n충청도 암행어사 시절 지방관리의 잘못을 발견하고 어떻게 처리했는가를 인조에게 보고한 '서계'가 남아있었는데 KBS 방송국이 이를 취재하였다.를 보면 세금을 과다징수한 진천현감과 생일날 과다한 잔치를 벌이고 국법을 어긴 석성현감을 적발하여 파직시켰다는 기록이 있다.\n춘향전.\n춘향전.\n그는 아버지인 남원부사 성안의가 부임할 때, 아버지의 임지를 따라 남원에서 생활하다 우연히 남원 기생 춘향을 만나게 되었다. 그러나 춘향과는 이루어지지 못했고, 이는 바로 춘향전의 모티브가 되었다. 뒤에 호남 암행어사로 부임했다가 신분을 노출하고 되돌아갈 때 남원에 들렀다. 늙은기생 여진이 찾아와 만났는데, 그는 춘향을 찾았다 한다. 그러나 그는 춘향을 만날 수 없었다.\n‘서리와 함께 난간에 앉으니 눈빛이 뜰에 하얗게 깔려있고 대나무숲이 희었다. 나는 소년시절의 일을 생각하여 밤늦도록 잠들지 못했다.’\n춘향전은 판소리, 연극, 소설의 소재가 되었으나 양반가 자제의 스캔들이라 하여, 조정에서는 양반가의 위신을 떨어뜨린다며 춘향전의 상영을 금지하였다. 할 수 없이 민중들은 성몽룡 대신 이몽룡으로 성을 바꾸어서 연극과 판소리, 소설, 구전 등으로 전하였다.\n금준미주 천인혈.\n춘향전에 나오는 금준미주 천인혈은 성이성이 지은 시 중의 하나였다. 성이성이 춘향전에 나오는 성몽룡처럼 변사또를 응징한 남원 출두 기록은 실록이나 문집에는 없다. 그러나 춘향전에 나오는 잔치연에서 이몽룡이 변학도를 질타하면서 읊은 금준미주 천인혈 로 시작되는 시조는 성이성이 짓고, 읊었다. 이는 성이성의 4대손 성섭의 저서 <교와문고>와 그의 스승 조경남이 쓴 <난중잡록>에 그의 작품으로 기록되어 있다.\n호남 암행어사가 되었을때에 호남 12고을 군수, 현감들이 잔치를 베풀었다. 이때 성이성은 암행어사가 걸인의 행색을 하고서 연회장에 나타났다. 호남의 12고을의 군수, 현감들은 그를 조롱하며 '그대가 시를 지으면 종일토록 놀고 짓지 못하면 가라.'고 했고, 그는 즉석에서 금준미주 천인혈 을 짓는다. 이어 전라도내 6명의 부패한 수령들을 봉고파직시킨다. 석성현감이 생일날 과다한 잔치를 벌인 것은 춘향전에 등장하는 변사또의 모티브가 되었다.", 'split': 'train', 'filename': 'wiki_23'}
INFO:ekorpkit.datasets.build: >> elapsed time to load and parse data: 0:00:10.173167
INFO:ekorpkit.datasets.build:
Transforming dataframe with pipeline: ['reset_index', 'save_metadata']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('reset_index', 'reset_index'), ('save_metadata', 'save_metadata')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7fce771d3ca0>)
INFO:ekorpkit.pipelines.pipe:Resetting index: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.reset_index'}, 'index_column_name': 'id', 'drop_index': False, 'verbose': True}
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_metadata at 0x7fce771d35e0>)
INFO:ekorpkit.pipelines.pipe:Saving metadata: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_metadata'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': None, 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': None, 'output_file': None, 'suffix': None, 'output': {'filename': 'meta-kowiki-train.parquet', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki', 'filetype': '.parquet', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/meta-kowiki-train.parquet', 'columns': None, 'verbose': True}}, 'filepath': None, 'filetype': None, 'column_info': {'columns': {'id': 'id', 'text': 'text', 'merge_meta_on': 'id', 'timestamp': None}, 'datetime': {'columns': None, 'format': None, 'rcParams': None}, 'timestamp': {'key': None, 'format': None, 'rcParams': None}, 'data': {'id': 'int', 'text': 'str'}, 'meta': {'id': 'int', 'curid': 'str', 'url': 'str', 'title': 'str'}, 'segment_separator': '\\n\\n', 'sentence_separator': '\\n'}, 'split_name': 'train', 'verbose': True}
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/meta-kowiki-train.parquet
    curid                                         url title  \
0  634327  https://ko.wikipedia.org/wiki?curid=634327   성이성   
1  634328  https://ko.wikipedia.org/wiki?curid=634328    누타   
2  634329  https://ko.wikipedia.org/wiki?curid=634329  공중그네   
3  634331  https://ko.wikipedia.org/wiki?curid=634331   성몽룡   
4  634332  https://ko.wikipedia.org/wiki?curid=634332    계서   

                                                text  split filename  
0  성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의...  train  wiki_23  
1  누타(ぬた)는 잘게 썬 생선이나 조개를 파, 채소, 미역과 함께 초된장으로 무친 요...  train  wiki_23  
2                         공중그네(空中-)는 서커스의 기술 중 하나이다.  train  wiki_23  
3                                                     train  wiki_23  
4                                                     train  wiki_23  
(1339048, 6)
   id   curid                                         url title  \
0   0  634327  https://ko.wikipedia.org/wiki?curid=634327   성이성   
1   1  634328  https://ko.wikipedia.org/wiki?curid=634328    누타   
2   2  634329  https://ko.wikipedia.org/wiki?curid=634329  공중그네   
3   3  634331  https://ko.wikipedia.org/wiki?curid=634331   성몽룡   
4   4  634332  https://ko.wikipedia.org/wiki?curid=634332    계서   

                                                text  split filename  
0  성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의...  train  wiki_23  
1  누타(ぬた)는 잘게 썬 생선이나 조개를 파, 채소, 미역과 함께 초된장으로 무친 요...  train  wiki_23  
2                         공중그네(空中-)는 서커스의 기술 중 하나이다.  train  wiki_23  
3                                                     train  wiki_23  
4                                                     train  wiki_23  
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:00:05.303806
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/kowiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:01:05.529620
INFO:ekorpkit.info.stat:Initializing statistics for split: train with stats: {'name': 'train', 'dataset_name': 'kowiki', 'data_file': 'kowiki-train.parquet', 'meta_file': 'meta-kowiki-train.parquet'}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 1339048 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 1339048 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics before processing: 0:00:32.387224
INFO:ekorpkit.info.stat: >> updated splits: {'train': {'name': 'train', 'dataset_name': 'kowiki', 'data_file': 'kowiki-train.parquet', 'meta_file': 'meta-kowiki-train.parquet', 'num_docs_before_processing': 1339048, 'num_bytes_before_processing': 801994255, 'num_sents': 3829874}}
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'segment', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('segment', 'segment'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fce771d3ee0>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 1339048 len(args): 5
INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:00:18.489442
INFO:ekorpkit.base:Applying pipe: functools.partial(<function segment at 0x7fce771d30d0>)
INFO:ekorpkit.pipelines.pipe:instantiating segmenter
INFO:ekorpkit.base:instantiating ekorpkit.preprocessors.segmenter.KSSSegmenter...
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 1339048 len(args): 5
INFO:ekorpkit.pipelines.pipe: >> elapsed time to segment: 0:26:43.238374
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fce771d3280>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 1339048 len(args): 5
INFO:ekorpkit.pipelines.pipe:removed 736936 of 1339048 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:03.079006
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fce771d34c0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:601641 documents after dropping 471 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:01.811704
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fce771d3790>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': None, 'output_file': None, 'suffix': None, 'output': {'filename': 'sample-kowiki-train.txt', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki', 'filetype': '.txt', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/sample-kowiki-train.txt', 'columns': None, 'verbose': True}}, 'apply_to': 'text', 'num_samples_to_save': 2, 'output_file': None, 'sample_length_to_print': 1000, 'verbose': True}
INFO:ekorpkit.pipelines.pipe:Saved 2 samples to /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/sample-kowiki-train.txt
INFO:ekorpkit.info.stat:Calculating statistics for split: train
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5
----------------------------------------------------------------------------------------------------

text: 
《그랜드 점프》(, )는 슈에이샤가 발행하는 일본의 소년 만화 잡지이다.

----------------------------------------------------------------------------------------------------
text: 
레이크파크()는 다음과 같은 뜻이 있다.

----------------------------------------------------------------------------------------------------
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 601641 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:07.233255
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/info-kowiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [kowiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/kowiki]
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'kowiki-train.parquet'},
 'data_files_modified': '2022-10-29 06:30:41',
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'info_updated': '2022-10-29 06:58:28',
 'lang': 'ko',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-kowiki-train.parquet'},
 'meta_files_modified': '2022-10-29 06:29:35',
 'name': 'kowiki',
 'num_bytes_before_processing': 801994255,
 'num_docs': 601641,
 'num_docs_before_processing': 1339048,
 'num_segments': 601727,
 'num_sents': 6064076,
 'num_words': 74965324,
 'size_in_bytes': 800672700,
 'size_in_human_bytes': '763.58 MiB',
 'splits': {'train': {'data_file': 'kowiki-train.parquet',
                      'dataset_name': 'kowiki',
                      'human_bytes': '763.58 MiB',
                      'human_bytes_wospc': '692.66 MiB',
                      'meta_file': 'meta-kowiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 800672700,
                      'num_bytes_before_processing': 801994255,
                      'num_bytes_max': 417935,
                      'num_bytes_median': 348.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 726308931,
                      'num_docs': 601641,
                      'num_docs_before_processing': 1339048,
                      'num_segments': 601727,
                      'num_segments_median': 1.0,
                      'num_sents': 6064076,
                      'num_sents_median': 3.0,
                      'num_words': 74965324,
                      'num_words_max': 35733,
                      'num_words_median': 32.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}

Build the English Wikipedia Corpus#

cfg = eKonf.compose("corpus/builtin=enwiki")
cfg.verbose = True
cfg.num_workers = 50
db = eKonf.instantiate(cfg)
INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...
INFO: Preprocessing '/content/drive/MyDrive/workspace/.cache/corpus/enwiki/enwiki.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Preprocessed 2300000 pages
INFO: Preprocessed 2400000 pages
INFO: Preprocessed 2500000 pages
INFO: Preprocessed 2600000 pages
INFO: Preprocessed 2700000 pages
INFO: Preprocessed 2800000 pages
INFO: Preprocessed 2900000 pages
INFO: Preprocessed 3000000 pages
INFO: Preprocessed 3100000 pages
INFO: Preprocessed 3200000 pages
INFO: Preprocessed 3300000 pages
INFO: Preprocessed 3400000 pages
INFO: Preprocessed 3500000 pages
INFO: Preprocessed 3600000 pages
INFO: Preprocessed 3700000 pages
INFO: Preprocessed 3800000 pages
INFO: Preprocessed 3900000 pages
INFO: Preprocessed 4000000 pages
INFO: Preprocessed 4100000 pages
INFO: Preprocessed 4200000 pages
INFO: Preprocessed 4300000 pages
INFO: Preprocessed 4400000 pages
INFO: Preprocessed 4500000 pages
INFO: Preprocessed 4600000 pages
INFO: Preprocessed 4700000 pages
INFO: Preprocessed 4800000 pages
INFO: Preprocessed 4900000 pages
INFO: Preprocessed 5000000 pages
INFO: Preprocessed 5100000 pages
INFO: Preprocessed 5200000 pages
INFO: Preprocessed 5300000 pages
INFO: Preprocessed 5400000 pages
INFO: Preprocessed 5500000 pages
INFO: Preprocessed 5600000 pages
INFO: Preprocessed 5700000 pages
INFO: Preprocessed 5800000 pages
INFO: Preprocessed 5900000 pages
INFO: Preprocessed 6000000 pages
INFO: Preprocessed 6100000 pages
INFO: Preprocessed 6200000 pages
INFO: Preprocessed 6300000 pages
INFO: Preprocessed 6400000 pages
INFO: Preprocessed 6500000 pages
INFO: Preprocessed 6600000 pages
INFO: Preprocessed 6700000 pages
INFO: Preprocessed 6800000 pages
INFO: Preprocessed 6900000 pages
INFO: Preprocessed 7000000 pages
INFO: Preprocessed 7100000 pages
INFO: Preprocessed 7200000 pages
INFO: Preprocessed 7300000 pages
INFO: Preprocessed 7400000 pages
INFO: Preprocessed 7500000 pages
INFO: Preprocessed 7600000 pages
INFO: Preprocessed 7700000 pages
INFO: Preprocessed 7800000 pages
INFO: Preprocessed 7900000 pages
INFO: Preprocessed 8000000 pages
INFO: Preprocessed 8100000 pages
INFO: Preprocessed 8200000 pages
INFO: Preprocessed 8300000 pages
INFO: Preprocessed 8400000 pages
INFO: Preprocessed 8500000 pages
INFO: Preprocessed 8600000 pages
INFO: Preprocessed 8700000 pages
INFO: Preprocessed 8800000 pages
INFO: Preprocessed 8900000 pages
INFO: Preprocessed 9000000 pages
INFO: Preprocessed 9100000 pages
INFO: Preprocessed 9200000 pages
INFO: Preprocessed 9300000 pages
INFO: Preprocessed 9400000 pages
INFO: Preprocessed 9500000 pages
INFO: Preprocessed 9600000 pages
INFO: Preprocessed 9700000 pages
INFO: Preprocessed 9800000 pages
INFO: Preprocessed 9900000 pages
INFO: Preprocessed 10000000 pages
INFO: Preprocessed 10100000 pages
INFO: Preprocessed 10200000 pages
INFO: Preprocessed 10300000 pages
INFO: Preprocessed 10400000 pages
INFO: Preprocessed 10500000 pages
INFO: Preprocessed 10600000 pages
INFO: Preprocessed 10700000 pages
INFO: Preprocessed 10800000 pages
INFO: Preprocessed 10900000 pages
INFO: Preprocessed 11000000 pages
INFO: Preprocessed 11100000 pages
INFO: Preprocessed 11200000 pages
INFO: Preprocessed 11300000 pages
INFO: Preprocessed 11400000 pages
INFO: Preprocessed 11500000 pages
INFO: Preprocessed 11600000 pages
INFO: Preprocessed 11700000 pages
INFO: Preprocessed 11800000 pages
INFO: Preprocessed 11900000 pages
INFO: Preprocessed 12000000 pages
INFO: Preprocessed 12100000 pages
INFO: Preprocessed 12200000 pages
INFO: Preprocessed 12300000 pages
INFO: Preprocessed 12400000 pages
INFO: Preprocessed 12500000 pages
INFO: Preprocessed 12600000 pages
INFO: Preprocessed 12700000 pages
INFO: Preprocessed 12800000 pages
INFO: Preprocessed 12900000 pages
INFO: Preprocessed 13000000 pages
INFO: Preprocessed 13100000 pages
INFO: Preprocessed 13200000 pages
INFO: Preprocessed 13300000 pages
INFO: Preprocessed 13400000 pages
INFO: Preprocessed 13500000 pages
INFO: Preprocessed 13600000 pages
INFO: Preprocessed 13700000 pages
INFO: Preprocessed 13800000 pages
INFO: Preprocessed 13900000 pages
INFO: Preprocessed 14000000 pages
INFO: Preprocessed 14100000 pages
INFO: Preprocessed 14200000 pages
INFO: Preprocessed 14300000 pages
INFO: Preprocessed 14400000 pages
INFO: Preprocessed 14500000 pages
INFO: Preprocessed 14600000 pages
INFO: Preprocessed 14700000 pages
INFO: Preprocessed 14800000 pages
INFO: Preprocessed 14900000 pages
INFO: Preprocessed 15000000 pages
INFO: Preprocessed 15100000 pages
INFO: Preprocessed 15200000 pages
INFO: Preprocessed 15300000 pages
INFO: Preprocessed 15400000 pages
INFO: Preprocessed 15500000 pages
INFO: Preprocessed 15600000 pages
INFO: Preprocessed 15700000 pages
INFO: Preprocessed 15800000 pages
INFO: Preprocessed 15900000 pages
INFO: Preprocessed 16000000 pages
INFO: Preprocessed 16100000 pages
INFO: Preprocessed 16200000 pages
INFO: Preprocessed 16300000 pages
INFO: Preprocessed 16400000 pages
INFO: Preprocessed 16500000 pages
INFO: Preprocessed 16600000 pages
INFO: Preprocessed 16700000 pages
INFO: Preprocessed 16800000 pages
INFO: Preprocessed 16900000 pages
INFO: Preprocessed 17000000 pages
INFO: Preprocessed 17100000 pages
INFO: Preprocessed 17200000 pages
INFO: Preprocessed 17300000 pages
INFO: Preprocessed 17400000 pages
INFO: Preprocessed 17500000 pages
INFO: Preprocessed 17600000 pages
INFO: Preprocessed 17700000 pages
INFO: Preprocessed 17800000 pages
INFO: Preprocessed 17900000 pages
INFO: Preprocessed 18000000 pages
INFO: Preprocessed 18100000 pages
INFO: Preprocessed 18200000 pages
INFO: Preprocessed 18300000 pages
INFO: Preprocessed 18400000 pages
INFO: Preprocessed 18500000 pages
INFO: Preprocessed 18600000 pages
INFO: Preprocessed 18700000 pages
INFO: Preprocessed 18800000 pages
INFO: Preprocessed 18900000 pages
INFO: Preprocessed 19000000 pages
INFO: Preprocessed 19100000 pages
INFO: Preprocessed 19200000 pages
INFO: Preprocessed 19300000 pages
INFO: Preprocessed 19400000 pages
INFO: Preprocessed 19500000 pages
INFO: Preprocessed 19600000 pages
INFO: Preprocessed 19700000 pages
INFO: Preprocessed 19800000 pages
INFO: Preprocessed 19900000 pages
INFO: Preprocessed 20000000 pages
INFO: Preprocessed 20100000 pages
INFO: Preprocessed 20200000 pages
INFO: Preprocessed 20300000 pages
INFO: Preprocessed 20400000 pages
INFO: Preprocessed 20500000 pages
INFO: Preprocessed 20600000 pages
INFO: Preprocessed 20700000 pages
INFO: Preprocessed 20800000 pages
INFO: Preprocessed 20900000 pages
INFO: Preprocessed 21000000 pages
INFO: Preprocessed 21100000 pages
INFO: Preprocessed 21200000 pages
INFO: Preprocessed 21300000 pages
INFO: Preprocessed 21400000 pages
INFO: Preprocessed 21500000 pages
INFO: Preprocessed 21600000 pages
INFO: Preprocessed 21700000 pages
INFO: Preprocessed 21800000 pages
INFO: Preprocessed 21900000 pages
INFO: Preprocessed 22000000 pages
INFO: Preprocessed 22100000 pages
INFO: Preprocessed 22200000 pages
INFO: Preprocessed 22300000 pages
INFO: Preprocessed 22400000 pages
INFO: Loaded 743398 templates in 4128.0s
INFO: Starting page extraction from /content/drive/MyDrive/workspace/.cache/corpus/enwiki/enwiki.xml.bz2.
INFO: Using 255 extract processes.
INFO: Extracted 100000 articles (871.5 art/s)
INFO: Extracted 200000 articles (1301.9 art/s)
INFO: Extracted 300000 articles (1682.3 art/s)
INFO: Extracted 400000 articles (2125.6 art/s)
INFO: Extracted 500000 articles (2937.8 art/s)
INFO: Extracted 600000 articles (2250.8 art/s)
INFO: Extracted 700000 articles (2410.1 art/s)
INFO: Extracted 800000 articles (2522.5 art/s)
INFO: Extracted 900000 articles (2711.3 art/s)
INFO: Extracted 1000000 articles (2833.4 art/s)
INFO: Extracted 1100000 articles (2897.8 art/s)
INFO: Extracted 1200000 articles (3066.2 art/s)
INFO: Extracted 1300000 articles (2882.7 art/s)
INFO: Extracted 1400000 articles (3082.5 art/s)
INFO: Extracted 1500000 articles (3122.7 art/s)
INFO: Extracted 1600000 articles (3233.5 art/s)
INFO: Extracted 1700000 articles (3351.2 art/s)
INFO: Extracted 1800000 articles (3320.9 art/s)
INFO: Extracted 1900000 articles (3442.5 art/s)
INFO: Extracted 2000000 articles (3610.9 art/s)
INFO: Extracted 2100000 articles (3685.7 art/s)
INFO: Extracted 2200000 articles (3482.8 art/s)
INFO: Extracted 2300000 articles (3542.2 art/s)
INFO: Extracted 2400000 articles (3689.0 art/s)
INFO: Extracted 2500000 articles (3561.1 art/s)
INFO: Extracted 2600000 articles (4044.9 art/s)
INFO: Extracted 2700000 articles (3756.4 art/s)
INFO: Extracted 2800000 articles (3956.3 art/s)
INFO: Extracted 2900000 articles (3645.8 art/s)
INFO: Extracted 3000000 articles (3730.7 art/s)
INFO: Extracted 3100000 articles (3720.5 art/s)
INFO: Extracted 3200000 articles (3666.6 art/s)
INFO: Extracted 3300000 articles (3508.0 art/s)
INFO: Extracted 3400000 articles (3750.4 art/s)
INFO: Extracted 3500000 articles (4303.4 art/s)
INFO: Extracted 3600000 articles (3860.7 art/s)
INFO: Extracted 3700000 articles (4416.4 art/s)
INFO: Extracted 3800000 articles (3974.8 art/s)
INFO: Extracted 3900000 articles (3987.2 art/s)
INFO: Extracted 4000000 articles (3964.6 art/s)
INFO: Extracted 4100000 articles (4512.7 art/s)
INFO: Extracted 4200000 articles (3996.4 art/s)
INFO: Extracted 4300000 articles (4490.6 art/s)
INFO: Extracted 4400000 articles (4078.6 art/s)
INFO: Extracted 4500000 articles (4260.2 art/s)
INFO: Extracted 4600000 articles (7377.0 art/s)
INFO: Extracted 4700000 articles (6782.0 art/s)
INFO: Extracted 4800000 articles (3770.2 art/s)
INFO: Extracted 4900000 articles (3894.1 art/s)
INFO: Extracted 5000000 articles (4019.0 art/s)
INFO: Extracted 5100000 articles (4049.8 art/s)
INFO: Extracted 5200000 articles (4333.3 art/s)
INFO: Extracted 5300000 articles (3959.6 art/s)
INFO: Extracted 5400000 articles (5211.2 art/s)
INFO: Extracted 5500000 articles (4009.4 art/s)
INFO: Extracted 5600000 articles (3727.2 art/s)
INFO: Extracted 5700000 articles (3858.4 art/s)
INFO: Extracted 5800000 articles (4163.3 art/s)
INFO: Extracted 5900000 articles (4574.0 art/s)
INFO: Extracted 6000000 articles (4071.1 art/s)
INFO: Extracted 6100000 articles (4114.9 art/s)
INFO: Extracted 6200000 articles (5213.6 art/s)
INFO: Extracted 6300000 articles (4088.5 art/s)
INFO: Extracted 6400000 articles (3944.6 art/s)
INFO: Extracted 6500000 articles (3840.3 art/s)
INFO: Extracted 6600000 articles (3751.1 art/s)
INFO: Extracted 6700000 articles (4004.8 art/s)
INFO: Extracted 6800000 articles (3887.9 art/s)
INFO: Extracted 6900000 articles (3798.0 art/s)
INFO: Extracted 7000000 articles (4286.9 art/s)
INFO: Extracted 7100000 articles (4147.3 art/s)
INFO: Extracted 7200000 articles (4454.0 art/s)
INFO: Extracted 7300000 articles (3847.9 art/s)
INFO: Extracted 7400000 articles (7826.7 art/s)
INFO: Extracted 7500000 articles (3990.9 art/s)
INFO: Extracted 7600000 articles (4206.6 art/s)
INFO: Extracted 7700000 articles (6728.2 art/s)
INFO: Extracted 7800000 articles (4828.5 art/s)
INFO: Extracted 7900000 articles (3423.3 art/s)
INFO: Extracted 8000000 articles (3513.2 art/s)
INFO: Extracted 8100000 articles (4120.7 art/s)
INFO: Extracted 8200000 articles (3569.6 art/s)
INFO: Extracted 8300000 articles (3691.0 art/s)
INFO: Extracted 8400000 articles (3543.7 art/s)
INFO: Extracted 8500000 articles (3834.9 art/s)
INFO: Extracted 8600000 articles (3731.0 art/s)
INFO: Extracted 8700000 articles (3639.4 art/s)
INFO: Extracted 8800000 articles (5176.8 art/s)
INFO: Extracted 8900000 articles (3560.5 art/s)
INFO: Extracted 9000000 articles (4012.6 art/s)
INFO: Extracted 9100000 articles (4485.5 art/s)
INFO: Extracted 9200000 articles (4443.0 art/s)
INFO: Extracted 9300000 articles (4127.0 art/s)
INFO: Extracted 9400000 articles (4374.7 art/s)
INFO: Extracted 9500000 articles (4075.1 art/s)
INFO: Extracted 9600000 articles (4355.9 art/s)
INFO: Extracted 9700000 articles (4005.3 art/s)
INFO: Extracted 9800000 articles (4151.5 art/s)
INFO: Extracted 9900000 articles (3739.1 art/s)
INFO: Extracted 10000000 articles (3768.4 art/s)
INFO: Extracted 10100000 articles (4228.7 art/s)
INFO: Extracted 10200000 articles (4101.4 art/s)
INFO: Extracted 10300000 articles (3820.9 art/s)
INFO: Extracted 10400000 articles (4008.2 art/s)
INFO: Extracted 10500000 articles (3848.3 art/s)
INFO: Extracted 10600000 articles (4143.9 art/s)
INFO: Extracted 10700000 articles (4042.7 art/s)
INFO: Extracted 10800000 articles (4265.0 art/s)
INFO: Extracted 10900000 articles (4527.8 art/s)
INFO: Extracted 11000000 articles (3844.3 art/s)
INFO: Extracted 11100000 articles (3680.1 art/s)
INFO: Extracted 11200000 articles (3734.3 art/s)
INFO: Extracted 11300000 articles (3789.8 art/s)
INFO: Extracted 11400000 articles (3790.8 art/s)
INFO: Extracted 11500000 articles (4094.0 art/s)
INFO: Extracted 11600000 articles (3770.5 art/s)
INFO: Extracted 11700000 articles (3535.4 art/s)
INFO: Extracted 11800000 articles (4217.7 art/s)
INFO: Extracted 11900000 articles (4291.6 art/s)
INFO: Extracted 12000000 articles (7164.3 art/s)
INFO: Extracted 12100000 articles (3877.0 art/s)
INFO: Extracted 12200000 articles (4946.4 art/s)
INFO: Extracted 12300000 articles (3679.8 art/s)
INFO: Extracted 12400000 articles (4231.1 art/s)
INFO: Extracted 12500000 articles (3557.0 art/s)
INFO: Extracted 12600000 articles (3693.9 art/s)
INFO: Extracted 12700000 articles (3541.7 art/s)
INFO: Extracted 12800000 articles (3683.8 art/s)
INFO: Extracted 12900000 articles (3829.7 art/s)
INFO: Extracted 13000000 articles (4193.9 art/s)
INFO: Extracted 13100000 articles (4616.9 art/s)
INFO: Extracted 13200000 articles (5566.1 art/s)
INFO: Extracted 13300000 articles (3976.8 art/s)
INFO: Extracted 13400000 articles (3769.4 art/s)
INFO: Extracted 13500000 articles (3728.8 art/s)
INFO: Extracted 13600000 articles (4018.4 art/s)
INFO: Extracted 13700000 articles (3406.9 art/s)
INFO: Extracted 13800000 articles (3471.8 art/s)
INFO: Extracted 13900000 articles (4661.1 art/s)
INFO: Extracted 14000000 articles (4528.6 art/s)
INFO: Extracted 14100000 articles (3854.2 art/s)
INFO: Extracted 14200000 articles (3824.4 art/s)
INFO: Extracted 14300000 articles (3573.9 art/s)
INFO: Extracted 14400000 articles (4370.3 art/s)
INFO: Extracted 14500000 articles (3710.3 art/s)
INFO: Extracted 14600000 articles (3770.8 art/s)
INFO: Extracted 14700000 articles (3454.8 art/s)
INFO: Extracted 14800000 articles (3418.1 art/s)
INFO: Extracted 14900000 articles (3546.7 art/s)
INFO: Extracted 15000000 articles (3663.0 art/s)
INFO: Extracted 15100000 articles (3377.9 art/s)
INFO: Extracted 15200000 articles (3406.2 art/s)
INFO: Extracted 15300000 articles (3485.3 art/s)
INFO: Extracted 15400000 articles (3469.9 art/s)
INFO: Extracted 15500000 articles (3838.8 art/s)
INFO: Extracted 15600000 articles (3879.4 art/s)
INFO: Extracted 15700000 articles (3858.0 art/s)
INFO: Extracted 15800000 articles (3391.2 art/s)
INFO: Extracted 15900000 articles (3718.8 art/s)
INFO: Extracted 16000000 articles (3962.4 art/s)
INFO: Extracted 16100000 articles (3979.8 art/s)
INFO: Extracted 16200000 articles (4694.5 art/s)
INFO: Extracted 16300000 articles (4246.3 art/s)
INFO: Extracted 16400000 articles (4286.6 art/s)
INFO: Extracted 16500000 articles (4203.8 art/s)
INFO: Extracted 16600000 articles (3595.3 art/s)
INFO: Finished 255-process extraction of 16699989 articles in 4520.2s (3694.5 art/s)
INFO:ekorpkit.base:instantiating ekorpkit.info.stat.SummaryInfo...
INFO:ekorpkit.info.stat:Loading info file: /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/info-enwiki.yaml
INFO:ekorpkit.base:instantiating ekorpkit.io.load.data.load_data...
Extracted enwiki from dump file /content/drive/MyDrive/workspace/.cache/corpus/enwiki/enwiki.xml.bz2
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'lang': 'en',
 'license': 'CC Attribution / Share-Alike 3.0',
 'name': 'enwiki',
 'version': '1.0.0'}
INFO:ekorpkit.io.file:Processing [17144] files from ['**/*']
INFO:ekorpkit.io.load.data:Starting multiprocessing with 50 processes at load_data
{'curid': '40754509', 'url': 'https://en.wikipedia.org/wiki?curid=40754509', 'title': 'Endocannabinoid transporter', 'text': 'The endocannabinoid transporters (eCBTs) are transport proteins for the endocannabinoids. Most neurotransmitters are water-soluble and require transmembrane proteins to transport them across the cell membrane. The endocannabinoids (anandamide, AEA, and 2-arachidonoylglycerol, 2-AG) on the other hand, are non-charged lipids that readily cross lipid membranes. However, since the endocannabinoids are water immiscible, protein transporters have been described that act as carriers to solubilize and transport the endocannabinoids through the aqueous cytoplasm. These include the heat shock proteins (Hsp70s) and fatty acid-binding proteins for anandamide (FABPs). FABPs such as FABP1, FABP3, FABP5, and FABP7 have been shown to bind endocannabinoids. FABP inhibitors attenuate the breakdown of anandamide by the enzyme fatty acid amide hydrolase (FAAH) in cell culture. One of these inhibitors (SB-FI-26), isolated from a virtual library of a million compounds, belongs to a class of compounds (named the "truxilloids\') that act as an anti-nociceptive agent with mild anti-inflammatory activity in mice. These truxillic acids and their derivatives have been known to have anti-inflammatory and anti-nociceptive effects in mice and are active components of a Chinese herbal medicine ((−)-Incarvillateine Incarvillea sinensis) used to treat rheumatism and pain in human. The blockade of anandamide transport may, at least in part, be the mechanism through which these compounds exert their anti-nociceptive effects.\nStudies have found the involvement of cholesterol in membrane uptake and transport of anandamide. Cholesterol stimulates both the insertion of anandamide into synthetic lipid monolayers and bilayers, and its transport across bilayer membranes, suggest that besides putative anandamide protein-transporters, cholesterol could be an important component of the anandamide transport machinery, and as cholesterol-dependent modulation of CB1 cannabinoid receptors in nerve cells. The catalytic efficiency (i.e., the ratio between maximal velocity and Michaelis–Menten constant) of the AEA membrane transporter (AMT) is almost doubled compared with control cells, demonstrate that, among the proteins of the “endocannabinoid system,” only CB1 and AMT critically depend on membrane cholesterol content, an observation that may have important implications for the role of CB1 in protecting nerve cells against (endo)cannabinoid-induced apoptosis. This can be a reason, why the use of drugs to lower cholesterol is tied to a higher depression risk, and the correlation between levels and increased death rates from suicide and other violent causes.\nActivation of CB1 enhances AMT activity through increased nitric oxide synthase (NOS) activity and subsequent increase of NO production, whereas AMT activity instead is reduced by activation of the CB2 cannabinoid receptor, which inhibits NOS and NO release, also suggesting the distribution of these receptors may drive AEA directional transport through the blood-brain barrier and other endothelial cells.\nAs reviewed in 2016; "Many of the AMT (EMT) proposals have fallen by the wayside." To date a transmembrane protein transporter has not been identified. ', 'split': 'train', 'filename': 'wiki_23'}
INFO:ekorpkit.datasets.build: >> elapsed time to load and parse data: 0:01:36.349757
INFO:ekorpkit.datasets.build:
Transforming dataframe with pipeline: ['reset_index', 'save_metadata']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('reset_index', 'reset_index'), ('save_metadata', 'save_metadata')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7fce771d3ca0>)
INFO:ekorpkit.pipelines.pipe:Resetting index: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.reset_index'}, 'index_column_name': 'id', 'drop_index': False, 'verbose': True}
      curid                                           url  \
0  40754509  https://en.wikipedia.org/wiki?curid=40754509   
1  40754512  https://en.wikipedia.org/wiki?curid=40754512   
2  40754531  https://en.wikipedia.org/wiki?curid=40754531   
3  40754542  https://en.wikipedia.org/wiki?curid=40754542   
4  40754545  https://en.wikipedia.org/wiki?curid=40754545   

                         title  \
0  Endocannabinoid transporter   
1              Buddy McClinton   
2                   Power Lock   
3                  Mike Ballou   
4          Philip M. Kleinfeld   

                                                text  split filename  
0  The endocannabinoid transporters (eCBTs) are t...  train  wiki_23  
1  Buddy McClinton was a defensive back for the A...  train  wiki_23  
2                                                     train  wiki_23  
3  Mikell Randolph Ballou (born September 11, 194...  train  wiki_23  
4  Philip M. Kleinfeld (June 19, 1894 – January 1...  train  wiki_23  
(16699988, 6)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_metadata at 0x7fce771d35e0>)
INFO:ekorpkit.pipelines.pipe:Saving metadata: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_metadata'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': None, 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': None, 'output_file': None, 'suffix': None, 'output': {'filename': 'meta-enwiki-train.parquet', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki', 'filetype': '.parquet', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet', 'columns': None, 'verbose': True}}, 'filepath': None, 'filetype': None, 'column_info': {'columns': {'id': 'id', 'text': 'text', 'merge_meta_on': 'id', 'timestamp': None}, 'datetime': {'columns': None, 'format': None, 'rcParams': None}, 'timestamp': {'key': None, 'format': None, 'rcParams': None}, 'data': {'id': 'int', 'text': 'str'}, 'meta': {'id': 'int', 'curid': 'str', 'url': 'str', 'title': 'str'}, 'segment_separator': '\\n\\n', 'sentence_separator': '\\n'}, 'split_name': 'train', 'verbose': True}
   id     curid                                           url  \
0   0  40754509  https://en.wikipedia.org/wiki?curid=40754509   
1   1  40754512  https://en.wikipedia.org/wiki?curid=40754512   
2   2  40754531  https://en.wikipedia.org/wiki?curid=40754531   
3   3  40754542  https://en.wikipedia.org/wiki?curid=40754542   
4   4  40754545  https://en.wikipedia.org/wiki?curid=40754545   

                         title  \
0  Endocannabinoid transporter   
1              Buddy McClinton   
2                   Power Lock   
3                  Mike Ballou   
4          Philip M. Kleinfeld   

                                                text  split filename  
0  The endocannabinoid transporters (eCBTs) are t...  train  wiki_23  
1  Buddy McClinton was a defensive back for the A...  train  wiki_23  
2                                                     train  wiki_23  
3  Mikell Randolph Ballou (born September 11, 194...  train  wiki_23  
4  Philip M. Kleinfeld (June 19, 1894 – January 1...  train  wiki_23  
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:01:02.858828
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet
INFO:ekorpkit.io.file: >> elapsed time to save data: 0:19:07.049930
INFO:ekorpkit.info.stat:Initializing statistics for split: train with stats: {'name': 'train', 'dataset_name': 'enwiki', 'data_file': 'enwiki-train.parquet', 'meta_file': 'meta-enwiki-train.parquet'}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics before processing: 0:01:52.110589
INFO:ekorpkit.info.stat: >> updated splits: {'train': {'name': 'train', 'dataset_name': 'enwiki', 'data_file': 'enwiki-train.parquet', 'meta_file': 'meta-enwiki-train.parquet', 'num_docs_before_processing': 16699988, 'num_bytes_before_processing': 15342187701, 'num_sents': 68973662}}
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'segment', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('segment', 'segment'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fce771d3ee0>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5
INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:02:10.739667
INFO:ekorpkit.base:Applying pipe: functools.partial(<function segment at 0x7fce771d30d0>)
INFO:ekorpkit.pipelines.pipe:instantiating segmenter
INFO:ekorpkit.base:instantiating ekorpkit.preprocessors.segmenter.PySBDSegmenter...
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5
INFO:ekorpkit.pipelines.pipe: >> elapsed time to segment: 0:41:02.925846
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fce771d3280>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 16699988 len(args): 5
INFO:ekorpkit.pipelines.pipe:removed 10366199 of 16699988 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:31.301093
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fce771d34c0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:6327718 documents after dropping 6071 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:44.304687
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fce771d3790>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/data/ekorpkit-book', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': None, 'output_file': None, 'suffix': None, 'output': {'filename': 'sample-enwiki-train.txt', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki', 'filetype': '.txt', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/sample-enwiki-train.txt', 'columns': None, 'verbose': True}}, 'apply_to': 'text', 'num_samples_to_save': 2, 'output_file': None, 'sample_length_to_print': 1000, 'verbose': True}
INFO:ekorpkit.pipelines.pipe:Saved 2 samples to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/sample-enwiki-train.txt
----------------------------------------------------------------------------------------------------

text: 
Novska railway station () is a railway station on the Novska-Tovarnik railway in Novska, Croatia. 
There are three lines connecting Novska to Jasenovac, Okučani, and Lipovljani. 
The railway station consists of 18 railway tracks.

----------------------------------------------------------------------------------------------------
text: 
Zoltán Friedmanszky (22 October 1934 - 31 March 2022) was a Hungarian footballer who played as a forward. 
He was a member of the Hungary national team at the 1958 FIFA World Cup. 
However, he was never capped for the national team. 
He also played for Ferencváros.

----------------------------------------------------------------------------------------------------
INFO:ekorpkit.info.stat:Calculating statistics for split: train
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 50  input_split: False  merge_output: True  len(data): 6327718 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:01:53.710827
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/info-enwiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [enwiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/enwiki]
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'enwiki-train.parquet'},
 'data_files_modified': '2022-10-29 11:04:23',
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'info_updated': '2022-10-29 11:52:41',
 'lang': 'en',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-enwiki-train.parquet'},
 'meta_files_modified': '2022-10-29 10:45:11',
 'name': 'enwiki',
 'num_bytes_before_processing': 15342187701,
 'num_docs': 6327718,
 'num_docs_before_processing': 16699988,
 'num_segments': 6329379,
 'num_sents': 133373574,
 'num_words': 2482445427,
 'size_in_bytes': 15381978510,
 'size_in_human_bytes': '14.33 GiB',
 'splits': {'train': {'data_file': 'enwiki-train.parquet',
                      'dataset_name': 'enwiki',
                      'human_bytes': '14.33 GiB',
                      'human_bytes_wospc': '11.95 GiB',
                      'meta_file': 'meta-enwiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 15381978510,
                      'num_bytes_before_processing': 15342187701,
                      'num_bytes_max': 381498,
                      'num_bytes_median': 951.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 12831216681,
                      'num_docs': 6327718,
                      'num_docs_before_processing': 16699988,
                      'num_segments': 6329379,
                      'num_segments_median': 1.0,
                      'num_sents': 133373574,
                      'num_sents_median': 10.0,
                      'num_words': 2482445427,
                      'num_words_max': 65724,
                      'num_words_median': 154.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}

Build a Wikipedia Corpus for Other Languages#

cfg = eKonf.compose("corpus/builtin=wiki")
cfg.lang = "bn"
cfg.verbose = True
cfg.num_workers = 50

db = eKonf.instantiate(cfg)
INFO:ekorpkit.base:instantiating ekorpkit.datasets.build.DatasetBuilder...
INFO:ekorpkit.base:instantiating ekorpkit.io.fetch.loader.wiki.Wiki...
INFO:ekorpkit.base:instantiating ekorpkit.info.stat.SummaryInfo...
INFO:ekorpkit.info.stat:Loading info file: /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/info-bnwiki.yaml
INFO:ekorpkit.datasets.build:/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet already exists
INFO:ekorpkit.io.file:Processing [1] files from ['bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'bnwiki-train.parquet'},
 'data_files_modified': '2022-10-31 02:06:51',
 'description': 'Wikipedia',
 'fullname': 'Wikipedia Corpus (bn)',
 'homepage': 'https://bn.wikipedia.org',
 'info_updated': '2022-10-31 02:07:26',
 'lang': 'bn',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-bnwiki-train.parquet'},
 'meta_files_modified': '2022-10-31 02:05:58',
 'name': 'bnwiki',
 'num_bytes_before_processing': 669133776,
 'num_docs': 127833,
 'num_docs_before_processing': 348546,
 'num_segments': 127870,
 'num_sents': 1074085,
 'num_words': 36516749,
 'size_in_bytes': 668725092,
 'size_in_human_bytes': '637.75 MiB',
 'splits': {'train': {'data_file': 'bnwiki-train.parquet',
                      'dataset_name': 'bnwiki',
                      'human_bytes': '637.75 MiB',
                      'human_bytes_wospc': '602.97 MiB',
                      'meta_file': 'meta-bnwiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 668725092,
                      'num_bytes_before_processing': 669133776,
                      'num_bytes_max': 354024,
                      'num_bytes_median': 2621.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 632254760,
                      'num_docs': 127833,
                      'num_docs_before_processing': 348546,
                      'num_segments': 127870,
                      'num_segments_median': 1.0,
                      'num_sents': 1074085,
                      'num_sents_median': 5.0,
                      'num_words': 36516749,
                      'num_words_max': 18938,
                      'num_words_median': 142.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}
INFO:ekorpkit.io.file: >> elapsed time to load data: 0:00:03.919305
INFO:ekorpkit.datasets.build:
Processing dataframe with pipeline: ['normalize', 'filter_length', 'drop_duplicates', 'save_samples']
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('normalize', 'normalize'), ('filter_length', 'filter_length'), ('drop_duplicates', 'drop_duplicates'), ('save_samples', 'save_samples')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function normalize at 0x7fc6497bc700>)
INFO:ekorpkit.pipelines.pipe:instantiating normalizer
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 348546 len(args): 5
INFO:ekorpkit.pipelines.pipe: >> elapsed time to normalize: 0:00:20.737039
INFO:ekorpkit.base:Applying pipe: functools.partial(<function filter_length at 0x7fc655201f70>, len_bytes={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, len_words={'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'})
INFO:ekorpkit.pipelines.pipe:Filtering by length: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.filter_length', 'len_bytes': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_bytes'}, 'len_words': {'_partial_': True, '_target_': 'ekorpkit.utils.func.len_words'}}, 'apply_to': 'text', 'min_length': 30, 'max_length': None, 'len_func': 'len_bytes', 'len_column': 'num_bytes', 'add_len_column': True, 'verbose': True, 'use_batcher': True}
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 348546 len(args): 5
INFO:ekorpkit.pipelines.pipe:removed 220674 of 348546 documents with length < 30
INFO:ekorpkit.pipelines.pipe: >> elapsed time to filter length: 0:00:01.750412
INFO:ekorpkit.base:Applying pipe: functools.partial(<function drop_duplicates at 0x7fc655201dc0>)
INFO:ekorpkit.pipelines.pipe:Dropping duplicates: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.drop_duplicates'}, 'apply_to': 'text', 'verbose': True}
INFO:ekorpkit.pipelines.pipe:127833 documents after dropping 39 duplicates from [['text']]
INFO:ekorpkit.pipelines.pipe: >> elapsed time to drop duplicates: 0:00:01.292660
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_samples at 0x7fc655201ee0>)
INFO:ekorpkit.pipelines.pipe:Saving samples: {'_func_': {'_partial_': True, '_target_': 'ekorpkit.pipelines.pipe.save_samples'}, 'path': {'root': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/ekorpkit-book', 'name': 'ekorpkit-book', 'cached_path': None, 'filetype': '', 'verbose': True, 'data_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/ekorpkit-book/data', 'data_file': None, 'concat_data': False, 'data_columns': None, 'columns': None, 'output_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/ekorpkit-book/outputs', 'output_file': None, 'suffix': None, 'output': {'filename': 'sample-bnwiki-train.txt', 'base_dir': '/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki', 'filetype': '.txt', 'suffix': None, 'filepath': '/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/sample-bnwiki-train.txt', 'columns': None, 'verbose': True}}, 'apply_to': 'text', 'num_samples_to_save': 2, 'output_file': None, 'sample_length_to_print': 1000, 'verbose': True}
INFO:ekorpkit.pipelines.pipe:Saved 2 samples to /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/sample-bnwiki-train.txt
INFO:ekorpkit.info.stat:Calculating statistics for split: train
INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5
----------------------------------------------------------------------------------------------------

text: 
উম্মে হানি বিনতে আবি তালিব (আরবি فاختة بنت أبي طالب) হযরত মুহাম্মাদ সাঃ এর চাচাত বোন ছিলেন। উম্মে হানি আবু তালিবের কন্যা ছিলেন। তিনি একজন হাদিস বর্ণনাকারী সাহাবা ছিলেন।
নাম ও বংশ পরিচয়.
উম্মে হানি বিনতে আবি তালিব এর আসল ছিলো ফাখিতা মতান্তরে হিন্দ। তার পিতার নাম আবু তালিব ইবনে আবদুল মুত্তালিব ও মাতার নাম ছিলো ফাতিমা বিনতে আসাদ। তিনি জাফর,আকিল ও আলীর সহোদরা ছিলেন।
উম্মে হানির কয়েকজন সন্তানের নাম হলো:
ইসলাম পূর্ব সময়.
তার বাল্যকালের কথা তেমন কিছু জানা যায় না। তবে মহানবী হযরত মুহাম্মাদ সাঃ এর নবুওয়াত প্রাপ্তির পূর্বে চাচা আবু তালিবের নিকট উম্মে হানির বিয়ের প্রস্তাব পাঠান। একই সময়হুবায়রা ইবনে আমর ইবনে আয়িয আল মাখযুমিও উম্মে হানিকে বিয়ে করতে চান। আবু তালিব হুবায়রার প্রস্তাব গ্রহণ করে উম্মে হানিকে তার সাথে বিয়ে দেন। এবং হযরত মুহাম্মাদ সাঃ কে বললেন, "ভাতিজা! আমরা তার সাথে বৈবাহিক সম্পর্ক করেছি। সম্মানীয়দের সমকক্ষ সম্মানীয়রাই হয়ে থাকে।" 
মক্কা বিজয়ের দিন ও ইসলাম গ্রহণ.
ইমাম আয যাহাবি বলেছেন, উম্মে হানি মক্কা বিজয়ের দিন ইসলাম গ্রহণ করেছেন। মক্কা বিজয়ের দিন উম্মে হানির ইসলামের গ...

----------------------------------------------------------------------------------------------------
text: 
কুনাহান্ধু ( দিভেহি : ކުނަހަންދޫ) লামু প্রবালপ্রাচীরের একটি জন অধ্যুষিত দ্বীপ।
ভূগোল.
দেশের রাজধানী মালে থেকে দ্বীপটি দক্ষিণে অবস্থিত।

----------------------------------------------------------------------------------------------------
INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 556
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 556  procs: 230  input_split: False  merge_output: True  len(data): 127833 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:05.268952
INFO:ekorpkit.info.stat:Saving updated info file: /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/info-bnwiki.yaml
INFO:ekorpkit.datasets.build:
Corpus [bnwiki] is built to [/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki] from [/content/drive/MyDrive/workspace/data/archive/datasets/source/bnwiki]
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'bnwiki-train.parquet'},
 'data_files_modified': '2022-11-04 06:36:13',
 'description': 'Wikipedia',
 'fullname': 'Wikipedia Corpus (bn)',
 'homepage': 'https://bn.wikipedia.org',
 'info_updated': '2022-11-04 08:17:09',
 'lang': 'bn',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-bnwiki-train.parquet'},
 'meta_files_modified': '2022-11-04 06:36:13',
 'name': 'bnwiki',
 'num_bytes_before_processing': 669133776,
 'num_docs': 127833,
 'num_docs_before_processing': 348546,
 'num_segments': 127870,
 'num_sents': 1074085,
 'num_words': 36516749,
 'size_in_bytes': 668725092,
 'size_in_human_bytes': '637.75 MiB',
 'splits': {'train': {'data_file': 'bnwiki-train.parquet',
                      'dataset_name': 'bnwiki',
                      'human_bytes': '637.75 MiB',
                      'human_bytes_wospc': '602.97 MiB',
                      'meta_file': 'meta-bnwiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 668725092,
                      'num_bytes_before_processing': 669133776,
                      'num_bytes_max': 354024,
                      'num_bytes_median': 2621.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 632254760,
                      'num_docs': 127833,
                      'num_docs_before_processing': 348546,
                      'num_segments': 127870,
                      'num_segments_median': 1.0,
                      'num_sents': 1074085,
                      'num_sents_median': 5.0,
                      'num_words': 36516749,
                      'num_words_max': 18938,
                      'num_words_median': 142.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}
time: 41.5 s (started: 2022-11-04 08:16:28 +00:00)

Build a Corpus using CLI#

To build more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=true \
    num_workers=1 \
    run=corpus.builtin \
    corpus/builtin=wiki \
    corpus.builtin.lang="bn" \
    corpus.builtin.io.force.summarize=false \
    corpus.builtin.io.force.preprocess=false \
    corpus.builtin.io.force.build=false \
    corpus.builtin.io.force.download=false

Load the Corpus#

cfg = eKonf.compose("corpus")
cfg.name = "enwiki"
enwiki = eKonf.instantiate(cfg)
print(enwiki)
INFO:ekorpkit.datasets.base:Loaded info file: /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/info-enwiki.yaml
INFO:ekorpkit.io.file:Processing [1] files from ['enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/enwiki-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['meta-enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki/meta-enwiki-train.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.info.column:No timestamp key found
Corpus : enwiki
time: 1min 48s (started: 2022-11-04 02:24:13 +00:00)
eKonf.print(enwiki.INFO)
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'enwiki-train.parquet'},
 'data_files_modified': '2022-10-29 11:04:23',
 'description': 'Wikipedia',
 'fullname': 'English Wikipedia Corpus',
 'homepage': 'https://en.wikipedia.org',
 'info_updated': '2022-10-29 11:52:41',
 'lang': 'en',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-enwiki-train.parquet'},
 'meta_files_modified': '2022-10-29 10:45:11',
 'name': 'enwiki',
 'num_bytes_before_processing': 15342187701,
 'num_docs': 6327718,
 'num_docs_before_processing': 16699988,
 'num_segments': 6329379,
 'num_sents': 133373574,
 'num_words': 2482445427,
 'size_in_bytes': 15381978510,
 'size_in_human_bytes': '14.33 GiB',
 'splits': {'train': {'data_file': 'enwiki-train.parquet',
                      'dataset_name': 'enwiki',
                      'human_bytes': '14.33 GiB',
                      'human_bytes_wospc': '11.95 GiB',
                      'meta_file': 'meta-enwiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 15381978510,
                      'num_bytes_before_processing': 15342187701,
                      'num_bytes_max': 381498,
                      'num_bytes_median': 951.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 12831216681,
                      'num_docs': 6327718,
                      'num_docs_before_processing': 16699988,
                      'num_segments': 6329379,
                      'num_segments_median': 1.0,
                      'num_sents': 133373574,
                      'num_sents_median': 10.0,
                      'num_words': 2482445427,
                      'num_words_max': 65724,
                      'num_words_median': 154.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}
time: 9.73 ms (started: 2022-11-04 02:26:02 +00:00)

Sample and Save the Corpus#

enwiki.data.info()
INFO:ekorpkit.io.file:Concatenating 1 dataframes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16699988 entries, 0 to 16699987
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   id        int64 
 1   text      object
 2   split     object
 3   filename  object
dtypes: int64(1), object(3)
memory usage: 509.6+ MB
time: 674 ms (started: 2022-11-04 02:26:53 +00:00)
enwiki.splits["train"] = enwiki.splits["train"].sample(frac=0.05)
enwiki.save_as("enwiki_sampled")
print(enwiki.data.info())
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 230  input_split: False  merge_output: True  len(data): 834999 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:04:32.445424
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet
INFO:ekorpkit.io.file:Concatenating 1 dataframes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 834999 entries, 0 to 834998
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        834999 non-null  int64 
 1   text      834999 non-null  object
 2   split     834999 non-null  object
 3   filename  834999 non-null  object
dtypes: int64(1), object(3)
memory usage: 25.5+ MB
None
time: 5min 34s (started: 2022-11-04 02:27:31 +00:00)

Load Corpora#

You can load several corpora at once and merge them into a single corpus.

cfg = eKonf.compose("corpus=corpora")
cfg.name = ["enwiki_sampled", "kowiki", "bnwiki"]
# cfg.data_dir = '../data'
cfg.auto.load = True
crps = eKonf.instantiate(cfg)
print(crps)
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.utils.notebook:shell type: ZMQInteractiveShell
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /content/drive/MyDrive/workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
INFO:ekorpkit.datasets.corpora:processing enwiki_sampled
INFO:ekorpkit.io.file:Processing [1] files from ['enwiki_sampled-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/enwiki_sampled/enwiki_sampled-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: index, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
WARNING:ekorpkit.datasets.base:File enwiki_sampled-dev.parquet not found.
WARNING:ekorpkit.datasets.base:File enwiki_sampled-test.parquet not found.
INFO:ekorpkit.datasets.corpus:No metadata files found
INFO:ekorpkit.info.column:No timestamp key found
INFO:ekorpkit.datasets.corpora:processing kowiki
INFO:ekorpkit.datasets.base:Loaded info file: /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/info-kowiki.yaml
INFO:ekorpkit.io.file:Processing [1] files from ['kowiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/kowiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/kowiki-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['meta-kowiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/meta-kowiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/kowiki/meta-kowiki-train.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.info.column:No timestamp key found
INFO:ekorpkit.datasets.corpora:processing bnwiki
INFO:ekorpkit.datasets.base:Loaded info file: /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/info-bnwiki.yaml
INFO:ekorpkit.io.file:Processing [1] files from ['bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/bnwiki-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'split', 'filename'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['meta-bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/meta-bnwiki-train.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/data/datasets/corpus/bnwiki/meta-bnwiki-train.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.info.column:No timestamp key found
INFO:ekorpkit.datasets.corpora:>>> Elapsed time: 0:00:17.370715 <<< 
Corpora
----------
enwiki_sampled
kowiki
bnwiki

time: 18.8 s (started: 2022-11-04 03:20:27 +00:00)

Checking the Corpus Information#

eKonf.print(crps["kowiki"].INFO)
{'category': 'formal',
 'column_info': {'columns': {'id': 'id',
                             'merge_meta_on': 'id',
                             'text': 'text',
                             'timestamp': None},
                 'data': {'id': 'int', 'text': 'str'},
                 'datetime': {'columns': None,
                              'format': None,
                              'rcParams': None},
                 'meta': {'curid': 'str',
                          'id': 'int',
                          'title': 'str',
                          'url': 'str'},
                 'segment_separator': '\\n\\n',
                 'sentence_separator': '\\n',
                 'timestamp': {'format': None, 'key': None, 'rcParams': None}},
 'data_files': {'train': 'kowiki-train.parquet'},
 'data_files_modified': '2022-10-29 06:30:41',
 'description': '위키백과, 우리 모두의 백과사전',
 'fullname': 'Korean Wikipedia Corpus',
 'homepage': 'https://ko.wikipedia.org',
 'info_updated': '2022-10-29 06:58:28',
 'lang': 'ko',
 'license': 'CC Attribution / Share-Alike 3.0',
 'meta_files': {'train': 'meta-kowiki-train.parquet'},
 'meta_files_modified': '2022-10-29 06:29:35',
 'name': 'kowiki',
 'num_bytes_before_processing': 801994255,
 'num_docs': 601641,
 'num_docs_before_processing': 1339048,
 'num_segments': 601727,
 'num_sents': 6064076,
 'num_words': 74965324,
 'size_in_bytes': 800672700,
 'size_in_human_bytes': '763.58 MiB',
 'splits': {'train': {'data_file': 'kowiki-train.parquet',
                      'dataset_name': 'kowiki',
                      'human_bytes': '763.58 MiB',
                      'human_bytes_wospc': '692.66 MiB',
                      'meta_file': 'meta-kowiki-train.parquet',
                      'name': 'train',
                      'num_bytes': 800672700,
                      'num_bytes_before_processing': 801994255,
                      'num_bytes_max': 417935,
                      'num_bytes_median': 348.0,
                      'num_bytes_min': 30,
                      'num_bytes_wospc': 726308931,
                      'num_docs': 601641,
                      'num_docs_before_processing': 1339048,
                      'num_segments': 601727,
                      'num_segments_median': 1.0,
                      'num_sents': 6064076,
                      'num_sents_median': 3.0,
                      'num_words': 74965324,
                      'num_words_max': 35733,
                      'num_words_median': 32.0,
                      'num_words_min': 1}},
 'version': '1.0.0'}
time: 8.71 ms (started: 2022-11-04 03:20:46 +00:00)

Checking the Corpus Data#

print(crps["kowiki"].data.text[0])
INFO:ekorpkit.io.file:Concatenating 1 dataframes
성이성(成以性, 1595년(선조 28년) ∼ 1664년(현종 5년))은 조선 후기의 문신이자 유학자, 청백리이다. 자(字)는 여습(汝習)이고 호는 계서(溪西)이다. 본관은 창녕(昌寧). 춘향전의 실제 주인공으로 춘향전의 주인공인 몽룡은 원래 성몽룡이었다. 남원부사와 승정원승지를 지낸 성안의의 아들이다.
강직한 간관이자 청백리이다. 그의 직계 후손들은 춘향전에 나온 '금준미주 천인혈'이 그가 실제로 지은 한시라고 주장한다. 호서 암행어사와 호남 암행어사로 활동, 감찰하며 부패 수령들을 봉고파직시켰다. 이것 역시 춘향전의 소재가 된다. 학맥으로는 김굉필의 손제자이자 그의 학맥을 계승한 강복성(康復誠)의 문인이다. 경상북도 출신.
생애.
생애 초반.
출생과 가계.
성이성은 경상북도 봉화군 물야면 가평리 태생으로 아버지는 창녕 성씨로 승정원승지와 군수를 지낸 성안의(成安義)이고, 어머니는 예안 김씨로 증(贈) 호조 참판에 추증(追贈)된 김계선의 딸이다.
그는 어려서부터 그는 학업에 열중하여 13세때 그가 쓴 글을 우연히 정경세(鄭經世)가 보게 되었다. 정경세는 그의 글을 읽고 장차 크게 될 인물이라 하였다.
수학과 남원 생활.
어려서부터 공부를 게을리하지 않고 학문에 더욱 증진하여 조경남의 문하에서 수학하다가 뒤에 강복성(康復誠)의 문인이 되었다. 강복성은 사림의 학통인 길재-김숙자-김종직-김굉필(金宏弼)-조광조-이연경(李延慶)의 학통을 계승한 학자였다.
1607년(선조 40) 남원부사로 부임한 아버지 성안의를 따라 갔다가 그곳에서 만난 기생과의 일화가 후일 춘향전의 주 뼈대가 되었다. 그러나 아버지 성안의가 참의로 발령되면서 기생 춘향과는 이별하게 된다. 이때 시중에는 성이성과 춘향을 소재로 한 춘향전이 희극과 인형극, 만담 등으로 확산되었는데, 양반가의 자제의 스캔들이라 하여 조선조정에서 관을 시켜서 금지하게 되자 성몽룡을 이몽룡으로 바꾸고, 성씨(姓氏)가 없던 기생인 춘향에게 성씨 성을 붙여서 시연하게 된다.
1616년(광해군 8년) 그는 사마시 양시에 합격했는데 생원시에 합격하여 생원(生員)이 되고, 그 해에 다시 진사시에 합격하여 진사(進士)가 되었다. 그러나 광해군 때의 난세에는 벼슬길에 나아가지 않았다.
관료 생활.
과거 급제와 관료생활.
1627년 (인조5년)에 식년 문과에 병과로 급제하였다.
1635년(인조 13) 성이성은 사간원 정언이 되고 홍문관 부수찬·부교리를 거쳐 1636년 사헌부지평이 되었다. 1637년(인조 15) 호서(湖西) 지방의 암행어사로 파견되었다가 돌아왔다. 그해 성이성은 사간원 헌납이 되어 공신이며 서인당의 고관인 윤방(尹昉)·김류(金류)·심기원(沈器遠)·김자점(金自點) 등을 탄핵하여 왕을 잘못된 길로 인도했다며 오국불충(誤國不忠)의 죄를 논하기도 했다.
암행어사 활동.
1639년(인조 17) 호남(湖南) 지방 암행어사에 임명되어 5년간 호남 지역을 순찰하고 1644년(인조 22) 되돌아왔가. 그 뒤 1647년(인조 25) 다시 호남 암행어사 로 파견되었다.
그러나 호남 암행어사로 부임했을 때 신분을 노출시키고 마는데 성이성은 암행을 하고 다니다가 1647년 11월 25일 순천에서 실수로 부득이 자신의 신분을 드러내고 이후에는 한양으로 돌아오게 되는데, 그는 일기에 돌아오는 길이던 12월 1일 남원에 들렀다고 적고 있다.
생애 후반.
1648년 여름 성이성은 전라도 담양군수로 부임해 장마철 강둑이 범람해 피해가 큰 것을 보고 2년에 걸쳐 제방을 쌓고 그 위에 나무를 심었다. 현재 관방제림으로 불리는 숲이 그가 남긴 치세의 흔적이다. 푸조나무 느티나무 팽나무 등 184그루가 전한다. 본디 제방에 나무를 심으면 나무가 바람에 흔들릴 때 제방에 해롭다 하여 심지 않았는데 성이성은 비바람에 강한 토종나무를 골라 심음으로써 이같은 상식을 엎었다. 담양군은 현재 관방제림에 산책로를 조성하여 관광객들의 발길을 모으고 있다.
외직으로는 진주부사 · 강계부사 등 네 고을을 다스렸는데, 진주부사로 재직할 때는 서인 출신으로 경상도 암행어사로 파견된 민정중(閔鼎重)이 조사하여 그의 선치(善治)를 보고하여 특별히 왕으로부터 표리(表裏, 옷감)를 받았고, 강계부사 때에는 여진족 등의 약탈과 흉년 등으로 어려움에 처한 부민들에게 삼세(蔘稅)를 모두 면제해주어 백성들이 기뻐하였으며 부처가 환생하여 돌아왔다며 '생불' 또는 '관서활불'(關西活佛)이라며 칭송하였다. 1664년(현종 15)에 향년 70세를 일기로 사망했다.
사후.
고향인 봉화군 물야면 가평 1리에는 성이성을 추모하는 사당인 계서당이 건립되었다. 사후인 1695년(숙종 21) 청렴함을 인정받아 조정으로부터 청백리로 선출되었다. 증 통정대부 부제학에 추증되었다. 저서로는 &lt;계서유고&gt;가 있다.
귀신 문제 해결.
전라도 지역에 귀신이 자주 출몰한다는 곳이 있었다. 그 곳은 상인이나 과거 시험을 보러 가던 선비들이 여러 번 변을 당했는데, 성이성이 이 문제를 해결하였다 한다. 호남 암행어사가 돼서 호남 지역의 귀신이 많이 나오는 곳을 찾아가 억울함을 달래주고 문제를 해결하였고 이것 역시 입에서 입으로 전해져 설화가 되었다.
부패 관리 파직.
충청도 암행어사 시절 지방관리의 잘못을 발견하고 어떻게 처리했는가를 인조에게 보고한 '서계'가 남아있었는데 KBS 방송국이 이를 취재하였다.를 보면 세금을 과다징수한 진천현감과 생일날 과다한 잔치를 벌이고 국법을 어긴 석성현감을 적발하여 파직시켰다는 기록이 있다.
춘향전.
춘향전.
그는 아버지인 남원부사 성안의가 부임할 때, 아버지의 임지를 따라 남원에서 생활하다 우연히 남원 기생 춘향을 만나게 되었다. 그러나 춘향과는 이루어지지 못했고, 이는 바로 춘향전의 모티브가 되었다. 뒤에 호남 암행어사로 부임했다가 신분을 노출하고 되돌아갈 때 남원에 들렀다. 늙은기생 여진이 찾아와 만났는데, 그는 춘향을 찾았다 한다. 그러나 그는 춘향을 만날 수 없었다.
‘서리와 함께 난간에 앉으니 눈빛이 뜰에 하얗게 깔려있고 대나무숲이 희었다. 나는 소년시절의 일을 생각하여 밤늦도록 잠들지 못했다.’
춘향전은 판소리, 연극, 소설의 소재가 되었으나 양반가 자제의 스캔들이라 하여, 조정에서는 양반가의 위신을 떨어뜨린다며 춘향전의 상영을 금지하였다. 할 수 없이 민중들은 성몽룡 대신 이몽룡으로 성을 바꾸어서 연극과 판소리, 소설, 구전 등으로 전하였다.
금준미주 천인혈.
춘향전에 나오는 금준미주 천인혈은 성이성이 지은 시 중의 하나였다. 성이성이 춘향전에 나오는 성몽룡처럼 변사또를 응징한 남원 출두 기록은 실록이나 문집에는 없다. 그러나 춘향전에 나오는 잔치연에서 이몽룡이 변학도를 질타하면서 읊은 금준미주 천인혈 로 시작되는 시조는 성이성이 짓고, 읊었다. 이는 성이성의 4대손 성섭의 저서 &lt;교와문고&gt;와 그의 스승 조경남이 쓴 &lt;난중잡록&gt;에 그의 작품으로 기록되어 있다.
호남 암행어사가 되었을때에 호남 12고을 군수, 현감들이 잔치를 베풀었다. 이때 성이성은 암행어사가 걸인의 행색을 하고서 연회장에 나타났다. 호남의 12고을의 군수, 현감들은 그를 조롱하며 '그대가 시를 지으면 종일토록 놀고 짓지 못하면 가라.'고 했고, 그는 즉석에서 금준미주 천인혈 을 짓는다. 이어 전라도내 6명의 부패한 수령들을 봉고파직시킨다. 석성현감이 생일날 과다한 잔치를 벌인 것은 춘향전에 등장하는 변사또의 모티브가 되었다.
time: 51.7 ms (started: 2022-11-04 03:20:46 +00:00)
print(crps["bnwiki"].data.text[0])
INFO:ekorpkit.io.file:Concatenating 1 dataframes
শ্যামাদাস মুখোপাধ্যায় (২২ জুন ১৮৬৬ - ৮ মে ১৯৩৭) ছিলেন একজন ভারতীয় বাঙালি গণিতবিদ। তিনি ইউক্লিডিয় জ্যামিতির মুখোপাধ্যায়ের উপপাদ্য এবং চতুর্শীর্ষ উপপাদ্য (Four-vertex theorem) উপস্থাপনের জন্য পরিচিত। তিনি ভারতের প্রথম গণিতবিদ হিসেবে ডক্টরেট ডিগ্রী অর্জন করেন।
জন্ম ও শিক্ষাজীবন.
শ্যামাদাস মুখোপাধ্যায় ১৮৬৬ খ্রিষ্টাব্দের ২২ জুন পশ্চিমবঙ্গের হুগলি জেলার হরিপাল ব্লকে জন্মগ্রহণ করেন। তার বাবা বাবু গঙ্গা কান্ত মুখোপাধ্যায় রাজ্য বিচার বিভাগে নিযুক্ত ছিলেন। চাকরি সূত্রে তাকে বিভিন্ন স্থানে স্থানান্তরিত করা হওয়ায় শ্যামাদাস মুখোপাধ্যায়কে বিভিন্ন সময়ে বিভিন্ন শিক্ষা প্রতিষ্ঠানে শিক্ষা গ্রহণ করতে হয়। তিনি হুগলি কলেজ থেকে স্নাতক হন। তিনি ১৮৯০ খ্রিষ্টাব্দে কলকাতার প্রেসিডেন্সি কলেজ থেকে গণিত বিষয়ে এমএ ডিগ্রি অর্জন করেন। তিনি ১৯০৯ খ্রিষ্টাব্দে তার গাণিতিক তত্ত্বালোচনা "অন দ্যা ইনফিনিটেসিমাল এনালিসিস অফ এন আর্ক"-এর জন্য কলকাতা বিশ্ববিদ্যালয় থেকে গ্রিফিত পুরস্কার পান। তিনি ১৯১০ খ্রিষ্টাব্দে তার নিজস্ব ডিফারেনশিয়াল জ্যামিতির উপরে কলকাতা বিশ্ববিদ্যালয় থেকে পিএইচডি ডিগ্রি লাভ করেন। তার থিসিসের নাম ছিল "Parametric Coefficients in the Differential Geometry of Curves in an N-space"।
কর্মজীবন.
শ্যামাদাস মুখোপাধ্যায় কলকাতায় বঙ্গবাসী কলেজে কিছু বছর কাজ করার পর বেথুন কলেজে যোগদান করেন, যেখানে তিনি গণিত ছাড়াও ইংরেজি সাহিত্য ও দর্শনশাস্ত্রে শিক্ষা দিতেন। ১৯০৪ খ্রিষ্টাব্দে তাকে প্রেসিডেন্সি কলেজে স্থানান্তর করা হয়। সেখানে তিনি ১৯১২ খ্রিষ্টাব্দ পর্যন্ত আট সেখানে শিক্ষকতা করেন। ১৯১২খ্রিষ্টাব্দে কলকাতা বিশ্ববিদ্যালয়ের তৎকালীন উপাচার্য স্যার আশুতোষ মুখার্জী, শ্যামাদাস মুখোপাধ্যায়কে কলকাতা বিশ্ববিদ্যালয়ে নতুন বিশুদ্ধ গণিত বিভাগে যোগদানের জন্য আমন্ত্রণ জানান। শ্যামাদাস মুখোপাধ্যায় সেই আমন্ত্রণে সেখানে যোগ দেন। ১৯৩২ খ্রিষ্টাব্দে তিনি কলকাতা গাণিতিক সমিতির সভাপতি নির্বাচিত হন। তিনি আমৃত্যু এই পদে ছিলেন। ১৯৩৭ খ্রিষ্টাব্দের ৮ মে হৃদরোগের আক্রান্ত হয়ে তিনি মারা যান।
গবেষণা.
শ্যামাদাস মুখোপাধ্যায় সম্ভবত স্নাতককালে তার হুগলি কলেজের শিক্ষক উইলিয়াম বুথের জ্যামিতি বিষয়ক গবেষণা দ্বারা অনুপ্রাণিত হয়েছিলেন। ডক্টর মুখোপাধ্যায়ের গবেষণা মূলত অ-ইউক্লিডিয়ান জ্যামিতি, ডিফারেনশিয়াল জ্যামিতি এবং চতুর্মাত্রিক স্থানের স্টেরিওস্কোপিক উপস্থাপনার জন্য গুরুত্বপূর্ণ ছিল।
তিনি দুটি গুরুত্বপূর্ণ উপপাদ্য উপস্থাপন করেন। প্রথম উপপাদ্যটি হল- “"the minimum number of cyclic points on a convex oval is 4"’’ (একটি উত্তল ডিম্বাকৃতিতে সাইক্লিক পয়েন্টের সর্বনিম্ন সংখ্যা হল ৪) এবং দ্বিতীয়টি হল- “"the minimum number of sextactic points on a convex oval is 6"” (একটি উত্তল ডিম্বাকৃতিতে সেক্সট্যাকটিক পয়েন্টের সর্বনিম্ন সংখ্যা হল ৬)। এই দুটি উপপাদ্য প্রথম প্রকাশিত হয় ১৯০৯ খ্রিস্টাব্দে কলকাতা গাণিতিক সমিতির বুলেটিনে। কিন্তু সেই সময় এই দুটি গুরুত্বপূর্ণ তত্ত্বকে গুরুত্ব দেওয়া হয়নি। শুধুমাত্র বিশিষ্ট ফরাসি গণিতবিদ জাক আদামার ডক্টর মুখোপাধ্যায়ের গবেষণার গুরুত্ব উপলব্ধি করে কোলেজ দ্য ফ্রঁসে তত্ত্ব দুটির কথা উল্লেখ করেন। অনেক বছর পরে, এই তত্ত্ব দুটি ইউরোপে পুনরায় আবিষ্কৃত হয়। জার্মান জ্যামিতিবেত্তা Wilhelm Blaschke শ্যামাদাস মুখোপাধ্যায়কে প্রথম উপপাদ্যটির প্রথম প্রমাণের জন্য উপযুক্ত সম্মান দিয়েছেন। জ্যামিতির আধুনিক সাহিত্যে এই তত্ত্বটি এখন অপ্রত্যাশিতভাবে উদ্ধৃত করা হয়েছে "মুখোপাধ্যায়ের চতুর্শীর্ষ উপপাদ্য" নামে।
পরবর্তীতে শ্যামাদাস মুখোপাধ্যায় এই দুটি উপপাদ্যের সাধারণীকরণ করেন। প্রথম উপপাদ্যের সাধারণ বক্তব্যটি হল “"If a circle C intersects an oval V in 2n points (n 2) then there exists at least 2n cyclic points in order on V, of alternately contrary signs, provided the oval has continuity of order 3"”। দ্বিতীয় উপপাদ্যের সাধারণ বক্তব্যটি হল “"If a conic C intersects an oval V in 2n points (n&gt; or = 2), then there exist at least 2n sextactic points in order on V, which are alternatively positive and negative, provided V has continuity of order 5"”। এইভাবে তিনি আগের তত্ত্ব দুটিকে আরো শক্তভাবে প্রতিষ্ঠা করেন।
time: 10.3 ms (started: 2022-11-04 03:20:46 +00:00)

Concatenating Corpora#

crps.concat_corpora()
INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Adding id [corpus] to ['id']
INFO:ekorpkit.info.column:Added id [corpus], now ['id', 'corpus']
INFO:ekorpkit.info.column:Added a column [corpus] with value [enwiki_sampled]
INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Added a column [corpus] with value [kowiki]
INFO:ekorpkit.info.column:Added a column [corpus] with value [kowiki]
INFO:ekorpkit.io.file:Concatenating 1 dataframes
INFO:ekorpkit.info.column:Added a column [corpus] with value [bnwiki]
INFO:ekorpkit.info.column:Added a column [corpus] with value [bnwiki]
time: 296 ms (started: 2022-11-04 03:20:47 +00:00)
crps.data
id text split filename corpus
0 4915400 train wiki_92 enwiki_sampled
1 7644961 Anaissini is a tribe of click beetles in the f... train wiki_49 enwiki_sampled
2 6658552 The Vicky Metcalf Award for Literature for You... train wiki_24 enwiki_sampled
3 16385169 Shri Shivabalayogi Maharaj (24 January 1935 – ... train wiki_36 enwiki_sampled
4 11081255 Eylex Films Pvt is a chain of multiplex and si... train wiki_94 enwiki_sampled
... ... ... ... ... ...
2522588 348541 মোহাম্মদ সেলিম (জন্ম: ১৫ অক্টোবর, ১৯৮১) খুলনার... train wiki_34 bnwiki
2522589 348542 train wiki_34 bnwiki
2522590 348543 দেশি কামিলা (বৈজ্ঞানিক নাম: "Congresox talabon... train wiki_34 bnwiki
2522591 348544 train wiki_34 bnwiki
2522592 348545 বাথাইল নদী বাংলাদেশের উত্তর-পূর্বাঞ্চলের কিশোর... train wiki_34 bnwiki

2522593 rows × 5 columns

time: 13.4 ms (started: 2022-11-04 03:20:47 +00:00)
crps.metadata
id curid url title split corpus
0 0 634327 https://ko.wikipedia.org/wiki?curid=634327 성이성 train kowiki
1 1 634328 https://ko.wikipedia.org/wiki?curid=634328 누타 train kowiki
2 2 634329 https://ko.wikipedia.org/wiki?curid=634329 공중그네 train kowiki
3 3 634331 https://ko.wikipedia.org/wiki?curid=634331 성몽룡 train kowiki
4 4 634332 https://ko.wikipedia.org/wiki?curid=634332 계서 train kowiki
... ... ... ... ... ... ...
1687589 348541 554487 https://bn.wikipedia.org/wiki?curid=554487 মোহাম্মদ সেলিম train bnwiki
1687590 348542 554493 https://bn.wikipedia.org/wiki?curid=554493 Mohammad Salim train bnwiki
1687591 348543 554495 https://bn.wikipedia.org/wiki?curid=554495 দেশি কামিলা train bnwiki
1687592 348544 554501 https://bn.wikipedia.org/wiki?curid=554501 Congresox talabonoides train bnwiki
1687593 348545 554504 https://bn.wikipedia.org/wiki?curid=554504 বাথাইল নদী train bnwiki

1687594 rows × 6 columns

time: 7.47 ms (started: 2022-11-04 03:20:47 +00:00)

Save the concatenated corpus#

eKonf.save_data(crps.data, "wiki_corpus.parquet", project_dir + "/data")
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_corpus.parquet
time: 2min 56s (started: 2022-11-04 03:20:47 +00:00)