Lab 3: Training Tokenizers#

Prepare the environment#

%pip install --pre ekorpkit[tokenize]
from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
ws = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", ws.project_dir)
ws.envs.dict()
INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Set environment variable EKORPKIT_WORKSPACE_ROOT=/content/drive/MyDrive/workspace
INFO:ekorpkit.base:Set environment variable EKORPKIT_PROJECT_DIR=/content/drive/MyDrive/workspace/projects/ekorpkit-book
version: 0.1.40.post0.dev51
is colab? False
INFO:root:compose config with overrides: ['+project=default']
INFO:ekorpkit.base:There are no arguments to initilize a config, using default config.
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_WORKSPACE_ROOT': '/content/drive/MyDrive/workspace',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_PROJECT_DIR': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'INFO',
 'NUM_WORKERS': 230,
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'WANDB_PROJECT': None,
 'WANDB_DISABLED': None,
 'LABEL_STUDIO_SERVER': 'http://ekorpkit-labelstudio:8080',
 'CACHED_PATH_CACHE_ROOT': None}
time: 936 ms (started: 2022-12-13 04:56:08 +00:00)

Load the saved corpora#

data = eKonf.load_data("enko_filtered.parquet", ws.project_dir / "data")
data.head()
INFO:ekorpkit.io.file:Processing [1] files from ['enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet
id text split filename corpus num_chars num_words num_sents avg_num_chars avg_num_words
1 7644961 Anaissini is a tribe of click beetles in the f... train wiki_49 enwiki_sampled 63 11 1 5.727273 11.000000
2 6658552 The Vicky Metcalf Award for Literature for You... train wiki_24 enwiki_sampled 479 82 5 5.841463 16.400000
4 11081255 Eylex Films Pvt is a chain of multiplex and si... train wiki_94 enwiki_sampled 1161 181 12 6.414365 15.083333
8 4706486 Željko Zečević (; born 21 October 1963) is a S... train wiki_02 enwiki_sampled 1151 201 15 5.726368 13.400000
12 2170359 Gilberto Nascimento Silva (born 9 June 1956) i... train wiki_57 enwiki_sampled 685 105 9 6.523810 11.666667
time: 3.17 s (started: 2022-12-13 00:32:31 +00:00)
text_column = "text"

text_en = (
    data[data.corpus == "enwiki_sampled"][text_column].sample(1).values[0].split("\n")
)
text_ko = data[data.corpus == "kowiki"][text_column].sample(1).values[0].split("\n")

print(text_en)
print(text_ko)
['The Rijeka Thermal Power Station (, TE Rijeka, also known as "TE Urinj") is an oil-fired power station east of Rijeka at Kostrena, Croatia. It was built between 1974 and 1978 and it has one generation unit with capacity of 320\xa0MW. The height of the boiler house including its rooftop flue gas stack is .', 'Turbine for the power station was supplied by Ansaldo Energia. Ansaldo Energia was also awarded engineering, procurement and construction contract. Boilers were supplied by Waagner-Biro.', 'The power station is owned and operated by Hrvatska elektroprivreda. Its annual production varies, averaging 917 GWh, but only 141 GWh in 2011. It is expected to undergo decommissioning in 2020, but it is doubtful that it will remain operational until then because of its pollution problem. , Rijeka Thermal Power Station is offline, ready to resume generation within 160 hours of notice. On 18 October 2022, it was unofficially reported that HEP plans to restart the operation of the power plant in order to cover the losses incurred during the energy crisis.']
['런던의 2012년 하계 올림픽 펜싱 여자 플뢰레 단체전은 8월 2일에 엑셀 박람회관에서 진행되었다.', '토너먼트 형식.', '9팀이 여자 플뢰레 단체전에 참가하였다. 본선에 출전하는 팀들은 FIE 랭킹에 따라 대진이 정해졌다. 영국은 개최국 자격으로 모든 종목을 선택 여부에 따라 참전할 수 있는 자격이 주어졌다. 영국은 이 토너먼트에 참가하여 첫경기에서 9번 시드의 이집트를 상대하여 승리하며 8강전에서 나머지 7개의 참가팀들과 만났다. 8강전에서 패한 팀들도 두 경기를 더 치러 5위에서 8위까지의 순위를 결정한다. 반대로, 8강전에서 승리한 팀들은 준결승전에서 서로와 맞붙는다. 준결승전에서 승리한 두 팀은 금메달 결정전으로, 패한 팀들은 동메달 결정전으로 이동한다.', '단체전은 먼저 45투셰를 기록하거나, 정규시간이 지나고 더 많은 투셰를 기록한 팀이 승리한다.']
time: 145 ms (started: 2022-11-14 02:09:16 +00:00)

Covert pandas datafame to huggingface dataset#

from datasets import Dataset

raw_dataset = Dataset.from_pandas(data[[text_column]])
raw_dataset
Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 603719
})
time: 1.22 s (started: 2022-11-14 02:09:19 +00:00)

Shuffle the dataset#

# shuffle the dataset

raw_dataset = raw_dataset.shuffle(seed=42)
time: 116 ms (started: 2022-11-14 02:09:21 +00:00)

Split the dataset into sentences for training#

The sentencepiece module comes with a python training API, which uses sentences in a file, one sentence per line. We will use the sent_tokenize function from the nltk package to split the text into sentences. The sent_tokenize function is a wrapper around the punkt tokenizer, which is a pre-trained sentence tokenizer. The punkt tokenizer is trained on the Penn Treebank corpus, which is a collection of Wall Street Journal articles. The punkt tokenizer is a good choice for plain English text, but it may not be the best choice for other languages.

import nltk
from nltk.tokenize import sent_tokenize
from ekorpkit.tokenizers.trainers.spm import export_sentence_chunk_files

nltk.download("punkt")
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True
time: 1.13 s (started: 2022-11-14 02:09:23 +00:00)
output_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

export_sentence_chunk_files(
    raw_dataset,
    output_dir=output_dir,
    chunk_size=10000,
    text_column=text_column,
    sent_tokenize=sent_tokenize,
)
Hide code cell output
INFO:ekorpkit.tokenizers.trainers.spm:Writing sentence chunks to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_chunk
time: 1min 29s (started: 2022-11-14 02:09:26 +00:00)

Sample sentences and combine them into a single file#

If your dataset is too large, you can sample a subset of the sentence files for training. The sample function from the random module can be used to sample a subset of the files.

You can use sample_and_combine function to sample a subset of sentence files and combine them into a single file.

from ekorpkit.tokenizers.trainers.spm import sample_and_combine

input_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
output_dir = project_dir + "/data/tokenizers/enko_filtered_samples"

sampled_file = sample_and_combine(
    input_dir=input_dir, output_dir=output_dir, sample_size=50
)
INFO:ekorpkit.tokenizers.trainers.spm:sampled files: ['sent_chunk_0009.txt', 'sent_chunk_0019.txt', 'sent_chunk_0060.txt', 'sent_chunk_0038.txt', 'sent_chunk_0000.txt', 'sent_chunk_0042.txt', 'sent_chunk_0025.txt', 'sent_chunk_0053.txt', 'sent_chunk_0035.txt', 'sent_chunk_0033.txt', 'sent_chunk_0008.txt', 'sent_chunk_0023.txt', 'sent_chunk_0004.txt', 'sent_chunk_0024.txt', 'sent_chunk_0013.txt', 'sent_chunk_0003.txt', 'sent_chunk_0017.txt', 'sent_chunk_0051.txt', 'sent_chunk_0027.txt', 'sent_chunk_0058.txt', 'sent_chunk_0012.txt', 'sent_chunk_0029.txt', 'sent_chunk_0015.txt', 'sent_chunk_0044.txt', 'sent_chunk_0057.txt', 'sent_chunk_0020.txt', 'sent_chunk_0052.txt', 'sent_chunk_0059.txt', 'sent_chunk_0005.txt', 'sent_chunk_0011.txt', 'sent_chunk_0031.txt', 'sent_chunk_0030.txt', 'sent_chunk_0001.txt', 'sent_chunk_0056.txt', 'sent_chunk_0047.txt', 'sent_chunk_0055.txt', 'sent_chunk_0007.txt', 'sent_chunk_0032.txt', 'sent_chunk_0018.txt', 'sent_chunk_0014.txt', 'sent_chunk_0016.txt', 'sent_chunk_0054.txt', 'sent_chunk_0028.txt', 'sent_chunk_0021.txt', 'sent_chunk_0006.txt', 'sent_chunk_0040.txt', 'sent_chunk_0049.txt', 'sent_chunk_0043.txt', 'sent_chunk_0037.txt', 'sent_chunk_0022.txt']
INFO:ekorpkit.tokenizers.trainers.spm:number of lines sampled: 2,998,856
INFO:ekorpkit.tokenizers.trainers.spm:saved sampled sentences to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt
time: 20.9 s (started: 2022-11-13 05:07:06 +00:00)

Train tokenizers with Hugging Face Tokenizers#

Hugging Face’s Tokenizers provides a wide range of tokenizers, including BPE, WordPiece, Unigram, SentencePiece, and ByteLevel. We will use the BPE and Unigram tokenizers in this lab.

Import the libraries and prepare functions#

from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel
from tokenizers.trainers import BpeTrainer, UnigramTrainer, WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from ekorpkit.tokenizers.trainers.spm import batch_chunks


unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>", "[MASK]"]  # special tokens


def prepare_tokenizer_trainer(algo):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if algo == "BPE":
        tokenizer = Tokenizer(BPE(unk_token=unk_token))
        trainer = BpeTrainer(special_tokens=spl_tokens)
    elif algo == "UNI":
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
        trainer = WordLevelTrainer(special_tokens=spl_tokens)

    normalizer = normalizers.Sequence([NFD(), StripAccents()])
    tokenizer.normalizer = normalizer
    tokenizer.pre_tokenizer = Whitespace()

    return tokenizer, trainer
time: 2.35 ms (started: 2022-11-12 06:51:08 +00:00)
def train_tokenizer(algo="BPE"):
    """
    Takes the files and trains the tokenizer.
    """
    save_path = f"{project_dir}/tokenizers/{algo}_tokenizer.json"
    tokenizer, trainer = prepare_tokenizer_trainer(algo)
    tokenizer.train_from_iterator(
        batch_chunks(raw_dataset, batch_size=1000, text_column=text_column),
        trainer=trainer,
    )
    tokenizer.save(save_path)
    tokenizer = Tokenizer.from_file(save_path)
    return tokenizer
time: 20.2 ms (started: 2022-11-13 03:09:51 +00:00)

Train BPE tokenizer#

model_path = train_tokenizer("BPE")

time: 15h 30min 58s (started: 2022-11-11 06:36:01 +00:00)

To train a BPE tokenizer, it took 15 hours and 30 minutes for the 2,998,856 sentences. The tokenizer was saved in the {project_dir}/tokenizers directory.

To train more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.auto=train

took 22m 36.6s

With 256 processors, it took 22 minutes and 37 seconds to tokenize the 2,998,856 sentences.

from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_bpe_huggingface_vocab_30000.json"
)

bpe_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(bpe_tokenizer.encode(text_en[0]).tokens)
print(bpe_tokenizer.encode(text_ko[0]).tokens)
Vocab size: 30000
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 67.4 ms (started: 2022-11-13 08:47:47 +00:00)

Train Unigram tokenizer#

model_path = train_tokenizer("UNI")

For a very large corpus, it may take a long time to train a Unigram tokenizer. It is recommended to use the following CLI command to train a Unigram tokenizer.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 10m 10.7s

With 256 processors, it took 10 minutes and 11 seconds to tokenize the 2,998,856 sentences.

from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
)

unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(unigram_tokenizer.encode(text_en[0]).tokens)
print(unigram_tokenizer.encode(text_ko[0]).tokens)
Vocab size: 30000
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 82 ms (started: 2022-11-13 08:47:51 +00:00)

Train tokenizers with Google SentencePiece (SPM)#

Install SentencePiece#

pip install sentencepiece

Train SentencePiece models#

You can use train_spm function to train a SentencePiece model. The train_spm function takes the following arguments:

  • model_prefix: The prefix of the model file. The model file will be saved as {model_prefix}_{model_type}_vocab_{vocab_size}.model.

  • input: The input file for training.

  • output_dir: The directory to save the model file.

  • vocab_size: The vocabulary size.

  • model_type: The model type. It can be unigram (default), bpe, char, or word.

  • character_coverage: The character coverage. It is only used for unigram and bpe model types. The default value is 1.0.

  • num_threads: The number of threads to use for training. The default value is 1. The max value is 128.

  • train_extremely_large_corpus: Whether to train an extremely large corpus. The default value is False.

Train Unigram model#

from ekorpkit.tokenizers.trainers.spm import train_spm

uni_model_path = train_spm(
    model_prefix="enko_wiki",
    input=sampled_file,
    output_dir=project_dir + "/tokenizers/spm",
    model_type="unigram",
    vocab_size=30000,
    character_coverage=0.9995,
    num_threads=128,
)
Hide code cell output
time: 20.3 ms (started: 2022-11-25 02:01:41 +00:00)

time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)

It took 23 minutes to train a unigram model with a vocabulary size of 30,000. The model file was saved in the {project_dir}/tokenizers directory.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 4m 2.4s

With 128 processors, it took 4 minutes and 2 seconds to tokenize the 2,998,856 sentences.

Load the trained model#

import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_unigram_vocab_30000.model"
model_file = project_dir + "/" + model_file
uni_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(uni_spm.encode(text_en[0], out_type=str))
print(uni_spm.encode(text_ko[0], out_type=str))
Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 67.4 ms (started: 2022-11-13 08:48:00 +00:00)

Train BPE model#

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 8m 41.8s

With 128 processors, it took 8 minutes and 42 seconds to tokenize the 2,998,856 sentences.

Load the trained model#

import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_bpe_vocab_30000.model"
model_file = project_dir + "/" + model_file
bpe_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(bpe_spm.encode(text_en[0], out_type=str))
print(bpe_spm.encode(text_ko[0], out_type=str))
Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 34.1 ms (started: 2022-11-13 08:48:11 +00:00)

Compare the Tokenizers#

Load the tokenizers#

tokenizers = {
    "BPE": bpe_tokenizer,
    "UNI": unigram_tokenizer,
    "UNI_SPM": uni_spm,
    "BPE_SPM": bpe_spm,
}


def tokenize(tokenizer, text):
    """
    Tokenizes the text using the tokenizer.
    """
    if isinstance(tokenizer, spm.SentencePieceProcessor):
        return tokenizer.encode(text, out_type=str)
    return tokenizer.encode(text).tokens
time: 20.6 ms (started: 2022-11-13 08:48:24 +00:00)

Analyze the output of the tokenizers#

texts = [text_en[0] , text_ko[0]]
tokens = {name: [] for name in tokenizers.keys()}


# tokenize the texts with the tokenizers
for text in texts:
    for name, tokenizer in tokenizers.items():
        print(f"Tokenizer: {name}")
        tokens[name].append(tokenize(tokenizer, text))
        print(tokens[name][-1])
        print("-" * 50)
Tokenizer: BPE
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
time: 20.8 ms (started: 2022-11-13 08:48:25 +00:00)

Compare the Tokens#

import pandas as pd


def compare_tokens(tokenizers, tokens, sample_num=0):

    max_len = max(len(tokens[name][sample_num]) for name in tokenizers.keys())
    diffs = {
        name: max_len - len(tokens[name][sample_num]) for name in tokenizers.keys()
    }

    padded_tokens = {
        name: tokens[name][sample_num] + [""] * diffs[name]
        for name in tokenizers.keys()
    }

    df = pd.DataFrame(padded_tokens)
    return df
time: 20.5 ms (started: 2022-11-13 08:48:30 +00:00)
compare_tokens(tokenizers, tokens, sample_num=0)
BPE UNI UNI_SPM BPE_SPM
0 This This ▁This ▁This
1 article article ▁article ▁article
2 describ describe ▁describes ▁describes
3 es s ▁the ▁the
4 the the ▁process ▁process
5 process process ▁by ▁by
6 by by ▁which ▁which
7 which which ▁the ▁the
8 the the ▁ter
9 ter territor ter rit
10 rit ial ri orial
11 or ex torial ▁ext
12 ial t ▁ex ent
13 ext ent t ▁of
14 ent of ent ▁Morocco
15 of Morocc ▁of ▁came
16 Moroc o ▁Morocco ▁to
17 co came ▁came ▁be
18 came to ▁to ▁as
19 to be ▁be ▁it
20 be as ▁as ▁is
21 as it ▁it ▁now
22 it is ▁is .
23 is now ▁now
24 now . .
25 .
time: 26.6 ms (started: 2022-11-13 08:48:31 +00:00)
compare_tokens(tokenizers, tokens, sample_num=1)
BPE UNI UNI_SPM BPE_SPM
0 크리 크리 ▁크리 ▁크리
1 클레이
2 레이드 레이드 레이드
3 ( ( ( (
4 C C C C
5 rick rick rick rick
6 l la lade l
7 ade de ) ade
8 ) ) )
9 ▁잉글랜드의
10 잉글랜드의 잉글랜드의 ▁노스 ▁잉글랜드의
11 노스 노스 ▁윌 ▁노스
12 ▁윌
13
14
15 ▁위치한
16 위치한 위치한 ▁위치한
17 ▁템
18
19 강의 강의 ▁강의 ▁강의
20 타운 타운 ▁타운 ▁타운
21 이자 이자 이자 이자
22 지방 지방 ▁지방 ▁지방
23 행정 행정 ▁행정 ▁행정
24 구이다 구이다
25 . 이다 이다 .
26 . . ▁스
27 ▁스
28
29
30 ▁시
31 ▁시
32
33 스터 스터
34 스터 스터
35 중간에 ▁중간에
36 위치해 중간에 ▁중간에 ▁위치해
37 있다 위치해 ▁위치해 ▁있다
38 . 있다 ▁있다 .
39 . .
time: 26.1 ms (started: 2022-11-13 08:48:32 +00:00)

Usage of Tokenizer Trainer#

data = eKonf.load_data("wiki_filtered.parquet", project_dir + "/data")
bnwiki_filtered = data[data.corpus == "bnwiki"]

data_file = "bnwiki_filtered.parquet"
cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "bnwiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg.training.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
cfg.auto = "train"

eKonf.save_data(bnwiki_filtered, data_file, cfg.path.data_dir)
INFO:ekorpkit.io.file:Processing [1] files from ['wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data/bnwiki_filtered.parquet
time: 16.5 s (started: 2022-11-25 02:14:06 +00:00)

BPE Model by Hugging Face#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Unigram Model by Hugging Face#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

BPE Model by SentencePiece#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Unigram Model by SentencePiece#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Compare the Tokenizers#

from ekorpkit.tokenizers.trainer import TokenizerTrainer, compare_tokens


def load_tokenizer(trainer_type, model_type, name="kowiki"):
    cfg_group = "tokenizer=trainer"
    if trainer_type == "spm":
        cfg_group += ".spm"
    cfg = eKonf.compose(cfg_group)
    cfg.name = name
    cfg.model.model_type = model_type
    trainer = TokenizerTrainer(**cfg)
    
    return trainer.tokenizer_obj
time: 16.2 ms (started: 2022-11-29 10:48:09 +00:00)
tokenizers = {}

for trainer_type in ['spm', 'huggingface']:
    for model_type in ['bpe', 'unigram']:
        name = f"{model_type}_{trainer_type}"
        print(name)
        tokenizers[name] = load_tokenizer(trainer_type, model_type, "bnwiki")
INFO:root:compose config with overrides: ['tokenizer=trainer.spm']
bpe_spm
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer.spm']
unigram_spm
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']
bpe_huggingface
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']
unigram_huggingface
INFO:ekorpkit.base:No method defined to call
time: 2.36 s (started: 2022-11-29 10:48:14 +00:00)
cfg_group = "tokenizer=trainer"
cfg = eKonf.compose(cfg_group)
cfg.name = "bnwiki"
trainer = TokenizerTrainer(**cfg)

sentences = []
files = trainer.input_files
for file in files:
    with open(file, "r") as f:
        sentences.extend(f.readlines())

print("number of sentences:", len(sentences))
print(sentences[0].strip())
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:No method defined to call
number of sentences: 357833
জোড্ডা পূর্ব বাংলাদেশের কুমিল্লা জেলার অন্তর্গত নাঙ্গলকোট উপজেলার একটি ইউনিয়ন।
time: 980 ms (started: 2022-11-29 10:48:22 +00:00)
compare_tokens(tokenizers, sentences)
Text: মোহাম্মদ রফিকুল ইসলাম ২০২০ সালের ১ সেপ্টেম্বর গাজীপুরে অবস্থিত আন্তর্জাতিক প্রযুক্তি বিশ্ববিদ্যালয়, ইসলামিক ইউনিভার্সিটি অব টেকনোলজির (আইইউটি) উপাচার্য হিসেবে যোগদান করেন।
bpe_spm unigram_spm bpe_huggingface unigram_huggingface
0 ▁মোহাম্মদ ▁মোহাম্মদ মহমমদ
1 ▁রফিকুল ▁রফিকুল রফকল মহমমদ
2 ▁ইসলাম ▁ইসলাম ইসলম
3 ▁২০২০ ▁২০২০ ২০২০ রফকল
4 ▁সালের ▁সালের সলর
5 ▁১ ▁১ ইসলম
6 ▁সেপ্টেম্বর ▁সেপ্টেম্বর সপটমবর
7 ▁গাজীপুর ▁গাজীপুর গজপর ২০২০
8 অবসথত
9 ▁অবস্থিত ▁অবস্থিত আনতরজতক সলর
10 ▁আন্তর্জাতিক ▁আন্তর্জাতিক পরযকত
11 ▁প্রযুক্তি ▁প্রযুক্তি বশববদযলয
12 ▁বিশ্ববিদ্যালয় ▁বিশ্ববিদ্যালয় ,
13 , , ইসলমক সপটমবর
14 ▁ইসলামিক ▁ইসলামিক ইউনভরসট
15 ▁ইউনিভার্সিটি ▁ইউনিভার্সিটি অব গজপর
16 ▁অব ▁অব টকনলজর
17 ▁টেকনোলজির ▁টেকনোলজির ( অবসথত
18 ▁( ▁( আই
19 আই আইইউ ইউট আনতরজতক
20 ইউ টি )
21 টি ) উপচরয পরযকত
22 ) ▁উপাচার্য হসব
23 ▁উপাচার্য ▁হিসেবে যগদন বশববদযলয
24 ▁হিসেবে ▁যোগদান করন ,
25 ▁যোগদান ▁করেন
26 ▁করেন ইসলমক
27
28 ইউনভরসট
29
30 অব
31
32 টকনলজ
33
34
35 (
36 আইইউ
37
38 )
39
40 উপচরয
41
42 হসব
43
44 যগদন
45
46 করন
47
time: 25 ms (started: 2022-11-29 10:48:30 +00:00)

Usage of Tokenizer Trainer#

%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)
INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Overwriting EKORPKIT_WORKSPACE_ROOT=/workspace with /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT=ekorpkit-book with ekorpkit-book
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT_DIR=/workspace/projects/ekorpkit-book with /content/drive/MyDrive/workspace/projects/ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
version: 0.1.40.post0.dev37
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 964 ms (started: 2022-11-30 06:49:42 +00:00)
data_file ="data/enko_filtered.parquet"

cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "enkowiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg._train_.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
# cfg.auto = "train"

print(cfg.path.data_dir)
# eKonf.copy(f"{project_dir}/{data_file}", cfg.path.data_dir)
INFO:root:compose config with overrides: ['tokenizer=trainer']
/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data
time: 506 ms (started: 2022-11-30 06:49:43 +00:00)
from ekorpkit.tokenizers.trainer import TokenizerTrainer

cfg.model.model_type = "unigram"
trainer = TokenizerTrainer(**cfg)
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.base:Setting WANDB_PROJECT=ekorpkit-book
INFO:ekorpkit.config:Using existing path: /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling
INFO:ekorpkit.base:No method defined to call
time: 2.06 s (started: 2022-11-30 06:49:45 +00:00)
trainer.tokenize(text="test")
['▁', 'test']
time: 64.5 ms (started: 2022-11-30 06:49:48 +00:00)
trainer.save_config()
INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.yaml
INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.json
'enkowiki(11)_config.yaml'
time: 267 ms (started: 2022-11-30 06:39:45 +00:00)
ekorpkit print_config=false \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=data/enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer._train_.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train