Lab 3: Training Tokenizers

Lab 3: Training Tokenizers#

Prepare the environment#

%pip install --pre ekorpkit[tokenize]

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
ws = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", ws.project_dir)
ws.envs.dict()

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Set environment variable EKORPKIT_WORKSPACE_ROOT=/content/drive/MyDrive/workspace
INFO:ekorpkit.base:Set environment variable EKORPKIT_PROJECT_DIR=/content/drive/MyDrive/workspace/projects/ekorpkit-book

version: 0.1.40.post0.dev51
is colab? False

INFO:root:compose config with overrides: ['+project=default']
INFO:ekorpkit.base:There are no arguments to initilize a config, using default config.

project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book

{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_WORKSPACE_ROOT': '/content/drive/MyDrive/workspace',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_PROJECT_DIR': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'INFO',
 'NUM_WORKERS': 230,
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'WANDB_PROJECT': None,
 'WANDB_DISABLED': None,
 'LABEL_STUDIO_SERVER': 'http://ekorpkit-labelstudio:8080',
 'CACHED_PATH_CACHE_ROOT': None}

time: 936 ms (started: 2022-12-13 04:56:08 +00:00)

Load the saved corpora#

data = eKonf.load_data("enko_filtered.parquet", ws.project_dir / "data")
data.head()

INFO:ekorpkit.io.file:Processing [1] files from ['enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet

	id	text	split	filename	corpus	num_chars	num_words	num_sents	avg_num_chars	avg_num_words
1	7644961	Anaissini is a tribe of click beetles in the f...	train	wiki_49	enwiki_sampled	63	11	1	5.727273	11.000000
2	6658552	The Vicky Metcalf Award for Literature for You...	train	wiki_24	enwiki_sampled	479	82	5	5.841463	16.400000
4	11081255	Eylex Films Pvt is a chain of multiplex and si...	train	wiki_94	enwiki_sampled	1161	181	12	6.414365	15.083333
8	4706486	Željko Zečević (; born 21 October 1963) is a S...	train	wiki_02	enwiki_sampled	1151	201	15	5.726368	13.400000
12	2170359	Gilberto Nascimento Silva (born 9 June 1956) i...	train	wiki_57	enwiki_sampled	685	105	9	6.523810	11.666667

time: 3.17 s (started: 2022-12-13 00:32:31 +00:00)

text_column = "text"

text_en = (
    data[data.corpus == "enwiki_sampled"][text_column].sample(1).values[0].split("\n")
)
text_ko = data[data.corpus == "kowiki"][text_column].sample(1).values[0].split("\n")

print(text_en)
print(text_ko)

['The Rijeka Thermal Power Station (, TE Rijeka, also known as "TE Urinj") is an oil-fired power station east of Rijeka at Kostrena, Croatia. It was built between 1974 and 1978 and it has one generation unit with capacity of 320\xa0MW. The height of the boiler house including its rooftop flue gas stack is .', 'Turbine for the power station was supplied by Ansaldo Energia. Ansaldo Energia was also awarded engineering, procurement and construction contract. Boilers were supplied by Waagner-Biro.', 'The power station is owned and operated by Hrvatska elektroprivreda. Its annual production varies, averaging 917 GWh, but only 141 GWh in 2011. It is expected to undergo decommissioning in 2020, but it is doubtful that it will remain operational until then because of its pollution problem. , Rijeka Thermal Power Station is offline, ready to resume generation within 160 hours of notice. On 18 October 2022, it was unofficially reported that HEP plans to restart the operation of the power plant in order to cover the losses incurred during the energy crisis.']
['런던의 2012년 하계 올림픽 펜싱 여자 플뢰레 단체전은 8월 2일에 엑셀 박람회관에서 진행되었다.', '토너먼트 형식.', '9팀이 여자 플뢰레 단체전에 참가하였다. 본선에 출전하는 팀들은 FIE 랭킹에 따라 대진이 정해졌다. 영국은 개최국 자격으로 모든 종목을 선택 여부에 따라 참전할 수 있는 자격이 주어졌다. 영국은 이 토너먼트에 참가하여 첫경기에서 9번 시드의 이집트를 상대하여 승리하며 8강전에서 나머지 7개의 참가팀들과 만났다. 8강전에서 패한 팀들도 두 경기를 더 치러 5위에서 8위까지의 순위를 결정한다. 반대로, 8강전에서 승리한 팀들은 준결승전에서 서로와 맞붙는다. 준결승전에서 승리한 두 팀은 금메달 결정전으로, 패한 팀들은 동메달 결정전으로 이동한다.', '단체전은 먼저 45투셰를 기록하거나, 정규시간이 지나고 더 많은 투셰를 기록한 팀이 승리한다.']
time: 145 ms (started: 2022-11-14 02:09:16 +00:00)

Covert pandas datafame to huggingface dataset#

from datasets import Dataset

raw_dataset = Dataset.from_pandas(data[[text_column]])
raw_dataset

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 603719
})

time: 1.22 s (started: 2022-11-14 02:09:19 +00:00)

Shuffle the dataset#

# shuffle the dataset

raw_dataset = raw_dataset.shuffle(seed=42)

time: 116 ms (started: 2022-11-14 02:09:21 +00:00)

Split the dataset into sentences for training#

The sentencepiece module comes with a python training API, which uses sentences in a file, one sentence per line. We will use the sent_tokenize function from the nltk package to split the text into sentences. The sent_tokenize function is a wrapper around the punkt tokenizer, which is a pre-trained sentence tokenizer. The punkt tokenizer is trained on the Penn Treebank corpus, which is a collection of Wall Street Journal articles. The punkt tokenizer is a good choice for plain English text, but it may not be the best choice for other languages.

import nltk
from nltk.tokenize import sent_tokenize
from ekorpkit.tokenizers.trainers.spm import export_sentence_chunk_files

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

True

time: 1.13 s (started: 2022-11-14 02:09:23 +00:00)

output_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

export_sentence_chunk_files(
    raw_dataset,
    output_dir=output_dir,
    chunk_size=10000,
    text_column=text_column,
    sent_tokenize=sent_tokenize,
)

Sample sentences and combine them into a single file#

If your dataset is too large, you can sample a subset of the sentence files for training. The sample function from the random module can be used to sample a subset of the files.

You can use sample_and_combine function to sample a subset of sentence files and combine them into a single file.

from ekorpkit.tokenizers.trainers.spm import sample_and_combine

input_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
output_dir = project_dir + "/data/tokenizers/enko_filtered_samples"

sampled_file = sample_and_combine(
    input_dir=input_dir, output_dir=output_dir, sample_size=50
)

INFO:ekorpkit.tokenizers.trainers.spm:sampled files: ['sent_chunk_0009.txt', 'sent_chunk_0019.txt', 'sent_chunk_0060.txt', 'sent_chunk_0038.txt', 'sent_chunk_0000.txt', 'sent_chunk_0042.txt', 'sent_chunk_0025.txt', 'sent_chunk_0053.txt', 'sent_chunk_0035.txt', 'sent_chunk_0033.txt', 'sent_chunk_0008.txt', 'sent_chunk_0023.txt', 'sent_chunk_0004.txt', 'sent_chunk_0024.txt', 'sent_chunk_0013.txt', 'sent_chunk_0003.txt', 'sent_chunk_0017.txt', 'sent_chunk_0051.txt', 'sent_chunk_0027.txt', 'sent_chunk_0058.txt', 'sent_chunk_0012.txt', 'sent_chunk_0029.txt', 'sent_chunk_0015.txt', 'sent_chunk_0044.txt', 'sent_chunk_0057.txt', 'sent_chunk_0020.txt', 'sent_chunk_0052.txt', 'sent_chunk_0059.txt', 'sent_chunk_0005.txt', 'sent_chunk_0011.txt', 'sent_chunk_0031.txt', 'sent_chunk_0030.txt', 'sent_chunk_0001.txt', 'sent_chunk_0056.txt', 'sent_chunk_0047.txt', 'sent_chunk_0055.txt', 'sent_chunk_0007.txt', 'sent_chunk_0032.txt', 'sent_chunk_0018.txt', 'sent_chunk_0014.txt', 'sent_chunk_0016.txt', 'sent_chunk_0054.txt', 'sent_chunk_0028.txt', 'sent_chunk_0021.txt', 'sent_chunk_0006.txt', 'sent_chunk_0040.txt', 'sent_chunk_0049.txt', 'sent_chunk_0043.txt', 'sent_chunk_0037.txt', 'sent_chunk_0022.txt']
INFO:ekorpkit.tokenizers.trainers.spm:number of lines sampled: 2,998,856

INFO:ekorpkit.tokenizers.trainers.spm:saved sampled sentences to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt

time: 20.9 s (started: 2022-11-13 05:07:06 +00:00)

Train tokenizers with Hugging Face Tokenizers#

Hugging Face’s Tokenizers provides a wide range of tokenizers, including BPE, WordPiece, Unigram, SentencePiece, and ByteLevel. We will use the BPE and Unigram tokenizers in this lab.

Import the libraries and prepare functions#

from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel
from tokenizers.trainers import BpeTrainer, UnigramTrainer, WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from ekorpkit.tokenizers.trainers.spm import batch_chunks


unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>", "[MASK]"]  # special tokens


def prepare_tokenizer_trainer(algo):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if algo == "BPE":
        tokenizer = Tokenizer(BPE(unk_token=unk_token))
        trainer = BpeTrainer(special_tokens=spl_tokens)
    elif algo == "UNI":
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
        trainer = WordLevelTrainer(special_tokens=spl_tokens)

    normalizer = normalizers.Sequence([NFD(), StripAccents()])
    tokenizer.normalizer = normalizer
    tokenizer.pre_tokenizer = Whitespace()

    return tokenizer, trainer

time: 2.35 ms (started: 2022-11-12 06:51:08 +00:00)

def train_tokenizer(algo="BPE"):
    """
    Takes the files and trains the tokenizer.
    """
    save_path = f"{project_dir}/tokenizers/{algo}_tokenizer.json"
    tokenizer, trainer = prepare_tokenizer_trainer(algo)
    tokenizer.train_from_iterator(
        batch_chunks(raw_dataset, batch_size=1000, text_column=text_column),
        trainer=trainer,
    )
    tokenizer.save(save_path)
    tokenizer = Tokenizer.from_file(save_path)
    return tokenizer

time: 20.2 ms (started: 2022-11-13 03:09:51 +00:00)

Train BPE tokenizer#

model_path = train_tokenizer("BPE")

time: 15h 30min 58s (started: 2022-11-11 06:36:01 +00:00)

To train a BPE tokenizer, it took 15 hours and 30 minutes for the 2,998,856 sentences. The tokenizer was saved in the {project_dir}/tokenizers directory.

To train more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.auto=train

took 22m 36.6s

With 256 processors, it took 22 minutes and 37 seconds to tokenize the 2,998,856 sentences.

from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_bpe_huggingface_vocab_30000.json"
)

bpe_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(bpe_tokenizer.encode(text_en[0]).tokens)
print(bpe_tokenizer.encode(text_ko[0]).tokens)

Vocab size: 30000
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 67.4 ms (started: 2022-11-13 08:47:47 +00:00)

Train Unigram tokenizer#

model_path = train_tokenizer("UNI")

For a very large corpus, it may take a long time to train a Unigram tokenizer. It is recommended to use the following CLI command to train a Unigram tokenizer.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 10m 10.7s

With 256 processors, it took 10 minutes and 11 seconds to tokenize the 2,998,856 sentences.

from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
)

unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(unigram_tokenizer.encode(text_en[0]).tokens)
print(unigram_tokenizer.encode(text_ko[0]).tokens)

Vocab size: 30000
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 82 ms (started: 2022-11-13 08:47:51 +00:00)

Train tokenizers with Google SentencePiece (SPM)#

Install SentencePiece#

pip install sentencepiece

Train SentencePiece models#

You can use train_spm function to train a SentencePiece model. The train_spm function takes the following arguments:

model_prefix: The prefix of the model file. The model file will be saved as {model_prefix}_{model_type}_vocab_{vocab_size}.model.
input: The input file for training.
output_dir: The directory to save the model file.
vocab_size: The vocabulary size.
model_type: The model type. It can be unigram (default), bpe, char, or word.
character_coverage: The character coverage. It is only used for unigram and bpe model types. The default value is 1.0.
num_threads: The number of threads to use for training. The default value is 1. The max value is 128.
train_extremely_large_corpus: Whether to train an extremely large corpus. The default value is False.

Train Unigram model#

from ekorpkit.tokenizers.trainers.spm import train_spm

uni_model_path = train_spm(
    model_prefix="enko_wiki",
    input=sampled_file,
    output_dir=project_dir + "/tokenizers/spm",
    model_type="unigram",
    vocab_size=30000,
    character_coverage=0.9995,
    num_threads=128,
)

time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)

It took 23 minutes to train a unigram model with a vocabulary size of 30,000. The model file was saved in the {project_dir}/tokenizers directory.

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 4m 2.4s

With 128 processors, it took 4 minutes and 2 seconds to tokenize the 2,998,856 sentences.

Load the trained model#

import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_unigram_vocab_30000.model"
model_file = project_dir + "/" + model_file
uni_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(uni_spm.encode(text_en[0], out_type=str))
print(uni_spm.encode(text_ko[0], out_type=str))

Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 67.4 ms (started: 2022-11-13 08:48:00 +00:00)

Train BPE model#

ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=enko_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

took 8m 41.8s

With 128 processors, it took 8 minutes and 42 seconds to tokenize the 2,998,856 sentences.

Load the trained model#

import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_bpe_vocab_30000.model"
model_file = project_dir + "/" + model_file
bpe_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(bpe_spm.encode(text_en[0], out_type=str))
print(bpe_spm.encode(text_ko[0], out_type=str))

Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 34.1 ms (started: 2022-11-13 08:48:11 +00:00)

Compare the Tokenizers#

Load the tokenizers#

tokenizers = {
    "BPE": bpe_tokenizer,
    "UNI": unigram_tokenizer,
    "UNI_SPM": uni_spm,
    "BPE_SPM": bpe_spm,
}


def tokenize(tokenizer, text):
    """
    Tokenizes the text using the tokenizer.
    """
    if isinstance(tokenizer, spm.SentencePieceProcessor):
        return tokenizer.encode(text, out_type=str)
    return tokenizer.encode(text).tokens

time: 20.6 ms (started: 2022-11-13 08:48:24 +00:00)

Analyze the output of the tokenizers#

texts = [text_en[0] , text_ko[0]]
tokens = {name: [] for name in tokenizers.keys()}


# tokenize the texts with the tokenizers
for text in texts:
    for name, tokenizer in tokenizers.items():
        print(f"Tokenizer: {name}")
        tokens[name].append(tokenize(tokenizer, text))
        print(tokens[name][-1])
        print("-" * 50)

Tokenizer: BPE
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
time: 20.8 ms (started: 2022-11-13 08:48:25 +00:00)

Compare the Tokens#

import pandas as pd


def compare_tokens(tokenizers, tokens, sample_num=0):

    max_len = max(len(tokens[name][sample_num]) for name in tokenizers.keys())
    diffs = {
        name: max_len - len(tokens[name][sample_num]) for name in tokenizers.keys()
    }

    padded_tokens = {
        name: tokens[name][sample_num] + [""] * diffs[name]
        for name in tokenizers.keys()
    }

    df = pd.DataFrame(padded_tokens)
    return df

time: 20.5 ms (started: 2022-11-13 08:48:30 +00:00)

compare_tokens(tokenizers, tokens, sample_num=0)

	BPE	UNI	UNI_SPM	BPE_SPM
0	This	This	▁This	▁This
1	article	article	▁article	▁article
2	describ	describe	▁describes	▁describes
3	es	s	▁the	▁the
4	the	the	▁process	▁process
5	process	process	▁by	▁by
6	by	by	▁which	▁which
7	which	which	▁the	▁the
8	the	the	▁	▁ter
9	ter	territor	ter	rit
10	rit	ial	ri	orial
11	or	ex	torial	▁ext
12	ial	t	▁ex	ent
13	ext	ent	t	▁of
14	ent	of	ent	▁Morocco
15	of	Morocc	▁of	▁came
16	Moroc	o	▁Morocco	▁to
17	co	came	▁came	▁be
18	came	to	▁to	▁as
19	to	be	▁be	▁it
20	be	as	▁as	▁is
21	as	it	▁it	▁now
22	it	is	▁is	.
23	is	now	▁now
24	now	.	.
25	.

time: 26.6 ms (started: 2022-11-13 08:48:31 +00:00)

compare_tokens(tokenizers, tokens, sample_num=1)

	BPE	UNI	UNI_SPM	BPE_SPM
0	크리	크리	▁크리	▁크리
1	클	클레이	클	클
2	레이드	드	레이드	레이드
3	(	(	(	(
4	C	C	C	C
5	rick	rick	rick	rick
6	l	la	lade	l
7	ade	de	)	ade
8	)	)	는	)
9	는	는	▁잉글랜드의	는
10	잉글랜드의	잉글랜드의	▁노스	▁잉글랜드의
11	노스	노스	▁윌	▁노스
12	윌	윌	트	▁윌
13	트	트	셔	트
14	셔	셔	에	셔
15	에	에	▁위치한	에
16	위치한	위치한	▁	▁위치한
17	템	템	템	▁템
18	스	스	스	스
19	강의	강의	▁강의	▁강의
20	타운	타운	▁타운	▁타운
21	이자	이자	이자	이자
22	지방	지방	▁지방	▁지방
23	행정	행정	▁행정	▁행정
24	구이다	구	구	구이다
25	.	이다	이다	.
26	스	.	.	▁스
27	윈	스	▁스	윈
28	던	윈	윈	던
29	과	던	던	과
30	시	과	과	▁시
31	런	시	▁시	런
32	세	런	런	세
33	스터	세	세	스터
34	의	스터	스터	의
35	중간에	의	의	▁중간에
36	위치해	중간에	▁중간에	▁위치해
37	있다	위치해	▁위치해	▁있다
38	.	있다	▁있다	.
39		.	.

time: 26.1 ms (started: 2022-11-13 08:48:32 +00:00)

Usage of Tokenizer Trainer#

data = eKonf.load_data("wiki_filtered.parquet", project_dir + "/data")
bnwiki_filtered = data[data.corpus == "bnwiki"]

data_file = "bnwiki_filtered.parquet"
cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "bnwiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg.training.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
cfg.auto = "train"

eKonf.save_data(bnwiki_filtered, data_file, cfg.path.data_dir)

INFO:ekorpkit.io.file:Processing [1] files from ['wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data/bnwiki_filtered.parquet

time: 16.5 s (started: 2022-11-25 02:14:06 +00:00)

BPE Model by Hugging Face#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Unigram Model by Hugging Face#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

BPE Model by SentencePiece#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=bpe \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Unigram Model by SentencePiece#

ekorpkit \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer.spm \
    tokenizer.name=bnwiki \
    tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer.model.character_coverage=0.9995 \
    tokenizer.training.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Compare the Tokenizers#

from ekorpkit.tokenizers.trainer import TokenizerTrainer, compare_tokens


def load_tokenizer(trainer_type, model_type, name="kowiki"):
    cfg_group = "tokenizer=trainer"
    if trainer_type == "spm":
        cfg_group += ".spm"
    cfg = eKonf.compose(cfg_group)
    cfg.name = name
    cfg.model.model_type = model_type
    trainer = TokenizerTrainer(**cfg)
    
    return trainer.tokenizer_obj

time: 16.2 ms (started: 2022-11-29 10:48:09 +00:00)

tokenizers = {}

for trainer_type in ['spm', 'huggingface']:
    for model_type in ['bpe', 'unigram']:
        name = f"{model_type}_{trainer_type}"
        print(name)
        tokenizers[name] = load_tokenizer(trainer_type, model_type, "bnwiki")

INFO:root:compose config with overrides: ['tokenizer=trainer.spm']

bpe_spm

INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer.spm']

unigram_spm

INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']

bpe_huggingface

INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']

unigram_huggingface

INFO:ekorpkit.base:No method defined to call

time: 2.36 s (started: 2022-11-29 10:48:14 +00:00)

cfg_group = "tokenizer=trainer"
cfg = eKonf.compose(cfg_group)
cfg.name = "bnwiki"
trainer = TokenizerTrainer(**cfg)

sentences = []
files = trainer.input_files
for file in files:
    with open(file, "r") as f:
        sentences.extend(f.readlines())

print("number of sentences:", len(sentences))
print(sentences[0].strip())

INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:No method defined to call

number of sentences: 357833
জোড্ডা পূর্ব বাংলাদেশের কুমিল্লা জেলার অন্তর্গত নাঙ্গলকোট উপজেলার একটি ইউনিয়ন।
time: 980 ms (started: 2022-11-29 10:48:22 +00:00)

compare_tokens(tokenizers, sentences)

Text: মোহাম্মদ রফিকুল ইসলাম ২০২০ সালের ১ সেপ্টেম্বর গাজীপুরে অবস্থিত আন্তর্জাতিক প্রযুক্তি বিশ্ববিদ্যালয়, ইসলামিক ইউনিভার্সিটি অব টেকনোলজির (আইইউটি) উপাচার্য হিসেবে যোগদান করেন।

	bpe_spm	unigram_spm	bpe_huggingface	unigram_huggingface
0	▁মোহাম্মদ	▁মোহাম্মদ	মহমমদ	▁
1	▁রফিকুল	▁রফিকুল	রফকল	মহমমদ
2	▁ইসলাম	▁ইসলাম	ইসলম	▁
3	▁২০২০	▁২০২০	২০২০	রফকল
4	▁সালের	▁সালের	সলর	▁
5	▁১	▁১	১	ইসলম
6	▁সেপ্টেম্বর	▁সেপ্টেম্বর	সপটমবর	▁
7	▁গাজীপুর	▁গাজীপুর	গজপর	২০২০
8	ে	ে	অবসথত	▁
9	▁অবস্থিত	▁অবস্থিত	আনতরজতক	সলর
10	▁আন্তর্জাতিক	▁আন্তর্জাতিক	পরযকত	▁
11	▁প্রযুক্তি	▁প্রযুক্তি	বশববদযলয	১
12	▁বিশ্ববিদ্যালয়	▁বিশ্ববিদ্যালয়	,	▁
13	,	,	ইসলমক	সপটমবর
14	▁ইসলামিক	▁ইসলামিক	ইউনভরসট	▁
15	▁ইউনিভার্সিটি	▁ইউনিভার্সিটি	অব	গজপর
16	▁অব	▁অব	টকনলজর	▁
17	▁টেকনোলজির	▁টেকনোলজির	(	অবসথত
18	▁(	▁(	আই	▁
19	আই	আইইউ	ইউট	আনতরজতক
20	ইউ	টি	)	▁
21	টি	)	উপচরয	পরযকত
22	)	▁উপাচার্য	হসব	▁
23	▁উপাচার্য	▁হিসেবে	যগদন	বশববদযলয
24	▁হিসেবে	▁যোগদান	করন	,
25	▁যোগদান	▁করেন	।	▁
26	▁করেন	।		ইসলমক
27	।			▁
28				ইউনভরসট
29				▁
30				অব
31				▁
32				টকনলজ
33				র
34				▁
35				(
36				আইইউ
37				ট
38				)
39				▁
40				উপচরয
41				▁
42				হসব
43				▁
44				যগদন
45				▁
46				করন
47				।

time: 25 ms (started: 2022-11-29 10:48:30 +00:00)

Usage of Tokenizer Trainer#

%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Overwriting EKORPKIT_WORKSPACE_ROOT=/workspace with /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT=ekorpkit-book with ekorpkit-book
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT_DIR=/workspace/projects/ekorpkit-book with /content/drive/MyDrive/workspace/projects/ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env

version: 0.1.40.post0.dev37
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 964 ms (started: 2022-11-30 06:49:42 +00:00)

data_file ="data/enko_filtered.parquet"

cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "enkowiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg._train_.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
# cfg.auto = "train"

print(cfg.path.data_dir)
# eKonf.copy(f"{project_dir}/{data_file}", cfg.path.data_dir)

INFO:root:compose config with overrides: ['tokenizer=trainer']

/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data
time: 506 ms (started: 2022-11-30 06:49:43 +00:00)

from ekorpkit.tokenizers.trainer import TokenizerTrainer

cfg.model.model_type = "unigram"
trainer = TokenizerTrainer(**cfg)

INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.base:Setting WANDB_PROJECT=ekorpkit-book
INFO:ekorpkit.config:Using existing path: /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling
INFO:ekorpkit.base:No method defined to call

time: 2.06 s (started: 2022-11-30 06:49:45 +00:00)

trainer.tokenize(text="test")

['▁', 'test']

time: 64.5 ms (started: 2022-11-30 06:49:48 +00:00)

trainer.save_config()

INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.yaml
INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.json

'enkowiki(11)_config.yaml'

time: 267 ms (started: 2022-11-30 06:39:45 +00:00)

ekorpkit print_config=false \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=tokenizer \
    tokenizer=trainer \
    tokenizer.name=enkowiki \
    tokenizer.dataset.train_file=data/enko_filtered.parquet \
    tokenizer.model.model_type=unigram \
    tokenizer.model.vocab_size=30000 \
    tokenizer._train_.use_sample=false \
    tokenizer.batch.num_workers=128 \
    tokenizer.auto=train

Lab 3: Training Tokenizers

Contents

Lab 3: Training Tokenizers#

Prepare the environment#

Load the saved corpora#

Covert pandas datafame to huggingface dataset#

Shuffle the dataset#

Split the dataset into sentences for training#

Sample sentences and combine them into a single file#

Train tokenizers with Hugging Face Tokenizers#

Import the libraries and prepare functions#

Train BPE tokenizer#

Train Unigram tokenizer#

Train tokenizers with Google SentencePiece (SPM)#

Install SentencePiece#

Train SentencePiece models#

Train Unigram model#

Load the trained model#

Train BPE model#

Load the trained model#

Compare the Tokenizers#

Load the tokenizers#

Analyze the output of the tokenizers#

Compare the Tokens#

Usage of Tokenizer Trainer#

BPE Model by Hugging Face#

Unigram Model by Hugging Face#

BPE Model by SentencePiece#

Unigram Model by SentencePiece#

Compare the Tokenizers#

Usage of Tokenizer Trainer#