Lab 3: Training Tokenizers#
Prepare the environment#
%pip install --pre ekorpkit[tokenize]
from ekorpkit import eKonf
eKonf.setLogger("INFO")
print("version:", eKonf.__version__)
is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
ws = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", ws.project_dir)
ws.envs.dict()
INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Set environment variable EKORPKIT_WORKSPACE_ROOT=/content/drive/MyDrive/workspace
INFO:ekorpkit.base:Set environment variable EKORPKIT_PROJECT_DIR=/content/drive/MyDrive/workspace/projects/ekorpkit-book
version: 0.1.40.post0.dev51
is colab? False
INFO:root:compose config with overrides: ['+project=default']
INFO:ekorpkit.base:There are no arguments to initilize a config, using default config.
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
'EKORPKIT_WORKSPACE_ROOT': '/content/drive/MyDrive/workspace',
'EKORPKIT_PROJECT': 'ekorpkit-book',
'EKORPKIT_PROJECT_DIR': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
'EKORPKIT_DATA_DIR': None,
'EKORPKIT_LOG_LEVEL': 'INFO',
'NUM_WORKERS': 230,
'KMP_DUPLICATE_LIB_OK': 'TRUE',
'CUDA_DEVICE_ORDER': None,
'CUDA_VISIBLE_DEVICES': None,
'WANDB_PROJECT': None,
'WANDB_DISABLED': None,
'LABEL_STUDIO_SERVER': 'http://ekorpkit-labelstudio:8080',
'CACHED_PATH_CACHE_ROOT': None}
time: 936 ms (started: 2022-12-13 04:56:08 +00:00)
Load the saved corpora#
data = eKonf.load_data("enko_filtered.parquet", ws.project_dir / "data")
data.head()
INFO:ekorpkit.io.file:Processing [1] files from ['enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet
id | text | split | filename | corpus | num_chars | num_words | num_sents | avg_num_chars | avg_num_words | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 7644961 | Anaissini is a tribe of click beetles in the f... | train | wiki_49 | enwiki_sampled | 63 | 11 | 1 | 5.727273 | 11.000000 |
2 | 6658552 | The Vicky Metcalf Award for Literature for You... | train | wiki_24 | enwiki_sampled | 479 | 82 | 5 | 5.841463 | 16.400000 |
4 | 11081255 | Eylex Films Pvt is a chain of multiplex and si... | train | wiki_94 | enwiki_sampled | 1161 | 181 | 12 | 6.414365 | 15.083333 |
8 | 4706486 | Željko Zečević (; born 21 October 1963) is a S... | train | wiki_02 | enwiki_sampled | 1151 | 201 | 15 | 5.726368 | 13.400000 |
12 | 2170359 | Gilberto Nascimento Silva (born 9 June 1956) i... | train | wiki_57 | enwiki_sampled | 685 | 105 | 9 | 6.523810 | 11.666667 |
time: 3.17 s (started: 2022-12-13 00:32:31 +00:00)
text_column = "text"
text_en = (
data[data.corpus == "enwiki_sampled"][text_column].sample(1).values[0].split("\n")
)
text_ko = data[data.corpus == "kowiki"][text_column].sample(1).values[0].split("\n")
print(text_en)
print(text_ko)
['The Rijeka Thermal Power Station (, TE Rijeka, also known as "TE Urinj") is an oil-fired power station east of Rijeka at Kostrena, Croatia. It was built between 1974 and 1978 and it has one generation unit with capacity of 320\xa0MW. The height of the boiler house including its rooftop flue gas stack is .', 'Turbine for the power station was supplied by Ansaldo Energia. Ansaldo Energia was also awarded engineering, procurement and construction contract. Boilers were supplied by Waagner-Biro.', 'The power station is owned and operated by Hrvatska elektroprivreda. Its annual production varies, averaging 917 GWh, but only 141 GWh in 2011. It is expected to undergo decommissioning in 2020, but it is doubtful that it will remain operational until then because of its pollution problem. , Rijeka Thermal Power Station is offline, ready to resume generation within 160 hours of notice. On 18 October 2022, it was unofficially reported that HEP plans to restart the operation of the power plant in order to cover the losses incurred during the energy crisis.']
['런던의 2012년 하계 올림픽 펜싱 여자 플뢰레 단체전은 8월 2일에 엑셀 박람회관에서 진행되었다.', '토너먼트 형식.', '9팀이 여자 플뢰레 단체전에 참가하였다. 본선에 출전하는 팀들은 FIE 랭킹에 따라 대진이 정해졌다. 영국은 개최국 자격으로 모든 종목을 선택 여부에 따라 참전할 수 있는 자격이 주어졌다. 영국은 이 토너먼트에 참가하여 첫경기에서 9번 시드의 이집트를 상대하여 승리하며 8강전에서 나머지 7개의 참가팀들과 만났다. 8강전에서 패한 팀들도 두 경기를 더 치러 5위에서 8위까지의 순위를 결정한다. 반대로, 8강전에서 승리한 팀들은 준결승전에서 서로와 맞붙는다. 준결승전에서 승리한 두 팀은 금메달 결정전으로, 패한 팀들은 동메달 결정전으로 이동한다.', '단체전은 먼저 45투셰를 기록하거나, 정규시간이 지나고 더 많은 투셰를 기록한 팀이 승리한다.']
time: 145 ms (started: 2022-11-14 02:09:16 +00:00)
Covert pandas datafame to huggingface dataset#
from datasets import Dataset
raw_dataset = Dataset.from_pandas(data[[text_column]])
raw_dataset
Dataset({
features: ['text', '__index_level_0__'],
num_rows: 603719
})
time: 1.22 s (started: 2022-11-14 02:09:19 +00:00)
Shuffle the dataset#
# shuffle the dataset
raw_dataset = raw_dataset.shuffle(seed=42)
time: 116 ms (started: 2022-11-14 02:09:21 +00:00)
Split the dataset into sentences for training#
The sentencepiece module comes with a python training API, which uses sentences in a file, one sentence per line. We will use the sent_tokenize
function from the nltk
package to split the text into sentences. The sent_tokenize
function is a wrapper around the punkt
tokenizer, which is a pre-trained sentence tokenizer. The punkt
tokenizer is trained on the Penn Treebank corpus, which is a collection of Wall Street Journal articles. The punkt
tokenizer is a good choice for plain English text, but it may not be the best choice for other languages.
import nltk
from nltk.tokenize import sent_tokenize
from ekorpkit.tokenizers.trainers.spm import export_sentence_chunk_files
nltk.download("punkt")
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
time: 1.13 s (started: 2022-11-14 02:09:23 +00:00)
output_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
export_sentence_chunk_files(
raw_dataset,
output_dir=output_dir,
chunk_size=10000,
text_column=text_column,
sent_tokenize=sent_tokenize,
)
Show code cell output
INFO:ekorpkit.tokenizers.trainers.spm:Writing sentence chunks to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_chunk
time: 1min 29s (started: 2022-11-14 02:09:26 +00:00)
Sample sentences and combine them into a single file#
If your dataset is too large, you can sample a subset of the sentence files for training. The sample
function from the random
module can be used to sample a subset of the files.
You can use sample_and_combine
function to sample a subset of sentence files and combine them into a single file.
from ekorpkit.tokenizers.trainers.spm import sample_and_combine
input_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
output_dir = project_dir + "/data/tokenizers/enko_filtered_samples"
sampled_file = sample_and_combine(
input_dir=input_dir, output_dir=output_dir, sample_size=50
)
INFO:ekorpkit.tokenizers.trainers.spm:sampled files: ['sent_chunk_0009.txt', 'sent_chunk_0019.txt', 'sent_chunk_0060.txt', 'sent_chunk_0038.txt', 'sent_chunk_0000.txt', 'sent_chunk_0042.txt', 'sent_chunk_0025.txt', 'sent_chunk_0053.txt', 'sent_chunk_0035.txt', 'sent_chunk_0033.txt', 'sent_chunk_0008.txt', 'sent_chunk_0023.txt', 'sent_chunk_0004.txt', 'sent_chunk_0024.txt', 'sent_chunk_0013.txt', 'sent_chunk_0003.txt', 'sent_chunk_0017.txt', 'sent_chunk_0051.txt', 'sent_chunk_0027.txt', 'sent_chunk_0058.txt', 'sent_chunk_0012.txt', 'sent_chunk_0029.txt', 'sent_chunk_0015.txt', 'sent_chunk_0044.txt', 'sent_chunk_0057.txt', 'sent_chunk_0020.txt', 'sent_chunk_0052.txt', 'sent_chunk_0059.txt', 'sent_chunk_0005.txt', 'sent_chunk_0011.txt', 'sent_chunk_0031.txt', 'sent_chunk_0030.txt', 'sent_chunk_0001.txt', 'sent_chunk_0056.txt', 'sent_chunk_0047.txt', 'sent_chunk_0055.txt', 'sent_chunk_0007.txt', 'sent_chunk_0032.txt', 'sent_chunk_0018.txt', 'sent_chunk_0014.txt', 'sent_chunk_0016.txt', 'sent_chunk_0054.txt', 'sent_chunk_0028.txt', 'sent_chunk_0021.txt', 'sent_chunk_0006.txt', 'sent_chunk_0040.txt', 'sent_chunk_0049.txt', 'sent_chunk_0043.txt', 'sent_chunk_0037.txt', 'sent_chunk_0022.txt']
INFO:ekorpkit.tokenizers.trainers.spm:number of lines sampled: 2,998,856
INFO:ekorpkit.tokenizers.trainers.spm:saved sampled sentences to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt
time: 20.9 s (started: 2022-11-13 05:07:06 +00:00)
Train tokenizers with Hugging Face Tokenizers#
Hugging Face’s Tokenizers provides a wide range of tokenizers, including BPE, WordPiece, Unigram, SentencePiece, and ByteLevel. We will use the BPE and Unigram tokenizers in this lab.
Import the libraries and prepare functions#
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel
from tokenizers.trainers import BpeTrainer, UnigramTrainer, WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from ekorpkit.tokenizers.trainers.spm import batch_chunks
unk_token = "<UNK>" # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>", "[MASK]"] # special tokens
def prepare_tokenizer_trainer(algo):
"""
Prepares the tokenizer and trainer with unknown & special tokens.
"""
if algo == "BPE":
tokenizer = Tokenizer(BPE(unk_token=unk_token))
trainer = BpeTrainer(special_tokens=spl_tokens)
elif algo == "UNI":
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)
else:
tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
trainer = WordLevelTrainer(special_tokens=spl_tokens)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = Whitespace()
return tokenizer, trainer
time: 2.35 ms (started: 2022-11-12 06:51:08 +00:00)
def train_tokenizer(algo="BPE"):
"""
Takes the files and trains the tokenizer.
"""
save_path = f"{project_dir}/tokenizers/{algo}_tokenizer.json"
tokenizer, trainer = prepare_tokenizer_trainer(algo)
tokenizer.train_from_iterator(
batch_chunks(raw_dataset, batch_size=1000, text_column=text_column),
trainer=trainer,
)
tokenizer.save(save_path)
tokenizer = Tokenizer.from_file(save_path)
return tokenizer
time: 20.2 ms (started: 2022-11-13 03:09:51 +00:00)
Train BPE tokenizer#
model_path = train_tokenizer("BPE")
time: 15h 30min 58s (started: 2022-11-11 06:36:01 +00:00)
To train a BPE tokenizer, it took 15 hours and 30 minutes for the 2,998,856 sentences. The tokenizer was saved in the {project_dir}/tokenizers
directory.
To train more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.
ekorpkit \
project.name=ekorpkit-book \
dir.workspace=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer \
tokenizer.name=enkowiki \
tokenizer.dataset.train_file=enko_filtered.parquet \
tokenizer.model.model_type=bpe \
tokenizer.model.vocab_size=30000 \
tokenizer.model.character_coverage=0.9995 \
tokenizer.training.use_sample=false \
tokenizer.auto=train
took 22m 36.6s
With 256 processors, it took 22 minutes and 37 seconds to tokenize the 2,998,856 sentences.
from tokenizers import Tokenizer
tokenizer_path = (
f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_bpe_huggingface_vocab_30000.json"
)
bpe_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(bpe_tokenizer.encode(text_en[0]).tokens)
print(bpe_tokenizer.encode(text_ko[0]).tokens)
Vocab size: 30000
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 67.4 ms (started: 2022-11-13 08:47:47 +00:00)
Train Unigram tokenizer#
model_path = train_tokenizer("UNI")
For a very large corpus, it may take a long time to train a Unigram tokenizer. It is recommended to use the following CLI command to train a Unigram tokenizer.
ekorpkit \
project.name=ekorpkit-book \
dir.workspace=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer \
tokenizer.name=enkowiki \
tokenizer.dataset.train_file=enko_filtered.parquet \
tokenizer.model.model_type=unigram \
tokenizer.model.vocab_size=30000 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
took 10m 10.7s
With 256 processors, it took 10 minutes and 11 seconds to tokenize the 2,998,856 sentences.
from tokenizers import Tokenizer
tokenizer_path = (
f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
)
unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(unigram_tokenizer.encode(text_en[0]).tokens)
print(unigram_tokenizer.encode(text_ko[0]).tokens)
Vocab size: 30000
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
time: 82 ms (started: 2022-11-13 08:47:51 +00:00)
Train tokenizers with Google SentencePiece (SPM)#
Install SentencePiece#
pip install sentencepiece
Train SentencePiece models#
You can use train_spm
function to train a SentencePiece model. The train_spm
function takes the following arguments:
model_prefix
: The prefix of the model file. The model file will be saved as{model_prefix}_{model_type}_vocab_{vocab_size}.model
.input
: The input file for training.output_dir
: The directory to save the model file.vocab_size
: The vocabulary size.model_type
: The model type. It can beunigram
(default),bpe
,char
, orword
.character_coverage
: The character coverage. It is only used forunigram
andbpe
model types. The default value is1.0
.num_threads
: The number of threads to use for training. The default value is1
. The max value is128
.train_extremely_large_corpus
: Whether to train an extremely large corpus. The default value isFalse
.
Train Unigram model#
from ekorpkit.tokenizers.trainers.spm import train_spm
uni_model_path = train_spm(
model_prefix="enko_wiki",
input=sampled_file,
output_dir=project_dir + "/tokenizers/spm",
model_type="unigram",
vocab_size=30000,
character_coverage=0.9995,
num_threads=128,
)
Show code cell output
time: 20.3 ms (started: 2022-11-25 02:01:41 +00:00)
time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)
It took 23 minutes to train a unigram model with a vocabulary size of 30,000. The model file was saved in the {project_dir}/tokenizers
directory.
ekorpkit \
project.name=ekorpkit-book \
dir.workspace=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer.spm \
tokenizer.name=enkowiki \
tokenizer.dataset.train_file=enko_filtered.parquet \
tokenizer.model.model_type=unigram \
tokenizer.model.vocab_size=30000 \
tokenizer.model.character_coverage=0.9995 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
took 4m 2.4s
With 128 processors, it took 4 minutes and 2 seconds to tokenize the 2,998,856 sentences.
Load the trained model#
import sentencepiece as spm
model_file = "tokenizers/spm/enko_wiki_unigram_vocab_30000.model"
model_file = project_dir + "/" + model_file
uni_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(uni_spm.encode(text_en[0], out_type=str))
print(uni_spm.encode(text_ko[0], out_type=str))
Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 67.4 ms (started: 2022-11-13 08:48:00 +00:00)
Train BPE model#
ekorpkit \
project.name=ekorpkit-book \
dir.workspace=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer.spm \
tokenizer.name=enkowiki \
tokenizer.dataset.train_file=enko_filtered.parquet \
tokenizer.model.model_type=bpe \
tokenizer.model.vocab_size=30000 \
tokenizer.model.character_coverage=0.9995 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
took 8m 41.8s
With 128 processors, it took 8 minutes and 42 seconds to tokenize the 2,998,856 sentences.
Load the trained model#
import sentencepiece as spm
model_file = "tokenizers/spm/enko_wiki_bpe_vocab_30000.model"
model_file = project_dir + "/" + model_file
bpe_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(bpe_spm.encode(text_en[0], out_type=str))
print(bpe_spm.encode(text_ko[0], out_type=str))
Vocab size: 30000
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
time: 34.1 ms (started: 2022-11-13 08:48:11 +00:00)
Compare the Tokenizers#
Load the tokenizers#
tokenizers = {
"BPE": bpe_tokenizer,
"UNI": unigram_tokenizer,
"UNI_SPM": uni_spm,
"BPE_SPM": bpe_spm,
}
def tokenize(tokenizer, text):
"""
Tokenizes the text using the tokenizer.
"""
if isinstance(tokenizer, spm.SentencePieceProcessor):
return tokenizer.encode(text, out_type=str)
return tokenizer.encode(text).tokens
time: 20.6 ms (started: 2022-11-13 08:48:24 +00:00)
Analyze the output of the tokenizers#
texts = [text_en[0] , text_ko[0]]
tokens = {name: [] for name in tokenizers.keys()}
# tokenize the texts with the tokenizers
for text in texts:
for name, tokenizer in tokenizers.items():
print(f"Tokenizer: {name}")
tokens[name].append(tokenize(tokenizer, text))
print(tokens[name][-1])
print("-" * 50)
Tokenizer: BPE
['This', 'article', 'describ', 'es', 'the', 'process', 'by', 'which', 'the', 'ter', 'rit', 'or', 'ial', 'ext', 'ent', 'of', 'Moroc', 'co', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI
['This', 'article', 'describe', 's', 'the', 'process', 'by', 'which', 'the', 'territor', 'ial', 'ex', 't', 'ent', 'of', 'Morocc', 'o', 'came', 'to', 'be', 'as', 'it', 'is', 'now', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁', 'ter', 'ri', 'torial', '▁ex', 't', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁This', '▁article', '▁describes', '▁the', '▁process', '▁by', '▁which', '▁the', '▁ter', 'rit', 'orial', '▁ext', 'ent', '▁of', '▁Morocco', '▁came', '▁to', '▁be', '▁as', '▁it', '▁is', '▁now', '.']
--------------------------------------------------
Tokenizer: BPE
['크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI
['크리', '클레이', '드', '(', 'C', 'rick', 'la', 'de', ')', '는', '잉글랜드의', '노스', '윌', '트', '셔', '에', '위치한', '템', '스', '강의', '타운', '이자', '지방', '행정', '구', '이다', '.', '스', '윈', '던', '과', '시', '런', '세', '스터', '의', '중간에', '위치해', '있다', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'lade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁', '템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구', '이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
Tokenizer: BPE_SPM
['▁크리', '클', '레이드', '(', 'C', 'rick', 'l', 'ade', ')', '는', '▁잉글랜드의', '▁노스', '▁윌', '트', '셔', '에', '▁위치한', '▁템', '스', '▁강의', '▁타운', '이자', '▁지방', '▁행정', '구이다', '.', '▁스', '윈', '던', '과', '▁시', '런', '세', '스터', '의', '▁중간에', '▁위치해', '▁있다', '.']
--------------------------------------------------
time: 20.8 ms (started: 2022-11-13 08:48:25 +00:00)
Compare the Tokens#
import pandas as pd
def compare_tokens(tokenizers, tokens, sample_num=0):
max_len = max(len(tokens[name][sample_num]) for name in tokenizers.keys())
diffs = {
name: max_len - len(tokens[name][sample_num]) for name in tokenizers.keys()
}
padded_tokens = {
name: tokens[name][sample_num] + [""] * diffs[name]
for name in tokenizers.keys()
}
df = pd.DataFrame(padded_tokens)
return df
time: 20.5 ms (started: 2022-11-13 08:48:30 +00:00)
compare_tokens(tokenizers, tokens, sample_num=0)
BPE | UNI | UNI_SPM | BPE_SPM | |
---|---|---|---|---|
0 | This | This | ▁This | ▁This |
1 | article | article | ▁article | ▁article |
2 | describ | describe | ▁describes | ▁describes |
3 | es | s | ▁the | ▁the |
4 | the | the | ▁process | ▁process |
5 | process | process | ▁by | ▁by |
6 | by | by | ▁which | ▁which |
7 | which | which | ▁the | ▁the |
8 | the | the | ▁ | ▁ter |
9 | ter | territor | ter | rit |
10 | rit | ial | ri | orial |
11 | or | ex | torial | ▁ext |
12 | ial | t | ▁ex | ent |
13 | ext | ent | t | ▁of |
14 | ent | of | ent | ▁Morocco |
15 | of | Morocc | ▁of | ▁came |
16 | Moroc | o | ▁Morocco | ▁to |
17 | co | came | ▁came | ▁be |
18 | came | to | ▁to | ▁as |
19 | to | be | ▁be | ▁it |
20 | be | as | ▁as | ▁is |
21 | as | it | ▁it | ▁now |
22 | it | is | ▁is | . |
23 | is | now | ▁now | |
24 | now | . | . | |
25 | . |
time: 26.6 ms (started: 2022-11-13 08:48:31 +00:00)
compare_tokens(tokenizers, tokens, sample_num=1)
BPE | UNI | UNI_SPM | BPE_SPM | |
---|---|---|---|---|
0 | 크리 | 크리 | ▁크리 | ▁크리 |
1 | 클 | 클레이 | 클 | 클 |
2 | 레이드 | 드 | 레이드 | 레이드 |
3 | ( | ( | ( | ( |
4 | C | C | C | C |
5 | rick | rick | rick | rick |
6 | l | la | lade | l |
7 | ade | de | ) | ade |
8 | ) | ) | 는 | ) |
9 | 는 | 는 | ▁잉글랜드의 | 는 |
10 | 잉글랜드의 | 잉글랜드의 | ▁노스 | ▁잉글랜드의 |
11 | 노스 | 노스 | ▁윌 | ▁노스 |
12 | 윌 | 윌 | 트 | ▁윌 |
13 | 트 | 트 | 셔 | 트 |
14 | 셔 | 셔 | 에 | 셔 |
15 | 에 | 에 | ▁위치한 | 에 |
16 | 위치한 | 위치한 | ▁ | ▁위치한 |
17 | 템 | 템 | 템 | ▁템 |
18 | 스 | 스 | 스 | 스 |
19 | 강의 | 강의 | ▁강의 | ▁강의 |
20 | 타운 | 타운 | ▁타운 | ▁타운 |
21 | 이자 | 이자 | 이자 | 이자 |
22 | 지방 | 지방 | ▁지방 | ▁지방 |
23 | 행정 | 행정 | ▁행정 | ▁행정 |
24 | 구이다 | 구 | 구 | 구이다 |
25 | . | 이다 | 이다 | . |
26 | 스 | . | . | ▁스 |
27 | 윈 | 스 | ▁스 | 윈 |
28 | 던 | 윈 | 윈 | 던 |
29 | 과 | 던 | 던 | 과 |
30 | 시 | 과 | 과 | ▁시 |
31 | 런 | 시 | ▁시 | 런 |
32 | 세 | 런 | 런 | 세 |
33 | 스터 | 세 | 세 | 스터 |
34 | 의 | 스터 | 스터 | 의 |
35 | 중간에 | 의 | 의 | ▁중간에 |
36 | 위치해 | 중간에 | ▁중간에 | ▁위치해 |
37 | 있다 | 위치해 | ▁위치해 | ▁있다 |
38 | . | 있다 | ▁있다 | . |
39 | . | . |
time: 26.1 ms (started: 2022-11-13 08:48:32 +00:00)
Usage of Tokenizer Trainer#
data = eKonf.load_data("wiki_filtered.parquet", project_dir + "/data")
bnwiki_filtered = data[data.corpus == "bnwiki"]
data_file = "bnwiki_filtered.parquet"
cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "bnwiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg.training.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
cfg.auto = "train"
eKonf.save_data(bnwiki_filtered, data_file, cfg.path.data_dir)
INFO:ekorpkit.io.file:Processing [1] files from ['wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/wiki_filtered.parquet
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.io.file:Saving dataframe to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data/bnwiki_filtered.parquet
time: 16.5 s (started: 2022-11-25 02:14:06 +00:00)
BPE Model by Hugging Face#
ekorpkit \
project_name=ekorpkit-book \
workspace_dir=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer \
tokenizer.name=bnwiki \
tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
tokenizer.model.model_type=bpe \
tokenizer.model.vocab_size=30000 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
Unigram Model by Hugging Face#
ekorpkit \
project_name=ekorpkit-book \
workspace_dir=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer \
tokenizer.name=bnwiki \
tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
tokenizer.model.model_type=unigram \
tokenizer.model.vocab_size=30000 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
BPE Model by SentencePiece#
ekorpkit \
project_name=ekorpkit-book \
workspace_dir=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer.spm \
tokenizer.name=bnwiki \
tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
tokenizer.model.model_type=bpe \
tokenizer.model.vocab_size=30000 \
tokenizer.model.character_coverage=0.9995 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
Unigram Model by SentencePiece#
ekorpkit \
project_name=ekorpkit-book \
workspace_dir=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer.spm \
tokenizer.name=bnwiki \
tokenizer.dataset.train_file=data/bnwiki_filtered.parquet \
tokenizer.model.model_type=unigram \
tokenizer.model.vocab_size=30000 \
tokenizer.model.character_coverage=0.9995 \
tokenizer.training.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train
Compare the Tokenizers#
from ekorpkit.tokenizers.trainer import TokenizerTrainer, compare_tokens
def load_tokenizer(trainer_type, model_type, name="kowiki"):
cfg_group = "tokenizer=trainer"
if trainer_type == "spm":
cfg_group += ".spm"
cfg = eKonf.compose(cfg_group)
cfg.name = name
cfg.model.model_type = model_type
trainer = TokenizerTrainer(**cfg)
return trainer.tokenizer_obj
time: 16.2 ms (started: 2022-11-29 10:48:09 +00:00)
tokenizers = {}
for trainer_type in ['spm', 'huggingface']:
for model_type in ['bpe', 'unigram']:
name = f"{model_type}_{trainer_type}"
print(name)
tokenizers[name] = load_tokenizer(trainer_type, model_type, "bnwiki")
INFO:root:compose config with overrides: ['tokenizer=trainer.spm']
bpe_spm
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer.spm']
unigram_spm
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']
bpe_huggingface
INFO:ekorpkit.base:No method defined to call
INFO:root:compose config with overrides: ['tokenizer=trainer']
unigram_huggingface
INFO:ekorpkit.base:No method defined to call
time: 2.36 s (started: 2022-11-29 10:48:14 +00:00)
cfg_group = "tokenizer=trainer"
cfg = eKonf.compose(cfg_group)
cfg.name = "bnwiki"
trainer = TokenizerTrainer(**cfg)
sentences = []
files = trainer.input_files
for file in files:
with open(file, "r") as f:
sentences.extend(f.readlines())
print("number of sentences:", len(sentences))
print(sentences[0].strip())
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:No method defined to call
number of sentences: 357833
জোড্ডা পূর্ব বাংলাদেশের কুমিল্লা জেলার অন্তর্গত নাঙ্গলকোট উপজেলার একটি ইউনিয়ন।
time: 980 ms (started: 2022-11-29 10:48:22 +00:00)
compare_tokens(tokenizers, sentences)
Text: মোহাম্মদ রফিকুল ইসলাম ২০২০ সালের ১ সেপ্টেম্বর গাজীপুরে অবস্থিত আন্তর্জাতিক প্রযুক্তি বিশ্ববিদ্যালয়, ইসলামিক ইউনিভার্সিটি অব টেকনোলজির (আইইউটি) উপাচার্য হিসেবে যোগদান করেন।
bpe_spm | unigram_spm | bpe_huggingface | unigram_huggingface | |
---|---|---|---|---|
0 | ▁মোহাম্মদ | ▁মোহাম্মদ | মহমমদ | ▁ |
1 | ▁রফিকুল | ▁রফিকুল | রফকল | মহমমদ |
2 | ▁ইসলাম | ▁ইসলাম | ইসলম | ▁ |
3 | ▁২০২০ | ▁২০২০ | ২০২০ | রফকল |
4 | ▁সালের | ▁সালের | সলর | ▁ |
5 | ▁১ | ▁১ | ১ | ইসলম |
6 | ▁সেপ্টেম্বর | ▁সেপ্টেম্বর | সপটমবর | ▁ |
7 | ▁গাজীপুর | ▁গাজীপুর | গজপর | ২০২০ |
8 | ে | ে | অবসথত | ▁ |
9 | ▁অবস্থিত | ▁অবস্থিত | আনতরজতক | সলর |
10 | ▁আন্তর্জাতিক | ▁আন্তর্জাতিক | পরযকত | ▁ |
11 | ▁প্রযুক্তি | ▁প্রযুক্তি | বশববদযলয | ১ |
12 | ▁বিশ্ববিদ্যালয় | ▁বিশ্ববিদ্যালয় | , | ▁ |
13 | , | , | ইসলমক | সপটমবর |
14 | ▁ইসলামিক | ▁ইসলামিক | ইউনভরসট | ▁ |
15 | ▁ইউনিভার্সিটি | ▁ইউনিভার্সিটি | অব | গজপর |
16 | ▁অব | ▁অব | টকনলজর | ▁ |
17 | ▁টেকনোলজির | ▁টেকনোলজির | ( | অবসথত |
18 | ▁( | ▁( | আই | ▁ |
19 | আই | আইইউ | ইউট | আনতরজতক |
20 | ইউ | টি | ) | ▁ |
21 | টি | ) | উপচরয | পরযকত |
22 | ) | ▁উপাচার্য | হসব | ▁ |
23 | ▁উপাচার্য | ▁হিসেবে | যগদন | বশববদযলয |
24 | ▁হিসেবে | ▁যোগদান | করন | , |
25 | ▁যোগদান | ▁করেন | । | ▁ |
26 | ▁করেন | । | ইসলমক | |
27 | । | ▁ | ||
28 | ইউনভরসট | |||
29 | ▁ | |||
30 | অব | |||
31 | ▁ | |||
32 | টকনলজ | |||
33 | র | |||
34 | ▁ | |||
35 | ( | |||
36 | আইইউ | |||
37 | ট | |||
38 | ) | |||
39 | ▁ | |||
40 | উপচরয | |||
41 | ▁ | |||
42 | হসব | |||
43 | ▁ | |||
44 | যগদন | |||
45 | ▁ | |||
46 | করন | |||
47 | । |
time: 25 ms (started: 2022-11-29 10:48:30 +00:00)
Usage of Tokenizer Trainer#
%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2
from ekorpkit import eKonf
eKonf.setLogger("INFO")
print("version:", eKonf.__version__)
is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)
INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Overwriting EKORPKIT_WORKSPACE_ROOT=/workspace with /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT=ekorpkit-book with ekorpkit-book
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Overwriting EKORPKIT_PROJECT_DIR=/workspace/projects/ekorpkit-book with /content/drive/MyDrive/workspace/projects/ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
version: 0.1.40.post0.dev37
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 964 ms (started: 2022-11-30 06:49:42 +00:00)
data_file ="data/enko_filtered.parquet"
cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "enkowiki"
cfg.batch.num_workers = 128
cfg.dataset.train_file = data_file
cfg._train_.use_sample = False
cfg.model.model_type = "bpe"
cfg.model.vocab_size = 30000
# cfg.auto = "train"
print(cfg.path.data_dir)
# eKonf.copy(f"{project_dir}/{data_file}", cfg.path.data_dir)
INFO:root:compose config with overrides: ['tokenizer=trainer']
/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data
time: 506 ms (started: 2022-11-30 06:49:43 +00:00)
from ekorpkit.tokenizers.trainer import TokenizerTrainer
cfg.model.model_type = "unigram"
trainer = TokenizerTrainer(**cfg)
INFO:root:compose config with overrides: ['tokenizer=trainer']
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.base:Setting WANDB_PROJECT=ekorpkit-book
INFO:ekorpkit.config:Using existing path: /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling
INFO:ekorpkit.base:No method defined to call
time: 2.06 s (started: 2022-11-30 06:49:45 +00:00)
trainer.tokenize(text="test")
['▁', 'test']
time: 64.5 ms (started: 2022-11-30 06:49:48 +00:00)
trainer.save_config()
INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.yaml
INFO:ekorpkit.config:Saving config to /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki/configs/enkowiki(11)_config.json
'enkowiki(11)_config.yaml'
time: 267 ms (started: 2022-11-30 06:39:45 +00:00)
ekorpkit print_config=false \
project_name=ekorpkit-book \
workspace_dir=/content/drive/MyDrive/workspace \
run=tokenizer \
tokenizer=trainer \
tokenizer.name=enkowiki \
tokenizer.dataset.train_file=data/enko_filtered.parquet \
tokenizer.model.model_type=unigram \
tokenizer.model.vocab_size=30000 \
tokenizer._train_.use_sample=false \
tokenizer.batch.num_workers=128 \
tokenizer.auto=train