Lab 4: Pretraining Language Models#

Now that we have our dataset and tokenizer, in this lab, we will train a language model on a large corpus of text from scratch.

%pip install --pre ekorpkit[model]
from ekorpkit import eKonf

eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
ws = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", ws.project_dir)
ws.envs.dict()
version: 0.1.40.post0.dev47
is colab? False
INFO:ekorpkit.base:There are no arguments to initilize a config, using default config.
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_WORKSPACE_ROOT': '/content/drive/MyDrive/workspace',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_PROJECT_DIR': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'WARNING',
 'NUM_WORKERS': 230,
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'WANDB_PROJECT': None,
 'WANDB_DISABLED': None}
time: 500 ms (started: 2022-12-06 23:34:52 +00:00)
from huggingface_hub import notebook_login

notebook_login()
time: 70.7 ms (started: 2022-11-21 00:59:58 +00:00)
import os
from huggingface_hub import HfApi
from huggingface_hub import HfFolder

token = HfFolder.get_token()
if token is None:
    token = os.environ["HF_USER_ACCESS_TOKEN"]

if token is None:
    raise ValueError("Please login to huggingface_hub")

user_id = HfApi().whoami(token)["name"]

print(f"user id '{user_id}' will be used during this lab")
user id 'entelecheia' will be used during this lab
time: 893 ms (started: 2022-11-21 01:06:46 +00:00)

Unicode Normalization#

One little thing to note is that we will need to normalize our text before training our language model. This is because the same character can be represented in different ways. For example, the character “é” can be represented as “e” followed by a combining accent character, or as a single character.

TL;DR#

Use NFKC normalization to normalize your text before training your language model.

Unicode Normalization Forms#

There are four normalization forms:

  • NFC: Normalization Form Canonical Composition

  • NFD: Normalization Form Canonical Decomposition

  • NFKC: Normalization Form Compatibility Composition

  • NFKD: Normalization Form Compatibility Decomposition

In the above forms, “C” stands for “Canonical” and “K” stands for “Compatibility”. The “C” forms are the most commonly used. The “K” forms are used when you need to convert characters to their compatibility representation. For example, the “K” forms will convert “fi” to “fi”.

There two main differences between the two sets of forms:

  • The length of the string is changed or not: NFC and NFKC always produce a string of the same length or shorter, while NFD and NFKD may produce a string that is longer.

  • The original string is changed or not: NFC and NFD always produce a string that is identical to the original string, while NFKC and NFKD may produce a string that is different from the original string.

Unicode Normalization in Python#

In Python, you can use the unicodedata module to normalize your text. The unicodedata.normalize function takes two arguments:

  • form: The normalization form to use. This can be one of the following: NFC, NFD, NFKC, NFKD.

  • unistr: The string to normalize.

import unicodedata

text = "abcABC123가나다…"
print(f"Original: {text}, {len(text)}")
for form in ["NFC", "NFD", "NFKC", "NFKD"]:
    ntext = unicodedata.normalize(form, text)
    print(f"{form}: {ntext}, {len(ntext)}")
Original: abcABC123가나다…, 13
NFC: abcABC123가나다…, 13
NFD: abcABC123가나다…, 16
NFKC: abcABC123가나다..., 15
NFKD: abcABC123가나다..., 18
time: 23.8 ms (started: 2022-11-19 10:16:57 +00:00)

BERT Pretraining#

In this lab, we will train a BERT-like model using masked-language modeling, one of the two pretraining tasks used in the original BERT paper.

What is BERT?#

BERT is a large-scale language model that was trained on the English Wikipedia using a masked-language modeling objective. The model was then fine-tuned on a variety of downstream tasks, including question answering, natural language inference, and sentiment analysis. BERT was the first large-scale language model to be pre-trained using a deep bidirectional architecture and outperformed previous language models on a variety of tasks.

BERT was originally pre-trained on 1 Million Steps with a global batch size of 256.

“We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.”

For more information, see the lecture notes on BERT.

Masked-Language Modeling (MLM)#

Masked-language modeling is a pretraining task where we mask some of the input tokens and train the model to predict the original value of the masked tokens. For example, if we have the sentence “The dog ate the apple”, we can mask the word “ate” and train the model to predict the original value of the masked token. The model will then learn to predict the original value of the masked tokens based on the context of the sentence.

Example:

Input: “The dog [MASK] the apple”

Preprocessing the Dataset#

Before training our language model, we need to preprocess our dataset. We will use our tokenizer to tokenize our dataset and then convert the tokens to their IDs. If we have a sentence that is longer than the maximum sequence length, we will truncate the sentence. If the sentence is shorter than the maximum sequence length, we will pad the sentence with the padding token.

Unlike the original BERT paper, we will not use the WordPiece tokenization algorithm. Instead, we will use the unigram tokenization algorithm.

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from tokenizers.processors import BertProcessing

tokenizer_path = "tokenizers/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
tokenizer_path = project_dir + "/" + tokenizer_path
context_length = 512

unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {unigram_tokenizer.get_vocab_size()}")
unigram_tokenizer.post_processor = BertProcessing(
    ("</s>", unigram_tokenizer.token_to_id("</s>")),
    ("<s>", unigram_tokenizer.token_to_id("<s>")),
)

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=unigram_tokenizer,
    truncation=True,
    max_length=context_length,
    return_length=True,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="right",
)

print(f"is_fast: {tokenizer.is_fast}")
print(f"Vocab size: {tokenizer.vocab_size}")
print(tokenizer("Hello, my dog is cute"))
tokenizer.save_pretrained(project_dir + "/tokenizers/enko_wiki")
Vocab size: 30000
is_fast: True
Vocab size: 30000
{'input_ids': [1, 8, 14690, 10, 8, 968, 8, 6871, 8, 42, 8, 2777, 72, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
('/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/tokenizer_config.json',
 '/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/special_tokens_map.json',
 '/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/tokenizer.json')
time: 67 ms (started: 2022-11-19 10:55:20 +00:00)

Load the tokenizer#

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(
    project_dir + "/tokenizers/enko_wiki"
)
print(f"is_fast: {tokenizer.is_fast}")
is_fast: True
time: 59.6 ms (started: 2022-11-21 01:03:21 +00:00)
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
{'input_ids': [1, 8, 14690, 10, 8, 235, 8, 202, 8, 15219, 489, 2, 8, 37, 8, 235, 8, 15219, 8, 11241, 8, 80, 8, 65, 9, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
time: 20.2 ms (started: 2022-11-21 01:07:07 +00:00)

Load the dataset#

from datasets import load_dataset

data_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

dataset = load_dataset("text", data_dir=data_dir, split="train")
dataset
WARNING:datasets.builder:Using custom data configuration default-f34802e795f4ed05
WARNING:datasets.builder:Reusing dataset text (/workspace/data/tbts/.cache/huggingface/datasets/text/default-f34802e795f4ed05/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad)
Dataset({
    features: ['text'],
    num_rows: 3618972
})
time: 1.02 s (started: 2022-11-19 09:39:39 +00:00)

Tokenize the dataset#

text_column = "text"


def tokenize(element):
    outputs = tokenizer(
        element[text_column],
        truncation=True,
        max_length=context_length,
        return_special_tokens_mask=True,
    )
    return outputs
time: 15.3 ms (started: 2022-11-19 09:55:12 +00:00)
num_proc = 20

# preprocess dataset
tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=[text_column], num_proc=num_proc
)
tokenized_dataset.features
from itertools import chain

# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= context_length:
        total_length = (total_length // context_length) * context_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + context_length] for i in range(0, total_length, context_length)]
        for k, t in concatenated_examples.items()
    }
    return result


tokenized_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=num_proc)

# shuffle dataset
tokenized_dataset = tokenized_dataset.shuffle(seed=1234)

print(f"the dataset contains in total {len(tokenized_dataset)*context_length} tokens")
# the dataset contains in total 137,816,832 tokens

We have 137,816,832 tokens in our dataset. For reference, the original BERT paper used 3.2 billion tokens, and GPT-3 uses 300 billion tokens.

Initializing a New Model#

We will initialize a new model using the bert-base-uncased configuration. We will then save the configuration to a file so that we can use it later when we load the model.

from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM

tk_path = project_dir + "/tokenizers/enko_wiki"

# Load codeparrot tokenizer trained for Python code tokenization
tokenizer = AutoTokenizer.from_pretrained(tk_path)

# Configuration
config_kwargs = {
    "vocab_size": len(tokenizer),
    "pad_token_id": tokenizer.pad_token_id,
    "mask_token_id": tokenizer.mask_token_id,
    "cls_token_id": tokenizer.cls_token_id,
    "sep_token_id": tokenizer.sep_token_id,
    "unk_token_id": tokenizer.unk_token_id,
}

# # Load model with config and push to hub
config = AutoConfig.from_pretrained("bert-base-uncased", **config_kwargs)
model = AutoModelForMaskedLM.from_config(config)

model_path = project_dir + "/models/enko_wiki_bert_base_uncased"
model.save_pretrained(model_path)
time: 3.4 s (started: 2022-11-19 11:29:40 +00:00)

Our model have 109.1 million parameters just like the original BERT model.

from transformers import BertForMaskedLM

model = BertForMaskedLM(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"BERT size: {model_size/1000**2:.1f}M parameters")
BERT size: 109.1M parameters
time: 2.07 s (started: 2022-11-19 11:23:15 +00:00)

Set up a DataCollator#

Before we can start training, we need to set up a data collator that will be used to collate the batches of data. We will use the DataCollatorForLanguageModeling data collator, which will take care of masking the tokens and padding the sequences.

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer, mlm=True, mlm_probability=0.15
)
2022-11-21 01:24:31.069901: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
time: 2.59 s (started: 2022-11-21 01:24:30 +00:00)

Load the Tokenized Dataset#

Our dataset is already tokenized, there are 268,366 examples in our dataset.

from datasets import Dataset

dataset_dir = project_dir + "/data/tokenized_datasets/enko_filtered"

tokenized_dataset = Dataset.load_from_disk(dataset_dir)
tokenized_dataset
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 268366
})
time: 269 ms (started: 2022-11-21 01:24:33 +00:00)

Check the output of the first batch of data.

out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")
input_ids shape: torch.Size([5, 512])
token_type_ids shape: torch.Size([5, 512])
attention_mask shape: torch.Size([5, 512])
labels shape: torch.Size([5, 512])
time: 70 ms (started: 2022-11-21 01:24:35 +00:00)

Training the Model#

We will configure the training arguments and then set up a trainer to train our model.

Configure the Training Arguments#

from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir=model_path,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)
Using cuda_amp half precision backend
time: 2.43 s (started: 2022-11-19 11:34:48 +00:00)

Train the Model#

from accelerate import Accelerator

accelerator = Accelerator()
acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}
device = accelerator.device

print(f"device: {device}")

trainer = accelerator.prepare(trainer)

trainer.train()

trainer.save_model(model_path)
device: cuda
time: 47.9 ms (started: 2022-11-19 11:34:51 +00:00)

It took 6h 33m 0.0s to train our model for 40 epochs.

Testing the Model#

We will load our model and test it on a few examples.

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM

model_path = project_dir + "/models/enko_wiki_bert_base_uncased"

model = AutoModelForMaskedLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
[INFO|configuration_utils.py:652] 2022-11-22 19:14:02,688 >> loading configuration file /content/drive/MyDrive/workspace/projects/ekorpkit-book/models/enko_wiki_bert_base_uncased/config.json
[INFO|configuration_utils.py:706] 2022-11-22 19:14:02,690 >> Model config BertConfig {
  "_name_or_path": "/content/drive/MyDrive/workspace/projects/ekorpkit-book/models/enko_wiki_bert_base_uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 4,
  "position_embedding_type": "absolute",
  "sep_token_id": 6,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

[INFO|modeling_utils.py:2155] 2022-11-22 19:14:02,691 >> loading weights file /content/drive/MyDrive/workspace/projects/ekorpkit-book/models/enko_wiki_bert_base_uncased/pytorch_model.bin
[INFO|modeling_utils.py:2608] 2022-11-22 19:14:04,867 >> All model checkpoint weights were used when initializing BertForMaskedLM.

[INFO|modeling_utils.py:2616] 2022-11-22 19:14:04,873 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at /content/drive/MyDrive/workspace/projects/ekorpkit-book/models/enko_wiki_bert_base_uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
[INFO|tokenization_utils_base.py:1773] 2022-11-22 19:14:04,888 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:1773] 2022-11-22 19:14:04,888 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1773] 2022-11-22 19:14:04,889 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1773] 2022-11-22 19:14:04,889 >> loading file tokenizer_config.json
time: 2.3 s (started: 2022-11-22 19:14:02 +00:00)
# perform predictions
example = "처음으로 대중적으로 <mask> 롤플레잉 게임이다."
for prediction in fill_mask(example):
    print(prediction)
{'score': 0.1881457269191742, 'token': 1183, 'token_str': '만든', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ 만든 ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.14176464080810547, 'token': 3567, 'token_str': '제작한', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ 제작한 ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.04182405769824982, 'token': 15022, 'token_str': '롤플레잉', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ 롤플레잉 ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.035137832164764404, 'token': 3225, 'token_str': '개발한', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ 개발한 ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.033710286021232605, 'token': 3171, 'token_str': '아시안', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ 아시안 ▁ 롤플레잉 ▁ 게임 이다.'}
time: 223 ms (started: 2022-11-22 19:14:05 +00:00)

Usage of MLM Trainer#

%config InlineBackend.figure_format='retina'
%load_ext autotime

from ekorpkit import eKonf

eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)
eKonf.os.envs.dict()
The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
version: 0.1.40.post0.dev46
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_WORKSPACE_ROOT': '/content/drive/MyDrive/workspace',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_PROJECT_DIR': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'WARNING',
 'NUM_WORKERS': 230,
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'WANDB_PROJECT': None,
 'WANDB_DISABLED': None}
time: 10.1 ms (started: 2022-12-05 10:58:05 +00:00)
from ekorpkit.tasks.nlp import MlmTrainer

train_dir = "outputs/enkowiki/sentence_chunks"
pretrained_tokenizer_file = "enkowiki_unigram_huggingface_vocab_30000.json"

cfg = eKonf.compose("task/nlp/lm=mlm")
cfg.name = "enkowiki"
cfg.model.config_name = "bert-base-uncased"
cfg.tokenizer.pretrained_tokenizer_file = pretrained_tokenizer_file
cfg.dataset.train_file = train_dir
cfg.dataset.max_seq_length = 512
cfg.use_accelerator = True
cfg.training.num_train_epochs = 10
cfg.training.eval_steps = 500
cfg.training.warmup_steps = 100
# eKonf.print(cfg)
trainer = MlmTrainer(**cfg)
WARNING:ekorpkit.models.transformer.trainers.base:Process rank: -1, device: cuda:0, n_gpu: 8, distributed training: False, 16-bits training: True
INFO:ekorpkit.base:No method defined to call
time: 2.35 s (started: 2022-12-05 10:59:56 +00:00)
# trainer.train()
trainer.save_config()
'enkowiki(19)_config.yaml'
time: 260 ms (started: 2022-12-05 11:00:06 +00:00)
ekorpkit print_config=false \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=transformer \
    transformer=mlm.trainer \
    transformer.name=enkowiki \
    transformer.model.config_name=bert-base-uncased \
    transformer.tokenizer.pretrained_tokenizer_file=enkowiki_unigram_huggingface_vocab_30000.json \
    transformer.dataset.train_file=outputs/enkowiki/sentence_chunks \
    transformer.dataset.max_seq_length=512 \
    transformer.dataset.num_workers=8 \
    transformer.use_accelerator=true \
    transformer.training.num_train_epochs=40 \
    transformer.training.eval_steps=500 \
    transformer.training.warmup_steps=100 \
    transformer.auto=train
# perform predictions
example = "처음으로 대중적으로 <mask> 롤플레잉 게임이다."
for prediction in trainer.fill_mask(example):
    print(prediction)
[INFO|configuration_utils.py:652] 2022-11-26 11:57:21,702 >> loading configuration file /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/models/enkowiki/enkowiki-bert-base-uncased/config.json
[INFO|configuration_utils.py:706] 2022-11-26 11:57:21,703 >> Model config BertConfig {
  "_name_or_path": "/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/models/enkowiki/enkowiki-bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "cls_token_id": 5,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "mask_token_id": 3,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 4,
  "position_embedding_type": "absolute",
  "sep_token_id": 6,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "unk_token_id": 7,
  "use_cache": true,
  "vocab_size": 30000
}

[INFO|modeling_utils.py:2155] 2022-11-26 11:57:21,708 >> loading weights file /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/models/enkowiki/enkowiki-bert-base-uncased/pytorch_model.bin
[INFO|modeling_utils.py:2608] 2022-11-26 11:57:23,604 >> All model checkpoint weights were used when initializing BertForMaskedLM.

[INFO|modeling_utils.py:2616] 2022-11-26 11:57:23,610 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/models/enkowiki/enkowiki-bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
[INFO|tokenization_utils_base.py:1773] 2022-11-26 11:57:23,662 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:1773] 2022-11-26 11:57:23,663 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1773] 2022-11-26 11:57:23,663 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1773] 2022-11-26 11:57:23,663 >> loading file tokenizer_config.json
{'score': 0.3371969759464264, 'token': 8, 'token_str': '▁', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ ▁ ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.028631633147597313, 'token': 9, 'token_str': '.', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁. ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.019517863169312477, 'token': 10, 'token_str': ',', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁, ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.009688169695436954, 'token': 11, 'token_str': 'the', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ the ▁ 롤플레잉 ▁ 게임 이다.'}
{'score': 0.008737098425626755, 'token': 13, 'token_str': ')', 'sequence': '▁ 처음으로 ▁ 대중적 으로 ▁ ) ▁ 롤플레잉 ▁ 게임 이다.'}
time: 2.16 s (started: 2022-11-26 11:57:21 +00:00)
trainer.show_config()
INFO:ekorpkit.config:Using existing path: /content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling
INFO:ekorpkit.config:Merging config with args: {}
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
{'_target_': 'ekorpkit.models.transformer.trainers.MlmTrainer',
 'auto': {},
 'batch': {'batch_name': 'enkowiki',
           'batch_num': None,
           'num_workers': 1,
           'random_seed': True,
           'resume_latest': False,
           'resume_run': False,
           'run_to_resume': 'latest',
           'seed': None,
           'verbose': False},
 'dataset': {'cache_dir': '/content/drive/MyDrive/workspace/.cache',
             'data_dir': None,
             'dataset_config_name': None,
             'dataset_name': None,
             'download_mode': None,
             'filename_extension': None,
             'line_by_line': False,
             'max_eval_samples': None,
             'max_seq_length': 512,
             'max_train_samples': None,
             'mlm_probability': 0.15,
             'num_workers': None,
             'overwrite_cache': True,
             'pad_to_max_length': False,
             'seed': None,
             'shuffle': False,
             'text_column_name': None,
             'train_file': 'outputs/enkowiki/sentence_chunks',
             'use_auth_token': False,
             'validation_file': None,
             'validation_split_percentage': 5},
 'model': {'cache_dir': None,
           'config_name': 'bert-base-uncased',
           'config_overrides': None,
           'model_dir': None,
           'model_name': None,
           'model_name_or_path': None,
           'model_revision': 'main',
           'model_type': None,
           'tokenizer_name': None,
           'use_auth_token': False,
           'use_fast_tokenizer': True},
 'name': 'enkowiki',
 'path': {'batch_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/enkowiki',
          'batch_name': 'enkowiki',
          'cache_dir': '/content/drive/MyDrive/workspace/.cache',
          'data_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/data',
          'library_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/libs',
          'model_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/models',
          'output_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs',
          'root': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling',
          'task_name': 'language-modeling',
          'tmp_dir': '/content/drive/MyDrive/workspace/.tmp',
          'verbose': False},
 'project': {'path': {'archive': '/content/drive/MyDrive/workspace/data/archive',
                      'cache': '/content/drive/MyDrive/workspace/.cache',
                      'corpus': '/content/drive/MyDrive/workspace/data/datasets/corpus',
                      'data': '/content/drive/MyDrive/workspace/data',
                      'dataset': '/content/drive/MyDrive/workspace/data/datasets',
                      'ekorpkit': '/workspace/projects/ekorpkit/ekorpkit',
                      'home': '/root',
                      'library': '/content/drive/MyDrive/workspace/data/libs',
                      'log': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/logs',
                      'model': '/content/drive/MyDrive/workspace/data/models',
                      'output': '/content/drive/MyDrive/workspace/projects/ekorpkit-book/outputs',
                      'project': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
                      'resource': '/workspace/projects/ekorpkit/ekorpkit/resources',
                      'runtime': '/workspace/projects/ekorpkit-book/ekorpkit-book/docs/lectures/deep_nlp',
                      'tmp': '/content/drive/MyDrive/workspace/.tmp',
                      'workspace': '/content/drive/MyDrive/workspace'},
             'project_dir': '/content/drive/MyDrive/workspace/projects/ekorpkit-book',
             'project_name': 'ekorpkit-book',
             'task_name': 'language-modeling',
             'workspace_dir': '/content/drive/MyDrive/workspace'},
 'secret': {'hf_user_access_token': None, 'wandb_api_key': None},
 'tokenizer': {'bos_token': '<s>',
               'cls_token': '<cls>',
               'eos_token': '</s>',
               'mask_token': '<mask>',
               'model_dir': None,
               'model_max_length': None,
               'model_type': None,
               'name': 'enkowiki_unigram_huggingface_vocab_30000.json',
               'pad_token': '<pad>',
               'padding_side': 'right',
               'path': None,
               'return_length': True,
               'sep_token': '<sep>',
               'truncation': True,
               'unk_token': '<unk>'},
 'training': {'do_eval': True,
              'do_train': True,
              'eval_steps': 500,
              'evaluation_strategy': 'steps',
              'fp16': True,
              'gradient_accumulation_steps': 8,
              'learning_rate': 0.0005,
              'logging_steps': 1000,
              'lr_scheduler_type': 'cosine',
              'num_train_epochs': 10,
              'output_dir': None,
              'overwrite_output_dir': True,
              'per_device_eval_batch_size': 32,
              'per_device_train_batch_size': 32,
              'push_to_hub': False,
              'report_to': 'wandb',
              'run_name': 'enkowiki',
              'save_steps': 5000,
              'warmup_steps': 100,
              'weight_decay': 0.1},
 'use_accelerator': True,
 'verbose': False}
time: 314 ms (started: 2022-11-29 10:36:06 +00:00)

Training bnwiki_bert using MLM Trainer#

from ekorpkit.tokenizers.trainer import TokenizerTrainer


cfg = eKonf.compose("tokenizer=trainer")
cfg.name = "bnwiki"
cfg.training.use_sample = False
cfg.model.model_type = "unigram"
cfg.model.vocab_size = 30000

tk_trainer = TokenizerTrainer(**cfg)
print(tk_trainer.model_path)
print(tk_trainer.sample_filepath)
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.base:No method defined to call
/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/tokenizers/bnwiki/bnwiki_unigram_huggingface_vocab_30000.json
/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/bnwiki/sentence_sample/sentence_chunks_sampled.txt
time: 2.76 s (started: 2022-11-24 11:54:52 +00:00)
from ekorpkit.models.transformer.trainers import MlmTrainer

# data_file = "bnwiki_filtered.parquet"

cfg = eKonf.compose("model/transformer=mlm.trainer")
cfg.name = "bnwiki"
cfg.model.config_name = "bert-base-uncased"
cfg.tokenizer.pretrained_tokenizer_file = str(tk_trainer.model_path)
cfg.dataset.train_file = str(tk_trainer.sample_filepath)
cfg.dataset.max_seq_length = 512
cfg.dataset.num_workers = 8
cfg.use_accelerator = True
cfg.training.num_train_epochs = 50
cfg.training.eval_steps = 100
cfg.training.warmup_steps = 50

trainer = MlmTrainer(**cfg)
2022-11-24 11:54:58.186442: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
WARNING:ekorpkit.models.transformer.trainers.mlm:Process rank: -1, device: cuda:0, n_gpu: 8, distributed training: False, 16-bits training: True
INFO:ekorpkit.models.transformer.trainers.mlm:Training/evaluation parameters TrainingArguments(
_n_gpu=8,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100,
evaluation_strategy=steps,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0005,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=None,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1000,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=50,
optim=adamw_hf,
output_dir=/content/drive/MyDrive/workspace/projects/ekorpkit-book/language-modeling/outputs/models/bnwiki/bnwiki-bert-base-uncased,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=bnwiki,
save_on_each_node=False,
save_steps=5000,
save_strategy=steps,
save_total_limit=None,
seed=1185224509,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=50,
weight_decay=0.1,
xpu_backend=None,
)
INFO:ekorpkit.base:No method defined to call
time: 5.29 s (started: 2022-11-24 11:54:57 +00:00)
# trainer.train()
ekorpkit print_config=true \
    project_name=ekorpkit-book \
    workspace_dir=/content/drive/MyDrive/workspace \
    run=transformer \
    transformer=mlm.trainer \
    transformer.name=bnwiki \
    transformer.model.config_name=bert-base-uncased \
    transformer.tokenizer.pretrained_tokenizer_file=bnwiki_unigram_huggingface_vocab_30000.json \
    transformer.dataset.train_file=data/bnwiki_filtered.parquet \
    transformer.dataset.max_seq_length=512 \
    transformer.dataset.num_workers=8 \
    transformer.use_accelerator=true \
    transformer.training.num_train_epochs=50 \
    transformer.training.eval_steps=100 \
    transformer.training.warmup_steps=50 \
    transformer.auto=train
# perform predictions
example = "এই মসজিদটির ভিত্তিপ্রস্তর Nov 25, 2022 at 1:21:19 PM GMT+9<mask> ২০১৫ সালের জুন মাসে।" # হয়েছিলো
for prediction in trainer.fill_mask(example):
    print(prediction)
{'score': 0.05522403120994568, 'token': 8, 'token_str': '▁', 'sequence': '▁এই ▁মসজদট র ▁ভততপরসতর ▁ ▁ ▁ ২০১৫ ▁সলর ▁জন ▁মস ।'}
{'score': 0.02941860817372799, 'token': 9, 'token_str': 'র', 'sequence': '▁এই ▁মসজদট র ▁ভততপরসতর ▁ র ▁ ২০১৫ ▁সলর ▁জন ▁মস ।'}
{'score': 0.016348714008927345, 'token': 10, 'token_str': ',', 'sequence': '▁এই ▁মসজদট র ▁ভততপরসতর ▁, ▁ ২০১৫ ▁সলর ▁জন ▁মস ।'}
{'score': 0.012751172296702862, 'token': 11, 'token_str': '▁এব', 'sequence': '▁এই ▁মসজদট র ▁ভততপরসতর ▁ ▁এব ▁ ২০১৫ ▁সলর ▁জন ▁মস ।'}
{'score': 0.01006506010890007, 'token': 13, 'token_str': '.', 'sequence': '▁এই ▁মসজদট র ▁ভততপরসতর ▁. ▁ ২০১৫ ▁সলর ▁জন ▁মস ।'}
time: 339 ms (started: 2022-11-24 23:45:48 +00:00)

References#