BERT: Bidirectional Encoder Representations from Transformers#

bert

  • The year 2018 marked a turning point for the field of Natural Language Processing (NLP).

  • The BERT [Devlin et al., 2018] paper introduced a new language representation model that outperformed all previous models on a wide range of NLP tasks.

  • BERT is a deep bidirectional transformer model that is pre-trained on a large corpus of unlabeled text.

  • The model is trained to predict masked words in a sentence and is also trained to predict the next sentence in a sequence of sentences.

  • The pre-trained model can then be fine-tuned on a variety of downstream NLP tasks with state-of-the-art results.

BERT builds on two key ideas:

BERT is pre-trained on a large corpus of unlabeled text. Its weights are learned by predicting masked words in a sentence and predicting the next sentence in a sequence of sentences.

BERT is a deep bidirectional transformer model. It is a multi-headed beast with 12(24) layers, 12(16) attention heads, and 110 million parameters. Since model weights are not shared across layers, the total number of different attention weights is 12(24) x 12(16) = 144(384).

The Architecture of BERT#

The core component of BERT is the attention mechanism.

Attention is a way for a model to assign weights to different parts of the input based on their importance to some task.

For the sentence The dog from down the street ran up to me and ___, to complete the sentence, a model may give more attention to the word dog than to the word street. Because knowing that the subject of the sentence is a dog is more important than knowing that where the dog is from.

Attention Mechanism#

The attention mechanism is quite simple. It is a function that takes in two inputs: a query and a key. The query is the part of the input that we want to focus on. The key is the part of the input that we want to compare the query to. The output of the attention mechanism is a weighted sum of the values of the key. The weights are computed by a function of the query and the key.

Suppose you have some sequence of words \(X\), where each element \(x_i\) is a (refered to as the value) of dimension \(d\).

In the following example, \(X\) is a sequence of 3 words, each represented by a 4-dimensional vector.

Attention is simply a function that takes X as input and returns another sequence Y of the same length as X, composed of vectors of the same dimension as the vectors in X.

Each vector in Y is computed by taking a weighted average of the vectors in X.

That is, attention is just a weighted average of the values of the vectors in X. The weights show how much the model attends to each vector in X when computing the output vector.

Attending to language#

How does attention work in the context of language?

Suppose we have a sentence the dog ran. We can represent this sentence as a sequence of vectors, where each vector is a word embedding.

A word embedding is a vector representation of a word. It is a vector of real numbers. Each element of the vector represents a dimension of the word that captures some aspect of its meaning.

Those aspects can be semantic, syntactic, or even phonetic. For example, the first element of the vector may represent the semantic meaning of the word, the second element may represent the syntactic meaning of the word, and the third element may represent the phonetic meaning of the word.

In practice, word embeddings are not interpretable. We don’t know what each element of the vector represents. But we can use them to compute the meaning of a sentence.

For example, we can perform arithmetic operations on word embeddings. For example, we can compute the vector representation of the word cat by adding the vectors of the words kitten and dog and subtracting the vector of the word puppy.

\[\text{cat} - \text{kitten} \approx \text{dog} - \text{puppy}\]

Attention is also a form of arithmetic operation. Therefore, you can apply attention to word embeddings to compute the meaning of a sentence.

For example, the embedding of the word dog in \(Y\) is computed by taking a weighted average of the embeddings of the words the, dog, and ran in \(X\) with weights of 0.2, 0.7, and 0.1 respectively.

How does this process help the model to understand the meaning of a sentence?

To fully understand the meaning of a sentence, we need to know the meaning of each word in the sentence. But we can’t just look at each word in the sentence in isolation. We need to know the context of the word.

The attention mechanism allows the model to focus on the words that are most important to the meaning of the sentence.

Multi-head attention#

BERT learns multiple attention mechanisms, called heads, which operate in parallel. Each head is a different attention mechanism. The output of each head is concatenated and fed into a feed-forward neural network.

Multi-head attention enables the model to learn broader types of relationships between words in a sentence than a single attention mechanism.

BERT also stacks multiple layers of multi-head attention. Each layer of multi-head attention takes the output of the previous layer as input. The output of the last layer of multi-head attention is fed into a feed-forward neural network.

Through this architecture, BERT is able to learn very rich representations of language as it gets to the deeper layers of the network.

Because the attention heads do not share parameters, each head learns a different type of relationship between words in a sentence.

Visualizing Attention#

# %pip install bertviz
%config InlineBackend.figure_format='retina'

from bertviz import model_view, head_view
from bertviz.neuron_view import show
from transformers import AutoTokenizer, AutoModel, utils

utils.logging.set_verbosity_error()  # Suppress standard warnings

model_version = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_version)
model = AutoModel.from_pretrained(model_version, output_attentions=True)
sentence_a = "I went to the store."
sentence_b = "At the store, I bought fresh strawberries."

inputs = tokenizer.encode(
    [sentence_a, sentence_b],
    return_tensors="pt",
)
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

Each cell in the model view shows the attention pattern of a single head. The attention pattern of each head is different. The attention patterns are specific to the input sentence.

# model_view(attention, tokens)

Deconstructing Attention#

Let’s take a closer look at how attention works.

Now that we know that attention is a weighted average of the values of the vectors in X, let’s look at how the weights are computed.

The weights are computed by a function of the query and the key. The query is the part of the input that we want to focus on. The key is the part of the input that we want to compare the query to.

These vectors can be thought of as a type of word embedding like the value vectors we saw earlier, but constructed specifically for determining the similarity between words.

The similarity between two words is computed by taking the dot product of the query and key vectors of those words.

To convert the dot product into a probability, we apply a softmax function to the dot product, which normalizes the dot product so that it sums to 1.

The softmax values on the right represent the final weights of the attention mechanism.

Where do these query and key vectors come from?#

We now understand that attention weights are computed from query and key vectors. But where do these vectors come from?

The query and key vectors are computed from the value vectors. The query and key vectors are computed by passing the value vectors through two different linear layers.

We can visualize how attention weights are computed from query and key vectors using the neuron view.

This view traces the computation of attention from the selected word on the left to the complete sequence of words on the right. Positive values are colored blue and negative values orange, with color intensity representing magnitude.

Let’s go through the columns in the neuron view one at a time.

Query q: the query vector q encodes the word on the left that is paying attention, i.e. the one that is “querying” the other words. In the example above, the query vector for “on” (the selected word) is highlighted.

Key k: the key vector k encodes the word on the right to which attention is being paid. The key vector and the query vector together determine a compatibility score between the two words.

q×k (elementwise): the elementwise product between the query vector of the selected word and each of the key vectors. This is a precursor to the dot product (the sum of the elementwise product) and is included for visualization purposes because it shows how individual elements in the query and key vectors contribute to the dot product.

q·k: the scaled dot product (see above) of the selected query vector and each of the key vectors. This is the unnormalized attention score.

Softmax: the softmax of the scaled dot product. This normalizes the attention scores to be positive and sum to one.

Explaining BERT’s attention patterns#

Let’s explore the attention patterns of various layers of the BERT (the BERT-Base, uncased version).

Sentence A: I went to the store.

Sentence B: At the store, I bought fresh strawberries.

BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is:

[CLS] I went to the store . [SEP] At the store , I bought fresh straw ##berries . [SEP]

from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model_type = 'bert'
do_lower_case = True
neuron_model = BertModel.from_pretrained(model_version, output_attentions=True)
neuron_tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)
show(neuron_model, model_type, neuron_tokenizer, sentence_a, sentence_b, layer=2, head=0)
Layer: Head: Attention:
inputs = tokenizer.encode(
    ["I went to the store.", "At the store, I bought fresh strawberries."],
    return_tensors="pt",
)
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

What does BERT actually learn?#

Delimiter-focused attention patterns#

This pattern serves as a kind of “no-op”; an attention head focuses on the [SEP] tokens when it can’t find anything else to focus on.

How exactly is BERT able to fixate on the [SEP] tokens? The answer lies in the query and key vectors.

In the Key column, the key vectors for the two occurrences of [SEP] carry a distinctive signature: they both have a small number of active neurons with strongly positive (blue) or negative (orange) values, and a larger number of neurons with values close to zero (light blue/orange or white):

The query vectors tend to match the [SEP] key vectors along those active neurons, resulting in high values for the elementwise product q×k, as in this example:

The query vectors for the other words follow a similar pattern: they match the [SEP] key vector along the same set of neurons. Thus it seems that BERT has designated a small set of neurons as “[SEP]-matching neurons,” and the query vectors assigned values that match the [SEP] key vectors at these positions.

Select layer 6, head 4. In this pattern, attention is directed to the delimiter tokens, [CLS] and [SEP].

  • For example, most of the attention is directed to [SEP].

head_view(attention, tokens, layer=6, heads=[4])
Layer:

Bag of Words attention pattern#

In this pattern, attention is divided fairly evenly across all words in the same sentence:

BERT is essentially computing a bag-of-words embedding by taking an (almost) unweighted average of the word embeddings in the same sentence.

How does BERT finesse the queries and keys to form this attention pattern? Let’s again turn to the neuron view:

In the \(q×k\) column, we see a clear pattern: a small number of neurons (2–4) dominate the calculation of the attention scores. When query and key vector are in the same sentence (the first sentence, in this case), the product shows high values (blue) at these neurons. When query and key vector are in different sentences, the product is strongly negative (orange) at these same positions, as in this example:

When query and key are both from sentence 1, they tend to have values with the same sign along the active neurons, resulting in a positive product. When the query is from sentence 1, and the key is from sentence 2, the same neurons tend to have values with opposite signs, resulting in a negative product.

How does BERT know the concept of “sentence”, especially in the first layer of the network before higher-level abstractions are formed? As mentioned earlier, BERT accepts special [SEP] tokens that mark sentence boundaries. Additionally, BERT incorporates sentence-level embeddings that are added to the input layer. The information encoded in these sentence embeddings flows to downstream variables, i.e. queries and keys, and enables them to acquire sentence-specific values.

Select layer 0, head 0

head_view(attention, tokens, layer=0, heads=[0])
Layer:

Next-word attention pattern#

It makes sense that the model would focus on the next word, because adjacent words are often the most relevant for understanding a word’s meaning in context. Traditional n-gram language models are based on this same intuition.

Let’s check out the neuron view for this pattern:

We see that the product of the query vector for “the” and the key vector for “store” (the next word) is strongly positive across most neurons. For tokens other than the next token, the key-query product contains some combination of positive and negative values. The result is a high attention score between “the” and “store”.

For this attention pattern, a large number of neurons figure into the attention score, and these neurons differ depending on the token position, as illustrated here:

This behavior differs from the delimiter-focused and the sentence-focused attention patterns, in which a small, fixed set of neurons determine the attention scores. For those two patterns, only a few neurons are required because the patterns are so simple, and there is little variation in the words that receive attention. In contrast, the next-word attention pattern needs to track which of the 512 words receives attention from a given position, i.e., which is the next word. To do so it needs to generate queries and keys such that each query vector matches with a unique key vector from the 512 possibilities. This would be difficult to accomplish using a small subset of neurons.

How is BERT able to generate these position-aware queries and keys?

  • The answer lies in BERT’s position embeddings, which are added to the word embeddings at the input layer.

  • BERT learns a separate position embedding for each position in the sequence, and adds these to the word embeddings.

  • This position information flows to downstream variables, i.e. queries and keys, and enables them to acquire position-specific values.

Select layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) Most of the attention at a particular position is directed to the next token in the sequence.

  • If you do not select any token, the visualization shows the attention pattern for all tokens in the sequence.

  • If you select a token, the visualization shows the attention pattern for the selected token.

  • If you select a token i, virtually all the attention is directed to the next token went.

  • The [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] (the first token in the sequence) rather than the next token.

  • This pattern, attention to the next token, appears to work primarily within a sentence.

  • This pattern is related to the idea of a recurrent neural network (RNN) that is trained to predict the next word in a sequence.

head_view(attention, tokens, layer=2, heads=[0])
Layer:

Attention to previous word#

Select layer 6, head 11. In this pattern, much of the attention is directed to the previous token in the sequence.

  • For example, most of the attention from went is directed to the previous token i.

  • The pattern is not as distinct as the next-token pattern, but it is still present.

  • Some attention is also dispersed to other tokens in the sequence, especially to the [SEP] token.

  • This pattern is also related to the idea of an RNN, in this case the forward direction of an RNN.

head_view(attention, tokens, layer=6, heads=[11])
Layer:

Attention to other words predictive of word#

Select layer 2, head 1. In this pattern, attention seems to be directed to other words that are predictive of the source word, excluding the source word itself.

  • For example, most of the attention for straw is directed to ##berries, and most of the attention from ##berries is focused on straw.

head_view(attention, tokens, layer=2, heads=[1])
Layer:

References#