Vector Semantics and Representation#

Vector Semantics and Word Embeddings#

  • Lexical semantics is the study of the meaning of words

  • Distributional hypothesis: words that occur in similar contexts have similar meanings

  • Sparse vectors: one-hot encoding or bag-of-words

  • Dense vectors: word embeddings

What do words mean, and how do we represent that?#

cassoulet

Do we want to represent that …

  • “cassoulet” is a French dish?

  • “cassoulet” contains meat and beans?

  • “cassoulet” is a stew?

bar

Do we want to represent that …

  • “bar” is a place where you can drink alcohol?

  • “bar” is a long rod?

  • “bar” is to prevent something from moving?

About words, we can say that …

  • Concepts or word senses have a complex many-to-many relationship with words

  • Words have relations with each other

    • Synonyms: “bar” and “pub”

    • Antonyms: “bar” and “open”

    • Similarity: “bar” and “club”

    • Relatedness: “bar” and “restaurant”

    • Superordinate: “bar” and “place”

    • Subordinate: “bar” and “pub”

    • Connotation: “bar” and “prison”

Different approaches to lexical semantics#

NLP draws on two different approaches to lexical semantics:

  • Lexical semantics:

    • The study of the meaning of words

    • The lexicographic tradition aims to capture the information represented in lexical entries in dictionaries

  • Distributional semantics:

    • The study of the meaning of words based on their distributional properties in large corpora

    • The distributional hypothesis: words that occur in similar contexts have similar meanings

Lexical semantics#

  • Uses resources such as lexicons, thesauri, ontologies etc. that capture explicit knowledge about word meanings.

  • Assumes that words have discrete word senses that can be represented in a lexicon.

    • bank 1 = a financial institution

    • bank 2 = a river bank

  • May capture explicit knowledge about word meanings, but is limited in its ability to capture the meaning of words that are not in the lexicon.

    • dog is a canine (lexicon)

    • cars have wheels (lexicon)

Distributional semantics#

  • Uses large corpora of raw text to learn the meaning of words from the contexts in which they occur.

  • Maps words to vector representations that capture the distributional properties of the words in the corpus.

  • Uses neural networks to learn the dense vector representations of words, word embeddings, from large corpora of raw text.

  • If each word is mapped to a single vector, this ignores the fact that words can have multiple meanings or parts of speech.

How do we represent words to capture word similarities?#

  • As atomic symbols

    • in a traditional n-gram language model

    • explicit features in a machine learning model

    • this is equivalent to very high-dimensional one-hot vectors:

      • aardvark = [1,0,…,0], bear = [0,1,…,0], …, zebra = [0,0,…,1]

      • height and tall are as different as aardvark and zebra

  • As very high-dimensional sparse vectors

    • to capture the distributional properties of words

  • As low-dimensional dense vectors

    • word embeddings

What should word representations capture?#

  • Vector representations of words were originally used to capture lexical semantics so that words with similar meanings would be represented by vectors that are close together in vector space.

  • These representations may also capture some morphological and syntactic information about words. (part of speech, inflections, stems, etc.)

The Distributional Hypothesis#

Zellig Harris (1954):

  • Words that occur in similar contexts have similar meanings.

  • oculist and eye doctor occur in almost the same contexts

  • If A and B have almost the same environment, then A and B are synonymous.

John Firth (1957):

  • You shall know a word by the company it keeps.

The contexts in which words occur tell us a lot about the meaning of words.

Words that occur in similar contexts have similar meanings.

Why do we care about word contexts?#

What is tezgüino?

  • A bottle of tezgüino is on the table.

  • Everybody likes tezgüino.

  • Tezgüino makes you drunk.

  • We make tezgüino out of corn.

We don’t know what tezgüino is, but we can guess that it is a drink because we understand these sentences.

If we have the following sentences:

  • A bottle of wine is on the table.

  • There is a beer bottle on the table

  • Beer makes you drunk.

  • We make bourbon out of corn.

  • Everybody likes chocolate

  • Everybody likes babies

Could we guess that tezgüino is a drink like wine or beer?

However, there are also red herrings:

  • Everybody likes babies

  • Everybody likes chocolate

Two ways NLP uses context for semantics#

Distributional similarity: (vector-space semantics)

  • Assume that words that occur in similar contexts have similar meanings.

  • Use the set of all contexts in which a word occurs to measure the similarity between words.

Word sense disambiguation:

  • Assume that if a word has multiple meanings, then it will occur in different contexts for each meaning.

  • Use the context of a particular occurrence of a word to identify the sense of the word in that context.

Distributional Similarity#

Basic idea#

  • Measure the semantic similarities of words by measuring the similarity of their contexts in which they occur

How?#

  • Represent words as sparse vectors such that:

    • each vector element (dimension) represents a different context

    • the value of each element is the frequency of the context in which the word occurs, capturing how strongly the word is associated with that context

  • Compute the semantic similarity of words by measuring the similarity of their context vectors

Distributional similarities represent each word \(w\) as a vector \(v_w\) of context counts:

\[w = (w_1 , \ldots , w_N ) \in R^N\]

in a vector space \(R^N\) where \(N\) is the number of contexts.

  • each dimension \(i\) represents a different context \(c_i\)

  • each element \(v_{w,i}\) captures how strongly \(w\) is associated with context \(c_i\)

  • \(v_{w,i}\) is the co-occurrence count of \(w\) and \(c_i\)

The Information Retrieval perspective: The Term-Document Matrix#

In information retrieval, we search a collection of \(N\) documents for \(M\) terms:

  • We can represent each word in the vocabulary \(V\) as an \(N\)-dimensional vector \(v_w\) where \(v_{w,i}\) is the frequency of the word \(w\) in document \(i\).

  • Conversely, we can represent each document as an \(M\)-dimensional vector \(v_d\) where \(v_{d,j}\) is the frequency of the term \(t_j\) in document \(d\).

Finding the most relevant documents for a query \(q\) is equivalent to finding the most similar documents to the query vector \(v_q\).

  • Queries are also documents, so we can use the same vector representation for queries and documents.

  • Use the similarity of the query vector \(v_q\) to the document vectors \(v_d\) to rank the documents.

  • Documents are similar to queries if they have similar terms.

Term-Document Matrix#

A term-document matrix is a 2D matrix:

  • each row represents a term in the vocabulary

  • each column represents a document

  • each element is the frequency of the term in the document

  • Each column vector = a document

    • Each entry = the frequency of the term in the document

  • Each row vector = a term

    • Each entry = the frequency of the term in the document

Two documents are similar if their vectors are similar.

Two words are similar if their vectors are similar.

For information retrieval, the term-document matrix is useful because it allows us to represent documents as vectors and compute the similarity between documents in terms of the words they contain, or of terms in terms of the documents they occur in.

We can adapt this idea to implement a model of the distributional hypothesis if we treat each context as a column in the matrix and each word as a row.

What is a context?#

There are many ways to define a context:

Contexts defined by nearby words:

  • How often does the word \(w_i\) occur within a window of \(k\) words of the word \(w_j\)?

  • Or, how often do the words \(w_i\) and \(w_j\) occur in the same document or sentence?

  • This yields fairly broad thematic similarities between words.

Contexts defined by grammtical relations:

  • How often does the word \(w_i\) occur as the subject of the word \(w_j\)?

  • This requires a grammatical parser to identify the grammatical relations between words.

  • This yields more fine-grained similarities between words.

Using nearby words as contexts#

  1. Define a fixed vocabulary of \(N\) context words \(c_1 , \ldots , c_N\)

  • Contexts words should occur frequently enough in the corpus that you can get reliable counts.

  • However, you should ignore very frequent words (stopwords) like the and a because they are not very informative.

  1. Define what nearby means:

  • For example, we can define a window of \(k\) words on either side of the word \(w_j\).

  1. Count the number of times each context word \(c_i\) occurs within a window of \(k\) words of the word \(w_j\).

  2. Define how to transform the co-occurrence counts into a vector representation of the word \(w_j\).

  • For example, we can use the (positive) PMI of the word \(w_j\) and the context word \(c_i\).

  1. Compute the similarity between words by measuring the similarity of their context vectors.

  • For example, we can use the cosine similarity of the context vectors.

Word-Word Matrix#

Resulting word-word matrix:

  • \(f(w, c)\) = how often does word w appear in context \(c\):

  • information appeared six times in the context of data

Defining co-occurrences:#

  • Within a fixed window: \(c_i\) occurs within \(\pm n\) words of \(w\)

  • Within the same sentence: requires sentence boundaries

  • By grammatical relations: \(c_i\) occurs as a subject/object/modifier/… of verb \(w\) (requires parsing - and separate features for each relation)

Representing co-occurrences:#

  • \(f_i\) as binary features (1,0): \(w\) does/does not occur with \(c_i\)

  • \(f_i\) as frequencies: \(w\) occurs \(n\) times with \(c_i\)

  • \(f_i\) as probabilities: e.g. \(f_i\) is the probability that \(c_i\) is the subject of \(w\).

Getting co-occurrence counts#

Co-occurrence as a binary feature:

  • Does word \(w\) ever appear in the context \(c\)? (1 = yes/0 = no)

Co-occurrence as a frequency count:

  • How often does word \(w\) appear in the context \(c\)? (0,1,2,… times)

Counts vs PMI#

Sometimes, low co-occurrences counts are very informative, and high co-occurrence counts are not:

  • Any word is going to have relatively high co-occurrence counts with very common contexts (e.g. the with a), but this won’t tell us much about what that word means.

  • We need to identify when co-occurrence counts are higher than we would expect by chance.

We can use pointwise mutual information (PMI) values instead of raw frequency counts:

\[ PMI(w,c) = \log \frac{p(w,c)}{p(w)p(c)} \]

Computing PMI of \(w\) and \(c\):#

Using a fixed window of \(\pm k\) words#

  • \(N\): How many tokens does the corpus contain?

  • \(f(w) \le N\): How often does \(w\) occur?

  • \(f(w, c) \le f(w)\): How often does \(w\) occur with \(c\) in its window?

  • \(f(c) = \sum_w f(w, c)\): How many tokens have \(c\) in their window?

\[p(w) = \frac{f{w}}{N}, p(c) = \frac{f(c)}{N}, p(w,c) = \frac{f(w,c)}{N}\]
\[PMI(w,c) = \log \frac{p(w,c)}{p(w)p(c)}\]

Positive Pointwise Mutual Information

PMI is negative when words co-occur less than expected by chance.

  • This is unreliable without huge corpora:

  • With \(P(w ) \approx P(w2 ) \approx 10^{−6}\) , we can’t estimate whether \(P(w_1 , w_2 )\) is significantly different from \(10^{−12}\)

We often just use positive PMI values, and replace all negative PMI values with 0:

Positive Pointwise Mutual Information (PPMI):

\[ \text{PPMI}(w, c) = PMI, \text{ if } \text{PMI}(w, c) \gt 0 \]
\[ \text{PPMI}(w, c) = 0, \text{ if } \text{PMI}(w, c) \le 0 \]

PMI and smoothing

PMI is biased towards infrequent events:

  • If \(P(w, c) = P(w) = P(c)\), then \(\text{PMI}(w, c) = \log (\frac{1}{P(w)})\)

  • So \(\text{PMI}(w, c)\) is larger for rare words \(w\) with low \(P(w)\).

Simple remedy: Add-k smoothing of \(P(w, c), P(w), P(c)\) pushes all PMI values towards zero.

  • Add-k smoothing affects low-probability events more, and will therefore reduce the bias of PMI towards infrequent events. (Pantel & Turney 2010)

Dot product as similarity#

If the vectors consist of simple binary features (0,1), we can use the dot product as similarity metric:

\[ sim_{dot-prod}(\vec{x}\cdot\vec{y}) = \sum_{i=1}^{N} x_i \times y_i \]

The dot product is a bad metric if the vector elements are arbitrary features: it prefers long vectors

  • If one \(x_i\) is very large (and \(y_i\) nonzero), \(sim(x, y)\) gets very large

  • If the number of nonzero \(x_i\) and \(y_i\) is very large, \(sim(x, y)\) gets very large.

  • Both can happen with frequent words.

\[ \text{length of }\vec{x}: |\vec{x}|=\sqrt{\sum_{i=1}^{N}x_i^2} \]

Vector similarity: Cosine#

One way to define the similarity of two vectors is to use the cosine of their angle.

The cosine of two vectors is their dot product, divided by the product of their lengths:

\[ sim_{cos}(\vec{x},\vec{y})=\frac{\sum_{i=1}^{N} x_i \times y_i}{\sqrt{\sum_{i=1}^{N}x_i^2}\sqrt{\sum_{i=1}^{N}y_i^2}} = \frac{\vec{x}\cdot\vec{y}}{|\vec{x}||\vec{y}|} \]

\(sim(\mathbf{w}, \mathbf{u}) = 1\): \(\mathbf{w}\) and \(\mathbf{u}\) point in the same direction

\(sim(\mathbf{w}, \mathbf{u}) = 0\): \(\mathbf{w}\) and \(\mathbf{u}\) are orthogonal

\(sim(\mathbf{w}, \mathbf{u}) = -1\): \(\mathbf{w}\) and \(\mathbf{u}\) point in the opposite direction

Visualizing cosines#