Vector Representations and Embedding

tags: `NLP` `preparatory`

From Chapter 6: Vector Semantics and Embeddings in Speech and Language Processing, Dan Jurafsky and James H. Martin, 3rd ed.

Words that occur in similar contexts tend to have similar meanings.
- This link between similarity in how words are distributed and similarity in their meanings is called the distributional hypothesis.
Vector semantics instantiates the above linguistic hypothesis by learning representations of the meaning of words, called embeddings, directly from their distributions in texts.
These representations are an example of representation learning, automatically learning useful representations of the input texts.
- Finding such self-supervised ways to learn representations of the input instead of manually creating them with feature engineering is an important focus of NLP research.

Lexical Semantics

A model of word meaning should allow us to draw useful inferences that will help us solve meaning-related tasks.
- Helpful inferences can come in term of words with similar meaning, antonyms, positive/negative connotations, etc.
- The linguistic study of word meanings is called lexical semantic.
Lemma/citation form: the dictionary form of a word, i.e., sing is the lemma for sing, sang, sung.
- Lemma can be polysemous, or having multiple word senses, i.e., mouse can be either a rodent or a computer accessory.
If one word has a sense that is (near) identical to a sense of another word, we say the two senses of these two words synonyms.
- Two words are also synonymous if substituting one for another doesn't change the truth conditions of the sentence. In this case, these two words have the same propositional meaning.
Most words do have lots of similar words compared to synonyms.
- The shift from synonymy to similarity is the shift from relations between word senses to relations between words.
- The notion of similarity is very useful in larger semantic tasks.
Other than similarity, two words can also be connected via word relatedness; i.e., coffee is not similar to cup, but they are related to each other by co-participating in daily life.
- A common kind of relatedness between words is if they belong to the same semantic field, a set of words that cover a semantic domain and bear structured relations with each other.
Connotation (affective meaning): The aspects of a word's meaning that are related to a writer or reader's emotions, sentiment, opinions, or evaluations.
- Positive or negative evaluation expressed through language is called sentiment.

Vector Semantics

The current best model that effectively deals with word meaning is the vector semantics model.
- The meaning of a word may change, so instead of using a logical language to define a word, use the usage context from actual people speaking and understanding.
- Vector semantic models are practical, as they can be learned directly from text without complex labeling/supervision.
Vector semantics combined two intuitions:
- Distributionalist intuition: defining a word by counting what other words occur in its environment
- Vector intuition: defining the meaning of a word as a vector, a list of numbers, a point in N-dimensional space
Vectors for presenting words are generally called embeddings
- Because the word is embedded in a particular vector space.
- Words with similar meanings tend to be close to other in the vector space.

Words and Vectors

Vectors or distributional models of meaning are generally based on a co-occurance matrix, a way of representing how often words co-occur.
A term-document matrix represents the vocabulary frequency of a certain corpus of documents.
- Two similar documents tend to have similar words, and thus tend to have similar frequency column vectors.
- Information retrieval is the task of finding a document in a collection that best matches a particular query.
Word-word (term-term/term-context) matrix: A record of the frequency that two words co-occur within the same context in some training corpus.
- The context can be as large as a document, or more commonly a certain window around the word itself (i.e., within n words to the left and right).
- Therefore, similar words tend to have similar vector representations in the matrix.

Cosine for measuring similarity

To define similarity between two target words, we need a measure for taking two such vectors and giving a measure of vector similarity.
- So far, the most common similarity metric is the cosine of the angle between the vectors.
- The similarity measurement is primarily concerning the angle, since the length of a word vector is dependent of its frequency (i.e., words with high frequencies have longer vectors).
The cosine similarity metric between two word vectors v and w can be computed as:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The cosine similarity ranges from 0-1.
- As 1 for vectors pointing in the same direction and 0 for orthogonal vectors.
- Since raw frequency values are non-negative, the cosine of these vectors cannot goes below 0.
Note that the similarity between two words are calculated by some function of the dot product between two vectors.
- The cosine is a normalized dot product.

TF-IDF Weighing

There are words that frequently co-occur with others, but gives no useful information about their surrounding words (i.e., words such as the, it, or they).
The tf-idf algorithm therefore weighs each word by their occurance in a document times their occurance in the entire corpus.
- The tf-idf weighted value of word t in document d is
  $w_{t, d} = t f_{t, d} \times i d f_{t}$ , where
- $t f_{t, d}$ is the term frequency, modified by log (to reduce the weights of frequently-appeared words), calculated as
  $t f_{t, d} = \log_{10} (c o u n t (t, d) + 1)$ , and
- $i d f_{t}$ is the inverse document frequency, calculated as
  $i d f_{t} = \log_{10} (\frac{N}{d f_{t}})$ , where
  - $N$ is the total number of documents in the collection, and
  - $d f_{t}$ is the number of documents in which term
    $t$ occurs.
The model compute similarity between two words by taking the cosine of their tf-idf vectors. In short, the model is called the tf-idf model.
- One common use for a tf-idf model is to compute word similarity, where each word is compared to several others to reveal the most similar words.
- The model can also be used to decide if two documents are similar, by taking the vectors of all the words in the document and computing the centroid of all those vectors.
  - The resulting vector used for comparison is called the document vector.

Pointwise Mutual Information (PMI)

Pointwise mutual information is the measure of how often two events occur compared to what we would expect if they are independent.
- It's a comparison of how much more two words co-occur in a corpus than how we would expect them to appear by chance.
The PMI between a target word
$w$ and a context word
$c$ is defined as:
$PMI (w, c) = \log_{2} \frac{P (w, c)}{P (w) P (c)}$
Since PMI values can be negative (which is not good except if we have an enormous corpus), Positive PMI (PPMI) is instead used:
$PPMI (w, c) = max (\log_{2} \frac{P (w, c)}{P (w) P (c)}, 0)$
PMI has a problem of being biased toward infrequent events (rare words tend to have higher values).
- One way to reduce this bias is changing the computation of
  $P (c)$ , raising the probability of the context word to the power of
  $α$ :
  $P_{α} (c) = \frac{c o u n t (c)^{α}}{\sum_{c} c o u n t (c)^{α}}$
- Another solution is Laplace smoothing.

Word2vec

The previous methods generate and rely on long and sparse vector models. There is an alternative method that make use of short (50-1000 units) and dense (most values are non-zero) vectors.
- Dense vectors work better in every NLP task than sparse vectors. This is likely because of the resulting ML model's feature simplicity, and less overfitting because of the smaller feature space.
- Dense vectors may also do a better job capturing synonymy than sparse vectors (i.e., automobile and car are in distinct dimensions of a sparse vector rep, making the model fail to capture the relationship between them).
A method for very dense, short vectors, is skip-gram negative sampling (SGNS), one of the two algorithms in the word2vec package.
- Word2vec methods are fast, easy to train, and widely available online.
The intuition of word2vec is instead of counting how often a word, such as w, occurs near another word such as catnip, the algorithm train a binary prediction classifier on the task of predicting if w is likely to show up near catnip.
- The task itself is irrelevant, but the learned classifier weights are taken as embeddings.
- This make the classifier incredibly simple to train, as running text can be considered supervised training data (i.e., a word s occurring near catnip acts as a gold 'correct answer' to the prediction question with w).
Word2vec is much simpler than a neural language model:
- Word2vec simplifies the task, making it binary classification as opposed to prediction, and
- Word2vec simplifies the architecture, training a simple logistic regression classifier rather than a complex, multilayer neural net.
The intuition of skip-gram is:
- Treat the target word and a neighboring context word as positive examples.
- Randomly sample other words in the lexicon to get negative samples.
- Use logistic regression to train a classifier to distinguish these two cases.
- Use the regression weights as the embeddings.

The Classifier

The probability that
$c$ is the context word of target word
$t$ is represented by
$P (+ | t, c)$ , which is calculated based on embeddings similarity between
$w$ and
$c$ .
- Intuition: a word is likely to occur near the target if its embeddings is similar to the target's embeddings.
  - Remember that two vectors are similar if they have a high dot product (cosine is just a normalized dot product).
  - In other words,
    $S i m i l a r i t y (t, c) \approx t \cdot c$
The classifying probability is therefore the similarity measure scaled by the sigmoid function.
- $P (+ | t, c) = σ (S i m i l a r i t y (t, c)) = \frac{1}{1 + e^{- t \cdot c}}$
- Since we also have to take into account multiple context words in the window, we calculate the probabilty with independence assumption:
  - $P (+ | t, c_{1 : k}) = \prod_{i = 1}^{k} \frac{1}{1 + e^{- t \cdot c_{i}}}$ , or
  - $\log P (+ | t, c_{1 : k}) = \sum_{i = 1}^{k} \log \frac{1}{1 + e^{- t \cdot c_{i}}}$
- We could thus calculate the probability if only we had embeddings for each target word and context word in the vocabulary.

Learning skip-gram embeddings

Word2vec learn the embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embeddings of each word
$w$ .
- The words within the context window of the target word are used to create the positive samples for the classifier.
- Skip-gram also uses more negative samples than positive samples (which the ratio set by a parameter
  $k$ ).
  - For each
    $(t, c)$ training instance we will create
    $k$ negative samples, each consisting of the target word
    $t$ plus a 'noise word'. A noise word is a random word from the lexicon which is not the target word
    $t$ .
  - The noise word are chosen according to their weighted unigram frequency
    $p_{α} (w)$ , where
    $α$ is a weight.
  - The weighted unigram frequency is calculated as:
    $P_{α} (w) = \frac{c o u n t (w)^{α}}{\sum_{w^{'}} c o u n t (w^{'})^{α}}$
  - Setting
    $α = 0.75$ gives better performance since it gives rare noise words slightly higher probability: for rare word,
    $P_{α} (w) > P (w)$ .
Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the algorithm is to adjust those embeddings such that we
- Maximize the similarity of the target word, context word pairs
  $(t, c)$ drawn from the positive examples, and
- Minimize the similarity of the
  $(t, c)$ pairs drawn from the negative examples.
We can express the above objective formally over the whole training set as:
- $L (θ) = \sum_{(t, c) \in +} \log P (+ | t, c) + \sum_{(t, c) \in -} \log P (- | t, c)$
- If we look at one word/context pair
  $(t, c)$ with its
  $k$ noise words
  $n_{1} . . . n_{k}$ , the learning objective
  $L$ is:
  $L (θ) = \log \frac{1}{1 + e^{- c \cdot t}} + \sum_{i = 1}^{k} \log \frac{1}{1 + e^{n_{i} \cdot t}}$
- We want to maximize the dot product of the word with the actual context words, and minimize the dot products of the word with k negative sampled neighbor words.
We can then use stochastic gradient descent to train to the objective.
- Note that the skip-gram model actually learns two separate embeddings for each word w: the target embedding
  $t$ and the context embedding
  $c$ .
  - These embeddings are stored in two matrices, the target matrix
    $T$ and the context matrix
    $C$ . The figure below demonstrates the intuition behind the learning tasks for the embeddings encoded in the two matrices.
- Just as in logistic regression, the algorithm starts with randomly initialized
  $W$ and
  $C$ matrices, and then walks through the corpus using gradient descent to move
  $W$ and
  $C$ , maximizing the objective.
  - $W$ and
    $C$ function as the parameters
    $θ$ that logistic regresssion is tuning.
  - Once the embeddings are learned, we will have two embeddings for each word
    $w_{i}$ :
    $t_{i}$ and
    $c_{i}$ .
    - We can choose to throw away
      $C$ and keep
      $W$ , meaning each word
      $i$ will be represented by the vector
      $t_{i}$ .
    - We can also add the two embeddings together, using the summed embedding
      $t_{i} + c_{i}$ as the new
      $d$ -dimensional embedding, or concatenating them into
      $2 d$ .
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The context window size
$L$ affect the performance of skip-gram embeddings, and experiments often tune
$L$ on a devset.
- One difference from count-based methods is that for skip-grams, the larger the window size the more computation the algorithm requires for training.
  - Since more neighboring words must be predicted.

Semantic properties of embeddings

Vector semantic models have a number of parameters.
Context window size is a parameter relevant to both sparse tf-idf vectors and dense word2vec vectors.
- This window is typically 1-10 words on each side of the target word (for a total context of 3-20 words).
- The choice depends on the goals of the representation.
  - Shorter context windows tend to lead to representations that are a bit more syntactic (since the information is coming from immediately nearby words).
  - Long context windows tend to lead to words that are topically related to the target word, but not similar.
It is often useful to distinguish two kinds of similarity or association between words.
- Two words have first-order co-occurence (syntagmatic association) if they are typically nearby each other (i.e., wrote is a first-order associate of book or poem).
- Two words have second-order co-occurence (paradigmatic association) if they have similar neighbors (i.e., wrote is a second-order associate of words like said or remarked).
Analogy: Another semantic property of embeddings is their ability to capture relational meanings.
- The offsets between vector embeddings can capture some analogical relations between words.
- For example, the result of the expression vector('king') - vector('man') + vector('woman') is a vector close to vector('queen').
Embeddings and Historical Semantics: Embeddings can also be a useful tool for studying how meaning changes over time, by computing multiple embedding spaces, each from texts written in a particular time period (figure below).

Bias and Embeddings

In addition to their ability to learn word meaning from text, embeddings can also reproduce the implicit biases and stereotypes that were latent in the text.
Recall that embeddings model analogical relations; 'queen' as the closest word to 'king' - 'man' + 'woman' implies the analogy man-woman::king-queen.
- Embedding analogies also exhibit gender stereotypes.
  - For example, the closest occupation to 'man' - 'computer programmer' + 'woman' is 'homemaker', according to word2vec embeddings trained on news text.
  - Another is suggesting the analogy 'father' is to 'doctor' as 'mother' is to 'nurse'.
  - This means algorithms that use embeddings as part of a search for potential programmers or doctors may incorrectly downweight documents with women's names.
Embeddings also encode the implicit associations that are a property of human reasoning.
- Findings of a real-life implicit association research are replicated using GloVe vectors and cosine similarity.
  - For example, African-American names had a higher GloVe cosine with unpleasant words.
- Thus, any embedding-aware algorithm that made use of word sentiment could lead to bias.
Historical embeddings are also being used to measure biases in the past.
Recent research focuses on ways to try to remove these kinds of biases (debiasing).
- For example, developing a transformation of the embedding space that removes gender stereotypes but preserves definitional gender.
- Debiasing may reduce bias im embeddings but does not eliminate it, and this remains an open problem.

Evaluating Vector Models

Extrinsic evaluation on NLP tasks is the most important evaluation metric for vector models.
- Evaluate vector models by adding them as features into any NLP task and seeing whether this improves performance over some other model.
Intrinsic evaluation is important nevertheless; the most common metric is to test the vector models performance on similarity.
- This involves computing the correlation between an algorithm's word similarity scores and word similarity ratings assigned by humans.
  - WordSim-353 is a set of ratings from 0-10 for 353 noun pairs (i.e., (plane, car) avg 5.77).
  - SimLex-999 quantify similarity between words, rather than their relatedness, and including both concrete and abstract abjective, noun and verb pairs.
  - The TOEFL dataset task choosing the correct synonym among 4 choices.
  - All of these datasets present words without context.
- Intrinsic similarity tasks with context are slighly more realistic.
  - The semantic textual similarity task evaluates the performance of sentence-level similarity algorithms, consisting of a set of pairs of sentences, each pair with human-labeled similarity scores.
- Another task used to evaluate is an analogy task.
  - The problem to solve of the form a is to b as c is to d.

GloVE

Global Vectors for Word Representation (GloVe) method, from GloVe Word Embeddings, Pennington et al., 2014.

GloVe, or Global Vectors, is a model that directly captures the global corpus statistics.
- In comparison, word2vec model capture only the local context window.
- GloVe combines the tf-idf global method and the skip-gram word2vec local method.
The model leverages statistical information by training only on the non-zero elements in a word-word co-occurence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus.

Notation

$X$ : Word-word co-occurence count matrix
$X_{i j}$ : The number of times word j occurs in the context of word i
$X_{i} = \sum_{k} X_{i k}$ : The number of times any word appears in the context of word i
$b_{i}, {\tilde{b}}_{k}$ : Scalar biases for the target and context word (i.e.,
$b_{i} = \log (X_{i})$ )
$w, \tilde{w}$ : Word vectors and separate context word vectors, respectively
$V$ : The size of the vocabulary

Method

A weighted least squares regression model is used to fulfill the task. Introducing a weighting function
$f (X_{i j})$ into the cost function gives us the model:

$\begin{matrix} J = \sum_{i, j = 1}^{V} f (X_{i j}) (w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j} - \log X_{i j}) \end{matrix}$
The weighting function
$f$ should be non-decreasing to avoid overweighting rare co-occurences, and small for large input values to avoid overweighting frequent co-occurences.
- This is to prevent the algorithm from learning only extremely common word pairs.
- A class of function that works well and thus chosen for the algorithm is:
  
  $\begin{matrix} f (x) = {\begin{cases} (x / x_{m a x})^{α} & if x < x_{\max} \\ 1 & otherwise \end{cases} \end{matrix}$