Word2Vec

We previously learned about using co-occurrence and latent-semantic analysis to create count-based word vectors. Here we learn a method based on neural networks that is conceptually similar, but has many technical advantages, not least that it can be trained on much larger corpora. This method is prediction-based in the sense that it learns how to predict a word from its context (continuous bag of words model) or a context from a word (skip gram model).

You should read the following introductory material after we went over the main ideas in class.

References

Original Research Articles

  • The paper that introduced continuous bag of words and skip gram is Efficient Estimation of Word Representations in Vector Space, 2013.

    • "The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary."
    • "words can have multiple degrees of similarity"
    • "most of the complexity is caused by the non-linear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently."
    • The results where tested on the task illustrated by
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

      To pass the test the vector Oslo-Athens+Greece must be closer to Norway than to any other vector. This idea had appeared before in Linguistic Regularities in Continuous Space Word Representations.
  • word2vec was introduced in Distributed Representations of Words and Phrases and their Compositionality, 2013, which refines the skip gram model from the previous paper. This was the first time that word embeddings were computed for millions of words by training on billions of words. It is also shown that word2vec produces representations that exhibit linear structure that makes precise analogical reasoning possible. See McCormick's Word2Vec Tutorial Part 2 for a summary of the novel techniques introduced in this paper.

Secondary Literature

Code