Note for Word2vec paper (Mikolov et al. Efficient Estimation of Word Representation in Vector Space, 2013)

# Note for Word2vec paper (Mikolov et al. Efficient Estimation of Word Representation in Vector Space, 2013) * Previous work on word vectors mostly used Latent Semantic Analysis (LSA) * This paper uses **neural networks** to get these vectors, which is probably the main breakthrough during the time and results in significantly better outcome than LSA in terms of speed. * Mikolov also uses **hierachical softmax** and represents the vocabulary as a **Huffman binary tree**, which speeds up the computation. * Uses **Adagrad** during training. * The work was later submitted to [NIPS 2013](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) with several extensions of the original skip-gram model. A simplified variant of Noise Contrastive Estimation (NCE), i.e. **Negative Sampling**, is also used for training the Skip-gram model. Base Architecture: * **Feed-forward neural network language model (NNLM):** input, projection, hidden, output * **Recurrent neural net language model (RNNLM):** input, hidden, output (NO projection layer) Proposed models: * **Continuous Bag-of-Words (CBOW)** model Trying to predict ONE word from surrounding word vectors. The input (surrounding) vectors are averaged so order of words does not influence the projection. In the model, the non-linear hidden layer is removed and have the projection layer shared for all words. * Continuous **Skip-gram** model Predicting surrounding (plural) words with the word in the middle. Questions: * How is the training data prepared?