# Note for Word2vec paper (Mikolov et al. Efficient Estimation of Word Representation in Vector Space, 2013)
* Previous work on word vectors mostly used Latent Semantic Analysis (LSA)
* This paper uses **neural networks** to get these vectors, which is probably the main breakthrough during the time and results in significantly better outcome than LSA in terms of speed.
* Mikolov also uses **hierachical softmax** and represents the vocabulary as a **Huffman binary tree**, which speeds up the computation.
* Uses **Adagrad** during training.
* The work was later submitted to [NIPS 2013](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) with several extensions of the original skip-gram model. A simplified variant of Noise Contrastive Estimation (NCE), i.e. **Negative Sampling**, is also used for training the Skip-gram model.
Base Architecture:
* **Feed-forward neural network language model (NNLM):** input, projection, hidden, output
* **Recurrent neural net language model (RNNLM):** input, hidden, output (NO projection layer)
Proposed models:
* **Continuous Bag-of-Words (CBOW)** model
Trying to predict ONE word from surrounding word vectors. The input (surrounding) vectors are averaged so order of words does not influence the projection. In the model, the non-linear hidden layer is removed and have the projection layer shared for all words.
* Continuous **Skip-gram** model
Predicting surrounding (plural) words with the word in the middle.
Questions:
* How is the training data prepared?