---
tags: deeplearning, cs224n
---
# CS224N (2019) Lecture 1 Further Readings
## Gradient for Skip-Gram Model (p30)
For a center word $c$ and a context word $o$, how to calculate $\frac{\partial{}}{\partial{v_{c}}}\log{p(o|c)}$?
$\frac{\partial{}}{\partial{v_{c}}}\log{p(o|c)}$
$= \frac{\partial{}}{\partial{v_{c}}}\log{\frac{exp(u_{o}^{T}v_{c})}{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}$
$= \frac{\partial{}}{\partial{v_{c}}}\log{exp(u_{o}^{T}v_{c})} - \frac{\partial{}}{\partial{v_{c}}}\log{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}$
(Rule for logarithm division)
$= \frac{\partial{}}{\partial{v_{c}}}u_{o}^{T}v_{c} - \frac{\partial{}}{\partial{v_{c}}}\log{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}$
(Logarithm and exponential are canceled out)
$= u_{o} - \frac{\partial{}}{\partial{v_{c}}}\log{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}$
(Useful basics in p28)
$= u_{o} - \frac{1}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}\frac{\partial{}}{\partial{v_{c}}}{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}$
(Chain rule: $\frac{\partial{y}}{\partial{v_{c}}} = \frac{\partial{y}}{\partial{z}}\frac{\partial{z}}{\partial{v_{c}}}, y = \log{z}, z = {\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}, \frac{\partial{y}}{\partial{z}} = \frac{\partial{}}{\partial{z}}\log{z} = \frac{1}{z}$)
$= u_{o} - \frac{1}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}{\sum_{w=1}^{V}\frac{\partial{}}{\partial{v_{c}}}exp(u_{w}^{T}v_{c})}$
$= u_{o} - \frac{1}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})\frac{\partial{}}{\partial{v_{c}}}u_{w}^{T}v_{c}}$
(Chain rule: $\frac{\partial{y}}{\partial{v_{c}}} = \frac{\partial{y}}{\partial{z}}\frac{\partial{z}}{\partial{v_{c}}}, y = \exp(z), z = u_{w}^{T}v_{c}, \frac{\partial{y}}{\partial{z}} = \frac{\partial{}}{\partial{z}}\exp(z) = exp(z)$)
$= u_{o} - \frac{1}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})u_{w}}$
(Useful basics in p28)
$= u_{o} - \sum_{w=1}^{V}\frac{exp(u_{w}^{T}v_{c})}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}u_{w}$
(Rearange)
$= u_{o} - \sum_{w=1}^{V}p(w|c)u_{w}$
(From the definition: $p(w|c) = \sum_{w=1}^{V}\frac{exp(u_{w}^{T}v_{c})}{{\sum_{w=1}^{V}exp(u_{w}^{T}v_{c})}}$)
**Conclusion**
The gradient is the observed representation of the context word minus what the model thinks that the context should look like. The context that should look like is the weighted average of the representations of each word.
## Continuous Bag of Words (CBOW) Model (p32)
*Source: [cs224 lecture 1 note](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf)*
Predicting a center word from the surrounding context.
For each word, we want to learn 2 vectors:
- $v$: (input vector) when the word is in the context
- $u$: (output vector) when the word is in the center
**Notation**

**Model Architecture**

**Forward Feed**

**Loss Function**
Use cross entropy cross entropy $H(\hat{y}, y)$ as our loss function:

Because $y$ is a one-hot vector, where $y_{i}$ = 1 and other is 0, then the loss can be simplified as:

So, for center word $c$, what we want to minimize is:

## Evaluation of Word2Vec
*Source: [word2vec paper](http://arxiv.org/pdf/1301.3781.pdf)*
Define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions.
Example table:

The questions in each category were created in two steps:
1. a list of similar word pairs was created manually
2. a large list of questions is formed by connecting two word pairs
Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes.
Example evaluation result:

## Hierarchical Softmax
*Source: [cs224 lecture 1 note](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf)*
Hierarchical softmax is an efficient way to calculate the probability between words.
The method uses a binary tree to represent all words in the vocabulary
- Each leaf of the tree is a word, and there is a unique path from root to leaf
- No output representation for words
- Each node of the graph (except the root and the leaves) is associated to a vector that the model is going to learn

The probability of a word $w$ given a vector $w_{i}$: $P(w|w_{i})$, is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to $w$.
Example: ([source](https://www.quora.com/What-is-hierarchical-softmax))

$$P(I'm|C) = 0.57 * 0.68 * 0.72 = 0.279072$$
The computing cost is only $O(log(|V|))$, more efficient than original softmax, which is $O(|V|)$
**Notation**
- $L(w)$: the number of nodes in the path from the root to the leaf $w$
- $n(w, i)$: the i-th node on this path with associated vector $v_{n(w,i)}$
- $ch(n)$: the left child node for inner node $n$
**Formula**

There are some properties:
1. For any value of :

2. For any word wi:

Example:


**Speedup**
Use a binary Huffman tree, which assigns frequent words shorter paths in the tree.
## Word2Vec Online Demo
https://turbomaze.github.io/word2vecjson/