# Word2Vec (Skip-gram) [Notes]
## Idea
* We have a large corpus of text
* Every word in a fixed vocabulary is represented by a dense vector
* Go through each position $t$ in the text, which has a center word $c$ and the surrounding context words $o$, in the given window size.
* Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of a given $c$ (or vice versa)
* Keep adjusting the word vectors to maximize this probability
## Architecture
We will be using a simple feedforward neural network with one hidden layer, with dimensions `[d,1]`, where `d` is the dimensions of word embeddings
![](https://i.imgur.com/61HCpGo.png)
### A few notes
* As the Skipgram architecture tries to model a classifier in a Euclidean Space, non-linear activation functions for the hidden layer will not be required as this will defeat the purpose.
* In the final layer, the SotfMax function will be applied to $Z$ to get $S$
* The SoftMax function is defined as follows
$\text{SoftMax}(z_{i}) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$
## Forward Propagation
1. We have the input $X$ which is of dimension`[V,1]`, where `V` is the vocabulary size.
2. Take the dot product with matrix $W$ to get $A$
3. Take the dot product of $A$ and $U$ to get $Z$
4. Apply SoftMax function to $Z$ to get $S$.
## Compute Loss
Cross Entropy Loss
$L(y,s) = \sum_{i}{(y_i)(log(s_i))}$
Cost Function
$J(w,u) = \frac{1}{m} \sum L(y,s)$
## Gradient Change
### With respect to $U$
$\frac{\partial J}{\partial U} = \frac{\partial J}{\partial Z} \frac{\partial Z}{\partial U}$
$= (S-Y) \cdot (A^T)$
`[V,d]` = `[V,1]` $\cdot$ `[1,d]`
### With respect to $W$
$\frac{\partial J}{\partial W} = \frac{\partial J}{\partial Z} \frac{\partial Z}{\partial A} \frac{\partial A}{\partial W}$
$= ((S-Y)^T \cdot U)^T \otimes X$
`[d,V]` = (`[1,V]` $\cdot$ `[V,d]`) $\otimes$`[V,1]`
## Optimization
Updating weights $W$ and $U$ by using the gradients calculated above. Here $\alpha$ is the learning rate.
$W = W - \alpha \frac{\partial J}{\partial W}$
$U = U - \alpha \frac{\partial J}{\partial U}$