# Word2Vec (Skip-gram) [Notes] ## Idea * We have a large corpus of text * Every word in a fixed vocabulary is represented by a dense vector * Go through each position $t$ in the text, which has a center word $c$ and the surrounding context words $o$, in the given window size. * Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of a given $c$ (or vice versa) * Keep adjusting the word vectors to maximize this probability ## Architecture We will be using a simple feedforward neural network with one hidden layer, with dimensions `[d,1]`, where `d` is the dimensions of word embeddings ![](https://i.imgur.com/61HCpGo.png) ### A few notes * As the Skipgram architecture tries to model a classifier in a Euclidean Space, non-linear activation functions for the hidden layer will not be required as this will defeat the purpose. * In the final layer, the SotfMax function will be applied to $Z$ to get $S$ * The SoftMax function is defined as follows $\text{SoftMax}(z_{i}) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$ ## Forward Propagation 1. We have the input $X$ which is of dimension`[V,1]`, where `V` is the vocabulary size. 2. Take the dot product with matrix $W$ to get $A$ 3. Take the dot product of $A$ and $U$ to get $Z$ 4. Apply SoftMax function to $Z$ to get $S$. ## Compute Loss Cross Entropy Loss $L(y,s) = \sum_{i}{(y_i)(log(s_i))}$ Cost Function $J(w,u) = \frac{1}{m} \sum L(y,s)$ ## Gradient Change ### With respect to $U$ $\frac{\partial J}{\partial U} = \frac{\partial J}{\partial Z} \frac{\partial Z}{\partial U}$ $= (S-Y) \cdot (A^T)$ `[V,d]` = `[V,1]` $\cdot$ `[1,d]` ### With respect to $W$ $\frac{\partial J}{\partial W} = \frac{\partial J}{\partial Z} \frac{\partial Z}{\partial A} \frac{\partial A}{\partial W}$ $= ((S-Y)^T \cdot U)^T \otimes X$ `[d,V]` = (`[1,V]` $\cdot$ `[V,d]`) $\otimes$`[V,1]` ## Optimization Updating weights $W$ and $U$ by using the gradients calculated above. Here $\alpha$ is the learning rate. $W = W - \alpha \frac{\partial J}{\partial W}$ $U = U - \alpha \frac{\partial J}{\partial U}$