Try โ€‚โ€‰HackMD

Word2Vec (Skip-gram) [Notes]

Idea

  • We have a large corpus of text
  • Every word in a fixed vocabulary is represented by a dense vector
  • Go through each position
    t
    in the text, which has a center word
    c
    and the surrounding context words
    o
    , in the given window size.
  • Use the similarity of the word vectors for
    c
    and
    o
    to calculate the probability of a given
    c
    (or vice versa)
  • Keep adjusting the word vectors to maximize this probability

Architecture

We will be using a simple feedforward neural network with one hidden layer, with dimensions [d,1], where d is the dimensions of word embeddings

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

A few notes

  • As the Skipgram architecture tries to model a classifier in a Euclidean Space, non-linear activation functions for the hidden layer will not be required as this will defeat the purpose.
  • In the final layer, the SotfMax function will be applied to
    Z
    to get
    S
  • The SoftMax function is defined as follows
    SoftMax(zi)=expโก(zi)โˆ‘jexpโก(zj)

Forward Propagation

  1. We have the input
    X
    which is of dimension[V,1], where V is the vocabulary size.
  2. Take the dot product with matrix
    W
    to get
    A
  3. Take the dot product of
    A
    and
    U
    to get
    Z
  4. Apply SoftMax function to
    Z
    to get
    S
    .

Compute Loss

Cross Entropy Loss

L(y,s)=โˆ‘i(yi)(log(si))

Cost Function

J(w,u)=1mโˆ‘L(y,s)

Gradient Change

With respect to
U

โˆ‚Jโˆ‚U=โˆ‚Jโˆ‚Zโˆ‚Zโˆ‚U

=(Sโˆ’Y)โ‹…(AT)

[V,d] = [V,1]

โ‹… [1,d]

With respect to
W

โˆ‚Jโˆ‚W=โˆ‚Jโˆ‚Zโˆ‚Zโˆ‚Aโˆ‚Aโˆ‚W

=((Sโˆ’Y)Tโ‹…U)TโŠ—X

[d,V] = ([1,V]

โ‹… [V,d])
โŠ—
[V,1]

Optimization

Updating weights

W and
U
by using the gradients calculated above. Here
ฮฑ
is the learning rate.

W=Wโˆ’ฮฑโˆ‚Jโˆ‚W

U=Uโˆ’ฮฑโˆ‚Jโˆ‚U