Try   HackMD

Neural Network Math, LSTM Background

tags: Internship

Goal:

  • Understand math behind neural networks
  • Study conceptual components of LSTM

Resources:
Towards Data Science Page
Adventures in Machine Learning
Machine Learning

Math of Optimizing Neural Networks

Example cost function for a pair:

  • The 1/2 coefficient just helps with easy calculation when we derive later on.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Mean squared error:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Gradient descent:

  • Same idea as
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • J is loss function
  • Partial derivatives for gradient

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Calculating Derivative of Loss Function

Find change of loss J caused by weight max w through chain function:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Solving for dz/dw:

  • Previously established that z the the vectorization of the nodes in a layer such that
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Solving for dh/dz:

  • Previously established that h is the output of the activation on the node, AKA the result of f(z), such that
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Solving for dJ/dh:

  • This also requires chain rule to solve

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Final derivative of loss function:

  • Given
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Propagating through Hidden Layers

Connecting weights between different layers:

  • Weights connecting to output layer:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    • Simple since can directly compare output with training data
  • Harder to calculate loss changes in outputs of hidden layers
    • No direct reference to compare outputs with
    • Only connected to loss function via other nodes and mediating weights

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For hidden layer calculations, use backpropagation:

  • Propagate this term through the layers:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    • This term is the network's way of connecting to the loss function.
    • Nodes in hidden layers contribute to this term via their weight
    • If only one output layer node, generalized term:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • If multiple nodes in output layer, then we can generalize to:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Conclusion

Given all the work above, we can finally calculate change of loss function by change in weights:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Following similar steps, we can also calculate change of loss function by change in bias:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Given the above derivations, we can finally calculate gradient descent:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Vectorization of backpropagation:

  • Does NOT use matrix multiplication
  • Dot symbol is element-by-element multiplication, AKA Hadamard product

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For reference, Hadamard product works as such:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Integrating Gradient Descent

(Reminder) Overall cost function:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

(Reminder) Gradient descent calculation (element-by-element and vectorized):

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sum up all partial derivatives of individual cost function calculations (both W and b):

  • Perform this operation at each iteration.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Once you've iterated through all samples, update weight parameters:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Final Backpropagation Algorithm

  • Randomly initialize weights for each layer
    W(l)
  • While iterations < iteration limit:
    • Set
      ΔW
      and
      Δb
      to zero
    • For samples [1,m):
      • Perform feed forward pass through all
        nl
        layers.
      • Store activation function outputs h(l)
      • Calculate
        δ(nl)
        for output layer
      • Use backpropagation to calculate
        δ(l)
        values for layers 2 to
        nl1
      • Update
        ΔW(l)
        and
        Δb(l)
        for each layer
    • Perform gradient descent using:

W(l)=W(l)α[1mΔW(l)]
b(l)=b(l)α[1mΔb(l)]

Repeat gradient descent until satisfied with minimization.

Implementation of NN in Python

GOAL: Identify value of hand-written digits.

I will skip this part of the tutorial for now.

LSTM Tutorial

Source: https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/

Continued from last note.

Problem with Vanilla RNN

  • Recurrent neural networks
    • Each time step shares same set of weights
    • Very flexible configurations to describe inputs vs. outputs
      • Many-to-many
      • Many-to-one
      • One-to-many

  • Impracticality of vanilla RNNs
    • Vanishing gradient problem
      • Can't handle long-term memory
      • More time steps = more backpropagation gradients accumulate to either explode or vanish

Demonstration of Vanishing Gradient Problem

Representation of RNN:

ht=σ(Uxt+Vht1)

  • U: weight matrix connecting inputs
  • V: weight matrix connecting recurrent outputs

Three time steps into performing softmax of

ht outputs:

ht=σ(Uxt+V(σ(Uxt1+V(σ(Uxt2)))

Gradient of error with respect to weight matrix U:

E3U=E3out3out3h3h3h2h2h1h1U

Each gradient requires calculation of gradient of sigmoid function:

  • Sigmoid function may output close to either 0 or 1, so gradient becomes very small
  • Eventually leads to vanishing gradient

If gradient becomes very small, then it does very little in adjusting the weight to the new values.

Components of LSTM Network

  • LSTM cell is designed to prevent vanishing gradient problem
    • Reduces multiplication of gradients less than zero
    • Creates internal memory state that is just added to processed input
    • Forget gate: controls time dependence and effects of past inputs
      • Determines what to remember/forge

Structure of LSTM Cell

  • Data flows left to right
  • xt
    : current input
  • ht1
    : previous cell output
  • Concatenate
    xt
    and
    ht1
    together and enter it to top "data rail"

Input Gate

Scale input to range (-1,1) using tanh activation function:

g=tanh(bg+xtUg+ht1Vg)

  • Ug
    : weight for input
  • Vg
    : previous cell output
  • bg
    : input bias
  • g
    denotes input weights and bias values (as opposed to input gate, forget gate, output gate, etc.)

Multiply tanh-squashed input by output of the input gate:

  • Element-wise multiplication
  • Input gate: hidden layer of sigmoid activated nodes
    • Input: weighted
      xt
      ht1
    • Output: between 0 and 1
    • "Switches" input on or off via multiplication

Input gate:

i=σ(bi+xtUi+ht1Vi)

Output of input stage of LSTM cell:

gi

  • denotes element-wise multiplication

Input gate output i acts as weights for squashed input g.

Internal State and Forget Gate

  • Introduces new variable
    st
    : inner state of LSTM cell
    • Delay
      st
      by one time-step
    • Add to
      gi
      input
    • Provides internal recurrence loop to learn relationship between inputs separated by time
  • Forget gate: sigmoid activated set of notes
    • Element-wise multiplied by
      st1
      to determine remember/forget
    • Acts as weights for internal state
    • f=σ(bf+xtUf+ht1Vf)
  • Element-wise product of previous state and forget gate:
    st1f
    • Then added to input
      • NOT multiplied or mixed with it
      • Vanilla RNN would typicall mix it via weights and a sigmoid activation function
      • ADDING reduces risk of vanishing gradients
  • Output of this stage:
    st=st1f+gi

Output Gate

  • Two parts:
    • tanh squashing function
    • Output sigmoid gating function
  • Output sigmoid gating function: multiplied by squashed state
    st
    to determine which values are output
    • o=σ(bo+xtUo+ht1Vo)
  • Final output of cell:
    ht=tanh(st)o

Reducing Vanishing Gradient Problem

In an LSTM cell, the recurrency of the internal uses addition:

st=st1f+gi

Taking the partial derivative:

stst1=f

  • Becomes just multiplication of
    f
  • If output of
    f=1
    , then no gradient decay
  • In the begnning, the bias of sigmoid in
    f
    is typically made bigger so
    f
    starts as 1
  • Allows all past input states to be remembered in the beginning
  • Later forget gate will reduce memory

Data Dimensions inside LSTM Cell

  • Each input is a vector
    • Example: "cat" is a 650-length vector
  • Each input row has a certain number of vectors
    • Example: one input row has 35 input vectors, so input of each row is 35x650
  • Batches of data processed at once via multi-dimensional tensors
    • Example: batch size of 20, so training input data is 20x35x650
  • "Time-major" format: arranged (35 x 20 x 650)