Neural Network Math, LSTM Background

tags: `Internship`

Goal:

Understand math behind neural networks
Study conceptual components of LSTM

Resources:
Towards Data Science Page
Adventures in Machine Learning
Machine Learning

Math of Optimizing Neural Networks

Example cost function for a pair:

The 1/2 coefficient just helps with easy calculation when we derive later on.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Mean squared error:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Gradient descent:

Same idea as
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
J is loss function
Partial derivatives for gradient

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Calculating Derivative of Loss Function

Find change of loss J caused by weight max w through chain function:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Solving for dz/dw:

Previously established that z the the vectorization of the nodes in a layer such that
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Solving for dh/dz:

Previously established that h is the output of the activation on the node, AKA the result of f(z), such that
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Solving for dJ/dh:

This also requires chain rule to solve

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Final derivative of loss function:

Given
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Propagating through Hidden Layers

Connecting weights between different layers:

Weights connecting to output layer:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Simple since can directly compare output with training data
Harder to calculate loss changes in outputs of hidden layers
- No direct reference to compare outputs with
- Only connected to loss function via other nodes and mediating weights

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For hidden layer calculations, use backpropagation:

Propagate this term through the layers:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This term is the network's way of connecting to the loss function.
- Nodes in hidden layers contribute to this term via their weight
- If only one output layer node, generalized term:
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
If multiple nodes in output layer, then we can generalize to:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Conclusion

Given all the work above, we can finally calculate change of loss function by change in weights:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Following similar steps, we can also calculate change of loss function by change in bias:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Given the above derivations, we can finally calculate gradient descent:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Vectorization of backpropagation:

Does NOT use matrix multiplication
Dot symbol is element-by-element multiplication, AKA Hadamard product

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For reference, Hadamard product works as such:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Integrating Gradient Descent

(Reminder) Overall cost function:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(Reminder) Gradient descent calculation (element-by-element and vectorized):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sum up all partial derivatives of individual cost function calculations (both W and b):

Perform this operation at each iteration.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Once you've iterated through all samples, update weight parameters:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Final Backpropagation Algorithm

Randomly initialize weights for each layer
$W^{(l)}$
While iterations < iteration limit:
- Set
  $Δ W$ and
  $Δ b$ to zero
- For samples [1,m):
  - Perform feed forward pass through all
    $n_{l}$ layers.
  - Store activation function outputs h(l)
  - Calculate
    $δ^{(n_{l})}$ for output layer
  - Use backpropagation to calculate
    $δ^{(l)}$ values for layers 2 to
    $n_{l} - 1$
  - Update
    $Δ W^{(l)}$ and
    $Δ b^{(l)}$ for each layer
- Perform gradient descent using:

W^{(l)} = W^{(l)} - α [\frac{1}{m} Δ W^{(l)}]

b^{(l)} = b^{(l)} - α [\frac{1}{m} Δ b^{(l)}]

Repeat gradient descent until satisfied with minimization.

Implementation of NN in Python

GOAL: Identify value of hand-written digits.

I will skip this part of the tutorial for now.

LSTM Tutorial

Source: https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/

Continued from last note.

Problem with Vanilla RNN

Recurrent neural networks
- Each time step shares same set of weights
- Very flexible configurations to describe inputs vs. outputs
  - Many-to-many
  - Many-to-one
  - One-to-many

Impracticality of vanilla RNNs
- Vanishing gradient problem
  - Can't handle long-term memory
  - More time steps = more backpropagation gradients accumulate to either explode or vanish

Demonstration of Vanishing Gradient Problem

Representation of RNN:

h_{t} = σ ({Ux}_{t} + {Vh}_{t - 1})

U: weight matrix connecting inputs
V: weight matrix connecting recurrent outputs

Three time steps into performing softmax of

h_{t}

outputs:

h_{t} = σ ({Ux}_{t} + V (σ ({Ux}_{t - 1} + V (σ ({Ux}_{t - 2})))

Gradient of error with respect to weight matrix U:

\frac{\partial E_{3}}{\partial U} = \frac{\partial E_{3}}{\partial o u t_{3}} \frac{\partial o u t_{3}}{\partial h_{3}} \frac{\partial h_{3}}{\partial h_{2}} \frac{\partial h_{2}}{\partial h_{1}} \frac{\partial h_{1}}{\partial U}

Each gradient requires calculation of gradient of sigmoid function:

Sigmoid function may output close to either 0 or 1, so gradient becomes very small
Eventually leads to vanishing gradient

If gradient becomes very small, then it does very little in adjusting the weight to the new values.

Components of LSTM Network

LSTM cell is designed to prevent vanishing gradient problem
- Reduces multiplication of gradients less than zero
- Creates internal memory state that is just added to processed input
- Forget gate: controls time dependence and effects of past inputs
  - Determines what to remember/forge

Structure of LSTM Cell

Data flows left to right
$x_{t}$ : current input
$h_{t} - 1$ : previous cell output
Concatenate
$x_{t}$ and
$h_{t} - 1$ together and enter it to top "data rail"

Input Gate

Scale input to range (-1,1) using tanh activation function:

g = t a n h (b^{g} + x_{t} U^{g} + h_{t - 1} V^{g})

$U^{g}$ : weight for input
$V^{g}$ : previous cell output
$b^{g}$ : input bias
$g$ denotes input weights and bias values (as opposed to input gate, forget gate, output gate, etc.)

Multiply tanh-squashed input by output of the input gate:

Element-wise multiplication
Input gate: hidden layer of sigmoid activated nodes
- Input: weighted
  $x_{t}$
  $h_{t} - 1$
- Output: between 0 and 1
- "Switches" input on or off via multiplication

Input gate:

i = σ (b^{i} + x_{t} U^{i} + h_{t - 1} V^{i})

Output of input stage of LSTM cell:

g \circ i

$\circ$ denotes element-wise multiplication

Input gate output i acts as weights for squashed input g.

Internal State and Forget Gate

Introduces new variable
$s_{t}$ : inner state of LSTM cell
- Delay
  $s_{t}$ by one time-step
- Add to
  $g \circ i$ input
- Provides internal recurrence loop to learn relationship between inputs separated by time
Forget gate: sigmoid activated set of notes
- Element-wise multiplied by
  $s_{t} - 1$ to determine remember/forget
- Acts as weights for internal state
- $f = σ (b^{f} + x_{t} U^{f} + h_{t - 1} V^{f})$
Element-wise product of previous state and forget gate:
$s_{t - 1} \circ f$
- Then added to input
  - NOT multiplied or mixed with it
  - Vanilla RNN would typicall mix it via weights and a sigmoid activation function
  - ADDING reduces risk of vanishing gradients
Output of this stage:
$s_{t} = s_{t - 1} \circ f + g \circ i$

Output Gate

Two parts:
- tanh squashing function
- Output sigmoid gating function
Output sigmoid gating function: multiplied by squashed state
$s_{t}$ to determine which values are output
- $o = σ (b^{o} + x_{t} U^{o} + h_{t - 1} V^{o})$
Final output of cell:
$h_{t} = t a n h (s_{t}) \circ o$

Reducing Vanishing Gradient Problem

In an LSTM cell, the recurrency of the internal uses addition:

s_{t} = s_{t - 1} \circ f + g \circ i

Taking the partial derivative:

\frac{\partial s_{t}}{\partial s_{t - 1}} = f

Becomes just multiplication of
$f$
If output of
$f = 1$ , then no gradient decay
In the begnning, the bias of sigmoid in
$f$ is typically made bigger so
$f$ starts as 1
Allows all past input states to be remembered in the beginning
Later forget gate will reduce memory

Data Dimensions inside LSTM Cell

Each input is a vector
- Example: "cat" is a 650-length vector
Each input row has a certain number of vectors
- Example: one input row has 35 input vectors, so input of each row is 35x650
Batches of data processed at once via multi-dimensional tensors
- Example: batch size of 20, so training input data is 20x35x650
"Time-major" format: arranged (35 x 20 x 650)

Neural Network Math, LSTM Background

tags: Internship

Math of Optimizing Neural Networks

Calculating Derivative of Loss Function

Propagating through Hidden Layers

Integrating Gradient Descent

Final Backpropagation Algorithm

Implementation of NN in Python

LSTM Tutorial

Problem with Vanilla RNN

Components of LSTM Network

Structure of LSTM Cell

Input Gate

Internal State and Forget Gate

Output Gate

Reducing Vanishing Gradient Problem

Data Dimensions inside LSTM Cell

Read more

Checklist for TEEP@AsiaPlus

<center><i class="fa fa-edit"></i> Experience Reflection on My BMW Lab TEEP Internship </center>

Daily Report by Tiffany Cheng

<center><i class="fa fa-edit"></i> MongoDB: Dashboard Access from NTUST</center>

tags: `Internship`