Neural Network Math, LSTM Background
Math of Optimizing Neural Networks
Example cost function for a pair:
- The 1/2 coefficient just helps with easy calculation when we derive later on.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Mean squared error:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Gradient descent:
- Same idea as
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- J is loss function
- Partial derivatives for gradient
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Calculating Derivative of Loss Function
Find change of loss J caused by weight max w through chain function:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Solving for dz/dw:
- Previously established that z the the vectorization of the nodes in a layer such that
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Solving for dh/dz:
- Previously established that h is the output of the activation on the node, AKA the result of f(z), such that
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Solving for dJ/dh:
- This also requires chain rule to solve
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Final derivative of loss function:
- Given
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Propagating through Hidden Layers
Connecting weights between different layers:
- Weights connecting to output layer:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Simple since can directly compare output with training data
- Harder to calculate loss changes in outputs of hidden layers
- No direct reference to compare outputs with
- Only connected to loss function via other nodes and mediating weights
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For hidden layer calculations, use backpropagation:
-
Propagate this term through the layers:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This term is the network's way of connecting to the loss function.
- Nodes in hidden layers contribute to this term via their weight
- If only one output layer node, generalized term:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
If multiple nodes in output layer, then we can generalize to:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Conclusion
Given all the work above, we can finally calculate change of loss function by change in weights:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Following similar steps, we can also calculate change of loss function by change in bias:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Given the above derivations, we can finally calculate gradient descent:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Vectorization of backpropagation:
- Does NOT use matrix multiplication
- Dot symbol is element-by-element multiplication, AKA Hadamard product
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For reference, Hadamard product works as such:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Integrating Gradient Descent
(Reminder) Overall cost function:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
(Reminder) Gradient descent calculation (element-by-element and vectorized):
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Sum up all partial derivatives of individual cost function calculations (both W and b):
- Perform this operation at each iteration.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Once you've iterated through all samples, update weight parameters:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Final Backpropagation Algorithm
- Randomly initialize weights for each layer
- While iterations < iteration limit:
- Set and to zero
- For samples [1,m):
- Perform feed forward pass through all layers.
- Store activation function outputs h(l)
- Calculate for output layer
- Use backpropagation to calculate values for layers 2 to
- Update and for each layer
- Perform gradient descent using:
Repeat gradient descent until satisfied with minimization.
Implementation of NN in Python
GOAL: Identify value of hand-written digits.
I will skip this part of the tutorial for now.
LSTM Tutorial
Source: https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/
Continued from last note.
Problem with Vanilla RNN
- Recurrent neural networks
- Each time step shares same set of weights
- Very flexible configurations to describe inputs vs. outputs
- Many-to-many
- Many-to-one
- One-to-many


- Impracticality of vanilla RNNs
- Vanishing gradient problem
- Can't handle long-term memory
- More time steps = more backpropagation gradients accumulate to either explode or vanish
Demonstration of Vanishing Gradient Problem
Representation of RNN:
- U: weight matrix connecting inputs
- V: weight matrix connecting recurrent outputs
Three time steps into performing softmax of outputs:
Gradient of error with respect to weight matrix U:
Each gradient requires calculation of gradient of sigmoid function:
- Sigmoid function may output close to either 0 or 1, so gradient becomes very small
- Eventually leads to vanishing gradient

If gradient becomes very small, then it does very little in adjusting the weight to the new values.
Components of LSTM Network
- LSTM cell is designed to prevent vanishing gradient problem
- Reduces multiplication of gradients less than zero
- Creates internal memory state that is just added to processed input
- Forget gate: controls time dependence and effects of past inputs
- Determines what to remember/forge
Structure of LSTM Cell

- Data flows left to right
- : current input
- : previous cell output
- Concatenate and together and enter it to top "data rail"
Scale input to range (-1,1) using tanh activation function:
- : weight for input
- : previous cell output
- : input bias
- denotes input weights and bias values (as opposed to input gate, forget gate, output gate, etc.)
Multiply tanh-squashed input by output of the input gate:
- Element-wise multiplication
- Input gate: hidden layer of sigmoid activated nodes
- Input: weighted
- Output: between 0 and 1
- "Switches" input on or off via multiplication
Input gate:
Output of input stage of LSTM cell:
- denotes element-wise multiplication
Input gate output i acts as weights for squashed input g.
Internal State and Forget Gate
- Introduces new variable : inner state of LSTM cell
- Delay by one time-step
- Add to input
- Provides internal recurrence loop to learn relationship between inputs separated by time
- Forget gate: sigmoid activated set of notes
- Element-wise multiplied by to determine remember/forget
- Acts as weights for internal state
- Element-wise product of previous state and forget gate:
- Then added to input
- NOT multiplied or mixed with it
- Vanilla RNN would typicall mix it via weights and a sigmoid activation function
- ADDING reduces risk of vanishing gradients
- Output of this stage:
Output Gate
- Two parts:
- tanh squashing function
- Output sigmoid gating function
- Output sigmoid gating function: multiplied by squashed state to determine which values are output
- Final output of cell:
Reducing Vanishing Gradient Problem
In an LSTM cell, the recurrency of the internal uses addition:
Taking the partial derivative:
- Becomes just multiplication of
- If output of , then no gradient decay
- In the begnning, the bias of sigmoid in is typically made bigger so starts as 1
- Allows all past input states to be remembered in the beginning
- Later forget gate will reduce memory
Data Dimensions inside LSTM Cell
- Each input is a vector
- Example: "cat" is a 650-length vector
- Each input row has a certain number of vectors
- Example: one input row has 35 input vectors, so input of each row is 35x650
- Batches of data processed at once via multi-dimensional tensors
- Example: batch size of 20, so training input data is 20x35x650
- "Time-major" format: arranged (35 x 20 x 650)
