# <center><i class="fa fa-edit"></i> Neural Network Math, LSTM Background </center> ###### tags: `Internship` :::info **Goal:** - [x] Understand math behind neural networks - [x] Study conceptual components of LSTM **Resources:** [Towards Data Science Page](https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47) [Adventures in Machine Learning](https://adventuresinmachinelearning.com/neural-networks-tutorial/#first-attempt-feed-forward) [Machine Learning](https://hackmd.io/@Derni/HJQkjlnIP) ::: ## Math of Optimizing Neural Networks *Example cost function for a pair:* - The 1/2 coefficient just helps with easy calculation when we derive later on. ![](https://i.imgur.com/h3Iv3tk.png) *Mean squared error:* ![](https://i.imgur.com/C1wMr1N.png) *Gradient descent*: - Same idea as ![](https://i.imgur.com/EaRxABc.png) - J is loss function - Partial derivatives for gradient ![](https://i.imgur.com/ZzX9gmy.png) ### Calculating Derivative of Loss Function *Find change of loss J caused by weight max w through chain function:* ![](https://i.imgur.com/PwwDC0G.png) *Solving for dz/dw:* - Previously established that z the the vectorization of the nodes in a layer such that ![](https://i.imgur.com/mcN2rab.png) ![](https://i.imgur.com/evkdw4g.png) *Solving for dh/dz:* - Previously established that h is the output of the activation on the node, AKA the result of f(z), such that ![](https://i.imgur.com/3gG6tSa.png) ![](https://i.imgur.com/n10W8ot.png) *Solving for dJ/dh:* - This also requires chain rule to solve ![](https://i.imgur.com/dC7F0UB.png) *Final derivative of loss function:* - Given ![](https://i.imgur.com/mvGEmi9.png) ![](https://i.imgur.com/7KAwDfy.png) ### Propagating through Hidden Layers Connecting weights between different layers: - Weights connecting to output layer: ![](https://i.imgur.com/McpbHzI.png) - Simple since can directly compare output with training data - Harder to calculate loss changes in outputs of hidden layers - No direct reference to compare outputs with - Only connected to loss function via other nodes and mediating weights ![](https://i.imgur.com/3VU24qo.png) For hidden layer calculations, use backpropagation: - Propagate this term through the layers: ![](https://i.imgur.com/mvGEmi9.png) - This term is the network's way of connecting to the loss function. - Nodes in hidden layers contribute to this term via their weight - If only one output layer node, generalized term: ![](https://i.imgur.com/DUI2nxq.png) - If multiple nodes in output layer, then we can generalize to: ![](https://i.imgur.com/mn5fRTc.png) ![](https://i.imgur.com/vktXsXl.png) :::success **Conclusion** Given all the work above, we can finally calculate change of loss function by change in weights: ![](https://i.imgur.com/7HPDXha.png) Following similar steps, we can also calculate change of loss function by change in bias: ![](https://i.imgur.com/tt5MTws.png) Given the above derivations, we can finally calculate **gradient descent**: ![](https://i.imgur.com/LfkC6Y0.png) ::: Vectorization of backpropagation: - Does NOT use matrix multiplication - Dot symbol is element-by-element multiplication, AKA Hadamard product ![](https://i.imgur.com/eM1dePn.png) :::warning For reference, Hadamard product works as such: ![](https://i.imgur.com/GSxsVFA.png) ::: ### Integrating Gradient Descent (Reminder) Overall cost function: ![](https://i.imgur.com/Hx85YYF.png) (Reminder) Gradient descent calculation (element-by-element and vectorized): ![](https://i.imgur.com/ovFrLYM.png) Sum up all partial derivatives of individual cost function calculations (both W and b): - Perform this operation at each iteration. ![](https://i.imgur.com/fzx1zJc.png) Once you've iterated through all samples, update weight parameters: ![](https://i.imgur.com/e2Mawso.png) ### Final Backpropagation Algorithm - Randomly initialize weights for each layer $W^{(l)}$ - While iterations < iteration limit: - Set $\Delta W$ and $\Delta b$ to zero - For samples [1,m): - Perform feed forward pass through all $n_l$ layers. - Store activation function outputs h(l) - Calculate $\delta^{(n_l)}$ for output layer - Use backpropagation to calculate $\delta^{(l)}$ values for layers 2 to $n_l - 1$ - Update $\Delta W^{(l)}$ and $\Delta b^{(l)}$ for each layer - Perform gradient descent using: $$ W^{(l)} = W^{(l)} – \alpha \left[\frac{1}{m} \Delta W^{(l)} \right] $$ $$ b^{(l)} = b^{(l)} – \alpha \left[\frac{1}{m} \Delta b^{(l)}\right] $$ Repeat gradient descent until satisfied with minimization. ### Implementation of NN in Python GOAL: Identify value of hand-written digits. :::warning I will skip this part of the tutorial for now. ::: ## LSTM Tutorial Source: https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/ :::warning Continued from last note. ::: ### Problem with Vanilla RNN - Recurrent neural networks - Each time step shares same set of weights - Very flexible configurations to describe inputs vs. outputs - Many-to-many - Many-to-one - One-to-many ![](https://i.imgur.com/Yy3bT7u.png) ![](https://i.imgur.com/WJIi05Q.png) - Impracticality of vanilla RNNs - Vanishing gradient problem - Can't handle long-term memory - More time steps = more backpropagation gradients accumulate to either explode or vanish :::success **Demonstration of Vanishing Gradient Problem** Representation of RNN: $\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{Vh}_{t-1})$ - U: weight matrix connecting inputs - V: weight matrix connecting recurrent outputs Three time steps into performing softmax of $\textbf{h}_t$ outputs: $$ \textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{V}(\sigma(\textbf{Ux}_{t-1} + \textbf{V}(\sigma(\textbf{Ux}_{t-2}))) $$ Gradient of error with respect to weight matrix U: $$ \frac{\partial E_3}{\partial U} = \frac{\partial E_3}{\partial out_3}\frac{\partial out_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial U} $$ Each gradient requires calculation of gradient of sigmoid function: - Sigmoid function may output close to either 0 or 1, so gradient becomes very small - Eventually leads to vanishing gradient ![](https://i.imgur.com/g21e1Rw.png) If gradient becomes very small, then it does very little in adjusting the weight to the new values. ::: ### Components of LSTM Network - LSTM cell is designed to prevent vanishing gradient problem - Reduces multiplication of gradients less than zero - Creates internal memory state that is just *added* to processed input - Forget gate: controls time dependence and effects of past inputs - Determines what to remember/forge #### Structure of LSTM Cell ![](https://i.imgur.com/R1c6sRN.png) - Data flows left to right - $x_t$: current input - $h_t-1$: previous cell output - Concatenate $x_t$ and $h_t-1$ together and enter it to top "data rail" #### Input Gate *Scale input to range (-1,1) using tanh activation function:* $$ g = tanh(b^g + x_tU^g + h_{t-1}V^g) $$ - $U^g$: weight for input - $V^g$: previous cell output - $b^g$: input bias - $g$ denotes input weights and bias values (as opposed to input gate, forget gate, output gate, etc.) *Multiply tanh-squashed input by output of the input gate:* - Element-wise multiplication - Input gate: hidden layer of sigmoid activated nodes - Input: weighted $x_t$ $h_t-1$ - Output: between 0 and 1 - "Switches" input on or off via multiplication Input gate: $i = \sigma(b^i + x_tU^i + h_{t-1}V^i)$ Output of input stage of LSTM cell: $g \circ i$ - $\circ$ denotes element-wise multiplication :::warning Input gate output *i* acts as weights for squashed input *g*. ::: #### Internal State and Forget Gate - Introduces new variable $s_t$: inner state of LSTM cell - Delay $s_t$ by one time-step - Add to $g \circ i$ input - Provides internal recurrence loop to learn relationship between inputs separated by time - Forget gate: sigmoid activated set of notes - Element-wise multiplied by $s_t-1$ to determine remember/forget - Acts as weights for internal state - $f = \sigma(b^f + x_tU^f + h_{t-1}V^f)$ - Element-wise product of previous state and forget gate: $s_{t-1} \circ f$ - Then *added* to input - NOT multiplied or mixed with it - Vanilla RNN would typicall mix it via weights and a sigmoid activation function - ADDING reduces risk of vanishing gradients - Output of this stage: $s_t = s_{t-1} \circ f + g \circ i$ #### Output Gate - Two parts: - tanh squashing function - Output sigmoid gating function - Output sigmoid gating function: multiplied by squashed state $s_t$ to determine which values are output - $o = \sigma(b^o + x_tU^o + h_{t-1}V^o)$ - Final output of cell: $h_t = tanh(s_t) \circ o$ ### Reducing Vanishing Gradient Problem In an LSTM cell, the recurrency of the internal uses addition: $$ s_t = s_{t-1} \circ f + g \circ i $$ Taking the partial derivative: $$ \frac{\partial s_t}{\partial s_{t-1}} = f $$ - Becomes just multiplication of $f$ - If output of $f = 1$, then no gradient decay - In the begnning, the bias of sigmoid in $f$ is typically made bigger so $f$ starts as 1 - Allows all past input states to be remembered in the beginning - Later forget gate will reduce memory ### Data Dimensions inside LSTM Cell - Each input is a vector - Example: "cat" is a 650-length vector - Each input row has a certain number of vectors - Example: one input row has 35 input vectors, so input of each row is 35x650 - Batches of data processed at once via multi-dimensional tensors - Example: batch size of 20, so training input data is 20x35x650 - "Time-major" format: arranged *(35 x 20 x 650)* ![](https://i.imgur.com/Oi89Sct.png)