# <center><i class="fa fa-edit"></i> Neural Network Math, LSTM Background </center>
###### tags: `Internship`
:::info
**Goal:**
- [x] Understand math behind neural networks
- [x] Study conceptual components of LSTM
**Resources:**
[Towards Data Science Page](https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47)
[Adventures in Machine Learning](https://adventuresinmachinelearning.com/neural-networks-tutorial/#first-attempt-feed-forward)
[Machine Learning](https://hackmd.io/@Derni/HJQkjlnIP)
:::
## Math of Optimizing Neural Networks
*Example cost function for a pair:*
- The 1/2 coefficient just helps with easy calculation when we derive later on.

*Mean squared error:*

*Gradient descent*:
- Same idea as 
- J is loss function
- Partial derivatives for gradient

### Calculating Derivative of Loss Function
*Find change of loss J caused by weight max w through chain function:*

*Solving for dz/dw:*
- Previously established that z the the vectorization of the nodes in a layer such that 

*Solving for dh/dz:*
- Previously established that h is the output of the activation on the node, AKA the result of f(z), such that 

*Solving for dJ/dh:*
- This also requires chain rule to solve

*Final derivative of loss function:*
- Given 

### Propagating through Hidden Layers
Connecting weights between different layers:
- Weights connecting to output layer: 
- Simple since can directly compare output with training data
- Harder to calculate loss changes in outputs of hidden layers
- No direct reference to compare outputs with
- Only connected to loss function via other nodes and mediating weights

For hidden layer calculations, use backpropagation:
- Propagate this term through the layers: 
- This term is the network's way of connecting to the loss function.
- Nodes in hidden layers contribute to this term via their weight
- If only one output layer node, generalized term: 
- If multiple nodes in output layer, then we can generalize to: 

:::success
**Conclusion**
Given all the work above, we can finally calculate change of loss function by change in weights:

Following similar steps, we can also calculate change of loss function by change in bias:

Given the above derivations, we can finally calculate **gradient descent**:

:::
Vectorization of backpropagation:
- Does NOT use matrix multiplication
- Dot symbol is element-by-element multiplication, AKA Hadamard product

:::warning
For reference, Hadamard product works as such:

:::
### Integrating Gradient Descent
(Reminder) Overall cost function:

(Reminder) Gradient descent calculation (element-by-element and vectorized):

Sum up all partial derivatives of individual cost function calculations (both W and b):
- Perform this operation at each iteration.

Once you've iterated through all samples, update weight parameters:

### Final Backpropagation Algorithm
- Randomly initialize weights for each layer $W^{(l)}$
- While iterations < iteration limit:
- Set $\Delta W$ and $\Delta b$ to zero
- For samples [1,m):
- Perform feed forward pass through all $n_l$ layers.
- Store activation function outputs h(l)
- Calculate $\delta^{(n_l)}$ for output layer
- Use backpropagation to calculate $\delta^{(l)}$ values for layers 2 to $n_l - 1$
- Update $\Delta W^{(l)}$ and $\Delta b^{(l)}$ for each layer
- Perform gradient descent using:
$$
W^{(l)} = W^{(l)} – \alpha \left[\frac{1}{m} \Delta W^{(l)} \right]
$$
$$
b^{(l)} = b^{(l)} – \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]
$$
Repeat gradient descent until satisfied with minimization.
### Implementation of NN in Python
GOAL: Identify value of hand-written digits.
:::warning
I will skip this part of the tutorial for now.
:::
## LSTM Tutorial
Source: https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/
:::warning
Continued from last note.
:::
### Problem with Vanilla RNN
- Recurrent neural networks
- Each time step shares same set of weights
- Very flexible configurations to describe inputs vs. outputs
- Many-to-many
- Many-to-one
- One-to-many


- Impracticality of vanilla RNNs
- Vanishing gradient problem
- Can't handle long-term memory
- More time steps = more backpropagation gradients accumulate to either explode or vanish
:::success
**Demonstration of Vanishing Gradient Problem**
Representation of RNN: $\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{Vh}_{t-1})$
- U: weight matrix connecting inputs
- V: weight matrix connecting recurrent outputs
Three time steps into performing softmax of $\textbf{h}_t$ outputs:
$$
\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{V}(\sigma(\textbf{Ux}_{t-1} + \textbf{V}(\sigma(\textbf{Ux}_{t-2})))
$$
Gradient of error with respect to weight matrix U:
$$
\frac{\partial E_3}{\partial U} = \frac{\partial E_3}{\partial out_3}\frac{\partial out_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial U}
$$
Each gradient requires calculation of gradient of sigmoid function:
- Sigmoid function may output close to either 0 or 1, so gradient becomes very small
- Eventually leads to vanishing gradient

If gradient becomes very small, then it does very little in adjusting the weight to the new values.
:::
### Components of LSTM Network
- LSTM cell is designed to prevent vanishing gradient problem
- Reduces multiplication of gradients less than zero
- Creates internal memory state that is just *added* to processed input
- Forget gate: controls time dependence and effects of past inputs
- Determines what to remember/forge
#### Structure of LSTM Cell

- Data flows left to right
- $x_t$: current input
- $h_t-1$: previous cell output
- Concatenate $x_t$ and $h_t-1$ together and enter it to top "data rail"
#### Input Gate
*Scale input to range (-1,1) using tanh activation function:*
$$
g = tanh(b^g + x_tU^g + h_{t-1}V^g)
$$
- $U^g$: weight for input
- $V^g$: previous cell output
- $b^g$: input bias
- $g$ denotes input weights and bias values (as opposed to input gate, forget gate, output gate, etc.)
*Multiply tanh-squashed input by output of the input gate:*
- Element-wise multiplication
- Input gate: hidden layer of sigmoid activated nodes
- Input: weighted $x_t$ $h_t-1$
- Output: between 0 and 1
- "Switches" input on or off via multiplication
Input gate: $i = \sigma(b^i + x_tU^i + h_{t-1}V^i)$
Output of input stage of LSTM cell: $g \circ i$
- $\circ$ denotes element-wise multiplication
:::warning
Input gate output *i* acts as weights for squashed input *g*.
:::
#### Internal State and Forget Gate
- Introduces new variable $s_t$: inner state of LSTM cell
- Delay $s_t$ by one time-step
- Add to $g \circ i$ input
- Provides internal recurrence loop to learn relationship between inputs separated by time
- Forget gate: sigmoid activated set of notes
- Element-wise multiplied by $s_t-1$ to determine remember/forget
- Acts as weights for internal state
- $f = \sigma(b^f + x_tU^f + h_{t-1}V^f)$
- Element-wise product of previous state and forget gate: $s_{t-1} \circ f$
- Then *added* to input
- NOT multiplied or mixed with it
- Vanilla RNN would typicall mix it via weights and a sigmoid activation function
- ADDING reduces risk of vanishing gradients
- Output of this stage: $s_t = s_{t-1} \circ f + g \circ i$
#### Output Gate
- Two parts:
- tanh squashing function
- Output sigmoid gating function
- Output sigmoid gating function: multiplied by squashed state $s_t$ to determine which values are output
- $o = \sigma(b^o + x_tU^o + h_{t-1}V^o)$
- Final output of cell: $h_t = tanh(s_t) \circ o$
### Reducing Vanishing Gradient Problem
In an LSTM cell, the recurrency of the internal uses addition:
$$
s_t = s_{t-1} \circ f + g \circ i
$$
Taking the partial derivative:
$$
\frac{\partial s_t}{\partial s_{t-1}} = f
$$
- Becomes just multiplication of $f$
- If output of $f = 1$, then no gradient decay
- In the begnning, the bias of sigmoid in $f$ is typically made bigger so $f$ starts as 1
- Allows all past input states to be remembered in the beginning
- Later forget gate will reduce memory
### Data Dimensions inside LSTM Cell
- Each input is a vector
- Example: "cat" is a 650-length vector
- Each input row has a certain number of vectors
- Example: one input row has 35 input vectors, so input of each row is 35x650
- Batches of data processed at once via multi-dimensional tensors
- Example: batch size of 20, so training input data is 20x35x650
- "Time-major" format: arranged *(35 x 20 x 650)*
