# Deep Learning Lab 6: RNN
## Overview (0:00 - 14:00) - Zhengyuan Ding
RNN is one type of architectures that we can use to deal with sequences of data. What is a sequence? From the CNN lesson we see that a signal can be 1D,2D or 3D based on the domain. The domain is what you map from to go to. Handling sequential data is basically dealing with 1D data since the domain are just temporal input Xs. Nevertheless, you can also use RNN to deal with 2D data, where you have double directions.
#### 1. Vanilla vs. Recurrent NN

On the left is a typical vanilla neural network diagram with three layers. "Vanilla" is an american term meaning plain. The pink bubble is the input vector x, in the center is the hidden layer in green, and the final blue layer is the output. Using an example from digital electronics on the right, this is like a combinational logic, where the current output only depends on the current input.

In constrast to a vanilla neural network, in recurrent neural network the current output depends not only on the current input but also the state of the system, shown on the left of the image above. This is like a sequential logic in digital electronic, where the output also depends on a "flip-flop"( a basic memory unit in digital electronics).
Therefore the main difference here is that the outoput of a vanilla neural network only depends on the current input, while the one of RNN depends also on the state of the system.

Yann's diagram adds these shapes between neurons to represent the mapping between one tensor and another(one vector to another). For example, in the diagram above the input vector x will map through this additional item to the hidden representations h. This item is actually affine transformation i.e. rotation plus distortion. Then through another transformation we get from the hidden layer to the final output. Similarly, in the RNN diagram you can have the same additional items between neurons.

#### 2. 4 Types of RNN Architectures and Examples

First case is vector to sequence as shown in the diagram above.
The input is one bubble and then there will be evolutions of the internal state of the system annotated as these green bubbles. As the state of the system evolves, at every time step there will be one speicific output.
An example of this type of architecture is shown below. The input is one of these images while the output will be a sequence of words representing the english descriptions of the input image. To explain using the diagram above, each blue bubble here can be an index in a dictionay and tells which word to pick in an indexed dictionary.For instance if the output is the sentence "This is a yellow school bus". You first get an index and look it up to be "a" and then get another one to pick up word "yellow" and so on. Some of the results of this network is shown below.For example, in the first column the description regarding the last picture is "A herd of elephants walking across a dry grass field.", which is very well refined. Then in the second column the first image outputs "Two dogs play in the grass.", while it's actually three. In the last column are the more wrong examples such as "A yellow school bus parked in a parking lot." In general, these results show that this network can be failing or performs well sometimes. This is the case that is from one input vector, which is the representation of an image, to a sequence of symbols, which are for example characters or words making up the english sentences. This kind of architecture is called autoregressive network. An autoregressive network is a network which is outputing an output given that you feed as input the previous output.


Second usage is sequence to a final vector. This network keeps feeding a sequence of symbols and only at the end it gives a final output. An application of this can be using the network to interpret Python. For example, the input are these lines of Python program.
```python
j = 8584
for x in range(8):
j += 920
b=(1500+j)
print((b+7567))
```
Then network will be able to output the correct solution of this program. Another more complicated program like this:
```python
i = 8827
c = (i-5347)
print((c+8704) if
2641<8500 else 5308)
```
Then the output should be 12184. These two examples display that you can train a neural network to do this kind of operation. We just need to feed a sequence of symbols and enforce the final output to be a specific value.

Next is sequence to vector to sequence. This architecture used to be the standard way of performing language translation. You start with a sequence of symbols here shown in pink. Then everything gets condensed into this final h, which represents a concept. For instance, we can have a sentence as input and squeeze it temproraly into a vector, which is representing the meaning and message that to send across. Then after getting this meaning in whatever representation, the network unrolls it back into a different language. For example "Today I'm very happy" in a sequence of words in English can be translated into Italian or Chinese. In general, the network gets some kind of encoding as inputs and turns them to a compressed representation. Finaly it gets the decoding given the same compressed version. Also in recent time we have seen networks like Transformers, which we will cover in the next lesson. This type of architecture used to be the state of the art two years ago.
If you do a PCA over the latent space, you will have the words grouped by semantics like shown in this graph.

If we zoom in, we will see that the in the same location there are all the months, like January and November.

If you focus on a different region, you get words like "a few days ago " the next few months etc.

From these examples, we see that different location will have some specific common meaning.
This graph below showcases how how by training this kind of network will pick up on some semantics features. For exmaple in this case you can see there is a vector connecting man to woman and another between king and queen, which means woman minus man is going to be equal to queen minus king. You will get the same distance in this embeddings space applied to cases like male-female. Another example will be walking to walked and swimming to swam. You can always apply this kind of specific linear transofmation going from one word to another or from country to capital.


The last case is sequence to sequence. In this network, as you start feeding inside input the network starts getting output. An example of this type of architecture is T9, if you remeber using Nokia phone, you will get text suggestions as you are typing things through. Another example is speech to captions. One of the cool example is this RNN-writer. When you start typing "the rings of saturn glittered while", it suggests the following "two men looked at each other". This network was trained on some sci-fi novels so that you can just type somthing and let it make suggestions to help you write a book. One more example is shown below. You input the top prompt and then this network will try to complete the rest.

## Back Propagation through time (14:00 - 23:50) - Biao Huang
#### 1. Model architecture
In order to perform training on RNN, backpropagation through time (BPTT) is supposed to be used. The model architechture of RNN is given in figure below. The left design uses loop representation while the right figure unfolds the loop into a row over time.

Hidden representations are stated as
$\begin{aligned}
\begin{cases}
h[t]&= g(W_{h}\begin{bmatrix}
x[t] \\
h[t-1]
\end{bmatrix}
+b_h) \\
h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\
\hat{y}[t]&= g(W_yh[t]+b_y)
\end{cases}
\end{aligned}$
The first equation indicates a non-linear function applied on rotation of a stack version of input where previous configuration of hidden layer is appended. At the beginning, $h[0]$ is set 0. To simplify the equation, $W_h$ can be written as two separate matrices, $\left[ W_{hx}\ W_{hh}\right]$, thus sometimes the transformation can be stated as
$$W_{hx}\cdot x[t]+W_{hh}\cdot h[t-1]$$which corresponds to the stack representation of input.
y[t] is calculated at the final rotation and then we can use the chain rule to backpropagate the error to previous time step.
#### 2. Batch-Ification in Language Modeling
When dealing with a sequence of symbols, we can batchify the text into different sizes. For example, when dealing with sequence shown in the following figure, batch-ification can be applied first, where time domain is preserved vertically. In this case, batch size is set to 4.

If BPTT period $T$ is set to 3, the first input $x[1:T]$ and output $y[1:T]$ for RNN is determined as
$\begin{aligned}
x[1:T] &= \begin{bmatrix}
a & g & m & s \\
b & h & n & t \\
c & i & o & u \\
\end{bmatrix} \\
y[1:T] &= \begin{bmatrix}
b & h & n & t \\
c & i & o & u \\
d & j & p & v
\end{bmatrix}
\end{aligned}$
When performing RNN on the first batch, firstly, we feed $x[1] = [a\ g\ m\ s]$ into RNN and force the output to be $y[1] = [b\ h\ n\ t]$. The hidden representation $h[1]$ will be sent forward into next time step to help RNN predict $y[2]$ from $x[2]$. After sending $h[T-1]$ to the final set of $x[T]$ and $y[T]$, we cut gradient propagation process for both $h[T]$ and $h[0]$ so that gradients will not propagate infinitely(.detach() in Pytorch). The whole process is shown in figure below.

## Vanishing and Exploding Gradient (23:50 - 31:00) - Biao Huang
#### 1. Problem

The figure above is a typical RNN architecture. In order to perform rotation over previous steps in RNN, we use matrices, which can be regarded as horizontal arrows in model above. Since the matrices can change the size of outputs, if the determinant we select is larger than 1, the gradient will Inflate through time and cause gradient explosion. Relatively speaking, if the eigenvalue we select is small across 0, the propagation process will shrink gradients and finally leads to gradient vanishing.
In typical RNN, gradients will be propagate through all the possible arrows, which provides the gradients a large chance to vanish or explode. For example, the gradient at time 1 is large, which is indicated by the bright color. When it goes through one rotation, the gradient shrinks a lot and at time 3, it gets killed.
#### 2. Solution
An ideal to prevent gradients from exploding or vanishing is to skip connections. To fulfill this, multiply networks can be used.

In the case above, we split original network into 4 networks. Take the first network for instance. It takes in value from input at time 1 and send output to the first intermediate state in Hidden Layer. The state has 3 other networks where the $\circ$s allows the gradients to pass while the $-$s blocks propagation. Such technique is called gated recurrent network.
LSTM is one prevalent gated RNN and is introduced in detail in following sections.
## Long Short-Term Memory (31:00 - 39:00) - Lin Jiang
#### 1. Model Architecture

Above are the equations for recurrent neural networks. Below are equations expressing a LSTM. The input gate is highlighted by yellow boxes, which will be an affine transformation. This input transformation will be multiplying $c[t]$, which is our candidate gate.

Don’t forget gate is multiplying the previous value of cell memory $c[t-1]$. Total cell value $c[t]$ is don’t forget gate plus input gate. Final hidden representation is element-wise multiplication between output gate $o[t]$ and hyperbolic tangent version of the cell $c[t]$, such that things are bounded. Finally, candidate gate $\tilde{c[t]}$ is simply a recurrent net. So we have a $o[t]$ to modulate the output, a $f[t]$ to modulate the don’t forget gate, and a $i[t]$ to modulate the input gate. All these interactions between memory and gates are multiplicative interactions. $i[t]$, $f[t]$ and $o[t]$ are all sigmoids, going from zero to one. Hence, when multiplying by zero, you have a closed mouth. When multiplying by one, you have an open mouth.
How do we turn off the output? Let’s say we have a purple internal representation $th$ and put a zero in the output gate. Then the output will be zero multiplying by something, and we get a zero. If we put a one in the output gate, we will get the same value as the purple representation.



Similarly, we can control the memory. For example, we can reset it by having $f[t]$ and $i[t]$ to be zeros. After multiplication and summation, we have a zero inside the memory. Otherwise, we can keep the memory, by still zeroing out the internal representation $th$ but keep a one in $f[t]$. Hence, the sum gets $c[t-1]$ and keeps sending it out. Finally, we can write such that we can get a one in the input gate, the multiplication gets purple, then set a zero in the don’t forget gate so it actually forget.



#### 2. Step by Step Walk Through
## Notebook Examples (39:00 - End) - Lin Jiang
**Sequence Classification**
The goal is to classify sequences. Elements and targets are represented locally (input vectors with only one non-zero bit). The sequence starts with an B, ends with a E (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set `{a, b, c, d}` except for two elements at positions t1 and t2 that are either X or Y. For the DifficultyLevel.HARD case, the sequence length is randomly chosen between 100 and 110, t1 is randomly chosen between 10 and 20, and t2 is randomly chosen between 50 and 60. There are 4 sequence classes Q, R, S, and U, which depend on the temporal order of X and Y. The rules are: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`.
1). Dataset Exploration
The return type from a data generator is a tuple with length 2. The first item in the tuple is the batch of sequences with shape (32, 9, 8). This is the data going to be fed into the network. There are eight different symbols in each row (X, Y, a,b, c, d, B, E). Each row is a one-hot vector. Sequence of rows represents sequence of symbols. The first all-zero row is padding. We use padding when the length of sequence is shorter than the maximum length in the batch.

The second item in the tuple is the corresponding batch of class labels with shape (32, 4), since we have 4 classes (Q, R, S, and U). The first sequence is: BbXcXcbE. Then its decoded class label is $[1, 0, 0, 0]$, corresponding to Q.
2). Defining the Model and Training
Let’s create a simple recurrent network, a LSTM and train for 10 epoches. In the training loop, we should always look for five steps: 1) perform the forward pass of the model, 2) compute the loss, 3) zero the gradient cache, 4) backpropagate to compute the partial derivative of loss with regard to parameters, 5) step in the opposite direction of gradient.


With easy level of difficulty, RNN gets 50% accuracy while LSTM gets 100% after 10 epoches. But LSTM has four times more weights than RNN and has two hidden layers, so it is not a fair comparison. After 100 epoches, RNN also gets 100% accuracy, but it takes longer than LSTM.


If we increase the difficulty of the training part (using longer sequences), we will see the RNN fails while LSTM keeps working.


*The above visualization is drawing the value of hidden state over time in LSTM. We will send the inputs through a hyperbolic tangent, such that if the input is below -2.5, it will be mapped to -1, and if it is above `2.5`, it will be mapped to 1. So in this case, we can see the specific hidden layer picked on X (fifth row in the picture) and then it became red until we got the other X. So, the fifth hidden unit of the cell is triggered by observing the X and goes quiet after seeing the other X. This allows us to recognize the class of sequence.
**Signal Echoing**
Echoing signal n steps is an example of synchronized many-to-many task.For instance, the 1st input sequence is "1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ...", and the 1st target sequence is "0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ...". In this case, the output is three steps after. So we need a short-time working memory to keep the information. Whereas in the language model, it says something that hasn't already been said.
Before, we send the whole sequence to the network, and force the final target to be something. In this case, we need to cut the long sequence into little chunks. While feeding a new chunk, we need to keep track of the hidden state, and send it as input to the internal state when adding the next new chunk. In LSTM, you can keep the memory for a long time as long as you have enough capacity. In RNN, after you reach some kind of length, it starts to forget about what happened in the past.
](https://)