# Recurrent Networks
### Ferenc Huszár (fh277)
DeepNN Lecture 8
---
## Different from what we've seen before:
* different input type (sequences)
* different network building blocks
* multiplicative interactions
* gating
* skip connections
* different objective
* maximum likelihood
* generative modelling
---
## Modelling sequences
* input to the network: $x_1, x_2, \ldots, x_T$
* sequences of different length
* sometimes 'EOS' symbol
* sequence classification (e.g. text classification)
* sequence generation (e.g. language generation)
* sequence-to-sequence (e.g. translation)
---
### Recurrent Neural Network
![](https://i.imgur.com/UJYrL7I.png)
---
### RNN: Unrolled through time
![](https://i.imgur.com/YnkgS5P.png)
---
### RNN: different uses
![](https://i.imgur.com/WGl90lv.jpg)
figure from [Andrej Karpathy's blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
---
### Generating sequences
Goal: model the distribution of sequences
$$
p(x_{1:T}) = p(x_1, \ldots, x_T)
$$
Idea: model it one-step-at-a-time:
$$
p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1)
$$
---
### Modeling sequence distributions
![](https://i.imgur.com/WfPwnjZ.png)
---
### Training: maximum likelihood
![](https://i.imgur.com/Z8sLsQI.png)
---
### Sampling sequences
![](https://i.imgur.com/c9WcaD0.png)
---
### Char-RNN: Shakespeare
from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
![](https://i.imgur.com/cN25jUL.png)
---
### Char-RNN: Wikipedia
from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
![](https://i.imgur.com/Nr0UjtR.png)
---
### Char-RNN: Wikipedia
from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
![](https://i.imgur.com/R91pDeJ.png)
---
### Char-RNN example: random XML
from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
![](https://i.imgur.com/H3b3QjC.png)
---
### Char-RNN example: LaTeX
from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
![](https://i.imgur.com/GgXRG4n.jpg)
---
### But, it was not that easy
* vanilla RNNs forget too quickly
* vanishing gradients problem
* exploding gradients problem
---
### Vanishing/exploding gradients problem
Vanilla RNN:
$$
\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})
$$
$$
\hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y)
$$
---
### The gradients of the loss are
\begin{align}
\frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\partial \mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{s+1}}{\partial h_s} \\
&= \frac{\partial \hat{L}}{\mathbf{h}_T} \left( \prod_{s=t}^{T-1} D_s \right) W^{T-t}_h,
\end{align}
where
* $D_t = \operatorname{diag} \left[\sigma'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]$
* if $\sigma$ is ReLU, $\sigma'(z) \in \{0, 1\}$
---
### The norm of the gradient is upper bounded
\begin{align}
\left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \left\|W_h\right\|^{T-t} \prod_{s=t}^{T-1} \left\|D_s\right\|,
\end{align}
* the norm of $D_s$ is less than 1 (ReLU)
* the norm of $W_h$ can cause gradients to explode
---
![](https://i.imgur.com/DVFyskJ.png)
---
### More typical solution: gating
Vanilla RNN:
$$
\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})
$$
Gated Recurrent Unit:
\begin{align}
\mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\
\tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\
\mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\
\mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\
\end{align}
---
## GRU diagram
![](https://i.imgur.com/TrhwIcC.png)
---
### LSTM: Long Short-Term Memory
* by Hochreiter and Schmidhuber (1997)
* improved/tweaked several times since
* more gates to control behaviour
* 2009: Alex Graves, ICDAR connected handwriting recognition competition
* 2013: sets new record in natural speech dataset
* 2014: GRU proposed (simplified LSTM)
* 2016: neural machine translation
---
### RNNs for images
![](https://karpathy.github.io/assets/rnn/house_read.gif)
([Ba et al, 2014](https://arxiv.org/abs/1412.7755))
---
### RNNs for images
![](https://karpathy.github.io/assets/rnn/house_generate.gif)
([Gregor et al, 2015](https://arxiv.org/abs/1502.04623))
---
### RNNs for painting
![](https://i.imgur.com/DhbBAl2.png)
([Mellor et al, 2019](https://learning-to-paint.github.io/))
---
### RNNs for painting
![](https://i.imgur.com/KKg33WR.jpg)
---
### Spatial LSTMs
![](https://i.imgur.com/4fOP3FR.png)
([Theis et al, 2015](https://arxiv.org/pdf/1506.03478.pdf))
---
### Spatial LSTMs generating textures
![](https://i.imgur.com/uLYyB3l.jpg)
---
### Seq2Seq: sequence-to-sequence
![](https://i.imgur.com/Ki8xpvY.png)
([Sutskever et al, 2014](https://arxiv.org/pdf/1409.3215.pdf))
---
### Seq2Seq: neural machine translation
![](https://i.imgur.com/WrZg5r4.png)
---
### Show and Tell: "Image2Seq"
![](https://i.imgur.com/hyUtUjl.png)
([Vinyals et al, 2015](https://arxiv.org/pdf/1411.4555.pdf))
---
### Show and Tell: "Image2Seq"
![](https://i.imgur.com/MSU5mIw.jpg)
([Vinyals et al, 2015](https://arxiv.org/pdf/1411.4555.pdf))
---
### Sentence to Parsing tree "Seq2Tree"
![](https://i.imgur.com/ywwmSCK.png)
([Vinyals et al, 2014](https://arxiv.org/abs/1412.7449))
---
### General algorithms as Seq2Seq
travelling salesman
![](https://i.imgur.com/B8jsaMt.png)
([Vinyals et al, 2015](https://arxiv.org/abs/1506.03134))
---
### General algorithms as Seq2Seq
convex hull and triangulation
![](https://i.imgur.com/mTQhCTi.png)
---
### Pointer networks
![](https://i.imgur.com/JhFpOkZ.png)
---
### Revisiting the basic idea
![](https://i.imgur.com/Ki8xpvY.png)
"Asking the network too much"
---
### Attention layer
![](https://i.imgur.com/nskRYts.png)
---
### Attention layer
Attention weights:
$$
\alpha_{t,s} = \frac{e^{\mathbf{e}^T_t \mathbf{d}_s}}{\sum_u e^{\mathbf{e}^T_t \mathbf{d}_s}}
$$
Context vector:
$$
\mathbf{c}_s = \sum_{t=1}^T \alpha_{t,s} \mathbf{e}_t
$$
---
### Attention layer visualised
![](https://i.imgur.com/MVt50yl.png =500x)
---
![](https://i.imgur.com/uNwTRux.png)
---
### To engage with this material at home
Try the [char-RNN Exercise](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/recurrent-neural-networks/char-rnn/Character_Level_RNN_Exercise.ipynb) from Udacity.
---
### Side note: dealing with depth
![](https://i.imgur.com/sTaW6fT.png)
---
### Side note: dealing with depth
![](https://i.imgur.com/2oCXEIh.png)
---
### Side note: dealing with depth
![](https://i.imgur.com/w8BmEfS.png =260x)
---
### Deep Residual Networks (ResNets)
![](https://i.imgur.com/hJK6Rx4.png)
---
### Deep Residual Networks (ResNets)
![](https://i.imgur.com/wjBWNn9.png)
---
### ResNets
* allow for much deeper networks (101, 152 layer)
* performance increases with depth
* new record in benchmarks (ImageNet, COCO)
* used almost everywhere now
---
### Resnets behave like ensembles
![](https://i.imgur.com/LNPB4e8.png)
from ([Veit et al, 2016](https://arxiv.org/pdf/1605.06431.pdf))
---
### DenseNets
![](https://i.imgur.com/Eyyx1uK.png)
---
### DenseNets
![](https://i.imgur.com/a5dQUl8.png)
---
### Back to RNNs
* like ResNets, LSTMs and GRU create "shortcuts"
* allows information to skip processing
* data-dependent gating
* data-dependent shortcuts
---
## Different from what we had before:
* different input type (sequences)
* different network building blocks
* multiplicative interactions
* gating
* skip connections
* different objective
* maximum likelihood
* generative modelling
---
{"metaMigratedAt":"2023-06-16T19:39:17.091Z","metaMigratedFrom":"YAML","title":"2022 DeepNN Lecture 8 Slides","breaks":true,"description":"Lecture slides on recurrent neural networks, its variants like uRNNs, LSTMs. Touching on deep feed-forward networks like ResNets","contributors":"[{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":8432,\"del\":198}]"}