DeepNN Lecture 8 Slides

# Recurrent Networks ### Ferenc Huszár (fh277) DeepNN Lecture 8 --- ## Different from what we had before: * different input type (sequences) * different network building blocks * multiplicative interactions * gating * skip connections * different objective * maximum likelihood * generative modelling --- ## Modelling sequences * input to the network: $x_1, x_2, \ldots, x_T$ * sequences of different length * sometimes 'EOS' symbol * sequence classification (e.g. text classification) * sequence generation (e.g. language generation) * sequence-to-sequence (e.g. translation) --- ### Recurrent Neural Network ![](https://i.imgur.com/UJYrL7I.png) --- ### RNN: Unrolled through time ![](https://i.imgur.com/YnkgS5P.png) --- ### RNN: different uses ![](https://i.imgur.com/WGl90lv.jpg) figure from [Andrej Karpathy's blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) --- ### Generating sequences Goal: model the distribution of sequences $$ p(x_{1:T}) = p(x_1, \ldots, x_T) $$ Idea: model it one-step-at-a-time: $$ p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1) $$ --- ### Modeling sequence distributions ![](https://i.imgur.com/WfPwnjZ.png) --- ### Training: maximum likelihood ![](https://i.imgur.com/Z8sLsQI.png) --- ### Sampling sequences ![](https://i.imgur.com/c9WcaD0.png) --- ### Char-RNN: Shakespeare from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) ![](https://i.imgur.com/cN25jUL.png) --- ### Char-RNN: Wikipedia from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) ![](https://i.imgur.com/Nr0UjtR.png) --- ### Char-RNN: Wikipedia from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) ![](https://i.imgur.com/R91pDeJ.png) --- ### Char-RNN example: random XML from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) ![](https://i.imgur.com/H3b3QjC.png) --- ### Char-RNN example: LaTeX from [Andrej Karpathy's 2015 blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) ![](https://i.imgur.com/GgXRG4n.jpg) --- ### But, it was not that easy * vanilla RNNs forget too quickly * vanishing gradients problem * exploding gradients problem * colab illustration --- ### Vanishing gradient problem ![](https://i.imgur.com/cLhmhjv.png) --- ### Vanishing/exploding gradients problem Vanilla RNN: $$ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) $$ $$ \hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y) $$ --- ### The gradients of the loss are \begin{align} \frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\partial \mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{s+1}}{\partial h_s} \\ &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} D_s W^{T-t}_h, \end{align} where * $D_t = \operatorname{diag} \left[\sigma'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]$ * if $\sigma$ is ReLU, $\sigma'(z) \in \{0, 1\}$ --- ### The norm of the gradient is upper bounded \begin{align} \left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \prod_{s=t}^{T-1} \left\|D_s\right\| \left\|W_h\right\|^{T-t}, \end{align} * the norm of $D_s$ is less than 1 (ReLU) * the norm of $W_h$ can cause gradients to explode --- ![](https://i.imgur.com/DVFyskJ.png) --- ### Unitary Evolution RNNs Idea: constrain $W_h$ to be unit-norm. ![](https://i.imgur.com/9Thc6AS.png) --- ### Unitary Evolution RNNs Compose weight matrix out of simple unitary transforms: $$ W_h = D_3R_2\mathcal{F}^{-1}D_2\Pi R_1\mathcal{F}D_1 $$ --- ### More typical solution: gating Vanilla RNN: $$ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) $$ Gated Recurrent Unit: \begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align} --- ## GRU diagram ![](https://i.imgur.com/TrhwIcC.png) --- ### LSTM: Long Short-Term Memory * by Hochreiter and Schmidhuber (1997) * improved/tweaked several times since * more gates to control behaviour * 2009: Alex Graves, ICDAR connected handwriting recognition competition * 2013: sets new record in natural speech dataset * 2014: GRU proposed (simplified LSTM) * 2016: neural machine translation --- ### Side note: dealing with depth ![](https://i.imgur.com/sTaW6fT.png) --- ### Side note: dealing with depth ![](https://i.imgur.com/2oCXEIh.png) --- ### Side note: dealing with depth ![](https://i.imgur.com/w8BmEfS.png =260x) --- ### Deep Residual Networks (ResNets) ![](https://i.imgur.com/hJK6Rx4.png) --- ### Deep Residual Networks (ResNets) ![](https://i.imgur.com/wjBWNn9.png) --- ### ResNets * allow for much deeper networks (101, 152 layer) * performance increases with depth * new record in benchmarks (ImageNet, COCO) * used almost everywhere now --- ### Resnets behave like ensembles ![](https://i.imgur.com/LNPB4e8.png) from ([Veit et al, 2016](https://arxiv.org/pdf/1605.06431.pdf)) --- ### DenseNets ![](https://i.imgur.com/Eyyx1uK.png) --- ### DenseNets ![](https://i.imgur.com/a5dQUl8.png) --- ### Back to RNNs * like ResNets, LSTMs and GRU create "shortcuts" * allows information to skip processing * data-dependent gating * data-dependent shortcuts --- ## Different from what we had before: * different input type (sequences) * different network building blocks * multiplicative interactions * gating * skip connections * different objective * maximum likelihood * generative modelling --- ### RNN: different uses ![](https://i.imgur.com/WGl90lv.jpg) figure from [Andrej Karpathy's blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) --- ### To engage with this material at home Try the [char-RNN Exercise](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/recurrent-neural-networks/char-rnn/Character_Level_RNN_Exercise.ipynb) from Udacity.