DeepNN Lecture 9 Slides

# Recurrent Networks continued ### Ferenc Huszár (fh277) DeepNN Lecture 9 --- ### RNN: Recap ![](https://i.imgur.com/YnkgS5P.png) --- ### The state update rule: naive $$ \mathbf{h}_{t+1} = \phi(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) $$ --- ### The state update rule: GRU \begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \odot \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align} --- ### implementing branching logic ...in code: ``` if r: return 5 else: return 3 ``` ...in algebra: ``` return r*5 + (1-r)*3 ``` --- ### The state update rule: GRU \begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \odot \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align} --- ### Side note: dealing with depth ![](https://i.imgur.com/n72rvhO.png) --- ### Side note: dealing with depth ![](https://i.imgur.com/w8BmEfS.png =260x) --- ### Very deep networks are hard to train * exploding/vanishing gradients * their performance degrades with depth * VGG19: 19-layer ConvNet --- ### Deep Residual Networks (ResNets) ![](https://i.imgur.com/wjBWNn9.png) --- ### Deep Residual Networks (ResNets) ![](https://i.imgur.com/hJK6Rx4.png) --- ### ResNets * allow for much deeper networks (101, 152 layer) * performance increases with depth * new record in benchmarks (ImageNet, COCO) * used almost everywhere now --- ### Resnets behave like ensembles ![](https://i.imgur.com/LNPB4e8.png) from ([Veit et al, 2016](https://arxiv.org/pdf/1605.06431.pdf)) --- ### DenseNets ![](https://i.imgur.com/4aTzmR7.png) --- ### Back to RNNs * like ResNets, LSTMs create "shortcuts" * allows information to skip processing * data-dependent gating * data-dependent shortcuts --- ### Visualising RNN behaviours See this [distill post](https://distill.pub/2019/memorization-in-rnns/) --- ### RNN: different uses ![](https://i.imgur.com/WGl90lv.jpg) figure from [Andrej Karpathy's blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) --- ### RNNs for images ![](https://karpathy.github.io/assets/rnn/house_read.gif) ([Ba et al, 2014](https://arxiv.org/abs/1412.7755)) --- ### RNNs for images ![](https://karpathy.github.io/assets/rnn/house_generate.gif) ([Gregor et al, 2015](https://arxiv.org/abs/1502.04623)) --- ### RNNs for painting ![](https://i.imgur.com/DhbBAl2.png) ([Mellor et al, 2019](https://learning-to-paint.github.io/)) --- ### RNNs for painting ![](https://i.imgur.com/KKg33WR.jpg) --- ### Spatial LSTMs ![](https://i.imgur.com/4fOP3FR.png) ([Theis et al, 2015](https://arxiv.org/pdf/1506.03478.pdf)) --- ### Spatial LSTMs generating textures ![](https://i.imgur.com/uLYyB3l.jpg) --- ### Seq2Seq: sequence-to-sequence ![](https://i.imgur.com/Ki8xpvY.png) ([Sutskever et al, 2014](https://arxiv.org/pdf/1409.3215.pdf)) --- ### Seq2Seq: neural machine translation ![](https://i.imgur.com/WrZg5r4.png) --- ### Show and Tell: "Image2Seq" ![](https://i.imgur.com/hyUtUjl.png) ([Vinyals et al, 2015](https://arxiv.org/pdf/1411.4555.pdf)) --- ### Show and Tell: "Image2Seq" ![](https://i.imgur.com/MSU5mIw.jpg) ([Vinyals et al, 2015](https://arxiv.org/pdf/1411.4555.pdf)) --- ### Sentence to Parsing tree "Seq2Tree" ![](https://i.imgur.com/ywwmSCK.png) ([Vinyals et al, 2014](https://arxiv.org/abs/1412.7449)) --- ### General algorithms as Seq2Seq travelling salesman ![](https://i.imgur.com/B8jsaMt.png) ([Vinyals et al, 2015](https://arxiv.org/abs/1506.03134)) --- ### General algorithms as Seq2Seq convex hull and triangulation ![](https://i.imgur.com/mTQhCTi.png) --- ### Pointer networks ![](https://i.imgur.com/JhFpOkZ.png) --- ### Revisiting the basic idea ![](https://i.imgur.com/Ki8xpvY.png) "Asking the network too much" --- ### Attention layer ![](https://i.imgur.com/nskRYts.png) --- ### Attention layer Attention weights: $$ \alpha_{t,s} = \frac{e^{\mathbf{e}^T_t \mathbf{d}_s}}{\sum_u e^{\mathbf{e}^T_t \mathbf{d}_s}} $$ Context vector: $$ \mathbf{c}_s = \sum_{t=1}^T \alpha_{t,s} \mathbf{e}_t $$ --- ### Attention layer visualised ![](https://i.imgur.com/MVt50yl.png =500x) --- ### Language Transformers and Transfer Learning ![](https://i.imgur.com/sVtIGCN.gif) --- ### Zero-Shot Transfer * train as language model - predict next token * use prompts that encode models --- ### To engage with this material at home Try the [char-RNN Exercise](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/recurrent-neural-networks/char-rnn/Character_Level_RNN_Exercise.ipynb) from Udacity. --- * neural machine translation (historical note) * image captioning: encoder is a CNN, decoder is RNN * forgetting problem revisited * asking the network too much * allowing the decoder to look back at encoder states * pointer networks