Recurrent Networks
Ferenc Huszár (fh277)
DeepNN Lecture 8
Different from what we've seen before:
different input type (sequences)
different network building blocks
multiplicative interactions
gating
skip connections
different objective
maximum likelihood
generative modelling
Modelling sequences
input to the network: \(x_1, x_2, \ldots, x_T\)
sequences of different length
sometimes 'EOS' symbol
sequence classification (e.g. text classification)
sequence generation (e.g. language generation)
sequence-to-sequence (e.g. translation)
Recurrent Neural Network
RNN: Unrolled through time
Generating sequences
Goal: model the distribution of sequences
\[
p(x_{1:T}) = p(x_1, \ldots, x_T)
\]
Idea: model it one-step-at-a-time:
\[
p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1)
\]
Modeling sequence distributions
Training: maximum likelihood
Sampling sequences
But, it was not that easy
vanilla RNNs forget too quickly
vanishing gradients problem
exploding gradients problem
Vanishing/exploding gradients problem
Vanilla RNN:
\[
\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})
\]
\[
\hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y)
\]
The gradients of the loss are
\begin{align}
\frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\partial \mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{s+1}}{\partial h_s} \\
&= \frac{\partial \hat{L}}{\mathbf{h}_T} \left( \prod_{s=t}^{T-1} D_s \right) W^{T-t}_h,
\end{align}
where
\(D_t = \operatorname{diag} \left[\sigma'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]\)
if \(\sigma\) is ReLU, \(\sigma'(z) \in \{0, 1\}\)
The norm of the gradient is upper bounded
\begin{align}
\left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \left\|W_h\right\|^{T-t} \prod_{s=t}^{T-1} \left\|D_s\right\|,
\end{align}
the norm of \(D_s\) is less than 1 (ReLU)
the norm of \(W_h\) can cause gradients to explode
More typical solution: gating
Vanilla RNN:
\[
\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})
\]
Gated Recurrent Unit:
\begin{align}
\mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\
\tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\
\mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\
\mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\
\end{align}
GRU diagram
LSTM: Long Short-Term Memory
by Hochreiter and Schmidhuber (1997)
improved/tweaked several times since
more gates to control behaviour
2009: Alex Graves, ICDAR connected handwriting recognition competition
2013: sets new record in natural speech dataset
2014: GRU proposed (simplified LSTM)
2016: neural machine translation
RNNs for painting
Spatial LSTMs generating textures
Seq2Seq: neural machine translation
General algorithms as Seq2Seq
convex hull and triangulation
Pointer networks
Revisiting the basic idea
"Asking the network too much"
Attention layer
Attention layer
Attention weights:
\[
\alpha_{t,s} = \frac{e^{\mathbf{e}^T_t \mathbf{d}_s}}{\sum_u e^{\mathbf{e}^T_t \mathbf{d}_s}}
\]
Context vector:
\[
\mathbf{c}_s = \sum_{t=1}^T \alpha_{t,s} \mathbf{e}_t
\]
Attention layer visualised
Side note: dealing with depth
Side note: dealing with depth
Side note: dealing with depth
Deep Residual Networks (ResNets)
Deep Residual Networks (ResNets)
ResNets
allow for much deeper networks (101, 152 layer)
performance increases with depth
new record in benchmarks (ImageNet, COCO)
used almost everywhere now
DenseNets
DenseNets
Back to RNNs
like ResNets, LSTMs and GRU create "shortcuts"
allows information to skip processing
data-dependent gating
data-dependent shortcuts
Different from what we had before:
different input type (sequences)
different network building blocks
multiplicative interactions
gating
skip connections
different objective
maximum likelihood
generative modelling
Resume presentation
Recurrent Networks Ferenc Huszár (fh277) DeepNN Lecture 8
{"metaMigratedAt":"2023-06-16T19:39:17.091Z","metaMigratedFrom":"YAML","title":"2022 DeepNN Lecture 8 Slides","breaks":true,"description":"Lecture slides on recurrent neural networks, its variants like uRNNs, LSTMs. Touching on deep feed-forward networks like ResNets","contributors":"[{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":8432,\"del\":198}]"}