DeepNN Notes on Recurrent Neural Networks

--- title: DeepNN Notes on Recurrent Neural Networks tags: DeepNN, Teaching, Lecture Notes description: Lecture notes on Recurrent Neural Networks --- # Recurrent Neural Networks Recurrent Neural Networks (RNNs) are nerual network models for processing sequential data. In general I will denote an input sequence to a RNN as $(x_1, x_2, \ldots, x_T)$. The individual symbols $x_t$ within can be of arbitrary type we can process with neural networks: discrete (e.g. characters from a vocabulary), vector-valued, images, etc. Crucially, the length of the input sequence is arbitrary, and an RNN can handle input sequences of any length. The typical illustrations of RNNs are below: ![](https://i.imgur.com/6s89D7c.png) The left-hand figure shows a neural network between a green input node and a light-red output node, which includes recurrent connections, the red loops. The way to read this is that the layers with recurrent connections have memory, so as new data is passed to its input, the value of these layers depends not just on the layer input, but also on the previous value of the layer's activations. A much clearer illustration of an RNN is the one on the right-hand side, which is unrolled over time. Time (or the index within the input sequence $t$) goes from left to right. The grey rectangles $\mathbf{h}_{l,t}$ show the value of the hidden state at layer $l$ of the network at time $t$. These are generally vector-valued. The arrows show that the value of $\mathbf{h}_{1,t}$ depends on the current input $x_t$ as well as the previous value of the hidden state at the first layer $\mathbf{h}_{1,t-1}$. Layers higher up take as input the hidden state at the previous layer. Importantly, the way the state of a layer depends on the previous state and current input does not change over time. The state update happens the same way in each timestep. This is why the RNN can process arbitrarily long sequences with a constant number of parameters ## Update equation An important detail in the RNN architecture is the functional form of how the state update happens, and how it is parametrised. ### Vanilla RNN The most basic versions of the RNN (also called Elman network) use an update equation that is directly inspired by the computation used in multilayer perceptrons: $$ \mathbf{h}_{t} = \phi(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b_h}) $$ The hidden state and the input are first linearly combined, and then an elementwise nonlinearity $\phi$, such as the ReLU, is used. ### Forgetting and vanishing gradients A key problem with this vanilla update rule is that it doesn't allow the network to "remember" information over long time periods. This can be seen by investigating what effect the input at time $t$ has on the hidden state at time $T>t$. It turns out that when this simple update rule is applied recursively, the effect of $x_t$ on $\mathbf{h}_{T}$ diminishes exponentially quickly as $T-t$ grows. This simple code snippet visualises this exponentially diminishing effect in a randomly initialized RNN: ```python import torch as torch import torch.nn as nn from matplotlib import pylab as plt rnn = nn.RNN(input_size=1, hidden_size=20, num_layers=3) input = torch.randn(100, 1, 1) input.requires_grad = True output, hT = rnn(input) hT[2, 0, 0].backward() plt.semilogy(input.grad[:, 0, 0].detach().numpy()**2, '.') ``` The code above should yield results like below: ![](https://i.imgur.com/RldVGKM.png) You can see that as the difference between $t$ and $T$ grows (from right to left), the magnitude of the gradient decreases exponentially. A similar pattern effects the gradients of the loss function as well: as the loss gradient is backpropagated through time from future outputs to past inputs, its magnitude may get exponentially small (vanishing gradients) or in some cases exponentially large (exploding gradients). ### Unitary evolution RNNs One way to fix the exploding/vanishing gradient problem is to constrain the weight matrices $W_h$. Here, for illustration, we are going to follow the reasoning of ([Arjovsky et al, 2015](https://arxiv.org/abs/1511.06464)). Consider a situation where the empirical loss depends on the final hidden state of the RNN $\mathbf{h}_T$. Applying the chain rule to calculate the gradient of the empirical loss $\hat{L}$ with respect to an intermediate hidden state $\mathbf{h}_t$ becomes: \begin{align} \frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\partial \mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{s+1}}{\partial h_s} \\ &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} D_s W^{T-t}_h, \end{align} where $D_t = \operatorname{diag} \left[\phi'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]$ is a diagonal matrix containing the derivatives of the nonlinearity $\phi$. In case of ReLU activations, this derivative is either $0$ or $1$. We can see that this expression for the gradient contains the term $W^{T-t}_h$, which is the state transition weight matrix raised to the $T-t$ power. The eigendecomposition of the matrix $W_h$ therefore have a large influence how the gradient behaves. Generally, a matrix whose eigenvalues are all above 1 in absolute value increases the norm of vectors when you multiply by them. Conversely, matrices whose eigenvalues are all smaller than 1 decrease the norm of vectors. If all eigenvalues are $1$ in absolute value, called a unitary matrix, the norm of a vector remains unchanged by the matrix. In the above expression for the gradient, the matrix $W_h$ appears raised to the $T-t$ power. If $W_h$ has eigenvalues smaller than 1, then $W^{T-t}_h$ will have eigenvalues that are vanishingly small for large $T-t$. This means that the norm of the gradient becomes vanishingly small, too. And the opposite happens when eigenvalues are above 1: the gradient explodes. To avoid the exploding/vanishing gradient probblem, it therefore makes sense to use weight matrices which are unitary. However, even if one initializes $W_h$ to a unitary matrix, its eigenvalues might change during trainig eventually leading to problems. This motivated [Arjovsky et al (2015)](https://arxiv.org/abs/1511.06464) to restrict the parameter $W_h$ to unitary matrices by construction. They do this by parametrising $W_h$ as a product of simple unitary matrices (reflections, rotations, permutations, Fourier transform). The resulting RNN, called the unitary evolution RNN or uRNN overcomes the exploding/vanishing gradients problem and achieves similar results to the most popular RNN architecture, the LSTM. That said, it has not seen widespread use in practice. Instead, parctical applicationns of RNNs almost exclusively use RNN cells based on gated activations, which we will discuss in the next section. We included the uRNN here because the reasoning that lead to its development is useful for gaining intuition about the problems with vanilla RNN update rules. ### Gated RNN units All neural network architectures we discussed so far in the lectures uses the same formula for layers: linear/affine transformation followed by elementwise nonlinearitty. Even ConvNets follow this structure, with the exception that the linear operation is a convolution, which is a very restricted special case of a linear transformation. Now we will intoduce a new operation that will extend this basic vocabulary and allow us to build better neural network architectures: pointwise multiplication between activations, which we will denote by $\odot$. To illustrate how this operation is used, let me introduce the state update equations for Gated Recurrent Units ([Cho et al, 2014](https://arxiv.org/pdf/1406.1078.pdf)): \begin{align} \mathbf{r}_t &= \sigma(W_r\mathbf{x}_{t} + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_{t} + U_z\mathbf{h}_t)\\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_{t+1} + U(\mathbf{r}_t \odot \mathbf{h}_{t-1})\right)\\ \mathbf{h}_{t} &= \mathbf{z}_t \odot \mathbf{h}_{t-1} + (1 - \mathbf{z}_t) \odot \tilde{\mathbf{h}}_{t} \\ \end{align} Let me explain the components: * The first two lines define two so called *gates*. The first, $\mathbf{r}_t$ is called the reset gate, the second, $\mathbf{z}_t$, is called the update gate. * The gates use a sigmoid activation $\sigma$ which is a continuous nonlinearity which has an output between $0$ and $1$. * $\tilde{\mathbf{h}}_t$ is the proposed new value of the hidden state. The reset gate $\mathbf{r}_t$ controls to what degree the previous value of the hidden state $\mathbf{h}_{t-1}$ is taken into account in this update. Components of the hidden state where $\mathbf{r}_t$ is $0$ will be *reset* to a completely new value, ignoring the previous value of the hidden state. Notice that for this update we may use a different, non-sigmoidal nonlinearity $\phi$ such as a relu. * The update gate $\mathbf{z}_t$ controls whether the state is updated to the new proposed value $\tilde{\mathbf{h}}_t$, or is left unchanged and set to the previous value $\mathbf{h}_{t-1}$. Gating is a neural network's way of implementing conditional logic. Consider the following logic implemented in an if statement: ```python if r: return 5 else: return 3 ``` where `r` is a boolean variable. In most programming languages, boolean variables are implemented as 0s and 1s and can be used as numbers. So, equivalently, we could write the above code as: ```python return r*5 + (1-r)*3 ``` This is what's going on in the GRU unit, at each component of the hidden state vector. A key difference is that in the network the update and reset gates are not binary, but instead can take continuous values between $0$ and $1$. This is important so that the network remains differentiable with respect to parameters of the reset and update gates. #### Why does this work? A GRU unit overcomes the main limitation of RNNs by being able to remember the same hidden state longer. Using the update gate, the GRU cell can decide to propagat hidden states over several timesteps unchanged. Equally, it has the ability to forget, and to reset hidden states to a completely new values via the reset gate. This mitigates the vanishing/exploding gradients problem, too. ### Long Short-Term Memory The GRU unit was introduced as a simplified version of the Long Short-Term Memory (LSTM) unit. The LSTM is perhaps the most widely used RNN variant. Compared to the GRU it has several more gates, and more parameters as a rersult. I won't cover the details here, because there are excellent materials on the internet that do, such as this [illustrated blog post](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21). Historically, the LSTM is very significant. Initially intriduced by [Hochreiter and Schmidhuber, (1997)](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf), it played a very important role in several breakthroughs in speech processing, machine translation, and many more. The [Wikipedia page on LSTMs](https://en.wikipedia.org/wiki/Long_short-term_memory) lists a timeline of some of the most notable achievements. ## Applications LSTMs and other RNNs have been applied very creatively in several domains. Perhaps the most natural applications are in modeling text, which naturally works well with the sequence processing abilities of RNNs. ### RNN as a generative model One common application of an RNN is as a generative model for sequences. A generative model defines a probability distribution over some domain, in the case of RNNs, over sequences. The RNN uses the chain rule of probability distributions to decompose this problem into predicting the next element in the sequence given the previous ones: $$ p(x_1, x_2, \ldots, x_T) = p(x_T\vert x_{1:T-1})p(x_{T-1}\vert x_{1:T-2})\cdots p(x_2\vert x_1)p(x_1) $$ An RNN can be naturally used to model a sequence this way. At each timestep it digests a new character from the sequence, and outputs a prediction of the following character given the ones already seen, as illustrated in the figure below: ![](https://i.imgur.com/tcUMqRk.png) Once you build a model like that, you can use it to sample sequences by feeding the randomly sampled next character back into the network as output: ![](https://i.imgur.com/HFvrBJh.png) To train a generative RNN, we typically use the maximum likelihood criterion, which evaluates the model's probabilistic predictions of the next character on the actual next character, and sums up the log probabilities across the whole sequence. This is shown in the following diagram: ![](https://i.imgur.com/9MnAZOP.png) ### char-RNNs One can use the generative RNN to model sentences or text on a character-by-character basis. This works surprisingly well. A very well written blog post by Andrej Karpathy, the [Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) illustrates how easily such models learn to mimic statistical regularities of language. I also recommend this nice article and tool that illustrates [memorization in character-based RNNs](https://distill.pub/2019/memorization-in-rnns/). Here you will be able to see what a character RNN pays attention to when making predictions about the next word in a sentence. Not all RNN-based models of text work on a character-by-character basis. One can model text as a sequence of words or a sequence of other tokens. RNNs for text are also often combined with word embeddings such as word2vec. ### RNNs for images Although more naturally suited for modelling text or other sequence data, RNNs have also been used to model images in different contexts. For example, [Mellor et al, (2019)](https://learning-to-paint.github.io/) used recurrent networks to model a sequence of painting operations which then create an image, as shown in the examples below: ![](https://i.imgur.com/DhbBAl2.png) This network was trained to reproduce images of faces, and can create "paintings" in different styles: ![](https://i.imgur.com/KKg33WR.jpg) This nice [blog posts](https://learning-to-paint.github.io/) gives more examples and explains how this method works. Another way RNNs have been extended to model images is the spatial LSTM by ([Theis et al, 2015](https://arxiv.org/pdf/1506.03478.pdf)). Here, the RNN architecture for modelling sequences is extended to model a grid of pixels, as illustrated in the figure below: ![](https://i.imgur.com/4fOP3FR.png) ### Seq2Seq A common way of using RNNs is to map between sequences, which is often called Seq2Seq or sequence-to-sequence ([Sutskever et al, 2014](https://arxiv.org/pdf/1409.3215.pdf)). The canonical example of this is machine translation where one maps, say, a French sentence to an English one. In the Seq2Seq framework one uses the RNN in the following way: ![](https://i.imgur.com/Ki8xpvY.png) Here, an input sequence `ABC` is first ingested by the RNN, producing a hidden state that captures its meaning. Then, the RNN switches to a generative mode to generate the output sequence, in this case `WXYZ`. In generative mode, the RNN works the way explained above in the "RNN as a generative model" section. The part of the RNN whch ingests the input sequence is often called the encoder RNN, while the part which generates the output is called the decoder. It is possible for the decoder to have a different architecture, or different parameters than the encoder. The hidden state that connects the encoder and the decoder is sometimes called the *thought vector* or *bottleneck*. #### Machine translation The flagship use-case for Seq2Seq is machine translation. Shortly after the first demonstrations of machine translation by this architecture, [Google announced](https://research.google/pubs/pub45610/) that they have a new version of Google Translate powered by Seq2Seq naural machine tarnslation. To make this work, a few more modeling ideas (and a whole lot of engineering) were needed on top of the basic Seq2Seq idea. One of those ides is to use bi-directional LSTM encoder. A biLSTM is an extension of LSTMs which processes information both forward and backward through the sequence. The second idea, which is going to be the topic of the next lectuer, is attention. ### Image2Seq The modularity of deep learning means that both the encoder and the decoder networks can be swapped out to something other than an RNN. In Image2Seq ([Vinyals et al, 2015](https://arxiv.org/pdf/1411.4555.pdf)) the encode is swapped out for a convnet which processes an image. ![](https://i.imgur.com/hyUtUjl.png) This way, the network can learn to map images to text captions: ![](https://i.imgur.com/MSU5mIw.jpg) ### Pointer Networks and Neural Turing Machines To demonstrate that RNNs can do more than just memorise and regurgitate text it was trained on, researchers started training RNNs to reason about other types of objects. Like a Turing machine, which has its input and output tape, a sequence-to-sequence neural network can implement algorithms. [Vinyals et al, (2015)](https://arxiv.org/pdf/1411.4555.pdf) trained sequence-to-sequence networks to solve problems such as travelling salesman, finding the convex hull of a set of points, or to find valid triangulations given a set of points. The goal here was not to improve on the performance of existing and known algorithms for solving these problems, but to demonstrate that LSTMs, trained with gradient descent, are able to train behaviour like this. ![](https://i.imgur.com/XImA5DX.png) In a similar vain, introduced Neural Turing Machines, which were recurrrent netwoks capable of leaning common algorithms for sorting and so on. This idea was later generalized to [Differentiable Neural Computers](https://deepmind.com/blog/article/differentiable-neural-computers) which was [published in Nature](https://www.nature.com/articles/nature20101). ## Related Topics In the lecture I talked about Residual Networks or ResNets for short ([He et al, 2015](https://arxiv.org/abs/1512.03385)). This was in order to illustrate that some of the problems with vanilla RNNs also arise when dealing with very deep neural networks, and that the architecture matters greatly in addressing these issues. I showed a figure from ([Veit et al, 2016](https://arxiv.org/abs/1605.06431)) about interpreting a ResNet as an ensemble of many neural netwoks of varying depth. Related to this idea are highway networks ([Srivastava et al, 2015](https://arxiv.org/abs/1505.00387)) ## If you want to play with RNNs To play with RNNs, I recommend you follow the character RNN exercise by Udacity: 1. locate the [ipython notebook](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/recurrent-neural-networks/char-rnn/Character_Level_RNN_Exercise.ipynb) on the github repository (the material in this repository is made available under MIT license) 1. you can load this notebook directly into google colab from github by going to [http://colab.research.google.com/github](http://colab.research.google.com/github) and entering the URL above. (there is also a handy [chrome browser plugin](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo) for opening jupyter notebooks on github directly in colab) 1. once you open it in colab, you have to download the dataset to your local machine. This is not part of the notebook, it assumes the file is available locally. To download, copy the following two lines into a cell at the beginning of your notebook: ``` !mkdir -p data !wget -c https://github.com/udacity/deep-learning-v2-pytorch/raw/master/recurrent-neural-networks/char-rnn/data/anna.txt -O data/anna.txt ```