Importrance of Masking in Generative Modeling of Sequences

# Importrance of Masking in Generative Modeling of Sequences When learning to generate sequences of symbols $x_1, x_2, \ldots, x_T$, we often do that by defining a pobabilistic generative model, a probability distribution over sequences $p(x_1, \ldots, x_T)$, by making use of the chain rule of probabilities: $$ p(x_1, \ldots, x_T) = p(x_1) p(x_2\vert x_1) p(x_3\vert x_1, x_3) \cdots p(x_T\vert x_1, \ldots, x_{T-1}) $$ This makes computational sense because modeling each of the conditional distributions above is easier as it is a distribution over a single symbol $x_t$, so even though the entire sequence $x_1, \ldots, x_T$ can take combinatorially many values, each component distribution we model $p(x_t\vert x_1, \ldots, x_{t-1})$ is only a distribution over a relatively small number of options, which can be be modelled easily. A generative model that defines a probabilistic model as a product of conditional distributions like above is often called autoregressive. ### Modeling sequences: RNN When we model sequences, we want to use a single neural network model to model all these component distributions. The most straightforward models for sequences are recurrent neural networks (see my [lecture notes](https://hackmd.io/@fhuszar/Hy2wJST-d)). The diagram below shows the computational graph of an RNN (unrolled in time). ![](https://i.imgur.com/UI20tAV.png) In each timestep, the RNN is given as input the current symbol in the sequence (bottom, green nodes), it updates the hidden state activations (grey cells) based on their previous values and the new input, and then it outputs a probability distribution over the next symbol (top, orange nodes). Notice there are no arrows from left to right, ensuring that the distribution we output at time $t$ can only depend on symbols ingested up until time $t$. ### Modelling sequences: CNN One can also model sequences with convolutional neural networks. A good example is WaveNet. The computational graph for WaveNets looks like this: ![](https://i.imgur.com/tKUSFCU.jpg) Here, we use dilated convolutions (hence the binary tree-like computational graph) and so called causal convolutions, which means that the convolution kernels are shifted so that the output node's value only depends on input nodes that are earlier in time (to the left). Importantly, there are again no arrows pointing from the right to left ### Modelling sequences: fully connected If we want to model sequences in this way using a fully connected architecture, we'd be in trouble. Let's consider the fully connected "autoencoder" architecture shown below: ![](https://i.imgur.com/JXl4k5c.png) At the bottom we see the sequence of three coordinates, at the top the network outputs predictions. However, now the prediction for the first coordinate depends on all the other time-points, including $x_2$ and $x_3$. Thus, this architecture can't be used to model the conditional distributions needed for the chain rule. This can be fixed by masking. A simple approach is to mask the weights: essentially removing all weights that go from right to left, and only keeping ones that respect the temporal ordering. Below is the approach called MADE (masked autoencoders for density estimation). ![](https://i.imgur.com/OlYY4WU.png) For each node (hidden units) in the autoencoder, we assign a time index, randomly in the case of MADE. We then apply a mask that sets all weights that go from a later time index to an earlier one (higher number to lower one) to zero. This ensures that there is an ordering of variables so that the outputs of the network can be interpreted as conditional distributions in a valid chain rule decomposition ### Modelling sequences: Masked dot-product attention Transformers, with attention over a sequence, are a lot like fully connected netwoks. A normal self-attention layer is permutation-invariant, it does not care about the order of symbols in a sequence. If we want to use a transformer model to model $p(x_t\vert x_1, \ldots, x_{t-1})$, we have to ensure that the Transformer's $t^\text{th}$ output can only depend on symbols up to time $t$. In practice, this can be achieved by masking the attention matrix, so it is upper triangular. This means that when calculating the output of the $t^\text{th}$ unit, the attention weights for any symbol after time $t$ are set to $0$. This is equivalent to cutting the edges that go from right to left in a fully connected architecture. Importantly, this masking is only needed to enable a Transformer to express an autoregressive probabilistic model. This is used in maximum likelihood language modelling tasks. Such model can be used to generate sequences, or to compress data. However, if the Transformer is used for other tasks, such as translation, the masking is not important everywhere. Note that there are alternative approaches to language modeling which does not require the specification of a correct probabilistic model over sequences. One of them is *masked language modelling* which is used to train BERT. The name is confusing in this context, but masking refers to something else. The network receives as input a sentence, from which some words are dropped and are replaced by a MASK symbol. The network then reconstructs the entire sentence, trying to figure out what the dropped words were. For this task, the network doesn't need masked attention layers, the prediction of each dropped dword is allowed to depend on context both before and after.