How can I make my models pay attention? - Attention mechanism in DL

# How can I make my models pay attention? - Attention mechanism in DL In the recent years, a growing interest has arose about _attention_ and how to implement this mechanism for several architectures and tasks. Not only in transformers, nowadays this has been one of the driving forces to obtain state-of-the-art performances in a variety of other deep learning models: e.g. Graph Attention Networks (GAN), Recurrent Models for Visual Attention (RAM)... But, what _attention_ is and how it works? ## Heuristic idea The _attention_ mechanism is that one which dynamically highlights and uses the __salient__ parts of the information in a similar manner the human brain would do it. For instance, in sequence to sequence models, it is built in the form of the so-called _context_. The hidden states of the encoder, which could be understood as a “memory” containing a sequence of facts, are not all relevant even though it is all passed to the decoder. The attention mechanism “exploits” the content of the memory, this _context_, at each time before the inference step having the ability of paying attention on the content of one memory element (or a few, with a different weight). ## Attention in _sequence to sequence_ models Let's see how it would be implemented in sequence to sequence models. ### Architecture This kind of models are based in a: - __Encoder__ step: during which a sequence of items (of any elemental type: strings, numbers...) is passed and used to generate a fixed-size vector called ``context``, conceivable as a 'memory' of potential relevant information. The ``context`` vector is nothing else than the hidden states of the recurrent encoder model; however, __attention__ can be paid. - __Decoder__ step: during which the ``context`` vector and the input sequence are used to decode and infer the target sequence. ![Sequence to sequence model applied to text translation](https://i.imgur.com/8EVO4P5.png) It is during the encoder step, when __attention__ models 'exploit' the ``context`` weighting its relevant terms. These weights are not but the relevance 'probability'; at least, they are transformed under a softmax transformation after the attention score has been assigned. Some more details regarding this part can be found below. ### How attention is rated? The attention is measured according an arbitrary score. There are different variations but all of them need: - The decoder output (or, at the first decoding step, the last hidden state of the encoder) $h$: vector with shape (``rnn_units``, ``n_timesteps_out``) - The encoder's hidden states $\overline{h}$, an array with shape (``rnn_units``, ``n_timesteps_in``) Being ``rnn_units`` the number of recurrent units of the model, and ``n_timesteps_in`` (\& ``n_timesteps_out``) the number of elements that are processed at once in the encoding (\& decoding) step. Then, each of these techniques return an attention weight array with shape (``n_timesteps_in``, ``n_timesteps_out``) and filled with the 'relevance probability' of each element of the input sequence with the respect to the elements that are currently being tackled in the decoding step (``n_timesteps_out`` elements at a time). Specifically, if we traverse the $h$ decoder output vector second axis with index $t$ (in this case $t\in \{1, \ldots, \ T\}$), $h_t$, and the encoder hidden states vector second axis with index $s$ (in this case $s\in \{1, \ldots, \ S\equiv n_{steps_{out}}\}$), $\overline{h}_s$; then the context vector will be: $$ c_t = \sum_{s=1}^{S} \alpha_{ts} \overline{h}_s$$ (for each decoder output element $t$, ``n_timesteps_out`` per step) being $\alpha_{ts}$ the attention weights computed as a probability. Thus, using a 'softmax' after applying the score $\mathrm{score}(h_t, \overline{h}_s)$ function: $$ \alpha_{ts} = \frac{e^{\mathrm{score}(h_t, \overline{h}_s)}}{\sum_{s'=1}^{S}e^{\mathrm{score}(h_t, \overline{h}_s')}}$$ This score function can be implemented differently. For instance, we have the well-known Luong attention scores: - Luong **Dot** attention: $\mathrm{score}(h_t, \overline{h}_s) \equiv h_t^T \overline{h}_s$ - Luong **General** attention: $\mathrm{score}(h_t, \overline{h}_s) \equiv h_t^T W_a \overline{h}_s$ Being a $W_a$ a trainable array. ### Visualizing the attention Depending on the problem and the task, the elements may have a different weight distribution. For instance, when translating from english to spanish : - "A beautiful dog" -> "Un perro bonito" If ``n_timestep_out = 1``, the attention weight corresponding to "dog" in the first step would be almost 0, while it would be almost one in the second one (not the third!) This would result in an almost-diagonal matrix if the ``context`` vector was plotted as row for each ``n_timestep_in``. This is a common visualization of the attention mechanism. Below, a plot of the attention weights is displayed when the score is Bahdanau's and the seq-to-seq model solves a number-translation task: ![Bahdanau attention weights](https://i.imgur.com/bVcCUgn.png) ## Summary The _attention_ mechanism is a work-around to circumvent the sequence to sequence models' bottleneck that arises when dealing with the fixed-size ``context`` vector. This vector is conceivable as the 'memory' of the encoder which contains potential relevant 'facts' for the decoding step. However, not all of them are relevant so, instead of increasing even more its lenght (making the model much more complex and sensitive to over-fitting), the **attention** mechanism comes into action weighting the more relevant ones similarly to what a human brain would do. In this [notebook](https://www.kaggle.com/code/gerardcastro/dl-3rd-assignment) I implement a sequence-to-sequence model with several different attention scores.