Attention explained in an easy way

# Attention explained in an easy way ## Introduction and Motivation A common model to translate sentences from one language to another is the so-called Seq2Seq (i.e. Sequence-to-Sequence) model. Seq2Seq is a model that uses an encoder-decoder architecture, which encodes information, for example, a sentence in English, into an output vector which we call a context vector. This context vector is then used as an input for the decoder which will produce the translated sentence, for example, in German. For the encoder and decoder, usually, either an RNN or LSTM is used. However, one major issue with this architecture is the inability of the model to handle longer inputs well. This is due to the fact that Seq2Seq models "forget" the earlier part of a sentence once it has been fully processed. To solve this problem, Bahdanau et al. have introduced the Attention mechanism in the 2014 paper ["Neural Machine Translation by Jointly Learning to Align and Translate"](https://arxiv.org/abs/1409.0473). Shortly after in 2017, the famous paper ["Attention is All You Need"](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) was published, which has revolutionized the Deep Learning community with the Transformers architecture that was built using Attention. In this blog post, we will explain the idea and mechanism behind attention. ## The idea Now that we have talked about this "Attention" term, what is the fuzz about and what does it actually do? In order to understand the idea of the Attention mechanism, let's quickly recap how a Seq2Seq model works traditionally: ![](https://i.imgur.com/62CEFpT.png)*Source: Smerity - Peeking into the neural network architecture used for Google's Neural Machine Translation, 2016* In the above figure, we can see both the encoder and decoder of the model. The encoder processes the input in a sequential manner, which ends up producing one single vector where all the information is stored that is then being passed to the decoder. However, usually, an RNN produces an encoder output at each step or token of the sentence, which is simply being disregarded in Seq2Seq models. For smaller sequences, this works pretty well, but with larger inputs, this becomes an increasingly bigger bottleneck. Now the idea behind Attention is to work with all of the encoder output, instead of just the last one. Intuitively, **attention** is being paid to each word and especially to the ones that are most meaningful. ## The attention mechanism using Bahdanau Attention ![](https://i.imgur.com/WyUMVpG.png)*Source: Bahdanau et al. - Neural Machine Translation by Jointly Learning to Align and Translate, 2014* The above figure shows the mechanism behind the Bahdanau Attention (also called Additive Attention). The Attention weights $\alpha$ determine how much attention should be paid to each input word. Essentially, $\alpha_{t,i}$ is the attention that should be paid by $y_t$ to $h_t$. These weights can then be simply trained by a Feed-Forward Neural Network or computed by just using the Dot-Product as done by other Attention mechanisms. We can then take the Attention weights to compute the context vector which is the weighted sum over all the annotations with the probabilities $\alpha_{t,i}$. This way, we let the decoder selectively decide which words or input it should pay the most attention to and it does not have to rely on a single fixed-length context vector from the encoder. Now let us quickly have a look at the step-to-step guide of how Bahdanau Attention works and then we will implement the Bahdanau Attention Layer in Keras. 1. **Generate the Encoder hidden States**. In the figure above, the encoder hidden states are denoted by $h_t$. Generating them can be done by variants of RNNs such as Bi-directional RNNs, LSTMs, GRUs or similar models. 2. **Calculation of the alignment vectors**. The alignment vectors are calculated by the Bahdanau score function: $$\operatorname{score}\left(\boldsymbol{h}_{t}, \overline{\boldsymbol{h}}_{s}\right)=\boldsymbol{v}_{a}^{\top} \tanh \left(\boldsymbol{W}_{1} \boldsymbol{h}_{t}+\boldsymbol{W}_{2} \overline{\boldsymbol{h}}_{s}\right)$$ $h_t$ denotes each of the encoders hidden state and $\overline{\boldsymbol{h}}_{s}$ the previous decoders hidden state. As we can see, we have three weights, $W_1$ and $W_2$ and $v_a$, that we need to train. The alignment vector gives us the score of how each output word corresponds to the whole input sequence. To transform these scores into probabilities, we simply need to apply the softmax function on the vector, which will give us the Attention weights $\alpha$. 3. **Calculation of the context vectors**: As described before, we now just need to multiply the Attention weights with the hidden state of the encoder. 4. **Concatenate context vector**. Now we need to concatenate the context vector with the previous decoder hidden state. This will be used by the decoder to compute the next word or token. Steps 1 to 4 are repeated until an End-Of-Sentence token is produced. We can implement this Attention layer in Python with Keras with the following code: ``` class BahdanauAttention(tf.keras.layers.Layer): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1) def call(self, query, values): # Bahdanau Score function (Step 2) score = self.V(tf.nn.tanh(self.W1(tf.expand_dims(query, 1)) + self.W2(values))) # Applying the softmax function to the alignment vector attention_weights = tf.nn.softmax(score, axis=1) # Calculating the context vectors (Step 3) context_vector = tf.reduce_sum(attention_weights * values, axis=1) return context_vector, attention_weights ``` After we have computed our Attention weights, we can visualize the Attention weights between input and output with an Attention matrix. This is extremely useful, as it can show us where our model performs well and where it does not. Also, it is a fun way of showcasing the mechanism of Attention. ![](https://i.imgur.com/CHztJqz.png)*Source: Bahdanau et al. - Neural Machine Translation by Jointly Learning to Align and Translate, 2014* ## Other Attention mechanisms Now the cool thing about Attention mechanisms is that once we have built the architecture, it is pretty easy to replace the Attention mechanisms with others. In order to achieve this, we simply need to change the alignment score function, while the rest will stay the same. An overview of other Attention mechanisms can be seen below: ![](https://i.imgur.com/xnsDEbN.png)*Source: Lilian Weng - Attention? Attention!, 2018* ## Conclusion In this blog post, we have explained the idea behind Attention and we have gone into detail about how the Attention mechanism works, as well as how it can be implemented in Keras. Overall, we can say that Attention is a tremendously important principle in Deep Learning and modern language models would probably be unthinkable nowadays without it, as it laid the foundation for Transformers, BERT and many more groundbreaking models and ideas. In the next posts, we will dive a bit deeper into how is used in modern applications.