Attention in Deep Learning

# Attention in Deep Learning #### Author: Thorsten Kalb, written in 2022 ## Prerequesites For the understanding of this blog entry, you need a basic understanding of recurrent neural networks and auto-encoders. Some knowledge of classifiers in Machine Learning will facilitate the interpretation of attention. ## Background: Sequence to Sequence models Sequence to sequence (seq2seq) models are common in deep learning, for example in machine translation tasks. The model takes a sequence as an input and it produces a different sequence as an output. In machine translation, for example, it will take a sentence (sequence of words) in one language as an input and will give another sequence of words, the translated sentence, as an output. In many applications, the full input sequence needs to be read before one can start with creating the output sequence. #### Example: Machine translation from German to English In German subordinate clauses, the verb is at the very end of the clause. In English, this is not the case. Thus, to translate a German sentence to English, we need to wait until the final word to start translating (and to understand the sentence!). German sentence: Ich habe keinen Hunger, weil ich ein Brot gegessen habe. English equivalent: I am not hungry, because I ate a sandwich. #### Implementation in Deep Learning A sequence model consists of an encoder and a decoder. The encoder is a recurrent neural network (RNN), which takes as input a sequence and produces one output at each time step, hence also a sequence. It also gives a hidden state (h), which is fed to the same RNN in the next time step (i.e. to the next RNN unit) . The last hidden state of the encoder is called the context, as it contains information of the full input sequence. The context is passed as an initial hidden state to the decoder, which is another RNN. The output of the decoder is the output sequence, for instance the translated sentence. ![](https://i.imgur.com/5TnqxUM.png) ## The problem: Poor performance for long sequences The implementation described above works well for short sequences. However, longer sequences cannot be processed with a good performance. It turns out, that the context vector that is passed from the encoder to the decoder is the bottleneck. It contains all the information of all time steps. This implies for a long sequence that every time step does not contribute much to the context vector. This makes it difficult for the decoder to decode long sequences. ## The solution: Attention mechanisms To solve this problem, a mechanism called attention has been proposed by Bahdanau et al. (2014) and Luong et al. (2015). ### What is attention from a cognitive perspective? Every day, we are flooded with information. Imagine waiting in a car at a red traffic light. Pedestrians with a dog are crossing the street. There is an ad for a restaurant. The car next to you plays your favourite song. You are hungry, because you did not eat that sandwich. It is warm outside, the sun is shining. The dog is a German shephard puppy and plays with the leash. These are mostly irrelevant information for the task: When do you start driving? The two important features here are the pedestrians on the road and the red traffic light. Then, how do we filter relevant from irrelevant information? We pay attention! We classify the information of the red traffic light and of the persons on the road as important for the task, while we classify the rest of the information as not important. For the translation task from German to English above, we would first pay attention to the green part of the German sentence, then to the red verb at the end and finally to the black part. These examples illustrate that attention is task-dependent and time-dependent. Since many seq2seq models are trained on one specific task, the time dependency is the important feature, here. ### Implementations in Deep Learning With the cognitive perspective of attention as a probabilistic importance classifier, the implementaion is straightforward: Instead of passing only the last hidden state as the context vector, the hidden states of all time steps are passed to the decoder. Here, an importance classifier is added. At each decoding time step, the importance classifier calculates a relative importance for each context vector. This relative importance is sometimes called the attention weight. The new context vector for the decoding step is then defined as the weighted sum of all context vectors, weighted by its relative importance. Finally, the decoder uses this new context vector at the given time step to produce an output. #### Example: Machine translation from German to English Recall the translation problem. German sentence: Ich habe keinen Hunger, weil ich ein Brot gegessen habe. English equivalent: I am not hungry, because I ate a sandwich. For the green part of the (German) input sequence, the decoder would just pay attention to the words at the same position. This means, for producing the output *I*, the decoder pays attention to *Ich*, etc., which are the literal translations of the words. When the decoder has decoded produced the second *I*, it will not pay attention to *ein Brot*, but rather to *gegessen habe*, because at this time step of decoding (translating), the verb is the important part. #### Mathematical implementation A score, the importance, for each context vector is calculated from a classifier, that takes ... From this score, the softmax is taken. This transforms it into a relative importance and allows for a probabilistic interpretation: E.g. the first context vector has 90% relative importance, the second 7%, the third 3%, and the rest of the context vectors are not important. The weighted sum of these context vectors are passed to the decoder. The attention mechanism is trained jointly withe the auto-encoder. For the calculation of the score, Luong et al. proposed a kernel nearest neighbour mechanism: $$ \text{score} = h^T W d $$ In a simplified version, this matrix is the identity matrix and the attetnion mechanism does not have any training parameters. Bahdanau et al just proposed a classic perceptron: $$ \text{score} = v^T \tanh(Uh + Wd)$$ Here, $U$,$W$ are trainable matrices, $v$ is a trainable vector. $h$ is the hidden state vector of the recurrent neural network and $d$ is the input to the decoder. The attention weights are then calculated as the softmax of these scores. ![](https://i.imgur.com/Ut7Qyyc.png) ## Summary **Attention mechanisms can be seen as a probabilistic importance classifier on top of the decoder**. It classifies the importance of all context vectors. According to this importance score, the context vectors are merged (weighted sum) and given to the decoder. Attention mechanisms are trained jointly with the auto-encoder. Some proposed implementations of such a probabilistic classifier include nearest neighbours with a linear kernel (Luong et al. 2015) or the classic perceptron (Bahdanau, 2014). This could probably be extended to any other probabilistic classifier. ## Credits: This blog was written by Thorsten Kalb, inspired by the Deep Learning lecture of Jordi Vitria and the [blog](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) of Jay Allamar.