Attention - HackMD

# Attention In psychology, attention is the cognitive process of selectively focusing on one or a few things while ignoring others. In deep learning, attention is one of the most influential ideas in the last decade. It is an attempt to implement the definition of attention through the use of neural networks, i.e. selectively focusing on some relevant things while ignoring others in deep neural networks. Today we are going to see how this mechanism works. ## Seq2seq models Attention was originally introduced as a solution to address the main problem with seq2seq patterns. Below we see how these work. ![](https://i.imgur.com/HW6SXYa.png) The architecture of Seq2Seq models consists in: * An **encoder**, consisting of RNNs, that takes in input a sequence in the form of a hidden state vector, called context. The items of the sequence may be words, numbers or any kind of data. * A **decoder** that uses the context vector in order to generate an output sequence and after every successive prediction, it uses the previous hidden state to predict the next instance of the sequence. For example, the decoder may be aiming to generate translations starting from a sequence of words. Drawback: The output sequence relies heavily on the context defined by the hidden state in the final output of the encoder, making it challenging for the model to deal with long sentences. In the case of long sequences, there is a high probability that the initial context has been lost by the end of the sequence. Solution: Bahdanau et al., 2014 and Luong et al., 2015 papers introduced and a technique called “Attention” which allows the model to focus on different parts of the input sequence at every stage of the output sequence allowing the context to be preserved from beginning to end. ## Types of Attention There are 2 different major types of Attention: **Bahdanau Attention** and **Luong Attention**. While the underlying principles of Attention are the same in both, their differences lie mainly in their architectures and computations. ### Bahdanau Attention ![](https://i.imgur.com/AcPPFWp.png) The first type of Attention, commonly referred to as Additive Attention, came from a paper by Dzmitry Bahdanau. The paper aimed to improve the sequence-to-sequence model in machine translation by aligning the decoder with the relevant input sentences and implementing Attention. The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows: ![](https://i.imgur.com/qb6icOj.png) #### 1. Producing the Encoder Hidden States Instead of using only the hidden state at the final time step, we’ll be carrying forward all the hidden states produced by the encoder to the next step. #### 2-3. Calculating Alignment Scores After obtaining all of our encoder outputs, we have to calculate the alignment score for each of them with respect to the decoder input and hidden state at that time step. **The alignment score is the essence of the Attention mechanism**, as it quantifies the amount of “Attention” the decoder will place on each of the encoder outputs when producing the next output. The alignment scores for Bahdanau Attention are calculated using the hidden state produced by the decoder in the previous time step and the encoder outputs with the following equation: ![](https://i.imgur.com/mRSVWzP.png) The decoder hidden state and encoder outputs will be passed through their individual Linear layer and have their own individual trainable weights. Thereafter, they will be added together before being passed through a tanh activation function. The decoder hidden state is added to each encoder output in this case. Lastly, the resultant vector from the previous few steps will undergo matrix multiplication with a trainable vector, obtaining a final alignment score vector which holds a score for each encoder output. #### 4. Softmaxing the Alignment Scores After generating the alignment scores vector in the previous step, we can then apply a softmax on this vector to obtain the attention weights. #### 5. Calculating the Context Vector The encoder hidden states and their respective alignment scores are now multiplied to form the context vector. #### 6. Decoding the Output The context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output. The final output for the time step is obtained by passing the new hidden state through a Linear layer, which acts as a classifier to give the probability scores of the next predicted word. **The process (steps 2-5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length.** ### Luong Attention ![](https://i.imgur.com/L6UBpam.png) Luong Attention it's often referred to as Multiplicative Attention and was built on top of the Attention mechanism proposed by Bahdanau. The main differences between Luong Attention and Bahdanau Attention is the way that the alignment score is calculated. In Luong Attention, there are three different ways that the alignment scoring function is defined: * Dot The first one is the dot scoring function. This is the simplest of the functions; to produce the alignment score we only need to take the hidden states of the encoder and multiply them by the hidden state of the decoder. ![](https://i.imgur.com/cqvUOKi.png) * General The second type is called general and is similar to the dot function, except that a weight matrix is added into the equation as well. ![](https://i.imgur.com/1b157TU.png) * Concat The last function is slightly similar to the way that alignment scores are calculated in Bahdanau Attention, whereby the decoder hidden state is added to the encoder hidden states. ![](https://i.imgur.com/P7riyed.png) However, the difference lies in the fact that the decoder hidden state and encoder hidden states are added together first before being passed through a Linear layer. This means that the decoder hidden state and encoder hidden state will not have their individual weight matrix, but a shared one instead, unlike in Bahdanau Attention.After being passed through the Linear layer, a tanh activation function will be applied on the output before being multiplied by a weight matrix to produce the alignment score. ## Conclusions The Attention mechanism has revolutionised the world of NLP models and is currently a standard. This is because it enables the model to “remember” all the words in the input and focus on specific words when formulating a response. Deep Learning models are generally considered as black boxes, meaning that they do not have the ability to explain their outputs. However, Attention is one of the successful methods that helps to make our model interpretable and explain why it does what it does. The only disadvantage of the Attention mechanism is that it is a very time consuming and hard to parallelize system.