A Brief Explanation About Attention Mechanism in Deep Learning

# A Brief Explanation About Attention Mechanism in Deep Learning ## Sequence to Sequence Models First of all in order to put us in context we have to talk about Sequence to Sequence (Seq2Seq) models. Seq2Seq models are a special class of Recurrent Neural Network architectures designed to handle sequence data. They are commonly used for tasks such as machine translation, text summarization, and language modeling. The basic architecture of a Seq2Seq model consists of two parts: an encoder and a decoder (both of them are usually LSTM models). The encoder processes the input sequence and encodes it into a fixed-length context vector. The decoder then uses this context vector to generate the output sequence. ![](https://i.imgur.com/lEUjtwV.jpg) Basic Seq2Seq models work pretty well for short sentences but when we have to process very lenghty sequences we encounter some problems. This models can only capture a limited amount of context from the input sequence, the gradients may vanish completely because they are passed back through the many layers of the model and especially for long sequences can be computationally expensive to train and use. A solution to this problem is to add the attention mechanism in order to obtain a more robust model that can also work with lengthy sentences. ## Introduction to Attention Attention is a mechanism used in deep learning that allows a model to focus on specific parts of an input when making predictions or decisions. This is particularly useful when we have a sequence-to-sequence recurrent model that takes a sequence of items (words, letters, image frames, etc) and outputs another sequence of items. For example, in natural language processing tasks, where a model might need to understand the meaning of a sentence or paragraph by focusing on certain words or phrases. The attention mechanism is a good solution when we have to deal with very long sentences because it allows the model to focus only on the relevant parts of the sequence. ## How Does Attention Work ? At a high level, attention works by assigning different weights to different parts of an input. These weights determine how much "attention" the model will give to each part of the input when making a prediction. The model then uses these weighted inputs to generate an output. In general, to implement attention we will consider RNN layers (LSTM, GRU...) and construct an encoder-decoder (seq2seq) model with some particularities: * The encoder passes all the hidden states to the decoder. * The decoder builds the attention weights by giving at each hidden state a **score**. * The softmax function is applied to this score in order to amplify hidden states with high scores. * The attention weights are multiplied to the last decoder RNN hidden state in order to obtain the context vector. Note that at the first time step since we still have no decoder hidden state we use the last hidden state of the encoder to multiply it to the attention weights. ![](https://i.imgur.com/C6cnBou.png) * Finally, this context state vector is the one we pass trough a feedforward neural network in order to obtain the output of one time step. * This process is done at each time step in order to obtain the output sequence. ## How Can we Compute the Score ? The last thing we have to cover is how the score for each hidden state is computed. Besides the encoder hidden states the computation of the score involves the attention decoder RNN output vector and can also involves some weight matrices. These weights are learned by the model during training, and determine how much importance or "attention" the model will give to each part of the input when making a prediction. The implementation of attention score can vary depending on the specific architecture and task at hand. However, there are some common elements that are typically included. Luong dot, Luong multiplicative, and Bahdanau are all variations of the attention mechanism and each one uses a different set of equations to compute attention weights. **Luong dot attention** uses dot product to compute attention weights. This means that the attention weights are computed by taking the dot product of the key vectors which represent each part of the input (all encoder hidden states) and the query vector which represents the current state of the model (decoder RNN output). ![](https://i.imgur.com/0R9qF09.png) **Luong multiplicative** uses a general form of the dot product to compute attention weights. This allows the model to learn to attend to different parts of the input in a more flexible and fine-grained way. The attention weights are computed using a combination of the query vector, the key vectors, and a set of learnable parameters. ![](https://i.imgur.com/qwzdOIj.png) **Bahdanau attention** uses a different set of equations to compute attention weights compared to other attention mechanisms. The attention weights are computed using a combination of the query vector (decoder RNN output), the key vectors (all encoder hidden states) and the value vectors which also represent the input. ![](https://i.imgur.com/u1eMMBc.png) Overall, the differences between these attention mechanisms lie in the equations used to compute attention weights. Each variation has its own strengths and weaknesses, and may be more or less effective depending on the specific task at hand.