Attention mechanisms in Deep Learning

# Attention mechanisms in Deep Learning Sequence-to-sequence (seq2seq) models have become very popular in Deep Learning, especially in Natural Language Processing (NLP) tasks. They have allowed us to build powerful models for Machine Translation that are able to translate from one language to another. One of the main reasons behind their success is a mechanism called **Attention**. In this article, we will briefly talk about classic sequence-to-sequence models and their problems. Then, we will see what Attention is and will discuss the different Attention mechanisms that have been proposed. ## Overview of sequence-to-sequence models <figure> <img src="https://i.imgur.com/OsEbo0E.png" alt="Sequence-to-sequence model" > <figcaption style="text-align: center;"> An example of a sequence-to-sequence model. </figcaption> </figure> Sequence-to-sequence models are made up of an **encoder** and a **decoder**. The encoder processes the input sequence and generates a vector containing its information called **context**. The decoder takes the context vector produced in the previous steps and generates the output sequence using it. These kind of models of models are usually implemented using RNNs or some sort of variation of them (LSTMs, GRUs, etc.). <div style="display: block; text-align: center;"> <figure> <img src="https://i.imgur.com/hCKl7z4.png" alt="Translation with sequence-to-sequence model" > <figcaption style="text-align: center;"> Example of a sequence-to-sequence model processing an input sequence written in French and generating the corresponding output sequence in English. </figcaption> </figure> </div> However, these kind of models present a severe limitation: because the context contains mainly information from the last step of the encoder, a lot of the previous information is lost. This is especially troublesome when dealing with long sequences. ## What is Attention? In order to deal with this problem, some authors proposed a mechanism called "Attention". The idea behind it is to allow the model to focus on the relevant parts of the input sequence as needed. This done by performing a weighted combination of all of the hidden states from the encoder in order to create a context vector containing the relevant information from the encoding process at each step of the decoding stage. This context is combined with the decoder's current hidden state in order to produce the output. An example of how Attention works for a given input sequence can be seen in the image below. <div style="display: block; text-align: center;"> <figure> <img src="https://i.imgur.com/6YB1HqX.png" alt="Attention weights example" > <figcaption style="text-align: center;"> Example of attention weights for a translation task. The input is a sentence in French, and the output is the translated sentence in English. We can see that for each output word, there are some elements from the output that are more relevant than others (they have larger weights, and thus, the cells are brighter). </figcaption> </figure> </div> Given a sequence of encoder outputs $\bar{h}_s$ that are being attended to (which are also called *key* and *value*), and the decoder state $h_t$ that attends the sequence (also called *query*), the decoder performs the following operations at each step: 1. Compute the alignment score value with $\bar{h}_s$ and $h_t$ using some kind of method (we will some of them later on). 2. Compute the attention weights as $a_{ts} = \text{softmax}(score)$ 3. Generate the context vector: $c_t = \sum a_{ts} \bar{h}_s$ ## Attention mechanisms As we have mentioned before, there are some methods to compute the alignment score values. Let's take a look at some of the most important ones. ### Bahdanau Attention This method was proposed by Bahdanau in the original paper published in 2014 that introduced the concept of Attention. The score is computed as follows: $$score(h_t, \bar{h}_s) = v^T \tanh(W_1 h_t + W_2 \bar{h}_s)$$ where $W_1$ and $W_2$ are two matrices associated to the outputs of the encoder and the current state of the decoder, and $v$ is a vector. An implementation using `Tensorflow` and `Keras` can be seen down below: ```python= class BahdanauAttention(tf.keras.layers.Layer): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` ### Luong Dot Attention In 2015, Luong published another paper on Attention introducing some new Attention mechanisms. One of them is the Luong Dot Attention, which is computed as follows: $$score(h_t, \bar{h}_s) = h_t^T \bar{h}_s$$ Notice that in this case there are no parameters that need to be adjusted: the score only depends on $h_t$ and $\bar{h}_s$. An implementation of this Attention mechanism can be seen below: ```python= class LuongDotAttention(tf.keras.layers.Layer): def __init__(self): super(LuongDotAttention, self).__init__() def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(values, perm=[0, 2, 1]) # LUONGH Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed), perm=[0, 2, 1]) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` ### Luong General Attention Finally, Luong proposed another technique in the same paper, which is a generalization of the previous one. The score is computed as follows: $$score(h_t, \bar{h}_s) = h_t^T W \bar{h}_s$$ where $W$ is a matrix of weights. Same as before, the below you have an implementation of this attention mechanism: ```python= class LuongGeneralAttention(tf.keras.layers.Layer): def __init__(self, units): super(LuongGeneralAttention, self).__init__() self.W = tf.keras.layers.Dense(units) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(self.W(values), perm=[0, 2, 1]) score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed), perm=[0, 2, 1]) attention_weights = tf.nn.softmax(score, axis=1) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` ## Conclusions Attention has supposed a breakthrough in Deep Learning, especially in NLP tasks. It has shown to be a very powerful technique and it allows to obtain exceptional results. It is also the foundamental building block of **Transformers**, which are the current state-of-the-art method used in NLP. So, having a basic understanding of how it works can be very useful when studying more complex models.