The Attention Mechanism in Deep Learning

# The Attention Mechanism in Deep Learning ## Introduction Attention is one of the most influential ideas in the Deep Learning community. Even though this mechanism is now used in various problems like image captioning and others, it was initially designed in the context of Neural Machine Translation using Seq2Seq Models. In this blog, we will present the reader with a general overview of this interesting topic. The question is: What’s wrong with seq2seq models? The seq2seq models are normally composed of an encoder-decoder architecture, where the encoder processes the input sequence and encodes/compresses/summarizes the information into a context vector (also called as the “thought vector”) of a fixed length. This representation is expected to be a good summary of the entire input sequence. The decoder is then initialized with this context vector, using which it starts generating the transformed output. A critical and apparent disadvantage of this fixed-length context vector design is the incapability of the system to remember longer sequences. Often is has forgotten the earlier parts of the sequence once it has processed the entire the sequence. In other words, the problem that sequence-to-sequence models usually have is that they are not able to accurately process long input sequences, since only the last hidden state of the encoder RNN is used as the context vector for the decoder. The attention mechanism was born to resolve this problem. ## The idea of "attention": Different attention mechanisms The idea behind the attention mechanism is to allow the decoder part to utilize the most relevant parts of the input sequence in a flexible way, through a weighted combination of all of the encoded input vectors, having the most relevant vectors being attributed the highest weights. It is accomplished by creating a unique mapping between each time step of the decoder output to all the encoder hidden states. This means that for each output that the decoder makes, it has access to the entire input sequence and can selectively pick out specific elements from that sequence to produce the output. As human beings we are quickly able to understand these mappings between different parts of the input sequence and corresponding parts of the output sequence. However its not that straight forward for artificial neural network to automatically detect these mappings and "learn" them through Gradient Descent and Back-propagation. Hence, the name **attention** serves as a perfect term to describe these kind of algorithms and mechanisms. The general attention mechanism has three main components: the queries Q, the keys K, and the values V. The general attention mechanism performs the following computations: * Computation of the alignment score value: $Score = Q * K$ * Generation of the attention weights: $a = softmax(score)$ * Generation of the context vector: $c = \sum a * V$ | ![](https://i.imgur.com/F7VXZnm.png) | | :--: | | *General scheme of a seq2seq attention model* | There are two main types of attention mechanisms that we will discuss in the following. Their differences consist, basically, in their architectures and the different computations of the alignment score value. ### Bahdanau Attention This method aims to improve the sequence-to-sequence model by aligning the decoder with the relevant input sentences and implementing Attention. The entire step-by-step process of applying Bahdanau Attention is the following: 1. Encoder produces hidden states of each element in the input sequence. 2. Calculating Alignment Scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated. 3. Alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed. 4. The encoder hidden states and their respective alignment scores are multiplied to form the context vector. 5. The context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output. 6. The process (steps 2-5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length. | ![General scheme of the Bahdanau Attention model](https://i.imgur.com/cgsP8HE.jpg) | | :--: | | *General scheme of the Bahdanau attention model* | The allignment score formula is the following: $$ Score=W\cdot tanh(W_{combined}(H_{encoder}+H_{decoder})) . $$ Finally, its implementaion with python and tensorflow looks as the following: ``` class BahdanauAttention(tf.keras.layers.Layer): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) # Bahdanau score score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` ### Luong Attention Comparing to the Bahdanau Attention, Luong Attention has different general structure of the Attention Decoder as the context vector is only utilised after the RNN produced the output for that time step. The entire step-by-step process of applying Luong Attention is the following: 1. Encoder produces hidden states of each element in the input sequence. 2. Decoder RNN - the previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step. 3. Using the new decoder hidden state and the encoder hidden states, alignment scores are calculated. 4. The alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed. 5. The encoder hidden states and their respective alignment scores are multiplied to form the context vector. 6. Producing the Final Output - the context vector is concatenated with the decoder hidden state generated in step 2 as passed through a fully connected layer to produce a new output. 7. The process (steps 2-6) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length. | ![General scheme of the Bahdanau Attention model](https://i.imgur.com/cgsP8HE.jpg) | | :--: | | *General scheme of the Luongattention model* | Its alignment score formula (and, thus, its implementation) can be take two forms: #### Luong Attention Dot version The allignment score formula is the following: $$ Score=H_{encoder}\cdot H_{decoder}. $$ Its implementaion with python and tensorflow looks as the following: ``` class LuongDotAttention(tf.keras.layers.Layer): def __init__(self): super(LuongDotAttention, self).__init__() def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(values, perm=[0, 2, 1]) # LUONG Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed), perm=[0, 2, 1]) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` #### Luong Attention General version The allignment score formula is the following: $$ Score=W(H_{encoder}\cdot H_{decoder}). $$ Its implementaion with python and tensorflow looks as the following: ``` class LuongGeneralAttention(tf.keras.layers.Layer): def __init__(self, size): super(LuongGeneralAttention, self).__init__() self.W = tf.keras.layers.Dense(size) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(self.W(values), perm=[0, 2, 1]) # LUONG General score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed), perm=[0, 2, 1]) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) c = attention_weights * values context_vector = tf.reduce_sum(c, axis=1) return context_vector, attention_weights ```