Attention, a brief explanation

# Attention, a brief explanation One of the main problems of RNN is that as long as iterate the array of information the network lose memory and pays less attention to the tokens that are far, this is commonly known as the **bottleneck problem**. For this reason, is not a good idea to use RNN to solve a problem with large sentences. This problem can be solved via attention architecture, concretely in this work we will talk about the following attention functions: - Luong Dot: [Effective Approaches to Attention-based Neural Machine Translation, 2015](https://arxiv.org/abs/1508.04025) - Luong General: [Effective Approaches to Attention-based Neural Machine Translation, 2015](https://arxiv.org/abs/1508.04025) - Bahdanau: [Neural Machine Translation by Jointly Learning to Align and Translate, 2014](https://arxiv.org/abs/1409.0473) The architecture used in this project is a RNN combined with the attention mechanism. <center> <img src="https://i.imgur.com/DvfAfdk.png" width="600" height="400" /> Figure 1: RNN with attention schema </center> ## How Attention Works? In this section we will focus on attention in RNN. * Notation: * $h_s$: All the hidden states of the encoder * $h_t$: Previous hidden states of the `Decoder`, from previous time step out. * $C_t$: Context vector * $W_a$: Weight matrix * `query` is the previous output. * `values` are the all hidden states created in the encoder. In order to solve the bottleneck problem, we must use all the hidden states to generate the output `Decoder`. The `Encoder` will work as in encoder-decoder model (such as seq2seq), but the `Decoder` does an extra step before providing the output, the steps are the following: - First, we initialize the `Decoder` states by using the last states of the `Encoder` as usual. - For each `times_step` we must use all hidden states ($h_s$) of the `Encoder` and the previous `Decoder` output to calculate **scores** for each hidden state. - Apply a Softmax function to the **scores**. This amplifying hidden states with high scores, and drowning out hidden states with low scores. - To calculate the **context vector $C_t$**, we multiply and sum the hidden state of the encoder with the scores. - Then, concatenate the context vector with the previous `Decoders` output to create the input for the `Decoder` in next time step. The key to understand attention mechanism is that each word have a relation with other word and this can be catch by attention, at the same time that keep all the information each step because this architecture uses all hidden states instead of using the last one as the seq2seq model do. In case that the attention function has a trainable matrix weight ($W_a$) using back propagation we are going to calculate the best rates for the weight matrix. <center> <img src="https://i.imgur.com/E6eBy9v.png" width="400" height="400" /> Figure 2: heatplot attention weight </center> This heatplot represent an example of the matrix scores weight. The problem faced is a inverse of a sequence using Bahdanau attention function. Notice that the inverse diagonal is close to 1, showing the attention between same numbers but in the output with inverted order. ### Implement Attention Scores($h_t\hat{h_s}$) * Attention $$ \alpha_{ts} = \frac{e^{score(h_t\hat{h_s})}}{\sum^S_{s'=1}{e^{score(h_t\hat{h_s})}}} $$ Notice that the **score** is passed through a **Softmax** function in order to be rescaled [0,1]. That will make that the scores with highest score will be close to 1 and vice versa. * Context vector: $$ c_t=\sum \alpha_{ts}\hat{h}_s $$ In order to get the Context vector $C_t$ we multiply and sum the hidden state of the encoder with the scores $\alpha_{ts}$ 1. **Luong Dot** This function was presented in the paper [Effective Approaches to Attention-based Neural Machine Translation, 2015](https://arxiv.org/abs/1508.04025). The simplest attention function with only a matrix multiplicative of the previous hidden states $h_t$ transposed and $h_s$ all the hidden states of the encoder. This attention mechanism where the alignment score function is calculated as: $$score(h_t\hat{h_s})=h_t^T\hat{h}_s$$ ```{python} class LuongDotAttention(tf.keras.layers.Layer): def __init__(self): super(LuongDotAttention, self).__init__() def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(values, perm=[0, 2, 1]) # LUONGH Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed), perm=[0, 2, 1]) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` From: [notebook](https://colab.research.google.com/github/JanLeyva/DeepLearning/blob/main/Assignment3_2021.ipynb#scrollTo=Mr06uR9YTUvq) 2. **Luong multiplicative** Luong General Attention function is defined by: $$score(h_t,\hat{h}_s)= h_t^T W_a\hat{h}_s$$ This function was presented in the paper [Effective Approaches to Attention-based Neural Machine Translation, 2015](https://arxiv.org/abs/1508.04025). In this paper they present the Dot Luong Attention and the General Luong. The Luong general got a $W_a$ to be trained in order to get better performance. ```{python} class LuongGeneralAttention(tf.keras.layers.Layer): def __init__(self, units): super(LuongGeneralAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) values_transposed = tf.transpose(values, perm=[0, 2, 1]) # LUONGH Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, self.W1(values_transposed)), perm=[0, 2, 1]) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` From: [notebook](https://colab.research.google.com/github/JanLeyva/DeepLearning/blob/main/Assignment3_2021.ipynb#scrollTo=Mr06uR9YTUvq) 3. **Bahdanau** Bahdanau attention function is defined by: $$score(h_t\hat{h_s})= v^T_a tanh (W_a [h_t^T;\hat{h}_s])$$ This function was presented on the paper [Neural Machine Translation by Jointly Learning to Align and Translate, 2014](https://arxiv.org/abs/1409.0473). This paper introduced and refined a technique called “Attention”. The function include 2 trainable weight matrix ($W_{a1}, W_{a2}$), $V$ is a weight vector and $tanh$ to scale numbers [-1,1]. ```{python} class BahdanauAttention(tf.keras.layers.Layer): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1) def call(self, query, values): query_with_time_axis = tf.expand_dims(query, 1) score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))) # attention_weights shape == (batch_size, max_length, 1) attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights ``` From: [notebook](https://colab.research.google.com/github/JanLeyva/DeepLearning/blob/main/Assignment3_2021.ipynb#scrollTo=Mr06uR9YTUvq) --- ## Attention is All You Need We cannot talk about attention without mention one of the most famous papers from this decade **[Attention is all you need, 2017](https://arxiv.org/abs/1706.03762)** by the DeepMind team. This paper introduces the idea of **transformers**, an architecture that does not need a recurrence method to get a good performance only **attention mechanism** and a properly architecture showed below: ![Figure 3: The Transformer - model architecture. source: Attention is All You need, 2017](https://i.imgur.com/SQRn9lZ.png) <center> Figure 3: Transformer - model architecture </center> This architecture adds a positional encoding after the input embedding in order to do not lose information. The idea behind it is use wave in order to do a continues way binary encoding of position. This kind of architecture is nowadays the most used in NLP task, even in computer vision. Models such as BERT, GPT-2, GPT-3 use this architecture. This article is writte based on this [notebook](https://colab.research.google.com/github/JanLeyva/DeepLearning/blob/main/Assignment3_2021.ipynb#scrollTo=Mr06uR9YTUvq) that implement the three attention function mentioned. # References * [[Jordi Vitrià, Departament de Matemàtiques i Informàtica de la UB, 2021]](https://github.com/DeepLearningUB/DeepLearningUB.github.io) Attention and Context-based Embeddings. * [[Bahdanau et al.2015]](https://arxiv.org/pdf/1508.04025.pdf) D. Bahdanau, K Cho, andY. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. * [[Luong et al.2014]](https://arxiv.org/abs/1508.04025) Minh-Thang Luong, Hieu Pham, Christopher D. ManningEffective Approaches to Attention-based Neural Machine Translation. * [Tensorflow Attention tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention) * [Murat Karakaya Akademi](https://colab.research.google.com/github/kmkarakaya/ML_tutorials/blob/master/seq2seq_Part_F_Encoder_Decoder_with_Bahdanau_%26_Luong_Attention_Mechanism.ipynb#scrollTo=N-RyyRhTQ2XC) Encoder Decoder with Bahdanau & Luong Attention * [[Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin, 2017]](https://arxiv.org/abs/1706.03762) Attention Is All You Need. - Figures 1 and 3 are from the course material from Jordi Vitrià, Departament de Matemàtiques i Informàtica de la UB. Deep Learning, 2021.