How does attention work?

# How does attention work? ###### tags: `Attention`, `Deep Learning`, `Seq2Seq`, `Translation`, `RNN` ## Introduction If we give a huge dataset to the model to learn, some important parts of the data may be ignored. Paying attention to the important information is necessary and can improve the performance. This can be achieved by adding an additional feature to the model, this extra feature is called Attention. Attention is one of the most influential ideas in deep learning in recent years. It is an input processing technique for neural networks that allows the network to focus on specific aspects of a complex input, one at a time until the entire data set is categorized. Although this mechanism is now used in a number of problems such as image captioning and others, it was initially designed in the context of neural machine translation using Seq2Seq models. The aim of this method is to divide complicated tasks into smaller areas of attention that are processed sequentially. Similar to how the human mind solves a new problem by breaking it down into simpler tasks and solving them one by one. ## Seq2Seq Learning Sequence-to-sequence learning (Seq2Seq) is a family of machine learning approaches used for language processing. It converts one sequence (source) into another sequence (target) by working, generally speaking, in this way: ![](https://i.imgur.com/TiIYH1u.png) The elements of the sequence $x_1, x_2, \dots, x_n$ can be literally anything. For instance, text representations, pixels, or even images in the case of videos. As we can see, Seq2Seq models are composed of an encoder-decoder architecture, where the encoder processes the input sequence and compresses/summarises the information into an embedding, a context vector of a fixed length. This representation is expected to be a good summary of the entire input stream. The decoder is then initialized with this context vector, from which it starts to generate the transformed output. A critical and apparent disadvantage of this fixed-length context vector design is the system's inability to remember longer sequences. Often the first parts of the sequence are forgotten once the entire sequence has been processed. For example, in the case of RNN this happens due to the vanishing gradient problem, as detailed in Bengio et al. (1994) [1]. It can remember the parts which it has just seen. Here is where the Attention mechanism appears, it was born to solve this problem. ## The main idea behind Attention So, how can we give more importance to some of the input sequences compared to others? Let's see an example, let’s say, we want to predict the next word in a sentence, and its context is located a few words back. :::info “Although he is from France, having spent his childhood in England, he prefers to speak English.” ::: In these groups of sentences, if we want to predict the word “English”, the words “childhood” and “England” should be given more weight while predicting it. And although "France" is another state’s name, it should be ignored. So, is there a way to keep all relevant information in the input sentences intact while creating the context vector? Bahdanau et al. (2015) [2] presented a simple but elegant idea in which he suggested that not only can all input words in the context vector be taken into account, but also that relative importance should be given to each of them. Thus, each time the proposed model generates a sentence, it searches for a set of positions in the hidden states of the encoder where the most relevant information is available. This is the idea behind Attention. ## Understanding Attention As discussed in the previous section, the encoder compresses the sequential input and processes the input as a context vector. We can introduce an attention mechanism to create a shortcut between the entire input and the context vector where the shortcut connection weights can be changed for each output. Due to the connection between the input and the context vector, the context vector can have access to the entire input, and the problem of forgetting long sequences can be solved to some extent. Using the attention mechanism in a network, a context vector can have the following information: - Encoder Hidden States - Decoder Hidden States - Alignment between source and target ![](https://i.imgur.com/agXz52l.png) The image above is a representation of the Seq2Seq model with an additive attention mechanism embedded in it. Let's introduce the attention mechanism mathematically to make it clearer. Let's say we have an input with $n$ sequences and an output $y$ with $m$ sequences in a network. $$x = (x_1, x_2, \dots, x_n) \\ y = (y_1, y_2, \dots, y_m) $$ Now the encoder we are using in the network is a bidirectional RNN network where it has a forward hidden state and a backward hidden state. The representation of the encoder state can be done by concatenating these forward and backward states. Where in the decoder network, the hidden state is $$s_t = f(s_{t-1}, y_{t-1}, c_t).$$ For the output word at position $t$, the context vector $C_t$ can be the sum of the hidden states of the input sequence. $$c_t = \sum_{i=1}^{n} \alpha_{t,i}h_i \ \ ; \text{ Context vector for output } y_t\\ \alpha_{t,i} = \frac{e^{score(s_{t-1}, h_i)}}{\sum_{i'=1}^{n}e^{score(s_{t-1}, h_{i'})}} \ \ ; \text{ Softmax of some predefined score.}$$ Here we can see that the sum of the hidden state is weighted by the alignment scores. We can say that $\alpha_{t,i}$ are the weights that define how much of the hidden state of each source should be taken into consideration for each output. There can be several types of alignment scores depending on their geometry. It can be linear or in curve geometry. Below are some of the most popular attention mechanisms: | Name | Score function | | -------- | -------- | | Content-base attention | score($s_t, h_i$) = $\cos[s_t, h_i]$| | Additive |score($s_t, h_i$) = $W_a^T \tanh(W_a[s_t, h_i]$) | |General | score($s_t, h_i$) = $s_t^T W_a h_i$| |Dot-product| score($s_t, h_i$) = $s_t^T h_i$| |Scaled Dot-Product | score($s_t, h_i$) = $\frac{s_t^T h_i}{\sqrt{n}$| (where $W_a$ is a trainable weight matrix in the attention layer.) They have different alignment scoring functions. In addition, we can classify the attention mechanism in the following ways: - Self-Attention - Soft-Attention - Hard-Attention To read more about these different mechanisms and dive deeper into the world of attention, I recommend reading this [blog](https://towardsdatascience.com/attention-in-neural-networks-e66920838742). ## Final Words Attention is a general mechanism that introduces, from another point of view, the notion of memory. Memory is stored in the weights of attention over time and gives us an indication of where to look. We have approached this concept from a general and a more theoretical point of view, going through different attentions and getting an overall idea of the great usefulness of this tool that has been a breakthrough in the field of deep learning. ## References [1] Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. IEEE Press, San Francisco. [2] Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv: 1409.0473.