Attention - HackMD

# Attention Attention is one of the most prominent ideas in the Deep Learning community. By now, you have surely heard about it. But how does Attention work? What's the main idea behind it? What is it used for? In this post we are going to give a general overview of how Attention works and what's the intuition behind it. ## Seq2Seq models Although nowadays Attention is used in the context of image captioning and other problems, it was originally designed for translation in the context of seq2seq models. Seq2Seq models attempt to transform an input sequence, often referred to as source, into a now sequence, the target. Both sequences can be of variable length. These models usually have an encoder-decoder architecture, which is composed of: - An **encoder**: takes the source sequence and compresses the information into a fixed length vector representation, referred to as context vector or embedding.Its output is expected to be a good summary of the whole meaning of the input sequence. - A **decoder**: takes the context vector as input and produces the target sequence. The main inconvenience of these models is that it tries to summarize the source sequence into a fixed-length context vector. As you can imagine, this affects the ability of the model remembering long input sentences. ## The intuition behind Attention Attention is somewhat motivated by how we pay attention to different ares of an image correlate words in a sentence when we try to give them meaning. ![A dog dressed like a person.](https://i.imgur.com/bUQTtO3.png) Take for example the image above. We can tell it is a dog by focusing on the central part of the image, where the dog's face is found, and recognize the pointy ears and its snout. However, if that part of the image gets covered the picture losses its real meaning and it becoems hard to tell that the image is of a dog. The same intuiton can be applied when working with sentences. ![He found a red sock sentence.](https://i.imgur.com/W3ylojB.png) When contextualizing sentences, some word are more related to others, they add more meaning to them. For example, 'red' does not add much information to 'found', but it is important that 'sock' is related to 'found' and 'red' is related to 'sock'. ## How does Attention work? Lets denote the input sequence $x=[x_1\,, x_2\,, \dots \,,x_n]$, of length $n$, and the target sequence $y=[y_1\,, y_2\,, \dots \,,y_m]$ of length $m$. As the seq2seq models, Attention also has an encoder-decoder architecture. The main difference is that the information that the encoder passes to the decoder is no longer a fixed-length context vector. Instead, the encoder passes all the hidden states, so the decoder receives a lot more information. The decoder has a hidden state $s_t = f(s_{t-1}, y_{t-1}, c_t)$ for the word at position $t$, $t = 1\,, \dots \,, m$, where $c_t$ is the context vector for the output given at $y_t$ and it is computed as a sum of hidden states of the input sequence, wieghted by a score: $c_t = \sum_{i=1}^n \alpha_{t,i}h_i$ $\alpha_{t,i} = \frac{\exp(\text{score}(s_{t-1}, h_i))}{\sum_{j=1}^n \exp(\text{score}(s_{t-1}, h_j))}$ From these "weighted" encoder hidden states, the decoder builds a new representation, context states, to produce the output. We are just missing one piece from this explanation, the concept of score. The score indicates how well the pair of input at position $i$ and output at position $t$ match. Different score formulas can be used and they may result in different outcomes. Both Bahdanaus adn Luong proposed different score functions. ### References - Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate." ICLR 2015. - Thang Luong, Hieu Pham, Christopher D. Manning. “Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015.