Attention in sequence-to-sequence recurrent models

# Attention in sequence-to-sequence recurrent models In [sequence-to-sequence models](https://towardsdatascience.com/sequence-to-sequence-using-encoder-decoder-15e579c10a94) we want to convert sequences from one domain (for example, english language) to other domain (german language). This is done with the two principal elements of the model: The encoder and the decoder. On the one hand, the encoder compress the input sequence in a context vector. On the other hand, the decoder takes that context and process the output sequence (in this case, the german translation). ![Sequence-to-sequence example](https://i.imgur.com/2CRhy9S.png) [Sequence-to-Sequence](https://towardsdatascience.com/sequence-to-sequence-using-encoder-decoder-15e579c10a94) But that unique context vector in between can be a problem for large sequences, becoming a bottleneck and preventing the model from training properly. To solve this long sequences problem, [Bahdanau](https://arxiv.org/abs/1409.0473) and [Luong](https://arxiv.org/abs/1508.04025) proposed a technique called attention. ## How attention works Attention allows the model, at each step of the output sequence, to focus on the most relevant parts of the input sequence. In order to allow that, now the encoder doesn't give only as context it last hidden state, but each hidden state of the encoder units. ![](https://i.imgur.com/xJJHzJZ.png) The encoder works in the same way that before, but the decoder implements a different process. Now, at each decoder step, by using all the encoder hidden states and the last decoder hidden state, an attention score is given to each encoder hidden state. There are different methods to calculate that score (Luong dot, Luong multiplicative, Bahdanau). But all share the idea of similarity between the decoder hidden state and the encoder hidden states (in order to get high scores for the more similars of the encoder with respect to the decoder one, and low scores for the least similars). ![](https://i.imgur.com/pdyct9l.png) Once the scores are calculated, they are softmaxed (getting the attention weights): ![](https://i.imgur.com/VCazvVb.png) And then multiplied by each of the encoder hidden states (amplifying hidden states with high scores and drowning out hidden states with low scores). Finally, the context state is build with these last values: ![](https://i.imgur.com/C65Hmvh.png) That context state is concatenated with the last decoder hidden state, and this new vector is given to the actual decoder unit in order to get the next hidden state. This output is also given to a feedforward neural network, that obtains the word for this step.