## Attention
**Attention** is a deep learning mechansism that allows the network to focus on different parts of the input. Nowadays, it is often applied in **sequence to sequence** (seq2seq) models. This technique is especially useful when processing long sequences, for example, a paragraph in a text to be translated. The network learns to shift the attention around and capture dependencies between the input and output sequences.
To understand in-depth how attention works, it is worth recalling the architecture of the vanilla encoder-decoder RNN.

Firstly, the **encoding** RNN processess the input sequence $x_1, x_2, ..., x_n$ and obtains a series of hidden states $h_1, ..., h_n$, which are conected sequentially. Once the entire sequence is processed, the encoding network generates a vector summarizing all its relevant information: the context vector $c$. In the first step in the **decoding**, the network receives as input a start token $y_0$ and the context vector, that will be shared among all decoding hidden states. Finally, the decoder produces the output sequence $y_1, y_2, ..., y_p$.
The main problem of this approach is that the context vector represents a **bottleneck**, mainly because it must 1) summarize all the learnt information in the encoding, 2) be the same for all decoding hidden states. Attention allows the network to reconstruct different context vectors based on the encoding hidden states.
The new architecture, incluiding attention is the following:

The encoder remains the same while the decoder adds the attention mechanism to compute a sequence of context vectors $c_1, c_2, ..., c_p$. For the purposes of visualization, only the first attention graph is shown. Firstly, we must define a **scoring** function, $f$, that takes as input one encoding hidden state and one decoding hidden state. There are several valid functions for this purpose, such as the ones proposed by Luong et al. and Bahdanau et al., and they might include or not learnable weights. We denote the score of $h_i$ and $s_t$ as $e_{it}=f(h_i, s_t)$. Then, each encoding hidden state is multiplied by its **softmaxed score** (the attention weights), $a_{it}=h_i \cdot softmax(e_{it})$. The network has predicted itself how much weight we want to put into each of the hidden states of the encoder. Finally, in the decoder, at each timestep we produce an output based on the current context vector, the start token and the previous hidden state.
In this new mechanism, the decoder learns to focus on different parts of the encoder. As a result, attention succeeds in capturing relationships between long input-output sequences.