# Deep Learning. Assigment 3
The purpose of this blog is to explain the most relevant concepts for the understanding of Deep Learning assigment three, focusing on the concept of attention. Moreover, we will evaluate the obtained results in order to draw some conclusions. Nevertheless, the latter part can also be found in the delivered notbeook.
## General framework
A sequence-to-sequence recurrent model is a class of recurrent neural network that is typically to solve complex lenguage problems like machine translation, question answering, text summarization, etc.
The architecture of a sequence-to-sequence recurrent model is the following:
+ An **encoder** compiles the information it captures into a fixed size vector called context.
+ After processing the entire input sequence, the encoder sends the context over to the **decoder**, which begins producing the output sequence item by item.

The problem with this structure is that the context vector is a fixed size vector. Therefore, due to the fixed size, the context vector turned out to be a bottleneck for large lenguague models (LMM). A solution to work with long sentences was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers, introduced the concept of **Attention** to avoid the mentioned problem.
## Atention
Attention is a technique which tries to resemble the concept of cognitive attention, i.e., tries to mimic the behaviour of our brain in the sense of selecting and discarding information. This technique, will allow our structure to focus on the important parts of our input and solve the desired task. For example, when looking an image, our brain, in the first instance, tries to get the most relevant information from textures, colours, salient parts, etc. Our eyes tries to find these salient features of the image of the images instead of analyse it in detail. This is attention!
Similarly, with sentences, our brain tries to focus on important words instead of the whole phrase. For example, when we hear or read the word "eating", we expect to encounter a food word very soon.

---
Let us see how this concept actually works. In the sequence-to-sequence recurrent model, only the final state of the encoder (the context vector) is passed to the decoder, whereas if we implement attention, the encoder passes a lot more data to the decoder, concretely, all the hidden states are passed to the decoder. Moreover, with the implementation of an attention decoder, in order to focus on the parts of the input that are relevant to this decoding time step, the following extra steps are performed:
+ Give each hidden state (provided by the encoder) a **score**.
+ Multiply each hidden state by its softmaxed score.
\begin{align*}
\alpha_{ts} = \frac{e^{score(h_t,\bar{h}_s)}}{\sum_{s'=1}^S e^{score(h_t,\bar{h}_s')}}
\end{align*}
+ Build a new representation (context state) from these “weighted” encoder hidden states.
\begin{align*}
c_t=\sum_{S}\alpha_{ts}\bar{h}_S
\end{align*}
The question now is: how do we calculate that score?
## Score
The solution to the last question was given in [Bahdanau et al., 2014](https://arxiv.org/abs/1409.0473) and [Luong et al., 2015](https://arxiv.org/abs/1508.04025). Two papers that proposes and justifies possible formulas for the score. The formulas are the following
\begin{cases}
score(h_t,\bar{h}_s)=h_t^TW_a\bar{h}_s\\
score(h_t,\bar{h}_s)=v_a^Ttanh(W_a[h_t;\bar{h}_s])
\end{cases}
The first one corresponds to the Luong multiplicative score and notice that is only using a weighted matrix, while the second one corresponds to the Bahdanau score. In the last one, notice that is using two weighted matrices and a hidden layer $v$.
## Evaluation of the obtained results
**This part is also in the notbeook!!**
---
The implementation of the Bahdanau Attention and Luong General Attention was quite simple since we have only modificated the ready-made implementation of the Loung dot Attention. The key modifications are:
+ Initialization of some weights, depending on the case, one or two weighted matrices and a hidden layer.
+ Change the attention score to the appropiate one in each case (see the blog site to see the concrete formulas).
Let us compare the methods. Notice that the given example (**Luong Dot Attention**) was not performing very well due to the number of epochs and the patience implemented. Therefore, we incremented the number of epochs up to 200, up to 20 the patience and the performance went up to 98.730% accuracy in training and 97.765% in test. Notice that early stopping ended the program with 137 epochs. We continued with the **Bahdanau Attention**, which, with the same number of epochs and patience, reached 99.984% accuracy in training and 99.955% accuracy in test. Nevertheless, notice that early sttoping was unsuccessful and the 200 epochs were made, but with a really good result. To finish, we implemented **Luong General Attention**, which, with the same number of epochs and patience, did not work as well as the others. The method reached 79.596% accuracy in trianing and 75.860% in test. As in the first case, early stopping ended the program earlier, with 197 epochs, but with a worst performance. It seems that this method needs more epochs and more patience to obtain a better result. We can say that this last method (**Luong General Attention**) is the worst by far.
To finish, let us compare and explain the weights visualizations. First of all, notice that the diagonal obtained is rotated due to the order of the axes. Observe that the first two weighted matrices are really close to a diagonal one (the desire output), while the last one does not (the performance was not as good as the first two!!).