HackMD - Collaborative Markdown Knowledge Base

##### $\textbf{Attention please!}$ Attention is used in the machine translation tasks to improve the performance of the encoder-decoder model. Before attention RRN-based architectures used to work very well especially with LSTM components, but the limitation was the length of the sequence. For very long sequences encoded sequence representation (let's call it $\textit{z}$) will be unable to compress the information at the inital parts of the sequence as well as the last parts, colloquially, $\textit{z}$ 'forgets' what was at the beginning. Because of that a model will focus more on the end of the sequence, but that is usually not the optimal way to a sequence task. That can be called $\textit{the bottleneck problem}$. Now we get to the point when attention comes as a rescue! The main idea for attention mechanism is that the vector $\textit{z}$ (also known as the contex vector) should have access to all parts of the input sequence instead of just the last one. So actually we can observe that attention can be viewed as a notion of memory. Let's describe the mechanism of attention. For each hidden state $h_1,h_2,...,h_n$ we have to compute a scalar: $$ e_{ij}=score(y_{i-1},h_j)$$ where $y_{i-1}$ is the previous state in the decoder, so the output of the decoder with $i$ as a prediction step and $score$ is a particular function. Next, we use a softmax function to compute the weights: $$\alpha_{ij}=\frac{exp(e_{ij})}{\sum^S_{k=1}exp(e_{ik})} $$ and after that we can finally obtain our beloved vector $\textit{z}$: $$z_i= \sum_S \alpha_{is}h_S$$ There is a lot of different score functions. The most well-known ones are: 1. Luong Dot Attention - $y_{i-1}h$ 2. Luong Multiplicatice Attention - $y_{i-1}W_ah$ 3. Bahdanau Attention - $v_a^Ttanh(W_a[h;y_{i-1}])$ where $W_a$ is a matrix with learnable weights.