# Notes on "[Neural Machine Translation By Jointly Learning To Align And Translate](https://arxiv.org/abs/1409.0473)"
###### tags: `nlp` `machine-translation` `supervised`
#### Authors
[Rishika Bhagwatkar](https://https://github.com/rishika2110)
[Khurshed Fitter](https://https://github.com/GlazeDonuts)
## Brief Outline
In this paper, they conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of the basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
## Introduction
* Most of the proposed neural machine translation models belong to a family of encoder–decoders. The only path of information transfer between the encoder and the decoder is a fixed length vector a.k.a. encoding.
* The whole encoder–decoder system, is jointly trained to maximize the probability of a correct translation given a source sentence.
* A potential issue with this approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.
* In order to address this issue, they introduce an extension to the encoder–decoder model which learns to align and translate jointly.
* Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
* The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
## Background
Neural machine translation approaches typically consists of two components, the first of which encodes a source sentence $\textbf{x}$ and the second decodes to a target sentence $\textbf{y}$. Despite being a quite new approach, neural machine translation has already shown promising results.
#### RNN Encoder-Decoder
* In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors $\textbf{x}$ = ($x_{1}$,...., $x_{T_x}$), into a vector $c$. For a RNN,
$$
\begin{aligned}
h_t = f(x_t, h_{t-1})
\end{aligned}
$$
and,
$$
\begin{aligned}
c = q(\{h_1,....h_{Tx}\})
\end{aligned}
$$
where, $h_t$ is the hidden state at time $t$, and $c$ is the vector generated from the sequence of hidden states. $f$ and $q$ are some nonlinear functions.
* The decoder defines a probability over the translation $\textbf{y}$ by decomposing the joint probability into the ordered conditionals:
$$
\begin{aligned}
p(\textbf{y}) = \prod_{t=1}^{T} p(y_t | \{y_1,..., y_{t-1}\}, c)
\end{aligned}
$$
where, $\textbf{y}$ $=$ $(y_1,....y_{T_y})$
## Learning to Align and Translate
The new architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation.
### Encoder
* The annotation of each word was learnt to summarize not only the preceeding words, which is the case in vanilla RNNs, but also the following words.
* Therefore, they used bidirectional RNN.
* A biRNN consists of two RNNs.
* Forward RNN$(\overrightarrow{f})$: It reads the input sequence in forward direction $(x_1,..., x_{T_x})$ and calculates a sequence of forward hidden states $(\overrightarrow{h_1},...,\overrightarrow{h_{T_x}})$
* Backward RNN$(\overleftarrow{f})$: It reads the input sequence in reverse order $(x_{T_x},..., x_1)$ and calculates a sequence of backward hidden states $(\overleftarrow{h_{T_x}},..., \overleftarrow{h_1})$
* The annotation of each word was obtained by concatenating both the hidden states of the word along row (along sequence length dimension) i.e., for a word $x_i$ it's annotaion is $h_i =[\overrightarrow{h_i};\overleftarrow{h_i}]$.
* In this way, the annotation $h_j$ contains summaries of both preceeding and following words.
### Decoder
They define each conditional probability as:
$$
\begin{aligned}
p(y_i|y_1,...., y_{i-1}, \textbf{x} ) = g(y_{i-1}, s_i, c_i)
\end{aligned}
$$
where, $s_i$ is the RNN hidden state for time $i$ and is computed by
$$
\begin{aligned}
s_i = f(s_{i-1}, y_{i-1}, c_i)
\end{aligned}
$$
The context vector $c_i$ depends on sequence of *annotations* $(h_1,...h_{T_x})$ to which an encoder maps an input sequence. It is computed as weighted sum of the *annotations*
$$
\begin{aligned}
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\end{aligned}
$$
The weigth $\alpha_{ij}$ of each *annotation* $h_j$ is computed by
$$
\begin{aligned}
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{t_x}\exp(e_{ik})}
\end{aligned}
$$
where
$$
\begin{aligned}
e_{ij} = a(s_{i-1}, h_j)
\end{aligned}
$$
is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match. They parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system.
The probability $\alpha_{ij}$, or its associated energy $e_{ij}$, reflects the importance of the annotation $h_j$ with respect to the previous hidden state $s_{i−1}$ in deciding the next state $s_i$ and generating $y_i$.
## Experiment Details
### Dataset
* The dataset is a bilingual, parallel corpora of WMT'14 English to French Transations.
* They reduced the size of corpus from $850\text{M}$ words to $348\text{M}$ words.
* Then tokenised $30,000$ most frequently ocurring words and map other than these to $[\text{UNK}]$ token.
* Didn't do any pre-processing such as lowercasing or stemming to the data.
### Models
* They trained two types of models:
* RNNencdec: Simple RNN Encoder-Decoder
* RNNsearch: The proposed model
* These models were trained twice. First with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
* Number of hidden units were 1000 in both the models.
* They applied mini batch Stochastic Gradient Descent(SGD) algorithm along with Adadelta to train the model.
* Beam search was used to find the translation.
## Results
### Quantative Results
The blue scores of both the models on all the sentences and on sentences without any unknown word in them as well as in their translations are given below:
$$\begin{array}{c|c|c}
\text { Model } & \text { All } & \text { No UNK }^{\circ} \\
\hline \hline \text { RNNencdec-30 } & 13.93 & 24.19 \\
\text { RNNsearch-30 } & 21.50 & 31.44 \\
\hline \text { RNNencdec-50 } & 17.82 & 26.71 \\
\text { RNNsearch-50 } & 26.75 & 34.16 \\
\hline \text { RNNsearch-50 }^{\star} & 28.45 & 36.15 \\
\hline \text { Moses } & 33.30 & 35.63
\end{array}$$
* Moses: The conventional phrase-based Statistical Machine Translation system.
* $\text { RNNsearch-50 }^{\star}$ was trained until the performance on the development set stopped improving.
It is clearly evident that the proposed RNNsearch outperformed RNNencdec.
### Qualitative Results
* The visualisation of *annotaion* weigths $\alpha_{ij}$ gives us way to inspect the (soft-)alignment between the words in a generated translation and those in a source sentence.
* Each row of matrix in the plot shows which word of the source sentence the translation had paid attention to while generation of translation.
* Each pixel shows the weight $\alpha_{ij}$ of the annotation of the $j$-th source word for the $i$-th target word
Few results obtained were:
![](https://i.imgur.com/yS5SFaT.png)
Here, we can see that the model correctly translates a phrase [European Economic Area] into [zone économique européen]. Also, the model was able to correctly align [zone] with [Area], jumping over the two words ([European] and [Economic]), and then looked one word back at a time to complete the whole phrase [zone économique européenne].
![](https://i.imgur.com/VZxGrQ3.png)
Here, we can see that the strength of soft-alignment, opposed to a hard-alignment, is evident. The source phrase [the man] which was translated into [l’ homme]. Any hard alignment will map [the] to [l’] and [man] to [homme]. This is not helpful for translation, as one must consider the word following [the] to determine whether it should be translated into [le], [la], [les] or [l’].
## Conclusion
* The basic encoder–decoder was modified by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word.
* The proposed model had a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences.
* It outperforms the conventional encoder–decoder model (RNNencdec) significantly, regardless of the sentence length and that it is much more robust to the length of a source sentence.