Notes on "Sequence to Sequence Learning with Neural Networks"

Brief Outline

They use a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

Below is a table summarising the approach:

Despite their flexibility and power, DNNs (Deep Neural Nets) can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori.
- For example, speech recognition and machine translation are sequential problems
Recurrent Neural Nets (namely LSTM) are used as they can map a variable length sequence to a definite dimension vector representation.
A MultiLayer LSTM RNN is used for both the encoder and the decoder. However the decoder is provided the encoders output as the beginning context, hence it is conditioned on the input sequence.
LSTMs are used due to their ability to learn on data with long range temporal dependencies.
The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work. Doing so introduces many short term dependencies between the source and the target sentence which made the optimization problem easier.

The simplest strategy for general sequence learning is to map the input sequence to a fixed-sized vector using one RNN, and then to map the vector to the target sequence with another RNN. While RNN can provide all the relevant information but it would be difficult to train the RNNs due to the resulting long term dependencies. However, the Long Short-Term Memory (LSTM) is known to learn problems with long range temporal dependencies. So LSTM may perform better than RNN.
Used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously.
Found that deep LSTMs significantly outperformed shallow LSTMs, so chose an LSTM with four layers.
Found it extremely valuable to reverse the order of the words of the input sentence. This makes it easy for SGD to “establish communication” between the input and the output.

Trained it by maximizing the log probability of a correct translation T given the source sentence S. The training objective is

\begin{array}{r} \frac{1}{| S |} \sum_{(T, S) \in S} \log p (T | S) \end{array}

Produced translations by finding the most likely translation according to the LSTM:

\begin{array}{r} \hat{T} = \arg max_{T} p (T | S) \end{array}

Searched for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation. Beam of size 2 provides most of the benefits of beam search.
Passed reverse input sentence to the LSTM which reduced the test perplexity from 5.8 to 4.7 and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.

Initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08
Stochastic Gradient Descent was applied.
Started with learning rate of 0.7 and then reduced it by half every 0.5 epochs after 5 epochs.
Trained model for 7.5 epochs.
Used batches with 128 sequences.
Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus a hard constraint was enforced on the norm of the gradient by scaling it when its norm exceeded a threshold. For each training batch, they computed

$\begin{array}{r} s = | | g | |_{2} \end{array}$
where g is the gradient divided by 128. If
$s > 5$ , they set
$g = \frac{5 g}{s}$
Dataset had many short sentences (length=20-30) and a few long sentences(length>100). So the minibatches were formed having sentences of roughly same length. This yielded 2 times speedup.

Their work showed that a large deep LSTM, that has a limited vocabulary and that makes almost no assumption about problem structure can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task.
Lot of improvement was obtained by reversing the words in the source sentences.
LSTMs performed quite well on long sentences as well which was not expected due to its limited memory.