# Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)"
Author(s): Thanmay Jayakumar
## Brief Outline
#### Problem
* The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios
* **But is much less effective for low resource languages** - Neural methods are data-hungry and learn poorly from low-count events.
#### Key Idea
* **Transfer Learning:** Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data (Torrey and Shavlik, 2009; Pan and Yang, 2010).
* **This is the first study to apply transfer learning to neural machine translation.**
* First train a high-resource language pair (the parent model)
* Then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training.
## Model
### Approach
* We first train an NMT model on a dataset where there is a large amount of bilingual data (e.g., French-English), which we call the parent model.
* Next, we initialize an NMT model with the already-trained parent model.
* This new model is then trained on a dataset with very little bilingual data (e.g., Uzbek-English), which we call the child model.
* **Note:** _This means that the low-data NMT model will not start with random weights, but with the weights from the parent model._
### Architecture
|  |
| -------- |
| Figure: The NMT model architecture, showing six blocks of parameters, in addition to source/target words and predictions. During transfer learning, we expect the source-language related blocks to change more than the target-language related blocks. |
### Training Parameters
* Minibatch size: 128
* Hidden state size: 1000
* Target vocabulary size: 15K
* Source vocabulary size: 30K
*
| Parameter | Parent model | Child model |
| ------------------- | ------------ | ----------- |
| Dropout probability | 0.2 | 0.5 |
| Learning rate | 0.5 | 0.5 |
| Decay rate | 0.9 | 0.9 |
| Epochs | 5 | 100 |
* Initial parameter range : [-0.08, +0.08]
* Re-scale the gradient when the gradient norm is greater than 5
## Experiments
Translating Hausa, Turkish, Uzbek and Urdu into English with the help of a French-English parent model.
| Language | Train size | Test size |
| ----------- | ---------- | --------- |
| Hausa | 1.0m | 11.3K |
| Turkish | 1.4m | 11.6K |
| Uzbek | 1.8m | 11.5K |
| Urdu | 0.2m | 11.4K |
### Transfer Results
The `BLEU` score results for the transfer learning method applied to the four languages are shown below
|  |
| -------- |
| Table: The ‘NMT’ column shows results without transfer, and the ‘Xfer’ column shows results with transfer. The ‘Final’ column shows `BLEU` after we ensemble 8 models and use unknown word replacement.|
### Re-scoring Results
* The authors also use the NMT model with transfer learning to re-score output n-best lists (n= 1000) from the SBMT system.
* Transfer NMT models give the highest gains over using re-scoring with a neural language model or an NMT system that does not use transfer.
|  |
| -------- |
| Table: The transfer method applied to re-scoring output n-best lists from the SBMT system. Additionally, the ‘LM’ column shows the results when an RNN LM was trained on the large English corpus and used to re-score the n-best list.|
## Analysis
### Different Parent Languages
* In the above experiments French-English is used as the parent language pair.
* Here, the authors experiment with different parent languages.
*
|  |
| -------- |
| Table: Data used for a low-resource Spanish-English task. Sizes are numbers of word tokens on the English side of the bitext|
*
|  |
| -------- |
| Table: For a low-resource Spanish-English task, we experiment with several choices of parent model: none, French-English, and German-English. French-English maybe best because French and Spanish are similar.|
* Overall, we can see that the choice of parent language can make a difference in the BLEU score, so in the Hausa, Turkish, Uzbek, and Urdu experiments, a parent language more optimal than French might improve results
### Effects of having Similar Parent Language
* We look at a best-case scenario in which the parent language is as similar as possible to the child language.
* Here the authors devise a synthetic child language (called French’)
* which is exactly like French, except its vocabulary is shuffled randomly. (e.g., “internationale” is now “pomme”, etc).
* This language, which looks unintelligible to human eyes, nevertheless has the same distributional and relational properties as actual French,
* i.e. the word that, prior to vocabulary reassignment, was ‘roi’ (king) is likely to share distributional characteristics, and hence embedding similarity, to the word that, prior to reassignment, was‘reine’ (queen).
* Such a language should be the ideal parent model.
*
|  |
| -------- |
| Table: A better match between parent and child languages should improve transfer results. We devised a child language called French’. We observe that French transfer learning helps French’ (13.3→20.0) more than it helps Uzbek (10.7→15.0).|
### Ablation Analysis
* Analyze what happens when different parts of the model are fixed, in order to see what yields optimal performance.

* The optimum setting is likely to be language- and corpus-dependent
* For parent-child models with closely related languages we expect freezing, or strongly regularizing, more components of the model to give better results.
### Learning Curves
The following are the learning curves for Uzbek-English

### Different Parent Models
|  |
| -------- |
| Table: Transfer with parent models trained only on English data. The child data is the Uzbek-English corpus from Table 3. The English-English parent learns to copy English sentences, and the EngPerm-English learns to un-permute scrambled English sentences. The LM is a 2-layer LSTM RNN language model trained on the English corpus|
## Conclusion
* Overall the described transfer method improves NMT scores on low-resource languages by a large margin and allows the transfer NMT system to come close to the performance of a very strong SBMT system, even exceeding its performance on Hausa-English.
* In addition, the authors consistently and significantly improve state-of-the-art SBMT systems on low-resource languages when the transfer NMT system is used for re-scoring.
* The experiments suggest that there is still room for improvement in selecting parent languages that are more similar to child languages, provided data for such parents can be found.