---
# System prepended metadata

title: 'Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)"'

---

# Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)"

Author(s): Thanmay Jayakumar

## Brief Outline

#### Problem 
* The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios
* **But is much less effective for low resource languages** - Neural methods are data-hungry and learn poorly from low-count events.

#### Key Idea
* **Transfer Learning:** Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data (Torrey and Shavlik, 2009; Pan and Yang, 2010). 
* **This is the first study to apply transfer learning to neural machine translation.**
    * First train a high-resource language pair (the parent model)
    * Then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training.

## Model

### Approach
* We first train an NMT model on a dataset where there is a large amount of bilingual data (e.g., French-English), which we call the parent model. 
* Next, we initialize an NMT model with the already-trained parent model.
* This new model is then trained on a dataset with very little bilingual data (e.g., Uzbek-English), which we call the child model.
* **Note:** _This means that the low-data NMT model will not start with random weights, but with the weights from the parent model._

### Architecture

| ![](https://i.imgur.com/XOHtgkP.png) |
| -------- |
| Figure: The NMT model architecture, showing six blocks of parameters, in addition to source/target words and predictions. During transfer learning, we expect the source-language related blocks to change more than the target-language related blocks.   |

### Training Parameters
* Minibatch size: 128
* Hidden state size: 1000
* Target vocabulary size: 15K
* Source vocabulary size: 30K
* 
    | Parameter           | Parent model | Child model |
    | ------------------- | ------------ | ----------- |
    | Dropout probability |     0.2      |     0.5     |
    | Learning rate       |     0.5      |     0.5     |
    | Decay rate          |     0.9      |     0.9     |
    | Epochs              |     5        |     100     |
* Initial parameter range : [-0.08, +0.08]
* Re-scale the gradient when the gradient norm is greater than 5

## Experiments
Translating Hausa, Turkish, Uzbek and Urdu into English with the help of a French-English parent model.
| Language    | Train size | Test size |
| ----------- | ---------- | --------- |
| Hausa       | 1.0m       | 11.3K     |
| Turkish     | 1.4m       | 11.6K     |
| Uzbek       | 1.8m       | 11.5K     |
| Urdu        | 0.2m       | 11.4K     |



### Transfer Results
The `BLEU` score results for the transfer learning method applied to the four languages are shown below
| ![](https://i.imgur.com/7YUpwRl.png) | 
| -------- | 
| Table: The ‘NMT’ column shows results without transfer, and the ‘Xfer’ column shows results with transfer. The ‘Final’ column shows `BLEU` after we ensemble 8 models and use unknown word replacement.|

### Re-scoring Results
* The authors also use the NMT model with transfer learning to re-score output n-best lists (n= 1000) from the SBMT system. 
* Transfer NMT models give the highest gains over using re-scoring with a neural language model or an NMT system that does not use transfer.

| ![](https://i.imgur.com/zOXWqUi.png) | 
| -------- | 
| Table: The transfer method applied to re-scoring output n-best lists from the SBMT system. Additionally, the ‘LM’ column shows the results when an RNN LM was trained on the large English corpus and used to re-score the n-best list.|

## Analysis

### Different Parent Languages

* In the above experiments French-English is used as the parent language pair. 
* Here, the authors experiment with different parent languages.

*
    | ![](https://i.imgur.com/Zl7CgDG.png) | 
    | -------- | 
    | Table: Data used for a low-resource Spanish-English task. Sizes are numbers of word tokens on the English side of the bitext|
*
    | ![](https://i.imgur.com/4WI9W4i.png) | 
    | -------- | 
    | Table: For a low-resource Spanish-English task, we experiment with several choices of parent model: none, French-English, and German-English. French-English maybe best because French and Spanish are similar.|
* Overall, we can see that the choice of parent language can make a difference in the BLEU score, so in the Hausa, Turkish, Uzbek, and Urdu experiments, a parent language more optimal than French might improve results

### Effects of having Similar Parent Language
* We look at a best-case scenario in which the parent language is as similar as possible to the child language.
* Here the authors devise a synthetic child language (called French’) 
    * which is exactly like French, except its vocabulary is shuffled randomly. (e.g., “internationale” is now “pomme”, etc). 
    * This language, which looks unintelligible to human eyes, nevertheless has the same distributional and relational properties as actual French, 
    * i.e. the word that, prior to vocabulary reassignment, was ‘roi’ (king) is likely to share distributional characteristics, and hence embedding similarity, to the word that, prior to reassignment, was‘reine’ (queen). 
    * Such a language should be the ideal parent model.
*
    | ![](https://i.imgur.com/l8U4G6W.png) | 
    | -------- | 
    | Table: A better match between parent and child languages should improve transfer results. We devised a child language called French’. We observe that French transfer learning helps French’ (13.3→20.0) more than it helps Uzbek (10.7→15.0).|
    
### Ablation Analysis
* Analyze what happens when different parts of the model are fixed, in order to see what yields optimal performance.
![](https://i.imgur.com/yNS3MeV.png)
* The optimum setting is likely to be language- and corpus-dependent
* For parent-child models with closely related languages we expect freezing, or strongly regularizing, more components of the model to give better results.

### Learning Curves

The following are the learning curves for Uzbek-English
![](https://i.imgur.com/VzQUq3i.png)

### Different Parent Models
| ![](https://i.imgur.com/kP2fapp.png) | 
| -------- | 
| Table: Transfer with parent models trained only on English data. The child data is the Uzbek-English corpus from Table 3. The English-English parent learns to copy English sentences, and the EngPerm-English learns to un-permute scrambled English sentences. The LM is a 2-layer LSTM RNN language model trained on the English corpus|

## Conclusion

* Overall the described transfer method improves NMT scores on low-resource languages by a large margin and allows the transfer NMT system to come close to the performance of a very strong SBMT system, even exceeding its performance on Hausa-English. 
* In addition, the authors consistently and significantly improve state-of-the-art SBMT systems on low-resource languages when the transfer NMT system is used for re-scoring. 
* The experiments suggest that there is still room for improvement in selecting parent languages that are more similar to child languages, provided data for such parents can be found.


