Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)"

# Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)" Author(s): Thanmay Jayakumar ## Brief Outline #### Problem * The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios * **But is much less effective for low resource languages** - Neural methods are data-hungry and learn poorly from low-count events. #### Key Idea * **Transfer Learning:** Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data (Torrey and Shavlik, 2009; Pan and Yang, 2010). * **This is the first study to apply transfer learning to neural machine translation.** * First train a high-resource language pair (the parent model) * Then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. ## Model ### Approach * We first train an NMT model on a dataset where there is a large amount of bilingual data (e.g., French-English), which we call the parent model. * Next, we initialize an NMT model with the already-trained parent model. * This new model is then trained on a dataset with very little bilingual data (e.g., Uzbek-English), which we call the child model. * **Note:** _This means that the low-data NMT model will not start with random weights, but with the weights from the parent model._ ### Architecture | ![](https://i.imgur.com/XOHtgkP.png) | | -------- | | Figure: The NMT model architecture, showing six blocks of parameters, in addition to source/target words and predictions. During transfer learning, we expect the source-language related blocks to change more than the target-language related blocks. | ### Training Parameters * Minibatch size: 128 * Hidden state size: 1000 * Target vocabulary size: 15K * Source vocabulary size: 30K * | Parameter | Parent model | Child model | | ------------------- | ------------ | ----------- | | Dropout probability | 0.2 | 0.5 | | Learning rate | 0.5 | 0.5 | | Decay rate | 0.9 | 0.9 | | Epochs | 5 | 100 | * Initial parameter range : [-0.08, +0.08] * Re-scale the gradient when the gradient norm is greater than 5 ## Experiments Translating Hausa, Turkish, Uzbek and Urdu into English with the help of a French-English parent model. | Language | Train size | Test size | | ----------- | ---------- | --------- | | Hausa | 1.0m | 11.3K | | Turkish | 1.4m | 11.6K | | Uzbek | 1.8m | 11.5K | | Urdu | 0.2m | 11.4K | ### Transfer Results The `BLEU` score results for the transfer learning method applied to the four languages are shown below | ![](https://i.imgur.com/7YUpwRl.png) | | -------- | | Table: The ‘NMT’ column shows results without transfer, and the ‘Xfer’ column shows results with transfer. The ‘Final’ column shows `BLEU` after we ensemble 8 models and use unknown word replacement.| ### Re-scoring Results * The authors also use the NMT model with transfer learning to re-score output n-best lists (n= 1000) from the SBMT system. * Transfer NMT models give the highest gains over using re-scoring with a neural language model or an NMT system that does not use transfer. | ![](https://i.imgur.com/zOXWqUi.png) | | -------- | | Table: The transfer method applied to re-scoring output n-best lists from the SBMT system. Additionally, the ‘LM’ column shows the results when an RNN LM was trained on the large English corpus and used to re-score the n-best list.| ## Analysis ### Different Parent Languages * In the above experiments French-English is used as the parent language pair. * Here, the authors experiment with different parent languages. * | ![](https://i.imgur.com/Zl7CgDG.png) | | -------- | | Table: Data used for a low-resource Spanish-English task. Sizes are numbers of word tokens on the English side of the bitext| * | ![](https://i.imgur.com/4WI9W4i.png) | | -------- | | Table: For a low-resource Spanish-English task, we experiment with several choices of parent model: none, French-English, and German-English. French-English maybe best because French and Spanish are similar.| * Overall, we can see that the choice of parent language can make a difference in the BLEU score, so in the Hausa, Turkish, Uzbek, and Urdu experiments, a parent language more optimal than French might improve results ### Effects of having Similar Parent Language * We look at a best-case scenario in which the parent language is as similar as possible to the child language. * Here the authors devise a synthetic child language (called French’) * which is exactly like French, except its vocabulary is shuffled randomly. (e.g., “internationale” is now “pomme”, etc). * This language, which looks unintelligible to human eyes, nevertheless has the same distributional and relational properties as actual French, * i.e. the word that, prior to vocabulary reassignment, was ‘roi’ (king) is likely to share distributional characteristics, and hence embedding similarity, to the word that, prior to reassignment, was‘reine’ (queen). * Such a language should be the ideal parent model. * | ![](https://i.imgur.com/l8U4G6W.png) | | -------- | | Table: A better match between parent and child languages should improve transfer results. We devised a child language called French’. We observe that French transfer learning helps French’ (13.3→20.0) more than it helps Uzbek (10.7→15.0).| ### Ablation Analysis * Analyze what happens when different parts of the model are fixed, in order to see what yields optimal performance. ![](https://i.imgur.com/yNS3MeV.png) * The optimum setting is likely to be language- and corpus-dependent * For parent-child models with closely related languages we expect freezing, or strongly regularizing, more components of the model to give better results. ### Learning Curves The following are the learning curves for Uzbek-English ![](https://i.imgur.com/VzQUq3i.png) ### Different Parent Models | ![](https://i.imgur.com/kP2fapp.png) | | -------- | | Table: Transfer with parent models trained only on English data. The child data is the Uzbek-English corpus from Table 3. The English-English parent learns to copy English sentences, and the EngPerm-English learns to un-permute scrambled English sentences. The LM is a 2-layer LSTM RNN language model trained on the English corpus| ## Conclusion * Overall the described transfer method improves NMT scores on low-resource languages by a large margin and allows the transfer NMT system to come close to the performance of a very strong SBMT system, even exceeding its performance on Hausa-English. * In addition, the authors consistently and significantly improve state-of-the-art SBMT systems on low-resource languages when the transfer NMT system is used for re-scoring. * The experiments suggest that there is still room for improvement in selecting parent languages that are more similar to child languages, provided data for such parents can be found.