Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)"

# Notes on "[Transfer Learning for Low-Resource Neural Machine Translation](https://arxiv.org/pdf/1604.02201.pdf)" Author(s): Thanmay Jayakumar ## Brief Outline #### Problem * The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios * **But is much less effective for low resource languages** - Neural methods are data-hungry and learn poorly from low-count events. #### Key Idea * **Transfer Learning:** Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data (Torrey and Shavlik, 2009; Pan and Yang, 2010). * **This is the first study to apply transfer learning to neural machine translation.** * First train a high-resource language pair (the parent model) * Then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. ## Model ### Approach * We first train an NMT model on a dataset where there is a large amount of bilingual data (e.g., French-English), which we call the parent model. * Next, we initialize an NMT model with the already-trained parent model. * This new model is then trained on a dataset with very little bilingual data (e.g., Uzbek-English), which we call the child model. * **Note:** _This means that the low-data NMT model will not start with random weights, but with the weights from the parent model._ ### Architecture | ![](https://i.imgur.com/XOHtgkP.png) | | -------- | | Figure: The NMT model architecture, showing six blocks of parameters, in addition to source/target words and predictions. During transfer learning, we expect the source-language related blocks to change more than the target-language related blocks. | ### Training Parameters * Minibatch size: 128 * Hidden state size: 1000 * Target vocabulary size: 15K * Source vocabulary size: 30K * | Parameter | Parent model | Child model | | ------------------- | ------------ | ----------- | | Dropout probability | 0.2 | 0.5 | | Learning rate | 0.5 | 0.5 | | Decay rate | 0.9 | 0.9 | | Epochs | 5 | 100 | * Initial parameter range : [-0.08, +0.08] * Re-scale the gradient when the gradient norm is greater than 5 ## Experiments Translating Hausa, Turkish, Uzbek and Urdu into English with the help of a French-English parent model. | Language | Train size | Test size | | ----------- | ---------- | --------- | | Hausa | 1.0m | 11.3K | | Turkish | 1.4m | 11.6K | | Uzbek | 1.8m | 11.5K | | Urdu | 0.2m | 11.4K | ### Transfer Results The `BLEU` score results for the transfer learning method applied to the four languages are shown below | ![](https://i.imgur.com/7YUpwRl.png) | | -------- | | Table: The ‘NMT’ column shows results without transfer, and the ‘Xfer’ column shows results with transfer. The ‘Final’ column shows `BLEU` after we ensemble 8 models and use unknown word replacement.| ### Re-scoring Results * The authors also use the NMT model with transfer learning to re-score output n-best lists (n= 1000) from the SBMT system. * Transfer NMT models give the highest gains over using re-scoring with a neural language model or an NMT system that does not use transfer. | ![](https://i.imgur.com/zOXWqUi.png) | | -------- | | Table: The transfer method applied to re-scoring output n-best lists from the SBMT system. Additionally, the ‘LM’ column shows the results when an RNN LM was trained on the large English corpus and used to re-score the n-best list.| ## Analysis ### Different Parent Languages * In the above experiments French-English is used as the parent language pair. * Here, the authors experiment with different parent languages. * | ![](https://i.imgur.com/Zl7CgDG.png) | | -------- | | Table: Data used for a low-resource Spanish-English task. Sizes are numbers of word tokens on the English side of the bitext| * | ![](https://i.imgur.com/4WI9W4i.png) | | -------- | | Table: For a low-resource Spanish-English task, we experiment with several choices of parent model: none, French-English, and German-English. French-English maybe best because French and Spanish are similar.| * Overall, we can see that the choice of parent language can make a difference in the BLEU score, so in the Hausa, Turkish, Uzbek, and Urdu experiments, a parent language more optimal than French might improve results ### Effects of having Similar Parent Language * We look at a best-case scenario in which the parent language is as similar as possible to the child language. * Here the authors devise a synthetic child language (called French’) * which is exactly like French, except its vocabulary is shuffled randomly. (e.g., “internationale” is now “pomme”, etc). * This language, which looks unintelligible to human eyes, nevertheless has the same distributional and relational properties as actual French, * i.e. the word that, prior to vocabulary reassignment, was ‘roi’ (king) is likely to share distributional characteristics, and hence embedding similarity, to the word that, prior to reassignment, was‘reine’ (queen). * Such a language should be the ideal parent model. * | ![](https://i.imgur.com/l8U4G6W.png) | | -------- | | Table: A better match between parent and child languages should improve transfer results. We devised a child language called French’. We observe that French transfer learning helps French’ (13.3→20.0) more than it helps Uzbek (10.7→15.0).| ### Ablation Analysis * Analyze what happens when different parts of the model are fixed, in order to see what yields optimal performance. ![](https://i.imgur.com/yNS3MeV.png) * The optimum setting is likely to be language- and corpus-dependent * For parent-child models with closely related languages we expect freezing, or strongly regularizing, more components of the model to give better results. ### Learning Curves The following are the learning curves for Uzbek-English ![](https://i.imgur.com/VzQUq3i.png) ### Different Parent Models | ![](https://i.imgur.com/kP2fapp.png) | | -------- | | Table: Transfer with parent models trained only on English data. The child data is the Uzbek-English corpus from Table 3. The English-English parent learns to copy English sentences, and the EngPerm-English learns to un-permute scrambled English sentences. The LM is a 2-layer LSTM RNN language model trained on the English corpus| ## Conclusion * Overall the described transfer method improves NMT scores on low-resource languages by a large margin and allows the transfer NMT system to come close to the performance of a very strong SBMT system, even exceeding its performance on Hausa-English. * In addition, the authors consistently and significantly improve state-of-the-art SBMT systems on low-resource languages when the transfer NMT system is used for re-scoring. * The experiments suggest that there is still room for improvement in selecting parent languages that are more similar to child languages, provided data for such parents can be found.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.