Notes on "[Rapid Adaptation of Neural Machine Translation to New Languages](https://arxiv.org/pdf/1808.04189.pdf)"

# Notes on "[Rapid Adaptation of Neural Machine Translation to New Languages](https://arxiv.org/pdf/1808.04189.pdf)" Authors: Graham Neubig, Junjie Hu ## Brief Outline #### Problem * NMT require large amounts of training data, and creating high-quality systems in low-resource languages (LRLs) is a difficult challenge where research efforts have just begun. * **The time it takes to create such a system.** In a crisis situation, time is of the essence, and systems that require days or weeks of training will not be desirable or even feasible. * _How can we create MT systems for new language pairs as accurately as possible, and as quickly as possible?_ #### Key Idea * Propose NMT methods at the intersection of cross-lingual transfer learning (Zoph et al., 2016) and multilingual training (Johnson et al., 2016) * Propose a novel method of similar-language regularization (SLR) where training data from a second similar languages is used to help prevent over-fitting to the small LRL dataset. ## Model ### Approach * We have a source LRL of interest, and we want to translate into English. * Training a seed model on a large number of languages, then fine-tuning the model to improve its performance on the language of interest. * All the adaptation methods are based on first training on larger data including other languages, then fine-tuning the model to be specifically tailored to the LRL. ### Training Paradigms #### Multilingual Modelilng Methods ##### Single-source modelling ("Sing.") * Uses only parallel data between the LRL of interest and English. * Resulting model will be most highly tailored to the final test language pair, but the method also has the obvious disadvantage that training data is very sparse. ##### Bi-source modeling (“Bi”) * Inspired by Johnson et al. (2016) * trains an MT system with two source languages: * one LRL that we would like to translate from * and a second highly related high-resource language (HRL): the helper source language. * The advantage of this method is that it allows the LRL to learn from a highly similar helper, potentially increasing accuracy. ##### All-source modeling (“All”) * Creates a universal model on all of the languages that we have at our disposal. * This paradigm allows us to train a single model that has wide coverage of vocabulary and syntax of a large number of languages * Has the drawback in that a single model must be able to express information about all the languages in the training set within its limited parameter budget. * Thus, it is reasonable to expect that this model may achieve worse accuracy than a model created specifically to handle a particular source language. #### Adaptation to New Languages ##### Adaptation by Fine-tuning * **Seed Model Variety:** Zoph et al. (2016) performed experiments taking a bilingual system trained on a different language (e.g. French) and adapting it to a new LRL (e.g. Uzbek). -- We can also take universal model and adapt it to the new language * **Warm vs. Cold Start:** * Intuitively, we expect warm-start training to perform better, as having access to the LRL of interest during the training of the original model will ensure that it can handle the LRL to some extent. * However, the cold-start scenario is also of interest: we may want to spend large amounts of time training a strong model, then quickly adapt to a new language that we have never seen before in our training data as data becomes available. ##### Similar-Language Regularization One problem with adapting to a small amount of data in the target language is that it will be very easy for the model to over-fit to the small training set. To alleviate this problem, we propose a method of similar language regularization: while training to adapt to the language of interest, we also add some data from another similar HRL that has sufficient resources to help prevent overfitting. We do this in two ways: * **Corpus Concatenation:** Simply concatenate the data from the two corpora, so that we have a small amount of data in the LRL, and a large amount of data in the similar HRL * **Balanced Sampling:** Every time we select a minibatch to do training, we either sample it from the LRL, or from the HRL according to a fixed ratio. We try different sampling strategies, including sampling with a 1-to-1 ratio, 1-to-2 ratio, and 1-to-4 ratio for the LRL and HRL respectively. ## Experiments * The authors perform experiments on the 58-language-to-English TED corpus (Qi et al., 2018) * Like Qi et al. (2018), they experiment with Azerbaijani (aze), Belarusian (bel), and Galician (glg) to English, and also additionally add Slovak (slk), a slightly higher resourced language, for contrast. * These languages are all paired with a similar HRL: Turkish (tur), Russian (rus), Portuguese (por), and Czech (ces) respectively. * Models are implemented using xnmt (Neubig et al., 2018) * The model consists of an attentional neural machine translation model (Bahdanau et al., 2015), using * bi-directional LSTM encoders * 128-dimensional word embeddings * 512-dimensional hidden states, * standard LSTM-based decoder. ### Does Multilingual Training Help? * **Warm start setting:** * Yes, gains of 7-13 `BLEU` points are obtained by going from single-source to bi-source or all-source training, corroborating previous work (Gu et al., 2018). * **Cold start setting:** * Even systems with no data in the target language are able to achieve nontrivial accuracies, up to 15.5 BLEU on glg-eng. * Interestingly, in the cold-start scenario, the All− model bests the Bi− model, indicating that massively multilingual training is more useful in this setting. In contrast, the unsupervised NMT model struggles, achieving a `BLEU` score of around 0 for all language pairs – this is because unsupervised NMT requires high-quality monolingual embeddings from the same distribution, which can be trained easily in English, but are not available in the low-resource languages we are considering. ### Does Adaptation work? * We can first observe that regardless of the original model and method for adaptation, adaptation is helpful, particularly (and unsurprisingly) in the cold-start case. * When adapting directly to only the target language (“→Sing.”), adapting from the massively multilingual model performs better, indicating that information about all input languages is better than just a single language. * Next, comparing with our proposed method of adding similar language regularization (“→Bi”), we can see that this helps significantly over adapting directly to the LRL, particularly in the cold-start case where we can observe gains of up to 1.7 BLEU points. * Finally, in their data setting, corpus concatenation outperforms balanced sampling in both the cold-start and warm-start scenarios. ### How Can We Adapt Most Efficiently? * We can see that in all cases the cold-start models (All− →) either outperform or are comparable in final accuracy to the from scratch single-source and bi-source models. * In addition, all of the adapted models converge faster than the bi-source from-scratch trained models, indicating that adapting from seed models is a good strategy for rapid construction of MT systems in new languages. * Comparing the cold-start adaptation strategies, we can see that in general, the higher the density of target language training data, the faster the training converges to a solution, but the worse the final solution is. * This suggests that there is a speed/accuracy tradeoff in the amount of similar language regularization we apply during fine-tuning. ## Conclusion * This paper examined methods to rapidly adapt MT systems to new languages by fine-tuning. * In both warm-start and cold-start scenarios, the best results were obtained by adapting a pre-trained universal model to the low-resource language while regularizing with similar languages.