Notes on "[Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data](https://arxiv.org/pdf/2105.15071.pdf)"

# Notes on "[Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data](https://arxiv.org/pdf/2105.15071.pdf)" Authors: Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzman, Pascale Fung, Philipp Koehn, Mona Diab ## Brief Outline #### Problem * The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages #### Key Idea * Fortunately, some low-resource languages are linguistically related or similar to high-resource languages * These related languages may share many lexical or syntactic structures * This method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. ## Approach | ![](https://i.imgur.com/MkEo4oE.png) | | -------- | | Figure: Illustration of the training tasks for translating from English into a low-resource language (LRL) and from an LRL to English. | ### English to Low-Resource * **Translation:** The first task is translation from English into the high-resource language (HRL) which is trained using readily available high-resource parallel data. This task aims to transfer high-resource translation knowledge to aid in translating into the low-resource language. * **Denoising Autoencoding:** For this task, we leverage monolingual text by introducing noise to each sentence, feeding the noised sentence into the encoder, and training the model to generate the original sentence. The noise we use is similar to (Lample et al., 2018a), which includes a random shuffling and masking of words. The shuffling is a random permutation of words, where the position of words is constrained to shift at most 3 words from the original position. Each word is masked with a uniform probability of 0.1. This task aims to learn a feature space for the languages, so that the encoder and decoder could transform between the features and the sentences. * **Backtranslation:** For this task, we train on English to low-resource backtranslation data. The aim of this task is to capture a language-modeling effect in the low-resource language. * **Adversarial Training:** The final task aims to make the encoder output language-agnostic features. The representation is language agnostic to the noised high and low-resource languages as well as English. Ideally, the encoder output should contain the semantic information of the sentence and little to no language-specific information. This way, any knowledge learned from the English to high-resource parallel data can be directly applied to generating the low-resource language by simply switching the language token during inference, without capturing spurious correlations (Gu et al., 2019a). ### Low-Resource to English * **Translation:** We train high-resource to English translation on parallel data with the goal of adapting this knowledge to translate low-resource sentences. * **Backtranslation:** Low-resource to English backtranslation translation. * **Adversarial Training:** We feed the sentences from the monolingual corpora of the high and low-resource corpora into the encoder, and the encoder output is trained so that its input language cannot be distinguished by a critic. The goal is to encode the low-resource data into a shared space with the high-resource, so that the decoder trained on the translation task can be directly used. No noise was added to the input, since we did not observe an improvement ## Experiments ### Dataset The authors experiment on three groups of languages. In each group, there is a large quantity of parallel training data for one language(high-resource) and no parallel for the related languages to simulate a low-resource scenario. The three groupings include 1. Iberian languages, where we treat Spanish as the high resource and Portuguese and Catalan as related lower-resource languages. 2. Indic languages where we treat Hindi as the high-resource language, and Marathi, Nepali, and Urdu as lower-resource related languages 3. Arabic, where we treat Modern Standard Arabic (MSA) as the high-resource, and Egyptian and Levantine Arabic dialects as lowresource. * Among the languages, the relationship between Urdu and Hindi is a special setting; while the two languages are mutually intelligible as spoken languages, they are written using different scripts. | ![](https://i.imgur.com/1gYfhGm.png)| | -------- | | Table: The sources and size of the datasets we use for each language. The HRLs are used for training and the LRLs are used for testing. | * For monolingual data, they randomly sample 1M sentences for each language from the CC-100 corpus3 (Conneau et al., 2020; Wenzek et al., 2020). * For quality control, they filter out sentences if more than 40% of characters in the sentence do not belong to the alphabet set of the language. * For quality and memory constraints, they only use sentences with length between 30 and 200 characters. ### Results * **English to LRL:** | ![](https://i.imgur.com/LC3bv6h.png) | | -------- | | Table : BLEU score of the first iteration on the English to low-resource direction. Both the adversarial (Adv) and backtranslation (BT) components contribute to improving the results. The fine-tuning step is omitted for Urdu as decoding is already restricted to a different script-set from the related high-resource language. | * **LRL to English:** | ![](https://i.imgur.com/PBBVpWA.png) | | -------- | | Table: BLEU score of the first iteration on the LRL to English direction. Both the adversarial(Adv) and back-translation (BT) components contribute to improving the results. | ## Conclusion * The authors present NMT-Adapt, a novel approach for neural machine translation of low-resource languages which assumes zero parallel data or bilingual lexicon in the low-resource language. * Utilizing parallel data in a similar high resource language as well as monolingual data in the low-resource language, they apply unsupervised adaptation to facilitate translation to and from the low-resource language. * Their approach combines several tasks including adversarial training, denoising language modeling, and iterative back translation to facilitate the adaptation. * Experiments demonstrate that this combination is more effective than any task on its own and generalizes across many different languages.