Notes on "[Neural machine translation for low-resource languages without parallel corpora](https://link.springer.com/article/10.1007/s10590-017-9203-5)"

# Notes on "[Neural machine translation for low-resource languages without parallel corpora](https://link.springer.com/article/10.1007/s10590-017-9203-5)" Author(s): Thanmay Jayakumar ## Brief Outline #### Problem * Total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. * Large number of other languages of the world are considered low-resource languages (LRL), because they lack in linguistic resources, e.g. grammars, POS taggers, corpora. #### Key Idea * We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language - **but there may be ample training data between a closely-related high-resource language (HRL) and the third language** * We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration * Even though a language might lack in parallel resources, it is possible that monolingual data is available or can be collected from online news and other accurate web resources ## Approach The goal of this paper is to develop a language independent method that enables MT for low-resource languages and can be easily applied, for example in an emergency situation. The method is divided in two steps: * **Transliteration HRL:** * LRL Based on the observation that many closely-related languages have high vocabulary overlap and almost completely regular orthographic or morphological differences, we bootstrap existing parallel resources between the HRL and the third language and use transliteration to transform the related HRL side of the parallel data so that it resembles the LRL. * Since our language scenario does not assume any readily available parallel text between the LRL and any other language, we automatically extract transliteration pairs from Wikipedia article titles and explore an optional small bilingual glossary of the most frequent terms * **Back-translation of monolingual LRL data:** * Although a language might lack in parallel texts, it is often the case that accurate monolingual online resources, such as newspapers, and web data from official sources are available. * Contrary to phrase-based statistical MT—where monolingual data forms the language model—in neural machine translation (NMT) monolingual data is not so easily integrated without changes to the network architecture. * For this reason, we use models trained on the transliterated HRL data to back-translate the monolingual LRL data into the third language and train our final models with the resulting data. The assumption is that accurate data is more efficient if used on the target side, a question explored in this work. ## Architecture | ![](https://i.imgur.com/ZtegD6W.png) | | -------- | | Figure: Experimental work-flow for the NMT system ## Experiments ### Baseline System First, the authors trained models on the related HRL data without any transformation. Then, they applied simple word substitution HRL→ LRL using the glossary to establish whether a minimum transformation would affect the quality of the output and in which way. They evaluated both with an HRL–EN and an LRL–EN test set in order to determine whether the transformed data is more ‘HRL-like’ or ‘LRL-like’. | ![](https://i.imgur.com/aDFA58Q.png) | | -------- | | Results for the baseline systems RU↔ EN and BE<sub>ru</sub> ↔EN after training for 7 epochs | ### Low-resource languages: Belarusian↔ English using Russian | ![](https://i.imgur.com/hSoMCXb.png) | | -------- | | Results for systems BE↔ EN after training for 7 epochs | ### Pseudo-low-resource languages: Spanish↔ English using Italian | ![](https://i.imgur.com/t87RvO8.png) | | -------- | | Results for systems ES ↔ EN after training for 7 epochs | ## Analysis ### Transliteration The authors found that the best results were achieved when training a sequence-to-sequence LSTM model with a bidirectional encoder and global attention after splitting the input into characters. | ![](https://i.imgur.com/2hv0e51.png) | | -------- | | Example of a Russian sentence transliterated into Belarusian and the original Belarusian sentence | | ![](https://i.imgur.com/6xRxXDU.png) | | -------- | | Example of an Italian sentence transliterated into Spanish with our method and the method proposed by Currey et al. (2016) and the original Spanish sentence | | ![](https://i.imgur.com/V0OARK9.png) | | -------- | | Scores for different data selection options for transliteration IT→ ES and in comparison with the rule-based transliteration by (Currey et al. 2016) | * The glossary might not have yielded any improvement for transliterating the titles, but it proved useful in the case of sentences. * The best scores for transliterating sentences according to the two character-based metrics were achieved when pairs without spelling differences are preserved and the glossary is concatenated x10 with the training data, even though BLEU and Word Error Rate (WER) score favour preserving only pairs with spelling differences. * This effect could simply have been caused by the existence of significantly more training data. The results suggest that for closely-related languages, simple methods, such as transliteration, might be sufficient for achieving quality output. * Additionally, we report results after applying the transliteration method proposed by Currey et al. (2016) on the same test sets. For both test sets, the rule-based method scores significantly lower for all metric. ### Effect of parallel data The authors explore whether the presence of a very small amount of parallel data between the low-resource language and the third language can significantly enhance the quality of the output and what is the most effective method to exploit the parallel data. | ![](https://i.imgur.com/c1pH3ug.png) | | -------- | | `BLEU` scores for different parallel corpus (EN→ ES) sizes and methods | | ![](https://i.imgur.com/lCf7dQH.png) | | -------- | | `BLEU` scores for different parallel corpus (EN→ ES) sizes and epochs for using transfer learning | | ![](https://i.imgur.com/KuOUlS3.png) | | -------- | | Size of parallel data EN→ ES required to achieve our best model’s score | This clearly shows that the proposed method can be efficient for the extreme cases of low-resource languages. ## Conclusion * Enabling MT for low-resource languages poses several challenges due to the lack of parallel resources available for training. * In this paper, the authors present a language-independent method that enables MT for low-resource languages for which no parallel data is available between the LRL and any other language. * They take advantage of the similarities between a closely-related HRL and the LRL in order to transform the HRL data into data similar to the LRL. * For this purpose, they train neural transliteration models with transliteration pairs extracted from Wikipedia article titles and a glossary of the 200 most frequent words in the HRL corpus. * Then, they automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. * Our method achieves significant improvements in MT quality, especially when original quality data is used on the target side. * Additional experiments in the pseudo-LRL scenario revealed that there is potential to achieve quality MT if sufficient transliterationpairs areavailable, as well as high quality monolingual corpora and accurate test sets for evaluation. * Finally, they showed that the existence of parallel corpora can lead to further improvements for a size larger than 1000 sentences with the transfer learning method. * In general, the proposed method could be used to contribute to spreading vital information in disaster and emergency scenarios.