Notes on "[Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages](https://aclanthology.org/2021.wat-1.21.pdf)"

# Notes on "[Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages](https://aclanthology.org/2021.wat-1.21.pdf)" Author(s): Thanmay Jayakumar ## Brief Outline #### Problem * Dravidian languages are notoriously **difficult to translate** by state-of-the-art neural models. * This stems from the fact that these languages are **morphologically very rich** as well as being **low-resourced**. #### Key-Idea * In this paper, the authors focus on **subword segmentation** and evaluate **Linguistically Motivated Vocabulary Reduction (LMVR)** against the more commonly used **SentencePiece (SP)** for the task of translating from English into four different Dravidian languages. ## Introduction * Dravidian languages are an important family of languages spoken by about 250 million of people primarily located in Southern India and Sri Lanka (Steever, 2019). * Kannada (KN), Malayalam (MA), Tamil (TA) and Telugu (TE) are the four most spoken Dravidian languages with approximately 47, 34, 71 and 79 million native speakers, respectively. * Together, they account for 93% of all Dravidian language speakers. While Kannada, Malayalam and Tamil are classified as South Dravidian languages, Telugu is a part of South-Central Dravidian languages. * All four languages are SOV (Subject-Object-Verb) languages with free word order. **They are highly agglutinative and inflectionally rich languages.** * Additionally, each language has a different writing system. Refer Table | ![](https://i.imgur.com/lnpfdih.png) | | -------- | | Table : Example sentence in English along with its translation and transliteration in the four Dravidian languages.| ## Experiments * Given the encouraging results reported on the agglutinative Turkish language using LMVR, Ataman et al. (2017). * The authors hypothesise that translation into Dravidian languages may also benefit from a linguistically motivated segmenter. ### Training Corpora | ![](https://i.imgur.com/7ipXo2G.png) | | -------- | | Table: Approximate sizes (in thousands) of the parallel training corpora |