# Notes on "[Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages](https://aclanthology.org/2021.wat-1.21.pdf)"
Author(s): Thanmay Jayakumar
## Brief Outline
#### Problem
* Dravidian languages are notoriously **difficult to translate** by state-of-the-art neural models.
* This stems from the fact that these languages are **morphologically very rich** as well as being **low-resourced**.
#### Key-Idea
* In this paper, the authors focus on **subword segmentation** and evaluate **Linguistically Motivated Vocabulary Reduction (LMVR)** against the more commonly used **SentencePiece (SP)** for the task of translating from English into four different Dravidian languages.
## Introduction
* Dravidian languages are an important family of languages spoken by about 250 million of people primarily located in Southern India and Sri Lanka (Steever, 2019).
* Kannada (KN), Malayalam (MA), Tamil (TA) and Telugu (TE) are the four most spoken Dravidian languages with approximately 47, 34, 71 and 79 million native speakers, respectively.
* Together, they account for 93% of all Dravidian language speakers. While Kannada, Malayalam and Tamil are classified as South Dravidian languages, Telugu is a part of South-Central Dravidian languages.
* All four languages are SOV (Subject-Object-Verb) languages with free word order. **They are highly agglutinative and inflectionally rich languages.**
* Additionally, each language has a different writing system. Refer Table
|  |
| -------- |
| Table : Example sentence in English along with its translation and transliteration in the four Dravidian languages.|
## Experiments
* Given the encouraging results reported on the agglutinative Turkish language using LMVR, Ataman et al. (2017).
* The authors hypothesise that translation into Dravidian languages may also benefit from a linguistically motivated segmenter.
### Training Corpora
|  |
| -------- |
| Table: Approximate sizes (in thousands) of the parallel training corpora |