Thanmay Jayakumar

@Thanmay

Joined on Jun 18, 2021

  • Authors: Rahul Aralikatte, Neelamadhav Gantayat, Naveen Panwar Brief Outline Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Problem Though existing Sandhi splitting systems incorporate pre-defined splitting rules, they have a low accuracy as the same compound word might be broken down in multiple ways to provide syntactically correct splits. Key Idea
     Like  Bookmark
  • Author(s): Thanmay Jayakumar Brief Outline Problem The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios But is much less effective for low resource languages - Neural methods are data-hungry and learn poorly from low-count events. Key Idea
     Like  Bookmark
  • Author(s): Thanmay Jayakumar Brief Outline Problem Total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. Large number of other languages of the world are considered low-resource languages (LRL), because they lack in linguistic resources, e.g. grammars, POS taggers, corpora. Key Idea
     Like  Bookmark
  • Author(s): Thanmay Jayakumar Brief Outline Problem Dravidian languages are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. Key-Idea
     Like  Bookmark
  • Machine Translation (MT) [x] Sequence to Sequence Learning with Neural Networks [arXiv] [x] Neural Machine Translation by Jointly Learning to Align and Translate [arXiv] [x] Attention Is All You Need [arXiv] Low-resource MT tasks A list of papers that I found interesting while exploring the task of tackling machine translation in low-resource settings, in descending order of the year published. [Google Slides]
     Like  Bookmark
  • Authors: Graham Neubig, Junjie Hu Brief Outline Problem NMT require large amounts of training data, and creating high-quality systems in low-resource languages (LRLs) is a difficult challenge where research efforts have just begun. The time it takes to create such a system. In a crisis situation, time is of the essence, and systems that require days or weeks of training will not be desirable or even feasible. How can we create MT systems for new language pairs as accurately as possible, and as quickly as possible? Key Idea
     Like  Bookmark
  • Authors: Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar Brief Outline In this paper the authors present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. Different from existing pre-trained models, IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. They evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Their experiments on NMT for 12 language pairs and extreme summarization for 7 languages using multilingual fine-tuning show that IndicBART is competitive with or better than mBART50 despite containing significantly fewer parameters.
     Like  Bookmark
  • Authors: Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzman, Pascale Fung, Philipp Koehn, Mona Diab Brief Outline Problem The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages Key Idea Fortunately, some low-resource languages are linguistically related or similar to high-resource languages
     Like  Bookmark
  • Authors: Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard Brief Outline This work introduces Itihāsa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Rāmāyana and The Mahābhārata. The main motivation behind this work is to provide an impetus for the Indic NLP community to build better translation systems for Sanskrit. Benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset. Introduction
     Like  Bookmark
  • Problem Despite the success of training multi-lingual NMT systems; there are a couple of challenges to leverage them for zero-resource languages: Lexical-level Sharing: Conventionally, a multilingual NMT model has a vocabulary that represents the union of the vocabularies of all source languages. Therefore, the multi-lingual words do not practically share the same embedding space since each word has its own representation. This does not pose a problem for languages with sufficiently large amount of data, yet it is a major limitation for extremely low resource languages since most of the vocabulary items will not have enough, if any, training examples to get a reliably trained models. Sentence-level Sharing: It is also crucial for low- resource languages to share source sentence representation with other similar languages. For example, if a language shares syntactic order with another language it should be feasible for the low-resource language to share such representation with another high resourse language. It is also important to utilize monolingual data to learn such representation since the low or zero resource language may have monolingual resources only Key Idea Utilize multi-lingual neural translation system to share lexical and sentence level representations across multiple source languages into one target language
     Like  Bookmark