Notes on "[IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages](https://arxiv.org/pdf/2109.02903.pdf)"

# Notes on "[IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages](https://arxiv.org/pdf/2109.02903.pdf)" Authors: Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar ## Brief Outline * In this paper the authors present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. * Different from existing pre-trained models, IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. * They evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. * Their experiments on NMT for 12 language pairs and extreme summarization for 7 languages using multilingual fine-tuning show that IndicBART is competitive with or better than mBART50 despite containing significantly fewer parameters. ## Model ### Dataset and Tokenization * The IndicBART model is trained on the IndicCorp dataset (Kakwani et al., 2020) which contains 11 Indic languages and English. The Indic languages are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu. The corpora statistics are given in Table 1. | ![](https://i.imgur.com/A6ONVgg.png) | | -------- | | Table 1: Monolingual corpora statistics (in millions). | * The model is trained on a total of approx. 450 million sentences and 9 billion tokens. All the Indic language data is represented in a single script ie. the Devanagari script using the IndicNLP library (Kunchukuttan, 2020). * The model is trained at the sentence-level, unlike the mBART50 model which is trained on contiguous fixed-size text chunks potentially spanning multiple sentences. ### Model details * Mask (p=)35% of the words in each sentence by randomly sampling a span length according to a Poisson distribution (λ = 3.5). * They use (N=) 6 encoder and decoder layers with hidden and filter sizes of 1024 and 4096, respectively, and 16 attention heads. * They use dropouts of 0.1 and label smoothing of 0.1. * The Adam optimizer is used with a maximum learning rate of 0.001 and a weight decay of 0.00001. * Linear learning rate warmup is used and decay with 16,000 warmup steps. * The model has been trained with the YANMTT toolkit3 (Dabre and Sumita, 2021) which is based on the mBART implementation of the HuggingFace Transformers library. * Batch sizes of 4096 tokens is used and the model is trained for 750,000 iterations on 48 NVIDIA V-100 GPUs which corresponds to roughly 2 epochs and took around 5 days. ## Experiment: NMT The authors compare IndicBART with mBART50 which should be the most directly comparable model. They study their performance in: (1) low-resource scenarios, (2) in-domain and general domain settings, (3) multilingual training settings. ### Training datasets * For a low-resource setting, they use the PMI subset (Haddow and Kirefu, 2020) of the WAT 2021 MultiIndicMT5 (Nakazawa et al., 2021) training set for finetuning. This represents an extremely low-resource parallel corpus setting where we expect IndicBART to be the most helpful. * They also experiment with extending the PMI data with the CVIT-PIB (henceforth PIB) data (Siripragrada et al., 2020) which is similar in domain to the former. * They also use the large, general domain Samanantar corpus (Ramesh et al., 2021) to compare with the generalization capabilities of pretrained models which are finetuned with small corpora (PMI, PIB). Also note that the PMI and PIB data are included in the Samanantar data | ![](https://i.imgur.com/YZQfTKL.png) | | ------------------------------------ | | Table: Statistics of parallel corpora (#sentences) | ### Models trained *Note: Unless explicitly mentioned, the models are assumed to be trained/fine-tuned using the PMI training data.* 1. **Bi:** Bilingual models trained from scratch. 2. **MB50+Bi:** Bilingual models that are fine-tuned on mBART50. Since mBART50 is not explicitly trained for Kannada, Punjabi and Oriya we map the scripts for these languages to Devanagari before fine-tuning. Results for these are italicized. 3. **IB w/o SM+Bi:** Bilingual models fine-tuned on the IndicBART model trained without script unification. 4. **IB+Bi:** Bilingual models fine-tuned on the IndicBART model trained with script unification. 5. **M2O/O2M:** Many-to-one or one-to-many models trained from scratch. 6. **IB w/o SM+O2M/M2O:** Many-to-one or one-to-many fine-tuned on the IndicBART model trained without script unification. 7. **IB+O2M/M2O:** Many-to-one or one-to-many fine-tuned on the IndicBART model trained with script unification. We consider the results of this model as our main result. 8. **IB+CORPUS:** In addition to the ones above we train variations of model 7 using different training corpora. ### Result #### Main Results *Note: Scores are reported on the WAT 2021 test set.* | ![](https://i.imgur.com/lPtl1Xp.png) | | -------- | | Table: Comparison of IndicBART with other models. | * To summarize, IndicBART finetuned on all language pairs is significantly better than bilingual and multilingual baselines. It is also competitive or better than mBART50 tuned on a single language pair. * IndicBART has only 1/3rd the model size of mBART50, showing that it is advantageous to have compact models focusing on significantly fewer languages. | ![](https://i.imgur.com/uviFfRe.png) | | -------- | | Table: Ablation studies to study the impact of multilingualism and script unification on downstream performance of IndicBART. | * Comparing script unification against original script models, independent of multilingual or bilingual training, the former is clearly better than the latter which could indicate that script unification enables languages to better benefit from each other. * On the other hand, multilingual training, independent of script unification, is definitely better than bilingual training. Ultimately, the combination of script unification and multilingual training tends to give the best results. * The case of Kannada, Punjabi and Oriya illustrates the utility of script unification. #### Impact Of Corpora Size and Domain | ![](https://i.imgur.com/RmvO50n.png) | | -------- | | Ablation study of the impact of using different sizes of fine-tuning corpora (PMI and its combination with PIB) and their comparison against a model trained from scratch as well as fine-tuned on a general domain corpus (Samanantar). * We can conclude with a high degree of confidence that using IndicBART for fine-tuning will be more successful when evaluating on a specific domain using small domain specific training corpora. * On the other hand, it will be better to use large domain specific corpora when working on specific domains. We expect that doing targeted domain specific training for the individual sub-domains in the FLORES datasets may help improve the utility of IndicBART for general domain translation. #### Evaluating On Unseen Languages | ![](https://i.imgur.com/SlicdoY.png)| | -------- | | Evaluation of Nepali and Sinhala to English translation which IndicBART has not seen during pretraining | * The baselines, trained using the unified script IndicBART vocabulary will seem weaker than what is reported in previous work but it should be noted that the vocabulary was not actually trained for Nepali and Sinhala. * Regardless, fine-tuning leads to substantial improvements in translation quality which indicates the utility of IndicBART even for unseen languages. We can expect some additional pre-training on monolingual corpora for these languages to have an even larger impact on fine-tuning performance. ## Conclusion * The authors present IndicBART, a multilingual, pre-trained sequence-to-sequence model for Indic languages to support development of NLG applications for Indic languages. * IndicBART supports 11 Indic languages and English, and utilizes the orthographic similarity of Indic scripts to enable better crosslingual transfer. * Their experiments on multilingual fine-tuning IndicBART for NMT and summarization show that the model is competitive with or better than those obtained using existing large models such as mBART50. * Due to the use of orthographic similarity, the model can be used to build translation models for languages like Sinhala and Nepali that are not included in pre-training. * They show that script unification has a strong positive impact on translation and summarization. * They also showed that IndicBART, thanks to its script independent nature, can be readily used for enabling translation for languages such as Sinhala and Nepali which IndicBART has not been explicitly trained for.