Notes on "[Itihāsa: A large-scale corpus for Sanskrit to English translation](https://arxiv.org/pdf/2106.03269.pdf)"

# Notes on "[Itihāsa: A large-scale corpus for Sanskrit to English translation](https://arxiv.org/pdf/2106.03269.pdf)" Authors: Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard ## Brief Outline * This work introduces Itihāsa, a large-scale translation dataset containing 93,000 pairs of Sanskrit *shlokas* and their English translations. The *shlokas* are extracted from two Indian epics viz., The Rāmāyana and The Mahābhārata. * The main motivation behind this work is to provide an impetus for the Indic NLP community to build better translation systems for Sanskrit. * Benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset. ## Introduction * Only two authors have attempted to translate the unabridged versions of both The Ramayana and The Mahabharata to English: Manmatha Nath Dutt in the 1890s and Bibek Debroy in the 2010s. M.N. Dutt was a prolific translator whose works are now in the public domain. * M.N. Dutt was a prolific translator whose works are now in the public domain. These works are published in a shloka-wise format as shown in Fig. 1 which makes it easy to automatically align shlokas with their translations. | ![](https://i.imgur.com/DjNUS5o.png)| | -------- | | Figure 1: An introductory *shloka* from The Rāmāyana. The four parts with eight syllables each are highlighted with different shades of gray. | * Though many of M. N. Dutt’s works are freely available, we choose to extract data from The Rāmāyana (Vālmiki and Dutt, 1891), and The Mahābhārata (Dwaipāyana and Dutt, 1895), mainly due to its size and popularity. * As per the authors' knowledge, this is the biggest Sanskrit-English translation dataset to be released in the public domain. * The authors also train and evaluate standard translation systems on this dataset. In both translation directions, we use Moses as an SMT baseline, and Transformer-based seq2seq models as NMT baselines ## Motivation * Since The Ramayana and The Mahabharata are so pervasive in Indian culture, and have been translated to all major Indian languages, there is a possibility of creating an n-way parallel corpus with Sanskrit as the pivot language * Due to Sanskrit being a morphologically rich, agglutinative, and highly inflexive, complex concepts can be expressed in compact forms by combining individual words through Sandhi and Samasa. 2 This also enables a speaker to potentially create an infinite number of unique words in Sanskrit. * Essentially, a parallel corpus allows us to apply a plethora of transfer learning techniques to improve NLP tools for Sanskrit. ## Data preparation * The translated works of The Ramayana and The Mahabharata were published in four and nine volumes respectively. * All volumes have a standard two-column format as shown in Fig. 2. | ![](https://i.imgur.com/yqdyrb7.png) | | -------- | | Figure 2: Pre-processing Pipeline: The three steps shown here are: (i) invert the colour scheme of the PDF and dilate every detectable edge, (ii) find the indices of the longest vertical and horizontal lines in the page, and (iii) split the original PDF along the found separator lines.| * Each page has a header with the chapter name and page number separated from the main text by a horizontal line. * The two columns of text are separated by a vertical line. * The process of data preparation can be divided into (i) automatic OCR extraction, and (ii) manual inspection for alignment errors. ### Automatic Extraction * The OCR systems we experimented with performed poorly on digitized documents due to their two-column format. * They often fail to recognize line breaks which result in the concatenation of text present in different columns. * To mitigate this issue, the authors use an edge detector to find the largest horizontal and vertical lines, and using the indices of the detected lines, split the original page horizontally and vertically to remove the header and separate the columns ### Manual Inspection * An important limitation of the OCR system is that it sometimes wrongly treats large gaps between words as line breaks and the rest of the text on the line is moved to the end of the paragraph which results in translations being misaligned with its *shlokas*. * Therefore, the output of all 13 volumes was manually inspected and such misalignments were corrected. * This was a time-consuming process and the first author inspected the output manually over the course of one year. ## Corpus * In total, the authors extract 19,371 translation pairs from 642 chapters of The Ramayana and 73,659 translation pairs from 2,110 chapters of The Mahabharata. | ![](https://i.imgur.com/1CZAQ99.png) | | -------- | | Table: Size of training, development, and test sets | * It should be noted that these numbers do not correspond to the number of shlokas because, in the original volumes, shlokas are sometimes split and often combined to make the English translations flow better. * Due to Sanskrit’s agglutinative nature, the dataset is asymmetric in the sense that, the number of words required to convey the same information. (Problem: High vocabulary size for Sanskrit with large number of unique tokens) Refer Fig. 3 | ![](https://i.imgur.com/ByIBgh3.png) | | -------- | | Figure 3: Comparison of vocabulary sizes. | ## Experiments * We train one SMT and five NMT systems in both directions and report the (i) character n-gram F-score, (ii) token accuracy, (iii) BLEU , and (iv) Translation Edit Ratio (TER) scores in Tab. 2. * For SMT, we use Moses and for NMT, we use sequence-to-sequence (seq2seq) Transformers. We train the seq2seq models from scratch by initializing the encoders and decoders with standard BERT (B2B) architectures. * These `Tiny`, `Mini`, `Small`, `Medium`, and `Base` models have 2/128, 4/256, 4/512, 8/512, and 12/768 layers/dimensions respectively. ### Results | ![](https://i.imgur.com/AcddqR9.png) | | ------------------------------------ | | Table: Character F1, Token accuracy, BLEU, and TER scores | ## Discussion * We see that all models perform poorly, with low token accuracy and high TER. * While the English to Sanskrit (E2S) models get better with size, this pattern is not clearly seen in Sanskrit to English (S2E) models. Surprisingly for S2E models, the token accuracy progressively decreases as their size increases. * Also, Moses has the best TER among S2E models which suggests that the seq2seq models have not been able to learn even simple co-occurrences between source and target tokens. This leads us to hypothesize that the Sanskrit encoders produce sub-optimal representations. #### Extensions to improve quality of representations * Add a *Sandhi-splitting* step to the tokenization pipeline, thereby decreasing the Sanskrit vocabulary size. * Initialize the encoders with a pre-trained language model. ## Conclusion * In this work, the authors introduce Itihāsa, a large-scale dataset containing more than 93,000 pairs of Sanskrit shlokas and their English translations from The Rāmāyana and The Mahābhārata. * First, they detail the extraction process which includes an automated OCR phase and a manual alignment phase. * Next, they analyze the dataset to give an intuition of its asymmetric nature and to showcase its complexities. * Lastly, they train state-of-the-art translation models which perform poorly, proving the necessity for more work in this area.