Notes on "[Sanskrit Sandhi Splitting using *seq2(seq)²*](https://arxiv.org/pdf/1801.00428.pdf)"

# Notes on "[Sanskrit Sandhi Splitting using *seq2(seq)²*](https://arxiv.org/pdf/1801.00428.pdf)" Authors: Rahul Aralikatte, Neelamadhav Gantayat, Naveen Panwar ## Brief Outline Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. #### Problem * Though existing Sandhi splitting systems incorporate pre-defined splitting rules, they have a low accuracy as the same compound word might be broken down in multiple ways to provide syntactically correct splits. #### Key Idea * The authors propose a novel deep learning architecture called Double Decoder RNN (DD-RNN), which (i) predicts the location of the split(s) , and (ii) predicts the constituent words (learning the Sandhi splitting rules) ## Model * The critical part of learning to split compound words is to correctly identify the location(s) of the split(s) * Therefore, two decoders were added to the bi-directional encoder-decoder model: (i) location decoder which learns to predict the split locations and (ii) character decoder which generates the split words. | ![](https://i.imgur.com/RV8C3eU.png) | | -------- | | Figure: The bi-directional encoder and decoders with attention | ### Phase 1 * Only the location decoder is trained and the character decoder is frozen. * The character embeddings are learned from scratch in this phase along with the attention weights and other parameters. * Here, the model learns to identify the split locations. * For example, if the inputs are the embeddings for the compound word *protsahah*, ### Phase 2 * The location decoder is frozen and the character decoder is trained. * The encoder and attention weights are allowed to be fine-tuned. * This decoder learns the underlying rules of Sandhi splitting. * Since the attention layer is already pre-trained to identify potential split locations in the previous phase, the character decoder can use this context and learn to split the words more accurately. * For example, for the same input word *protsahah*, ### Example For example, if the input word is *protsāhaḥ*, the phases will be #### Phase 1 * The location decoder will generate a binary vector `[0, 0, 1, 0, 0, 0, 0, 0, 0]` which indicates that the split occurs between the third and fourth characters. #### Phase 2 * The character decoder will generate `[p, r, a, +, u, t, s, ā, h, a, h]` as the output ## Dataset * The *UoH* corpus, created at the University of Hyderabad contains 113,913 words and their splits. This dataset is noisy with typing errors and incorrect splits. * The recent SandhiKosh corpus (Shubham Bhardwaj, 2018) is a set of 13,930 annotated splits. * We combine these datasets and heuristically prune them to finally get 71, 747 words and their splits. * The pruning is done by considering a data point to be valid only if the compound word and it’s splits are present in a standard Sanskrit dictionary (Monier-Williams, 1970). ## Tools * There exist multiple Sandhi splitters in the open domain such as (i) JNU splitter (Sachin, 2007), (ii) UoH splitter (Kumar et al., 2010) and (iii) INRIA sanskrit reader companion (Huet, 2003) (Goyal and Huet, 2013). * *Note: None of these approaches address the fundamental problem of identifying the location of the split before applying the rules, which will significantly reduce the number of rules that can be applied, hence resulting in more accurate splits.* ## Evaluation and Results | ![](https://i.imgur.com/voeQuWo.png) | ![](https://i.imgur.com/tueK6MQ.png) | | -------- | -------- | | Figure: Top-1 split prediction accuracy comparison of different publicly available tools with DD-RNN | Figure: Split prediction accuracy comparison of different publicly available tools (Top-10) with DD-RNN (Top-1) | | ![](https://i.imgur.com/1iXs5Zx.png) | ![](https://i.imgur.com/PBCLqeg.png) | | -------- | -------- | | Table: Location and split prediction accuracy of all the tools and models under comparison | Figure: Split prediction accuracy comparison of different variations of RNN on words of di | ## Conclusion * In this research, the authors propose a novel double decoder RNN architecture with attention for Sanskrit Sandhi splitting. * A deep bi-directional encoder is used to encode the character sequence of a Sanskrit word. * Using this encoded context vector, a location decoder is first used to learn the location(s) of the split(s). * Then the character decoder is used to generate the split words. * The performance of the proposed approach was evaluated on the benchmark dataset in comparison with other publicly available tools, standard RNN architectures and with prior work which tackle similar problems in other languages. #### Future work: Tackle the harder was evaluated Samasa problem which requires semantic information of a word in addition to the characters’ context.