Final project blogpost 2

# Final project blogpost 2 ## Description of project For my final project, I will be contributing to a project by a friend of mine who is pursuing a CS master's at Brown. The broad goal of my friend's project, as I understand it, is to create a neural net that, given a *non*-English word, returns an English ordinary-language definition of that word. My friend has finished the first iteration of a English definition generating model; my main task is to adapt it to work on cross / multi-lingual embeddings, and to then replicate his experiments. My project can thus be thought of as an Application project: I am trying to apply both this text generation model and cross-lingual embeddings to the task of producing English definitions of non-English words. ## What I've done I've * skimmed the original definition modelling paper, Noraset et al's "Definition Modeling: Learning to define word embeddings in natural language". * looked at the literature on cross-lingual embeddings, in particular Sebastian Ruder's blog post on cross-lingual embeddings (https://ruder.io/cross-lingual-embeddings/) and skimmed parts of Conneau et al's "Unsupervised Cross-lingual Embeddings at Scale". I plan to try using their XLM-R pretrained model for the crosslingual embeddings. * skimmed part of my friend's code base * skimmed some of the other cross-lingual embeddings papers. Things to think about as I work on this: * We'll want to try using a reasonable variety of languages if time permits, since the performance of cross-lingual embeddings apparently varies depending on how similar the language is to English (see Vulić et al's '"On the relation between linguistic typology and (limitations of) multilingual language modeling") * It might make sense to try passing in cross-lingual embeddings to other definition models --- other than my friend's, that is ---- if time permits. This would allow us to control for the quality of the underlying definition model, when assessing how good the resulting definitions are. ## Next steps * Learn PyTorch / work through a couple of PyTorch tutorials. * Look at my friend's codebase more carefully. * Figure out what the models he's currently using are; what * Try passing in cross-lingual embeddings from the XLM-R model. * Read up on SentencePiece tokenization and related tokenization approaches. # References Thanapon Noraset, Chen Liang, Larry Birnbaum, Doug Downey. 2016. Definition Modeling: Learning to define word embeddings in natural language. Conneau et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. Ruder. https://ruder.io/cross-lingual-embeddings/ A survey of cross-lingual embedding models. Gerz, D., Vulić, I., Ponti, E., Reichart, R., & Korhonen, A. (2020). On the relation between linguistic typology and (limitations of) multilingual language modeling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 316-327. https://doi.org/10.17863/CAM.30216