# Final project blogpost 2
## Description of project
For my final project, I will be contributing to a project by a friend of mine who is pursuing a CS master's at Brown. The broad goal of my friend's project, as I understand it, is to create a neural net that, given a *non*-English word, returns an English ordinary-language definition of that word.
My friend has finished the first iteration of a English definition generating model; my main task is to adapt it to work on cross / multi-lingual embeddings, and to then replicate his experiments. My project can thus be thought of as an Application project: I am trying to apply both this text generation model and cross-lingual embeddings to the task of producing English definitions of non-English words.
## What I've done
I've
* skimmed the original definition modelling paper, Noraset et al's "Definition Modeling: Learning to define word embeddings in natural language".
* looked at the literature on cross-lingual embeddings, in particular Sebastian Ruder's blog post on cross-lingual embeddings (https://ruder.io/cross-lingual-embeddings/) and skimmed parts of Conneau et al's "Unsupervised Cross-lingual Embeddings at Scale". I plan to try using their XLM-R pretrained model for the crosslingual embeddings.
* skimmed part of my friend's code base
* skimmed some of the other cross-lingual embeddings papers.
Things to think about as I work on this:
* We'll want to try using a reasonable variety of languages if time permits, since the performance of cross-lingual embeddings apparently varies depending on how similar the language is to English (see Vulić et al's '"On the relation between linguistic typology and (limitations of) multilingual language modeling")
* It might make sense to try passing in cross-lingual embeddings to other definition models --- other than my friend's, that is ---- if time permits. This would allow us to control for the quality of the underlying definition model, when assessing how good the resulting definitions are.
## Next steps
* Learn PyTorch / work through a couple of PyTorch tutorials.
* Look at my friend's codebase more carefully.
* Figure out what the models he's currently using are; what
* Try passing in cross-lingual embeddings from the XLM-R model.
* Read up on SentencePiece tokenization and related tokenization approaches.
# References
Thanapon Noraset, Chen Liang, Larry Birnbaum, Doug Downey. 2016. Definition Modeling: Learning to define word embeddings in natural language.
Conneau et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale.
Ruder. https://ruder.io/cross-lingual-embeddings/ A survey of cross-lingual embedding models.
Gerz, D., Vulić, I., Ponti, E., Reichart, R., & Korhonen, A. (2020). On the relation between linguistic typology and (limitations of) multilingual language modeling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 316-327. https://doi.org/10.17863/CAM.30216