Adding more data

# Adding more data ## (Existing parallel words) The same thing you did but from other sources (MSWC from MLCommons), a whole bunch of words across 50 languages Highest quality (because its human labeled) - Panlex - SIL dictionaries - MSWC - etc. ## (Inferred parallel words) Use multilingual language models to do the following: Medium quality (auto tokenizing and then labeling words, noise) Find a "Multilingual Language Model": - Input Text (in one of many languages) - Output vector representation of that text ([......]) For each language supported in the language model: 1. Download raw text data in the language (e.g., from OPUS from Hugging Face) 2. Tokenize to get words 3. Use the multilingual word embeddings to compare the words to words in our semdom dataset (in English) For example: - Let's say we found the word "चंद्रमा" in Hindi - Then we would use the multilingual LM, convert "चंद्रमा" to [.....] - Then could compare [.....] to the vector representations for each of the words in our semdom English dataset - The siilarity between the vectors would allow us to assign sem dom ## (Inferred parallel words) Use MT models to translate words Variable quality - here's an API for M2M100-1.2B https://hf.space/gradioiframe/jason9693/m2m-100/api