Bilingual Dictionaries

# Bilingual Dictionaries The dataset consists of bilingual dictionaries. This dictionaries are used for cross-lingual word embeddings for Turkic languages. The original GitHub Repository of dataset can be reached by following [this link]( https://github.com/elmurod1202/crosLingWordEmbTurk). ## Dataset Details There are 7 dictionaries where first group of them is presented in 'already-existing' folder. Turkish-English dictionary is obtained from [MUSE](https://github.com/facebookresearch/MUSE) and Uzbek-English dictionary is obtained from The Uzbek Glossary. The second group consists of other 5 dictionaries gathered via Google Translate. Sizes of dictionaries are given in the table below. | Dictionay | Size (in words) | |-------|-------------| | Turkish-English (tr-en) | 9350 | | Uzbek-English (uz-en) | 7958 | | Azeri-English (az-en) | 7422 | | Kazakh-English (kk-en)| 8454| | Kyrgyz-English (ky-en)| 7974| ### Samples The raw data consists of dictionaries where for each row, first word is the word in Turkic language, and second is corresponding meaning in English. ``` yönetmenler directors yoksunluk deprivation pantolon trousers kutsallık holiness ``` ### Splits No split sizes are indicated for the first group of dictionaries. Indicated train/test split sizes (in words) for the second group of dictionaries are given in the table below. | Dictionay | Train | Test | |-------|-------------| ---| | Turkish-English (tr-en) | 9350 | 500| | Uzbek-English (uz-en) | 7958 | 500| | Azeri-English (az-en) | 7422 | 500| | Kazakh-English (kk-en)| 8454|500 | | Kyrgyz-English (ky-en)| 7974| 500| ## Dataset Creation ### Curation Rationale The dataset is motivated by the desire to advance Cross-lingual Word Embeddings studies for Turkic languages. The dictionaries are gathered to be used to determine which word embeddings perform better. ### Data Source and Annotations The first group of dictionaries are obtained from existing resources, MUSE, The Uzbek Glossary, and The Leneshmid Dictionary. Second group is obtained by using Google Translate API. In the second group, most frequent 30,000 English words are gathered by SketchEngine. Then words are reverse-translated to English and only the ones that are translated back to inital word are kept. ### Quality Second group dictionaries are gathered to increase the quality of the first group datasets and to get dictionaries for other target languages. After the 30,000 most frequent English words, results are merged with bilingual dictionaries gathered from [1000 Most Common Words](https://1000mostcommonwords.com/1000-most-common-english-words/) website. Finally, a cleaning process was undertaken to remove the duplications and translations that contains more than one token. Further information can be reached by [this paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.499.pdf). ## Additional Information ### Version This documentation is written based on the dataset version with commitment with id [f3d49bc](https://github.com/elmurod1202/crosLingWordEmbTurk/commit/f3d49bc4530156e04e3958076f669809c5db0c1c). ### Dataset Curators Published by Elmurod Kuriyozov, Doval Yerai, and Carlos Gomed-Rodriguez. ### Citation Information Please cite the following paper if you found this dataset useful: Elmurod Kuriyozov, Doval Yerai, and Carlos Gomed-Rodriguez. “Cross-Lingual Word Embeddings for Turkic Languages" Proceedings of the 12th Language Resources and Evaluation Conference, 2020. ``` @inproceedings{kuriyozov2020cross, title={Cross-Lingual Word Embeddings for Turkic Languages}, author={Kuriyozov, Elmurod and Doval, Yerai and G{\'o}mez-Rodr{\'\i}guez, Carlos}, booktitle={Proceedings of the 12th Language Resources and Evaluation Conference}, pages={4054--4062}, year={2020} } ```