# Turkish WordNet KeNet This dataset is comprehensive wordnet for Turkish. KeNet includes 77,330 synsets and it has both intralingual semantic relations and is linked to PWN through interlingual relations. The original paper can be found from [here](https://aclanthology.org/2021.gwc-1.19.pdf) and you can access the original repository [TurkishWordNet](https://github.com/StarlangSoftware/TurkishWordNet). ## Dataset Details An exemplary set of synsets from KeNet is given in Table 1. In this table, examples of the four most frequent parts of speech in KeNet are listed, i.e., noun, adjective, verb and adverb, respectively. For each of these examples, the first column shows the ID of the synset. The characters that are separated with "-" from the ID gives the POS of the synset (n for noun, v for verb, a for adjective, adv for adverb). The second column lists the synset members; the synset members that are listed in the same synset are synonyms. The third column demonstrates the definitions and lastly, the fourth column presents an exemplary sentence (if there is any) including one of the synset members. | Synset ID | Synset Members | Definition | Example Sentence | | -------- | -------- | -------- | --- | | TUR10-0000030-n | su ab ab "water" | Hidrojenle oksijenden oluşan, oda sıcaklıgında sıvı durumunda bulunan, renksiz, kokusuz, tatsız madde | | | TUR10-0000220-a | abajurlu "with lampshade" | Abajuru olan | Üstünde lacivert abajurlu, parlak bir madenden lamba. | ### Samples The structure of a sample synset is as follows: ``` <SYNSET> <ID>TUR10-0038510</ID> <LITERAL>anne<SENSE>2</SENSE> </LITERAL> <POS>n</POS> <DEF>...</DEF> <EXAMPLE>...</EXAMPLE> </SYNSET> ``` Each entry in the dictionary is enclosed by ```<SYNSET>``` and ```</SYNSET>``` tags. Synset members are represented as literals and their sense numbers. ```<ID>``` shows the unique identifier given to the synset. ```<POS>``` and ```<DEF>``` tags denote part of speech and definition, respectively. As for the ```<EXAMPLE>``` tag, it gives a sample sentence for the synset. ### Fields Explain the fields of the instances. | field | dtype | |----------|------------| | ID | string | | LITERAL | string | | SENSE | integer | | POS | char | | DEF | string | | EXAMPLE | string | ### Splits Different from the machine learning applications splits, following table explains the Part of Speech distribution in the dataset. |Part of Speech|# of Synsets| |---|---| |Nouns|44,074| |Verbs|17,791| |Adjectives|12,416| |Adverbs|2,550| |Interjections|342| |Pronouns|68| |Conjunctions|60| |Postpositions|29| |Total|77,330| ## Dataset Creation ### Curation Rationale With the advancement of natural language processing studies, a need for machine-readable dictionaries has arisen (Miller, 1995). In an attempt to answer that need, wordnets which store lexicographic information in a format that is adaptable to modern computing have emerged. This study aims to build most comprehensive Wordnet compared to previous works in Turkish. ### Data Source Contemporary Dictionary of Turkish (CDT) (2011’s print) published by the Turkish Language Institute (TLI) (Ehsani et al., 2018) ### Annotations Annotation process is explained in the original paper as follows: > By convention, CDT marks synonyms by using commas such that synonyms of a word are given after its definition with a separation of comma. To decide on true synonyms that must occur in the same synsets, we sliced the definitions at commas and listed the comma-separated lemmas and the rest of the definitions as candidates of synonyms. Then, those lists were displayed for linguistically-informed human annotators who decided on the synonymy relation between the lem- mas and the definitions. 49,774 pairs were annotated at the end of this phase. Although some of them were included as separate entries in CDT, passivized and causativized forms of verbs were deleted from KeNet as they share the same root with their active forms. > > Although the vast majority of the synsets were constructed during this process, there was a need for follow-up procedures to improve the organization of the current synsets. Since the main problem encountered in synset construction was the semantic relatedness of the synset members, two other procedures were followed in order to control the synonymy relations within the synsets: the merge process and the split process. > > #### Merge Process > > In the merge process, different synsets that should be grouped together were identified and grouped as a single synset. Three things were crucial while merging the synsets: (i) having a single and unique definition for each synset, (ii) having true synonyms as synset members in each synset and (iii) having a representative first synset member in each synset. Firstly, the synsets that were created by combining the synset members with identical senses had as many definitions as the number of synset members in them since the definitions were also merged while merging the synset members. The definitions of the merged synsets were initially combined with a pipe symbol in between them. A new definition for each merged synset was written so that each synset had a single and unique definition that covers the meaning of all its synset members. None of the synset members of a synset appeared in its definition. In this process, new definitions for 10,612 number of synsets were written by the human annotators. Secondly, some synsets were found to include unrelated synset members. Therefore, another goal of the merge process was to include only the synset members that were synonyms. 1,144 number of synsets with unrelated synset members that had been identified in other parts of the work were transferred to the split process. > > #### Split Process > > In the split process, the synsets that included synset members with different senses were split and separate synsets were created for each group of related synset members. In order to fix this problem, we created a pool where we collected all the synsets that had unrelated synset members. We displayed these synsets on Google Sheets. Linguistically-informed human annotators then split these wrongly-merged synsets and wrote new definitions for the newly-created ones. > > Currently, there are 77,330 synsets, 109,049 synset members and 80,956 distinct synset members in KeNet. The POS categories that are included are nouns, adverbs, adjectives, adverbs, interjections, pronouns, postpositions and conjunctions. > The details about build semantic relations including hypernymy, derivational relatedness, domain topic, part holonymy, antonymy, instance hypernymy, member holonymy, substance holonymy and interlingual relations are discussed in the paper. ## Additional Information ### Version This dataset is taken from the original repository with commit id ```b582406``` on 16 Oct 2022. ### Dataset Curators Özge Bakay, Özlem Ergelen, Elif Sarmış, Selin Yıldırım, Atilla Kocabalcıoğlu, Bilge Nas Arıcan, Merve Özçelik, Ezgi Sanıyar, Oğuzhan Kuyrukçu, Begüm Avar, Olcay Taner Yıldız ### Citation Information Please cite the following paper if you found this dataset useful: Özge Bakay, Özlem Ergelen, Elif Sarmış, Selin Yıldırım, Bilge Nas Arıcan, Atilla Kocabalcıoğlu, Merve Özçelik, Ezgi Sanıyar, Oğuzhan Kuyrukçu, Begüm Avar, and Olcay Taner Yıldız. 2021. Turkish WordNet KeNet. In Proceedings of the 11th Global Wordnet Conference, pages 166–174, University of South Africa (UNISA). Global Wordnet Association. ``` @inproceedings{bakay21, title={{T}urkish {W}ord{N}et {K}e{N}et}, year={2021}, author={O. Bakay and O. Ergelen and E. Sarmis and S. Yildirim and A. Kocabalcioglu and B. N. Arican and M. Ozcelik and E. Saniyar and O. Kuyrukcu and B. Avar and O. T. Y{\i}ld{\i}z}, booktitle={Proceedings of GWC 2021}} } ``` Uploaded and documented by Arda Goktogan: `ardagoktogan gmail com`.