# TRopBank: Turkish PropBank V2.0 In this paper, we present and explain TRopBank “Turkish PropBank v2.0”. PropBank is a hand-annotated corpus of propositions which is used to obtain the predicate-argument information of a language. Predicate-argument information of a language can help understand semantic roles of arguments. “Turkish PropBank v2.0”, unlike PropBank v1.0, has a much more extensive list of Turkish verbs, with 17.673 verbs in total. The original paper can be found from [here](https://aclanthology.org/2020.lrec-1.336/) and you can access the original repository [TurkishPropBank](https://github.com/StarlangSoftware/TurkishPropBank). ## Dataset Details For TRropBank, a total of 17,691 verbs were annotated. As the data suggests, unaccusative verbs that require a patient or theme in the ARG1 column constitute roughly 15.1% of all the annotated verbs (see Table). Based on the data, it can be inferred that Turkish has an evident preference for verbs that require an ARG0 over ones that require an ARG1 as their subject. Moreover, we can see that a significant portion of Turkish verbs, 47.9% to be exact, have the transitive framework. Turkish displays an observable preference regarding transitivity. Furthermore, having predicates that do not require any arguments, Turkish diverges from the majority of the languages whose PropBanks have been reviewed in Section 2 in the paper. Even though predicates without arguments (idiomatic structures) make up less than 1% of the total, the existence of such a divergence is significant. To sum up, TRropBank provides unprecedented data on the overall tendencies of Turkish verbs within the framework of transitivity and the portion of idiomatic expressions. As a result, we can infer that TRropBank helps us unveil the properties of argument structure of Turkish verbs in regards to theoretical linguistics in addition to being a valuable asset for NLP solutions. | | Value | Percentage | | -------- | -------- | -------- | Verbs with no ARG0 | 3023 | 17 Verbs with no ARG1 | 4486 | 25.3 Verbs with no ARG2 | 15803 | 89.3 Verbs with no ARG0 but ARG1 | 2681 | 15.1 ARG0 | 14668 | 49.3 ARG1 | 13126 | 35.8 ARG2 | 1888 | 6.3 ARG3 | 78 | 0.26 ARG4 | 1 | 0.003 pag | 14579 | 48.9 ppt | 10665 | 44.1 dir | 1431 | 4.8 gol | 800 | 2.6 loc | 814 | 2.7 src | 604 | 2 com | 481 | 1.6 tmp | 156 | 0.5 ext | 13 | 0.04 Unaccusatives | 2681 | 15.1 Verbs with no arguments | 79 | 0.44 Entries without a sample sentence | 9941 | 56.1 Intransitive verbs | 4180 | 23.5 Transitive verbs | 8521 | 47.9 Ditransitive verbs | 3043 | 17.2 Total number of annotated entries | 17691 | Total number of arguments | 32755 | Average number of arguments | 1.682 | ### Samples Single sample from the dataset is shown below: ``` <FRAMESET id="TUR10-0000290"> <ARG name="ARG0" function="pag">yere çöktüren kişi</ARG> <ARG name="ARG1" function="ppt">çöktürülen hayvan</ARG> </FRAMESET> ``` ```ID``` is the ID of the sysnset in the KeNet dataset, ```ARG``` keys are the argument of the verb. ### Fields Explain the fields of the instances. | field | dtype | |----------|------------| | ID | string | | name | string | | function | string | | value | string ## Dataset Creation ### Curation Rationale Motivation behind creating this dataset is explained in the following manner: > With PropBank, our aim is to provide this indispensable contextual information through annotating the argument structure of each verb. Thus it is evident that PropBank’s function is indispensable for processing and properly interpreting Turkish. In addition, PropBank enhances numerous NLP applications (e.g. machine translation, information extraction, question answering and information retrieval) by adding a semantic layer to the syntax, which takes the whole structure one step closer to human language. ### Data Source Synsets are taken from KeNet, which uses Contemporary Dictionary of Turkish (CDT) (2011’s print) published by the Turkish Language Institute (TLI) as data source. ### Annotations Main part of the annotation process is given as follows in the paper: Before starting the annotation process, the first step was sifting through the data in the Turkish wordnet KeNet (Ehsani et al., 2018; Bakay et al., 2019a; Bakay et al., 2019b; Ozcelik et al., 2019) since the corpus had to be tidied up considerably. Many of the entries were either included accidentally, or were decided to be redundant. Certain nouns that were included in the list due to their morphological resemblance to verbs, such as tokmak “mallet”, were excluded. Adjectival phrases were also excluded. The second stage of the cleanup process was the removal of rule governed verbal derivations. As mentioned previously, these were mainly passive, causative and helping verb constructions. This stage presented a minor challenge: detecting a passive or causative suffix on the verb is not enough to remove it. The verb has to have a base form that can stand on its own and the base has to share its definition with the derived form. Verbs like yürümek “to walk” and yürütmek “to make sb/sth walk” fit this definition, thus yürütmek was removed from the data set. As such, many entries had to be checked from the dictionary manually. Deciding whether an entry was a passive/causative structure that needed to be removed was not easy, and intuition had to be relied on in many cases. After the redundant verbs were removed from the data set, verbs and their definitions were reviewed. Meanings of the verbs constituted the units, thus verbs were listed for each definition and merged if synonymous. And finally, sample sentences were added for each entry in the data set. Some of these sample sentences were taken from a Turkish corpus, some were created by the annotators. For the complete annotation process, please refer to [original paper](https://aclanthology.org/2020.lrec-1.336/). ## Additional Information ### Version This dataset is taken from the original repository with commit id ```c7799a8``` on 16 Oct 2022. ### Dataset Curators Neslihan Kara, Deniz Baran Aslan, Büşra Marşan, Özge Bakay, Koray Ak, Olcay Taner Yıldız ### Citation Information Please cite the following paper if you found this dataset useful: Neslihan Kara, Deniz Baran Aslan, Büşra Marşan, Özge Bakay, Koray Ak, and Olcay Taner Yıldız. 2020. TRopBank: Turkish PropBank V2.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2763–2772, Marseille, France. European Language Resources Association. ``` @inproceedings{kara-etal-2020-tropbank, title = "{TR}op{B}ank: {T}urkish {P}rop{B}ank V2.0", author = {Kara, Neslihan and Aslan, Deniz Baran and Mar{\c{s}}an, B{\"u}{\c{s}}ra and Bakay, {\"O}zge and Ak, Koray and Y{\i}ld{\i}z, Olcay Taner}, booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.336", pages = "2763--2772", abstract = "In this paper, we present and explain TRopBank {``}Turkish PropBank v2.0{''}. PropBank is a hand-annotated corpus of propositions which is used to obtain the predicate-argument information of a language. Predicate-argument information of a language can help understand semantic roles of arguments. {``}Turkish PropBank v2.0{''}, unlike PropBank v1.0, has a much more extensive list of Turkish verbs, with 17.673 verbs in total.", language = "English", ISBN = "979-10-95546-34-4", } ``` Uploaded and documented by Arda Goktogan: `ardagoktogan gmail com`.