Try   HackMD
  •  
    TDD
    ·
    Created by ali safaya on Oct 25, 2021
    Linked with GitHub
  • Edit
Last changed by 

 
TDD
Documenting datasets for TDD project
0
48

Read more

TRopBank: Turkish PropBank V2.0

In this paper, we present and explain TRopBank “Turkish PropBank v2.0”. PropBank is a hand-annotated corpus of propositions which is used to obtain the predicate-argument information of a language. Predicate-argument information of a language can help understand semantic roles of arguments. “Turkish PropBank v2.0”, unlike PropBank v1.0, has a much more extensive list of Turkish verbs, with 17.673 verbs in total. The original paper can be found from here and you can access the original repository TurkishPropBank. Dataset Details For TRropBank, a total of 17,691 verbs were annotated. As the data suggests, unaccusative verbs that require a patient or theme in the ARG1 column constitute roughly 15.1% of all the annotated verbs (see Table). Based on the data, it can be inferred that Turkish has an evident preference for verbs that require an ARG0 over ones that require an ARG1 as their subject. Moreover, we can see that a significant portion of Turkish verbs, 47.9% to be exact, have the transitive framework. Turkish displays an observable preference regarding transitivity. Furthermore, having predicates that do not require any arguments, Turkish diverges from the majority of the languages whose PropBanks have been reviewed in Section 2 in the paper. Even though predicates without arguments (idiomatic structures) make up less than 1% of the total, the existence of such a divergence is significant.

Oct 16, 2022
HistNet

In our study, we present a polarity dictionary to provide an extensive polarity dictionary for Turkish that dictionary-based sentiment analysis studies have been longing for. Our primary objective is to provide a more refined and extensive polarity dictionary than the previous SentiTurkNet. In doing so, we have resorted to a different network from the referenced study. We have identified approximately 76,825 synsets from Kenet, which then were manually labeled as positive, negative or neutral by three native speakers of Turkish. Subsequently, a second labeling was further made on positive and negative words as strong or weak based on their degree of positivity or negativity. The original paper can be found from here and you can access the original repository TurkishHistNet. Dataset Details In this study, we have identified approximately 76,825 synsets from Kenet. Subsequently, all of these synsets were manually labeled as positive, negative or neutral by three native speakers of Turkish. Following the first labelling, a second labelling process was conducted for the words which were labeled as positive and negative in the first round. To be more specific, the words were re-labeled based on the degree of their positivity or negativity as strong or weak. Following table shows the number of synsets belonging to each category: Polarity Level

Oct 16, 2022
Turkish WordNet KeNet

This dataset is comprehensive wordnet for Turkish. KeNet includes 77,330 synsets and it has both intralingual semantic relations and is linked to PWN through interlingual relations. The original paper can be found from here and you can access the original repository TurkishWordNet. Dataset Details An exemplary set of synsets from KeNet is given in Table 1. In this table, examples of the four most frequent parts of speech in KeNet are listed, i.e., noun, adjective, verb and adverb, respectively. For each of these examples, the first column shows the ID of the synset. The characters that are separated with "-" from the ID gives the POS of the synset (n for noun, v for verb, a for adjective, adv for adverb). The second column lists the synset members; the synset members that are listed in the same synset are synonyms. The third column demonstrates the definitions and lastly, the fourth column presents an exemplary sentence (if there is any) including one of the synset members. Synset ID Synset Members Definition

Oct 16, 2022
Turkish FrameNet

Introduced in 1997, FrameNet (Lowe, 1997; Baker et al., 1998; Fillmore and Atkins, 1998; Johnson et al., 2001) has been developed by the International Computer Science Institute in Berkeley, California. It is a growing computational lexicography project that offers in-depth semantic information on English words and predicates. Based on the theory of Frame Semantics by Fillmore (Fillmore and others, 1976; Fillmore, 2006), FrameNet offers semantic information on predicate-argument structure in a way that is loosely similar to wordnet (Kilgarriff and Fellbaum, 2000). In FrameNet, predicates and related lemmas are categorized under frames. The notion of frame here is thoroughly described in Frame Semantics as a schematic representation of an event, state or relationship. These semantic information packets called frames are constituted of individual lemmas (also known as Lexical Units) and frame elements (such as the agent, theme, instrument, duration, manner, direction etc.). Frame elements can be described as semantic roles that are related to the frame. Lexical Units, or lemmas, are linked to a frame through a single sense. For instance, the lemma ”roast” can mean to criticise harshly or to cook by exposing to dry heat. With its latter meaning, ”roast” belongs to the Apply Heat frame. In this version of Turkish FrameNet, we aimed to release a version of Turkish FrameNet that captures at least a considerable majority of the most frequent predicates, thus offering a valuable and practical resource from day one. Because Turkish is a low-resource language, it was important to ensure that FrameNet had enough coverage that it could be incorporated into NLP solutions as soon as it is released to the public. The original paper can be found from here and you can access the original repository TurkishFrameNet. Dataset Details In this study, a total number of 139 Frames in 8 domains were created. 16 of these frames were created specifically for Turkish while the remaining 123 are translated from English FrameNet. These frames include a total number of 2769 synsets (See Table). As we used Turkish WordNet and PropBank’s repositories, the Lexical Units were made of wordnet synsets. Thus some LUs contain more than one predicate. The total number of predicates annotated in this study is 4080. In other words, 4080 predicates were annotated into their respective frames. Sample sentences of all were marked up for the specific roles in them.

Oct 16, 2022
Read more from TDD

Published on HackMD
    Expand allBack to topGo to bottom
Expand allBack to topGo to bottom

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up