TDD

@data-tdd

Documenting datasets for TDD project

Public team

Community (0)
No community contribution yet

Joined on Jul 18, 2021

  • In this paper, we present and explain TRopBank “Turkish PropBank v2.0”. PropBank is a hand-annotated corpus of propositions which is used to obtain the predicate-argument information of a language. Predicate-argument information of a language can help understand semantic roles of arguments. “Turkish PropBank v2.0”, unlike PropBank v1.0, has a much more extensive list of Turkish verbs, with 17.673 verbs in total. The original paper can be found from here and you can access the original repository TurkishPropBank. Dataset Details For TRropBank, a total of 17,691 verbs were annotated. As the data suggests, unaccusative verbs that require a patient or theme in the ARG1 column constitute roughly 15.1% of all the annotated verbs (see Table). Based on the data, it can be inferred that Turkish has an evident preference for verbs that require an ARG0 over ones that require an ARG1 as their subject. Moreover, we can see that a significant portion of Turkish verbs, 47.9% to be exact, have the transitive framework. Turkish displays an observable preference regarding transitivity. Furthermore, having predicates that do not require any arguments, Turkish diverges from the majority of the languages whose PropBanks have been reviewed in Section 2 in the paper. Even though predicates without arguments (idiomatic structures) make up less than 1% of the total, the existence of such a divergence is significant.
     Like  Bookmark
  • In our study, we present a polarity dictionary to provide an extensive polarity dictionary for Turkish that dictionary-based sentiment analysis studies have been longing for. Our primary objective is to provide a more refined and extensive polarity dictionary than the previous SentiTurkNet. In doing so, we have resorted to a different network from the referenced study. We have identified approximately 76,825 synsets from Kenet, which then were manually labeled as positive, negative or neutral by three native speakers of Turkish. Subsequently, a second labeling was further made on positive and negative words as strong or weak based on their degree of positivity or negativity. The original paper can be found from here and you can access the original repository TurkishHistNet. Dataset Details In this study, we have identified approximately 76,825 synsets from Kenet. Subsequently, all of these synsets were manually labeled as positive, negative or neutral by three native speakers of Turkish. Following the first labelling, a second labelling process was conducted for the words which were labeled as positive and negative in the first round. To be more specific, the words were re-labeled based on the degree of their positivity or negativity as strong or weak. Following table shows the number of synsets belonging to each category: Polarity Level
     Like  Bookmark
  • This dataset is comprehensive wordnet for Turkish. KeNet includes 77,330 synsets and it has both intralingual semantic relations and is linked to PWN through interlingual relations. The original paper can be found from here and you can access the original repository TurkishWordNet. Dataset Details An exemplary set of synsets from KeNet is given in Table 1. In this table, examples of the four most frequent parts of speech in KeNet are listed, i.e., noun, adjective, verb and adverb, respectively. For each of these examples, the first column shows the ID of the synset. The characters that are separated with "-" from the ID gives the POS of the synset (n for noun, v for verb, a for adjective, adv for adverb). The second column lists the synset members; the synset members that are listed in the same synset are synonyms. The third column demonstrates the definitions and lastly, the fourth column presents an exemplary sentence (if there is any) including one of the synset members. Synset ID Synset Members Definition
     Like  Bookmark
  • Introduced in 1997, FrameNet (Lowe, 1997; Baker et al., 1998; Fillmore and Atkins, 1998; Johnson et al., 2001) has been developed by the International Computer Science Institute in Berkeley, California. It is a growing computational lexicography project that offers in-depth semantic information on English words and predicates. Based on the theory of Frame Semantics by Fillmore (Fillmore and others, 1976; Fillmore, 2006), FrameNet offers semantic information on predicate-argument structure in a way that is loosely similar to wordnet (Kilgarriff and Fellbaum, 2000). In FrameNet, predicates and related lemmas are categorized under frames. The notion of frame here is thoroughly described in Frame Semantics as a schematic representation of an event, state or relationship. These semantic information packets called frames are constituted of individual lemmas (also known as Lexical Units) and frame elements (such as the agent, theme, instrument, duration, manner, direction etc.). Frame elements can be described as semantic roles that are related to the frame. Lexical Units, or lemmas, are linked to a frame through a single sense. For instance, the lemma ”roast” can mean to criticise harshly or to cook by exposing to dry heat. With its latter meaning, ”roast” belongs to the Apply Heat frame. In this version of Turkish FrameNet, we aimed to release a version of Turkish FrameNet that captures at least a considerable majority of the most frequent predicates, thus offering a valuable and practical resource from day one. Because Turkish is a low-resource language, it was important to ensure that FrameNet had enough coverage that it could be incorporated into NLP solutions as soon as it is released to the public. The original paper can be found from here and you can access the original repository TurkishFrameNet. Dataset Details In this study, a total number of 139 Frames in 8 domains were created. 16 of these frames were created specifically for Turkish while the remaining 123 are translated from English FrameNet. These frames include a total number of 2769 synsets (See Table). As we used Turkish WordNet and PropBank’s repositories, the Lexical Units were made of wordnet synsets. Thus some LUs contain more than one predicate. The total number of predicates annotated in this study is 4080. In other words, 4080 predicates were annotated into their respective frames. Sample sentences of all were marked up for the specific roles in them.
     Like  Bookmark
  • The corpus consists of pairs of sentences with semantic similarity scores based on human judgments, allowing experimentation with both PI and semantic similarity.The data collection and scoring methodology is described, and the corpus and first PI experiments are reported. Their approach to PI is new to using 'lean knowledge' methods (i.e. not using manually created knowledge bases or processing tools based on them). Dataset Details TuPC creation strategies combine methodologies from the prebuilt corpora MSRPC [30] and TPC [34]. They automatically extracted and matched sentences from daily news sites. The candidate couples were then hand-explained by examining their context. Unlike MSRPC, but like TPC, candidate pairs were scored according to semantic similarity. Label Description 0 UNRELATED On different topics
     Like  Bookmark
  • https://github.com/olcaytaner/TurkishAnnotatedCorpus-15 https://github.com/olcaytaner/TurkishAnnotatedTreeBank-15 This dataset presents the first multilayer annotated corpus for Turkish, which is a low-resourced agglutinativelanguage. Our dataset consists of 9,600 sentences translated from the Penn Treebank Corpus. Annotated layers contain syntactic and semantic information including morphological disambiguation of words, named entity annotation, shallow parse, sense annotation, and semantic role label annotation. The Original peper can be accesed from here. Dataset Details The corpus currently contains 9,600 sentences, the original English counterparts of which are taken from Penn-Treebank. Annotated layers include morphological disambiguation of words, named entities, shallow parse, senses, and semantic role labels.
     Like  Bookmark
  • The dataset consists of bilingual dictionaries. This dictionaries are used for cross-lingual word embeddings for Turkic languages. The original GitHub Repository of dataset can be reached by following this link. Dataset Details There are 7 dictionaries where first group of them is presented in 'already-existing' folder. Turkish-English dictionary is obtained from MUSE and Uzbek-English dictionary is obtained from The Uzbek Glossary. The second group consists of other 5 dictionaries gathered via Google Translate. Sizes of dictionaries are given in the table below. Dictionay Size (in words)
     Like  Bookmark
  • The dataset is a collection of analogical reasoning task pairs prepared by O. Gungor, and E. Yildiz. It contains 4 sets of pairs with different types of affixes, and word vectors obtained by skip-gram algorithm in their studies. GitHub Repository of the dataset and the paper can be found by given links. Dataset Details The pairs contains two words with their simple versions followed by affixed states. This dataset contains 4 subsets obtained by different pairings according to their affixes. File name Description Number of pairs noun_inflection_quads.txt
     Like  Bookmark
  • // A brief summary of the dataset. What it is? How it can be used? What are the contents of the files? Test dataset is a collection of lorem ipsums. It was collected from websites like loremipsum.io. The main goal for this dataset creation is to be used for filling text boxes. Dataset Details // Explain how this dataset is structured. This dataset consists 100K of sample of lorem ipsums, each of them is annotated with an integer label [0, 1, 2]. Label
     Like  Bookmark
  • 140 Poem dataset contains 140 Turkish poems from 7 different authors, each has 20 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details The dataset consists of 140 singly authored documents written by 7 different authors, with 20 different texts written by each author. The average length of texts is 109 words. Samples A sample instance is presented below. Example:
     Like  Bookmark
  • 2500 Column Writings dataset contains 2500 Turkish column writings from 18 different authors, each has 35 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details The dataset consists of 2500 singly authored documents written by 50 different authors, with 50 different texts written by each author. The average length of texts is 398 words. Samples A sample instance is presented below. Example:
     Like  Bookmark
  • 630 Column Writings dataset contains 630 Turkish column writings from 18 different authors, each has 35 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details The dataset consists of 630 singly authored documents written by 18 different authors, with 35 different texts written by each author. Also, this dataset has been chosen from 3 different classes such as politic, popular interest and sport in order to be used to determine the genre of the document. Again, the same dataset is composed of 4 female and 14 male authors in order to determine the gender of the author. The average length of texts is 398 words. Samples A sample instance is presented below. Example:
     Like  Bookmark
  • 90 Column Writings dataset contains 270 Turkish column writings from 9 different authors, each has 10 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details This dataset is created for Text2arf text presentation library, presented in the Ender Can et. al. The average length of texts is 466 words. Samples A sample instance is presented below. Example:
     Like  Bookmark
  • This dataset contains 910 Turkish column writings from 69 different authors. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details This is a classification dataset consist of 910 samples. The average length of texts is 465 words. Each sample belongs one of the following authors: Abbas Güçlü Cem Suer Fatih Altaylı İsmet Berkan Nuray Mert
     Like  Bookmark
  • 270 Column Writings dataset contains 270 Turkish column writings from 18 different authors, each has 15 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group. Dataset Details Selected texts comprise essays on politic, magazine and medical. Theaverage length of texts is 456 words. Samples A sample instance is presented below. Example:
     Like  Bookmark
  • A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression. github repo of dataset: https://github.com/ideateknoloji/SplitExp Dataset Details This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case. cases of end-of-sentence markers: case percentage
     Like  Bookmark
  • Turkish English parallel corpus is created for phrase-based statistical machine translation from English into Turkish. The dataset only consists of Turkish part of the parallel corpus. There is a lack of large scale parallel text resources, therefore this parallel corpus is created by augmenting the limited training data with content words and phrases. The parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc. Dataset Details This dataset only consist of Turkish part of the corpus. Turkish data represented with their morphological segmentation but instead of surface morpheses, lexical morpheses are used to abstract away word-internal details and conflate statistics for seemingly different suffixes. The Turkish side of the data consists of 56609 sentences. These sentences include 13767 unique words. Data contains 102998 morphological tags, 18237 of which are unique along with 18006 unique roots and 231 unique suffixes. Samples
     Like  Bookmark
  • This dataset is a collection of newspaper documents. It can be used for Information Retrieveal studies. It contains documents, queries, and query relevants. You can reach the paper and GitHub Repositories of this dataset via given links. Dataset Details There are 408,305 documents gathered from the news articles of Turkish newspaper Milliyet from years 2001-2005. Average number of words for each document before stop-word elimination is given as 234. There are 72 ad hoc queries. To determine the query relevants, pooling concept is used. The assessors evaluated the documents at the pool and rest of the documents, ones that are not in the pool, assumed to be irrelevant. Query relevants contains the pool documents for each query, including the result of the relevance assessment by assessors in the last column. (0: irrelevant, 1: relevant) Detailed information about the pooling and TREC approach can be reached by here. Samples <DOC> <DOCNO> 50000 </DOCNO>
     Like  Bookmark
  • Semantic Textual Similarity (STS) benchmark Turkish (STSb-TR) dataset is the machine translated version of English STS benchmark dataset using Google Cloud Translation API. No human corrections have been made to the translations. The official website for the STSb Turkish dataset can be found at STSb-TR. Dataset Details The dataset consists of 8628 machine-translated Turkish sentence pairs. In this dataset, each sentence pair was annotated by crowdsourcing and assigned a semantic similarity score. Scores range from 0 (no semantic similarity) to 5 (semantically equivalent). Samples An example from STSb-TR dataset is given below. Example:
     Like  Bookmark
  • Exams (Eχαμs) dataset is presenting a new benchmark for multilingual and cross-lingual question answering. Dataset Details This dataset contains 24,143 questions with their choices in total from 16 languages. The questions are high-quality high school exam questions from subjects such as Natural Sciences, Social Sciences, Arts, etc. Cross-lingual samples are obtained by gathering parallel examinations. Some exams in the dataset were offered in several languages in some countries. There are 9,857 parallel question pairs across seven languages. The structure of the questions is as given in the samples. Stem of the question is given as a string and choices are a list with their labels such as "A", "B". Correct answers are given in 'answerKey', and 'info' contains the grade, subject, and language information. You can reach the ACL Paper and GitHub repositories of this dataset via given links.
     Like  Bookmark