630 Column Writings dataset contains 630 Turkish column writings from 18 different authors, each has 35 coloumn writings. Dataset is genereated by Kemik Natural Language Processing Group.
The dataset consists of 630 singly authored documents written by 18 different authors, with 35 different texts written by each author. Also, this dataset has been chosen from 3 different classes such as politic, popular interest and sport in order to be used to determine the genre of the document. Again, the same dataset is composed of 4 female and 14 male authors in order to determine the gender of the author. The average length of texts is 398 words.
A sample instance is presented below.
Example:
İnteraktif günler Bugün yılın ilk pazartesisi. Yine interaktif günlerimizden birindeyiz. Doktor Mehmet Uhri hüzünlü bir yazı yollamış, benim hoşuma gitti, sizinle paylaşmak istiyorum. Ben ev taşırken sizi Mehmet Uhri ile baş başa bırakıyorum. HAVALARIN sürekli kapalı gittiği günlerdeydik. Kış bitmiyor, bahar bir türlü kendini göstermiyordu. Karamsarlık ve iç sıkıntısı sanki havayla birlikte insanların yüreğine de çöküyordu. O gün öğleden sonra güneş sıcak yüzünü gösterir gibi oldu. Hastane ortamından kaçma isteğiyle, işlerimi toparlayıp yakınımızdaki parka yöneldim. Boş banklardan birine oturup koltuğumun altındaki gazetenin sayfalarını çevirmeye başladım. Yaşlıca bir bey, izin isteyerek, bankın diğer ucuna oturdu. Cebinden çıkardığı ekmeği ufalayarak sağa sola atmaya başladı. Serçelerin, coşkuyla sunulan ekmeği ufalama çabaları o kadar güzeldi ki, ürkütmemek için kafamı gazeteme gömdüm. Göz ucuyla da bakıyorum. Bir süre sonra adamın kuşlara bir şeyler söylediğini, daha doğrusu konuşmaya çabaladığını fark edince ilgisiz kalamadım. Mırıl mırıl bir şeyler anlatıyordu. Cebimdeki bisküvilerden birini ufalayıp ben de kuşların ziyafetine katkıda bulunmak istedim. Adam, ellerimi tutarak engel oldu. - Onlar şekerli bisküvi değil mi? - Evet. - Şekerli bisküvi verme kuşlara! - Niçin? Onlara zarar mı verir? - Anlatması uzun sürer şimdi. Kuşlara iyilik yapmak istiyorsan, şekerli bisküvi verme o kadar... *** Şaşırmıştım. Sert, hatta biraz kaba bir üslupla söylenen bu sözler merakımı uyandırmıştı. - Minicik kuşlara zararlıysa, bizler de mi yemesek bu bisküvileri acaba? diyecek oldum. Baştan aşağıya dikkatlice süzdükten sonra beni, dedi ki: - Şehirde doğmuş büyümüş birine benziyorsun. Sen yiyebilirsin. Sana zarar vermez! Çattık dedim içimden. Adam biraz kaçık diye düşünmeye başlamıştım ki: - Beyim dedi. Ben köyde büyüdüm. Şehirden hep uzak durdum. Ne zaman ki, torunum dünyaya geldi, onun hatırına kışları şehre, torunumun yanına gelmeye başladım. Ama şehirden nefret ediyorum. Alışamadım. Biraz güneş çıktığında hemen kendimi parka atıyorum. Şu ileride, salıncakta sallanan kırmızılı kız da benim torunum... - Allah bağışlasın. Kaç yaşında? - Dört. Seneye yuvaya gidecek inşallah. O zaman, ben de onun başını beklemekten kurtulup, kaçacağım bu şehirden... - Nedir sizi bu kadar rahatsız eden? Neden kaçıyorsunuz? Burada her şey var! - Tam da bu yüzden kaçmak istiyorum ya! Şu kuşlara bir bak hele. Ekmek kırıntılarıyla karınlarını doyururlar. Onlara şekerli bisküvi verirsen, daha da severek yerler. Ne var ki, bisküvinin tadını alan kuşlar kuru ekmeğe bakmamaya başlar. Sonra da aç kalırlar. Dahası, şekerli bisküvi iştahlarını açar. Doysalar bile, yemeğe devam ederler. Çatlayıncaya kadar yerler. İşte o yüzden engel oldum onlara bisküvi vermene... - Ben tam olarak anlayamadım sizi! - İnsanlar da böyle. Şehirde her şeyden bol bol var. Şehre ve modern hayata alışan bu kuşlar gibi oluyor. Ne yese doymuyor! Şehir bozuyor insanları. Ben de bu şehir insanları gibi olmadan bir önce köye dönmek istiyorum... Hiç sesimi çıkarmadım. - Bilir misin, diye sürdürdü konuşmasını. Çiçeğe ihtiyacından fazla su verirsen, boğulduğunu anlamadan yaşar ama yavaş yavaş kökleri çürür, şehir insanları da böyle... Derin bir iç çekti. Cebinde kalan son ekmek kırıntılarını da serptikten sonra ayağa kalktı, kaygılı gözlere salıncakta sallanan torununa baktı ve... - Şehirliye anlatması zor! dedi. Sonra da yürüdü gitti...
Each file presents a coloumn writing and coloumn writings belong to same author are contained in the same directory. However actual name of the author, gender and genre information (mensioned in the related paper) is missing.
No split is provided by the dataset creators.
The main goal for this dataset is text classification by their authors.
The authors gathered the news from internet news, collected between 2005-2009.
All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework.
This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.
The data included here are from the news. Some of the presented articles may have been disclaimed.
Published by M. Fatih Amasyalı and Banu Diri
@article{amasyali2006automatic,
author = {Amasyali, MF and Diri, B},
title = {Automatic Turkish Text Categorization in Terms of Author, Genre and Gender},
journal = {11th International Conference on Applications of Natural Language to Information Systems-NLDB},
year = {2006},
volume = {LNCS Volume 3999}
}
In this paper, we present and explain TRopBank “Turkish PropBank v2.0”. PropBank is a hand-annotated corpus of propositions which is used to obtain the predicate-argument information of a language. Predicate-argument information of a language can help understand semantic roles of arguments. “Turkish PropBank v2.0”, unlike PropBank v1.0, has a much more extensive list of Turkish verbs, with 17.673 verbs in total. The original paper can be found from here and you can access the original repository TurkishPropBank. Dataset Details For TRropBank, a total of 17,691 verbs were annotated. As the data suggests, unaccusative verbs that require a patient or theme in the ARG1 column constitute roughly 15.1% of all the annotated verbs (see Table). Based on the data, it can be inferred that Turkish has an evident preference for verbs that require an ARG0 over ones that require an ARG1 as their subject. Moreover, we can see that a significant portion of Turkish verbs, 47.9% to be exact, have the transitive framework. Turkish displays an observable preference regarding transitivity. Furthermore, having predicates that do not require any arguments, Turkish diverges from the majority of the languages whose PropBanks have been reviewed in Section 2 in the paper. Even though predicates without arguments (idiomatic structures) make up less than 1% of the total, the existence of such a divergence is significant.
Oct 16, 2022In our study, we present a polarity dictionary to provide an extensive polarity dictionary for Turkish that dictionary-based sentiment analysis studies have been longing for. Our primary objective is to provide a more refined and extensive polarity dictionary than the previous SentiTurkNet. In doing so, we have resorted to a different network from the referenced study. We have identified approximately 76,825 synsets from Kenet, which then were manually labeled as positive, negative or neutral by three native speakers of Turkish. Subsequently, a second labeling was further made on positive and negative words as strong or weak based on their degree of positivity or negativity. The original paper can be found from here and you can access the original repository TurkishHistNet. Dataset Details In this study, we have identified approximately 76,825 synsets from Kenet. Subsequently, all of these synsets were manually labeled as positive, negative or neutral by three native speakers of Turkish. Following the first labelling, a second labelling process was conducted for the words which were labeled as positive and negative in the first round. To be more specific, the words were re-labeled based on the degree of their positivity or negativity as strong or weak. Following table shows the number of synsets belonging to each category: Polarity Level
Oct 16, 2022This dataset is comprehensive wordnet for Turkish. KeNet includes 77,330 synsets and it has both intralingual semantic relations and is linked to PWN through interlingual relations. The original paper can be found from here and you can access the original repository TurkishWordNet. Dataset Details An exemplary set of synsets from KeNet is given in Table 1. In this table, examples of the four most frequent parts of speech in KeNet are listed, i.e., noun, adjective, verb and adverb, respectively. For each of these examples, the first column shows the ID of the synset. The characters that are separated with "-" from the ID gives the POS of the synset (n for noun, v for verb, a for adjective, adv for adverb). The second column lists the synset members; the synset members that are listed in the same synset are synonyms. The third column demonstrates the definitions and lastly, the fourth column presents an exemplary sentence (if there is any) including one of the synset members. Synset ID Synset Members Definition
Oct 16, 2022Introduced in 1997, FrameNet (Lowe, 1997; Baker et al., 1998; Fillmore and Atkins, 1998; Johnson et al., 2001) has been developed by the International Computer Science Institute in Berkeley, California. It is a growing computational lexicography project that offers in-depth semantic information on English words and predicates. Based on the theory of Frame Semantics by Fillmore (Fillmore and others, 1976; Fillmore, 2006), FrameNet offers semantic information on predicate-argument structure in a way that is loosely similar to wordnet (Kilgarriff and Fellbaum, 2000). In FrameNet, predicates and related lemmas are categorized under frames. The notion of frame here is thoroughly described in Frame Semantics as a schematic representation of an event, state or relationship. These semantic information packets called frames are constituted of individual lemmas (also known as Lexical Units) and frame elements (such as the agent, theme, instrument, duration, manner, direction etc.). Frame elements can be described as semantic roles that are related to the frame. Lexical Units, or lemmas, are linked to a frame through a single sense. For instance, the lemma ”roast” can mean to criticise harshly or to cook by exposing to dry heat. With its latter meaning, ”roast” belongs to the Apply Heat frame. In this version of Turkish FrameNet, we aimed to release a version of Turkish FrameNet that captures at least a considerable majority of the most frequent predicates, thus offering a valuable and practical resource from day one. Because Turkish is a low-resource language, it was important to ensure that FrameNet had enough coverage that it could be incorporated into NLP solutions as soon as it is released to the public. The original paper can be found from here and you can access the original repository TurkishFrameNet. Dataset Details In this study, a total number of 139 Frames in 8 domains were created. 16 of these frames were created specifically for Turkish while the remaining 123 are translated from English FrameNet. These frames include a total number of 2769 synsets (See Table). As we used Turkish WordNet and PropBank’s repositories, the Lexical Units were made of wordnet synsets. Thus some LUs contain more than one predicate. The total number of predicates annotated in this study is 4080. In other words, 4080 predicates were annotated into their respective frames. Sample sentences of all were marked up for the specific roles in them.
Oct 16, 2022or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up