75 News - HackMD

# 75 News 75 News dataset contains 15 Turkish news articles from 5 different topics: economics, magazine, health, politics and sports. Dataset is genereated by [Kemik Natural Language Processing Group](http://www.kemik.yildiz.edu.tr/). The main goal for this dataset is text classification. However, as result of the structure of the given news, it may also be used for sentence tokenization. ## Dataset Details This dataset consists of 15 texts, each of which annotated with a string label denoting the type. | Label | Description | |-------|-------------| | ekonomi | economics | | magazin | maganize | | sağlık | health | | siyasi | politics | | sport | sports | ### Samples A sample instance is presented below. Example: ``` { 'topic': magazin, 'news': George Clooney CIA ajanı oluyor. GEORGE Clooney, yeni filminde Suriye'de görev yapan eski CIA ajan'ı Robert Bauer'i canlandıracak. İnternetteki "imdb" sitesinin haberine göre, Clooney, Bauer'in "See No Evil: The True Story of a Foot Soldier in the CIA's War on Terror" adlı otobiyografisinden beyazperdeye aktarılacak filmde başrol gösterecek. Filmin yapımcılığını George Clooney ile Steven Soderbergh'in ortak olduğu Section 8 adlı yapım şirketi üstlenecek.' } ``` ### Fields Explain the fields of the instances. | field | dtype | |----------|------------| | news | string | | topic | integer | | file_id | integer | ### Splits No train/validation/test split is provided by the authors. ## Dataset Creation ### Curation Rationale This dataset is motivated by the lack of annotated data on Turkish text classification. ### Data Source The authors gathered the news from [Milliyet](milliyet.com.tr). ### Annotations Each news article is human-annotated. ### Quality We observed that this dataset does not contain any duplications. Each article is edited by professional editors. The annotations are taken from the publisher news agency. ### Personal and Senstive Information All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework. ## Considerations ### Social Impact of Dataset This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. ### Discussion of Biases The data included here are from news. Some of the presented articles may have been disclaimed. ### Dataset Curators Published by M. Fatih AMASYALI. ### Citation Information ``` @article{amasyali2004otomatik, title={Otomatik haber metinleri sınıflandırma}, author={Amasyalı, MF and Yıldırım, T}, journal={SIU 2004}, pages={224--226}, year={2004} } ```