This dataset contains 910 Turkish column writings from 69 different authors. Dataset is genereated by Kemik Natural Language Processing Group.
This is a classification dataset consist of 910 samples. The average length of texts is 465 words. Each sample belongs one of the following authors:
A sample instance is presented below.
Example:
Each file presents a coloumn writing and coloumn writings belong to same author are contained in the same directory.
No split information is provided by the outhors.
The main goal for this dataset is text classification by their authors, and importance is to have many classes with few data.
The authors gathered from internet news between 2005-2009.
All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework.
This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.
The data included here are from the news. Some of the presented articles may have been disclaimed.
Published by M.Fatih AMASYALI, Başak BOZKURT, Cansu ŞEN, Ömer YILDIRMAZ, Furkan KAMACI, Murat YASDI, Muhammet Ali AYAS, Okay GÜNGÖR, Erben ŞAMİLOĞLU, Pınar ÖZVEREN, Recep YAŞAR, Hayri Uğur KOLTUK, Muhammet Said AYDEMİR, Abdulaziz FAKİRULLAHOĞLU, Sadık ÖZKARACA, Erman ÇATI, Mehmet İkbal KAYA, Uğur ERGÜL, Özkan YALÇIN, Berker NAROL, Guychgeldi ATAYEV
There is no paper associated with this dataset, but it is created by Kemik Natural Language Processing Group.