Bilkent Information Retrieveal Collection

# Bilkent Information Retrieveal Collection This dataset is a collection of newspaper documents. It can be used for Information Retrieveal studies. It contains documents, queries, and query relevants. You can reach the [paper]( http://repository.bilkent.edu.tr/bitstream/handle/11693/23211/Information%20retrieval%20on%20turkish%20texts.pdf?sequence=1&isAllowed=y) and [GitHub Repositories](https://github.com/BilkentInformationRetrievalGroup/MilliyetCollectionTREC) of this dataset via given links. ## Dataset Details There are 408,305 documents gathered from the news articles of Turkish newspaper *Milliyet* from years 2001-2005. Average number of words for each document before stop-word elimination is given as 234. There are 72 ad hoc queries. To determine the query relevants, pooling concept is used. The assessors evaluated the documents at the pool and rest of the documents, ones that are not in the pool, assumed to be irrelevant. Query relevants contains the pool documents for each query, including the result of the relevance assessment by assessors in the last column. (0: irrelevant, 1: relevant) Detailed information about the pooling and TREC approach can be reached by [here](https://trec.nist.gov/presentations/TREC9/intro/sld018.htm). ### Samples ``` <DOC> <DOCNO> 50000 </DOCNO> <SOURCE> Milliyet v.01 </SOURCE> <URL> www.milliyet.com.tr/2001/11/01/son/soneko02.html </URL> <DATE> 2001/11/01/ </DATE> <TIME> </TIME> <AUTHOR> </AUTHOR> <HEADLINE> Kapalıçarşı'da döviz fiyatları güne kaç liradan başladı.. Tıklayın </HEADLINE> <TEXT> İstanbul serbest piyasada dolar 1 milyon 595 bin lira, mark ise 736 bin lira satış fiyatıyla güne başladı. Kapalıçarşı'da dolar 1 milyon 585 bin liradan alınıp 1 milyon 595 bin liradan satılıyor. 728 bin liradan alınan markın satış fiyatı ise 736 bin lira olarak belirlendi. Serbest piyasada dünkü kapanışta dolar 1 milyon 602 bin, mark ise 739 bin lira olmuştu. </TEXT> </DOC> ``` ### Fields | field | dtype | |----------|------------| | DOCNO | integer | | SOURCE| string | | URL | string| | DATE | string| | AUTHOR | string| | HEADLINE| string| | TEXT | string ## Dataset Creation ### Curation Rationale The dataset is motivated by the desire to advance information retrieval studies in Turkish language. ### Data Source The authors gathered the documents from columns and news articles of Turkish newspaper *Milliyet* from the years between 2001-2005. ### Annotations The documents are collected automatically. The query relevance assessments are done by experienced web users. The assessors are not required to expertise on the topic they pick. ### Personal and Sensitive Information All documents are published in the newspaper, thus, are not expected to contain any personal/sensitive information. ## Considerations ### Social Impact of Dataset This dataset is part of an effort to encourage information retrieveal research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. ### Discussion of Biases The data included here are from writers of Turkish newspaper 'Milliyet'. Some percentage of these documents might contain columns with biased point of view. ## Additional Information ### Dataset Curators Published by Fazli Can, Seyit Kocberber, Erman Balcik, Cihan Kaynak, H. Cagdas Ocalan, and Onur M. Vursavas. ### Citation Information If you use this dataset, please cite the following paper: Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H. C., & Vursavas, O. M. (2008). Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology, 59(3), 407-421.