TS Abstract Corpus

# TS Abstract Corpus Abstracts of 6234 papers, categorized by scientific discipline. Test dataset is a collection of categorized and annotated abstracts from 6234 papers from various diciplines [Abstract Corpus](https://tscorpus.com/corpora/ts-abstract-corpus/). The main goal for this dataset is to investigate the patterns used in scientific writing, the lexicon variety by different diciplines. ## Dataset Details This dataset consists of 1,048,132 tokens with part-of-speech and morphological tagging. The data is in five columns respectively represents the surface form of the word, the part-of-speech tag, morphological analysis, the lemma of the word and the correct form of the surface word if any spelling mistake is found. ### Samples | Surface Form | PosTag | Morpohological Analysis | Lemma | Correct | |--------------|--------|-------------------------|-------|----------| | antik | Adj | Adj | antik | antik | | tiyatronun | Noun | Noun+A3sg+Pnon+Gen |tiyatro|tiyatronun| | arkasında | Noun | Noun+A3sg+P3sg+Loc | arka | arkasında| | yer | Noun | Noun+A3sg+Pnon+Nom | yer | yer | | alir | YY | NoMorph |NoLemma| alır | ### Data Source The data used in this corpus is a subsection harvested from the following study: Özturk, S., Sankur, B., Gungör, T., Yilmaz, M. B., Köroǧlu, B., Aǧin, O., ... & Ahat, M. (2014, April). Turkish labeled text corpus. In 2014 22nd Signal Processing and Communications Applications Conference (SIU) (pp. 1395-1398). IEEE. ### Quality The part-of-speech tagging accuracy is ~82%.