# TS Abstract Corpus
Abstracts of 6234 papers, categorized by scientific discipline.
Test dataset is a collection of categorized and annotated abstracts from 6234 papers from various diciplines [Abstract Corpus](https://tscorpus.com/corpora/ts-abstract-corpus/). The main goal for this dataset is to investigate the patterns used in scientific writing, the lexicon variety by different diciplines.
## Dataset Details
This dataset consists of 1,048,132 tokens with part-of-speech and morphological tagging. The data is in five columns respectively represents the surface form of the word, the part-of-speech tag, morphological analysis, the lemma of the word and the correct form of the surface word if any spelling mistake is found.
### Samples
| Surface Form | PosTag | Morpohological Analysis | Lemma | Correct |
|--------------|--------|-------------------------|-------|----------|
| antik | Adj | Adj | antik | antik |
| tiyatronun | Noun | Noun+A3sg+Pnon+Gen |tiyatro|tiyatronun|
| arkasında | Noun | Noun+A3sg+P3sg+Loc | arka | arkasında|
| yer | Noun | Noun+A3sg+Pnon+Nom | yer | yer |
| alir | YY | NoMorph |NoLemma| alır |
### Data Source
The data used in this corpus is a subsection harvested from the following study:
Özturk, S., Sankur, B., Gungör, T., Yilmaz, M. B., Köroǧlu, B., Aǧin, O., ... & Ahat, M. (2014, April). Turkish labeled text corpus. In 2014 22nd Signal Processing and Communications Applications Conference (SIU) (pp. 1395-1398). IEEE.
### Quality
The part-of-speech tagging accuracy is ~82%.