# TS Corpus Word List
The "TS Corpus Word List" is a list of unique words extracted from the corpora released under TS Corpus project [TS Corpus](https://tscorpus.com/).
## Dataset Details
The word list includes about 3.2 million unique words (3.280.440).
Each word is given on a new line.
## Dataset Creation
The source data is preprocessed to eliminate misspelled, foreign and invalid words.
### Curation Rationale
The dataset is motivated by the desire to list words actively in used in Turkish.
### Data Source
The words in the dataset harvested from a relatively large pile of texts; the corpora released under TS Corpus, the raw-texts collected from e-books and other digital-printed materials and internet crawls. The whole source is over 250 billion words.
### Quality
93.2 % of the dataset (3.055.001) is consisted of words that are validated by various spell-checkers. 6.8 % percent of the dataset is consisted of words
- used in spoken language or
- the words that spell-checkers failed to process.
### Social Impact of Dataset
Considiring the size of the data and its coverage over domains and genres the list represents a useful source of the words actively used in Turkish.
### Other Known Limitations
The dataset is not annotated according to the limitations given is "Quality" section above. This annotation is planned for the next versions of the dataset.
### Dataset Curators
Published by [Taner Sezer](https://tanersezer.com/)
### Citation Information
Please cite the following papers if you found this dataset useful:
Sezer, T. (2017). TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus. European Journal of Language and Literature, 9(1), 18-24.
Sezer, T., Sezer, B. 2013. TS Corpus: Herkes İçin Türkçe Derlem. Proceedings 27th National Linguistics Conference. May, 3-4 Mayıs 2013. Antalya, Kemer: Hacettepe University, English Linguistics Department. pp: 217-225