# TS Corpus Word List The "TS Corpus Word List" is a list of unique words extracted from the corpora released under TS Corpus project [TS Corpus](https://tscorpus.com/). ## Dataset Details The word list includes about 3.2 million unique words (3.280.440). Each word is given on a new line. ## Dataset Creation The source data is preprocessed to eliminate misspelled, foreign and invalid words. ### Curation Rationale The dataset is motivated by the desire to list words actively in used in Turkish. ### Data Source The words in the dataset harvested from a relatively large pile of texts; the corpora released under TS Corpus, the raw-texts collected from e-books and other digital-printed materials and internet crawls. The whole source is over 250 billion words. ### Quality 93.2 % of the dataset (3.055.001) is consisted of words that are validated by various spell-checkers. 6.8 % percent of the dataset is consisted of words - used in spoken language or - the words that spell-checkers failed to process. ### Social Impact of Dataset Considiring the size of the data and its coverage over domains and genres the list represents a useful source of the words actively used in Turkish. ### Other Known Limitations The dataset is not annotated according to the limitations given is "Quality" section above. This annotation is planned for the next versions of the dataset. ### Dataset Curators Published by [Taner Sezer](https://tanersezer.com/) ### Citation Information Please cite the following papers if you found this dataset useful: Sezer, T. (2017). TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus. European Journal of Language and Literature, 9(1), 18-24. Sezer, T., Sezer, B. 2013. TS Corpus: Herkes İçin Türkçe Derlem. Proceedings 27th National Linguistics Conference. May, 3-4 Mayıs 2013. Antalya, Kemer: Hacettepe University, English Linguistics Department. pp: 217-225