# Creating text-based datasets
Assuming we don't start with an existing primary source or a dataset we would like to explore there are many ways in which we can create digital text resources and text-based datasets.
- scanning / OCR
- API
- web scraping
- text mining
## APIs: Application Programming Interfaces
A computer interface that defines that defines the types of data retrieval calls or requests that user can make and the conventions that authenticated users need to follow.
### Example: NYPL API
http://api.repo.nypl.org/

## Web scraping
a series of tasks, usually as part of workflow, to create an original dataset by copying, with the help of software tools and automated processes certain elements and blocks of information from websites and web-based resources, into structured file formats that are ready for computational analysis.
## Text mining
The automatic extraction of information from different textual resources with the purpose of revealing patterns, dimensions and relations through computational analysis.
### Example: Overview
[Overview](https://www.overviewdocs.com/) is a document-mining platform originally developed with investigative journalism in mind.
## Data sources and data discovery tools
### Google's Ngram Viwers
https://books.google.com/ngrams

### Hathi Trust Digital Library Bookworm
https://bookworm.htrc.illinois.edu/develop/

### Jstor's Text Analyzer
https://www.jstor.org/analyze/

### Internet Archive's Ebooks and Texts
https://archive.org/details/texts
### Project Gutenberg
https://www.gutenberg.org/
### Specialized data sources and online corpora
- [A multilingual dataset of novels](https://txtlab.org/2016/01/txtlab450-a-data-set-of-multilingual-novels-for-teaching-and-research/)
- [The Situationist International Text Library ](http://library.nothingness.org/articles/SI/)
- [German Political Speeches Corpus](https://zenodo.org/record/3611246)
- [English Corpora: most widely used online corpora.](https://www.english-corpora.org/)
- [Eighteenth Century Collections Online](https://www.gale.com/primary-sources/eighteenth-century-collections-online)
- [The Early Modern OCR project](https://emop.tamu.edu/)