Creating text-based datasets
Assuming we don't start with an existing primary source or a dataset we would like to explore there are many ways in which we can create digital text resources and text-based datasets.
- scanning / OCR
- API
- web scraping
- text mining
APIs: Application Programming Interfaces
A computer interface that defines that defines the types of data retrieval calls or requests that user can make and the conventions that authenticated users need to follow.
Example: NYPL API
http://api.repo.nypl.org/
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Web scraping
a series of tasks, usually as part of workflow, to create an original dataset by copying, with the help of software tools and automated processes certain elements and blocks of information from websites and web-based resources, into structured file formats that are ready for computational analysis.
Text mining
The automatic extraction of information from different textual resources with the purpose of revealing patterns, dimensions and relations through computational analysis.
Example: Overview
Overview is a document-mining platform originally developed with investigative journalism in mind.
Google's Ngram Viwers
https://books.google.com/ngrams
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Hathi Trust Digital Library Bookworm
https://bookworm.htrc.illinois.edu/develop/
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Jstor's Text Analyzer
https://www.jstor.org/analyze/
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Internet Archive's Ebooks and Texts
https://archive.org/details/texts
Project Gutenberg
https://www.gutenberg.org/
Specialized data sources and online corpora