Try   HackMD

Creating text-based datasets

Assuming we don't start with an existing primary source or a dataset we would like to explore there are many ways in which we can create digital text resources and text-based datasets.

  • scanning / OCR
  • API
  • web scraping
  • text mining

APIs: Application Programming Interfaces

A computer interface that defines that defines the types of data retrieval calls or requests that user can make and the conventions that authenticated users need to follow.

Example: NYPL API

http://api.repo.nypl.org/

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Web scraping

a series of tasks, usually as part of workflow, to create an original dataset by copying, with the help of software tools and automated processes certain elements and blocks of information from websites and web-based resources, into structured file formats that are ready for computational analysis.

Text mining

The automatic extraction of information from different textual resources with the purpose of revealing patterns, dimensions and relations through computational analysis.

Example: Overview

Overview is a document-mining platform originally developed with investigative journalism in mind.

Data sources and data discovery tools

Google's Ngram Viwers

https://books.google.com/ngrams

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hathi Trust Digital Library Bookworm

https://bookworm.htrc.illinois.edu/develop/

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Jstor's Text Analyzer

https://www.jstor.org/analyze/

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Internet Archive's Ebooks and Texts

https://archive.org/details/texts

Project Gutenberg

https://www.gutenberg.org/

Specialized data sources and online corpora