Try   HackMD

Historical OCR Text Quality Analysis and Post-correction

Instructor: Sindhu Kutty;
Sponsor: Dr. John Dillon, Dr. Dan Hepp (ProQuest).

Motivation

Predict and Improve scanned-OCR of historical texts using machine learning methods.

Contribution

  • Create a parallel dataset between scanned-OCR New York Times news passages and corresponding human-generated clean texts;
  • Develop a synthetic dataset using human-generated clean data to simulate the OCR-specific mistakes;
  • Provide a unsupervised scanned-OCR text cleanliness predictor trained with basic language features;
  • Fine-tune T5-base model on the end-to-end OCR sentence-level correction task.

Shared Slides

If you are interested in detailed information, please email the author :)

tags: Research Project OCR Post-Correction