Historical OCR Text Quality Analysis and Post-correction

Instructor: Sindhu Kutty;
Sponsor: Dr. John Dillon, Dr. Dan Hepp (ProQuest).
Motivation
Predict and Improve scanned-OCR of historical texts using machine learning methods.
Contribution
- Create a parallel dataset between scanned-OCR New York Times news passages and corresponding human-generated clean texts;
- Develop a synthetic dataset using human-generated clean data to simulate the OCR-specific mistakes;
- Provide a unsupervised scanned-OCR text cleanliness predictor trained with basic language features;
- Fine-tune T5-base model on the end-to-end OCR sentence-level correction task.
Shared Slides
If you are interested in detailed information, please email the author :)
tags: Research Project
OCR Post-Correction