Historical OCR Text Quality Analysis and Post-correction

# Historical OCR Text Quality Analysis and Post-correction ![](https://i.imgur.com/1XGSAQw.png) Instructor: Sindhu Kutty; Sponsor: Dr. John Dillon, Dr. Dan Hepp (ProQuest). Motivation --- Predict and Improve scanned-OCR of historical texts using machine learning methods. Contribution --- - Create a parallel dataset between scanned-OCR New York Times news passages and corresponding human-generated clean texts; - Develop a synthetic dataset using human-generated clean data to simulate the OCR-specific mistakes; - Provide a unsupervised scanned-OCR text cleanliness predictor trained with basic language features; - Fine-tune T5-base model on the end-to-end OCR sentence-level correction task. Shared Slides --- - [Shared Slides](https://docs.google.com/presentation/d/1NZffeTILI0LmeixYr9SBmj_U7Ek-mJ5Q/edit?usp=sharing&ouid=111081196382689176167&rtpof=true&sd=true)  If you are interested in detailed information, please email the author :) ###### tags: `Research Project` `OCR Post-Correction`