# Historical OCR Text Quality Analysis and Post-correction
![](https://i.imgur.com/1XGSAQw.png)
Instructor: Sindhu Kutty;
Sponsor: Dr. John Dillon, Dr. Dan Hepp (ProQuest).
Motivation
---
Predict and Improve scanned-OCR of historical texts using machine learning methods.
Contribution
---
- Create a parallel dataset between scanned-OCR New York Times news passages and corresponding human-generated clean texts;
- Develop a synthetic dataset using human-generated clean data to simulate the OCR-specific mistakes;
- Provide a unsupervised scanned-OCR text cleanliness predictor trained with basic language features;
- Fine-tune T5-base model on the end-to-end OCR sentence-level correction task.
Shared Slides
---
- [Shared Slides](https://docs.google.com/presentation/d/1NZffeTILI0LmeixYr9SBmj_U7Ek-mJ5Q/edit?usp=sharing&ouid=111081196382689176167&rtpof=true&sd=true)
<!-- - [Vertical alignment](/theme-vertical-writing?both) -->
If you are interested in detailed information, please email the author :)
###### tags: `Research Project` `OCR Post-Correction`