okralact
a multi-engine Open Source OCR training system
Rui Dong, Konstantin Baierer, Clemens Neudecker
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Slides: https://hackmd.io/@kba/SyiQKUCUH
O cropus
kra ken
cala mari
tesseract
OCRopus
@tmbdev since 2007 (ocropy: 2010)
Standout features:
often used, lots of documentation and wrappers (e.g. Ocrocis)
CLI tools: ocropus-nlbin
, ocropus-gtedit
etc.
kraken
@mittagessen since 2015
Standout features:
Clean codebase, documentation and interfaces
CLI tools: kraken binarize
, ketos transcribe
etc.
calamari
@CHWick since 2018
Standout features:
Clean codebase, documentation and interfaces
n-fold training and voting
tesseract
@theraysmith since 1985 (LSTM: 2016)
Standout features:
Huge user and developer base
Elaborate maintenance, testing, model generation etc.
Basic Approach
Produce line-based Ground Truth (GT)
Feed into training engine:
line text and image from GT
neural network structure
language model
Different Conventions (examples)
File name and folder structure schemes
Neural network setup
Training stop conditions
Log format
Case in Point: tesstrain (formerly known as ocrd-train)
Started in 2018 as Makefile to make tesseract 4.0 training manageable
Grown to include lots of optimizations
Embraced by tesseract maintainers as of August 2019
Why?
Who knows how en-default.pyrnn.gz
was trained?
Synthetic data? Real line images? Both?
How would en-default.pyrnn.gz
perform compared to Calamari trained on the same data?
Engines are evolving so models must too
Ground Truth
OCR-D has been developing GT specifications
Training has different requirements:
line-based
data integrity is crucial
metadata to track provenance
=> Engine-agnostic container format for line GT
Training
Many parameters equivalent across engines => Common API
Details matter => JSON Schema
Track progress and performance uniformly
Models
Uniform container format based on BagIt
Include provenance and bibliographical metadata
Previous work in kraken to build upon
Caveat
okralact - the software - is fully working but a prototype
What we want to push with okralact
engine-spanning conventions
harmonized network structure specification
better documentation and provenance of trained models
Tech stack
Redis Queues for training and evaluation backed by Python workers
Different JSON Schemas to model Common API and engine-specifics
Python/Flask for a simple web interface
Engine-agnostic training interfaces:
possible
necessary
in everybody's interest
Let's do this!
Resume presentation
okralact a multi-engine Open Source OCR training system Rui Dong, Konstantin Baierer, Clemens Neudecker Slides: https://hackmd.io/@kba/SyiQKUCUH
{"metaMigratedAt":"2023-06-15T00:06:25.162Z","metaMigratedFrom":"YAML","title":"okralact - a multi-engine Open Source OCR training system","breaks":false,"description":"Yeah.","slideOptions":"{\"theme\":\"blood\",\"spotlight\":{\"enabled\":false}}","contributors":"[{\"id\":\"e8137db5-d2e1-4125-8f51-e51a4ef3646b\",\"add\":9708,\"del\":3916},{\"id\":\"522e25fb-df8d-45d9-92fc-8ba16bd41dd1\",\"add\":60,\"del\":0}]"}