<style>
.kb-hi {
color: red;
background-color: #00000088;
}
.dark-bg {
background-color: #00000088 !important;
}
.light-bg {
background-color: #ffffff88 !important;
}
</style>
# okralact
### a multi-engine Open Source OCR training system
Rui Dong, Konstantin Baierer, Clemens Neudecker
<img src="https://avatars0.githubusercontent.com/u/26362587?s=200&v=4"/>
Slides: https://hackmd.io/@kba/SyiQKUCUH
---
<!-- .slide: data-background="https://raw.githubusercontent.com/kba/2019-icdar/master/ajw6qfdhxt5e4ltqqm4c.jpg" -->
<div style="font-family: monospace; white-space: pre; font-size: 2.4em">
<strong class="kb-hi">O</strong>cropus
<strong class="kb-hi">kra</strong>ken
ca<strong class="kb-hi">la</strong>mari
tessera<strong class="kb-hi">ct</strong>
</div>
----
### OCRopus
![](https://github.com/kba/2019-icdar/raw/master/687474703a2f2f6d61646d2e64666b692e64652f5f6d656469612f6f63726f7075732e706e67)
* [@tmbdev](https://github.com/tmbdev) since 2007 (ocropy: 2010)
* Standout features:
* often used, lots of documentation and wrappers (e.g. Ocrocis)
* CLI tools: `ocropus-nlbin`, `ocropus-gtedit` etc.
----
### kraken
![](https://raw.githubusercontent.com/kba/2019-icdar/master/1*3Trbay07WLA99_czXxn_yA.jpeg?token=AACCXVYDRIMEQMRIPFGF6525RLUU2)
* [@mittagessen](https://github.com/mittagessen) since 2015
* Standout features:
* Clean codebase, documentation and interfaces
* CLI tools: `kraken binarize`, `ketos transcribe` etc.
----
### calamari
<p>
<img class="light-bg" src="https://avatars1.githubusercontent.com/u/40763252?v=4" height="200"/>
</p>
* [@CHWick](https://github.com/chwick) since 2018
* Standout features:
* Clean codebase, documentation and interfaces
* n-fold training and voting
----
### tesseract
<p>
<img src="https://user-images.githubusercontent.com/33478216/39959054-aa7b6138-5614-11e8-9961-25d137dcb43b.jpg" height="200"/>
</p>
* [@theraysmith](https://github.com/theraysmith) since 1985 (LSTM: 2016)
* Standout features:
* Huge user and developer base
* Elaborate maintenance, testing, model generation etc.
---
<!-- .slide: data-background="https://petapixel.com/assets/uploads/2015/04/octopus.jpg" -->
<!-- .slide: class="dark-bg" -->
# Training
----
## Basic Approach
* Produce line-based Ground Truth (GT)
* Feed into training engine:
* line text and image from GT
* neural network structure
* language model
----
## Different Conventions (examples)
* File name and folder structure schemes
* Neural network setup
* Training stop conditions
* Log format
----
## Case in Point: [tesstrain](https://github.com/tesseract-ocr/tesstrain) <small>(formerly known as ocrd-train)</small>
* Started in 2018 as Makefile to make tesseract 4.0 training manageable
* Grown to include lots of optimizations
* Embraced by tesseract maintainers as of August 2019
---
<!-- .slide: data-background="https://s3-us-west-1.amazonaws.com/scifindr/articles/images/cephalopods/cephalopods-plate_franz-anthony.jpg" -->
# Standardize!
----
## Why?
* Who knows how [`en-default.pyrnn.gz`](http://www.tmbdev.net/en-default.pyrnn.gz) was trained?
* Synthetic data? Real line images? Both?
* How would `en-default.pyrnn.gz` perform compared to Calamari trained on the same data?
* Engines are evolving :tada: so models must too
----
## Ground Truth
* OCR-D has been developing GT specifications
* Training has different requirements:
* line-based
* data integrity is crucial
* metadata to track provenance
* => Engine-agnostic container format for line GT
----
## Training
* Many parameters equivalent across engines => Common API
* Details matter => JSON Schema
* Track progress and performance uniformly
----
## Evaluation
* Decouple evaluation from training
* Test snapshots of models by predicting GT
* Currently CER and WER
* Synergies with [qurator-spk/dinglehopper](https://github.com/qurator-spk/dinglehopper), [eddieantonio/ocreval](https://github.com/eddieantonio/ocreval) and [impactcentre/ocrevalUAtion](https://github.com/impactcentre/ocrevalUAtion)
----
## Models
* Uniform container format based on BagIt
* Include provenance and bibliographical metadata
* Previous work in kraken to build upon
---
<!-- .slide: data-background="https://github.com/Doreenruirui/okralact/raw/master/docs/Framework.png" -->
<!-- .slide: class="dark-bg" -->
# Architecture
----
## Caveat
* okralact - the software - is fully working but a prototype
* What we want to push with okralact
* engine-spanning conventions
* harmonized network structure specification
* better documentation and provenance of trained models
----
## Tech stack
* Redis Queues for training and evaluation backed by Python workers
* Different JSON Schemas to model Common API and engine-specifics
* Python/Flask for a simple web interface
----
<!-- .slide: data-background="https://raw.githubusercontent.com/kba/2019-icdar/master/okralact-manual.png" -->
---
### Engine-agnostic training interfaces:
possible
necessary
in everybody's interest
<p>
<img src="https://i.kym-cdn.com/entries/icons/thumb/000/001/987/fyeah.jpg?1269221733" height="200"/>
</p>
#### Let's do this!
---
### Thank you!
<img src="https://raw.githubusercontent.com/kba/2019-icdar/master/end-man.png" height="200"/>
* Okralact: https://github.com/OCR-D/okralact
* OCR-D GitHub Org: https://github.com/OCR-D
* OCR-D Specs and Documentation: https://ocr-d.github.io
* Chat with us: https://gitter.im/OCR-D/Lobby
Talk to [@kba](https://github.com/kba), [@cneud](https://github.com/cneud), [@seuretm](https://github.com/seuretm) at HIP/ICDAR!
{"metaMigratedAt":"2023-06-15T00:06:25.162Z","metaMigratedFrom":"YAML","title":"okralact - a multi-engine Open Source OCR training system","breaks":false,"description":"Yeah.","slideOptions":"{\"theme\":\"blood\",\"spotlight\":{\"enabled\":false}}","contributors":"[{\"id\":\"e8137db5-d2e1-4125-8f51-e51a4ef3646b\",\"add\":9708,\"del\":3916},{\"id\":\"522e25fb-df8d-45d9-92fc-8ba16bd41dd1\",\"add\":60,\"del\":0}]"}