changed 6 years ago
Linked with GitHub kba/2019-icdar/slides.md

okralact

a multi-engine Open Source OCR training system

Rui Dong, Konstantin Baierer, Clemens Neudecker

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Slides: https://hackmd.io/@kba/SyiQKUCUH


Ocropus kraken calamari tesseract

OCRopus

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • @tmbdev since 2007 (ocropy: 2010)
  • Standout features:
    • often used, lots of documentation and wrappers (e.g. Ocrocis)
    • CLI tools: ocropus-nlbin, ocropus-gtedit etc.

kraken

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • @mittagessen since 2015
  • Standout features:
    • Clean codebase, documentation and interfaces
    • CLI tools: kraken binarize, ketos transcribe etc.

calamari

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • @CHWick since 2018
  • Standout features:
    • Clean codebase, documentation and interfaces
    • n-fold training and voting

tesseract

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • @theraysmith since 1985 (LSTM: 2016)
  • Standout features:
    • Huge user and developer base
    • Elaborate maintenance, testing, model generation etc.

Training


Basic Approach

  • Produce line-based Ground Truth (GT)
  • Feed into training engine:
    • line text and image from GT
    • neural network structure
    • language model

Different Conventions (examples)

  • File name and folder structure schemes
  • Neural network setup
  • Training stop conditions
  • Log format

Case in Point: tesstrain (formerly known as ocrd-train)

  • Started in 2018 as Makefile to make tesseract 4.0 training manageable
  • Grown to include lots of optimizations
  • Embraced by tesseract maintainers as of August 2019

Standardize!


Why?

  • Who knows how en-default.pyrnn.gz was trained?
  • Synthetic data? Real line images? Both?
  • How would en-default.pyrnn.gz perform compared to Calamari trained on the same data?
  • Engines are evolving
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    so models must too

Ground Truth

  • OCR-D has been developing GT specifications
  • Training has different requirements:
    • line-based
    • data integrity is crucial
    • metadata to track provenance
  • => Engine-agnostic container format for line GT

Training

  • Many parameters equivalent across engines => Common API
  • Details matter => JSON Schema
  • Track progress and performance uniformly

Evaluation


Models

  • Uniform container format based on BagIt
  • Include provenance and bibliographical metadata
  • Previous work in kraken to build upon

Architecture


Caveat

  • okralact - the software - is fully working but a prototype
  • What we want to push with okralact
    • engine-spanning conventions
    • harmonized network structure specification
    • better documentation and provenance of trained models

Tech stack

  • Redis Queues for training and evaluation backed by Python workers
  • Different JSON Schemas to model Common API and engine-specifics
  • Python/Flask for a simple web interface


Engine-agnostic training interfaces:

possible necessary in everybody's interest

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Let's do this!


Thank you!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Talk to @kba, @cneud, @seuretm at HIP/ICDAR!

Select a repo