changed 4 years ago
Published Linked with GitHub

Integration of Kitodo and OCR-D for Productive Mass-Digitisation

OCR-D Phase 3 Kick-Off

Robert Sachunsky

July 29, 2021


Implementation Project Kitodo / OCR-D

  • Participants
    • Sächsische Landesbibliothek –
      Staats- und Universitätsbibliothek Dresden (SLUB)
    • Universitätsbibliothek der TU Braunschweig (UBBS)
    • Universitätsbibliothek Mannheim (UBMA)
  • Volume: 8 man-years
  • Duration: 2 years
  • Start: October 2021

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Prior Work

  • SLUB staff: OCR-D coordination / module project (2018/19)
  • UBMA: OCR-D module project (2018/19)
  • SLUB, UBMA: OCR-D pilot libraries (2019/20)
  • SLUB: Kitodo development (since 2012)
  • UBBS: Kitodo testing, documentation, migration
  • all: experienced OCR-D and Kitodo users

Premises

  • Kitodo: Workflow Management System for libraries

    • Open-source, community-driven
    • Modules:
      • Kitodo.Production (digitisation workflows)
      • Kitodo.Presentation (DFG viewer etc.)
    • OCR: only via commercial plugins (black box, license costs)
  • OCR-D: operative single-workstation command-line prototype

    • no network interfaces for distribution/scaling yet
    • no error recovery and dynamic workflow execution yet
    • no result quality estimation and runtime evaluation yet
    • no assisted/automatic workflow configuration yet

Goals (OCR-D)

  1. Implement OCR-D as Web-based distributed system

    • easily scalable
    • easily deployable
  2. Develop quality based workflow optimisation for OCR-D

    • use heuristics and models for quality estimation of interim results
      during preprocessing, segmentation and recognition
    • weight interim result quality relative to contribution to overall result (follow-up steps, other pages)
    • when insufficient, switch to alternative configuration for segment/page/document automatically, or abort computation
    • offer a set of empirically optimised, dynamically quality-controlled workflow configurations for various materials

Goals (Kitodo)

  1. Implement OCR-D as OCR module in Kitodo.Production

    • import images, meta-data and structure data
    • track and visualise result progress/quality
    • error handling, versioning
    • export, validate and ingest results
    • edit and manage workflow configurations
  2. Extend Kitodo.Presentation and DFG Viewer

    • user evaluation of results, versioning
    • user prioritisation of (re-)OCR tasks (On-Demand-OCR)

Goals (further)

  • close collaboration with OCR-D coordination project

  • cooperation with other OCR-D implementation/module projects

  • Kitodo community workshops (disseminate and query requirements)

  • Kitodo community OCR-D service (test operation)


System Architecture

  • Kitodo with OCR-D "backend" as distributed system
  • strong integration of data and process management
  • generic/agnostic on both sides
    architecture

Project Plan

  • AP1 (SLUB): coordination and communication
  • AP2 (UBBS): management of Kitodo community
  • AP3 (SLUB): detailled technical specification
  • AP4 (SLUB): OCR-D server implementation
  • AP5 (SLUB): concept for automatic process control
  • AP6 (SLUB): implement automatic process control
  • AP7 (SLUB): develop quality estimation metrics & models
  • AP8 (UBBS): set up OCR-D service for Kitodo community
  • AP9 (SLUB): integrate OCR-D into Kitodo.Production
  • AP10 (UBMA): integrate OCR-D into Kitodo.Presentation
  • AP11 (UBMA): data storage, ingest and versioning
  • AP12 (SLUB): run and test DFG Viewer with OCR on Demand
  • AP13 (UBBS): evaluation and documentation with Kitodo community

Synergies and Interfaces

  • extending the processor CLI for error handling/signalling and parallelism
  • molding the final OCR-D Web API
  • providing reference implementations for server components
  • providing reference implementations for module containers
  • definining an evaluator CLI (analogous to processor CLI)
  • generalising the OCR-D workflow format for evaluators and switches
  • running & evaluating workflow experiments systematically
  • compiling optimised workflow configurations
  • defining quality metrics for workflow steps

Q & A

Thank you!

Select a repo