OCR(-D) und Kitodo

Michael Lütgen, Zeutschel GmbH
Elisabeth Engl, HAB Wolfenbüttel
Konstantin Baierer, Staatsbibliothek zu Berlin

https://hackmd.io/@kba/S1peIVxhH

Kitodo-Anwendertreffen 2019-11-19

Zeutschel OCR Cloud

OCR-D in a nutshell

DFG-funded Project focussed on full-text digitization of VD library material
=> Development of Software for whole OCR process
Coordination project (HAB, KIT, BBAW, SBB)
8 module projects implementing specific functionality

Pilotbibliotheken

Functionality test till January 2020
Recognition rate and performace are secondary at the moment
Partner:
- SLUB Dresden
- UB Mannheim
- SUB Göttingen
- UB Heidelberg
- ULB Halle
- ULB Darmstadt

Specs and Documentation

https://ocr-d.github.io

File Formats Usage: METS, PAGE, OCRD-ZIP
Command line interface
Actionable Tool description
Docker conventions
Ground Truth Transcription Guidelines
Installation and implementation instructions

https://ocr-d.github.io

Modulprojekte

https://kba.cloud/ocrd-kwalitee

OCR-D Deployment

From source: make install in the MP repo root directory
From PyPI: pip install ocrd_tesserocr
"Slim Docker": docker pull ocrd/tesserocr
"Fat Docker": https://github.com/stweil/ocrd_all
Taverna Workflow Editor: https://github.com/OCR-D/taverna_workflow

Other software of the OCR-D ecosphere

Training:

Evaluation: qurator-spk/dinglehopper

Deployment:

Why ABBYY

Okay segmentation
Fast
Good recognition of modern types

Why NOT ABBYY

Closed Source
Volume-based licensing
Little support for historical types
No training possible
Fraktur: expensive and very little development
Black Box: Few possibilities to influence recognition

Why OCR-D?

Free Software
Massive progress in neural networks in recent years
=> Excellent quality, esp. with fine-tuned models
Modular: OCR as a process of configurable steps
=> alternatives for all steps

OCR in Kitodo: Status Quo

According to https://github.com/kitodo/kitodo-production/wiki/OCR

OCR-Webservice of GBV/ZGV (?)
Intranda Taskmanager
Zeutschel zedOCR

Offline OCR and reimport

Considerations

Configurability: In Kitodo.Production UI? Config files? Black box?
Technical Integration: Long-running and potentially resource-intensive processes
Deployment: Locally? Remote but in-house? Commercial service provider? Verbund-level service?
Tradeoff: Quality of recognition vs speed and manual effort

Discussion I

How much of a priority is OCR for your institution?
What use cases do you have for OCR beyond search and providing research data?
How important is OCR for older types or less-supported languages at your institution?

Discussion II

How could OCR improve Kitodo.Production?
- Generate table of contents
- Article/Issue separation
OCR training as part of Kitodo?
How do you integrate OCR into Kitodo now?
Once OCR-D is ready, what is the added value of having OCR done by an external provider?

Discussion III

Which file formats do you want OCR results in?
- PAGE (currently supported by OCR-D)
- ALTO
- hOCR
- TEI
- WebAnnotation (for IIIF)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

OCR(-D) und Kitodo

Zeutschel OCR Cloud

OCR-D in a nutshell

Pilotbibliotheken

Specs and Documentation

Modulprojekte

OCR-D Deployment

Other software of the OCR-D ecosphere

Why ABBYY

Why NOT ABBYY

Why OCR-D?

OCR in Kitodo: Status Quo

Considerations

Discussion I

Discussion II

Discussion III

Links