<style>
.kb-hi {
color: red;
background-color: #00000088;
}
.dark-bg {
background-color: #00000088 !important;
}
.light-bg {
background-color: #ffffff88 !important;
}
</style>
# OCR(-D) und Kitodo
Michael Lütgen, Zeutschel GmbH
Elisabeth Engl, HAB Wolfenbüttel
Konstantin Baierer, Staatsbibliothek zu Berlin
https://hackmd.io/@kba/S1peIVxhH
Kitodo-Anwendertreffen 2019-11-19
---
## Zeutschel OCR Cloud
---
## OCR-D in a nutshell
- DFG-funded Project focussed on full-text digitization of VD library material
=> Development of Software for whole OCR process
- Coordination project (HAB, KIT, BBAW, SBB)
- 8 module projects implementing specific functionality
---
## Pilotbibliotheken
- Functionality test till January 2020
- Recognition rate and performace are secondary at the moment
- Partner:
- SLUB Dresden
- UB Mannheim
- SUB Göttingen
- UB Heidelberg
- ULB Halle
- ULB Darmstadt
---
## Specs and Documentation
https://ocr-d.github.io
* File Formats Usage: METS, PAGE, OCRD-ZIP
* Command line interface
* Actionable Tool description
* Docker conventions
* Ground Truth Transcription Guidelines
* Installation and implementation instructions
https://ocr-d.github.io
---
## Modulprojekte
https://kba.cloud/ocrd-kwalitee
---
## OCR-D Deployment
* From source: `make install` in the MP repo root directory
* From PyPI: `pip install ocrd_tesserocr`
* "Slim Docker": `docker pull ocrd/tesserocr`
* "Fat Docker": https://github.com/stweil/ocrd_all
* Taverna Workflow Editor: https://github.com/OCR-D/taverna_workflow
---
## Other software of the OCR-D ecosphere
Training:
* [tesseract/tesstrain](https://github.com/tesseract-ocr/tesstrain)
* [OCR-D/okralact](https://github.com/OCR-D/okralact)
Evaluation: [qurator-spk/dinglehopper](https://github.com/qurator-spk/dinglehopper)
Deployment:
* [stweil/ocrd_all](https://github.com/stweil/ocrd_all)
* [bersky/workflow-configuration](https://github.com/bertsky/workflow-configuration)
---
## Why ABBYY
* Okay segmentation
* Fast
* Good recognition of modern types
---
## Why **NOT** ABBYY
* Closed Source
* Volume-based licensing
* Little support for historical types
* No training possible
* Fraktur: expensive and very little development
* Black Box: Few possibilities to influence recognition
---
## Why OCR-D?
* Free Software
* Massive progress in neural networks in recent years
=> Excellent quality, esp. with fine-tuned models
* Modular: OCR as a process of configurable steps
=> alternatives for all steps
---
## OCR in Kitodo: Status Quo
According to https://github.com/kitodo/kitodo-production/wiki/OCR
* OCR-Webservice of GBV/ZGV (?)
* Intranda Taskmanager
* Zeutschel zedOCR
**OR**
* Offline OCR and reimport
---
## Considerations
- Configurability: In Kitodo.Production UI? Config files? Black box?
- Technical Integration: Long-running and potentially resource-intensive processes
- Deployment: Locally? Remote but in-house? Commercial service provider? Verbund-level service?
- Tradeoff: Quality of recognition vs speed and manual effort
---
## Discussion I
- How much of a priority is OCR for your institution?
- What use cases do you have for OCR beyond search and providing research data?
- How important is OCR for older types or less-supported languages at your institution?
----
## Discussion II
- How could OCR improve Kitodo.Production?
- Generate table of contents
- Article/Issue separation
- OCR *training* as part of Kitodo?
- How do you integrate OCR into Kitodo now?
- Once OCR-D is ready, what is the added value of having OCR done by an external provider?
----
## Discussion III
- Which file formats do you want OCR results in?
- PAGE (currently supported by OCR-D)
- ALTO
- hOCR
- TEI
- WebAnnotation (for IIIF)
---
## Links
* [OCR-D.github.io](https://ocr-d.github.io)
* [OCR-D @ Github](https://github.com/OCR-D)
* [OCR-D.de](https://ocr-d.de)
* [DHd AG OCR](https://gitter.im/ag-ocr/community) (Website im Aufbau)
* [ocrd-kwalitee](https://kba.cloud/ocrd-kwalitee)
---
{"metaMigratedAt":"2023-06-15T01:44:40.297Z","metaMigratedFrom":"YAML","title":"OCR(-D) und Kitodo","breaks":true,"description":"View the slide with \"Slide Mode\".","slideOptions":"{\"theme\":\"league\",\"spotlight\":{\"enabled\":false}}","contributors":"[{\"id\":\"0aebf9c4-477a-4517-ac73-1d9ab93f410b\",\"add\":185,\"del\":152},{\"id\":\"e8137db5-d2e1-4125-8f51-e51a4ef3646b\",\"add\":5298,\"del\":1229}]"}