<style> .kb-hi { color: red; background-color: #00000088; } .dark-bg { background-color: #00000088 !important; } .light-bg { background-color: #ffffff88 !important; } </style> # OCR(-D) und Kitodo Michael Lütgen, Zeutschel GmbH Elisabeth Engl, HAB Wolfenbüttel Konstantin Baierer, Staatsbibliothek zu Berlin https://hackmd.io/@kba/S1peIVxhH Kitodo-Anwendertreffen 2019-11-19 --- ## Zeutschel OCR Cloud --- ## OCR-D in a nutshell - DFG-funded Project focussed on full-text digitization of VD library material => Development of Software for whole OCR process - Coordination project (HAB, KIT, BBAW, SBB) - 8 module projects implementing specific functionality --- ## Pilotbibliotheken - Functionality test till January 2020 - Recognition rate and performace are secondary at the moment - Partner: - SLUB Dresden - UB Mannheim - SUB Göttingen - UB Heidelberg - ULB Halle - ULB Darmstadt --- ## Specs and Documentation https://ocr-d.github.io * File Formats Usage: METS, PAGE, OCRD-ZIP * Command line interface * Actionable Tool description * Docker conventions * Ground Truth Transcription Guidelines * Installation and implementation instructions https://ocr-d.github.io --- ## Modulprojekte https://kba.cloud/ocrd-kwalitee --- ## OCR-D Deployment * From source: `make install` in the MP repo root directory * From PyPI: `pip install ocrd_tesserocr` * "Slim Docker": `docker pull ocrd/tesserocr` * "Fat Docker": https://github.com/stweil/ocrd_all * Taverna Workflow Editor: https://github.com/OCR-D/taverna_workflow --- ## Other software of the OCR-D ecosphere Training: * [tesseract/tesstrain](https://github.com/tesseract-ocr/tesstrain) * [OCR-D/okralact](https://github.com/OCR-D/okralact) Evaluation: [qurator-spk/dinglehopper](https://github.com/qurator-spk/dinglehopper) Deployment: * [stweil/ocrd_all](https://github.com/stweil/ocrd_all) * [bersky/workflow-configuration](https://github.com/bertsky/workflow-configuration) --- ## Why ABBYY * Okay segmentation * Fast * Good recognition of modern types --- ## Why **NOT** ABBYY * Closed Source * Volume-based licensing * Little support for historical types * No training possible * Fraktur: expensive and very little development * Black Box: Few possibilities to influence recognition --- ## Why OCR-D? * Free Software * Massive progress in neural networks in recent years => Excellent quality, esp. with fine-tuned models * Modular: OCR as a process of configurable steps => alternatives for all steps --- ## OCR in Kitodo: Status Quo According to https://github.com/kitodo/kitodo-production/wiki/OCR * OCR-Webservice of GBV/ZGV (?) * Intranda Taskmanager * Zeutschel zedOCR **OR** * Offline OCR and reimport --- ## Considerations - Configurability: In Kitodo.Production UI? Config files? Black box? - Technical Integration: Long-running and potentially resource-intensive processes - Deployment: Locally? Remote but in-house? Commercial service provider? Verbund-level service? - Tradeoff: Quality of recognition vs speed and manual effort --- ## Discussion I - How much of a priority is OCR for your institution? - What use cases do you have for OCR beyond search and providing research data? - How important is OCR for older types or less-supported languages at your institution? ---- ## Discussion II - How could OCR improve Kitodo.Production? - Generate table of contents - Article/Issue separation - OCR *training* as part of Kitodo? - How do you integrate OCR into Kitodo now? - Once OCR-D is ready, what is the added value of having OCR done by an external provider? ---- ## Discussion III - Which file formats do you want OCR results in? - PAGE (currently supported by OCR-D) - ALTO - hOCR - TEI - WebAnnotation (for IIIF) --- ## Links * [OCR-D.github.io](https://ocr-d.github.io) * [OCR-D @ Github](https://github.com/OCR-D) * [OCR-D.de](https://ocr-d.de) * [DHd AG OCR](https://gitter.im/ag-ocr/community) (Website im Aufbau) * [ocrd-kwalitee](https://kba.cloud/ocrd-kwalitee) ---
{"title":"OCR(-D) und Kitodo","tags":"OCR-D","description":"View the slide with \"Slide Mode\".","slideOptions":{"theme":"league","spotlight":{"enabled":false}}}
    1221 views
   owned this note