OCR-D experiences and recommendations with regard to the composition of OCR-D processors into workflows

--- tags: ocr-d --- # OCR-D experiences and recommendations with regard to the composition of OCR-D processors into workflows As part of development and testing, some experiences and observations were made that can help with the selection and composition of OCR-D processors into more complex workflows and that we therefore deem useful to share with potential implementers. Nevertheless, it is important to state that these observations do stem from experimental use only and have not been verified through systematic testing and evaluation. Therefore, it will require more and extensive testing with different material types and from different content holders (e.g. pilot libraries) to determine optimal workflows for a particular type of content. Experiences and recommendations by the OCR-D coordination project are set in *italics* in the following. Last but not least: **Your mileage may vary!** --- ### A) OCR-D processors provided by module projects These are the processors that have been created by the official, i.e. DFG-funded, OCR-D [module projects](https://ocr-d.github.io/projects): 1. [ocrd_typegroups_classifier](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#1-seuretmocrd_typegroups_classifier) 2. [LAYoutERkennung](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#2-mjenckelLAYoutERkennung) 3. [segmentation-runner](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#3-ocr-d-modul-2-segmentierungsegmentation-runner) 4. [ocrd_tesserocr](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#4-OCR-Docrd_tesserocr) 5. [ocrd_cis](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#5-cisocrgroupocrd_cis) 6. [cor-asv-ann](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#6a-ASVLeipzigcor-asv-ann) | [cor-asv-fst](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#6b-ASVLeipzigcor-asv-fst) 8. [okralact](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#7-Doreenruiruiokralact) 9. [OLA-HD-IMPL](https://hackmd.io/vNRvknjzSE2JrXWMvFY3og?view#8-OLA-HD-IMPL) #### 1. seuretm/ocrd_typegroups_classifier A typegroups classifier URL: https://github.com/seuretm/ocrd_typegroups_classifier Processors provided: * `ocrd-typegroups-classifier`: detect the dominant font type on a page level *Type/Font information can be used to e.g. select an appropriate OCR model. Note however that while past OCR engines required separate models for e.g. Antiqua and Fraktur, current models do not seem to require this distinction anymore, i.e. one single trained model can be used to recognize both Antiqua & Fraktur in one run without compromising recognition quality.* #### 2. mjenckel/LAYoutERkennung *The current version does not comply with OCR-D interface specifications!* Tools for preprocessing scanned images for OCR URL: https://github.com/mjenckel/OCR-D-LAYoutERkennung Processors provided: * `ocrd-anybaseocr-crop`: crop image borders (e.g. from scanning) **Use only this** * `ocrd-anybaseocr-binarize`: convert image to only black/white pixels * `ocrd-anybaseocr-deskew`: rotate image to correct skew * `ocrd-anybaseocr-dewarp`: straighten text lines * `ocrd-anybaseocr-tiseg`: text/image segmentation * `ocrd-anybaseocr-textline`: textline segmentation * `ocrd-anybaseocr-layout-analysis`: page layout analysis * `ocrd-anybaseocr-block-segmentation`: region segmentation #### 3. ocr-d-modul-2-segmentierung/segmentation-runner *The current version does not comply with OCR-D interface specifications!* A page segmentation algorithm based on a Fully Convolutional Network URL: https://github.com/ocr-d-modul-2-segmentierung/ocrd-pixelclassifier-segmentation Processors provided: * `ocropus-gpageseg-with-coords` * `ocrd-pc-seg-process` * `ocrd-pc-seg-single` #### 4. OCR-D/ocrd_tesserocr ocrd_tesserocr packages the popular Open Source OCR engine Tesseract with an OCR-D interface, and exposes several of its methods via API. Processors provided: * `ocrd-tesserocr-binarize`: convert image to only black/white pixels * `ocrd-tesserocr-crop`: crop image borders (e.g. from scanning) * `ocrd-tesserocr-deskew`: rotate image to correct skew * `ocrd-tesserocr-recognize`: text recognition * `ocrd-tesserocr-segment-line`: textline segmentation * `ocrd-tesserocr-segment-region`: region segmentation * `ocrd-tesserocr-segment-word`: word segmentation ocrd_tesserocr requires a version of Tesseract >= 4.1.0. For Ubuntu versions <19.10, this version of Tesseract is not part of the distribution, but can be installed either via PPA or by compilation from source. *OCR-D did not (yet) train any OCR models. The "standard" models shipped by Tesseract can be used. Thankfully though, [@stweil](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) has been training Tesseract models for German Fraktur & Antiqua that provide far superior results than the standard models and we therefore strongly recommend to use these models instead for testing. The current best model can be downloaded from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/.* #### 5. cisocrgroup/ocrd_cis #### 6. ASVLeipzig/cor-asv-ann | ASVLeipzig/cor-asv-fst #### 7. Doreenruirui/okralact #### 8. OLA-HD-IMPL --- ### B) Alternative OCR-D processors In addition to the official processors developed by the DFG-funded module projects, a number of useful third party processors and own developments are provided via the OCR-D coordination project. Please note that we only provide these alternative processors for convenience, but we currently cannot commit to fully supporting or maintaining these in the long run! ... #### X. ocrd_segment * `ocrd-segment-via-model` :TODO: