<style> /* reduce from default 48px: */ .reveal { font-size: 24px; text-align: left; } .reveal .slides { text-align: left; } /* change from default gray-on-black: */ .hljs { color: #005; background: #fff; } /* prevent invisible fragments from occupying space: */ .fragment.visible:not(.current-fragment) { display: none; height:0px; line-height: 0px; font-size: 0px; } /* increase font size in diagrams: */ .label { font-size: 24px; font-weight: bold; } /* increase maximum width of code blocks: */ .reveal pre code { max-width: 1000px; max-height: 1000px; } /* remove black border from images: */ .reveal section img { border: 0; } .reveal pre.mermaid { width: 100% !important; } .reveal svg { max-height: 600px; } .reveal .scaled-flowchart-td pre.mermaid { width: 100% !important; /* why? float: left; */ } .reveal .scaled-flowchart-td svg { max-width: 100% !important; } .reveal .scaled-flowchart-td svg g.node, .reveal .scaled-flowchart-td svg g.label, .reveal .scaled-flowchart-td svg foreignObject { width: 100% !important; } .reveal .scaled-flowchart-td p { clear:both; } .reveal .centered { text-align: center } .reveal .width75 { max-width: 75%; } </style> # Flexible OCR workflows with OCR-D <!-- .element: class="centered width75" --> Robert Sachunsky, Kay-Michael Würzner <!-- .element: class="centered width75" --> --- ## Contents - Workspaces - Processors - Integration - Workflow - Wishlist --- ## Workspaces - *Physical* representation of a METS file - Directory with subdirectories for each file group - Each subdirectory contains the listed files - Adding and removing files explicitly via `ocrd workspace` command - Adding files implicitly via `ocrd process` command - Using the output file group parameter `-O` - Cloning remote “workspaces” (i.e. METS files) - Access to millions of digitized books! ```shell ocrd workspace clone \ https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml . ``` --- ## Processors - Representation of OCR-related operations as *processor* - Operate within an OCR-D workspace - Input and output definition via METS file groups - Parameter specification via JSON - Invocation via - Individual CLI (e.g. `ocrd-anybaseocr-crop`) - Meta-processor `ocrd process` (concatenated + validated invocation of multiple processors within a single call) - Existing workflow-fit processors: - Ocropy-based ... - Tesseract-based ... - Olena-based binarization ... - Module Projects ... ---- ### Ocropy-based processors - Many OCR-related operations available - Not cleanly wrapped - Not separately available - Creating processors for - Binarization (on page/region/line level) - Deskewing (on page/region level) - Dewarping (on line level) - *Clipping* (on region/line level) - *Resegmentation* (on line level) - Ultimate goal: Ocropy as an API - Code clean-up and improvement ... - New operations ... - Restructuring into `ocrolib` and processors ... ---- ### Ocropy-based processors #### Code clean-up and improvement - move common Ocropy functions into `common` (from CLIs, from `OLD/`), add new ones: - `PIL.Image` vs `np.ndarray` conversions: **`array2pil`**, **`pil2array`** (integer vs normed float) - plausibility checks: **`check_page`**, **`check_region`**, **`check_line`** (but mix absolute and relative bounds, and make DPI-zoomable) - `nlbin` deskewing: **`estimate_skew_angle`**, **`estimate_skew`** (but resize when rotating, with minimum on variance drop) - `nlbin` binarization: **`estimate_local_whitelevel`**, **`estimate_thresholds`**, **`binarize`** (but keep exact pixel size, catch NaN) - remove connected components only contained in the margins: **`borderclean`** ---- ### Ocropy-based processors #### Code clean-up and improvement - disect and improve segmentation: - **`remove_hlines`**: add height threshold, reduce default width threshold - **`compute_separators_morph`**: reduce thresholds (because black colseps can be discontinuous), use only connected components fully inside zone - **`compute_gradmaps`**: reduce boxmap minsize (for chopped lines at margins), reduce horizontal blur (to avoid joining lines via as-/descenders) - **`compute_line_seeds`**: more robust top/bottom projection rules and no horizontal blur (to avoid joining lines via as-/descenders) - **`hmerge_line_seeds`**: new way to ensure horizontal label consistency - **`compute_segmentation`**: - `fullpage` switch (regions do not have hlines and colseps) - `zoom` parameter (thresholds must be DPI-relative) - use twice the estimated `scale` (blackletter has huge capitals and dense as-/descenders – avoid splitting lines) - before spreading line seeds, assign unlabelled connected components to their majority seed (instead of splitting) ---- ### Ocropy-based processors #### New operation: *Resegmentation* observation 1: ~ Ocropy dewarping and recognition is very sensitive to connected components intruding from neighbouring lines (e.g. ascenders and descenders) observation 2: ~ GT line segmentation is very coarse (only bounding boxes, large overlap) idea: ~ use Ocropy line segmentation to improve GT line segmentation via label majority rule, then annotate _shrinked polygon_ ---- ### Ocropy-based processors #### New operation: *Resegmentation* <div> ![](https://user-images.githubusercontent.com/38561704/60338418-73978f00-9995-11e9-8fd3-4c149e2df266.png) ![](https://user-images.githubusercontent.com/38561704/60338419-74302580-9995-11e9-9fce-a1eb1fe44e2f.png) ![](https://user-images.githubusercontent.com/38561704/60338421-74302580-9995-11e9-9997-0165ef75833b.png) ![](https://user-images.githubusercontent.com/38561704/60338422-74302580-9995-11e9-9a47-452557fec061.png) ![](https://user-images.githubusercontent.com/38561704/60624848-d940ac80-9dd5-11e9-95a9-2074c2e0182a.png) </div> <!-- .element: class="fragment" data-fragment-index="0" --> ![](https://user-images.githubusercontent.com/38561704/61275373-5a684e00-a79d-11e9-82f7-ae40ad8d02ae.png) <!-- .element: class="fragment" data-fragment-index="1" --> ---- ### Ocropy-based processors #### New operation: *Resegmentation* ![](https://user-images.githubusercontent.com/38561704/61275456-8daadd00-a79d-11e9-91d9-5365c047403a.png) <!-- .element: class="fragment" data-fragment-index="0" --> ![](https://user-images.githubusercontent.com/38561704/61275578-cc409780-a79d-11e9-8845-762f5df66488.png) <!-- .element: class="fragment" data-fragment-index="1" --> ![](https://user-images.githubusercontent.com/38561704/61275670-f7c38200-a79d-11e9-9427-5d4cbe7733d6.png) <!-- .element: class="fragment" data-fragment-index="2" --> ![](https://user-images.githubusercontent.com/38561704/61275855-57ba2880-a79e-11e9-91be-f7c531ac8d15.png) <!-- .element: class="fragment" data-fragment-index="3" --> ![](https://user-images.githubusercontent.com/38561704/61275459-8daadd00-a79d-11e9-940e-650d15331273.png) <!-- .element: class="fragment" data-fragment-index="4" --> ![](https://user-images.githubusercontent.com/38561704/61275580-cc409780-a79d-11e9-819b-f8e009c70b93.png) <!-- .element: class="fragment" data-fragment-index="5" --> ![](https://user-images.githubusercontent.com/38561704/61275671-f7c38200-a79d-11e9-9daa-386ca7b72109.png) <!-- .element: class="fragment" data-fragment-index="6" --> ![](https://user-images.githubusercontent.com/38561704/61275857-57ba2880-a79e-11e9-922c-b25122c3dd13.png) <!-- .element: class="fragment" data-fragment-index="7" --> ![](https://user-images.githubusercontent.com/38561704/61275460-8daadd00-a79d-11e9-87be-d6e8a193d38a.png) <!-- .element: class="fragment" data-fragment-index="8" --> ![](https://user-images.githubusercontent.com/38561704/61275581-ccd92e00-a79d-11e9-9be6-7d3260512ad3.png) <!-- .element: class="fragment" data-fragment-index="9" --> ![](https://user-images.githubusercontent.com/38561704/61275672-f7c38200-a79d-11e9-9d65-7d8c7b74e88f.png) <!-- .element: class="fragment" data-fragment-index="10" --> ![](https://user-images.githubusercontent.com/38561704/61275859-5852bf00-a79e-11e9-90d7-ad5abc3cc1c4.png) <!-- .element: class="fragment" data-fragment-index="11" --> ![](https://user-images.githubusercontent.com/38561704/61275461-8daadd00-a79d-11e9-8b89-b67a4d6dab14.png) <!-- .element: class="fragment" data-fragment-index="12" --> ![](https://user-images.githubusercontent.com/38561704/61275582-ccd92e00-a79d-11e9-8a40-95c270433154.png) <!-- .element: class="fragment" data-fragment-index="13" --> ![](https://user-images.githubusercontent.com/38561704/61275674-f7c38200-a79d-11e9-8e1a-de54e1428fb3.png) <!-- .element: class="fragment" data-fragment-index="14" --> ![](https://user-images.githubusercontent.com/38561704/61275861-5852bf00-a79e-11e9-9a45-aba95d8bc09b.png) <!-- .element: class="fragment" data-fragment-index="15" --> ![](https://user-images.githubusercontent.com/38561704/61275462-8e437380-a79d-11e9-9a9d-1178a7860a49.png) <!-- .element: class="fragment" data-fragment-index="16" --> ![](https://user-images.githubusercontent.com/38561704/61275583-ccd92e00-a79d-11e9-80f6-aadeb5f7bbc7.png) <!-- .element: class="fragment" data-fragment-index="17" --> ![](https://user-images.githubusercontent.com/38561704/61275676-f85c1880-a79d-11e9-9f29-03ee2c95f213.png) <!-- .element: class="fragment" data-fragment-index="18" --> ![](https://user-images.githubusercontent.com/38561704/61275863-5852bf00-a79e-11e9-85e5-7f9e3c04e5bb.png) <!-- .element: class="fragment" data-fragment-index="19" --> ---- ### Ocropy-based processors #### New operation: *Resegmentation* ![](https://i.imgur.com/rTyX1xP.png) ---- ### Ocropy-based processors #### New operation: *Clipping* observation 1: ~ on GT, both regions and lines often overlap with their neighbouring regions and lines – not just in the background, but within connected components idea 1: ~ remove connected components that are not fully contained in the segment but in a neighbour observation 2: ~ many frequent cases of this will create interior islands or non-contiguous polygons (not allowed in PAGE-XML, usually not supported by implementations) idea 2: ~ do not remove by shrinking the polygon, but by _clipping_ to the background colour note: - can be used to suppress graphics or separators within or across a region or line - can be used as an alternative to resegmentation (on the line level) - can not be applied if the segment already has `AlternativeImage` or `@orientation` (segments and neighbours become incomensurable) - runs best after binarization ---- ### Ocropy-based processors #### New operation: *Clipping* ![](https://user-images.githubusercontent.com/38561704/61275129-c7c7af00-a79c-11e9-8752-dc5051773d1f.png) ![](https://user-images.githubusercontent.com/38561704/61276541-c64bb600-a79f-11e9-857a-30310270b5fe.png) <!-- .element: class="fragment" data-fragment-index="1" --> ![](https://user-images.githubusercontent.com/38561704/61276557-d06db480-a79f-11e9-92da-abbe206e4e37.png) <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Ocropy-based processors #### New operation: *Clipping* - problem: regions overlapping for no good reason e.g. from Tesseract ![](https://i.imgur.com/001fg8v.jpg =350x) ---- ### Ocropy-based processors #### Planned restructuring - `tmbdev/ocropy`, forked under `OCR-D/ocropy`: - move non-UI functions from CLIs into `ocrolib` - package `ocrolib` under PyPI _ocrolib_ - package CLIs under PyPI _ocropus_ - add our `ocrolib` changes (one by one): - Python 3 port - additional `ocrolib.common` functions - improvements in segmentation - try to get upstream approval - our `OCR-D/ocrd_ocropus`: - only for OCR-D wrappers - base on new `ocrolib` → **Ocropy as API is under way!** ---- ### Tesseract-based processors - Many OCR-related operations available - Mostly available via API - Exposed to Python via `tesserocr` - Often more robust segmentation and text recognition than Ocropy - Creating Processors for - Binarization (on region/line level) - *Cropping* ... - Segmentation (on page/region/line level) ... - Deskewing (on page/region level) ... - Text recognition (on line/word/glyph level) ---- ### Tesseract-based processors #### New processor: Poor-man's cropping - Cropping not implemented as separate function in Tesseract - Idea: minmal rectangle around all detected regions as `Border` - Side effect: quality improvement upon repeated region detection within `Border` - Problems: - Facing pages - Empty or sparsely filled pages - Robustness (i.e. works good for most but really bad for some pages) ---- ### Tesseract-based processors #### Basal segment classification - Distinction of different region types in Tesseract - Text - Image - Separator - Table - ... - Mapped to (coarser) PAGE classification - Switches: - `crop_polygons` (rectangles or polygons?) - `find_tables` (tables as tables?) ---- ### Tesseract-based processors #### Basal segment classification ![](https://i.imgur.com/zfvx5iy.jpg =700x) ---- ### Tesseract-based processors #### Deskewing and orientation - 2 distinct APIs in Tesseract: | | `DetectOrientationScript()` | `AnalyseLayout()` + `Orientation()` | | - | --- | --- | | confidence | yes | no | | orientation | yes | yes | | script | yes | no | | deskewing | no | yes | | textline order | no | yes | | reading direction | no | yes | → can yield **contradictory results**! ---- ### Tesseract-based processors #### Deskewing and orientation - resolve conflicts: 1. accept _script_ results from OSD (if very confident) 2. accept _orientation_ results from OSD (if very confident) 3. ignore _orientation_ results from AL (but warn if contradictory) 4. apply _deskewing_ results from AL (adding angle to orientation) 5. apply _order/direction_ results from AL - available on _page_ and _region_ level ---- ### Olena-based binarization - Binarization is still relevant! - No RGB-processing recognition engine(s) available - High influence on OLR and OCR results - Multiple binarization implementations in Olena - *Kim*, *Niblack*, *Wolf* ... → expose as parameter - Only as CLIs → `ocrd bashlib` as last resort - PAGE processing (but only on page level) - `AlternativeImage` support (as case study in `xmlstarlet`) ---- ### Olena-based binarization | Original | Tesseract | :---------:|:-----------: |![](https://i.imgur.com/MbUwqjK.png)|![](https://i.imgur.com/NYBcm0g.png)| | Ocropy | Olena-Wolf | :------------:|:--------------: ![](https://i.imgur.com/rQluIkV.png)|![](https://i.imgur.com/QkjuOw0.png)| ---- ### Module Project-based processors - Not very many workflow-fit - Interface compatibility - Result delivery - Insufficient documentation - Short `README`s - Missing integration examples - Missing training facilities - Low visibility ---- ### Module Project-based processors #### `ocrd-anybaseocr-crop` - Interface-compatible page border detection - Collaborated effort between module project and coordination project - Extremely important preprocessing step (due to DFG requirements on digitization) - Not yet implemented for other `anybaseocr`-based processors - Very, very promising results - Input: image - Output: `Border` element with coordinates - Not yet `AlternativeImage`-sensitive ---- ### Module Project-based processors #### `ocrd-anybaseocr-crop`: example | Tesseract | DFKI | :-----------------------------------:|:-----------------------------------: ![](https://i.imgur.com/k3VbSyj.png) | ![](https://i.imgur.com/6lsluzE.png)| --- ## Integration - APIs for PAGE and METS - Delivery of results in a comfortable and interoperable way - No need to directly modify XML files - Use PAGE for - Page-, region-, line-, word- and glyph-level results - Descriptive and binary (`AlternativeImage`) results - Use METS for - Document-level results - Specific aspects: - DPI relativity ... - Description vs. image ... (`AlternativeImage`) - Multiple inputs or outputs ... - Logging as a result ... ---- ### DPI relativity - Most OLR operations are sensitive to image resolution: e.g. minimum/maximum segment size in pixels - Some tools are DPI-aware e.g. Tesseract: ``` Warning: Invalid resolution 0 dpi. Using 70 instead. ``` ```python tessapi.SetVariable('user_defined_dpi', str(200)) ``` - Some implementations expect a certain value e.g. Ocropy: 300 DPI - We (usually) know DPI from metadata/tags: ```python info = OcrdExif(pil_image) if info.resolution != 1: # tag available dpi = info.resolution if info.resolutionUnit == 'cm': dpi = round(dpi * 2.54) ... ``` ---- ### DPI relativity - OLR processors that are DPI-aware already: _pass_ DPI from **`OcrdExif`** e.g. Tesseract wrappers - OLR processors that are not DPI-aware yet: _modify_ to **zoom**: 1. determine factor between expected and actual DPI 2. multiply all relevant constants in the code e.g. Ocropy wrappers ---- ### Description vs. image - PAGE: hierarchy of elements, _each_ with descriptive and binary content - **original** `/PcGts/Page/@imageFilename` - **derived** `//AlternativeImage/@filename` - Preprocessing steps: either 1. producing derived images _obligatory_: binarization, despeckling, dewarping 2. just describing the operation (images _optional_): cropping/segmentation (`Coords/@points`), deskewing (`@orientation`) operation must be _applied_ at some point – preferably when descending to a lower hierarchy level (i.e. during segmentation) - Consumers: must - _respect_ `AlternativeImage` if present on their hierarchy level of interest - _generate_ an image from the parent otherwise (which again could have `AlternativeImage`) – by: - **cutting** from `Coords/@points` - **rotating** by `(Page|TextRegion)/@orientation` ---- ### Description vs. image #### AlternativeImage problems and solutions - bounding box rectangles are too coarse (esp. in the presence of skew) → **polygon** coordinates must always be preserved fully (using polygon masking instead of simple cutting) - **coordinates are absolute** (they reference the _original_ image) → before cutting from the parent image, coordinates must be converted _to relative_ → before adding new child elements, coordinates must be converted _from relative_ → offsets must always be passed down the hierarchy - not all image operations retain pixel positions: 1. deskewing → annotate `@orientation`, _but_: - rotation generally increases the binary image at the margins → compensate by **additional offset** (half the increase in size) - rotation applies around the center of the image, not the origin → coordinates must be **translated to center**, rotated passive, then translated back 2. dewarping → use `Grid`? 3. rescaling → extend PAGE-XML with `@scale`? ---- ### Description vs. image #### Summary (slightly more abstract) - coordinates must be **reproducible**: > Annotation must be sufficient to calculate pixel positions in `AlternativeImage` from those in `@imageFilename` > (e.g. to cut parents) or vice versa (e.g. to add children). - image preprocessing steps which alter the coordinate system must describe their transform appropriately: - linear coordinate transformations (translation/offset, rotation/angle, scale) can be made exact (up to rounding) - non-linear transformations (dewarping) are inexact ... ---- ### Description vs. image #### Remaining issues - Strictness of **`@comments`** classification: - multiple `AlternativeImage` entries → rely on `@comments`, or always append/choose last? - if `AlternativeImage` _and_ `@orientation` → rely on `@comments`, or expect the image to be deskewed already? - if `AlternativeImage` _and_ `Border` → rely on `@comments`, or expect the image to be cropped already? - Reproducibility: - if `AlternativeImage` is larger than `Coords/@points` rectangle → due to rotation (offset) or rescaling (zoom) or both! → introduce `@scale` or prohibit rescaling altogether? - (page/line-level) dewarping → use `Grid`? ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### High-level API: - image/offset recursion: - **`Workspace.image_from_page`**: on the page level (original or derived) - **`Workspace.image_from_segment`**: all levels below page (derived) <div> what it does: - get last `AlternativeImage` or generate from parent (including: - conversion of coordinates to parent-relative, which involves offset correction and possibly coordinate rotation, - possibly image rotation) - return the image and its absolute bounding box (compensating for resizing by additional offset) - (`image_from_page` only:) also return `OcrdExif` instance for original </div> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" --> <div> how to use: - not recursive itself – needs to be called recursively, passing down results: ```python from ocrd_modelfactory import page_from_file ... page_id = input_file.pageId or input_file.ID # for logging pcgts = page_from_file(workspace.download_file(input_file)) page = pcgts.get_Page() page_image, page_xywh, page_image_info = workspace.image_from_page( page, page_id) ... for region in page.get_TextRegion(): region_image, region_xywh = workspace.image_from_segment( region, page_image, page_xywh) ... for line in region.get_TextLine(): line_image, line_xywh = workspace.image_from_segment( line, region_image, region_xywh) ... ``` </div> <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### High-level API: - add image to METS: **`Workspace.save_image_file`** <div> what it does: - export image file from `PIL.Image` object - make file path from fileGrp, ID and format - reference the file in METS via `Workspace.add_file` - return file path </div> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" --> <div> how to use: - needs to know image fileGrp, unique ID: ```python ... file_id = input_file.ID.replace(self.input_file_grp, 'OCR-D-IMG-DEWARP') ... file_path = workspace.save_image_file(image, file_id + '_' + region.id + '_' + line.id, page_id=input_file.pageId, file_grp='OCR-D-IMG-DEWARP') line.add_AlternativeImage(AlternativeImageType( filename=file_path, comments=comments + ',' + 'dewarped') ``` </div> <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - convert from absolute coordinates to relative: **`ocrd_utils.coordinates_of_segment`** what it does: - get the points of the element's polygon outline - shift all points by the offset (top-left corner) of the parent towards origin - (in case the parent was rotated:) rotate all points with the center of the image as pivot how to use: ```python line_polygon = coordinates_of_segment(line, region_image, region_xywh) line_polygon = resegment(line_polygon, region_labels, region_image_bin, line.id) line_polygon = coordinates_for_segment(line_polygon, region_image, region_xywh) line.get_Coords().points = points_from_polygon(line_polygon) ``` ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - convert from relative coordinates to absolute: **`ocrd_utils.coordinates_for_segment`** what it does: - (in case the parent was rotated:) rotate all points with the center of the image as pivot in opposite direction - shift all points by the offset (top-left corner) of the parent away from origin how to use: ```python ... for word_no, word in enumerate(iterate_level(tessapi.GetIterator(), RIL.WORD)): word_id = '%s_word%04d' % (line.id, word_no) bbox = word.BoundingBox(RIL.WORD) points = points_from_polygon(coordinates_for_segment( polygon_from_x0y0x1y1(bbox), None, # image not needed if element cannot have angle line_xywh)) word = WordType(id=word_id, Coords=CoordsType(points)) line.add_Word(word) ``` ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - only coordinate rotation (as `numpy.ndarray`): **`ocrd_utils.rotate_coordinates`** - mask away exterior to background: **`image_from_polygon`** - background-agnostic replacement for `PIL.Image.crop`: **`ocrd_utils.crop_image`** - ... ---- ### Description vs. image #### Early adopters - core Python implementation already used by: - [Tesseract processors](https://github.com/OCR-D/ocrd_tesserocr) `ocrd-tesserocr-*` - [Ocropy processors](https://github.com/cisocrgroup/cis-ocrd-py/) `ocrd-cis-ocropy-*` - core Bash implementation WIP used by: - [Olena binarization](https://github.com/OCR-D/ocrd_olena) `ocrd-olena-binarize` ---- ### Multiple inputs or outputs - specified by [comma-separated list](https://ocr-d.github.io/cli#command-line-interface-cli) on CLI: ```shell $ ocrd-olena-binarize -I OCR-D-GT-SEG-LINE -O OCR-D-SEG-PAGE,OCR-D-IMG-BIN $ ocrd-cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,OCR-D-OCR-TESS,OCR-D-COR-ASV-ANN ``` ---- ### Multiple inputs or outputs #### Use-case for multi-valued input: alignment - `TextLine/TextEquiv/Unicode` alignment via - global sequence alignment (see [ocrd-cis-align](https://github.com/cisocrgroup/cis-ocrd-py/blob/51702097e0e4ea023a06d131769eaa0de81dcdd4/ocrd_cis/align/aligner.py#L26) or [ocrd-cor-asv-ann-evaluate](https://github.com/ASVLeipzig/cor-asv-ann/blob/a460bd5a95bd6fa092c40259f99355c3af02f01b/ocrd_cor_asv_ann/wrapper/evaluate.py#L51)) - neural attention mechanism (cf. Dong&Smith 2018 _Multi-input attention_) - resegmentation - possibly: text alignment informed by segmentation (coordinates) ---- ### Multiple inputs or outputs #### Use-case for multi-valued output: PAGE and image - image preprocessing produces PAGE with `AlternativeImage` references → generated images must also be added to METS - bad solution: fixed fileGrp `OCR-D-IMG-BIN`, `OCR-D-IMG-DESKEW`, `OCR-D-IMG-DEWARP` etc. - good approach: - use `output_file_grp` second position, - fallback to default if not given (e.g. [ocrd-tesserocr-binarize](https://github.com/OCR-D/ocrd_tesserocr/blob/ca2530d0f4ffd23ca5bfe7380f1b1089af36f6b6/ocrd_tesserocr/binarize.py#L59) or ocrd-olena-binarize) ---- ### Logging as a result - not all operations have a natural (PAGE/image) output file group: e.g. OLR/OCR evaluation, model training - some need to aggregate over multiple pages (or even workspaces): e.g. CER/WER --- ## Workflow - Goals: - flexibility and complexity of **configurations**: processors and parameters as building blocks - efficiency and robustness of **engines**: parallel/distributed computation and validation of inputs/outputs - Aspects: - Running ... - Configuration ... - Processors ... - Measurements ... - Best Practices ... ---- ### Running - with individual CLIs combined in a custom bash script: - with `ocrd process` as engine: - with Taverna? - with Kitodo? ---- ### Configuration ```graphviz digraph G { node[shape=box]; compound=true; page_segmentation[label="Region segmentation"]; line_segmentation[label="Line segmentation"]; text_optimization[label="Text optimization"]; subgraph cluster_preprocessing_page { label = "Page preprocessing"; binarization_page[label="Binarization"]; cropping[label="Cropping"]; deskewing_page[label="Deskewing"]; despeckling_page[label="Despeckling"]; dewarping_page[label="Dewarping"]; binarization_page -> cropping -> deskewing_page -> despeckling_page -> dewarping_page {rank=same; binarization_page, cropping, deskewing_page, despeckling_page, dewarping_page} } subgraph cluster_preprocessing_segment { label = "Region preprocessing"; deskewing_segment[label="Deskewing"]; despeckling_segment[label="Despeckling"]; binarization_segment[label="Binarization"]; binarization_segment -> despeckling_segment -> deskewing_segment {rank=same; deskewing_segment, despeckling_segment, binarization_segment} } subgraph cluster_preprocessing_line { label = "Line preprocessing"; binarization_line[label="Binarization"]; dewarping_line[label="Dewarping"]; binarization_line -> dewarping_line {rank=same; binarization_line, dewarping_line} } subgraph cluster_ocr_line { label = "Text recognition"; ocr_one[label="OCR 1"]; ocr_two[label="OCR 2"]; ocr_n[label="OCR n"]; ocr_one -> ocr_two -> ocr_n[style=dotted,dir=none] {rank=same; ocr_one, ocr_two, ocr_n} } binarization_page -> page_segmentation[label="Pages",ltail=cluster_preprocessing_page] page_segmentation -> binarization_segment[label="Regions",lhead=cluster_preprocessing_segment] binarization_segment -> line_segmentation[label="Regions",ltail=cluster_preprocessing_segment] line_segmentation -> binarization_line[label="Lines",lhead=cluster_preprocessing_line] binarization_line -> ocr_one[label="Line images", ltail=cluster_preprocessing_line, lhead=cluster_ocr_line] ocr_one -> text_optimization[label="Line strings",ltail=cluster_ocr_line] } ``` ---- ### Available processors | Processor | Status | Note | | -------------------------- | -------- | ----------- | | *Binarization* | | | | `ocrd-olena-binarize` | ✓ | | | `ocrd-anybaseocr-binarize` | ✗ | Interface | | `ocrd-cis-ocropy-binarize` | ✓ | | | `ocrd-kraken-binarize` | ✗ | Invocation | | `ocrd-tesserocr-binarize` | ✓ | | | *Despeckling* | | | | `ocrd-cis-ocropy-denoise` | ✓ | | ---- ### Available processors | Processor | Status | Note | | -------------------------- | -------- | ----------- | | *Cropping* | | | | `ocrd-anybaseocr-crop` | ✓ | | | `ocrd-kraken-crop` | ✗ | Interface | | `ocrd-tesserocr-crop` | ✓ | | | *Deskewing* | | | | `ocrd-anybaseocr-deskew` | ✗ | Interface | | `ocrd-cis-ocropy-deskew` | ✓ | | | `ocrd-tesserocr-deskew` | ✓ | | | *Dewarping* | | | | `ocrd-anybaseocr-dewarp` | ✗ | Interface | | `ocrd-cis-ocropy-dewarp` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | ----------- | | *Region Segmentation* | | | | `ocrd-tesserocr-segment-region` | ✓ | | | *Clipping/Resegmentation* | | | | `ocrd-cis-ocropy-clip` | ✓ | | | `ocrd-cis-ocropy-resegment` | ✓ | | | `ocrd-segment-repair` | ✓ | | | *Line Segmentation* | | | | `ocrd-ocropy-segment` | ✗ | Invocation | | `ocrd-kraken-segment` | ✗ | Invocation | | `ocrd-tesserocr-segment-line` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | -------- | | *Font identification* | | | | `ocrd-typegroups-classifier` | ✓ | | | *Text recognition* | | | | `ocrd-cis-ocropy-recognize` | ✓ | | | `ocrd-tesserocr-recognize` | ✓ | | | `ocrd-calamari-recognize` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | --------- | | *OCR alignment* | | | | `ocrd-cis-align` | ✓ | | | *Text optimization* | | | | `ocrd-cor-asv-ann-process` | ✓ | | | `ocrd-cor-asv-fst-process` | ✓ | | | `ocrd-cis-profile` | ✓ | | | `ocrd-cis-postcorrection` | ✗ | Interface | | `ocrd-keraslm-rate` | ✓ | | | *OCR evaluation* | | | | `ocrd-keraslm-rate` | ✓ | | | `ocrd-cor-asv-ann-evaluate` | ✓ | | | `ocrd-dinglehopper` | ✓ | | ---- ### From image to regions: commands ```shell # # create workspace from existing METS ocrd workspace clone \ https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml . # # crop with anybaseocr ocrd-anybaseocr-crop -I ORIGINAL -O CROPPED -m mets.xml # # binarize on page level ocrd-cis-ocropy-binarize -I CROPPED -O BIN -p <(echo '{"level-of-operation": "page"}') -m mets.xml # # deskew on page level ocrd-cis-ocropy-deskew -I BIN -O DESKEWED -p <(echo '{"level-of-operation": "page"}') -m mets.xml # # segment into regions ocrd-tesserocr-segment-region -I DESKEWED -O REGIONS -m mets.xml ``` ---- ### From image to regions: example ![](https://i.imgur.com/ND1Qzcu.png) ---- ### From regions to lines: commands ```shell # # clip regions ocrd-cis-ocropy-clip -I REGIONS -O CLIPPED -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # binarize on region level ocrd-cis-ocropy-binarize -I CLIPPED -O RBIN -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # deskew on region level ocrd-cis-ocropy-deskew -I RBIN -O RDESKEWED -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # segment into lines ocrd-tesserocr-segment-line -I RDESKEWED -O LINES -m mets.xml ``` ---- ### From regions to lines: example ---- ### From lines to text: commands ```shell # # clip lines ocrd-cis-ocropy-clip -I LINES -O LCLIPPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # binarize on line level ocrd-cis-ocropy-binarize -I LCLIPPED -O LBIN -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # dewarp on line level ocrd-cis-ocropy-dewarp -I LBIN -O DEWARPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # recognize text ocrd-tesserocr-recognize -I DEWARPED -O TEXT -p <(echo '{"model": "frk+deu+Fraktur+Latin"}') -m mets.xml ``` ---- ### From lines to text: example ---- ### Observations - Decent progress on the image preprocessing stage - Severe problems with recognition of (complex) page structures - Acceptable text quality (i.e on par, sometimes even better, than ABBYY FineReader) - Running time (roughly 2 h) per book needs improvement (?) ---- ### Measurements on current GT - Tesseract vs. Ocropy OCR, different models - Tesseract vs. Ocropy{nlbin} vs. Olena{Kim/Wolf/Sauvola} binarization - binarization on page level vs. region level - impact of various preprocessors (deskewing, dewarping, clipping, resegmentation) - resegmentation vs. clipping on line level - segmentation accuracy of GT itself - deskewing on page level vs. region level - _dewarping on page level vs. line level_ (not covered here) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] clip["CLIP{BLOCK}"] deskew["DESKEW{BLOCK}:tesserocr"] reseg[RESEGMENT] dew[DEWARP:ocropy] ocr[OCR:*] bin --> clip clip --> deskew deskew --> reseg reseg --> dew dew --> ocr ``` | OCR | CER[%] | | --- | ------ | | OCRO{fraktur} | 23.7 | | OCRO{fraktur(jze)} | 28.4 | | TESS{Fraktur} | 12.2 | | TESS{frk} | 11.9 | | TESS{frk+deu} | 11.5 | → layout/preprocessing still not good enough → Ocropy suffers from _Frakturwechsel_ (nearly no coverage of Antiqua/Arabic numerals, no way to mix models) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:ocropy"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.2 | (-0.5 for s/tesserocr/ocropy/) | | OCRO{fraktur(jze)} | 28.0 | (-0.4 for s/tesserocr/ocropy/) | | TESS{Fraktur} | 12.1 | (-0.1 for s/tesserocr/ocropy/) | | TESS{frk} | 11.9 | (+-0 for s/tesserocr/ocropy/) | | TESS{frk+deu} | 11.4 | (-0.1 for s/tesserocr/ocropy/) | → Ocropy deskews slightly better than Tesseract (the latter being more conservative) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.4 | (+1.2 for s//-DESKEW/) | | OCRO{fraktur(jze)} | 29.3 | (+1.3 for s//-DESKEW/) | | TESS{Fraktur} | 12.8 | (+0.7 for s//-DESKEW/) | | TESS{frk} | 12.6 | (+0.7 for s//-DESKEW/) | | TESS{frk+deu} | 12.2 | (+0.8 for s//-DESKEW/) | → Deskewing helps, but not that much (i.e. either GT images already have little skew, or dewarping can compensate) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 52.6 | (+28.2 for s//-DEWARP/) | | OCRO{fraktur(jze)} | 61.4 | (+32.1 for s//-DEWARP/) | | TESS{Fraktur} | 13.3 | (+ 0.5 for s//-DEWARP/) | | TESS{frk} | 13.2 | (+ 0.6 for s//-DEWARP/) | | TESS{frk+deu} | 12.8 | (+ 0.6 for s//-DEWARP/) | → Ocropy is very sensitive against warped images, Tesseract nearly immune ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 53.2 | (+0.6 for s/-DEWARP/-DEWARP-DESKEW/) | | OCRO{fraktur(jze)} | 63.0 | (+1.6 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{Fraktur} | 13.5 | (+0.2 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{frk} | 13.3 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{frk+deu} | 12.9 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) | → Deskewing appearently cannot replace dewarping, i.e. either deskewing is too bad, or dewarping cannot compensate missing deskewing in the first place. ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 56.8 | (+3.6 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | OCRO{fraktur(jze)} | 66.0 | (+3.0 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{Fraktur} | 13.6 | (+0.1 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{frk} | 13.7 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{frk+deu} | 13.3 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | → Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> deskew deskew["DESKEW{PAGE}:tesserocr"] deskew --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.6 | (+0.9 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | OCRO{fraktur(jze)} | 29.7 | (+1.3 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{Fraktur} | 13.4 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{frk} | 13.2 | (+2.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{frk+deu} | 12.7 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | → Deskewing (on average) works slightly better on the region level than on the page level ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> den den["DENOISE{PAGE}:ocropy"] den --> deskew deskew["DESKEW{PAGE}:ocropy"] deskew --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 19.1 | (-5.5 for s//DENOISE{PAGE}/) | | OCRO{fraktur(jze)} | 23.5 | (-6.2 for s//DENOISE{PAGE}/) | | TESS{Fraktur} | 11.7 | (-1.7 for s//DENOISE{PAGE}/) | | TESS{frk} | 11.8 | (-1.4 for s//DENOISE{PAGE}/) | | TESS{frk+deu} | 11.2 | (-1.5 for s//DENOISE{PAGE}/) | → Denoising can improve Wolf binarization, esp. for Ocropy ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.9 | (+0.2 for s//-CLIP{BLOCK}/) | | OCRO{fraktur(jze)} | 28.6 | (+0.2 for s//-CLIP{BLOCK}/) | | TESS{Fraktur} | 12.3 | (+0.1 for s//-CLIP{BLOCK}/) | | TESS{frk} | 12.3 | (+0.4 for s//-CLIP{BLOCK}/) | | TESS{frk+deu} | 11.9 | (+0.4 for s//-CLIP{BLOCK}/) | → Clipping on the region level gives very minimal improvement (if resegmentation is used) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 31.2 | (+7.5 for s//-RESEG/) | | OCRO{fraktur(jze)} | 35.1 | (+6.7 for s//-RESEG/) | | TESS{Fraktur} | 13.2 | (+1.0 for s//-RESEG/) | | TESS{frk} | 12.7 | (+0.8 for s//-RESEG/) | | TESS{frk+deu} | 12.4 | (+0.9 for s//-RESEG/) | → Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines → Resegmentation needs deskewing ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> clip2 clip2["CLIP{LINE}"] clip2 --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.1 | (+0.4 for s/RESEG/CLIP{LINE}/) | | OCRO{fraktur(jze)} | 28.9 | (+0.5 for s/RESEG/CLIP{LINE}/) | | TESS{Fraktur} | 12.9 | (+0.7 for s/RESEG/CLIP{LINE}/) | | TESS{frk} | 12.7 | (+0.8 for s/RESEG/CLIP{LINE}/) | | TESS{frk+deu} | 12.4 | (+0.9 for s/RESEG/CLIP{LINE}/) | → Clipping on the line level cannot quite replace resegmentation, with Tesseract it does not help at all ---- ```mermaid graph LR bin["BIN{PAGE}:olena{kim}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.9 | (+1.2 for s/wolf/kim/) | | OCRO{fraktur(jze)} | 29.6 | (+1.2 for s/wolf/kim/) | | TESS{Fraktur} | 14.0 | (+1.8 for s/wolf/kim/) | | TESS{frk} | 14.2 | (+2.3 for s/wolf/kim/) | | TESS{frk+deu} | 13.8 | (+2.3 for s/wolf/kim/) | → Kim is noticeably worse than Wolf (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{sauvola}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.0 | (-0.7 for s/wolf/sauvola/) | | OCRO{fraktur(jze)} | 27.9 | (-0.5 for s/wolf/sauvola/) | | TESS{Fraktur} | 12.0 | (-0.2 for s/wolf/sauvola/) | | TESS{frk} | 11.8 | (-0.1 for s/wolf/sauvola/) | | TESS{frk+deu} | 11.5 | (+-0 for s/wolf/sauvola/) | → Basic Sauvola is slightly better than Wolf (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{sauvola-ms-split}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 22.8 | (-0.9 for s/wolf/sauvola-ms-split/) | | OCRO{fraktur(jze)} | 27.6 | (-0.8 for s/wolf/sauvola-ms-split/) | | TESS{Fraktur} | 11.6 | (-0.6 for s/wolf/sauvola-ms-split/) | | TESS{frk} | 11.5 | (-0.4 for s/wolf/sauvola-ms-split/) | | TESS{frk+deu} | 11.1 | (-0.4 for s/wolf/sauvola-ms-split/) | → This Sauvola variant is even better (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:ocropy{nlbin}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 36.4 | (+12.7 for s/wolf/ocropy/) | | OCRO{fraktur(jze)} | 41.1 | (+12.7 for s/wolf/ocropy/) | | TESS{Fraktur} | 15.7 | (+ 3.5 for s/wolf/ocropy/) | | TESS{frk} | 14.9 | (+ 3.0 for s/wolf/ocropy/) | | TESS{frk+deu} | 14.8 | (+ 3.3 for s/wolf/ocropy/) | → Ocropy{nlbin} is really bad! (but what about `perc` / `range` / `threshold` / `lo` / `hi` parameters?) ---- ```mermaid graph LR clip["CLIP{BLOCK}"] clip --> bin bin["BIN{BLOCK}:ocropy{nlbin}"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.0 | (-13.4 for s/BIN{PAGE}/BIN{BLOCK}/) | | OCRO{fraktur(jze)} | 27.4 | (-13.7 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{Fraktur} | 11.5 | (- 4.2 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{frk} | 11.4 | (- 3.5 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{frk+deu} | 11.2 | (- 3.6 for s/BIN{PAGE}/BIN{BLOCK}/) | → Binarization on the region level is superior to the page level (if using Ocropy{nlbin}!) note: - clipping is neutral here – it does ad-hoc binarization (with Ocropy{nlbin}) - to be conclusive, the Olena variants should be run like this as well (which requires bashlib access to `AlternativeImage` on the region level) ---- ```mermaid graph LR clip["CLIP{BLOCK}"] clip --> bin bin["BIN{BLOCK}:tesserocr"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 55.8 | (+32.8 for s/ocropy{nlbin}/tesserocr/) | | OCRO{fraktur(jze)} | 63.5 | (+36.1 for s/ocropy{nlbin}/tesserocr/) | | TESS{Fraktur} | 15.0 | (+ 3.5 for s/ocropy{nlbin}/tesserocr/) | | TESS{frk} | 15.3 | (+ 3.9 for s/ocropy{nlbin}/tesserocr/) | | TESS{frk+deu} | 14.8 | (+ 3.6 for s/ocropy{nlbin}/tesserocr/) | → Ocropy cannot cope with Tesseract binarization. → But even Tesseract prefers Ocropy binarization! note: - to be conclusive, Tesseract binarization should be run on the page level for comparison, but this is impossible with its CAPI ---- ```mermaid graph LR bin["BIN{BLOCK}:ocropy{nlbin}"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 55.4 | (+32.4 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ | | OCRO{fraktur(jze)} | 64.6 | (+37.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ | | TESS{Fraktur} | 12.9 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk} | 12.9 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk+deu} | 12.7 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) | → Without the extra preprocessors, Ocropy is lost on GT. ^1^: "naîve" configuration of Ocropy ---- ```mermaid graph LR bin["BIN{BLOCK}:tesserocr"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 66.4 | (+10.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | OCRO{fraktur(jze)} | 72.4 | (+ 8.9 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{Fraktur} | 16.2 | (+ 1.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | | TESS{frk} | 16.9 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | | TESS{frk+deu} | 16.4 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | → Without the extra processors, Tesseract still works. ^2^: "naîve" configuration of Tesseract (but how does this compare to the CLI?) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 56.8 | (+33.1 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | OCRO{fraktur(jze)} | 66.0 | (+37.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{Fraktur} | 13.6 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk} | 13.5 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk+deu} | 13.3 | (+ 1.8 for s//-CLIP-DESKEW-RESEG-DEWARP/) | → Again, Ocropy requires the extra preprocessors, while Tesseract is suprisingly insensitive to invading neighbouring regions/lines, against skewed and warped images. ---- ### Measurements on GT4HistOCR - directory / file name suffix structure instead of METS/PAGE - pre-built (fixed) preprocessing with less effort (no polygons, no clipping / resegmentation / dewarping) - automatic segmentation by alignment instead of manual segmentation - representative? - large enough for training (OCR, COR) ---- ```mermaid graph LR bin["BIN{PAGE}:ocropy{nlbin}"] deskew["DESKEW{PAGE}:ocropy{nlbin}"] lseg["SEG{LINE}:ocropy{gpageseg}"] bin --> deskew deskew --> lseg lseg --> ocr ocr[OCR:*] style bin fill:#ccc style deskew fill:#ccc style lseg fill:#ccc ``` <!-- .slide svg: width="80px"; height="10px"; --> | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 8.33 | (-14.5 for s/OCR-D/4HistOCR/) | | OCRO{fraktur(jze)} | 6.27 | (-21.3 for s/OCR-D/4HistOCR/) | | TESS{Fraktur} | 8.33 | (-3.3 for s/OCR-D/4HistOCR/) | | TESS{frk} | ? | (- ? for s/OCR-D/4HistOCR/) | | TESS{frk+deu} | ? | (- ? for s/OCR-D/4HistOCR/) | → Much better than best measured OCR-D configuration! ---- ### Striving for best-practice configurations - Increasing number of processors - DFKI preprocessing and layout analysis - Würzburg layout analysis - Leipzig and Munich post correction - Increasing number of models - Trainable processors? - Erlangen font detection + Leipzig training tools - OCR-D GT - Increasing number of possible configurations - Recommendations for users are needed! - (Tools and workflows for text-based evaluation) - Tools and workflows for layout evaluation --- ## Wishlist - [ ] improve quality/consistency of existing processors (`AlternativeImage`, DPI, packaging, logging, documentation) - [ ] OCR-D wrapper for PRImA Layout Evaluation (profiles) - [ ] OCR-D wrapper for ScanTailor - [ ] OCR-D wrapper for [page-level dewarping with Leptonica](https://tpgit.github.io/UnOfficialLeptDocs/leptonica/dewarping.html)
{"metaMigratedAt":"2023-06-14T23:23:35.530Z","metaMigratedFrom":"YAML","title":"Flexible OCR workflows with OCR-D","breaks":false,"description":"Slides for the OCR-D developer workshop 2019","slideOptions":"{\"theme\":\"white\",\"slideNumber\":true}","contributors":"[{\"id\":\"c62f1b15-791a-47e1-8e4c-ab2ed00c04bc\",\"add\":53018,\"del\":12069},{\"id\":\"14a147d0-cd6c-4764-9d25-9c0ae54f027e\",\"add\":11000,\"del\":935},{\"id\":\"e8137db5-d2e1-4125-8f51-e51a4ef3646b\",\"add\":1319,\"del\":447}]"}
    3025 views
   owned this note