Kay-Michael Würzner
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- id: ocrd_workflows_2019 title: Flexible OCR workflows with OCR-D tags: OCR, OLR, OCR-D description: Slides for the OCR-D developer workshop 2019 slideOptions: theme: white slideNumber: true --- <style> /* reduce from default 48px: */ .reveal { font-size: 24px; text-align: left; } .reveal .slides { text-align: left; } /* change from default gray-on-black: */ .hljs { color: #005; background: #fff; } /* prevent invisible fragments from occupying space: */ .fragment.visible:not(.current-fragment) { display: none; height:0px; line-height: 0px; font-size: 0px; } /* increase font size in diagrams: */ .label { font-size: 24px; font-weight: bold; } /* increase maximum width of code blocks: */ .reveal pre code { max-width: 1000px; max-height: 1000px; } /* remove black border from images: */ .reveal section img { border: 0; } .reveal pre.mermaid { width: 100% !important; } .reveal svg { max-height: 600px; } .reveal .scaled-flowchart-td pre.mermaid { width: 100% !important; /* why? float: left; */ } .reveal .scaled-flowchart-td svg { max-width: 100% !important; } .reveal .scaled-flowchart-td svg g.node, .reveal .scaled-flowchart-td svg g.label, .reveal .scaled-flowchart-td svg foreignObject { width: 100% !important; } .reveal .scaled-flowchart-td p { clear:both; } .reveal .centered { text-align: center } .reveal .width75 { max-width: 75%; } </style> # Flexible OCR workflows with OCR-D <!-- .element: class="centered width75" --> Robert Sachunsky, Kay-Michael Würzner <!-- .element: class="centered width75" --> --- ## Contents - Workspaces - Processors - Integration - Workflow - Wishlist --- ## Workspaces - *Physical* representation of a METS file - Directory with subdirectories for each file group - Each subdirectory contains the listed files - Adding and removing files explicitly via `ocrd workspace` command - Adding files implicitly via `ocrd process` command - Using the output file group parameter `-O` - Cloning remote “workspaces” (i.e. METS files) - Access to millions of digitized books! ```shell ocrd workspace clone \ https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml . ``` --- ## Processors - Representation of OCR-related operations as *processor* - Operate within an OCR-D workspace - Input and output definition via METS file groups - Parameter specification via JSON - Invocation via - Individual CLI (e.g. `ocrd-anybaseocr-crop`) - Meta-processor `ocrd process` (concatenated + validated invocation of multiple processors within a single call) - Existing workflow-fit processors: - Ocropy-based ... - Tesseract-based ... - Olena-based binarization ... - Module Projects ... ---- ### Ocropy-based processors - Many OCR-related operations available - Not cleanly wrapped - Not separately available - Creating processors for - Binarization (on page/region/line level) - Deskewing (on page/region level) - Dewarping (on line level) - *Clipping* (on region/line level) - *Resegmentation* (on line level) - Ultimate goal: Ocropy as an API - Code clean-up and improvement ... - New operations ... - Restructuring into `ocrolib` and processors ... ---- ### Ocropy-based processors #### Code clean-up and improvement - move common Ocropy functions into `common` (from CLIs, from `OLD/`), add new ones: - `PIL.Image` vs `np.ndarray` conversions: **`array2pil`**, **`pil2array`** (integer vs normed float) - plausibility checks: **`check_page`**, **`check_region`**, **`check_line`** (but mix absolute and relative bounds, and make DPI-zoomable) - `nlbin` deskewing: **`estimate_skew_angle`**, **`estimate_skew`** (but resize when rotating, with minimum on variance drop) - `nlbin` binarization: **`estimate_local_whitelevel`**, **`estimate_thresholds`**, **`binarize`** (but keep exact pixel size, catch NaN) - remove connected components only contained in the margins: **`borderclean`** ---- ### Ocropy-based processors #### Code clean-up and improvement - disect and improve segmentation: - **`remove_hlines`**: add height threshold, reduce default width threshold - **`compute_separators_morph`**: reduce thresholds (because black colseps can be discontinuous), use only connected components fully inside zone - **`compute_gradmaps`**: reduce boxmap minsize (for chopped lines at margins), reduce horizontal blur (to avoid joining lines via as-/descenders) - **`compute_line_seeds`**: more robust top/bottom projection rules and no horizontal blur (to avoid joining lines via as-/descenders) - **`hmerge_line_seeds`**: new way to ensure horizontal label consistency - **`compute_segmentation`**: - `fullpage` switch (regions do not have hlines and colseps) - `zoom` parameter (thresholds must be DPI-relative) - use twice the estimated `scale` (blackletter has huge capitals and dense as-/descenders – avoid splitting lines) - before spreading line seeds, assign unlabelled connected components to their majority seed (instead of splitting) ---- ### Ocropy-based processors #### New operation: *Resegmentation* observation 1: ~ Ocropy dewarping and recognition is very sensitive to connected components intruding from neighbouring lines (e.g. ascenders and descenders) observation 2: ~ GT line segmentation is very coarse (only bounding boxes, large overlap) idea: ~ use Ocropy line segmentation to improve GT line segmentation via label majority rule, then annotate _shrinked polygon_ ---- ### Ocropy-based processors #### New operation: *Resegmentation* <div> ![](https://user-images.githubusercontent.com/38561704/60338418-73978f00-9995-11e9-8fd3-4c149e2df266.png) ![](https://user-images.githubusercontent.com/38561704/60338419-74302580-9995-11e9-9fce-a1eb1fe44e2f.png) ![](https://user-images.githubusercontent.com/38561704/60338421-74302580-9995-11e9-9997-0165ef75833b.png) ![](https://user-images.githubusercontent.com/38561704/60338422-74302580-9995-11e9-9a47-452557fec061.png) ![](https://user-images.githubusercontent.com/38561704/60624848-d940ac80-9dd5-11e9-95a9-2074c2e0182a.png) </div> <!-- .element: class="fragment" data-fragment-index="0" --> ![](https://user-images.githubusercontent.com/38561704/61275373-5a684e00-a79d-11e9-82f7-ae40ad8d02ae.png) <!-- .element: class="fragment" data-fragment-index="1" --> ---- ### Ocropy-based processors #### New operation: *Resegmentation* ![](https://user-images.githubusercontent.com/38561704/61275456-8daadd00-a79d-11e9-91d9-5365c047403a.png) <!-- .element: class="fragment" data-fragment-index="0" --> ![](https://user-images.githubusercontent.com/38561704/61275578-cc409780-a79d-11e9-8845-762f5df66488.png) <!-- .element: class="fragment" data-fragment-index="1" --> ![](https://user-images.githubusercontent.com/38561704/61275670-f7c38200-a79d-11e9-9427-5d4cbe7733d6.png) <!-- .element: class="fragment" data-fragment-index="2" --> ![](https://user-images.githubusercontent.com/38561704/61275855-57ba2880-a79e-11e9-91be-f7c531ac8d15.png) <!-- .element: class="fragment" data-fragment-index="3" --> ![](https://user-images.githubusercontent.com/38561704/61275459-8daadd00-a79d-11e9-940e-650d15331273.png) <!-- .element: class="fragment" data-fragment-index="4" --> ![](https://user-images.githubusercontent.com/38561704/61275580-cc409780-a79d-11e9-819b-f8e009c70b93.png) <!-- .element: class="fragment" data-fragment-index="5" --> ![](https://user-images.githubusercontent.com/38561704/61275671-f7c38200-a79d-11e9-9daa-386ca7b72109.png) <!-- .element: class="fragment" data-fragment-index="6" --> ![](https://user-images.githubusercontent.com/38561704/61275857-57ba2880-a79e-11e9-922c-b25122c3dd13.png) <!-- .element: class="fragment" data-fragment-index="7" --> ![](https://user-images.githubusercontent.com/38561704/61275460-8daadd00-a79d-11e9-87be-d6e8a193d38a.png) <!-- .element: class="fragment" data-fragment-index="8" --> ![](https://user-images.githubusercontent.com/38561704/61275581-ccd92e00-a79d-11e9-9be6-7d3260512ad3.png) <!-- .element: class="fragment" data-fragment-index="9" --> ![](https://user-images.githubusercontent.com/38561704/61275672-f7c38200-a79d-11e9-9d65-7d8c7b74e88f.png) <!-- .element: class="fragment" data-fragment-index="10" --> ![](https://user-images.githubusercontent.com/38561704/61275859-5852bf00-a79e-11e9-90d7-ad5abc3cc1c4.png) <!-- .element: class="fragment" data-fragment-index="11" --> ![](https://user-images.githubusercontent.com/38561704/61275461-8daadd00-a79d-11e9-8b89-b67a4d6dab14.png) <!-- .element: class="fragment" data-fragment-index="12" --> ![](https://user-images.githubusercontent.com/38561704/61275582-ccd92e00-a79d-11e9-8a40-95c270433154.png) <!-- .element: class="fragment" data-fragment-index="13" --> ![](https://user-images.githubusercontent.com/38561704/61275674-f7c38200-a79d-11e9-8e1a-de54e1428fb3.png) <!-- .element: class="fragment" data-fragment-index="14" --> ![](https://user-images.githubusercontent.com/38561704/61275861-5852bf00-a79e-11e9-9a45-aba95d8bc09b.png) <!-- .element: class="fragment" data-fragment-index="15" --> ![](https://user-images.githubusercontent.com/38561704/61275462-8e437380-a79d-11e9-9a9d-1178a7860a49.png) <!-- .element: class="fragment" data-fragment-index="16" --> ![](https://user-images.githubusercontent.com/38561704/61275583-ccd92e00-a79d-11e9-80f6-aadeb5f7bbc7.png) <!-- .element: class="fragment" data-fragment-index="17" --> ![](https://user-images.githubusercontent.com/38561704/61275676-f85c1880-a79d-11e9-9f29-03ee2c95f213.png) <!-- .element: class="fragment" data-fragment-index="18" --> ![](https://user-images.githubusercontent.com/38561704/61275863-5852bf00-a79e-11e9-85e5-7f9e3c04e5bb.png) <!-- .element: class="fragment" data-fragment-index="19" --> ---- ### Ocropy-based processors #### New operation: *Resegmentation* ![](https://i.imgur.com/rTyX1xP.png) ---- ### Ocropy-based processors #### New operation: *Clipping* observation 1: ~ on GT, both regions and lines often overlap with their neighbouring regions and lines – not just in the background, but within connected components idea 1: ~ remove connected components that are not fully contained in the segment but in a neighbour observation 2: ~ many frequent cases of this will create interior islands or non-contiguous polygons (not allowed in PAGE-XML, usually not supported by implementations) idea 2: ~ do not remove by shrinking the polygon, but by _clipping_ to the background colour note: - can be used to suppress graphics or separators within or across a region or line - can be used as an alternative to resegmentation (on the line level) - can not be applied if the segment already has `AlternativeImage` or `@orientation` (segments and neighbours become incomensurable) - runs best after binarization ---- ### Ocropy-based processors #### New operation: *Clipping* ![](https://user-images.githubusercontent.com/38561704/61275129-c7c7af00-a79c-11e9-8752-dc5051773d1f.png) ![](https://user-images.githubusercontent.com/38561704/61276541-c64bb600-a79f-11e9-857a-30310270b5fe.png) <!-- .element: class="fragment" data-fragment-index="1" --> ![](https://user-images.githubusercontent.com/38561704/61276557-d06db480-a79f-11e9-92da-abbe206e4e37.png) <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Ocropy-based processors #### New operation: *Clipping* - problem: regions overlapping for no good reason e.g. from Tesseract ![](https://i.imgur.com/001fg8v.jpg =350x) ---- ### Ocropy-based processors #### Planned restructuring - `tmbdev/ocropy`, forked under `OCR-D/ocropy`: - move non-UI functions from CLIs into `ocrolib` - package `ocrolib` under PyPI _ocrolib_ - package CLIs under PyPI _ocropus_ - add our `ocrolib` changes (one by one): - Python 3 port - additional `ocrolib.common` functions - improvements in segmentation - try to get upstream approval - our `OCR-D/ocrd_ocropus`: - only for OCR-D wrappers - base on new `ocrolib` → **Ocropy as API is under way!** ---- ### Tesseract-based processors - Many OCR-related operations available - Mostly available via API - Exposed to Python via `tesserocr` - Often more robust segmentation and text recognition than Ocropy - Creating Processors for - Binarization (on region/line level) - *Cropping* ... - Segmentation (on page/region/line level) ... - Deskewing (on page/region level) ... - Text recognition (on line/word/glyph level) ---- ### Tesseract-based processors #### New processor: Poor-man's cropping - Cropping not implemented as separate function in Tesseract - Idea: minmal rectangle around all detected regions as `Border` - Side effect: quality improvement upon repeated region detection within `Border` - Problems: - Facing pages - Empty or sparsely filled pages - Robustness (i.e. works good for most but really bad for some pages) ---- ### Tesseract-based processors #### Basal segment classification - Distinction of different region types in Tesseract - Text - Image - Separator - Table - ... - Mapped to (coarser) PAGE classification - Switches: - `crop_polygons` (rectangles or polygons?) - `find_tables` (tables as tables?) ---- ### Tesseract-based processors #### Basal segment classification ![](https://i.imgur.com/zfvx5iy.jpg =700x) ---- ### Tesseract-based processors #### Deskewing and orientation - 2 distinct APIs in Tesseract: | | `DetectOrientationScript()` | `AnalyseLayout()` + `Orientation()` | | - | --- | --- | | confidence | yes | no | | orientation | yes | yes | | script | yes | no | | deskewing | no | yes | | textline order | no | yes | | reading direction | no | yes | → can yield **contradictory results**! ---- ### Tesseract-based processors #### Deskewing and orientation - resolve conflicts: 1. accept _script_ results from OSD (if very confident) 2. accept _orientation_ results from OSD (if very confident) 3. ignore _orientation_ results from AL (but warn if contradictory) 4. apply _deskewing_ results from AL (adding angle to orientation) 5. apply _order/direction_ results from AL - available on _page_ and _region_ level ---- ### Olena-based binarization - Binarization is still relevant! - No RGB-processing recognition engine(s) available - High influence on OLR and OCR results - Multiple binarization implementations in Olena - *Kim*, *Niblack*, *Wolf* ... → expose as parameter - Only as CLIs → `ocrd bashlib` as last resort - PAGE processing (but only on page level) - `AlternativeImage` support (as case study in `xmlstarlet`) ---- ### Olena-based binarization | Original | Tesseract | :---------:|:-----------: |![](https://i.imgur.com/MbUwqjK.png)|![](https://i.imgur.com/NYBcm0g.png)| | Ocropy | Olena-Wolf | :------------:|:--------------: ![](https://i.imgur.com/rQluIkV.png)|![](https://i.imgur.com/QkjuOw0.png)| ---- ### Module Project-based processors - Not very many workflow-fit - Interface compatibility - Result delivery - Insufficient documentation - Short `README`s - Missing integration examples - Missing training facilities - Low visibility ---- ### Module Project-based processors #### `ocrd-anybaseocr-crop` - Interface-compatible page border detection - Collaborated effort between module project and coordination project - Extremely important preprocessing step (due to DFG requirements on digitization) - Not yet implemented for other `anybaseocr`-based processors - Very, very promising results - Input: image - Output: `Border` element with coordinates - Not yet `AlternativeImage`-sensitive ---- ### Module Project-based processors #### `ocrd-anybaseocr-crop`: example | Tesseract | DFKI | :-----------------------------------:|:-----------------------------------: ![](https://i.imgur.com/k3VbSyj.png) | ![](https://i.imgur.com/6lsluzE.png)| --- ## Integration - APIs for PAGE and METS - Delivery of results in a comfortable and interoperable way - No need to directly modify XML files - Use PAGE for - Page-, region-, line-, word- and glyph-level results - Descriptive and binary (`AlternativeImage`) results - Use METS for - Document-level results - Specific aspects: - DPI relativity ... - Description vs. image ... (`AlternativeImage`) - Multiple inputs or outputs ... - Logging as a result ... ---- ### DPI relativity - Most OLR operations are sensitive to image resolution: e.g. minimum/maximum segment size in pixels - Some tools are DPI-aware e.g. Tesseract: ``` Warning: Invalid resolution 0 dpi. Using 70 instead. ``` ```python tessapi.SetVariable('user_defined_dpi', str(200)) ``` - Some implementations expect a certain value e.g. Ocropy: 300 DPI - We (usually) know DPI from metadata/tags: ```python info = OcrdExif(pil_image) if info.resolution != 1: # tag available dpi = info.resolution if info.resolutionUnit == 'cm': dpi = round(dpi * 2.54) ... ``` ---- ### DPI relativity - OLR processors that are DPI-aware already: _pass_ DPI from **`OcrdExif`** e.g. Tesseract wrappers - OLR processors that are not DPI-aware yet: _modify_ to **zoom**: 1. determine factor between expected and actual DPI 2. multiply all relevant constants in the code e.g. Ocropy wrappers ---- ### Description vs. image - PAGE: hierarchy of elements, _each_ with descriptive and binary content - **original** `/PcGts/Page/@imageFilename` - **derived** `//AlternativeImage/@filename` - Preprocessing steps: either 1. producing derived images _obligatory_: binarization, despeckling, dewarping 2. just describing the operation (images _optional_): cropping/segmentation (`Coords/@points`), deskewing (`@orientation`) operation must be _applied_ at some point – preferably when descending to a lower hierarchy level (i.e. during segmentation) - Consumers: must - _respect_ `AlternativeImage` if present on their hierarchy level of interest - _generate_ an image from the parent otherwise (which again could have `AlternativeImage`) – by: - **cutting** from `Coords/@points` - **rotating** by `(Page|TextRegion)/@orientation` ---- ### Description vs. image #### AlternativeImage problems and solutions - bounding box rectangles are too coarse (esp. in the presence of skew) → **polygon** coordinates must always be preserved fully (using polygon masking instead of simple cutting) - **coordinates are absolute** (they reference the _original_ image) → before cutting from the parent image, coordinates must be converted _to relative_ → before adding new child elements, coordinates must be converted _from relative_ → offsets must always be passed down the hierarchy - not all image operations retain pixel positions: 1. deskewing → annotate `@orientation`, _but_: - rotation generally increases the binary image at the margins → compensate by **additional offset** (half the increase in size) - rotation applies around the center of the image, not the origin → coordinates must be **translated to center**, rotated passive, then translated back 2. dewarping → use `Grid`? 3. rescaling → extend PAGE-XML with `@scale`? ---- ### Description vs. image #### Summary (slightly more abstract) - coordinates must be **reproducible**: > Annotation must be sufficient to calculate pixel positions in `AlternativeImage` from those in `@imageFilename` > (e.g. to cut parents) or vice versa (e.g. to add children). - image preprocessing steps which alter the coordinate system must describe their transform appropriately: - linear coordinate transformations (translation/offset, rotation/angle, scale) can be made exact (up to rounding) - non-linear transformations (dewarping) are inexact ... ---- ### Description vs. image #### Remaining issues - Strictness of **`@comments`** classification: - multiple `AlternativeImage` entries → rely on `@comments`, or always append/choose last? - if `AlternativeImage` _and_ `@orientation` → rely on `@comments`, or expect the image to be deskewed already? - if `AlternativeImage` _and_ `Border` → rely on `@comments`, or expect the image to be cropped already? - Reproducibility: - if `AlternativeImage` is larger than `Coords/@points` rectangle → due to rotation (offset) or rescaling (zoom) or both! → introduce `@scale` or prohibit rescaling altogether? - (page/line-level) dewarping → use `Grid`? ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### High-level API: - image/offset recursion: - **`Workspace.image_from_page`**: on the page level (original or derived) - **`Workspace.image_from_segment`**: all levels below page (derived) <div> what it does: - get last `AlternativeImage` or generate from parent (including: - conversion of coordinates to parent-relative, which involves offset correction and possibly coordinate rotation, - possibly image rotation) - return the image and its absolute bounding box (compensating for resizing by additional offset) - (`image_from_page` only:) also return `OcrdExif` instance for original </div> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" --> <div> how to use: - not recursive itself – needs to be called recursively, passing down results: ```python from ocrd_modelfactory import page_from_file ... page_id = input_file.pageId or input_file.ID # for logging pcgts = page_from_file(workspace.download_file(input_file)) page = pcgts.get_Page() page_image, page_xywh, page_image_info = workspace.image_from_page( page, page_id) ... for region in page.get_TextRegion(): region_image, region_xywh = workspace.image_from_segment( region, page_image, page_xywh) ... for line in region.get_TextLine(): line_image, line_xywh = workspace.image_from_segment( line, region_image, region_xywh) ... ``` </div> <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### High-level API: - add image to METS: **`Workspace.save_image_file`** <div> what it does: - export image file from `PIL.Image` object - make file path from fileGrp, ID and format - reference the file in METS via `Workspace.add_file` - return file path </div> <!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" --> <div> how to use: - needs to know image fileGrp, unique ID: ```python ... file_id = input_file.ID.replace(self.input_file_grp, 'OCR-D-IMG-DEWARP') ... file_path = workspace.save_image_file(image, file_id + '_' + region.id + '_' + line.id, page_id=input_file.pageId, file_grp='OCR-D-IMG-DEWARP') line.add_AlternativeImage(AlternativeImageType( filename=file_path, comments=comments + ',' + 'dewarped') ``` </div> <!-- .element: class="fragment" data-fragment-index="2" --> ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - convert from absolute coordinates to relative: **`ocrd_utils.coordinates_of_segment`** what it does: - get the points of the element's polygon outline - shift all points by the offset (top-left corner) of the parent towards origin - (in case the parent was rotated:) rotate all points with the center of the image as pivot how to use: ```python line_polygon = coordinates_of_segment(line, region_image, region_xywh) line_polygon = resegment(line_polygon, region_labels, region_image_bin, line.id) line_polygon = coordinates_for_segment(line_polygon, region_image, region_xywh) line.get_Coords().points = points_from_polygon(line_polygon) ``` ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - convert from relative coordinates to absolute: **`ocrd_utils.coordinates_for_segment`** what it does: - (in case the parent was rotated:) rotate all points with the center of the image as pivot in opposite direction - shift all points by the offset (top-left corner) of the parent away from origin how to use: ```python ... for word_no, word in enumerate(iterate_level(tessapi.GetIterator(), RIL.WORD)): word_id = '%s_word%04d' % (line.id, word_no) bbox = word.BoundingBox(RIL.WORD) points = points_from_polygon(coordinates_for_segment( polygon_from_x0y0x1y1(bbox), None, # image not needed if element cannot have angle line_xywh)) word = WordType(id=word_id, Coords=CoordsType(points)) line.add_Word(word) ``` ---- ### Description vs. image #### Implementation in `OCR-D/core` ##### Low-level API: - only coordinate rotation (as `numpy.ndarray`): **`ocrd_utils.rotate_coordinates`** - mask away exterior to background: **`image_from_polygon`** - background-agnostic replacement for `PIL.Image.crop`: **`ocrd_utils.crop_image`** - ... ---- ### Description vs. image #### Early adopters - core Python implementation already used by: - [Tesseract processors](https://github.com/OCR-D/ocrd_tesserocr) `ocrd-tesserocr-*` - [Ocropy processors](https://github.com/cisocrgroup/cis-ocrd-py/) `ocrd-cis-ocropy-*` - core Bash implementation WIP used by: - [Olena binarization](https://github.com/OCR-D/ocrd_olena) `ocrd-olena-binarize` ---- ### Multiple inputs or outputs - specified by [comma-separated list](https://ocr-d.github.io/cli#command-line-interface-cli) on CLI: ```shell $ ocrd-olena-binarize -I OCR-D-GT-SEG-LINE -O OCR-D-SEG-PAGE,OCR-D-IMG-BIN $ ocrd-cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,OCR-D-OCR-TESS,OCR-D-COR-ASV-ANN ``` ---- ### Multiple inputs or outputs #### Use-case for multi-valued input: alignment - `TextLine/TextEquiv/Unicode` alignment via - global sequence alignment (see [ocrd-cis-align](https://github.com/cisocrgroup/cis-ocrd-py/blob/51702097e0e4ea023a06d131769eaa0de81dcdd4/ocrd_cis/align/aligner.py#L26) or [ocrd-cor-asv-ann-evaluate](https://github.com/ASVLeipzig/cor-asv-ann/blob/a460bd5a95bd6fa092c40259f99355c3af02f01b/ocrd_cor_asv_ann/wrapper/evaluate.py#L51)) - neural attention mechanism (cf. Dong&Smith 2018 _Multi-input attention_) - resegmentation - possibly: text alignment informed by segmentation (coordinates) ---- ### Multiple inputs or outputs #### Use-case for multi-valued output: PAGE and image - image preprocessing produces PAGE with `AlternativeImage` references → generated images must also be added to METS - bad solution: fixed fileGrp `OCR-D-IMG-BIN`, `OCR-D-IMG-DESKEW`, `OCR-D-IMG-DEWARP` etc. - good approach: - use `output_file_grp` second position, - fallback to default if not given (e.g. [ocrd-tesserocr-binarize](https://github.com/OCR-D/ocrd_tesserocr/blob/ca2530d0f4ffd23ca5bfe7380f1b1089af36f6b6/ocrd_tesserocr/binarize.py#L59) or ocrd-olena-binarize) ---- ### Logging as a result - not all operations have a natural (PAGE/image) output file group: e.g. OLR/OCR evaluation, model training - some need to aggregate over multiple pages (or even workspaces): e.g. CER/WER --- ## Workflow - Goals: - flexibility and complexity of **configurations**: processors and parameters as building blocks - efficiency and robustness of **engines**: parallel/distributed computation and validation of inputs/outputs - Aspects: - Running ... - Configuration ... - Processors ... - Measurements ... - Best Practices ... ---- ### Running - with individual CLIs combined in a custom bash script: - with `ocrd process` as engine: - with Taverna? - with Kitodo? ---- ### Configuration ```graphviz digraph G { node[shape=box]; compound=true; page_segmentation[label="Region segmentation"]; line_segmentation[label="Line segmentation"]; text_optimization[label="Text optimization"]; subgraph cluster_preprocessing_page { label = "Page preprocessing"; binarization_page[label="Binarization"]; cropping[label="Cropping"]; deskewing_page[label="Deskewing"]; despeckling_page[label="Despeckling"]; dewarping_page[label="Dewarping"]; binarization_page -> cropping -> deskewing_page -> despeckling_page -> dewarping_page {rank=same; binarization_page, cropping, deskewing_page, despeckling_page, dewarping_page} } subgraph cluster_preprocessing_segment { label = "Region preprocessing"; deskewing_segment[label="Deskewing"]; despeckling_segment[label="Despeckling"]; binarization_segment[label="Binarization"]; binarization_segment -> despeckling_segment -> deskewing_segment {rank=same; deskewing_segment, despeckling_segment, binarization_segment} } subgraph cluster_preprocessing_line { label = "Line preprocessing"; binarization_line[label="Binarization"]; dewarping_line[label="Dewarping"]; binarization_line -> dewarping_line {rank=same; binarization_line, dewarping_line} } subgraph cluster_ocr_line { label = "Text recognition"; ocr_one[label="OCR 1"]; ocr_two[label="OCR 2"]; ocr_n[label="OCR n"]; ocr_one -> ocr_two -> ocr_n[style=dotted,dir=none] {rank=same; ocr_one, ocr_two, ocr_n} } binarization_page -> page_segmentation[label="Pages",ltail=cluster_preprocessing_page] page_segmentation -> binarization_segment[label="Regions",lhead=cluster_preprocessing_segment] binarization_segment -> line_segmentation[label="Regions",ltail=cluster_preprocessing_segment] line_segmentation -> binarization_line[label="Lines",lhead=cluster_preprocessing_line] binarization_line -> ocr_one[label="Line images", ltail=cluster_preprocessing_line, lhead=cluster_ocr_line] ocr_one -> text_optimization[label="Line strings",ltail=cluster_ocr_line] } ``` ---- ### Available processors | Processor | Status | Note | | -------------------------- | -------- | ----------- | | *Binarization* | | | | `ocrd-olena-binarize` | ✓ | | | `ocrd-anybaseocr-binarize` | ✗ | Interface | | `ocrd-cis-ocropy-binarize` | ✓ | | | `ocrd-kraken-binarize` | ✗ | Invocation | | `ocrd-tesserocr-binarize` | ✓ | | | *Despeckling* | | | | `ocrd-cis-ocropy-denoise` | ✓ | | ---- ### Available processors | Processor | Status | Note | | -------------------------- | -------- | ----------- | | *Cropping* | | | | `ocrd-anybaseocr-crop` | ✓ | | | `ocrd-kraken-crop` | ✗ | Interface | | `ocrd-tesserocr-crop` | ✓ | | | *Deskewing* | | | | `ocrd-anybaseocr-deskew` | ✗ | Interface | | `ocrd-cis-ocropy-deskew` | ✓ | | | `ocrd-tesserocr-deskew` | ✓ | | | *Dewarping* | | | | `ocrd-anybaseocr-dewarp` | ✗ | Interface | | `ocrd-cis-ocropy-dewarp` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | ----------- | | *Region Segmentation* | | | | `ocrd-tesserocr-segment-region` | ✓ | | | *Clipping/Resegmentation* | | | | `ocrd-cis-ocropy-clip` | ✓ | | | `ocrd-cis-ocropy-resegment` | ✓ | | | `ocrd-segment-repair` | ✓ | | | *Line Segmentation* | | | | `ocrd-ocropy-segment` | ✗ | Invocation | | `ocrd-kraken-segment` | ✗ | Invocation | | `ocrd-tesserocr-segment-line` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | -------- | | *Font identification* | | | | `ocrd-typegroups-classifier` | ✓ | | | *Text recognition* | | | | `ocrd-cis-ocropy-recognize` | ✓ | | | `ocrd-tesserocr-recognize` | ✓ | | | `ocrd-calamari-recognize` | ✓ | | ---- ### Available processors | Processor | Status | Note | | ------------------------------- | -------- | --------- | | *OCR alignment* | | | | `ocrd-cis-align` | ✓ | | | *Text optimization* | | | | `ocrd-cor-asv-ann-process` | ✓ | | | `ocrd-cor-asv-fst-process` | ✓ | | | `ocrd-cis-profile` | ✓ | | | `ocrd-cis-postcorrection` | ✗ | Interface | | `ocrd-keraslm-rate` | ✓ | | | *OCR evaluation* | | | | `ocrd-keraslm-rate` | ✓ | | | `ocrd-cor-asv-ann-evaluate` | ✓ | | | `ocrd-dinglehopper` | ✓ | | ---- ### From image to regions: commands ```shell # # create workspace from existing METS ocrd workspace clone \ https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml . # # crop with anybaseocr ocrd-anybaseocr-crop -I ORIGINAL -O CROPPED -m mets.xml # # binarize on page level ocrd-cis-ocropy-binarize -I CROPPED -O BIN -p <(echo '{"level-of-operation": "page"}') -m mets.xml # # deskew on page level ocrd-cis-ocropy-deskew -I BIN -O DESKEWED -p <(echo '{"level-of-operation": "page"}') -m mets.xml # # segment into regions ocrd-tesserocr-segment-region -I DESKEWED -O REGIONS -m mets.xml ``` ---- ### From image to regions: example ![](https://i.imgur.com/ND1Qzcu.png) ---- ### From regions to lines: commands ```shell # # clip regions ocrd-cis-ocropy-clip -I REGIONS -O CLIPPED -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # binarize on region level ocrd-cis-ocropy-binarize -I CLIPPED -O RBIN -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # deskew on region level ocrd-cis-ocropy-deskew -I RBIN -O RDESKEWED -p <(echo '{"level-of-operation": "region"}') -m mets.xml # # segment into lines ocrd-tesserocr-segment-line -I RDESKEWED -O LINES -m mets.xml ``` ---- ### From regions to lines: example ---- ### From lines to text: commands ```shell # # clip lines ocrd-cis-ocropy-clip -I LINES -O LCLIPPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # binarize on line level ocrd-cis-ocropy-binarize -I LCLIPPED -O LBIN -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # dewarp on line level ocrd-cis-ocropy-dewarp -I LBIN -O DEWARPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml # # recognize text ocrd-tesserocr-recognize -I DEWARPED -O TEXT -p <(echo '{"model": "frk+deu+Fraktur+Latin"}') -m mets.xml ``` ---- ### From lines to text: example ---- ### Observations - Decent progress on the image preprocessing stage - Severe problems with recognition of (complex) page structures - Acceptable text quality (i.e on par, sometimes even better, than ABBYY FineReader) - Running time (roughly 2 h) per book needs improvement (?) ---- ### Measurements on current GT - Tesseract vs. Ocropy OCR, different models - Tesseract vs. Ocropy{nlbin} vs. Olena{Kim/Wolf/Sauvola} binarization - binarization on page level vs. region level - impact of various preprocessors (deskewing, dewarping, clipping, resegmentation) - resegmentation vs. clipping on line level - segmentation accuracy of GT itself - deskewing on page level vs. region level - _dewarping on page level vs. line level_ (not covered here) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] clip["CLIP{BLOCK}"] deskew["DESKEW{BLOCK}:tesserocr"] reseg[RESEGMENT] dew[DEWARP:ocropy] ocr[OCR:*] bin --> clip clip --> deskew deskew --> reseg reseg --> dew dew --> ocr ``` | OCR | CER[%] | | --- | ------ | | OCRO{fraktur} | 23.7 | | OCRO{fraktur(jze)} | 28.4 | | TESS{Fraktur} | 12.2 | | TESS{frk} | 11.9 | | TESS{frk+deu} | 11.5 | → layout/preprocessing still not good enough → Ocropy suffers from _Frakturwechsel_ (nearly no coverage of Antiqua/Arabic numerals, no way to mix models) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:ocropy"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.2 | (-0.5 for s/tesserocr/ocropy/) | | OCRO{fraktur(jze)} | 28.0 | (-0.4 for s/tesserocr/ocropy/) | | TESS{Fraktur} | 12.1 | (-0.1 for s/tesserocr/ocropy/) | | TESS{frk} | 11.9 | (+-0 for s/tesserocr/ocropy/) | | TESS{frk+deu} | 11.4 | (-0.1 for s/tesserocr/ocropy/) | → Ocropy deskews slightly better than Tesseract (the latter being more conservative) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.4 | (+1.2 for s//-DESKEW/) | | OCRO{fraktur(jze)} | 29.3 | (+1.3 for s//-DESKEW/) | | TESS{Fraktur} | 12.8 | (+0.7 for s//-DESKEW/) | | TESS{frk} | 12.6 | (+0.7 for s//-DESKEW/) | | TESS{frk+deu} | 12.2 | (+0.8 for s//-DESKEW/) | → Deskewing helps, but not that much (i.e. either GT images already have little skew, or dewarping can compensate) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 52.6 | (+28.2 for s//-DEWARP/) | | OCRO{fraktur(jze)} | 61.4 | (+32.1 for s//-DEWARP/) | | TESS{Fraktur} | 13.3 | (+ 0.5 for s//-DEWARP/) | | TESS{frk} | 13.2 | (+ 0.6 for s//-DEWARP/) | | TESS{frk+deu} | 12.8 | (+ 0.6 for s//-DEWARP/) | → Ocropy is very sensitive against warped images, Tesseract nearly immune ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 53.2 | (+0.6 for s/-DEWARP/-DEWARP-DESKEW/) | | OCRO{fraktur(jze)} | 63.0 | (+1.6 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{Fraktur} | 13.5 | (+0.2 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{frk} | 13.3 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) | | TESS{frk+deu} | 12.9 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) | → Deskewing appearently cannot replace dewarping, i.e. either deskewing is too bad, or dewarping cannot compensate missing deskewing in the first place. ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 56.8 | (+3.6 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | OCRO{fraktur(jze)} | 66.0 | (+3.0 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{Fraktur} | 13.6 | (+0.1 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{frk} | 13.7 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | | TESS{frk+deu} | 13.3 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) | → Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> deskew deskew["DESKEW{PAGE}:tesserocr"] deskew --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.6 | (+0.9 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | OCRO{fraktur(jze)} | 29.7 | (+1.3 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{Fraktur} | 13.4 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{frk} | 13.2 | (+2.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | | TESS{frk+deu} | 12.7 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) | → Deskewing (on average) works slightly better on the region level than on the page level ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> den den["DENOISE{PAGE}:ocropy"] den --> deskew deskew["DESKEW{PAGE}:ocropy"] deskew --> clip clip["CLIP{BLOCK}"] clip --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 19.1 | (-5.5 for s//DENOISE{PAGE}/) | | OCRO{fraktur(jze)} | 23.5 | (-6.2 for s//DENOISE{PAGE}/) | | TESS{Fraktur} | 11.7 | (-1.7 for s//DENOISE{PAGE}/) | | TESS{frk} | 11.8 | (-1.4 for s//DENOISE{PAGE}/) | | TESS{frk+deu} | 11.2 | (-1.5 for s//DENOISE{PAGE}/) | → Denoising can improve Wolf binarization, esp. for Ocropy ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.9 | (+0.2 for s//-CLIP{BLOCK}/) | | OCRO{fraktur(jze)} | 28.6 | (+0.2 for s//-CLIP{BLOCK}/) | | TESS{Fraktur} | 12.3 | (+0.1 for s//-CLIP{BLOCK}/) | | TESS{frk} | 12.3 | (+0.4 for s//-CLIP{BLOCK}/) | | TESS{frk+deu} | 11.9 | (+0.4 for s//-CLIP{BLOCK}/) | → Clipping on the region level gives very minimal improvement (if resegmentation is used) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 31.2 | (+7.5 for s//-RESEG/) | | OCRO{fraktur(jze)} | 35.1 | (+6.7 for s//-RESEG/) | | TESS{Fraktur} | 13.2 | (+1.0 for s//-RESEG/) | | TESS{frk} | 12.7 | (+0.8 for s//-RESEG/) | | TESS{frk+deu} | 12.4 | (+0.9 for s//-RESEG/) | → Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines → Resegmentation needs deskewing ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> clip2 clip2["CLIP{LINE}"] clip2 --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.1 | (+0.4 for s/RESEG/CLIP{LINE}/) | | OCRO{fraktur(jze)} | 28.9 | (+0.5 for s/RESEG/CLIP{LINE}/) | | TESS{Fraktur} | 12.9 | (+0.7 for s/RESEG/CLIP{LINE}/) | | TESS{frk} | 12.7 | (+0.8 for s/RESEG/CLIP{LINE}/) | | TESS{frk+deu} | 12.4 | (+0.9 for s/RESEG/CLIP{LINE}/) | → Clipping on the line level cannot quite replace resegmentation, with Tesseract it does not help at all ---- ```mermaid graph LR bin["BIN{PAGE}:olena{kim}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 24.9 | (+1.2 for s/wolf/kim/) | | OCRO{fraktur(jze)} | 29.6 | (+1.2 for s/wolf/kim/) | | TESS{Fraktur} | 14.0 | (+1.8 for s/wolf/kim/) | | TESS{frk} | 14.2 | (+2.3 for s/wolf/kim/) | | TESS{frk+deu} | 13.8 | (+2.3 for s/wolf/kim/) | → Kim is noticeably worse than Wolf (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{sauvola}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.0 | (-0.7 for s/wolf/sauvola/) | | OCRO{fraktur(jze)} | 27.9 | (-0.5 for s/wolf/sauvola/) | | TESS{Fraktur} | 12.0 | (-0.2 for s/wolf/sauvola/) | | TESS{frk} | 11.8 | (-0.1 for s/wolf/sauvola/) | | TESS{frk+deu} | 11.5 | (+-0 for s/wolf/sauvola/) | → Basic Sauvola is slightly better than Wolf (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{sauvola-ms-split}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 22.8 | (-0.9 for s/wolf/sauvola-ms-split/) | | OCRO{fraktur(jze)} | 27.6 | (-0.8 for s/wolf/sauvola-ms-split/) | | TESS{Fraktur} | 11.6 | (-0.6 for s/wolf/sauvola-ms-split/) | | TESS{frk} | 11.5 | (-0.4 for s/wolf/sauvola-ms-split/) | | TESS{frk+deu} | 11.1 | (-0.4 for s/wolf/sauvola-ms-split/) | → This Sauvola variant is even better (on average) ---- ```mermaid graph LR bin["BIN{PAGE}:ocropy{nlbin}"] bin --> clip clip["CLIP{BLOCK}"] clip --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 36.4 | (+12.7 for s/wolf/ocropy/) | | OCRO{fraktur(jze)} | 41.1 | (+12.7 for s/wolf/ocropy/) | | TESS{Fraktur} | 15.7 | (+ 3.5 for s/wolf/ocropy/) | | TESS{frk} | 14.9 | (+ 3.0 for s/wolf/ocropy/) | | TESS{frk+deu} | 14.8 | (+ 3.3 for s/wolf/ocropy/) | → Ocropy{nlbin} is really bad! (but what about `perc` / `range` / `threshold` / `lo` / `hi` parameters?) ---- ```mermaid graph LR clip["CLIP{BLOCK}"] clip --> bin bin["BIN{BLOCK}:ocropy{nlbin}"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 23.0 | (-13.4 for s/BIN{PAGE}/BIN{BLOCK}/) | | OCRO{fraktur(jze)} | 27.4 | (-13.7 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{Fraktur} | 11.5 | (- 4.2 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{frk} | 11.4 | (- 3.5 for s/BIN{PAGE}/BIN{BLOCK}/) | | TESS{frk+deu} | 11.2 | (- 3.6 for s/BIN{PAGE}/BIN{BLOCK}/) | → Binarization on the region level is superior to the page level (if using Ocropy{nlbin}!) note: - clipping is neutral here – it does ad-hoc binarization (with Ocropy{nlbin}) - to be conclusive, the Olena variants should be run like this as well (which requires bashlib access to `AlternativeImage` on the region level) ---- ```mermaid graph LR clip["CLIP{BLOCK}"] clip --> bin bin["BIN{BLOCK}:tesserocr"] bin --> deskew deskew["DESKEW{BLOCK}:tesserocr"] deskew --> reseg reseg["RESEG"] reseg --> dewarp dewarp["DEWARP:ocropy"] dewarp --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 55.8 | (+32.8 for s/ocropy{nlbin}/tesserocr/) | | OCRO{fraktur(jze)} | 63.5 | (+36.1 for s/ocropy{nlbin}/tesserocr/) | | TESS{Fraktur} | 15.0 | (+ 3.5 for s/ocropy{nlbin}/tesserocr/) | | TESS{frk} | 15.3 | (+ 3.9 for s/ocropy{nlbin}/tesserocr/) | | TESS{frk+deu} | 14.8 | (+ 3.6 for s/ocropy{nlbin}/tesserocr/) | → Ocropy cannot cope with Tesseract binarization. → But even Tesseract prefers Ocropy binarization! note: - to be conclusive, Tesseract binarization should be run on the page level for comparison, but this is impossible with its CAPI ---- ```mermaid graph LR bin["BIN{BLOCK}:ocropy{nlbin}"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 55.4 | (+32.4 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ | | OCRO{fraktur(jze)} | 64.6 | (+37.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ | | TESS{Fraktur} | 12.9 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk} | 12.9 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk+deu} | 12.7 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) | → Without the extra preprocessors, Ocropy is lost on GT. ^1^: "naîve" configuration of Ocropy ---- ```mermaid graph LR bin["BIN{BLOCK}:tesserocr"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 66.4 | (+10.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | OCRO{fraktur(jze)} | 72.4 | (+ 8.9 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{Fraktur} | 16.2 | (+ 1.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | | TESS{frk} | 16.9 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | | TESS{frk+deu} | 16.4 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ | → Without the extra processors, Tesseract still works. ^2^: "naîve" configuration of Tesseract (but how does this compare to the CLI?) ---- ```mermaid graph LR bin["BIN{PAGE}:olena{wolf}"] bin --> ocr ocr[OCR:*] ``` | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 56.8 | (+33.1 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | OCRO{fraktur(jze)} | 66.0 | (+37.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{Fraktur} | 13.6 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk} | 13.5 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) | | TESS{frk+deu} | 13.3 | (+ 1.8 for s//-CLIP-DESKEW-RESEG-DEWARP/) | → Again, Ocropy requires the extra preprocessors, while Tesseract is suprisingly insensitive to invading neighbouring regions/lines, against skewed and warped images. ---- ### Measurements on GT4HistOCR - directory / file name suffix structure instead of METS/PAGE - pre-built (fixed) preprocessing with less effort (no polygons, no clipping / resegmentation / dewarping) - automatic segmentation by alignment instead of manual segmentation - representative? - large enough for training (OCR, COR) ---- ```mermaid graph LR bin["BIN{PAGE}:ocropy{nlbin}"] deskew["DESKEW{PAGE}:ocropy{nlbin}"] lseg["SEG{LINE}:ocropy{gpageseg}"] bin --> deskew deskew --> lseg lseg --> ocr ocr[OCR:*] style bin fill:#ccc style deskew fill:#ccc style lseg fill:#ccc ``` <!-- .slide svg: width="80px"; height="10px"; --> | OCR | CER[%] | comparison | | --- | ------ | ---------- | | OCRO{fraktur} | 8.33 | (-14.5 for s/OCR-D/4HistOCR/) | | OCRO{fraktur(jze)} | 6.27 | (-21.3 for s/OCR-D/4HistOCR/) | | TESS{Fraktur} | 8.33 | (-3.3 for s/OCR-D/4HistOCR/) | | TESS{frk} | ? | (- ? for s/OCR-D/4HistOCR/) | | TESS{frk+deu} | ? | (- ? for s/OCR-D/4HistOCR/) | → Much better than best measured OCR-D configuration! ---- ### Striving for best-practice configurations - Increasing number of processors - DFKI preprocessing and layout analysis - Würzburg layout analysis - Leipzig and Munich post correction - Increasing number of models - Trainable processors? - Erlangen font detection + Leipzig training tools - OCR-D GT - Increasing number of possible configurations - Recommendations for users are needed! - (Tools and workflows for text-based evaluation) - Tools and workflows for layout evaluation --- ## Wishlist - [ ] improve quality/consistency of existing processors (`AlternativeImage`, DPI, packaging, logging, documentation) - [ ] OCR-D wrapper for PRImA Layout Evaluation (profiles) - [ ] OCR-D wrapper for ScanTailor - [ ] OCR-D wrapper for [page-level dewarping with Leptonica](https://tpgit.github.io/UnOfficialLeptDocs/leptonica/dewarping.html)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully