<style>
/* reduce from default 48px: */
.reveal {
font-size: 24px;
text-align: left;
}
.reveal .slides {
text-align: left;
}
/* change from default gray-on-black: */
.hljs {
color: #005;
background: #fff;
}
/* prevent invisible fragments from occupying space: */
.fragment.visible:not(.current-fragment) {
display: none;
height:0px;
line-height: 0px;
font-size: 0px;
}
/* increase font size in diagrams: */
.label {
font-size: 24px;
font-weight: bold;
}
/* increase maximum width of code blocks: */
.reveal pre code {
max-width: 1000px;
max-height: 1000px;
}
/* remove black border from images: */
.reveal section img {
border: 0;
}
.reveal pre.mermaid {
width: 100% !important;
}
.reveal svg {
max-height: 600px;
}
.reveal .scaled-flowchart-td pre.mermaid {
width: 100% !important;
/* why? float: left; */
}
.reveal .scaled-flowchart-td svg {
max-width: 100% !important;
}
.reveal .scaled-flowchart-td svg g.node,
.reveal .scaled-flowchart-td svg g.label,
.reveal .scaled-flowchart-td svg foreignObject {
width: 100% !important;
}
.reveal .scaled-flowchart-td p {
clear:both;
}
.reveal .centered {
text-align: center
}
.reveal .width75 {
max-width: 75%;
}
</style>
# Flexible OCR workflows with OCR-D <!-- .element: class="centered width75" -->
Robert Sachunsky, Kay-Michael Würzner <!-- .element: class="centered width75" -->
---
## Contents
- Workspaces
- Processors
- Integration
- Workflow
- Wishlist
---
## Workspaces
- *Physical* representation of a METS file
- Directory with subdirectories for each file group
- Each subdirectory contains the listed files
- Adding and removing files explicitly via `ocrd workspace` command
- Adding files implicitly via `ocrd process` command
- Using the output file group parameter `-O`
- Cloning remote “workspaces” (i.e. METS files)
- Access to millions of digitized books!
```shell
ocrd workspace clone \
https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml .
```
---
## Processors
- Representation of OCR-related operations as *processor*
- Operate within an OCR-D workspace
- Input and output definition via METS file groups
- Parameter specification via JSON
- Invocation via
- Individual CLI (e.g. `ocrd-anybaseocr-crop`)
- Meta-processor `ocrd process`
(concatenated + validated invocation of multiple processors within a single call)
- Existing workflow-fit processors:
- Ocropy-based ...
- Tesseract-based ...
- Olena-based binarization ...
- Module Projects ...
----
### Ocropy-based processors
- Many OCR-related operations available
- Not cleanly wrapped
- Not separately available
- Creating processors for
- Binarization (on page/region/line level)
- Deskewing (on page/region level)
- Dewarping (on line level)
- *Clipping* (on region/line level)
- *Resegmentation* (on line level)
- Ultimate goal: Ocropy as an API
- Code clean-up and improvement ...
- New operations ...
- Restructuring into `ocrolib` and processors ...
----
### Ocropy-based processors
#### Code clean-up and improvement
- move common Ocropy functions into `common`
(from CLIs, from `OLD/`), add new ones:
- `PIL.Image` vs `np.ndarray` conversions: **`array2pil`**, **`pil2array`**
(integer vs normed float)
- plausibility checks: **`check_page`**, **`check_region`**, **`check_line`**
(but mix absolute and relative bounds, and make DPI-zoomable)
- `nlbin` deskewing: **`estimate_skew_angle`**, **`estimate_skew`**
(but resize when rotating, with minimum on variance drop)
- `nlbin` binarization: **`estimate_local_whitelevel`**, **`estimate_thresholds`**, **`binarize`**
(but keep exact pixel size, catch NaN)
- remove connected components only contained in the margins: **`borderclean`**
----
### Ocropy-based processors
#### Code clean-up and improvement
- disect and improve segmentation:
- **`remove_hlines`**: add height threshold, reduce default width threshold
- **`compute_separators_morph`**: reduce thresholds (because black colseps can be discontinuous), use only connected components fully inside zone
- **`compute_gradmaps`**: reduce boxmap minsize (for chopped lines at margins), reduce horizontal blur (to avoid joining lines via as-/descenders)
- **`compute_line_seeds`**: more robust top/bottom projection rules and no horizontal blur (to avoid joining lines via as-/descenders)
- **`hmerge_line_seeds`**: new way to ensure horizontal label consistency
- **`compute_segmentation`**:
- `fullpage` switch (regions do not have hlines and colseps)
- `zoom` parameter (thresholds must be DPI-relative)
- use twice the estimated `scale` (blackletter has huge capitals and dense as-/descenders – avoid splitting lines)
- before spreading line seeds, assign unlabelled connected components to their majority seed (instead of splitting)
----
### Ocropy-based processors
#### New operation: *Resegmentation*
observation 1:
~ Ocropy dewarping and recognition is very sensitive to connected components intruding from neighbouring lines (e.g. ascenders and descenders)
observation 2:
~ GT line segmentation is very coarse (only bounding boxes, large overlap)
idea:
~ use Ocropy line segmentation to improve GT line segmentation via label majority rule, then annotate _shrinked polygon_
----
### Ocropy-based processors
#### New operation: *Resegmentation*
<div>
![](https://user-images.githubusercontent.com/38561704/60338418-73978f00-9995-11e9-8fd3-4c149e2df266.png)
![](https://user-images.githubusercontent.com/38561704/60338419-74302580-9995-11e9-9fce-a1eb1fe44e2f.png)
![](https://user-images.githubusercontent.com/38561704/60338421-74302580-9995-11e9-9997-0165ef75833b.png)
![](https://user-images.githubusercontent.com/38561704/60338422-74302580-9995-11e9-9a47-452557fec061.png)
![](https://user-images.githubusercontent.com/38561704/60624848-d940ac80-9dd5-11e9-95a9-2074c2e0182a.png)
</div>
<!-- .element: class="fragment" data-fragment-index="0" -->
![](https://user-images.githubusercontent.com/38561704/61275373-5a684e00-a79d-11e9-82f7-ae40ad8d02ae.png)
<!-- .element: class="fragment" data-fragment-index="1" -->
----
### Ocropy-based processors
#### New operation: *Resegmentation*
![](https://user-images.githubusercontent.com/38561704/61275456-8daadd00-a79d-11e9-91d9-5365c047403a.png)
<!-- .element: class="fragment" data-fragment-index="0" -->
![](https://user-images.githubusercontent.com/38561704/61275578-cc409780-a79d-11e9-8845-762f5df66488.png)
<!-- .element: class="fragment" data-fragment-index="1" -->
![](https://user-images.githubusercontent.com/38561704/61275670-f7c38200-a79d-11e9-9427-5d4cbe7733d6.png)
<!-- .element: class="fragment" data-fragment-index="2" -->
![](https://user-images.githubusercontent.com/38561704/61275855-57ba2880-a79e-11e9-91be-f7c531ac8d15.png)
<!-- .element: class="fragment" data-fragment-index="3" -->
![](https://user-images.githubusercontent.com/38561704/61275459-8daadd00-a79d-11e9-940e-650d15331273.png)
<!-- .element: class="fragment" data-fragment-index="4" -->
![](https://user-images.githubusercontent.com/38561704/61275580-cc409780-a79d-11e9-819b-f8e009c70b93.png)
<!-- .element: class="fragment" data-fragment-index="5" -->
![](https://user-images.githubusercontent.com/38561704/61275671-f7c38200-a79d-11e9-9daa-386ca7b72109.png)
<!-- .element: class="fragment" data-fragment-index="6" -->
![](https://user-images.githubusercontent.com/38561704/61275857-57ba2880-a79e-11e9-922c-b25122c3dd13.png)
<!-- .element: class="fragment" data-fragment-index="7" -->
![](https://user-images.githubusercontent.com/38561704/61275460-8daadd00-a79d-11e9-87be-d6e8a193d38a.png)
<!-- .element: class="fragment" data-fragment-index="8" -->
![](https://user-images.githubusercontent.com/38561704/61275581-ccd92e00-a79d-11e9-9be6-7d3260512ad3.png)
<!-- .element: class="fragment" data-fragment-index="9" -->
![](https://user-images.githubusercontent.com/38561704/61275672-f7c38200-a79d-11e9-9d65-7d8c7b74e88f.png)
<!-- .element: class="fragment" data-fragment-index="10" -->
![](https://user-images.githubusercontent.com/38561704/61275859-5852bf00-a79e-11e9-90d7-ad5abc3cc1c4.png)
<!-- .element: class="fragment" data-fragment-index="11" -->
![](https://user-images.githubusercontent.com/38561704/61275461-8daadd00-a79d-11e9-8b89-b67a4d6dab14.png)
<!-- .element: class="fragment" data-fragment-index="12" -->
![](https://user-images.githubusercontent.com/38561704/61275582-ccd92e00-a79d-11e9-8a40-95c270433154.png)
<!-- .element: class="fragment" data-fragment-index="13" -->
![](https://user-images.githubusercontent.com/38561704/61275674-f7c38200-a79d-11e9-8e1a-de54e1428fb3.png)
<!-- .element: class="fragment" data-fragment-index="14" -->
![](https://user-images.githubusercontent.com/38561704/61275861-5852bf00-a79e-11e9-9a45-aba95d8bc09b.png)
<!-- .element: class="fragment" data-fragment-index="15" -->
![](https://user-images.githubusercontent.com/38561704/61275462-8e437380-a79d-11e9-9a9d-1178a7860a49.png)
<!-- .element: class="fragment" data-fragment-index="16" -->
![](https://user-images.githubusercontent.com/38561704/61275583-ccd92e00-a79d-11e9-80f6-aadeb5f7bbc7.png)
<!-- .element: class="fragment" data-fragment-index="17" -->
![](https://user-images.githubusercontent.com/38561704/61275676-f85c1880-a79d-11e9-9f29-03ee2c95f213.png)
<!-- .element: class="fragment" data-fragment-index="18" -->
![](https://user-images.githubusercontent.com/38561704/61275863-5852bf00-a79e-11e9-85e5-7f9e3c04e5bb.png)
<!-- .element: class="fragment" data-fragment-index="19" -->
----
### Ocropy-based processors
#### New operation: *Resegmentation*
![](https://i.imgur.com/rTyX1xP.png)
----
### Ocropy-based processors
#### New operation: *Clipping*
observation 1:
~ on GT, both regions and lines often overlap with their neighbouring regions and lines – not just in the background, but within connected components
idea 1:
~ remove connected components that are not fully contained in the segment but in a neighbour
observation 2:
~ many frequent cases of this will create interior islands or non-contiguous polygons (not allowed in PAGE-XML, usually not supported by implementations)
idea 2:
~ do not remove by shrinking the polygon, but by _clipping_ to the background colour
note:
- can be used to suppress graphics or separators within or across a region or line
- can be used as an alternative to resegmentation (on the line level)
- can not be applied if the segment already has `AlternativeImage` or `@orientation` (segments and neighbours become incomensurable)
- runs best after binarization
----
### Ocropy-based processors
#### New operation: *Clipping*
![](https://user-images.githubusercontent.com/38561704/61275129-c7c7af00-a79c-11e9-8752-dc5051773d1f.png)
![](https://user-images.githubusercontent.com/38561704/61276541-c64bb600-a79f-11e9-857a-30310270b5fe.png)
<!-- .element: class="fragment" data-fragment-index="1" -->
![](https://user-images.githubusercontent.com/38561704/61276557-d06db480-a79f-11e9-92da-abbe206e4e37.png)
<!-- .element: class="fragment" data-fragment-index="2" -->
----
### Ocropy-based processors
#### New operation: *Clipping*
- problem: regions overlapping for no good reason
e.g. from Tesseract
![](https://i.imgur.com/001fg8v.jpg =350x)
----
### Ocropy-based processors
#### Planned restructuring
- `tmbdev/ocropy`, forked under `OCR-D/ocropy`:
- move non-UI functions from CLIs into `ocrolib`
- package `ocrolib` under PyPI _ocrolib_
- package CLIs under PyPI _ocropus_
- add our `ocrolib` changes (one by one):
- Python 3 port
- additional `ocrolib.common` functions
- improvements in segmentation
- try to get upstream approval
- our `OCR-D/ocrd_ocropus`:
- only for OCR-D wrappers
- base on new `ocrolib`
→ **Ocropy as API is under way!**
----
### Tesseract-based processors
- Many OCR-related operations available
- Mostly available via API
- Exposed to Python via `tesserocr`
- Often more robust segmentation and text recognition than Ocropy
- Creating Processors for
- Binarization (on region/line level)
- *Cropping* ...
- Segmentation (on page/region/line level) ...
- Deskewing (on page/region level) ...
- Text recognition (on line/word/glyph level)
----
### Tesseract-based processors
#### New processor: Poor-man's cropping
- Cropping not implemented as separate function in Tesseract
- Idea: minmal rectangle around all detected regions as `Border`
- Side effect: quality improvement upon repeated region detection within `Border`
- Problems:
- Facing pages
- Empty or sparsely filled pages
- Robustness (i.e. works good for most but really bad for some pages)
----
### Tesseract-based processors
#### Basal segment classification
- Distinction of different region types in Tesseract
- Text
- Image
- Separator
- Table
- ...
- Mapped to (coarser) PAGE classification
- Switches:
- `crop_polygons` (rectangles or polygons?)
- `find_tables` (tables as tables?)
----
### Tesseract-based processors
#### Basal segment classification
![](https://i.imgur.com/zfvx5iy.jpg =700x)
----
### Tesseract-based processors
#### Deskewing and orientation
- 2 distinct APIs in Tesseract:
| | `DetectOrientationScript()` | `AnalyseLayout()` + `Orientation()` |
| - | --- | --- |
| confidence | yes | no |
| orientation | yes | yes |
| script | yes | no |
| deskewing | no | yes |
| textline order | no | yes |
| reading direction | no | yes |
→ can yield **contradictory results**!
----
### Tesseract-based processors
#### Deskewing and orientation
- resolve conflicts:
1. accept _script_ results from OSD
(if very confident)
2. accept _orientation_ results from OSD
(if very confident)
3. ignore _orientation_ results from AL
(but warn if contradictory)
4. apply _deskewing_ results from AL
(adding angle to orientation)
5. apply _order/direction_ results from AL
- available on _page_ and _region_ level
----
### Olena-based binarization
- Binarization is still relevant!
- No RGB-processing recognition engine(s) available
- High influence on OLR and OCR results
- Multiple binarization implementations in Olena
- *Kim*, *Niblack*, *Wolf* ...
→ expose as parameter
- Only as CLIs
→ `ocrd bashlib` as last resort
- PAGE processing (but only on page level)
- `AlternativeImage` support (as case study in `xmlstarlet`)
----
### Olena-based binarization
| Original | Tesseract |
:---------:|:-----------:
|![](https://i.imgur.com/MbUwqjK.png)|![](https://i.imgur.com/NYBcm0g.png)|
| Ocropy | Olena-Wolf |
:------------:|:--------------:
![](https://i.imgur.com/rQluIkV.png)|![](https://i.imgur.com/QkjuOw0.png)|
----
### Module Project-based processors
- Not very many workflow-fit
- Interface compatibility
- Result delivery
- Insufficient documentation
- Short `README`s
- Missing integration examples
- Missing training facilities
- Low visibility
----
### Module Project-based processors
#### `ocrd-anybaseocr-crop`
- Interface-compatible page border detection
- Collaborated effort between module project and coordination project
- Extremely important preprocessing step (due to DFG requirements on digitization)
- Not yet implemented for other `anybaseocr`-based processors
- Very, very promising results
- Input: image
- Output: `Border` element with coordinates
- Not yet `AlternativeImage`-sensitive
----
### Module Project-based processors
#### `ocrd-anybaseocr-crop`: example
| Tesseract | DFKI |
:-----------------------------------:|:-----------------------------------:
![](https://i.imgur.com/k3VbSyj.png) | ![](https://i.imgur.com/6lsluzE.png)|
---
## Integration
- APIs for PAGE and METS
- Delivery of results in a comfortable and interoperable way
- No need to directly modify XML files
- Use PAGE for
- Page-, region-, line-, word- and glyph-level results
- Descriptive and binary (`AlternativeImage`) results
- Use METS for
- Document-level results
- Specific aspects:
- DPI relativity ...
- Description vs. image ...
(`AlternativeImage`)
- Multiple inputs or outputs ...
- Logging as a result ...
----
### DPI relativity
- Most OLR operations are sensitive to image resolution:
e.g. minimum/maximum segment size in pixels
- Some tools are DPI-aware
e.g. Tesseract:
```
Warning: Invalid resolution 0 dpi. Using 70 instead.
```
```python
tessapi.SetVariable('user_defined_dpi', str(200))
```
- Some implementations expect a certain value
e.g. Ocropy: 300 DPI
- We (usually) know DPI from metadata/tags:
```python
info = OcrdExif(pil_image)
if info.resolution != 1:
# tag available
dpi = info.resolution
if info.resolutionUnit == 'cm':
dpi = round(dpi * 2.54)
...
```
----
### DPI relativity
- OLR processors that are DPI-aware already: _pass_ DPI from **`OcrdExif`**
e.g. Tesseract wrappers
- OLR processors that are not DPI-aware yet: _modify_ to **zoom**:
1. determine factor between expected and actual DPI
2. multiply all relevant constants in the code
e.g. Ocropy wrappers
----
### Description vs. image
- PAGE: hierarchy of elements, _each_ with descriptive and binary content
- **original** `/PcGts/Page/@imageFilename`
- **derived** `//AlternativeImage/@filename`
- Preprocessing steps: either
1. producing derived images _obligatory_:
binarization, despeckling, dewarping
2. just describing the operation (images _optional_):
cropping/segmentation (`Coords/@points`), deskewing (`@orientation`)
operation must be _applied_ at some point – preferably when descending to a lower hierarchy level (i.e. during segmentation)
- Consumers: must
- _respect_ `AlternativeImage` if present on their hierarchy level of interest
- _generate_ an image from the parent otherwise (which again could have `AlternativeImage`) – by:
- **cutting** from `Coords/@points`
- **rotating** by `(Page|TextRegion)/@orientation`
----
### Description vs. image
#### AlternativeImage problems and solutions
- bounding box rectangles are too coarse (esp. in the presence of skew)
→ **polygon** coordinates must always be preserved fully (using polygon masking instead of simple cutting)
- **coordinates are absolute** (they reference the _original_ image)
→ before cutting from the parent image, coordinates must be converted _to relative_
→ before adding new child elements, coordinates must be converted _from relative_
→ offsets must always be passed down the hierarchy
- not all image operations retain pixel positions:
1. deskewing → annotate `@orientation`, _but_:
- rotation generally increases the binary image at the margins
→ compensate by **additional offset** (half the increase in size)
- rotation applies around the center of the image, not the origin
→ coordinates must be **translated to center**, rotated passive, then translated back
2. dewarping → use `Grid`?
3. rescaling → extend PAGE-XML with `@scale`?
----
### Description vs. image
#### Summary (slightly more abstract)
- coordinates must be **reproducible**:
> Annotation must be sufficient to calculate pixel positions in `AlternativeImage` from those in `@imageFilename`
> (e.g. to cut parents) or vice versa (e.g. to add children).
- image preprocessing steps which alter the coordinate system must describe their transform appropriately:
- linear coordinate transformations (translation/offset, rotation/angle, scale) can be made exact (up to rounding)
- non-linear transformations (dewarping) are inexact ...
----
### Description vs. image
#### Remaining issues
- Strictness of **`@comments`** classification:
- multiple `AlternativeImage` entries
→ rely on `@comments`, or always append/choose last?
- if `AlternativeImage` _and_ `@orientation`
→ rely on `@comments`, or expect the image to be deskewed already?
- if `AlternativeImage` _and_ `Border`
→ rely on `@comments`, or expect the image to be cropped already?
- Reproducibility:
- if `AlternativeImage` is larger than `Coords/@points` rectangle
→ due to rotation (offset) or rescaling (zoom) or both!
→ introduce `@scale` or prohibit rescaling altogether?
- (page/line-level) dewarping
→ use `Grid`?
----
### Description vs. image
#### Implementation in `OCR-D/core`
##### High-level API:
- image/offset recursion:
- **`Workspace.image_from_page`**: on the page level (original or derived)
- **`Workspace.image_from_segment`**: all levels below page (derived)
<div>
what it does:
- get last `AlternativeImage` or generate from parent (including:
- conversion of coordinates to parent-relative, which involves offset correction and possibly coordinate rotation,
- possibly image rotation)
- return the image and its absolute bounding box (compensating for resizing by additional offset)
- (`image_from_page` only:) also return `OcrdExif` instance for original
</div>
<!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" -->
<div>
how to use:
- not recursive itself – needs to be called recursively, passing down results:
```python
from ocrd_modelfactory import page_from_file
...
page_id = input_file.pageId or input_file.ID # for logging
pcgts = page_from_file(workspace.download_file(input_file))
page = pcgts.get_Page()
page_image, page_xywh, page_image_info = workspace.image_from_page(
page, page_id)
...
for region in page.get_TextRegion():
region_image, region_xywh = workspace.image_from_segment(
region, page_image, page_xywh)
...
for line in region.get_TextLine():
line_image, line_xywh = workspace.image_from_segment(
line, region_image, region_xywh)
...
```
</div>
<!-- .element: class="fragment" data-fragment-index="2" -->
----
### Description vs. image
#### Implementation in `OCR-D/core`
##### High-level API:
- add image to METS:
**`Workspace.save_image_file`**
<div>
what it does:
- export image file from `PIL.Image` object
- make file path from fileGrp, ID and format
- reference the file in METS via `Workspace.add_file`
- return file path
</div>
<!-- .element: class="fragment fade-in-then-fade-out" data-fragment-index="1" -->
<div>
how to use:
- needs to know image fileGrp, unique ID:
```python
...
file_id = input_file.ID.replace(self.input_file_grp,
'OCR-D-IMG-DEWARP')
...
file_path = workspace.save_image_file(image,
file_id + '_' + region.id + '_' + line.id,
page_id=input_file.pageId,
file_grp='OCR-D-IMG-DEWARP')
line.add_AlternativeImage(AlternativeImageType(
filename=file_path,
comments=comments + ',' + 'dewarped')
```
</div>
<!-- .element: class="fragment" data-fragment-index="2" -->
----
### Description vs. image
#### Implementation in `OCR-D/core`
##### Low-level API:
- convert from absolute coordinates to relative:
**`ocrd_utils.coordinates_of_segment`**
what it does:
- get the points of the element's polygon outline
- shift all points by the offset (top-left corner) of the parent towards origin
- (in case the parent was rotated:) rotate all points with the center of the image as pivot
how to use:
```python
line_polygon = coordinates_of_segment(line, region_image, region_xywh)
line_polygon = resegment(line_polygon, region_labels, region_image_bin, line.id)
line_polygon = coordinates_for_segment(line_polygon, region_image, region_xywh)
line.get_Coords().points = points_from_polygon(line_polygon)
```
----
### Description vs. image
#### Implementation in `OCR-D/core`
##### Low-level API:
- convert from relative coordinates to absolute:
**`ocrd_utils.coordinates_for_segment`**
what it does:
- (in case the parent was rotated:) rotate all points with the center of the image as pivot in opposite direction
- shift all points by the offset (top-left corner) of the parent away from origin
how to use:
```python
...
for word_no, word in enumerate(iterate_level(tessapi.GetIterator(), RIL.WORD)):
word_id = '%s_word%04d' % (line.id, word_no)
bbox = word.BoundingBox(RIL.WORD)
points = points_from_polygon(coordinates_for_segment(
polygon_from_x0y0x1y1(bbox),
None, # image not needed if element cannot have angle
line_xywh))
word = WordType(id=word_id, Coords=CoordsType(points))
line.add_Word(word)
```
----
### Description vs. image
#### Implementation in `OCR-D/core`
##### Low-level API:
- only coordinate rotation (as `numpy.ndarray`):
**`ocrd_utils.rotate_coordinates`**
- mask away exterior to background:
**`image_from_polygon`**
- background-agnostic replacement for `PIL.Image.crop`:
**`ocrd_utils.crop_image`**
- ...
----
### Description vs. image
#### Early adopters
- core Python implementation already used by:
- [Tesseract processors](https://github.com/OCR-D/ocrd_tesserocr) `ocrd-tesserocr-*`
- [Ocropy processors](https://github.com/cisocrgroup/cis-ocrd-py/) `ocrd-cis-ocropy-*`
- core Bash implementation WIP used by:
- [Olena binarization](https://github.com/OCR-D/ocrd_olena) `ocrd-olena-binarize`
----
### Multiple inputs or outputs
- specified by [comma-separated list](https://ocr-d.github.io/cli#command-line-interface-cli) on CLI:
```shell
$ ocrd-olena-binarize -I OCR-D-GT-SEG-LINE -O OCR-D-SEG-PAGE,OCR-D-IMG-BIN
$ ocrd-cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,OCR-D-OCR-TESS,OCR-D-COR-ASV-ANN
```
----
### Multiple inputs or outputs
#### Use-case for multi-valued input: alignment
- `TextLine/TextEquiv/Unicode` alignment via
- global sequence alignment
(see [ocrd-cis-align](https://github.com/cisocrgroup/cis-ocrd-py/blob/51702097e0e4ea023a06d131769eaa0de81dcdd4/ocrd_cis/align/aligner.py#L26) or [ocrd-cor-asv-ann-evaluate](https://github.com/ASVLeipzig/cor-asv-ann/blob/a460bd5a95bd6fa092c40259f99355c3af02f01b/ocrd_cor_asv_ann/wrapper/evaluate.py#L51))
- neural attention mechanism
(cf. Dong&Smith 2018 _Multi-input attention_)
- resegmentation
- possibly: text alignment informed by segmentation (coordinates)
----
### Multiple inputs or outputs
#### Use-case for multi-valued output: PAGE and image
- image preprocessing produces PAGE with `AlternativeImage` references
→ generated images must also be added to METS
- bad solution:
fixed fileGrp `OCR-D-IMG-BIN`, `OCR-D-IMG-DESKEW`, `OCR-D-IMG-DEWARP` etc.
- good approach:
- use `output_file_grp` second position,
- fallback to default if not given
(e.g. [ocrd-tesserocr-binarize](https://github.com/OCR-D/ocrd_tesserocr/blob/ca2530d0f4ffd23ca5bfe7380f1b1089af36f6b6/ocrd_tesserocr/binarize.py#L59) or ocrd-olena-binarize)
----
### Logging as a result
- not all operations have a natural (PAGE/image) output file group:
e.g. OLR/OCR evaluation, model training
- some need to aggregate over multiple pages (or even workspaces):
e.g. CER/WER
---
## Workflow
- Goals:
- flexibility and complexity of **configurations**:
processors and parameters as building blocks
- efficiency and robustness of **engines**:
parallel/distributed computation and validation of inputs/outputs
- Aspects:
- Running ...
- Configuration ...
- Processors ...
- Measurements ...
- Best Practices ...
----
### Running
- with individual CLIs combined in a custom bash script:
- with `ocrd process` as engine:
- with Taverna?
- with Kitodo?
----
### Configuration
```graphviz
digraph G {
node[shape=box];
compound=true;
page_segmentation[label="Region segmentation"];
line_segmentation[label="Line segmentation"];
text_optimization[label="Text optimization"];
subgraph cluster_preprocessing_page {
label = "Page preprocessing";
binarization_page[label="Binarization"];
cropping[label="Cropping"];
deskewing_page[label="Deskewing"];
despeckling_page[label="Despeckling"];
dewarping_page[label="Dewarping"];
binarization_page -> cropping -> deskewing_page -> despeckling_page -> dewarping_page
{rank=same; binarization_page, cropping, deskewing_page, despeckling_page, dewarping_page}
}
subgraph cluster_preprocessing_segment {
label = "Region preprocessing";
deskewing_segment[label="Deskewing"];
despeckling_segment[label="Despeckling"];
binarization_segment[label="Binarization"];
binarization_segment -> despeckling_segment -> deskewing_segment
{rank=same; deskewing_segment, despeckling_segment, binarization_segment}
}
subgraph cluster_preprocessing_line {
label = "Line preprocessing";
binarization_line[label="Binarization"];
dewarping_line[label="Dewarping"];
binarization_line -> dewarping_line
{rank=same; binarization_line, dewarping_line}
}
subgraph cluster_ocr_line {
label = "Text recognition";
ocr_one[label="OCR 1"];
ocr_two[label="OCR 2"];
ocr_n[label="OCR n"];
ocr_one -> ocr_two -> ocr_n[style=dotted,dir=none]
{rank=same; ocr_one, ocr_two, ocr_n}
}
binarization_page -> page_segmentation[label="Pages",ltail=cluster_preprocessing_page]
page_segmentation -> binarization_segment[label="Regions",lhead=cluster_preprocessing_segment]
binarization_segment -> line_segmentation[label="Regions",ltail=cluster_preprocessing_segment]
line_segmentation -> binarization_line[label="Lines",lhead=cluster_preprocessing_line]
binarization_line -> ocr_one[label="Line images", ltail=cluster_preprocessing_line, lhead=cluster_ocr_line]
ocr_one -> text_optimization[label="Line strings",ltail=cluster_ocr_line]
}
```
----
### Available processors
| Processor | Status | Note |
| -------------------------- | -------- | ----------- |
| *Binarization* | | |
| `ocrd-olena-binarize` | ✓ | |
| `ocrd-anybaseocr-binarize` | ✗ | Interface |
| `ocrd-cis-ocropy-binarize` | ✓ | |
| `ocrd-kraken-binarize` | ✗ | Invocation |
| `ocrd-tesserocr-binarize` | ✓ | |
| *Despeckling* | | |
| `ocrd-cis-ocropy-denoise` | ✓ | |
----
### Available processors
| Processor | Status | Note |
| -------------------------- | -------- | ----------- |
| *Cropping* | | |
| `ocrd-anybaseocr-crop` | ✓ | |
| `ocrd-kraken-crop` | ✗ | Interface |
| `ocrd-tesserocr-crop` | ✓ | |
| *Deskewing* | | |
| `ocrd-anybaseocr-deskew` | ✗ | Interface |
| `ocrd-cis-ocropy-deskew` | ✓ | |
| `ocrd-tesserocr-deskew` | ✓ | |
| *Dewarping* | | |
| `ocrd-anybaseocr-dewarp` | ✗ | Interface |
| `ocrd-cis-ocropy-dewarp` | ✓ | |
----
### Available processors
| Processor | Status | Note |
| ------------------------------- | -------- | ----------- |
| *Region Segmentation* | | |
| `ocrd-tesserocr-segment-region` | ✓ | |
| *Clipping/Resegmentation* | | |
| `ocrd-cis-ocropy-clip` | ✓ | |
| `ocrd-cis-ocropy-resegment` | ✓ | |
| `ocrd-segment-repair` | ✓ | |
| *Line Segmentation* | | |
| `ocrd-ocropy-segment` | ✗ | Invocation |
| `ocrd-kraken-segment` | ✗ | Invocation |
| `ocrd-tesserocr-segment-line` | ✓ | |
----
### Available processors
| Processor | Status | Note |
| ------------------------------- | -------- | -------- |
| *Font identification* | | |
| `ocrd-typegroups-classifier` | ✓ | |
| *Text recognition* | | |
| `ocrd-cis-ocropy-recognize` | ✓ | |
| `ocrd-tesserocr-recognize` | ✓ | |
| `ocrd-calamari-recognize` | ✓ | |
----
### Available processors
| Processor | Status | Note |
| ------------------------------- | -------- | --------- |
| *OCR alignment* | | |
| `ocrd-cis-align` | ✓ | |
| *Text optimization* | | |
| `ocrd-cor-asv-ann-process` | ✓ | |
| `ocrd-cor-asv-fst-process` | ✓ | |
| `ocrd-cis-profile` | ✓ | |
| `ocrd-cis-postcorrection` | ✗ | Interface |
| `ocrd-keraslm-rate` | ✓ | |
| *OCR evaluation* | | |
| `ocrd-keraslm-rate` | ✓ | |
| `ocrd-cor-asv-ann-evaluate` | ✓ | |
| `ocrd-dinglehopper` | ✓ | |
----
### From image to regions: commands
```shell
#
# create workspace from existing METS
ocrd workspace clone \
https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml .
#
# crop with anybaseocr
ocrd-anybaseocr-crop -I ORIGINAL -O CROPPED -m mets.xml
#
# binarize on page level
ocrd-cis-ocropy-binarize -I CROPPED -O BIN -p <(echo '{"level-of-operation": "page"}') -m mets.xml
#
# deskew on page level
ocrd-cis-ocropy-deskew -I BIN -O DESKEWED -p <(echo '{"level-of-operation": "page"}') -m mets.xml
#
# segment into regions
ocrd-tesserocr-segment-region -I DESKEWED -O REGIONS -m mets.xml
```
----
### From image to regions: example
![](https://i.imgur.com/ND1Qzcu.png)
----
### From regions to lines: commands
```shell
#
# clip regions
ocrd-cis-ocropy-clip -I REGIONS -O CLIPPED -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# binarize on region level
ocrd-cis-ocropy-binarize -I CLIPPED -O RBIN -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# deskew on region level
ocrd-cis-ocropy-deskew -I RBIN -O RDESKEWED -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# segment into lines
ocrd-tesserocr-segment-line -I RDESKEWED -O LINES -m mets.xml
```
----
### From regions to lines: example
----
### From lines to text: commands
```shell
#
# clip lines
ocrd-cis-ocropy-clip -I LINES -O LCLIPPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# binarize on line level
ocrd-cis-ocropy-binarize -I LCLIPPED -O LBIN -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# dewarp on line level
ocrd-cis-ocropy-dewarp -I LBIN -O DEWARPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# recognize text
ocrd-tesserocr-recognize -I DEWARPED -O TEXT -p <(echo '{"model": "frk+deu+Fraktur+Latin"}') -m mets.xml
```
----
### From lines to text: example
----
### Observations
- Decent progress on the image preprocessing stage
- Severe problems with recognition of (complex) page structures
- Acceptable text quality (i.e on par, sometimes even better, than ABBYY FineReader)
- Running time (roughly 2 h) per book needs improvement (?)
----
### Measurements on current GT
- Tesseract vs. Ocropy OCR, different models
- Tesseract vs. Ocropy{nlbin} vs. Olena{Kim/Wolf/Sauvola} binarization
- binarization on page level vs. region level
- impact of various preprocessors (deskewing, dewarping, clipping, resegmentation)
- resegmentation vs. clipping on line level
- segmentation accuracy of GT itself
- deskewing on page level vs. region level
- _dewarping on page level vs. line level_ (not covered here)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
clip["CLIP{BLOCK}"]
deskew["DESKEW{BLOCK}:tesserocr"]
reseg[RESEGMENT]
dew[DEWARP:ocropy]
ocr[OCR:*]
bin --> clip
clip --> deskew
deskew --> reseg
reseg --> dew
dew --> ocr
```
| OCR | CER[%] |
| --- | ------ |
| OCRO{fraktur} | 23.7 |
| OCRO{fraktur(jze)} | 28.4 |
| TESS{Fraktur} | 12.2 |
| TESS{frk} | 11.9 |
| TESS{frk+deu} | 11.5 |
→ layout/preprocessing still not good enough
→ Ocropy suffers from _Frakturwechsel_ (nearly no coverage of Antiqua/Arabic numerals, no way to mix models)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:ocropy"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 23.2 | (-0.5 for s/tesserocr/ocropy/) |
| OCRO{fraktur(jze)} | 28.0 | (-0.4 for s/tesserocr/ocropy/) |
| TESS{Fraktur} | 12.1 | (-0.1 for s/tesserocr/ocropy/) |
| TESS{frk} | 11.9 | (+-0 for s/tesserocr/ocropy/) |
| TESS{frk+deu} | 11.4 | (-0.1 for s/tesserocr/ocropy/) |
→ Ocropy deskews slightly better than Tesseract (the latter being more conservative)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 24.4 | (+1.2 for s//-DESKEW/) |
| OCRO{fraktur(jze)} | 29.3 | (+1.3 for s//-DESKEW/) |
| TESS{Fraktur} | 12.8 | (+0.7 for s//-DESKEW/) |
| TESS{frk} | 12.6 | (+0.7 for s//-DESKEW/) |
| TESS{frk+deu} | 12.2 | (+0.8 for s//-DESKEW/) |
→ Deskewing helps, but not that much (i.e. either GT images already have little skew, or dewarping can compensate)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 52.6 | (+28.2 for s//-DEWARP/) |
| OCRO{fraktur(jze)} | 61.4 | (+32.1 for s//-DEWARP/) |
| TESS{Fraktur} | 13.3 | (+ 0.5 for s//-DEWARP/) |
| TESS{frk} | 13.2 | (+ 0.6 for s//-DEWARP/) |
| TESS{frk+deu} | 12.8 | (+ 0.6 for s//-DEWARP/) |
→ Ocropy is very sensitive against warped images, Tesseract nearly immune
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 53.2 | (+0.6 for s/-DEWARP/-DEWARP-DESKEW/) |
| OCRO{fraktur(jze)} | 63.0 | (+1.6 for s/-DEWARP/-DEWARP-DESKEW/) |
| TESS{Fraktur} | 13.5 | (+0.2 for s/-DEWARP/-DEWARP-DESKEW/) |
| TESS{frk} | 13.3 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) |
| TESS{frk+deu} | 12.9 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) |
→ Deskewing appearently cannot replace dewarping, i.e. either deskewing is too bad, or dewarping cannot compensate missing deskewing in the first place.
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 56.8 | (+3.6 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
| OCRO{fraktur(jze)} | 66.0 | (+3.0 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
| TESS{Fraktur} | 13.6 | (+0.1 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
| TESS{frk} | 13.7 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
| TESS{frk+deu} | 13.3 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
→ Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> deskew
deskew["DESKEW{PAGE}:tesserocr"]
deskew --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 24.6 | (+0.9 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
| OCRO{fraktur(jze)} | 29.7 | (+1.3 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
| TESS{Fraktur} | 13.4 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
| TESS{frk} | 13.2 | (+2.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
| TESS{frk+deu} | 12.7 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
→ Deskewing (on average) works slightly better on the region level than on the page level
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> den
den["DENOISE{PAGE}:ocropy"]
den --> deskew
deskew["DESKEW{PAGE}:ocropy"]
deskew --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 19.1 | (-5.5 for s//DENOISE{PAGE}/) |
| OCRO{fraktur(jze)} | 23.5 | (-6.2 for s//DENOISE{PAGE}/) |
| TESS{Fraktur} | 11.7 | (-1.7 for s//DENOISE{PAGE}/) |
| TESS{frk} | 11.8 | (-1.4 for s//DENOISE{PAGE}/) |
| TESS{frk+deu} | 11.2 | (-1.5 for s//DENOISE{PAGE}/) |
→ Denoising can improve Wolf binarization, esp. for Ocropy
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 23.9 | (+0.2 for s//-CLIP{BLOCK}/) |
| OCRO{fraktur(jze)} | 28.6 | (+0.2 for s//-CLIP{BLOCK}/) |
| TESS{Fraktur} | 12.3 | (+0.1 for s//-CLIP{BLOCK}/) |
| TESS{frk} | 12.3 | (+0.4 for s//-CLIP{BLOCK}/) |
| TESS{frk+deu} | 11.9 | (+0.4 for s//-CLIP{BLOCK}/) |
→ Clipping on the region level gives very minimal improvement (if resegmentation is used)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 31.2 | (+7.5 for s//-RESEG/) |
| OCRO{fraktur(jze)} | 35.1 | (+6.7 for s//-RESEG/) |
| TESS{Fraktur} | 13.2 | (+1.0 for s//-RESEG/) |
| TESS{frk} | 12.7 | (+0.8 for s//-RESEG/) |
| TESS{frk+deu} | 12.4 | (+0.9 for s//-RESEG/) |
→ Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines
→ Resegmentation needs deskewing
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> clip2
clip2["CLIP{LINE}"]
clip2 --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 24.1 | (+0.4 for s/RESEG/CLIP{LINE}/) |
| OCRO{fraktur(jze)} | 28.9 | (+0.5 for s/RESEG/CLIP{LINE}/) |
| TESS{Fraktur} | 12.9 | (+0.7 for s/RESEG/CLIP{LINE}/) |
| TESS{frk} | 12.7 | (+0.8 for s/RESEG/CLIP{LINE}/) |
| TESS{frk+deu} | 12.4 | (+0.9 for s/RESEG/CLIP{LINE}/) |
→ Clipping on the line level cannot quite replace resegmentation, with Tesseract it does not help at all
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{kim}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 24.9 | (+1.2 for s/wolf/kim/) |
| OCRO{fraktur(jze)} | 29.6 | (+1.2 for s/wolf/kim/) |
| TESS{Fraktur} | 14.0 | (+1.8 for s/wolf/kim/) |
| TESS{frk} | 14.2 | (+2.3 for s/wolf/kim/) |
| TESS{frk+deu} | 13.8 | (+2.3 for s/wolf/kim/) |
→ Kim is noticeably worse than Wolf (on average)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{sauvola}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 23.0 | (-0.7 for s/wolf/sauvola/) |
| OCRO{fraktur(jze)} | 27.9 | (-0.5 for s/wolf/sauvola/) |
| TESS{Fraktur} | 12.0 | (-0.2 for s/wolf/sauvola/) |
| TESS{frk} | 11.8 | (-0.1 for s/wolf/sauvola/) |
| TESS{frk+deu} | 11.5 | (+-0 for s/wolf/sauvola/) |
→ Basic Sauvola is slightly better than Wolf (on average)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{sauvola-ms-split}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 22.8 | (-0.9 for s/wolf/sauvola-ms-split/) |
| OCRO{fraktur(jze)} | 27.6 | (-0.8 for s/wolf/sauvola-ms-split/) |
| TESS{Fraktur} | 11.6 | (-0.6 for s/wolf/sauvola-ms-split/) |
| TESS{frk} | 11.5 | (-0.4 for s/wolf/sauvola-ms-split/) |
| TESS{frk+deu} | 11.1 | (-0.4 for s/wolf/sauvola-ms-split/) |
→ This Sauvola variant is even better (on average)
----
```mermaid
graph LR
bin["BIN{PAGE}:ocropy{nlbin}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 36.4 | (+12.7 for s/wolf/ocropy/) |
| OCRO{fraktur(jze)} | 41.1 | (+12.7 for s/wolf/ocropy/) |
| TESS{Fraktur} | 15.7 | (+ 3.5 for s/wolf/ocropy/) |
| TESS{frk} | 14.9 | (+ 3.0 for s/wolf/ocropy/) |
| TESS{frk+deu} | 14.8 | (+ 3.3 for s/wolf/ocropy/) |
→ Ocropy{nlbin} is really bad! (but what about `perc` / `range` / `threshold` / `lo` / `hi` parameters?)
----
```mermaid
graph LR
clip["CLIP{BLOCK}"]
clip --> bin
bin["BIN{BLOCK}:ocropy{nlbin}"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 23.0 | (-13.4 for s/BIN{PAGE}/BIN{BLOCK}/) |
| OCRO{fraktur(jze)} | 27.4 | (-13.7 for s/BIN{PAGE}/BIN{BLOCK}/) |
| TESS{Fraktur} | 11.5 | (- 4.2 for s/BIN{PAGE}/BIN{BLOCK}/) |
| TESS{frk} | 11.4 | (- 3.5 for s/BIN{PAGE}/BIN{BLOCK}/) |
| TESS{frk+deu} | 11.2 | (- 3.6 for s/BIN{PAGE}/BIN{BLOCK}/) |
→ Binarization on the region level is superior to the page level (if using Ocropy{nlbin}!)
note:
- clipping is neutral here – it does ad-hoc binarization (with Ocropy{nlbin})
- to be conclusive, the Olena variants should be run like this as well (which requires bashlib access to `AlternativeImage` on the region level)
----
```mermaid
graph LR
clip["CLIP{BLOCK}"]
clip --> bin
bin["BIN{BLOCK}:tesserocr"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 55.8 | (+32.8 for s/ocropy{nlbin}/tesserocr/) |
| OCRO{fraktur(jze)} | 63.5 | (+36.1 for s/ocropy{nlbin}/tesserocr/) |
| TESS{Fraktur} | 15.0 | (+ 3.5 for s/ocropy{nlbin}/tesserocr/) |
| TESS{frk} | 15.3 | (+ 3.9 for s/ocropy{nlbin}/tesserocr/) |
| TESS{frk+deu} | 14.8 | (+ 3.6 for s/ocropy{nlbin}/tesserocr/) |
→ Ocropy cannot cope with Tesseract binarization.
→ But even Tesseract prefers Ocropy binarization!
note:
- to be conclusive, Tesseract binarization should be run on the page level for comparison, but this is impossible with its CAPI
----
```mermaid
graph LR
bin["BIN{BLOCK}:ocropy{nlbin}"]
bin --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 55.4 | (+32.4 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ |
| OCRO{fraktur(jze)} | 64.6 | (+37.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^1^ |
| TESS{Fraktur} | 12.9 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{frk} | 12.9 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{frk+deu} | 12.7 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
→ Without the extra preprocessors, Ocropy is lost on GT.
^1^: "naîve" configuration of Ocropy
----
```mermaid
graph LR
bin["BIN{BLOCK}:tesserocr"]
bin --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 66.4 | (+10.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| OCRO{fraktur(jze)} | 72.4 | (+ 8.9 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{Fraktur} | 16.2 | (+ 1.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ |
| TESS{frk} | 16.9 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ |
| TESS{frk+deu} | 16.4 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)^2^ |
→ Without the extra processors, Tesseract still works.
^2^: "naîve" configuration of Tesseract (but how does this compare to the CLI?)
----
```mermaid
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> ocr
ocr[OCR:*]
```
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 56.8 | (+33.1 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| OCRO{fraktur(jze)} | 66.0 | (+37.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{Fraktur} | 13.6 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{frk} | 13.5 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
| TESS{frk+deu} | 13.3 | (+ 1.8 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
→ Again, Ocropy requires the extra preprocessors, while Tesseract is suprisingly insensitive to invading neighbouring regions/lines, against skewed and warped images.
----
### Measurements on GT4HistOCR
- directory / file name suffix structure instead of METS/PAGE
- pre-built (fixed) preprocessing with less effort
(no polygons, no clipping / resegmentation / dewarping)
- automatic segmentation by alignment instead of manual segmentation
- representative?
- large enough for training (OCR, COR)
----
```mermaid
graph LR
bin["BIN{PAGE}:ocropy{nlbin}"]
deskew["DESKEW{PAGE}:ocropy{nlbin}"]
lseg["SEG{LINE}:ocropy{gpageseg}"]
bin --> deskew
deskew --> lseg
lseg --> ocr
ocr[OCR:*]
style bin fill:#ccc
style deskew fill:#ccc
style lseg fill:#ccc
```
<!-- .slide svg: width="80px"; height="10px"; -->
| OCR | CER[%] | comparison |
| --- | ------ | ---------- |
| OCRO{fraktur} | 8.33 | (-14.5 for s/OCR-D/4HistOCR/) |
| OCRO{fraktur(jze)} | 6.27 | (-21.3 for s/OCR-D/4HistOCR/) |
| TESS{Fraktur} | 8.33 | (-3.3 for s/OCR-D/4HistOCR/) |
| TESS{frk} | ? | (- ? for s/OCR-D/4HistOCR/) |
| TESS{frk+deu} | ? | (- ? for s/OCR-D/4HistOCR/) |
→ Much better than best measured OCR-D configuration!
----
### Striving for best-practice configurations
- Increasing number of processors
- DFKI preprocessing and layout analysis
- Würzburg layout analysis
- Leipzig and Munich post correction
- Increasing number of models
- Trainable processors?
- Erlangen font detection + Leipzig training tools
- OCR-D GT
- Increasing number of possible configurations
- Recommendations for users are needed!
- (Tools and workflows for text-based evaluation)
- Tools and workflows for layout evaluation
---
## Wishlist
- [ ] improve quality/consistency of existing processors
(`AlternativeImage`, DPI, packaging, logging, documentation)
- [ ] OCR-D wrapper for PRImA Layout Evaluation (profiles)
- [ ] OCR-D wrapper for ScanTailor
- [ ] OCR-D wrapper for [page-level dewarping with Leptonica](https://tpgit.github.io/UnOfficialLeptDocs/leptonica/dewarping.html)
{"metaMigratedAt":"2023-06-14T23:23:35.530Z","metaMigratedFrom":"YAML","title":"Flexible OCR workflows with OCR-D","breaks":false,"description":"Slides for the OCR-D developer workshop 2019","slideOptions":"{\"theme\":\"white\",\"slideNumber\":true}","contributors":"[{\"id\":\"c62f1b15-791a-47e1-8e4c-ab2ed00c04bc\",\"add\":53018,\"del\":12069},{\"id\":\"14a147d0-cd6c-4764-9d25-9c0ae54f027e\",\"add\":11000,\"del\":935},{\"id\":\"e8137db5-d2e1-4125-8f51-e51a4ef3646b\",\"add\":1319,\"del\":447}]"}