OCR-Workflows mit OCR-D

Architektur

Workspaces

physische Repräsentation einer METS-XML
- Verzeichnis für Dokument
- mets:fileGrp Unterverzeichnisse
- mets:file Dateien (relative lokale Pfade / URLs)
Verwaltung ähnl. Repository

ocrd workspace clone "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-50169238X"
ocrd workspace find --file-grp ORIGINAL --page-id PHYS_0001 --download
ocrd workspace list-page
ocrd workspace list-group
ocrd workspace add ...

Gegenstand der Prozessierung

Architektur

Repräsentation

METS-XML: Dokument-Struktur, Metadaten
PAGE-XML: Seiten-Struktur (hierarchisch); Beschreibung von Segmenten mit (Polygon-)Koordinaten, Attributen, Text, Bild-Referenzen
Bilder: Original und Bildderivate
Prozessierung = schrittweise Anreicherung (Annotation)

Architektur

Prozessoren

Implementierung für abstrakte Operation in einem OCR-Workflow
operiert auf Workspace (Eingabe/Ausgabe-fileGrps)
Parameter-Spezifikation (JSON-Schema)
einheitliche CLI

ocrd-olena-binarize -I MAX -O BIN -P impl sauvola-ms-split
ocrd-tesserocr-segment -I BIN -O SEG -P find_tables false
ocrd-cis-ocropy-dewarp -I SEG -O DEW -P smoothness 0.1
ocrd-calamari-recognize -I DEW -O OCR -P model "gt4histocr/*.ckpt.json"

derzeit > 60 Prozessoren verfügbar
- Binarisierung / Denoising / Despeckling
- Cropping / Deskewing / Dewarping
- Segmentierung / Clipping / Alignment
- Texterkennung / Alignment / Nachkorrektur
- Evaluierung / Konvertierung / Archivierung

Architektur

Workflow

Konfiguration der Folge von Prozessoren und Parametern
Shell-Syntax, Makefile-Syntax

MAX:
        ocrd workspace find -G $@ --download

BIN: MAX
BIN: TOOL = ocrd-olena-binarize
BIN: PARAMS = "impl": "wolf" # sauvola-ms-split

SEG: BIN
SEG: TOOL = ocrd-tesserocr-segment
SEG: PARAMS = "find_tables": false, "shrink_polygons": true

DEW: SEG
DEW: TOOL = ocrd-cis-ocropy-dewarp
DEW: PARAMS = "smoothness": 0.1

OCR: DEW
OCR: TOOL = ocrd-calamari-recognize
OCR: PARAMS = "model": "gt4histocr/*.ckpt.json"

auf Workspace(s) aufrufen und (parallel) abarbeiten:

make -j -f config.mk all

OCR-Workflows mit OCR-D

Übersicht

Projekt

Ziele

Architektur

Architektur

Workspaces

Architektur

Repräsentation

Architektur

Prozessoren

Architektur

Workflow

Anlaufpunkte

Installation

Anlaufpunkte

Benutzung

Anlaufpunkte

Entwicklung