Clemens Neudecker & Konstantin Baierer
Staatsbibliothek zu Berlin (SBB)
OCR-D Abschlussworkshop, DFG
12.02.2020 | Wissenschaftszentrum Bonn
–> Anreicherung von METS Bilddigitalisaten mit OCR- und Metadaten
–> Frühe Einbindung von Anwendern, Entwicklern
–> Transparenz über die gesamte Entwicklung
–> Nachvollziehbarkeit, Reproduzierbarkeit
ocrd_all
core
, spec
und alle Modulprojektemake all
um OCR-D komplett zu installierenMit Taverna können Workflows mit OCR-D-Prozessoren ausgeführt werden. Während der Ausführung werden
stdout
stderr
geloggt und mit den Ergebnissen im Forschungsdatenrepositorium gespeichert
Perspektivisch (KIT, work-in-progress):
Cross-Evaluation von verschiedenen Workflows
Das OCR-D Framework enthält alle Komponenten zur Ausführung eines kompletten Workflows.
Es besteht aus mehreren Docker Containern.
$ ocrd workspace clone \
https://raw.githubusercontent.com/OCR-D/assets/master/data/communist_manifesto/data/mets.xml \
/tmp/workspace1
find /tmp/workspace1
/tmp/workspace1/mets.xml
$ ocrd workspace clone \
--download \
https://raw.githubusercontent.com/OCR-D/assets/master/data/communist_manifesto/data/mets.xml \
/tmp/workspace2
find /tmp/workspace2
/tmp/workspace2/mets.xml
/tmp/workspace2/OCR-D-IMG
/tmp/workspace2/OCR-D-IMG/OCR-D-IMG_0015.png
$ ocrd workspace -d /tmp/workspace1 validate
<report valid="true">
<notice>Image OCR-D-IMG_0015: xResolution (1 pixels per inches) is suspiciously low</notice>
<notice>Image OCR-D-IMG_0015: yResolution (1 pixels per inches) is suspiciously low</notice>
</report>
$ ocrd workspace init /tmp/workspace-new
ocrd workspace init /tmp/workspace-new
12:59:19.479 INFO ocrd.resolver - Writing METS to /tmp/workspace-new/mets.xml
12:59:19.480 INFO ocrd.workspace - Saving mets '/tmp/workspace-new/mets.xml'
/tmp/workspace-new
$ ocrd workspace add \
--file-grp DEFAULT \
--file-id page1_img \
--mimetype image/tiff \
--page-id page1 \
page1.tiff
@@ -9,3 +9,6 @@
<mets:div TYPE="physSequence">
- </mets:div>
+ <mets:div TYPE="page" ID="page1">
+ <mets:fptr FILEID="page1_img"/>
+ </mets:div>
+ </mets:div>
</mets:structMap>
@@ -22,3 +25,8 @@
<mets:fileSec>
- </mets:fileSec>
+ <mets:fileGrp USE="DEFAULT">
+ <mets:file MIMETYPE="image/tiff" ID="page1_img">
+ <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="page1.tiff"/>
+ </mets:file>
+ </mets:fileGrp>
+ </mets:fileSec>
</mets:mets>
---
$ ocrd workspace --help
Commands:
add Add a file LOCAL_FILENAME to METS in a workspace.
backup Backing and restoring workspaces - dev edition
clone Create a workspace from a METS_URL and return the directory...
find Find files.
get-id Get METS id if any
init Create a workspace with an empty METS file in DIRECTORY.
list-group List fileGrp USE attributes
list-page List page IDs
prune-files Removes mets:files that point to non-existing local files
remove Delete file by ID from mets.xml
remove-group Delete a file group
set-id Set METS ID.
validate Validate a workspace
$ ocrd zip bag -d /tmp/workspace1 --id 'https://ocr-d.de/gt/123'
13:05:27.502 INFO ocrd.workspace_bagger - Bagging /tmp/workspace1 to /tmp/workspace1.ocrd.zip (temp dir /tmp/ocrd-bagit-zcyaar_5)
13:05:27.502 INFO ocrd.workspace_bagger - Resolving OCR-D-IMG/OCR-D-IMG_0015.png (partial)
13:05:27.502 INFO ocrd.workspace_bagger - Resolved OCR-D-IMG/OCR-D-IMG_0015.png
13:05:27.805 INFO ocrd.workspace_bagger - Created bag at /tmp/workspace1.ocrd.zip
$ ocrd zip validate /tmp/workspace1.ocrd.zip
13:08:04.640 INFO bagit - Verifying checksum for file /tmp/ocrd-bagit-_9ydkekl/data/mets.xml
13:08:04.641 INFO bagit - Verifying checksum for file /tmp/ocrd-bagit-_9ydkekl/data/OCR-D-IMG/OCR-D-IMG_0015.png
13:08:04.664 INFO bagit - Verifying checksum for file /tmp/ocrd-bagit-_9ydkekl/manifest-sha512.txt
13:08:04.664 INFO bagit - Verifying checksum for file /tmp/ocrd-bagit-_9ydkekl/bagit.txt
13:08:04.664 INFO bagit - Verifying checksum for file /tmp/ocrd-bagit-_9ydkekl/bag-info.txt
OK
$ ocrd-cis-ocropy-binarize --help
Usage: ocrd-cis-ocropy-binarize [OPTIONS]
Binarize (and optionally deskew/despeckle) pages / regions / lines with ocropy
Options:
-V, --version Show version
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-J, --dump-json Dump tool description as JSON and exit
-p, --parameter TEXT Parameters, either JSON string or path
JSON file
-g, --page-id TEXT ID(s) of the pages to process
-O, --output-file-grp TEXT File group(s) used as output.
-I, --input-file-grp TEXT File group(s) used as input.
-w, --working-dir TEXT Working Directory
-m, --mets TEXT METS to process
-h, --help This help message
Parameters:
"method" [string - ocropy] binarization method to use (only ocropy
will include deskewing) Possible values: ["none", "global",
"otsu", "gauss-otsu", "ocropy"]
"grayscale" [boolean - False] for the ocropy method, produce
grayscale-normalized instead of thresholded image
"maxskew" [number - 0.0] modulus of maximum skewing angle to detect
(larger will be slower, 0 will deactivate deskewing)
"noise_maxsize" [number - 0] maximum pixel number for connected
components to regard as noise (0 will deactivate denoising)
"level-of-operation" [string - page] PAGE XML hierarchy level
granularity to annotate images for Possible values: ["page",
"region", "line"]
Default Wiring:
['OCR-D-IMG', 'OCR-D-SEG-BLOCK', 'OCR-D-SEG-LINE'] -> ['OCR-D-IMG-BIN', 'OCR-D-SEG-BLOCK', 'OCR-D-SEG-LINE']
$ wget 'https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml'
$ ocrd validate page \
--check-coords \
--check-baseline \
--page-textequiv-consistency strict \
PAGE_0017_PAGE.xml
$ ocrd validate
Usage: ocrd validate [OPTIONS] COMMAND [ARGS]...
All the validation in one CLI
Options:
--help Show this message and exit.
Commands:
page Validate PAGE against OCR-D conventions
parameters Validate PARAM_JSON against parameter definition of
EXECUTABLE...
tasks Validate a sequence of tasks passable to 'ocrd process'
tool-json Validate OCRD_TOOL as an ocrd-tool.json file.
$ ocrd process --mets /pfad/zur/mets.xml \
'binarize -I MAX -O BIN' \
'segment -I BIN -O SEG' \
'recognize -I SEG -O OCR'
$ cp -r ~monorepo/assets/data/kant_aufklaerung_1784/data /tmp/ws1
$ cd /tmp/ws1
$ ocrd process \
'cis-ocropy-binarize -I OCR-D-IMG -O BIN,BIN-IMG' \
'anybaseocr-crop -I BIN -O CROP,CROP-IMG' \
'cis-ocropy-deskew -I CROP -O DESKEW' \
'tesserocr-segment-region -I DESKEW -O PAGE-REGION' \
'tesserocr-segment-line -I PAGE-REGION -O PAGE-LINE' \
'tesserocr-recognize -I PAGE-LINE -O OCR-TESS -p "{\"model\": \"GT4HistOCR_0.913_233896_953200\"}"'
Scalable Methods of Text and Structure Recognition for the Full-Text Digitization of Historical Prints, Part 1.B: Image Optimization (DFKI Kaiserslautern)
Ziele: Identifizierung, Entwicklung und Integration geeigneter Algorithmen für
ocrd-anybaseocr-binarize \
--mets /path/to/mets.xml \
--input-file-grp MAX \
--output-file-grp BIN-ANYBASE \
-p '{"gray": false}'
![]() |
![]() |
Scalable Methods of Text and Structure Recognition for the Full-Text Digitization of Historical Prints, Part 2: Layout Analysis - DFKI Kaiserslautern
Ziele:
ocrd-anybaseocr-block-segmentation \
-I BIN \
-O SEG-BLOCK-ANYBASEOCR \
-p '{"operation_level": "page"}'
ocrd-anybaseocr-block-segmentation \
-I BIN \
-O SEG-LINE-ANYBASEOCR \
-p '{"operation_level": "region"}'
Development of a semi-automatic open source tool for layout analysis and region extraction and region classificiation (LAREX) of early prints (Universität Würzburg)
Ziele:
NN/FST - Unsupervised OCR-Postcorrection based on Neural Networks and Finite-state Transducers (Universität Leipzig)
Ziele:
Optimized use of OCR methods – Tesseract as a component of the OCR-D workflow (Universitätsbibliothek Mannheim)
Ziele:
Tesseract Fraktur -> GT4HistOCR
Automated postcorrection of OCRed historical printings with integrated optional interactive postcorrection (Universität München)
Ziele:
$ ocrd-cis-ocropy-binarize -I OCR-D-IMG -O BIN
![]() |
![]() |
Development of a Repository for OCR Models and an Automatic Font Recognition tool in OCR-D (Universitäten Erlangen, Mainz, Leipzig)
Ziele:
ocrd-train
zu tesstrain
tesstrain