Robert Sachunsky, Kay-Michael Würzner
ocrd workspace
commandocrd process
command
-O
ocrd workspace clone \
https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml .
ocrd-anybaseocr-crop
)ocrd process
(concatenated + validated invocation of multiple processors within a single call)ocrolib
and processors …common
(from CLIs, from OLD/
), add new ones:
PIL.Image
vs np.ndarray
conversions: array2pil
, pil2array
(integer vs normed float)check_page
, check_region
, check_line
(but mix absolute and relative bounds, and make DPI-zoomable)nlbin
deskewing: estimate_skew_angle
, estimate_skew
(but resize when rotating, with minimum on variance drop)nlbin
binarization: estimate_local_whitelevel
, estimate_thresholds
, binarize
(but keep exact pixel size, catch NaN)borderclean
remove_hlines
: add height threshold, reduce default width thresholdcompute_separators_morph
: reduce thresholds (because black colseps can be discontinuous), use only connected components fully inside zonecompute_gradmaps
: reduce boxmap minsize (for chopped lines at margins), reduce horizontal blur (to avoid joining lines via as-/descenders)compute_line_seeds
: more robust top/bottom projection rules and no horizontal blur (to avoid joining lines via as-/descenders)hmerge_line_seeds
: new way to ensure horizontal label consistencycompute_segmentation
:
fullpage
switch (regions do not have hlines and colseps)zoom
parameter (thresholds must be DPI-relative)scale
(blackletter has huge capitals and dense as-/descenders – avoid splitting lines)tmbdev/ocropy
, forked under OCR-D/ocropy
:
ocrolib
ocrolib
under PyPI ocrolibocrolib
changes (one by one):
ocrolib.common
functionsOCR-D/ocrd_ocropus
:
ocrolib
→ Ocropy as API is under way!
tesserocr
Border
Border
crop_polygons
(rectangles or polygons?)find_tables
(tables as tables?)2 distinct APIs in Tesseract:
DetectOrientationScript() |
AnalyseLayout() + Orientation() |
|
---|---|---|
confidence | yes | no |
orientation | yes | yes |
script | yes | no |
deskewing | no | yes |
textline order | no | yes |
reading direction | no | yes |
→ can yield contradictory results!
resolve conflicts:
available on page and region level
ocrd bashlib
as last resortAlternativeImage
support (as case study in xmlstarlet
)Original | Tesseract |
---|---|
![]() |
![]() |
Ocropy | Olena-Wolf |
---|---|
![]() |
![]() |
README
socrd-anybaseocr-crop
anybaseocr
-based processorsBorder
element with coordinatesAlternativeImage
-sensitiveocrd-anybaseocr-crop
: exampleTesseract | DFKI |
---|---|
![]() |
![]() |
APIs for PAGE and METS
Use PAGE for
AlternativeImage
) resultsUse METS for
Specific aspects:
AlternativeImage
)Warning: Invalid resolution 0 dpi. Using 70 instead.
tessapi.SetVariable('user_defined_dpi', str(200))
info = OcrdExif(pil_image)
if info.resolution != 1:
# tag available
dpi = info.resolution
if info.resolutionUnit == 'cm':
dpi = round(dpi * 2.54)
...
OLR processors that are DPI-aware already: pass DPI from OcrdExif
e.g. Tesseract wrappers
OLR processors that are not DPI-aware yet: modify to zoom:
e.g. Ocropy wrappers
/PcGts/Page/@imageFilename
//AlternativeImage/@filename
producing derived images obligatory:
binarization, despeckling, dewarping
just describing the operation (images optional):
cropping/segmentation (Coords/@points
), deskewing (@orientation
)
operation must be applied at some point – preferably when descending to a lower hierarchy level (i.e. during segmentation)
AlternativeImage
if present on their hierarchy level of interestAlternativeImage
) – by:
Coords/@points
(Page|TextRegion)/@orientation
@orientation
, but:
Grid
?@scale
?Annotation must be sufficient to calculate pixel positions in
AlternativeImage
from those in@imageFilename
(e.g. to cut parents) or vice versa (e.g. to add children).
@comments
classification:
AlternativeImage
entries@comments
, or always append/choose last?AlternativeImage
and @orientation
@comments
, or expect the image to be deskewed already?AlternativeImage
and Border
@comments
, or expect the image to be cropped already?AlternativeImage
is larger than Coords/@points
rectangle@scale
or prohibit rescaling altogether?Grid
?OCR-D/core
image/offset recursion:
Workspace.image_from_page
: on the page level (original or derived)Workspace.image_from_segment
: all levels below page (derived)what it does:
AlternativeImage
or generate from parent (including:
image_from_page
only:) also return OcrdExif
instance for originalhow to use:
from ocrd_modelfactory import page_from_file
...
page_id = input_file.pageId or input_file.ID # for logging
pcgts = page_from_file(workspace.download_file(input_file))
page = pcgts.get_Page()
page_image, page_xywh, page_image_info = workspace.image_from_page(
page, page_id)
...
for region in page.get_TextRegion():
region_image, region_xywh = workspace.image_from_segment(
region, page_image, page_xywh)
...
for line in region.get_TextLine():
line_image, line_xywh = workspace.image_from_segment(
line, region_image, region_xywh)
...
OCR-D/core
add image to METS:
Workspace.save_image_file
what it does:
PIL.Image
objectWorkspace.add_file
how to use:
...
file_id = input_file.ID.replace(self.input_file_grp,
'OCR-D-IMG-DEWARP')
...
file_path = workspace.save_image_file(image,
file_id + '_' + region.id + '_' + line.id,
page_id=input_file.pageId,
file_grp='OCR-D-IMG-DEWARP')
line.add_AlternativeImage(AlternativeImageType(
filename=file_path,
comments=comments + ',' + 'dewarped')
OCR-D/core
convert from absolute coordinates to relative:
ocrd_utils.coordinates_of_segment
what it does:
how to use:
line_polygon = coordinates_of_segment(line, region_image, region_xywh)
line_polygon = resegment(line_polygon, region_labels, region_image_bin, line.id)
line_polygon = coordinates_for_segment(line_polygon, region_image, region_xywh)
line.get_Coords().points = points_from_polygon(line_polygon)
OCR-D/core
convert from relative coordinates to absolute:
ocrd_utils.coordinates_for_segment
what it does:
how to use:
...
for word_no, word in enumerate(iterate_level(tessapi.GetIterator(), RIL.WORD)):
word_id = '%s_word%04d' % (line.id, word_no)
bbox = word.BoundingBox(RIL.WORD)
points = points_from_polygon(coordinates_for_segment(
polygon_from_x0y0x1y1(bbox),
None, # image not needed if element cannot have angle
line_xywh))
word = WordType(id=word_id, Coords=CoordsType(points))
line.add_Word(word)
OCR-D/core
numpy.ndarray
):
ocrd_utils.rotate_coordinates
image_from_polygon
PIL.Image.crop
:
ocrd_utils.crop_image
ocrd-tesserocr-*
ocrd-cis-ocropy-*
ocrd-olena-binarize
specified by comma-separated list on CLI:
$ ocrd-olena-binarize -I OCR-D-GT-SEG-LINE -O OCR-D-SEG-PAGE,OCR-D-IMG-BIN
$ ocrd-cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,OCR-D-OCR-TESS,OCR-D-COR-ASV-ANN
TextLine/TextEquiv/Unicode
alignment via
AlternativeImage
references
→ generated images must also be added to METSOCR-D-IMG-BIN
, OCR-D-IMG-DESKEW
, OCR-D-IMG-DEWARP
etc.output_file_grp
second position,ocrd process
as engine:digraph G {
node[shape=box];
compound=true;
page_segmentation[label="Region segmentation"];
line_segmentation[label="Line segmentation"];
text_optimization[label="Text optimization"];
subgraph cluster_preprocessing_page {
label = "Page preprocessing";
binarization_page[label="Binarization"];
cropping[label="Cropping"];
deskewing_page[label="Deskewing"];
despeckling_page[label="Despeckling"];
dewarping_page[label="Dewarping"];
binarization_page -> cropping -> deskewing_page -> despeckling_page -> dewarping_page
{rank=same; binarization_page, cropping, deskewing_page, despeckling_page, dewarping_page}
}
subgraph cluster_preprocessing_segment {
label = "Region preprocessing";
deskewing_segment[label="Deskewing"];
despeckling_segment[label="Despeckling"];
binarization_segment[label="Binarization"];
binarization_segment -> despeckling_segment -> deskewing_segment
{rank=same; deskewing_segment, despeckling_segment, binarization_segment}
}
subgraph cluster_preprocessing_line {
label = "Line preprocessing";
binarization_line[label="Binarization"];
dewarping_line[label="Dewarping"];
binarization_line -> dewarping_line
{rank=same; binarization_line, dewarping_line}
}
subgraph cluster_ocr_line {
label = "Text recognition";
ocr_one[label="OCR 1"];
ocr_two[label="OCR 2"];
ocr_n[label="OCR n"];
ocr_one -> ocr_two -> ocr_n[style=dotted,dir=none]
{rank=same; ocr_one, ocr_two, ocr_n}
}
binarization_page -> page_segmentation[label="Pages",ltail=cluster_preprocessing_page]
page_segmentation -> binarization_segment[label="Regions",lhead=cluster_preprocessing_segment]
binarization_segment -> line_segmentation[label="Regions",ltail=cluster_preprocessing_segment]
line_segmentation -> binarization_line[label="Lines",lhead=cluster_preprocessing_line]
binarization_line -> ocr_one[label="Line images", ltail=cluster_preprocessing_line, lhead=cluster_ocr_line]
ocr_one -> text_optimization[label="Line strings",ltail=cluster_ocr_line]
}
Processor | Status | Note |
---|---|---|
Binarization | ||
ocrd-olena-binarize |
✓ | |
ocrd-anybaseocr-binarize |
✗ | Interface |
ocrd-cis-ocropy-binarize |
✓ | |
ocrd-kraken-binarize |
✗ | Invocation |
ocrd-tesserocr-binarize |
✓ | |
Despeckling | ||
ocrd-cis-ocropy-denoise |
✓ |
Processor | Status | Note |
---|---|---|
Cropping | ||
ocrd-anybaseocr-crop |
✓ | |
ocrd-kraken-crop |
✗ | Interface |
ocrd-tesserocr-crop |
✓ | |
Deskewing | ||
ocrd-anybaseocr-deskew |
✗ | Interface |
ocrd-cis-ocropy-deskew |
✓ | |
ocrd-tesserocr-deskew |
✓ | |
Dewarping | ||
ocrd-anybaseocr-dewarp |
✗ | Interface |
ocrd-cis-ocropy-dewarp |
✓ |
Processor | Status | Note |
---|---|---|
Region Segmentation | ||
ocrd-tesserocr-segment-region |
✓ | |
Clipping/Resegmentation | ||
ocrd-cis-ocropy-clip |
✓ | |
ocrd-cis-ocropy-resegment |
✓ | |
ocrd-segment-repair |
✓ | |
Line Segmentation | ||
ocrd-ocropy-segment |
✗ | Invocation |
ocrd-kraken-segment |
✗ | Invocation |
ocrd-tesserocr-segment-line |
✓ |
Processor | Status | Note |
---|---|---|
Font identification | ||
ocrd-typegroups-classifier |
✓ | |
Text recognition | ||
ocrd-cis-ocropy-recognize |
✓ | |
ocrd-tesserocr-recognize |
✓ | |
ocrd-calamari-recognize |
✓ |
Processor | Status | Note |
---|---|---|
OCR alignment | ||
ocrd-cis-align |
✓ | |
Text optimization | ||
ocrd-cor-asv-ann-process |
✓ | |
ocrd-cor-asv-fst-process |
✓ | |
ocrd-cis-profile |
✓ | |
ocrd-cis-postcorrection |
✗ | Interface |
ocrd-keraslm-rate |
✓ | |
OCR evaluation | ||
ocrd-keraslm-rate |
✓ | |
ocrd-cor-asv-ann-evaluate |
✓ | |
ocrd-dinglehopper |
✓ |
#
# create workspace from existing METS
ocrd workspace clone \
https://digital.slub-dresden.de/data/kitodo/gottgott_38213401X/gottgott_38213401X_mets.xml .
#
# crop with anybaseocr
ocrd-anybaseocr-crop -I ORIGINAL -O CROPPED -m mets.xml
#
# binarize on page level
ocrd-cis-ocropy-binarize -I CROPPED -O BIN -p <(echo '{"level-of-operation": "page"}') -m mets.xml
#
# deskew on page level
ocrd-cis-ocropy-deskew -I BIN -O DESKEWED -p <(echo '{"level-of-operation": "page"}') -m mets.xml
#
# segment into regions
ocrd-tesserocr-segment-region -I DESKEWED -O REGIONS -m mets.xml
#
# clip regions
ocrd-cis-ocropy-clip -I REGIONS -O CLIPPED -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# binarize on region level
ocrd-cis-ocropy-binarize -I CLIPPED -O RBIN -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# deskew on region level
ocrd-cis-ocropy-deskew -I RBIN -O RDESKEWED -p <(echo '{"level-of-operation": "region"}') -m mets.xml
#
# segment into lines
ocrd-tesserocr-segment-line -I RDESKEWED -O LINES -m mets.xml
#
# clip lines
ocrd-cis-ocropy-clip -I LINES -O LCLIPPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# binarize on line level
ocrd-cis-ocropy-binarize -I LCLIPPED -O LBIN -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# dewarp on line level
ocrd-cis-ocropy-dewarp -I LBIN -O DEWARPED -p <(echo '{"level-of-operation": "line"}') -m mets.xml
#
# recognize text
ocrd-tesserocr-recognize -I DEWARPED -O TEXT -p <(echo '{"model": "frk+deu+Fraktur+Latin"}') -m mets.xml
graph LR
bin["BIN{PAGE}:olena{wolf}"]
clip["CLIP{BLOCK}"]
deskew["DESKEW{BLOCK}:tesserocr"]
reseg[RESEGMENT]
dew[DEWARP:ocropy]
ocr[OCR:*]
bin --> clip
clip --> deskew
deskew --> reseg
reseg --> dew
dew --> ocr
OCR | CER[%] |
---|---|
OCRO{fraktur} | 23.7 |
OCRO{fraktur(jze)} | 28.4 |
TESS{Fraktur} | 12.2 |
TESS{frk} | 11.9 |
TESS{frk+deu} | 11.5 |
→ layout/preprocessing still not good enough
→ Ocropy suffers from Frakturwechsel (nearly no coverage of Antiqua/Arabic numerals, no way to mix models)
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:ocropy"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 23.2 | (-0.5 for s/tesserocr/ocropy/) |
OCRO{fraktur(jze)} | 28.0 | (-0.4 for s/tesserocr/ocropy/) |
TESS{Fraktur} | 12.1 | (-0.1 for s/tesserocr/ocropy/) |
TESS{frk} | 11.9 | (±0 for s/tesserocr/ocropy/) |
TESS{frk+deu} | 11.4 | (-0.1 for s/tesserocr/ocropy/) |
→ Ocropy deskews slightly better than Tesseract (the latter being more conservative)
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 24.4 | (+1.2 for s//-DESKEW/) |
OCRO{fraktur(jze)} | 29.3 | (+1.3 for s//-DESKEW/) |
TESS{Fraktur} | 12.8 | (+0.7 for s//-DESKEW/) |
TESS{frk} | 12.6 | (+0.7 for s//-DESKEW/) |
TESS{frk+deu} | 12.2 | (+0.8 for s//-DESKEW/) |
→ Deskewing helps, but not that much (i.e. either GT images already have little skew, or dewarping can compensate)
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 52.6 | (+28.2 for s//-DEWARP/) |
OCRO{fraktur(jze)} | 61.4 | (+32.1 for s//-DEWARP/) |
TESS{Fraktur} | 13.3 | (+ 0.5 for s//-DEWARP/) |
TESS{frk} | 13.2 | (+ 0.6 for s//-DEWARP/) |
TESS{frk+deu} | 12.8 | (+ 0.6 for s//-DEWARP/) |
→ Ocropy is very sensitive against warped images, Tesseract nearly immune
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 53.2 | (+0.6 for s/-DEWARP/-DEWARP-DESKEW/) |
OCRO{fraktur(jze)} | 63.0 | (+1.6 for s/-DEWARP/-DEWARP-DESKEW/) |
TESS{Fraktur} | 13.5 | (+0.2 for s/-DEWARP/-DEWARP-DESKEW/) |
TESS{frk} | 13.3 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) |
TESS{frk+deu} | 12.9 | (+0.1 for s/-DEWARP/-DEWARP-DESKEW/) |
→ Deskewing appearently cannot replace dewarping, i.e. either deskewing is too bad, or dewarping cannot compensate missing deskewing in the first place.
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 56.8 | (+3.6 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
OCRO{fraktur(jze)} | 66.0 | (+3.0 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
TESS{Fraktur} | 13.6 | (+0.1 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
TESS{frk} | 13.7 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
TESS{frk+deu} | 13.3 | (+0.4 for s/-DEWARP-DESKEW/-DEWARP-DESKEW-RESEG/) |
→ Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> deskew
deskew["DESKEW{PAGE}:tesserocr"]
deskew --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 24.6 | (+0.9 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
OCRO{fraktur(jze)} | 29.7 | (+1.3 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
TESS{Fraktur} | 13.4 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
TESS{frk} | 13.2 | (+2.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
TESS{frk+deu} | 12.7 | (+1.2 for s/DESKEW{BLOCK}/DESKEW{PAGE}/) |
→ Deskewing (on average) works slightly better on the region level than on the page level
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> den
den["DENOISE{PAGE}:ocropy"]
den --> deskew
deskew["DESKEW{PAGE}:ocropy"]
deskew --> clip
clip["CLIP{BLOCK}"]
clip --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 19.1 | (-5.5 for s//DENOISE{PAGE}/) |
OCRO{fraktur(jze)} | 23.5 | (-6.2 for s//DENOISE{PAGE}/) |
TESS{Fraktur} | 11.7 | (-1.7 for s//DENOISE{PAGE}/) |
TESS{frk} | 11.8 | (-1.4 for s//DENOISE{PAGE}/) |
TESS{frk+deu} | 11.2 | (-1.5 for s//DENOISE{PAGE}/) |
→ Denoising can improve Wolf binarization, esp. for Ocropy
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 23.9 | (+0.2 for s//-CLIP{BLOCK}/) |
OCRO{fraktur(jze)} | 28.6 | (+0.2 for s//-CLIP{BLOCK}/) |
TESS{Fraktur} | 12.3 | (+0.1 for s//-CLIP{BLOCK}/) |
TESS{frk} | 12.3 | (+0.4 for s//-CLIP{BLOCK}/) |
TESS{frk+deu} | 11.9 | (+0.4 for s//-CLIP{BLOCK}/) |
→ Clipping on the region level gives very minimal improvement (if resegmentation is used)
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 31.2 | (+7.5 for s//-RESEG/) |
OCRO{fraktur(jze)} | 35.1 | (+6.7 for s//-RESEG/) |
TESS{Fraktur} | 13.2 | (+1.0 for s//-RESEG/) |
TESS{frk} | 12.7 | (+0.8 for s//-RESEG/) |
TESS{frk+deu} | 12.4 | (+0.9 for s//-RESEG/) |
→ Resegmentation primarily helps Ocropy, i.e. Tesseract is much less sensitive to invading as-/descenders from neighbouring lines
→ Resegmentation needs deskewing
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> clip2
clip2["CLIP{LINE}"]
clip2 --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 24.1 | (+0.4 for s/RESEG/CLIP{LINE}/) |
OCRO{fraktur(jze)} | 28.9 | (+0.5 for s/RESEG/CLIP{LINE}/) |
TESS{Fraktur} | 12.9 | (+0.7 for s/RESEG/CLIP{LINE}/) |
TESS{frk} | 12.7 | (+0.8 for s/RESEG/CLIP{LINE}/) |
TESS{frk+deu} | 12.4 | (+0.9 for s/RESEG/CLIP{LINE}/) |
→ Clipping on the line level cannot quite replace resegmentation, with Tesseract it does not help at all
graph LR
bin["BIN{PAGE}:olena{kim}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 24.9 | (+1.2 for s/wolf/kim/) |
OCRO{fraktur(jze)} | 29.6 | (+1.2 for s/wolf/kim/) |
TESS{Fraktur} | 14.0 | (+1.8 for s/wolf/kim/) |
TESS{frk} | 14.2 | (+2.3 for s/wolf/kim/) |
TESS{frk+deu} | 13.8 | (+2.3 for s/wolf/kim/) |
→ Kim is noticeably worse than Wolf (on average)
graph LR
bin["BIN{PAGE}:olena{sauvola}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 23.0 | (-0.7 for s/wolf/sauvola/) |
OCRO{fraktur(jze)} | 27.9 | (-0.5 for s/wolf/sauvola/) |
TESS{Fraktur} | 12.0 | (-0.2 for s/wolf/sauvola/) |
TESS{frk} | 11.8 | (-0.1 for s/wolf/sauvola/) |
TESS{frk+deu} | 11.5 | (±0 for s/wolf/sauvola/) |
→ Basic Sauvola is slightly better than Wolf (on average)
graph LR
bin["BIN{PAGE}:olena{sauvola-ms-split}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 22.8 | (-0.9 for s/wolf/sauvola-ms-split/) |
OCRO{fraktur(jze)} | 27.6 | (-0.8 for s/wolf/sauvola-ms-split/) |
TESS{Fraktur} | 11.6 | (-0.6 for s/wolf/sauvola-ms-split/) |
TESS{frk} | 11.5 | (-0.4 for s/wolf/sauvola-ms-split/) |
TESS{frk+deu} | 11.1 | (-0.4 for s/wolf/sauvola-ms-split/) |
→ This Sauvola variant is even better (on average)
graph LR
bin["BIN{PAGE}:ocropy{nlbin}"]
bin --> clip
clip["CLIP{BLOCK}"]
clip --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 36.4 | (+12.7 for s/wolf/ocropy/) |
OCRO{fraktur(jze)} | 41.1 | (+12.7 for s/wolf/ocropy/) |
TESS{Fraktur} | 15.7 | (+ 3.5 for s/wolf/ocropy/) |
TESS{frk} | 14.9 | (+ 3.0 for s/wolf/ocropy/) |
TESS{frk+deu} | 14.8 | (+ 3.3 for s/wolf/ocropy/) |
→ Ocropy{nlbin} is really bad! (but what about perc
/ range
/ threshold
/ lo
/ hi
parameters?)
graph LR
clip["CLIP{BLOCK}"]
clip --> bin
bin["BIN{BLOCK}:ocropy{nlbin}"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 23.0 | (-13.4 for s/BIN{PAGE}/BIN{BLOCK}/) |
OCRO{fraktur(jze)} | 27.4 | (-13.7 for s/BIN{PAGE}/BIN{BLOCK}/) |
TESS{Fraktur} | 11.5 | (- 4.2 for s/BIN{PAGE}/BIN{BLOCK}/) |
TESS{frk} | 11.4 | (- 3.5 for s/BIN{PAGE}/BIN{BLOCK}/) |
TESS{frk+deu} | 11.2 | (- 3.6 for s/BIN{PAGE}/BIN{BLOCK}/) |
→ Binarization on the region level is superior to the page level (if using Ocropy{nlbin}!)
graph LR
clip["CLIP{BLOCK}"]
clip --> bin
bin["BIN{BLOCK}:tesserocr"]
bin --> deskew
deskew["DESKEW{BLOCK}:tesserocr"]
deskew --> reseg
reseg["RESEG"]
reseg --> dewarp
dewarp["DEWARP:ocropy"]
dewarp --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 55.8 | (+32.8 for s/ocropy{nlbin}/tesserocr/) |
OCRO{fraktur(jze)} | 63.5 | (+36.1 for s/ocropy{nlbin}/tesserocr/) |
TESS{Fraktur} | 15.0 | (+ 3.5 for s/ocropy{nlbin}/tesserocr/) |
TESS{frk} | 15.3 | (+ 3.9 for s/ocropy{nlbin}/tesserocr/) |
TESS{frk+deu} | 14.8 | (+ 3.6 for s/ocropy{nlbin}/tesserocr/) |
→ Ocropy cannot cope with Tesseract binarization.
→ But even Tesseract prefers Ocropy binarization!
graph LR
bin["BIN{BLOCK}:ocropy{nlbin}"]
bin --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 55.4 | (+32.4 for s//-CLIP-DESKEW-RESEG-DEWARP/)1 |
OCRO{fraktur(jze)} | 64.6 | (+37.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)1 |
TESS{Fraktur} | 12.9 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{frk} | 12.9 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{frk+deu} | 12.7 | (+ 1.5 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
→ Without the extra preprocessors, Ocropy is lost on GT.
1: "naîve" configuration of Ocropy
graph LR
bin["BIN{BLOCK}:tesserocr"]
bin --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 66.4 | (+10.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
OCRO{fraktur(jze)} | 72.4 | (+ 8.9 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{Fraktur} | 16.2 | (+ 1.2 for s//-CLIP-DESKEW-RESEG-DEWARP/)2 |
TESS{frk} | 16.9 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)2 |
TESS{frk+deu} | 16.4 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/)2 |
→ Without the extra processors, Tesseract still works.
2: "naîve" configuration of Tesseract (but how does this compare to the CLI?)
graph LR
bin["BIN{PAGE}:olena{wolf}"]
bin --> ocr
ocr[OCR:*]
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 56.8 | (+33.1 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
OCRO{fraktur(jze)} | 66.0 | (+37.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{Fraktur} | 13.6 | (+ 1.4 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{frk} | 13.5 | (+ 1.6 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
TESS{frk+deu} | 13.3 | (+ 1.8 for s//-CLIP-DESKEW-RESEG-DEWARP/) |
→ Again, Ocropy requires the extra preprocessors, while Tesseract is suprisingly insensitive to invading neighbouring regions/lines, against skewed and warped images.
graph LR
bin["BIN{PAGE}:ocropy{nlbin}"]
deskew["DESKEW{PAGE}:ocropy{nlbin}"]
lseg["SEG{LINE}:ocropy{gpageseg}"]
bin --> deskew
deskew --> lseg
lseg --> ocr
ocr[OCR:*]
style bin fill:#ccc
style deskew fill:#ccc
style lseg fill:#ccc
OCR | CER[%] | comparison |
---|---|---|
OCRO{fraktur} | 8.33 | (-14.5 for s/OCR-D/4HistOCR/) |
OCRO{fraktur(jze)} | 6.27 | (-21.3 for s/OCR-D/4HistOCR/) |
TESS{Fraktur} | 8.33 | (-3.3 for s/OCR-D/4HistOCR/) |
TESS{frk} | ? | (- ? for s/OCR-D/4HistOCR/) |
TESS{frk+deu} | ? | (- ? for s/OCR-D/4HistOCR/) |
→ Much better than best measured OCR-D configuration!
AlternativeImage
, DPI, packaging, logging, documentation)