GT4HistOCR model evaluation

Since GT4HistOCR has been published, various OCR models have been trained on all or subsets of the data for various engines. This is an account of how well these models fare:

Tesseract models trained at UB Mannheim – published 2019-2020
Calamari models trained by Qurator team – published 2019-2020
Calamari models trained by ZPD / Uni Würzburg … TODO

The original corpus had various flaws, which have been semi-automatically corrected by UB Mannheim. Also, there have been multiple full training runs by both teams.

Methodology

OCR Prediction

Tesseract: uses multiprocessing single-line (PSM13) standalone CLI tesserocr-batch

find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR_2000000 -x .gt4histocr_old
find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR -x .gt4histocr

Calamari: uses standalone CLI calamari-predict

find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 calamari-predict --batch_size 8 -j 4 --extension .gt4histocr_cala.txt --extended_prediction_data  --extended_prediction_data_format pred --checkpoint ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/\* --files

OCR Evaluation
- String alignment (Needleman-Wunsch), distance calculation (unweighted Levenshtein on graphemes, i.e. after joining combining characters), rate calculation (denominator is length of alignment path, not GT length or max-length) and aggregation (micro-averaging and parallel/subsample aggregation of stddev by Chan et al. 1979) is done by cor-asv-ann-compare
```
cor-asv-ann-compare -o compare-gt-gt4histocr.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.png/.gt4histocr.txt/)
cor-asv-ann-compare -o compare-gt-gt4histocr_cala.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt4histocr_cala.txt/)
```
- For generalization we should actually only look at unseen validation data (or at least the test data used to select the checkpoint). But for Calamari, which uses a voting scheme over 5 models from cross-fold training, no split is applicable. So we currently test on the full dataset in both cases. (But should at least train/test both in a similar way in the future.)

Notation

We will refer to the

GT4HistOCR_2000000 Tesseract model from 2019-08-22 as the old model
GT4HistOCR Tesseract model from 2020-02-11 as the new model
GT4HistOCR Calamari v0.3 model from 2019-07-22 as the old model
GT4HistOCR Calamari v1.0 model from 2019-12-11 as the new model
ZPD model
original corpus as the old data
corrected corpus as the new data

Results

Tesseract: old model, old data

full dataset: 0.7% with no I/J error but lots of whitespace at EOL errors

ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.007±0.046 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1844, ('a', 'aͤ')), (1143, ('a', 'ä')), (942, ('f', 'ſ')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (743, ('u', 'ü')), (735, ('t', 'r')), (666, ('b', 'h')), (591, ('ſ', 'f')), (586, ('u', 'n')), (480, ('c', 'e')), (442, ('n', 'u')), (431, ('o', 'oͤ')), (391, (' .', '.')), (339, ('.', ',')), (314, ('m', 'n')), (294, ('e', 'c')), (276, ('u', 'uͤ')), (259, ('ſ', ' ſ')), (259, ('r', 't'))], 13816319)

test dataset: 0.7%, same as above

ocrd_cor_asv_ann.scripts.compare -  3133 lines 0.007±0.044 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15, ('a', 'ä')), (15, ('o', 'ö')), (15, ('a', 'aͤ')), (14, ('b', 'h')), (14, ('f', 'ſ')), (13, ('ñ', 'ñ')), (9, ('u', 'ü')), (8, ('t', 'r')), (8, ('c', 'e')), (7, ('n', 'u')), (6, ('ſ', 'f')), (6, ('e', ' e')), (5, ('uͤ', 'ü')), (5, ('t', 't᷑')), (4, (' ', ', ')), (4, ('z', 'ʒ')), (4, (' .', '.')), (4, ('v', ' v')), (4, ('i', 'ĩ')), (4, ('r', 'x'))], 138055)

test dataset, dta19 subcorpus: 0.5%
test dataset, other subcorpora: 1.3%

Tesseract: old model, new data

full dataset: 0.9% with lots of J/I errors

ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.009±0.043 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17800, ('J', 'I')), (6043, ('r', 'ꝛ')), (1843, ('a', 'aͤ')), (1144, ('a', 'ä')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (749, ('f', 'ſ')), (741, ('u', 'ü')), (727, ('t', 'r')), (658, ('b', 'h')), (587, ('u', 'n')), (509, ('ſ', 'f')), (459, ('c', 'e')), (444, ('n', 'u')), (431, ('o', 'oͤ')), (384, (' .', '.')), (340, ('.', ',')), (314, ('m', 'n')), (277, ('u', 'uͤ')), (263, ('\n', ' —\n'))], 14129030)

test dataset:

Tesseract: new model, old data

full dataset: 1.3% with lots of I/J and whitespace at EOL errors

ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.013±0.059 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17506, ('I', 'J')), (3921, ('t', 'r')), (1998, ('a', 'aͤ')), (1279, (',', '.')), (1193, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (1036, ('f', 'ſ')), (994, ('e', 'c')), (946, ('t', 'c')), (882, ('o', 'ö')), (863, ('u', 'ü')), (826, ('ſ', 'f')), (800, ('u', 'n')), (798, ('t', 'e')), (721, ('n', 'm')), (629, ('i', ' i')), (569, (' ', '. ')), (557, ('ſ', ' ſ')), (530, (' ', '.'))], 13803140)

full dataset, dta19 subcorpus: 0.9%
full dataset, other subcorpora: 2.5% (but no I/J confusion there)

test dataset: 1.4%, same as above

ocrd_cor_asv_ann.scripts.compare -  3133 lines 0.014±0.058 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(174, ('I', 'J')), (45, ('t', 'r')), (18, ('a', 'ä')), (18, ('n', 'u')), (18, ('a', 'aͤ')), (16, ('z', 'ʒ')), (15, ('o', 'ö')), (14, (',', '.')), (11, ('e', 'c')), (10, ('ſ', 'f')), (10, ('f', 'ſ')), (10, ('i', 'r')), (9, ('g', ' g')), (9, ('u', 'ü')), (8, ('i', 'u')), (8, ('a', ' a')), (8, ('t', 'c')), (8, ('r', 'x')), (7, ('i', ' i')), (7, ('t', ' t'))], 138037)

test dataset, dta19 subcorpus: 1.2%
test dataset, other subcorpora: 2.2%

Tesseract: new model, new data

full dataset: 1.1% with lots of r/ꝛ errors (because the model was trained before the respective corrections?)

ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.049 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(5737, ('r', 'ꝛ')), (3862, ('t', 'r')), (1997, ('a', 'aͤ')), (1284, (',', '.')), (1194, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (957, ('t', 'c')), (947, ('e', 'c')), (882, ('o', 'ö')), (862, ('u', 'ü')), (846, ('f', 'ſ')), (802, ('u', 'n')), (798, ('t', 'e')), (753, ('ſ', 'f')), (721, ('n', 'm')), (630, ('i', ' i')), (557, ('ſ', ' ſ')), (554, (' ', '. ')), (552, (' ', '.'))], 14115872)

test dataset:

Calamari: old model, old data

TODO

Calamari: new model, old data

full dataset: 0.8% partly attributable to overaggressive normalization during training (calamari-train --data.pre_proc.processors.6.replacement_groups defaults to extended which among others rewrites double quotes to apostrophes and commas; for GT level 2, only spaces+roman_digits+ligatures-consonantal would be appropriate)

ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.038 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion / /dev/fd/63 vs /dev/fd/62: ([(11618, (',', '„')), (11125, ("'", '“')), (4527, ("' ", ' ')), (1818, ('a', 'aͤ')), (1292, (',J', 'J')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (840, ('o', 'ö')), (768, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (589, (',E', 'E')), (520, ('f', 'ſ')), (508, ('ſ', 'f')), (452, ('o', 'oͤ')), (395, (',U', 'U')), (393, (',M', 'M'))], 13824358)

When evaluating with stronger normalization (including Calamari's quotes regularizer): 0.4% – quoting errors are gone

ocrd_cor_asv_ann.scripts.compare - 302920 lines 0.004±0.032 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1818, ('a', 'aͤ')), (1164, ('a', 'ä')), (840, ('o', 'ö')), (768, ('u', 'ü')), (452, ('o', 'oͤ')), (418, ('f', 'ſ')), (413, ('ſ', 'f')), (329, ('u', 'uͤ')), (306, ('n', 'u')), (239, ('t', 'r')), (237, ('u', 'n')), (226, ('.', ' .')), (215, ('e', 'é')), (208, (' ', 'e')), (198, (' .', '.')), (194, ('e', 'c')), (186, ('.', ',')), (184, ('a', 'g')), (176, ('s', 'ſ')), (176, (' ', '. '))], 13285521)

Calamari: new model, new data

full dataset: 1.1% as above, but in addition lots of J/I errors

ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.043 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(16879, ('J', 'I')), (11619, (',', '„')), (11125, ("'", '“')), (6138, ('r', 'ꝛ')), (4527, ("' ", ' ')), (1817, ('a', 'aͤ')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (906, (',', 'I')), (841, ('o', 'ö')), (769, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (647, ('Jc', 'c')), (589, (',E', 'E')), (449, ('o', 'oͤ')), (432, ('ſ', 'f'))], 13823990)

When evaluating with stronger normalization (including Calamari's quotes regularizer): 0.6% as above, but in addition lots of J/I errors

ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.006±0.038 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17776, ('J', 'I')), (6138, ('r', 'ꝛ')), (1817, ('a', 'aͤ')), (1164, ('a', 'ä')), (841, ('o', 'ö')), (769, ('u', 'ü')), (449, ('o', 'oͤ')), (432, ('ſ', 'f')), (337, ('f', 'ſ')), (329, ('u', 'uͤ')), (312, ('n', 'u')), (263, ('t', 'r')), (244, ('u', 'n')), (228, ('.', ' .')), (222, ('e', 'é')), (212, ('s', 'ſ')), (210, ('.', ',')), (209, ('r', 't')), (207, (' ', 'e')), (207, (' ', '. '))], 13846676)

Calamari: ZPD model, old data

full dataset: 0.8%, like new model above, but without normalization errors (the ambiguity around ⸗/— and ⸗/- arises from inconsistencies between the subcorpora themselves)

ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.037 CER
ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15343, ('⸗', '—')), (1982, ('a', 'aͤ')), (1517, ('a', 'ä')), (1255, ('ſ', 'f')), (1242, ('.', ' .')), (1152, ('n', 'u')), (1138, ('o', 'ö')), (1118, ('e', 'c')), (1117, (' ', '“ ')), (880, ('u', 'ü')), (797, ('o', 'oͤ')), (703, ('⸗', '-')), (698, ('l', 'k')), (652, ('t', 'k')), (616, ('r', 'x')), (598, ('n', 'm')), (536, ('u', 'uͤ')), (504, ('b', 'h')), (495, (' ', "' ")), (493, (',', '„'))], 13811028)

Calamari: ZPD model, new data

Discussion

Transcription/normalization/regularization rules are extremely important in this high-accuracy regime.

…