# GT4HistOCR model evaluation Since [GT4HistOCR](https://zenodo.org/record/1344132) has been published, various OCR models have been trained on all or subsets of the data for various engines. This is an account of how well these models fare: 1. Tesseract models [trained at UB Mannheim](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) – [published 2019-2020](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR/) 2. Calamari models [trained by Qurator team](https://github.com/qurator-spk/train-calamari-gt4histocr) – [published 2019-2020](https://qurator-data.de/calamari-models/GT4HistOCR/) 3. Calamari models trained by ZPD / Uni Würzburg ... TODO The original corpus had various flaws, which have been semi-automatically [corrected by UB Mannheim](https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR). Also, there have been multiple full training runs by both teams. ## Methodology 1. OCR Prediction - Tesseract: uses multiprocessing single-line (PSM13) standalone CLI [tesserocr-batch](https://github.com/ASVLeipzig/cor-asv-ann-data-processing) find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR_2000000 -x .gt4histocr_old find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR -x .gt4histocr - Calamari: uses standalone CLI `calamari-predict` find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 calamari-predict --batch_size 8 -j 4 --extension .gt4histocr_cala.txt --extended_prediction_data --extended_prediction_data_format pred --checkpoint ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/\* --files 2. OCR Evaluation - String alignment (Needleman-Wunsch), distance calculation (unweighted Levenshtein on graphemes, i.e. after joining combining characters), rate calculation (denominator is length of alignment path, not GT length or max-length) and aggregation (micro-averaging and parallel/subsample aggregation of stddev by Chan et al. 1979) is done by [cor-asv-ann-compare](https://github.com/ASVLeipzig/cor-asv-ann/blob/master/ocrd_cor_asv_ann/scripts/compare.py) cor-asv-ann-compare -o compare-gt-gt4histocr.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.png/.gt4histocr.txt/) cor-asv-ann-compare -o compare-gt-gt4histocr_cala.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt4histocr_cala.txt/) - For generalization we should _actually_ only look at unseen validation data (or at least the test data used to select the checkpoint). But for Calamari, which uses a voting scheme over 5 models from cross-fold training, no split is applicable. So we currently test on the full dataset in both cases. (But should at least train/test both in a similar way in the future.) ### Notation We will refer to the - [GT4HistOCR_2000000 Tesseract model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) from 2019-08-22 as the **old model** - [GT4HistOCR Tesseract model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR/tessdata_best/GT4HistOCR.traineddata) from 2020-02-11 as the **new model** - [GT4HistOCR Calamari v0.3 model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/) from 2019-07-22 as the **old model** - [GT4HistOCR Calamari v1.0 model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/) from 2019-12-11 as the **new model** - **ZPD model** - [original corpus](https://zenodo.org/record/1344132) as the **old data** - [corrected corpus](https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR) as the **new data** ## Results ### Tesseract: old model, old data * full dataset: **0.7%** with no `I/J` error but lots of whitespace at EOL errors ``` ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.007±0.046 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1844, ('a', 'aͤ')), (1143, ('a', 'ä')), (942, ('f', 'ſ')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (743, ('u', 'ü')), (735, ('t', 'r')), (666, ('b', 'h')), (591, ('ſ', 'f')), (586, ('u', 'n')), (480, ('c', 'e')), (442, ('n', 'u')), (431, ('o', 'oͤ')), (391, (' .', '.')), (339, ('.', ',')), (314, ('m', 'n')), (294, ('e', 'c')), (276, ('u', 'uͤ')), (259, ('ſ', ' ſ')), (259, ('r', 't'))], 13816319) ``` * test dataset: **0.7%**, same as above ``` ocrd_cor_asv_ann.scripts.compare - 3133 lines 0.007±0.044 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15, ('a', 'ä')), (15, ('o', 'ö')), (15, ('a', 'aͤ')), (14, ('b', 'h')), (14, ('f', 'ſ')), (13, ('ñ', 'ñ')), (9, ('u', 'ü')), (8, ('t', 'r')), (8, ('c', 'e')), (7, ('n', 'u')), (6, ('ſ', 'f')), (6, ('e', ' e')), (5, ('uͤ', 'ü')), (5, ('t', 't᷑')), (4, (' ', ', ')), (4, ('z', 'ʒ')), (4, (' .', '.')), (4, ('v', ' v')), (4, ('i', 'ĩ')), (4, ('r', 'x'))], 138055) ``` * test dataset, `dta19` subcorpus: **0.5%** * test dataset, other subcorpora: **1.3%** ### Tesseract: old model, new data * full dataset: **0.9%** with lots of `J/I` errors ``` ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.009±0.043 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17800, ('J', 'I')), (6043, ('r', 'ꝛ')), (1843, ('a', 'aͤ')), (1144, ('a', 'ä')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (749, ('f', 'ſ')), (741, ('u', 'ü')), (727, ('t', 'r')), (658, ('b', 'h')), (587, ('u', 'n')), (509, ('ſ', 'f')), (459, ('c', 'e')), (444, ('n', 'u')), (431, ('o', 'oͤ')), (384, (' .', '.')), (340, ('.', ',')), (314, ('m', 'n')), (277, ('u', 'uͤ')), (263, ('\n', ' —\n'))], 14129030) ``` * test dataset: ### Tesseract: new model, old data * full dataset: **1.3%** with lots of `I/J` and whitespace at EOL errors ``` ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.013±0.059 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17506, ('I', 'J')), (3921, ('t', 'r')), (1998, ('a', 'aͤ')), (1279, (',', '.')), (1193, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (1036, ('f', 'ſ')), (994, ('e', 'c')), (946, ('t', 'c')), (882, ('o', 'ö')), (863, ('u', 'ü')), (826, ('ſ', 'f')), (800, ('u', 'n')), (798, ('t', 'e')), (721, ('n', 'm')), (629, ('i', ' i')), (569, (' ', '. ')), (557, ('ſ', ' ſ')), (530, (' ', '.'))], 13803140) ``` * full dataset, `dta19` subcorpus: **0.9%** * full dataset, other subcorpora: **2.5%** (but no `I/J` confusion there) * test dataset: **1.4%**, same as above ``` ocrd_cor_asv_ann.scripts.compare - 3133 lines 0.014±0.058 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(174, ('I', 'J')), (45, ('t', 'r')), (18, ('a', 'ä')), (18, ('n', 'u')), (18, ('a', 'aͤ')), (16, ('z', 'ʒ')), (15, ('o', 'ö')), (14, (',', '.')), (11, ('e', 'c')), (10, ('ſ', 'f')), (10, ('f', 'ſ')), (10, ('i', 'r')), (9, ('g', ' g')), (9, ('u', 'ü')), (8, ('i', 'u')), (8, ('a', ' a')), (8, ('t', 'c')), (8, ('r', 'x')), (7, ('i', ' i')), (7, ('t', ' t'))], 138037) ``` * test dataset, `dta19` subcorpus: **1.2%** * test dataset, other subcorpora: **2.2%** ### Tesseract: new model, new data * full dataset: **1.1%** with lots of `r/ꝛ` errors (because the model was trained _before_ the [respective](https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/-/commit/37b9dd0a09ca7d3a558859a05bd923bcb55d4eff) [corrections](https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/-/commit/685df2d6854340fa0f55a06aacb34c1c6e27b549)?) ``` ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.049 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(5737, ('r', 'ꝛ')), (3862, ('t', 'r')), (1997, ('a', 'aͤ')), (1284, (',', '.')), (1194, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (957, ('t', 'c')), (947, ('e', 'c')), (882, ('o', 'ö')), (862, ('u', 'ü')), (846, ('f', 'ſ')), (802, ('u', 'n')), (798, ('t', 'e')), (753, ('ſ', 'f')), (721, ('n', 'm')), (630, ('i', ' i')), (557, ('ſ', ' ſ')), (554, (' ', '. ')), (552, (' ', '.'))], 14115872) ``` * test dataset: ### Calamari: old model, old data TODO ### Calamari: new model, old data * full dataset: **0.8%** partly attributable to [overaggressive normalization](https://github.com/Calamari-OCR/calamari/blob/b606e63b6ee394f68e01ba8ace2bac8db984ba46/calamari_ocr/ocr/dataset/textprocessors/text_regularizer.py#L44-L50) during training (`calamari-train --data.pre_proc.processors.6.replacement_groups` defaults to `extended` which among others rewrites double quotes to apostrophes and commas; for [GT level 2](https://ocr-d.de/en/gt-guidelines/trans/transkription.html), only `spaces+roman_digits+ligatures-consonantal` would be appropriate) ``` ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.038 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion / /dev/fd/63 vs /dev/fd/62: ([(11618, (',', '„')), (11125, ("'", '“')), (4527, ("' ", ' ')), (1818, ('a', 'aͤ')), (1292, (',J', 'J')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (840, ('o', 'ö')), (768, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (589, (',E', 'E')), (520, ('f', 'ſ')), (508, ('ſ', 'f')), (452, ('o', 'oͤ')), (395, (',U', 'U')), (393, (',M', 'M'))], 13824358) ``` * When evaluating with stronger normalization (including Calamari's `quotes` regularizer): **0.4%** – quoting errors are gone ``` ocrd_cor_asv_ann.scripts.compare - 302920 lines 0.004±0.032 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1818, ('a', 'aͤ')), (1164, ('a', 'ä')), (840, ('o', 'ö')), (768, ('u', 'ü')), (452, ('o', 'oͤ')), (418, ('f', 'ſ')), (413, ('ſ', 'f')), (329, ('u', 'uͤ')), (306, ('n', 'u')), (239, ('t', 'r')), (237, ('u', 'n')), (226, ('.', ' .')), (215, ('e', 'é')), (208, (' ', 'e')), (198, (' .', '.')), (194, ('e', 'c')), (186, ('.', ',')), (184, ('a', 'g')), (176, ('s', 'ſ')), (176, (' ', '. '))], 13285521) ``` ### Calamari: new model, new data * full dataset: **1.1%** as above, but in addition lots of `J/I` errors ``` ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.043 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(16879, ('J', 'I')), (11619, (',', '„')), (11125, ("'", '“')), (6138, ('r', 'ꝛ')), (4527, ("' ", ' ')), (1817, ('a', 'aͤ')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (906, (',', 'I')), (841, ('o', 'ö')), (769, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (647, ('Jc', 'c')), (589, (',E', 'E')), (449, ('o', 'oͤ')), (432, ('ſ', 'f'))], 13823990) ``` * When evaluating with stronger normalization (including Calamari's `quotes` regularizer): **0.6%** as above, but in addition lots of `J/I` errors ``` ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.006±0.038 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17776, ('J', 'I')), (6138, ('r', 'ꝛ')), (1817, ('a', 'aͤ')), (1164, ('a', 'ä')), (841, ('o', 'ö')), (769, ('u', 'ü')), (449, ('o', 'oͤ')), (432, ('ſ', 'f')), (337, ('f', 'ſ')), (329, ('u', 'uͤ')), (312, ('n', 'u')), (263, ('t', 'r')), (244, ('u', 'n')), (228, ('.', ' .')), (222, ('e', 'é')), (212, ('s', 'ſ')), (210, ('.', ',')), (209, ('r', 't')), (207, (' ', 'e')), (207, (' ', '. '))], 13846676) ``` ### Calamari: ZPD model, old data * full dataset: **0.8%**, like new model above, but without normalization errors (the ambiguity around `⸗/—` and `⸗/-` arises from inconsistencies between the subcorpora themselves) ``` ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.037 CER ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15343, ('⸗', '—')), (1982, ('a', 'aͤ')), (1517, ('a', 'ä')), (1255, ('ſ', 'f')), (1242, ('.', ' .')), (1152, ('n', 'u')), (1138, ('o', 'ö')), (1118, ('e', 'c')), (1117, (' ', '“ ')), (880, ('u', 'ü')), (797, ('o', 'oͤ')), (703, ('⸗', '-')), (698, ('l', 'k')), (652, ('t', 'k')), (616, ('r', 'x')), (598, ('n', 'm')), (536, ('u', 'uͤ')), (504, ('b', 'h')), (495, (' ', "' ")), (493, (',', '„'))], 13811028) ``` ### Calamari: ZPD model, new data ## Discussion Transcription/normalization/regularization rules are extremely important in this high-accuracy regime. ...