Try   HackMD

GT4HistOCR model evaluation

Since GT4HistOCR has been published, various OCR models have been trained on all or subsets of the data for various engines. This is an account of how well these models fare:

  1. Tesseract models trained at UB Mannheimpublished 2019-2020
  2. Calamari models trained by Qurator teampublished 2019-2020
  3. Calamari models trained by ZPD / Uni Würzburg TODO

The original corpus had various flaws, which have been semi-automatically corrected by UB Mannheim. Also, there have been multiple full training runs by both teams.

Methodology

  1. OCR Prediction

    • Tesseract: uses multiprocessing single-line (PSM13) standalone CLI tesserocr-batch

      ​​​​​​​​​find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR_2000000 -x .gt4histocr_old
      ​​​​​​​​​find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR -x .gt4histocr
      
    • Calamari: uses standalone CLI calamari-predict

      ​​​​​​​​​find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 calamari-predict --batch_size 8 -j 4 --extension .gt4histocr_cala.txt --extended_prediction_data  --extended_prediction_data_format pred --checkpoint ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/\* --files
      
  2. OCR Evaluation

    • String alignment (Needleman-Wunsch), distance calculation (unweighted Levenshtein on graphemes, i.e. after joining combining characters), rate calculation (denominator is length of alignment path, not GT length or max-length) and aggregation (micro-averaging and parallel/subsample aggregation of stddev by Chan et al. 1979) is done by cor-asv-ann-compare

      ​​​​​​​​​cor-asv-ann-compare -o compare-gt-gt4histocr.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.png/.gt4histocr.txt/)
      ​​​​​​​​​cor-asv-ann-compare -o compare-gt-gt4histocr_cala.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt4histocr_cala.txt/)
      
    • For generalization we should actually only look at unseen validation data (or at least the test data used to select the checkpoint). But for Calamari, which uses a voting scheme over 5 models from cross-fold training, no split is applicable. So we currently test on the full dataset in both cases. (But should at least train/test both in a similar way in the future.)

Notation

We will refer to the

Results

Tesseract: old model, old data

  • full dataset: 0.7% with no I/J error but lots of whitespace at EOL errors
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.007±0.046 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1844, ('a', 'aͤ')), (1143, ('a', 'ä')), (942, ('f', 'ſ')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (743, ('u', 'ü')), (735, ('t', 'r')), (666, ('b', 'h')), (591, ('ſ', 'f')), (586, ('u', 'n')), (480, ('c', 'e')), (442, ('n', 'u')), (431, ('o', 'oͤ')), (391, (' .', '.')), (339, ('.', ',')), (314, ('m', 'n')), (294, ('e', 'c')), (276, ('u', 'uͤ')), (259, ('ſ', ' ſ')), (259, ('r', 't'))], 13816319)
    
  • test dataset: 0.7%, same as above
    ​​​​ocrd_cor_asv_ann.scripts.compare -  3133 lines 0.007±0.044 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15, ('a', 'ä')), (15, ('o', 'ö')), (15, ('a', 'aͤ')), (14, ('b', 'h')), (14, ('f', 'ſ')), (13, ('ñ', 'ñ')), (9, ('u', 'ü')), (8, ('t', 'r')), (8, ('c', 'e')), (7, ('n', 'u')), (6, ('ſ', 'f')), (6, ('e', ' e')), (5, ('uͤ', 'ü')), (5, ('t', 't᷑')), (4, (' ', ', ')), (4, ('z', 'ʒ')), (4, (' .', '.')), (4, ('v', ' v')), (4, ('i', 'ĩ')), (4, ('r', 'x'))], 138055)
    
  • test dataset, dta19 subcorpus: 0.5%
  • test dataset, other subcorpora: 1.3%

Tesseract: old model, new data

  • full dataset: 0.9% with lots of J/I errors
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.009±0.043 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17800, ('J', 'I')), (6043, ('r', 'ꝛ')), (1843, ('a', 'aͤ')), (1144, ('a', 'ä')), (803, ('ñ', 'ñ')), (782, ('o', 'ö')), (749, ('f', 'ſ')), (741, ('u', 'ü')), (727, ('t', 'r')), (658, ('b', 'h')), (587, ('u', 'n')), (509, ('ſ', 'f')), (459, ('c', 'e')), (444, ('n', 'u')), (431, ('o', 'oͤ')), (384, (' .', '.')), (340, ('.', ',')), (314, ('m', 'n')), (277, ('u', 'uͤ')), (263, ('\n', ' —\n'))], 14129030)
    
  • test dataset:

Tesseract: new model, old data

  • full dataset: 1.3% with lots of I/J and whitespace at EOL errors
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.013±0.059 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17506, ('I', 'J')), (3921, ('t', 'r')), (1998, ('a', 'aͤ')), (1279, (',', '.')), (1193, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (1036, ('f', 'ſ')), (994, ('e', 'c')), (946, ('t', 'c')), (882, ('o', 'ö')), (863, ('u', 'ü')), (826, ('ſ', 'f')), (800, ('u', 'n')), (798, ('t', 'e')), (721, ('n', 'm')), (629, ('i', ' i')), (569, (' ', '. ')), (557, ('ſ', ' ſ')), (530, (' ', '.'))], 13803140)
    
  • full dataset, dta19 subcorpus: 0.9%
  • full dataset, other subcorpora: 2.5% (but no I/J confusion there)
  • test dataset: 1.4%, same as above
    ​​​​ocrd_cor_asv_ann.scripts.compare -  3133 lines 0.014±0.058 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(174, ('I', 'J')), (45, ('t', 'r')), (18, ('a', 'ä')), (18, ('n', 'u')), (18, ('a', 'aͤ')), (16, ('z', 'ʒ')), (15, ('o', 'ö')), (14, (',', '.')), (11, ('e', 'c')), (10, ('ſ', 'f')), (10, ('f', 'ſ')), (10, ('i', 'r')), (9, ('g', ' g')), (9, ('u', 'ü')), (8, ('i', 'u')), (8, ('a', ' a')), (8, ('t', 'c')), (8, ('r', 'x')), (7, ('i', ' i')), (7, ('t', ' t'))], 138037)
    
  • test dataset, dta19 subcorpus: 1.2%
  • test dataset, other subcorpora: 2.2%

Tesseract: new model, new data

  • full dataset: 1.1% with lots of r/ꝛ errors (because the model was trained before the respective corrections?)
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.049 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(5737, ('r', 'ꝛ')), (3862, ('t', 'r')), (1997, ('a', 'aͤ')), (1284, (',', '.')), (1194, ('a', 'ä')), (1108, ('z', 'ʒ')), (1047, ('n', 'u')), (957, ('t', 'c')), (947, ('e', 'c')), (882, ('o', 'ö')), (862, ('u', 'ü')), (846, ('f', 'ſ')), (802, ('u', 'n')), (798, ('t', 'e')), (753, ('ſ', 'f')), (721, ('n', 'm')), (630, ('i', ' i')), (557, ('ſ', ' ſ')), (554, (' ', '. ')), (552, (' ', '.'))], 14115872)
    
  • test dataset:

Calamari: old model, old data

TODO

Calamari: new model, old data

  • full dataset: 0.8% partly attributable to overaggressive normalization during training (calamari-train --data.pre_proc.processors.6.replacement_groups defaults to extended which among others rewrites double quotes to apostrophes and commas; for GT level 2, only spaces+roman_digits+ligatures-consonantal would be appropriate)
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.038 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion / /dev/fd/63 vs /dev/fd/62: ([(11618, (',', '„')), (11125, ("'", '“')), (4527, ("' ", ' ')), (1818, ('a', 'aͤ')), (1292, (',J', 'J')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (840, ('o', 'ö')), (768, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (589, (',E', 'E')), (520, ('f', 'ſ')), (508, ('ſ', 'f')), (452, ('o', 'oͤ')), (395, (',U', 'U')), (393, (',M', 'M'))], 13824358)
    
  • When evaluating with stronger normalization (including Calamari's quotes regularizer): 0.4% – quoting errors are gone
    ​​​​ocrd_cor_asv_ann.scripts.compare - 302920 lines 0.004±0.032 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(1818, ('a', 'aͤ')), (1164, ('a', 'ä')), (840, ('o', 'ö')), (768, ('u', 'ü')), (452, ('o', 'oͤ')), (418, ('f', 'ſ')), (413, ('ſ', 'f')), (329, ('u', 'uͤ')), (306, ('n', 'u')), (239, ('t', 'r')), (237, ('u', 'n')), (226, ('.', ' .')), (215, ('e', 'é')), (208, (' ', 'e')), (198, (' .', '.')), (194, ('e', 'c')), (186, ('.', ',')), (184, ('a', 'g')), (176, ('s', 'ſ')), (176, (' ', '. '))], 13285521)
    

Calamari: new model, new data

  • full dataset: 1.1% as above, but in addition lots of J/I errors
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.011±0.043 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(16879, ('J', 'I')), (11619, (',', '„')), (11125, ("'", '“')), (6138, ('r', 'ꝛ')), (4527, ("' ", ' ')), (1817, ('a', 'aͤ')), (1235, (',D', 'D')), (1164, ('a', 'ä')), (1124, (',W', 'W')), (966, (',S', 'S')), (906, (',', 'I')), (841, ('o', 'ö')), (769, ('u', 'ü')), (695, ("',", ',')), (676, (',A', 'A')), (673, (',N', 'N')), (647, ('Jc', 'c')), (589, (',E', 'E')), (449, ('o', 'oͤ')), (432, ('ſ', 'f'))], 13823990)
    
  • When evaluating with stronger normalization (including Calamari's quotes regularizer): 0.6% as above, but in addition lots of J/I errors
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313204 lines 0.006±0.038 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(17776, ('J', 'I')), (6138, ('r', 'ꝛ')), (1817, ('a', 'aͤ')), (1164, ('a', 'ä')), (841, ('o', 'ö')), (769, ('u', 'ü')), (449, ('o', 'oͤ')), (432, ('ſ', 'f')), (337, ('f', 'ſ')), (329, ('u', 'uͤ')), (312, ('n', 'u')), (263, ('t', 'r')), (244, ('u', 'n')), (228, ('.', ' .')), (222, ('e', 'é')), (212, ('s', 'ſ')), (210, ('.', ',')), (209, ('r', 't')), (207, (' ', 'e')), (207, (' ', '. '))], 13846676)
    

Calamari: ZPD model, old data

  • full dataset: 0.8%, like new model above, but without normalization errors (the ambiguity around ⸗/— and ⸗/- arises from inconsistencies between the subcorpora themselves)
    ​​​​ocrd_cor_asv_ann.scripts.compare - 313208 lines 0.008±0.037 CER
    ​​​​ocrd_cor_asv_ann.scripts.compare - most frequent confusion: ([(15343, ('⸗', '—')), (1982, ('a', 'aͤ')), (1517, ('a', 'ä')), (1255, ('ſ', 'f')), (1242, ('.', ' .')), (1152, ('n', 'u')), (1138, ('o', 'ö')), (1118, ('e', 'c')), (1117, (' ', '“ ')), (880, ('u', 'ü')), (797, ('o', 'oͤ')), (703, ('⸗', '-')), (698, ('l', 'k')), (652, ('t', 'k')), (616, ('r', 'x')), (598, ('n', 'm')), (536, ('u', 'uͤ')), (504, ('b', 'h')), (495, (' ', "' ")), (493, (',', '„'))], 13811028)
    

Calamari: ZPD model, new data

Discussion

Transcription/normalization/regularization rules are extremely important in this high-accuracy regime.