Since GT4HistOCR has been published, various OCR models have been trained on all or subsets of the data for various engines. This is an account of how well these models fare:
The original corpus had various flaws, which have been semi-automatically corrected by UB Mannheim. Also, there have been multiple full training runs by both teams.
OCR Prediction
Tesseract: uses multiprocessing single-line (PSM13) standalone CLI tesserocr-batch
find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR_2000000 -x .gt4histocr_old
find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 tesseract-batch -Q4 -l GT4HistOCR -x .gt4histocr
Calamari: uses standalone CLI calamari-predict
find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | parallel -u -X -j1 calamari-predict --batch_size 8 -j 4 --extension .gt4histocr_cala.txt --extended_prediction_data --extended_prediction_data_format pred --checkpoint ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0/\* --files
OCR Evaluation
String alignment (Needleman-Wunsch), distance calculation (unweighted Levenshtein on graphemes, i.e. after joining combining characters), rate calculation (denominator is length of alignment path, not GT length or max-length) and aggregation (micro-averaging and parallel/subsample aggregation of stddev by Chan et al. 1979) is done by cor-asv-ann-compare
cor-asv-ann-compare -o compare-gt-gt4histocr.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.png/.gt4histocr.txt/)
cor-asv-ann-compare -o compare-gt-gt4histocr_cala.json -n Levenshtein -c 20 -F <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt.txt/) <(find GT4HistOCR -type f "(" -name "*.nrm.png" -or -name "*.bin.png" ")" | sed s/.[nb][ri][mn].png/.gt4histocr_cala.txt/)
For generalization we should actually only look at unseen validation data (or at least the test data used to select the checkpoint). But for Calamari, which uses a voting scheme over 5 models from cross-fold training, no split is applicable. So we currently test on the full dataset in both cases. (But should at least train/test both in a similar way in the future.)
We will refer to the
I/J
error but lots of whitespace at EOL errors
dta19
subcorpus: 0.5%J/I
errors
I/J
and whitespace at EOL errors
dta19
subcorpus: 0.9%I/J
confusion there)dta19
subcorpus: 1.2%r/ꝛ
errors (because the model was trained before the respective corrections?)
TODO
calamari-train --data.pre_proc.processors.6.replacement_groups
defaults to extended
which among others rewrites double quotes to apostrophes and commas; for GT level 2, only spaces+roman_digits+ligatures-consonantal
would be appropriate)
quotes
regularizer): 0.4% – quoting errors are gone
J/I
errors
quotes
regularizer): 0.6% as above, but in addition lots of J/I
errors
⸗/—
and ⸗/-
arises from inconsistencies between the subcorpora themselves)
Transcription/normalization/regularization rules are extremely important in this high-accuracy regime.
…