Robuste und performante Verfahren für die Layoutanalyse in OCR-D - HackMD

<style> /* reduce from default 48px: */ .reveal { font-size: 24px; text-align: left; } .reveal .slides { text-align: left; } /* change from default gray-on-black: */ .hljs { color: #005; background: #fff; } /* prevent invisible fragments from occupying space: */ .fragment.visible:not(.current-fragment) { display: none; height:0px; line-height: 0px; font-size: 0px; } /* increase font size in diagrams: */ .label { font-size: 24px; font-weight: bold; } /* increase maximum width of code blocks: */ .reveal pre code { max-width: 1000px; max-height: 1000px; } /* remove black border from images: */ .reveal section img { border: 0; } .reveal h3 { text-transform: none; } .reveal pre.mermaid { width: 100% !important; } .reveal svg { max-height: 600px; } .reveal .scaled-flowchart-td pre.mermaid { width: 100% !important; /* why? float: left; */ } .reveal .scaled-flowchart-td svg { max-width: 100% !important; } .reveal .scaled-flowchart-td svg g.node, .reveal .scaled-flowchart-td svg g.label, .reveal .scaled-flowchart-td svg foreignObject { width: 100% !important; } .reveal .scaled-flowchart-td p { clear:both; } .reveal .centered { text-align: center } .reveal .width75 { max-width: 75%; } </style> # Robuste und performante Verfahren für die Layoutanalyse in OCR-D ## Beitrag SLUB (Arbeitstreffen) _Robert Sachunsky_    ![slub-logo](https://www.slub-dresden.de/typo3conf/ext/slub_template/Resources/Public/Images/slublogo.svg =200x) 26.11.2024 : https://hackmd.io/@bertsky/ocrd-layout-meeting --- ## Status 1. Fortschritt OCR-D (allgemein) 2. Fortschritt OCR-D (effizientes GPU-Pipelining) 3. Fortschritt Detectron2 --- ## 1 Fortschritt OCR-D (allgemein) - core: viele [Bugfixes](https://github.com/OCR-D/core/pulls?q=is%3Apr+is%3Aclosed+author%3Abertsky) - core: API v3 [vorangetrieben](https://github.com/OCR-D/core/pull/1240), v.a. (95%) - Fehlerbehandlung (`skip|abort|overwrite|copy`) und Timeouts auf Prozessor-/Seitenebene - Parallelisierung via ~~Multithreading~~ Multiprocessing - Vereinfachung für Prozessoren - core: Logging [gefixt/überarbeitet](https://github.com/OCR-D/core/pull/1288) (95%) --- ## 1 Fortschritt OCR-D (allgemein) - ocrd_all: Build für Fat-Container [aktualisiert](https://github.com/OCR-D/ocrd_all/pull/436) ![circleci-ocrdall](https://hackmd.io/_uploads/SJ0Lb_zmJx.png) - ocrd_all: Nachrüstung von CI+CD in allen Modulen für Slim-Container --- ## 1 Fortschritt OCR-D (allgemein) - ocrd_kraken: [Anpassung an Kraken v5](https://github.com/OCR-D/ocrd_kraken/pull/43), inkrementelle Segmentierung - textract2page: [vollständige Layout-Rekursion](https://github.com/slub/textract2page/pull/23) - Anpassung an TF 2.[4..15]: - ocrd_keraslm (75%) - [(ocrd-fork-)tfaip](https://github.com/Planet-AI-GmbH/tfaip/compare/master...bertsky:tfaip:pypi-fork) (100%) - [calamari](https://github.com/Calamari-OCR/calamari/releases/tag/v2.3.0) (100%) --- ## 1 Fortschritt OCR-D (allgemein) - Anpassung an core v3: - [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr/pull/216) - [ocrd_kraken](https://github.com/OCR-D/ocrd_kraken/pull/44) - [ocrd_segment](https://github.com/OCR-D/ocrd_segment/pull/69) - [ocrd_cis](https://github.com/bertsky/ocrd_cis/pull/5) - [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari/pull/117) (auch: TF 2.15 und Calamari 2.x) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ~~Vorbild~~: [tfaip](https://tfaip.readthedocs.io/), `tfx.serving` - am Beispiel: - ocrd_detectron2 mit [`AsyncPredictor`](https://github.com/bertsky/ocrd_detectron2/compare/master...predict-async) - ocrd_calamari mit [Multiprocessing](https://github.com/OCR-D/ocrd_calamari/pull/118/) ... --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C1, no MT / MP, arbitrary batch size** – peaky, low util. to avoid OOM ![ocrd-calamari1](https://hackmd.io/_uploads/Skh34Of7Jx.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C1, no MT / MP, batch bucketing** – peaky ![ocrd-calamari1-bb](https://hackmd.io/_uploads/Hy5pNuzmyg.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C1, multithreading, batch bucketing** – peaky ![ocrd-calamari1-bb-mt](https://hackmd.io/_uploads/By5C4OMQ1x.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C2, multithreading, `predict_pipeline`, batch bucketing** – peaky ![ocrd-calamari2-bb-mt-predict-pipeline](https://hackmd.io/_uploads/B1I1HuzXkl.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C2, multiprocessing, `predict_pipeline`, batch bucketing** – less peaky ![ocrd-calamari2-bb-mp-predict-pipeline](https://hackmd.io/_uploads/ryMgSufmkl.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C2, multiprocessing, `predict_on_batch`, batch bucketing** – even less peaky ![ocrd-calamari2-bb-mp-predict-onbatch](https://hackmd.io/_uploads/r1BZBdzmkl.png) --- ## 2 Fortschritt OCR-D (effizientes GPU-Pipelining) - ocrd_calamari mit **C2, multiprocessing, custom pipeline, `mp.Queue`-based generator** – smooth! ![ocrd-calamari2-mp-predict-queue-generator](https://hackmd.io/_uploads/rk5GBuMX1l.png) --- ## 3 Fortschritt Detectron2 - qual. Verbesserung der PAGE-Dekodierung (NMS) - erst jetzt: Zugang HPC-Cluster :worried: - Experimente zu Finetuning, vorw. auf DocLayNet (COCO) - extrem viele Hyperparameter, viele mögl. Varianten - zunächst pures Mask R-CNN (Instance Segmentation), – Panoptic braucht zusätzliche Daten(konversion) - nur COCO-Evaluierung (mAP) --- ## 3 Fortschritt Detectron2 - Experimente zu Finetuning, vorw. auf DocLayNet (COCO) ![all](https://hackmd.io/_uploads/ry_YzPGQkx.png) ||| | --- | --- | | ![test](https://hackmd.io/_uploads/SJRJR_zQJx.png) | ![test](https://hackmd.io/_uploads/HyUFUtGmkx.png)| --- ## Baukasten: Segmentierung - [ocrd_segment](https://github.com/OCR-D/ocrd_segment) – OLR-Werkzeuge - Formatkonvertierung (Segmentmasken, COCO, PAGE) - robuste Polygon-Verarbeitung ("Shapely-Frontend") - repair: (semantische) PAGE-Validierung und -Korrektur - repair/"plausibilize": Segmentkonflikte auflösen ("hierarchische NMS") - repair/"sanitize": Hüllkontur der Vg-Pixel - project: Hüllkontur der Kind-Segmente - zus. mit [ocrd-cis-ocropy-resegment](https://github.com/bertsky/ocrd_cis/blob/fix-alpha-shape/ocrd_cis/ocropy/resegment.py): Layout-Nachverarbeitung - viele manuell optimierte OCRD-Workflows (u.a. für Zeitungen) - Vgl. mit [AWS Textract](https://github.com/slub/textract2page) --- ## Baukasten: Evaluierung - [Diskussion zu Metriken](https://github.com/OCR-D/ocrd_segment/wiki/SegmentationEvaluation)... - [ocrd-segment-evaluate | page-segment-evaluate](https://github.com/OCR-D/ocrd_segment/blob/master/ocrd_segment/evaluate.py): - effiziente IoU-Berechnung: `pycocotools.cocoeval` - Matching, Metriken, Aggregation: eigener Code, denn - Alignment von pycocotools [inadäquat](https://github.com/cocodataset/cocoapi/issues/564) - n:m statt nur 1:1 - auch FN/FP (bzw. Recall/Precision) - auch Instanz- statt nur Pixel-Metriken - auch Maße für Über-/Untersegmentierung - Micro-averaging, relative Maße - Optionen: Zeilen/Regionen, mit/ohne Klassen, Vordergrund/alles - noch **nicht**: _Allowable_ Split / Merge (PRImA) - [PRImA-Layout-Eval](https://github.com/PRImA-Research-Lab/prima-layout-eval): _partielle_ Quellen, Doku, Zusage von C. Clausner zur Mithilfe --- ## Diskussion Struktur-GT - DTA/1000 mit Qualitätsproblemen (und zu wenig)... - Reparatur-Workflows... |||| | --- | --- | --- | |![](https://files.gitter.im/ocrd-segment/community/dKdZ/raw_varnhagen_rahel03_1834_OCR-D-IMG-CROP_0019.pred.png =300x)|![](https://files.gitter.im/ocrd-segment/community/PYdG/raw_siemens_abhandlungen_1881_OCR-D-IMG-CROP_0013.pred.png =300x)|![](https://files.gitter.im/ocrd-segment/community/zXkn/raw_siebold_suesswasserfische_1863_OCR-D-IMG-CROP_0011.pred.png =300x)| --- ## Diskussion Struktur-GT - PubLayNet, DocLayNet, TableBank, DocBank, ReadingBank etc. → zu homogen/modern ||| | --- | --- | |![](https://files.gitter.im/ocrd-segment/community/ngYN/PMC2999828_00004.pred.png =300x)|![](https://files.gitter.im/ocrd-segment/community/WlP6/PMC3270436_00001.pred.png =300x)|![](https://user-content.gitter-static.net/a24035ee255f2476eb538f1a36dcdae6cdf78cab/68747470733a2f2f66696c65732e6769747465722e696d2f6f6372642d7365676d656e742f636f6d6d756e6974792f6341314b2f7468756d622f504d43333031343637365f30303030312e707265642e706e67 =300x)| --- ## Ideen SLUB (1) - ocrd-segment: [Template-basierte Analyse](https://github.com/OCR-D/ocrd_segment/wiki/TemplateDrivenSegmentation), [Notebook von @hnesk](https://github.com/hnesk/ocr-experiments) - eigene(s) Detectron2-Modell(e) für Regionen (Mask-RCNN Panoptic; evtl. Spezialmodelle) - eigenes Kraken-Modell für Zeilen (aber Handschrift _und_ Print in allen Varianten; nur auf Regionen-Ebene, damit robust und modular) --- ## Ideen SLUB (2) - ocrd-segment-evaluate ausbauen → Arbeit an GT mit OCR-Workflows - Werkzeuge für Phänomenologie und Exploration → dynamische Qualitätsabschätzung ohne GT - Binarisierung: CC-Statistik - Layout: ? --- ## Ideen SLUB (3) - [trainierbare ReadingOrder](https://github.com/lquirosd/Order_Relation_Operator)? - Wrapper für [Origami](https://github.com/bertsky/ocrd_origami)? - Wrapper für [Pero-OCR](https://github.com/DCGM/pero-ocr)? - Experimente mit [Laypa](https://github.com/stefanklut/laypa)? - [Artikelseparierung](https://github.com/CITlabRostock/citlab-article-separation-new)? --- ## Ziele * Problemklassen in den VD - identifizieren (Merkmale, Abgrenzung) - quantifizieren (Häufigkeit, Schwierigkeit) - priorisieren * Ground-Truth-Daten - für Training, für Evaluierung – gemeinsame Referenz für Experimente - prüfen, aufbereiten, harmonisieren, erstellen - Erstellung mit eigenen Werkzeugen leichter, konsistenter * OLR-Modelle und -Werkzeuge - weiterentwickeln, optimieren, kombinieren * OLR-Evaluation - Methoden, Metriken bereitstellen - anwenden und auswerten * Integration in OCR-D - modulare, effiziente, robuste Prozessoren - Implikationen für Spezifikation und Workflows --- ## Planung |Arbeitspaket|SBB|SLUB|ZPD |---|---|---|--- |1: Projektmanagement|2|1|1 |2: Anforderungsanalyse|3|6|2 |3: Datenbereitstellung|3|1|9 |4: Entwicklung|12|8|4 |5: Evaluation|2|6|2 |6: Integration|2|2|0 <br/> - AP 2 priorisieren, da Abhängikeit der Partner - möglichst früh AP4 dazunehmen, Anteil steigt schrittweise - AP2 schrittweise übergehend in AP5 - AP6 nebenläufig, aber vermutlich mehr Aufwände als erhofft (Erbschaft Phase III) - OCR-D → SLUB ? - GPU → SBB ? --- ## Planung ![aps](https://hackmd.io/_uploads/B1RVU3lJC.png) - Entwicklung im Spiralmodell, mögl. frühzeitig Train-Eval-Loop? - monatliche Hackathons zu einzelnen Tools, um Entwicklungskultur zu fördern?