Final presentation: OCR-D

# Final presentation: OCR-D Robert Sachunsky, Janek Schleicher, Kay‑Michael Würzner mentored by: Uwe Schmidt --- # Contents - What we are trying to achieve - What did we achieve - How do we go on from here --- # What we are trying to achieve ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/FQzHGNg.png" width="330" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/12R1ecV.png" width="290" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/DNxyPOh.png" width="310" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/zYuGYZA.png" width="390" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/RCEDV9f.png" width="390" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/kZ9QtUZ.png" width="320" /> ---- ## Segmentation of book pages Hierarchical polygons (mainly boxes) <img src="https://i.imgur.com/4xezeUp.png" width="350" /> ---- ## Classification of segments - Mixed classes (semantic vs. appearance) - Text: footnote, marginalia, catchword ... - Graphics: handwritten annotations, diagrams, drawings ... - Separators, Math (containing text), Tables (containing separators and text), Noise ---- ## Baseline - Heuristic layout analysis by Tesseract - Only bounding boxes - Large overlaps - Inadequate for historic documents - Inflexible for complex layouts - Pixel-Accuracy: $\approx$ 88% $\approx$ 82% (without background!) ---- ## Baseline <table> <tr> <td><img src="https://i.imgur.com/FQzHGNg.png"/></td> <td><img src="https://i.imgur.com/BfRGLs4.png"/></td></tr> </table> ---- ## Training a neural network - First attempt with fastai - No preprocessing - Masking all different type of text segments - Pixel-Classsifier with UNet-Model - Pixel-Accuracy $\approx$ 80% ![](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png =x200) ---- Initially, no text region type distinction ![](https://i.imgur.com/kqsSMuL.png?1) ---- Further training struggeled with text region distinction Example 1 <table> <tr><td><img src="https://i.imgur.com/4rZ5Fct.png"/></td> <td><img src="https://i.imgur.com/HtyDGKj.png"/></td></tr> </table> ---- Example 2 <table> <tr><td><img src="https://i.imgur.com/M80oFn9.png"/></td> <td><img src="https://i.imgur.com/WeZJR5R.png"/></td></tr> </table> ---- Example 3 <table> <tr><td><img src="https://i.imgur.com/0snzLEP.png"/></td> <td><img src="https://i.imgur.com/XuWxUTb.png"/></td></tr> </table> ---- Example 4 <table> <tr><td><img src="https://i.imgur.com/BBSbFA0.png"/></td> <td><img src="https://i.imgur.com/tObT5Uz.png"/></td></tr> </table> --- # What did we achieve ---- ## Introspection - Clear definition of the problem - Page segmentation as a preprocess for OCR - Page segmentation as relevant datum itself - Reduction to 7 (mostly) appearance-based classes - Text, Graphics, Table, Math, Separators, Noise, **Background** ---- ## Initial model Pixel classifier (U-Net): text annotations too loose ![](https://i.imgur.com/FmLuD3G.png) ---- ## Improved GT Shrinking regions to OCRopy line segmentations <img src="https://i.imgur.com/rnY3bsM.png" width="300" /> ---- ## Improved GT Sharper segmentation ![](https://i.imgur.com/VWxFZIl.png) ---- ## Additional input: binarization Disable loss in between letters in text regions ![](https://i.imgur.com/p44Iap0.png) ---- ## Additional input: binarization Very tight “regions” ![](https://i.imgur.com/E69FDV6.png) ---- ## Text-region boundary Additional segment to focus on separation ![](https://i.imgur.com/vkwuFNp.png) ---- ## Introspection 2 - Promising results from pixel classifier - Classfication works (0.94 pixel accuracy) - Grouping of pixels not good enough - Classification scheme works - Problems with regions containing text - Severe issues in GT - Missing regions (graphics and noise) - **Lack of consistency** ---- ## Alternative route - Use prediction of regions (ideally bounding boxes) - Proof of concept with [*StarDist*](https://github.com/mpicbg-csbd/stardist) - Star-convex object detection ![](https://raw.githubusercontent.com/mpicbg-csbd/stardist/master/images/overview_2d.png) ---- ## Stardist model Region detection model can separate regions! ![](https://i.imgur.com/0C93IP3.png) ---- ## Introspection 3 - Pixel classifier good at classifiying pixels - Region detection good at separating regions <img src="https://imgur.com/TmWf99V.png" width="290" height="490"/> ---- ## Pixel classifier + stardist Regions currently not classified as a whole ![](https://i.imgur.com/TOwUYTx.png) --- ## How do we go on from here - Improve GT - Fix errors and inconsistencies - Add more pages with more varied layout - Train a dedicated box prediction model - Numbers don't matter! - Implement means of useful evaluation --- ## More examples ![](https://i.imgur.com/hwT2NJ0.png) ---- ![](https://i.imgur.com/zw08lSE.jpg) ---- ![](https://i.imgur.com/ctNC8F4.jpg) ---- ![](https://i.imgur.com/jrmSop1.jpg) ---- ![](https://i.imgur.com/31eMbex.png) ---- ![](https://i.imgur.com/YIz5bbO.jpg) ---- ![](https://i.imgur.com/xyBxPV8.jpg) ---- ![](https://i.imgur.com/ioMiPOS.jpg) ---- ![](https://i.imgur.com/09tVVwS.jpg) ---- ![](https://i.imgur.com/lncdZHk.jpg) ---- ![](https://i.imgur.com/jeLkgPq.jpg) ---- ![](https://i.imgur.com/7xAp08R.jpg) ---- ![](https://i.imgur.com/NhRYZzd.jpg) ---- ![](https://i.imgur.com/kOuTGRN.jpg) ---- ![](https://i.imgur.com/tjMquL6.jpg) ---- ![](https://i.imgur.com/a1iNtaz.png) ---- ![](https://i.imgur.com/vwTZQwR.png) ---- ![](https://i.imgur.com/ca1700M.jpg) ---- ![](https://i.imgur.com/WHzqJPM.jpg) ---- ![](https://i.imgur.com/Imbrl9h.jpg) ---- ![](https://i.imgur.com/rWbiIHP.jpg) ---- ![](https://i.imgur.com/ubEmj3T.jpg) ---- ![](https://i.imgur.com/3FLlqwl.jpg) --- ## Many thanks for your attention and to the organizers and mentors. It was a great week!