# Final presentation: OCR-D
Robert Sachunsky, Janek Schleicher, Kay‑Michael Würzner
mentored by: Uwe Schmidt
---
# Contents
- What we are trying to achieve
- What did we achieve
- How do we go on from here
---
# What we are trying to achieve
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/FQzHGNg.png" width="330" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/12R1ecV.png" width="290" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/DNxyPOh.png" width="310" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/zYuGYZA.png" width="390" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/RCEDV9f.png" width="390" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/kZ9QtUZ.png" width="320" />
----
## Segmentation of book pages
Hierarchical polygons (mainly boxes)
<img src="https://i.imgur.com/4xezeUp.png" width="350" />
----
## Classification of segments
- Mixed classes (semantic vs. appearance)
- Text: footnote, marginalia, catchword ...
- Graphics: handwritten annotations, diagrams, drawings ...
- Separators, Math (containing text), Tables (containing separators and text), Noise
----
## Baseline
- Heuristic layout analysis by Tesseract
- Only bounding boxes
- Large overlaps
- Inadequate for historic documents
- Inflexible for complex layouts
- Pixel-Accuracy:
$\approx$ 88%
$\approx$ 82% (without background!)
----
## Baseline
<table>
<tr>
<td><img src="https://i.imgur.com/FQzHGNg.png"/></td>
<td><img src="https://i.imgur.com/BfRGLs4.png"/></td></tr>
</table>
----
## Training a neural network
- First attempt with fastai
- No preprocessing
- Masking all different type of text segments
- Pixel-Classsifier with UNet-Model
- Pixel-Accuracy $\approx$ 80%
![](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png =x200)
----
Initially, no text region type distinction
![](https://i.imgur.com/kqsSMuL.png?1)
----
Further training struggeled with text region distinction
Example 1
<table>
<tr><td><img src="https://i.imgur.com/4rZ5Fct.png"/></td>
<td><img src="https://i.imgur.com/HtyDGKj.png"/></td></tr>
</table>
----
Example 2
<table>
<tr><td><img src="https://i.imgur.com/M80oFn9.png"/></td>
<td><img src="https://i.imgur.com/WeZJR5R.png"/></td></tr>
</table>
----
Example 3
<table>
<tr><td><img src="https://i.imgur.com/0snzLEP.png"/></td>
<td><img src="https://i.imgur.com/XuWxUTb.png"/></td></tr>
</table>
----
Example 4
<table>
<tr><td><img src="https://i.imgur.com/BBSbFA0.png"/></td>
<td><img src="https://i.imgur.com/tObT5Uz.png"/></td></tr>
</table>
---
# What did we achieve
----
## Introspection
- Clear definition of the problem
- Page segmentation as a preprocess for OCR
- Page segmentation as relevant datum itself
- Reduction to 7 (mostly) appearance-based classes
- Text, Graphics, Table, Math, Separators, Noise, **Background**
----
## Initial model
Pixel classifier (U-Net): text annotations too loose
![](https://i.imgur.com/FmLuD3G.png)
----
## Improved GT
Shrinking regions to OCRopy line segmentations
<img src="https://i.imgur.com/rnY3bsM.png" width="300" />
----
## Improved GT
Sharper segmentation
![](https://i.imgur.com/VWxFZIl.png)
----
## Additional input: binarization
Disable loss in between letters in text regions
![](https://i.imgur.com/p44Iap0.png)
----
## Additional input: binarization
Very tight “regions”
![](https://i.imgur.com/E69FDV6.png)
----
## Text-region boundary
Additional segment to focus on separation
![](https://i.imgur.com/vkwuFNp.png)
----
## Introspection 2
- Promising results from pixel classifier
- Classfication works (0.94 pixel accuracy)
- Grouping of pixels not good enough
- Classification scheme works
- Problems with regions containing text
- Severe issues in GT
- Missing regions (graphics and noise)
- **Lack of consistency**
----
## Alternative route
- Use prediction of regions (ideally bounding boxes)
- Proof of concept with [*StarDist*](https://github.com/mpicbg-csbd/stardist)
- Star-convex object detection
![](https://raw.githubusercontent.com/mpicbg-csbd/stardist/master/images/overview_2d.png)
----
## Stardist model
Region detection model can separate regions!
![](https://i.imgur.com/0C93IP3.png)
----
## Introspection 3
- Pixel classifier good at classifiying pixels
- Region detection good at separating regions
<img src="https://imgur.com/TmWf99V.png" width="290" height="490"/>
----
## Pixel classifier + stardist
Regions currently not classified as a whole
![](https://i.imgur.com/TOwUYTx.png)
---
## How do we go on from here
- Improve GT
- Fix errors and inconsistencies
- Add more pages with more varied layout
- Train a dedicated box prediction model
- Numbers don't matter!
- Implement means of useful evaluation
---
## More examples
![](https://i.imgur.com/hwT2NJ0.png)
----
![](https://i.imgur.com/zw08lSE.jpg)
----
![](https://i.imgur.com/ctNC8F4.jpg)
----
![](https://i.imgur.com/jrmSop1.jpg)
----
![](https://i.imgur.com/31eMbex.png)
----
![](https://i.imgur.com/YIz5bbO.jpg)
----
![](https://i.imgur.com/xyBxPV8.jpg)
----
![](https://i.imgur.com/ioMiPOS.jpg)
----
![](https://i.imgur.com/09tVVwS.jpg)
----
![](https://i.imgur.com/lncdZHk.jpg)
----
![](https://i.imgur.com/jeLkgPq.jpg)
----
![](https://i.imgur.com/7xAp08R.jpg)
----
![](https://i.imgur.com/NhRYZzd.jpg)
----
![](https://i.imgur.com/kOuTGRN.jpg)
----
![](https://i.imgur.com/tjMquL6.jpg)
----
![](https://i.imgur.com/a1iNtaz.png)
----
![](https://i.imgur.com/vwTZQwR.png)
----
![](https://i.imgur.com/ca1700M.jpg)
----
![](https://i.imgur.com/WHzqJPM.jpg)
----
![](https://i.imgur.com/Imbrl9h.jpg)
----
![](https://i.imgur.com/rWbiIHP.jpg)
----
![](https://i.imgur.com/ubEmj3T.jpg)
----
![](https://i.imgur.com/3FLlqwl.jpg)
---
## Many thanks for your attention
and to the organizers and mentors.
It was a great week!
{"metaMigratedAt":"2023-06-14T23:23:36.646Z","metaMigratedFrom":"YAML","title":"Final presentation: OCR-D","breaks":true,"contributors":"[{\"id\":\"14a147d0-cd6c-4764-9d25-9c0ae54f027e\",\"add\":6330,\"del\":2531},{\"id\":\"c62f1b15-791a-47e1-8e4c-ab2ed00c04bc\",\"add\":439,\"del\":5},{\"id\":\"1ced2372-b923-4987-a6bb-a7e046062c42\",\"add\":1299,\"del\":317},{\"id\":\"c12dd3e9-1641-4797-a020-9d34753d0283\",\"add\":4155,\"del\":2964}]"}