# Vortrag DHd 2020 AG-OCR
## Projekt
**Ziele:**
Evaluation und Programmierung von Transformatoren/Konverter von PAGE in gängige Formate (PDF, TEI, ALTO)
**Umsetzung:**
**TEI:**
Fork: https://github.com/tboenig/page2tei von https://github.com/dariok/page2tei
**Konvertierung:**
auf Basis von XSLT
Besonderheiten und in gewisserweise Weiterentwicklung:
1. Unterscheidung der PAGE-Datei (Transkribus, OCR-D)
2. Angabe der spezifischen mets:fileGrp durch Parameter
3. Nutzung der spezifischen PAGE Klassifikation für Layout-Struktur-Angaben innerhalb des textRegion-Elementes (z.B. ``<TextRegion type="heading">``)
- heading
- caption
- header
- footer
- catch-word
- signature-mark
- marginalia
- footnote
- footnote-continued
- other
Beispiel Umsetzung des lebenden Kolumnentitel:
```xml
<xsl:when test="@type = 'header'">
<fw type="header" place="top" facs="#facs_{$numCurr}_{@id}">
<xsl:apply-templates select="p:TextLine | pc:TextLine"/>
</fw>
</xsl:when>
```
4. Erstellung von DIV-Containern in der TEI-Datei
Dazu werden die Überschriften genutzt die durch ``<TextRegion type="heading">`` in der PAGE-Datei gekennzeichnet sind.
Eine Schachtelung/Hierachie wie Kapitel, Unterkapitel... ist im PAGE-Format nicht vorgesehen und kann auch so nicht in der TEI-Datei erscheinen. Jedoch werden auf gleicher Hierachiestufe werden die DIV-Container erstellt.
5. Was ist noch zu tun:
- Tabellen-Umsetzung ist zu erweitern auf eine auf das PAGE-Schema konforme Transformation
- übliche Wartung
PAGE2ALTO
Voraussetzung:
## Feature-Vergleich PAGE-ALTO
| Beschreibung| PAGE |ALTO Realisierung | ALTO (ab 3.0) Vorschlag aus https://altoxml.github.io/documentation/use-cases/tags/ALTO_tags_usecases.html|
| -------------- | -------------- | -------------- | -------------- |
| **Pagetype** | | |
| Covers | ``<Page type="front-cover">`` | ``<StructureTag TYPE="Functional" LABEL="front-cover"/>`` | ``<StructureTag TYPE="Functional" LABEL="Cover"/>`` |
| Title pages | ``<Page type="title">`` | ``<StructureTag TYPE="Functional" LABEL="title"/>`` | ``<StructureTag TYPE="Functional" LABEL="TitlePage"/>`` |
| Frontmatter | ``<Page type="content">`` | | Text |
| Tables of content | ``<Page type="table-of-contents">`` | ``<StructureTag TYPE="Functional" LABEL="table-of-contents"/>`` | ``<StructureTag TYPE="Functional" LABEL="TOC"/>`` |
| Body matter | ``<Page type="content">`` | ``<StructureTag TYPE="Structural" LABEL="BodyMatter"/>`` | ``<StructureTag TYPE="Structural" LABEL="BodyMatter"/>`` |
| Backmatter | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="LOI"/>`` | ``<StructureTag TYPE="Functional" LABEL="LOI"/>`` |
| appendix | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="Appendix"/>`` | ``<StructureTag TYPE="Functional" LABEL="Appendix"/>`` |
| tables list | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="LOT"/>`` | ``<StructureTag TYPE="Functional" LABEL="LOT"/>`` |
| conclusion | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="Conclusion"/>`` | ``<StructureTag TYPE="Functional" LABEL="Conclusion"/>`` |
| glossary | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="Glossary"/>`` | ``<StructureTag TYPE="Functional" LABEL="Glossary"/>`` |
| bibliography | ``<Page type="content">`` | ``<StructureTag TYPE="Functional" LABEL="Bibliography"/>`` | ``<StructureTag TYPE="Functional" LABEL="Bibliography"/>`` |
| Index | ``<Page type="index"`` | ``<StructureTag TYPE="Functional" LABEL="index"/>`` | ``<StructureTag TYPE="Functional" LABEL="Index"/>`` |
| **Textregion** | | | |
| Running titles | ``<TextRegion type="header">`` | ``<StructureTag TYPE="Functional" LABEL="header"/>`` | ``<StructureTag TYPE="Functional" LABEL="RunningTitle"/>`` |
| chapters | ``<TextRegion type="heading">`` | ``<StructureTag TYPE="Structural" LABEL="heading" DESCRIPTION="I"/> `` | ``<StructureTag TYPE="Structural" LABEL="Chapter" DESCRIPTION="I"/> `` |
| parts | ``<TextRegion type="heading">`` | ``<StructureTag TYPE="Structural" LABEL="heading" DESCRIPTION="I"/>`` | ``<StructureTag TYPE="Structural" LABEL="Part" DESCRIPTION="I"/>`` |
| Titles | Text | ``<StructureTag TYPE="Structural" LABEL="FullTitle"/>`` | ``<StructureTag TYPE="Structural" LABEL="FullTitle"/>`` |
| subtitles | Text | ``<StructureTag TYPE="Structural" LABEL="Title1"/>`` | ``<StructureTag TYPE="Structural" LABEL="Title1"/>`` |
| Footnote references | Text | ``<StructureTag TYPE="Functional" TYPE="FootnoteReference" DESCRIPTION="1"/>`` | ``<StructureTag TYPE="Functional" TYPE="FootnoteReference" DESCRIPTION="1"/>`` |
| Footnotes | ``<TextRegion type="footnote">`` | ``<StructureTag TYPE="Functional" LABEL="footnote" DESCRIPTION="1"/>`` | ``<StructureTag TYPE="Functional" LABEL="Footnote" DESCRIPTION="1"/>`` |
| References to footnote | Text | ``<StructureTag TYPE="Reference" TYPE="ReferenceToFootnote" DESCRIPTION="1"/>`` | ``<StructureTag TYPE="Reference" TYPE="ReferenceToFootnote" DESCRIPTION="1"/>`` |
| Marginalias | ``<TextRegion type="marginalia">`` | ``<StructureTag TYPE="Functional" LABEL="marginalia"/>`` | ``<StructureTag TYPE="Functional" LABEL="Marginalia"/>`` |
| Figure captions | ``<TextRegion type="caption">`` | ``<StructureTag TYPE="Functional" LABEL="FigureCaption"/>`` | ``<StructureTag TYPE="Functional" LABEL="FigureCaption"/>`` |
| Figure references | Text | ``<StructureTag TYPE="Functional" LABEL="FigureReference" DESCRIPTION="9"/>`` | ``<StructureTag TYPE="Functional" LABEL="FigureReference" DESCRIPTION="9"/>`` |
| Table captions | ``<TextRegion type="caption">`` | ``<StructureTag TYPE="Functional" LABEL="TableCaption"/>`` | ``<StructureTag TYPE="Functional" LABEL="TableCaption"/>`` |
| Table references | Text | ``<StructureTag TYPE="Functional" LABEL="TableReference" DESCRIPTION="1"/>`` | ``<StructureTag TYPE="Functional" LABEL="TableReference" DESCRIPTION="1"/>`` |
| References to table | Text | ``<StructureTag TYPE="Reference" LABEL="ReferenceToTable" DESCRIPTION="1"/>`` |
| Page numbers | ``<TextRegion type="page-number">`` | ``<StructureTag TYPE="Functional" LABEL="page-number" DESCRIPTION="937"/>`` | ``<StructureTag TYPE="Functional" LABEL="PageNumber" DESCRIPTION="937"/>`` |
| Reference to page | Text | ``<StructureTag TYPE="Reference" LABEL="ReferenceToPage" DESCRIPTION="8"/>`` | ``<StructureTag TYPE="Reference" LABEL="ReferenceToPage" DESCRIPTION="8"/>`` |
| Unordered lists | ``<TextRegion type="list-label">`` | ``<StructureTag TYPE="Functional" LABEL="UL"/>`` | ``<StructureTag TYPE="Functional" LABEL="UL"/>`` |
| Ordered lists | ``<TextRegion type="list-label">`` | ``<StructureTag TYPE="Functional" LABEL="OL"/>`` | ``<StructureTag TYPE="Functional" LABEL="OL"/>`` |
https://github.com/maxnth/page-alto-ressources
https://github.com/PRImA-Research-Lab/prima-page-converter
## Vorarbeiten
- [Prima PAGE Converter]
- Transkribus Konverter https://github.com/tboenig/TranskribusCore/tree/master/src/main/resources/xslt
## Ziel
- alle Informationen der PAGE-Datei sollen nach Möglichkeit in die ALTO-Datei überführt werden.
## Umsetzung
### Code base
Mitentwicklung an [Prima PAGE Converter]()
### Feature Branches
#### [feature/AdvancedTags](https://github.com/maxnth/prima-core-libs/tree/feature/AdvancedTags)
Experimental branch adding the LayoutTags proposed by [ALTO](https://altoxml.github.io/documentation/use-cases/tags/ALTO_tags_usecases.html)
Example:
```xml
…
<Tags>
<LayoutTag ID="tag1" LABEL="Image"/>
<LayoutTag ID="tag2" LABEL="Table"/>
<LayoutTag ID="tag3" LABEL="Map"/>
</Tags>
…
<Illustration HEIGHT="36" HPOS="719" ID="r_0_1" TAGREFS="tag1" TYPE="ImageRegion"
VPOS="427" WIDTH="52">
<Shape>
<Polygon POINTS="719,427 770,427 770,462 719,462"/>
</Shape>
</Illustration>
<Illustration HEIGHT="50" HPOS="719" ID="r_0_2" TAGREFS="tag1" TYPE="ImageRegion"
VPOS="427" WIDTH="52">
<Shape>
<Polygon POINTS="719,427 770,427 770,462 719,462"/>
</Shape>
</Illustration>
…
<ComposedBlock HEIGHT="86" HPOS="25" ID="r_200" TAGREFS="tag2" TYPE="TableRegion"
VPOS="475" WIDTH="376">
<Shape>
<Polygon POINTS="25,475 25,560 400,560 400,475"/>
</Shape>
<TextBlock HEIGHT="86" HPOS="25" ID="r_200_1" VPOS="475" WIDTH="376">
…
</ComposedBlock>
…
<Illustration HEIGHT="80" HPOS="719" ID="r_1_1" TAGREFS="tag3" TYPE="MapRegion"
VPOS="427" WIDTH="52">
<Shape>
<Polygon POINTS="719,427 770,427 770,462 719,462"/>
</Shape>
</Illustration>
…
```
Werte noch zu definieren
#### [feature/ComposedBlockType](https://github.com/maxnth/prima-core-libs/tree/feature/ComposedBlockType)
Adds the PAGE XML region type to created ComposedBlockType-Elements (akin to IllustrationBlock-Elements)
Example:
```xml
<ComposedBlock HEIGHT="86" HPOS="25" ID="r_200" TAGREFS="tag2" TYPE="TableRegion"
VPOS="475" WIDTH="376">
<Shape>
<Polygon POINTS="25,475 25,560 400,560 400,475"/>
</Shape>
<TextBlock HEIGHT="86" HPOS="25" ID="r_200_1" VPOS="475" WIDTH="376">
…
</ComposedBlock>
```
#### [feature/TextRegionTypeTags](https://github.com/maxnth/prima-core-libs/tree/feature/TextRegionTypeTags)
Adds OtherTag-Elements for PAGE XML text region types
Example:
```xml
…
<Tags>
<OtherTag DESCRIPTION="PAGE XML text region type" ID="tag1" LABEL="page-number"/>
<OtherTag DESCRIPTION="PAGE XML text region type" ID="tag2" LABEL="header"/>
</Tags>
…
<TextBlock HEIGHT="36" HPOS="719" ID="r_1_1" TAGREFS="tag1" VPOS="427" WIDTH="52">
<Shape>
<Polygon POINTS="719,427 770,427 770,462 719,462"/>
</Shape>
<TextLine HEIGHT="34" HPOS="720" ID="tl_1_1" LANG="de" STYLEREFS="ts1"
VPOS="428" WIDTH="50">
<Shape>
<Polygon POINTS="720,428 769,428 769,461 720,461"/>
</Shape>
<String CONTENT="26" HEIGHT="34" HPOS="720" ID="w_w1aab1b1b2b1b1ab1"
LANG="de" STYLEREFS="ts2" VPOS="428" WIDTH="50">
<Shape>
<Polygon POINTS="720,428 769,428 769,461 720,461"/>
</Shape>
</String>
</TextLine>
</TextBlock>
…
<TextBlock HEIGHT="33" HPOS="1044" ID="r_2_1" IDNEXT="r_4_1" TAGREFS="tag2"
VPOS="442" WIDTH="228">
<Shape>
<Polygon POINTS="1044,442 1271,442 1271,474 1044,474"/>
</Shape>
<TextLine HEIGHT="31" HPOS="1045" ID="tl_2" LANG="de" STYLEREFS="ts3" VPOS="443"
WIDTH="226">
<Shape>
<Polygon POINTS="1045,443 1270,443 1270,473 1045,473"/>
</Shape>
<String CONTENT="II." HEIGHT="26" HPOS="1045" ID="w_w1aab1b3b2b1b1ab1"
LANG="de" STYLEREFS="ts3" VPOS="448" WIDTH="40">
<Shape>
<Polygon POINTS="1045,448 1084,448 1084,473 1045,473"/>
</Shape>
</String>
<String CONTENT="Abschnitt." HEIGHT="30" HPOS="1103"
ID="w_w1aab1b3b2b1b1ab9" LANG="de" STYLEREFS="ts3" VPOS="443"
WIDTH="168">
<Shape>
<Polygon POINTS="1103,443 1270,443 1270,472 1103,472"/>
</Shape>
</String>
</TextLine>
</TextBlock>
…
```
vgl. GT Guidelines zu TextRegion https://ocr-d.github.io/en/gt-guidelines/trans/lytextregion.html
#### [feature/fontColor](https://github.com/maxnth/prima-core-libs/tree/feature/fontColor)
Add support for converting textColour attributes to ALTO FONTCOLOR attributes
Example:
PAGEXML
```xml
…
<TextStyle textColour="red" fontFamily="Times New Roman" fontSize="8.5"/>
…
```
ALTO
```xml
…
<Styles>
…
<TextStyle FONTCOLOR="ff0000" FONTFAMILY="Times New Roman" FONTSIZE="8.5" ID="ts3"/>
…
```
#### [feature/marginWOprintspace](https://github.com/maxnth/prima-core-libs/tree/feature/marginWOprintspace)
Implements ALTO margin elements (TopMargin, BottomMargin, LeftMargin, RightMargin) for PAGE XML files without an explicitly specified PrintSpace-ELement. Builts on top of [feature/margins](https://github.com/maxnth/prima-core-libs/tree/feature/margins)
PrintSpace derived from polygon around all regions on the page
#### [feature/margins](https://github.com/maxnth/prima-core-libs/tree/feature/margins)
ALTO margin elements (TopMargin, BottomMargin, LeftMargin, RightMargin) for PAGE XML files with an explicitly specified PrintSpace-ELement. Everything outside of the PrintSpace is considered to be inside the Margin.
Example:
```xml
…
<Page HEIGHT="2686" ID="p0" PAGECLASS="content" PHYSICAL_IMG_NR="0" WIDTH="1700">
<TopMargin HEIGHT="376" HPOS="0" ID="TopMarginTypeID0" VPOS="0" WIDTH="1700"/>
<LeftMargin HEIGHT="1772" HPOS="0" ID="LeftMarginTypeID0" VPOS="376" WIDTH="630"/>
<RightMargin HEIGHT="1772" HPOS="1699" ID="RightMarginTypeID0" VPOS="376" WIDTH="3"/>
<BottomMargin HEIGHT="538" HPOS="0" ID="BottomMarginTypeID0" VPOS="2148" WIDTH="1700"/>
<PrintSpace HEIGHT="1772" HPOS="630" ID="PageSpaceTypeID0" VPOS="376" WIDTH="1071">
<Shape>
<Polygon POINTS="630,376 1700,376 1700,2147 630,2147"/>
</Shape>
…
```
#### [stringMaps](https://github.com/maxnth/prima-core-libs/tree/stringMaps)
Small code refactoring, replacing somewhat verbose if blocks (containing duplicates) with maps.
### Gemerget
https://github.com/PRImA-Research-Lab/prima-core-libs/pull/3 newline normalization
### Unterschiede zur Umsetzung in Transkribus

https://github.com/Transkribus/TranskribusCore/tree/master/src/main/resources/xslt
Unterschied bestehen:
1. ALTO-Version
Transkribus: alto-v2.0.xsd - PRIMA: alto-4-1.xsd
2. Styles
Transkribus: nicht übernommen - PRIMA: übernommen
3. Struktur
Transkribus: nicht übernommen - PRIMA: übernommen (siehe Tabelle)
## ocrd_fileformat
https://github.com/OCR-D/ocrd_fileformat
https://github.com/UB-Mannheim/ocr-fileformat
Transformationsszenarien: https://github.com/UB-Mannheim/ocr-fileformat#supported-transformations
| From ╲ To | hOCR | ALTO | PAGEXML |
| ---: | --- | --- | --- |
| hOCR | = | ✓ | ✓ |
| ALTO | ✓ | = | ✓ |
| PAGEXML | ✓ | ✓ | = |
| FineReader | ✓ | - | ✓ |
| Google Cloud Vision | ✓ | - | ✓ |
| TEI | ✓ | - | - |
## Beispiel Nachnutzung: ALTO2PAGE @ SLUB
Hilfreich bei OCR-D-basierter Neuprozessierung von Bestandsdaten, für die eine OCR vorliegt. Bestehende, zufriedenstellende Layoutsegmentierung kann so nachgenutzt und mit neuem Text versehen werden. In einem konkreten Fall wird das Werkzeug bereits für die Neuprozessierung des „Börsenblatts für den deutschen Buchhandel“ eingesetzt. Die Segmentierung in Regionen mit Abbyy Cloud ist akzeptabel (und leider alternativlos gegeben OCR-D), während Zeilen- und Texterkennung unter aller Kanone sind.
Beispiel:
https://digital.slub-dresden.de/werkansicht/dlf/229034/
https://www.zfb.uni-muenchen.de/forschungundnachwuchs/forschung/f_boebl/index.html