Labelling OCR Ground Truth for Usage in Repositories - HackMD

<style> h1, h2, h3, h4, p { text-align: left; } p < img { text-align: center; } /* The animation code */ @keyframes example { 0% {color:red; left:0px; top:0px;} 25% {color:yellow; left:200px; top:0px;} 50% {color:blue; left:200px; top:200px;} 75% {color:green; left:0px; top:200px;} 100% {color:red; left:0px; top:0px;} } @keyframes example2 { /*from {color: yellow;} to {color: red;} 50% {opacity: 3;}*/ 100% {color:red; left:0px; top:0px;} 75% {color:yellow; left:200px; top:0px;} 50% {color:blue; left:200px; top:200px;} 25% {color:green; left:0px; top:200px;} 0% {color:red; left:0px; top:0px;} } div.big{ font-size: 140%; text-align: center; } span.klein{ font-size: 60%; } span.sklein{ font-size: 40%; } span.b1 { font-size: 110%; color: red; animation-name: example; animation-duration: 2s; animation-iteration-count: 13; } span.b2 { font-size: 110%; color: yellow; animation-name: example2; animation-duration: 2s; animation-iteration-count: 13; } div.klein{ font-size: 60%; } div.space ul li { margin-top: 2em; font-size: 60% !important; } div.sklein{ font-size: 30%; line-height: 1em; } div.head_sklein{ font-size: 40%; } div.absolute { position: fixed; width: 100%; bottom: 1px; background: rgba(55,55,55,0.8) !important;" } ol.footnotes-list { font-size: 10pt; } table.klein { font-size: 60%; } img { } </style>  # <div style="text-shadow: 3px 2px black;">Labelling OCR Ground Truth for Usage in Repositories <span style="text-shadow: 3px 2px black;">Konstantin Baierer, Matthias Boenig, Berlin State Library – Prussian Cultural Heritage and Berlin-Brandenburg Academy of Sciences and Humanities</span></div> --- # Content * Definition of Ground Truth * Metadata for Ground Truth * OCR-D Repository * DEMO ---  # <div style="background: rgba(55,55,55,0.8) !important;">Ground Truth</div> <div style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;"> GT in this context means the text of the document in the form of digital transcriptions, e.g. characters, paragraphs as well as structural features such as headings, footnotes, etc. In addition to these text-centered features, special features of the document both on a technical (e.g. preprocessing of the image) and on a physical level (e.g. aging, artifacts) must also be included in the feature documentation.</div> --- # Metadata for Ground Truth * bibliographic * structural * physical * techninical information ---- # Bibliographic and structural metadata <div class="klein"> * such as place of publication, language, date etc. :+1: are best described using MODS (Metadata Object Description Schema) as part of the METS format (Metadata Encoding and Transmission Standard), * the physical and logical structure of a digital object, such as the image and transcription files. :+1: the very flexible METS container format can **embedded** or **referenced** this information </div> ----  # <div style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;">Physical and techninical information</div> <div class="klein" style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;"> * to distinguish between extrinsic and intrinsic characteristics * **Extrinsic features** were added to the object *during creation* (e.g. faint printing quality, ghosting/bleeding, the hand of a scan operator or information loss by preprocessing steps like binarization) * **Intrinsic features** are based on the *innate* properties of the document (e.g. language(s), font(s), print space or annotations ) </div> ----  # <div style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;">Ontology and Framework for Semantic Labelling of Document Data <br><span class="sklein">https://github.com/PRImA-Research-Lab/semantic-labelling</span></div> <div class="klein" style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;"> * a category scheme for the description of intrinsic and extrinsic properties of digital objects * e.g. use in Europeana Newspapers Project Dataset (https://www.primaresearch.org/repository/index/ENP) </div> ----  ---- ## Intrinsic characteristics of a document page * corresponding with OCR-D-GTM record property <table class="klein"> <tr><th>page</th><th>OCR-D-GTM record property</th></tr> <tr><td>font family</td><td>data-attributes/document-related/visual/text/font/typeface/cluster/bastarda data-attributes/document-related/visual/text/font/multi-font/typefaces</td> </tr> <tr><td>font size</td><td>data-attributes/document-related/visual/text/font/multi-font/font-sizes</td></tr> <tr><td>stamp</td><td>condition/wear/additions/informative/stamps</td></tr> </table> ---- ## Extrinsic characteristics of a document page * corresponding with OCR-D-GTM record property <table class="klein"> <tr><th>page</th><th>OCR-D-GTM record property</th></tr> <tr><td>faint printing</td><td>condition/production-related/document-faults/faint-chars</td></tr> <tr><td>low contrast</td><td>condition/production-related/document-characteristics/low-contrast condition/acquisition/method-flaws/imaging/low-contrast</td></tr> <tr><td>skewing</td><td>condition/acquisition/geometric/skew/global condition/acquisition/geometric/skew</td></tr> <tr><td>scan operator's fingers visible</td><td>condition/acquisition/content-or-background/included-objects/fingers</td></tr> </table> ---- # <div style="font-size: 70%; text-align: left;" class="absolute">OCR-D-GTM METS metadata profile</div> <img src="https://i.imgur.com/017qR7m.jpg" width="300%" height="300%" style="vertical-align:middle; background:none; border:none; box-shadow:3px 3px 5px rgba(0, 0, 0, 0.5);"/> --- # OCR-D Repository <div class="space"> 🕍⛪🕌 The **Ground Truth (GT) repository** contains all the ground truth metadata. This data repository is **public** available. 🔭 The **research data repository** may contain the results of all steps during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository does **not need to be publicly available**. </div> ---  ## OCR-D Ground Truth Repository (OCR-D-GT-REP) <div class="klein" style="text-align:left;"> **Software Requirements** <table style="margin-top:2em;"> <tr> <td style="border-style: solid; border-width: 5px;"> * Java 8 or higher * PostgreSQL 9.1 or higher * ArangoDB 3.3 or higher * Repository * Repository Authentication Service </td><td style="font-size:30pt; vertical-align:middle;">⬌</td><td style="border-style: solid; border-width: 5px; vertical-align:middle;">Docker</td></tr> </table> **Supported Format** Containerformat: BagIt Version 0.97+ Profile: https://ocr-d.github.io/bagit-profile.json Protocol: HTTP (REST) <img src="https://github.githubassets.com/images/spinners/octocat-spinner-128.gif" width="5%" height="5%" style="vertical-align:middle; background:none; border:none; box-shadow:3px 3px 5px rgba(0, 0, 0, 0.5);"> https://github.com/OCR-D/repository_metastore </div> ---  # <div style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left;">Ingest and exchange format: OCRD-ZIP</div> <ol style="background: rgba(55,55,55,0.8) !important; font-size: 70%; text-align: left; padding: 40px; font-size: 60%"> <li>Upload BagIt-Container using REST<br> <span class="klein">curl -u ingest:PASSWORD -v -F BagItContainer http://localhost:8080/api/v1/metastore/bagit</span></li> <div class="big"><span class="b1">⭣🙆</span><span class="b2">⭡</span><span class="b1">⭣🙆</span><span class="b2">⭡</span><span class="b1">⭣🙆</span><span class="b2">⭡</span><span class="b1">⭣🙆</span><span class="b2">⭡</span><span class="b1">⭣🙆</span><span class="b2">⭡</span></div> <li>Unzip container</li> <li>Validate container</li> <li>Extract metadata <ul> <li>METS Header (ppn, title)</li> <li>METS file</li> <li>PROV XML (if available)</li> <li>GT metadata (if available)</li></ul></li> <li>Index metadata (Elasticsearch / Kibana)</li> </ol> --- # <div class="head_sklein">DEMO</div> <div class="sklein"> * List all Documents * The list shows all ingested documents with its ‘resourceID’, ‘Link for Download’, ‘Referenced Files’, ‘Metadata’, and ‘Semantic Labeling’ https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit (Upload is only available for authorized users) * List all Files inside Document * All files referenced inside the mets.xml are listed here. https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/f15fb8c8-3842-4314-9a44-5e8b472d7bfc/files * List Metadata * List metadata of the document. (e.g.: title, author, year, identifier, languages, classifications) https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/f15fb8c8-3842-4314-9a44-5e8b472d7bfc/metadata * List Ground Truth Metadata * List all semantic labels. https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/f15fb8c8-3842-4314-9a44-5e8b472d7bfc/groundtruth * Search for Semantic Label * Search for documents with uneven illumination. https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination * Search * Search for Documents containing two Semantic Labels at once https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination,condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding * Search for Documents with Classification ‘Fachtext’ https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Fachtext * Download single file * Download/view single file. (Tiff) https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/f15fb8c8-3842-4314-9a44-5e8b472d7bfc/data/bagit/data/OCR-D-IMG/OCR-D-IMG_0001 </div> --- # 🙇 Thank you 🙇 ocr-d.de ocr-d.github.io ocr-d.github.io/docs github.com/OCR-D gitter.im/OCR-D/Lobby ---- # Restrampe ````