Additional fields to capture from alto2txt for the first run at newspaper ingest pipeline

# Additional fields to capture from alto2txt for the first run at newspaper ingest pipeline A ticked todo means we have agreed on a decision, [y] indicates we will add field. [n] means we skip for now ## Certain There a few fields which we almost certainly want, which are to do with the provenance of the newspapers: ### Source note ```xml <mods:relatedItem type="original"> <mods:note type="source note">British Library Heritage Made Digital Newspapers</mods:note> <mods:location> <mods:physicalLocation authority="marcorg">Uk</mods:physicalLocation> <mods:physicalLocation>British Library</mods:physicalLocation> </mods:location> </mods:relatedItem> ``` Including this information in the metadata will help establish the provenance of newspaper in DB but also means that derived/sampled datasets also include sufficient metadata about provenance to remove copyright worries for external users of the data. - [x] Capture decision [y] just the `<mods:note type="source note">British Library Heritage Made Digital Newspapers</mods:note>` text contents and see below for BLN data type. (for later) In BL Newspaper format the relevant provenance is probably: ```xml <dc:Rights>Copyright © The British Library Board</dc:Rights> ``` ## Maybe 🤷‍♂️ ### Processing information At the moment the following is captured: From the METS ```xml <mets:name>CCS docWorks/METAe Version 7.0-1</mets:name> ``` Some additional processing info is available in the ALTO: ```xml <processingSoftware> <softwareCreator>Nuance Communications, Inc.</softwareCreator> <softwareName>OmniPage</softwareName> <softwareVersion>20</softwareVersion> </processingSoftware> ``` - [x] Capture decision [n] ### Language ```xml <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> ``` - [x] Capture decision [y] ### Page Number Page number of the article in the physical newsaper. To extract via alto file path. For the BNA from FMP this probably means adding the `fileloc2` variable (i.e. ALTO filename) to the alto2txt output: ![](https://i.imgur.com/C2fsfkR.jpg) Note that this requires changes to each of the `extract_text_[...].xslt` files (for different mets versions). See: https://github.com/alan-turing-institute/Living-with-Machines/issues/2423 ## Do you want to get into this stuff in this doc? https://github.com/alan-turing-institute/Living-with-Machines/issues/2336#issuecomment-740566981 ### Source collection, where we can be explicit, probably passed as parameter ``` e.g. JISC, FMP, BNA, LwM Sponsored ([needs name](https://github.com/alan-turing-institute/Living-with-Machines/issues/2416)) ``` - [] Capture decision [y/n]