# Additional fields to capture from alto2txt for the first run at newspaper ingest pipeline
A ticked todo means we have agreed on a decision, [y] indicates we will add field. [n] means we skip for now
## Certain
There a few fields which we almost certainly want, which are to do with the provenance of the newspapers:
### Source note
```xml
<mods:relatedItem type="original">
<mods:note type="source note">British Library Heritage Made Digital Newspapers</mods:note>
<mods:location>
<mods:physicalLocation authority="marcorg">Uk</mods:physicalLocation>
<mods:physicalLocation>British Library</mods:physicalLocation>
</mods:location>
</mods:relatedItem>
```
Including this information in the metadata will help establish the provenance of newspaper in DB but also means that derived/sampled datasets also include sufficient metadata about provenance to remove copyright worries for external users of the data.
- [x] Capture decision [y] just the `<mods:note type="source note">British Library Heritage Made Digital Newspapers</mods:note>` text contents and see below for BLN data type.
(for later)
In BL Newspaper format the relevant provenance is probably:
```xml
<dc:Rights>Copyright © The British Library Board</dc:Rights>
```
## Maybe 🤷♂️
### Processing information
At the moment the following is captured:
From the METS
```xml
<mets:name>CCS docWorks/METAe Version 7.0-1</mets:name>
```
Some additional processing info is available in the ALTO:
```xml
<processingSoftware>
<softwareCreator>Nuance Communications, Inc.</softwareCreator>
<softwareName>OmniPage</softwareName>
<softwareVersion>20</softwareVersion>
</processingSoftware>
```
- [x] Capture decision [n]
### Language
```xml
<mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm>
```
- [x] Capture decision [y]
### Page Number
Page number of the article in the physical newsaper. To extract via alto file path.
For the BNA from FMP this probably means adding the `fileloc2` variable (i.e. ALTO filename) to the alto2txt output:

Note that this requires changes to each of the `extract_text_[...].xslt` files (for different mets versions).
See: https://github.com/alan-turing-institute/Living-with-Machines/issues/2423
## Do you want to get into this stuff in this doc?
https://github.com/alan-turing-institute/Living-with-Machines/issues/2336#issuecomment-740566981
### Source collection, where we can be explicit, probably passed as parameter
```
e.g. JISC, FMP, BNA, LwM Sponsored ([needs name](https://github.com/alan-turing-institute/Living-with-Machines/issues/2416))
```
- [] Capture decision [y/n]