InvenioRDM OCFL Case Study

# InvenioRDM OCFL Case Study ## Roadmap • InvenioRDM started its first OCFL Sprint on 15/11, and on 19/11 made an OCFL test output manually. • OCFL Sprint-2 is doing Invenio design to aggregate repository records and generate an OCFL root automatically, based on the Sprint-1 output and using Simeon Warner's validator from https://github.com/zimeon/ocfl-py • Sprint-2, which started 6/12; is also generating an OCFL community extension proposal to support JSONSchemata, as well as producing a case study documenting Sprint-1 implementation. ## InvenioRDM OCFL Case Study ### Sprint-1 Progress An InvenioRDM instance was constructed at https://ocfl.dev.data-futures.org/ using records from multiple exisiting *hasdai* Invenio repositories. Extracted metadata + content was then written into an OCFL root using the Brown ocfl-java-http client (https://github.com/Brown-University-Library/ocfl-java-http) and is available at https://github.com/data-futures/ocfl-test-data. Using ocfl-java-http allowed rapid generation of a compliant OCFL root and allowed us to identify a work-plan to automate InvenioRDM snapshots into a preservable OCFL structure. Records will initially be structured with all Invenio versions mapping to a single OCFL object - this avoids duplicating data within versions of a record - see https://github.com/OCFL/spec/issues/363 Inventories, checksums and boilerplate were all generated by Brown's Java client. We employ SHA-512 - though verifying native Invenio MD5 when retrieving from RDM, then calculating the SHA-512, which is verified by the OCFL API. ### Next Steps and Issues for Sprint-2 1. In order to ensure an internally consistant OCFL representation of a repository's content, we believe that the schemata in use must be preserved in the OCFL. We will propose a community extension detailing this use case. 2. The JSON metadata exported from RDM requires some customization to make it more suitable for long-term preservation. For example the API returns link URLs which would be unlikely to be valid in a future epoch and are therefore inappropriate for long-term preservation. Further testing to ensure that a complete set of metadata is preserved is required. 3. Fields containing values from controlled vocabularies are included in InvenioRDM's JSON export, in both reference an de-referenced forms, and avoid the need to export CVs explicitly, although this will be considered in future development. Example of 'languages' field as exported - ``` "languages": [ { "id": "eng", "title": { "en": "English" } } ], ``` 4. Since the main focus is for long-term preservation of repository contents, record ownership information is preserved as part of the exported metadata, but is not necessarily expected to contribute to access control policies when such an archive is reused. 5. OCFL exports will include all 'published' as well as unpublished records (with that status preserved) together with any access restriction information. However it is expected that policies enacted by administrators using the archive will determine actual control of access at that time. 6. Maybe we haven’t extracted all the necessary data from RDM yet: CERN will determine everything else we should extract during Sprint-2. 7. Ownership of records - mentioned in another context in 4. above is challenging to persist effectively. Only really meaningful if structured user information is exported as well; possibly providing ORCID where available for the owner would at least allow them to be grouped / re-claimed in another epoch on another platform. 8. To be complete we’d need to find a way to include all controlled vocabularies (again these should be privileged in the OCFL). It might be a bit unwieldy to simply write a JSONSchema which included enum’s for all possible values of language, resource_type, role etc etc? ENDS