Prep for call with Laurie/Phil
- Any open data set requires a **chain of custody** so it can be used for research in the future.
- Use Data Rescue Events to **build local networks** of librarians and technologists working together to set up a sustainable effort.
- **Metadata must be considered upfront**, before any archiving takes place. By creating **high quality metadata** we will enabling high quality archiving.
- Collaboration should take place on **open platforms that allow anyone to contribute**.
- Embody **open source and open science values** and focus on lowering barriers to contribute.
- Librarians will ultimately be in control of the process, and set the standard for metadata quality as **they are expert data and metadata curators**.
- Encourage **lots of copies** of data housed in lots of institutions
- **Minimize maintenance bottleneck** for organizers, e.g. libraries should be able to move forward without being blocked by infrastructure.
- Enable many groups to produce **high quality metadata** in parallel.
- Meaningfully **engage large numbers of volunteers** at events.
- Encourage **local long term community building**, overlapping libraries and Data Refuge groups
- **Empower librarians to assume curatorial roles supported by DataRefuge community**.
- Libraries can host what data they can, and coordinate with others to **ensure good coverage**
## Proposed Pilot Workflow
Note this workflow is meant to test a relatively simple set of instructions for the purposes of informing future more robust workflows
## Goal of pilot: Get each library to adopt ~100 datasets and produce metadata + backups
Metadata workflow for libraries pilot:
- Dat can slice up from Data.gov/IA/Climate Mirror/Archivers.space etc to produce '100 dataset slices' for people to adopt
- Metadata will come from Max's "list of lists" https://github.com/datproject/svalbard
- Create a spreadsheet of agencies and departments.
- Start with 881 known federal departments
- Segment metadata by these departments (using TLD in domain or data.gov organiation metadata)
- Filter departments by ones that have metadata
- Spreadsheet now has a list of departments ready for adoption
- Note: one department may still have thousands of datasets, so there may be multiple slices
- Libraries that want to participate will be assigned a slice
- Dat can "slice on demand" (like a deli) and send them a metadata file
- Libraries update spreadsheet to reflect their adoption of a slice
- Libraries improve metadata quality (could use DataRefuge events to help)
- Libraries back up datasets associated with metadata
- Can host it themselves or on DataRefuge.org
- Libraries create a GitHub Pull Request in DataRefuge GitHub
- Fill out Pull Request to add a data.json file
- We can write a bot that validates these data.json files (e.g. https://github.com/jlord/patchwork/pull/17067)
- Comments on the PR will give places for contributors to add qualitative information
- data.json should include:
- sha256 hash of backed up data file
- url of mirror(s)
- original url
- time of capture (when it was mirrored/downloaded)
- response headers and status codes of capture
- organizational metdata ('organization' field)
- dataset descriptive metdata ('description', 'maintainer', 'agency')
- Archiver Space ID
- Any other exisiting ID, e.g. EPA ID
- update periodicity
- Finally, once metadata is merged through GitHub, and validated, it can be published to DataRefuge.org for discovery and use
## Future Features Post-Pilot
- GitHub only for first pilot (for simplicity), can consider other metadata submission workflows later
- Data.gov style data.json metadata harvesting endpoints
- CKAN direct publishing (CKAN doesn't have equivalent of Pull Requests so is hard to scale out collaborations)