autoscale: true slidenumbers: true ## **Dryad2Dataverse** - Introduction --- ![right fit filtered](logo.jpg) ## CRKN Conference 2022 - Paul Lesack and Eugene Barsky - University of British Columbia - [research.data@ubc.ca](mailto:research.data@ubc.ca) --- ## What is it? - A tool to **harmonize** data collections between **Dryad** and **Dataverse** - A stand-alone application - A Python library ^ - we have developed a suite of Tool(s) to facilitate translation of data from Dryad to Dataverse, from start to finish - Both a downloadable piece of software and (hopefully) a development tool - Hopefully easy to use for the community --- ## Why did we develop that? - [**UBC Dataverse Collection**](https://borealisdata.ca/dataverse/ubc) @Borealis is our prime repository - There is a lot of UBC-authored data in Dryad - Wanted to harmonize and digitally preserve data studies - Data sets can be consolidated in one place: Borealis Dataverse ^ - UBC's "primary" research data repository is a Dataverse repository at . - Other UBC users have deposited their data sets into ; there is a large UBC contingent of data there — over 500 studies - Collection is split between two (or more, really) repositories: 50% Dryad, 50% Dataverse. That's not ideal. - Scholars Portal Dataverse is already aligned with Geodisy for national-level geospatial searching - SP Dataverse is connected to UBC's Summon instance --- ## Imagining the software - Simple - Modular - System-neutral - Could be scheduled - Runs from command line ^ - Simple enough to be used by users with little or ideally no programming experience - Modular - not all components should be required - System neutral - No requirement of server overhead - Ideally, a piece of software that would run from the command line with basic information supplied by an end user - Should be able to be scheduled --- ## Technical overview - **API** to **API** - A database for persistence and control ^ * Dryad and Dataverse both have relatively well documented Application Programming Interfaces (APIs), so it would make sense to use a programmatic approach to transfer the data. * The software sits between the two APIs and transfers data from one to the other. * A tiny database monitors changes --- ## Steps - Create a **metadata crosswalk**. Most important step. - Analyse Dryad's UBC datasets - Native **Dataverse JSON** as import ^ * Arguably the most important part of the whole project is mapping the output of Dryad to the input of Datavers * Fortunately, the UBC Library Research Commons has a lot of experience with that from numerous migrations and its work with Scholars portal. * Will it work? Have to analyze the datasets. * are there too many large files that won't fit into Dataverse due to size limitations? * Can things be translated to Dataverse's complex native JSON? * Finally, start programming --- ## Building Parts - serializer - transfer - monitor ^ * The resulting Python library (which you can download) is the basis of the application has three primary components working in a sequence. In essence, * a translator module (serializer), * an upload module (transfer) * and a monitor (cleverly called monitor). All of that is great if you want to do your own programming, which you don't. --- ## Features - Command line program - Using **RORs** as institutional output - Self-contained - Auto saves database on every run and creates time-stamped backups ^ * We built a command line program to convert, upload and monitor Dryad studies * Takes institutional ROR as an input, plus a number of other well known inputs, like the location of the Dataverse installation and an API key * Requires zero knowledge of Python (with the possible exception of installation) * With the binary versions even that is not required * Completely self-contained * Each run of the software creates a timestamped database backup so that disasters can be averted. --- ## Features - Can be run at any interval as each run is a self-contained crawl - Can send email status messages to multiple recipients ^ * Options to skip problematic studies, such as those that are too large to be uploaded to a Dataverse installation * Doesn't require any particular plaform or any special privileges except the capability to write to local storage. Runs from Linux, PC, Mac, your cell phone, etc. --- ## See it in action - **End result** - [https://borealisdata.ca/dataverse/UBC_DRYAD](https://borealisdata.ca/dataverse/UBC_DRYAD) **Code** - [github.com/ubc-library-rc/dryad2dataverse](github.com/ubc-library-rc/dryad2dataverse) - **Documentation** - [ubc-library-rc.github.io/dryad2dataverse/](ubc-library-rc.github.io/dryad2dataverse/) --- ## Contact us - **Paul Lesack** and **Eugene Barsky** - UBC Library, Research Commons - [research.data@ubc.ca](mailto:research.data@ubc.ca)