owned this note
owned this note
Published
Linked with GitHub
Tale Format design
=============================
2019-1-30
=========
Craig, Tommy, Kacper
* Tale format
* fetch.txt:
* We're not going to support it, since WT supports data that isn't supported by standard bagit clients (Globus)
* Adopting the bagit-ro as much as possible
* Adding the Datasets array + parent_dataset reference in the context and in the aggregated items.
* Still need to figure out what to do with the narrative and environment (xxx attribute)
* LICENSES?
* Raises the question of Code v Data
* Also question of user's specified license and the license supported by the provider
* Next steps:
* Clean up repo, READMEs, etc.
* Write sample Python code to read and write both Zip and BagIt formats
* Proof-of-concept export/import
* Publishing
* Refactor publishing to provider model -- ideally/eventually the Plugin-model
* manifest.json would be the same across publishing providers
* List of files in the workspace would be the same across providers
* But each provider needs to do something different -- i.e., call their own API
* Common Tale metadata is the same, but will be transformed into EML, DC, or whatever
* Publication destination (Member nodes) as an API endpoint instead of an array in Javascript
* License options as a provider API endpoint instead of the array in Javascript
* Other issues:
* Handling multiple authors, categories, citation info, etc.
* See also
* https://tools.ietf.org/html/draft-kunze-bagit-17#section-2.2.4
* https://researchobject.github.io/specifications/bundle/context.json
* https://github.com/ResearchObject/bagit-ro/
*
2019-1-29
=========
Craig, Tommy
* https://github.com/whole-tale/tale_serialization_formats
* fetch.txt
* WT supports data that can't be represented in the fetch.txt -- Globus
* So either we put the registered data in both fetch.txt and the metadata or we ignore fetch.txt.
* Which metadata format would you recommend?
* The resource-map approach -- RO
* Q. Why CreativeWork? For the filename. Why do we need filename when we have the full path? So we don't need to parse the URI
* Q. Why not adopt the bagit-ro format as-is? We don't need certain things -- like author on every file.
* Q. Where's the dataset section?
* Q. What about entrypoints?
* Q. What about environment?
* Following the r2d model, the environment is defined by the workspace, and therefore just files in the payload.
* Optionally, we could serialize environment files as a separate tag directory
* Q. Where do we put the top-level README (ala CO "Reproducing.md")? Can we have a tag file that's not in a directory?
* Put it under /data
* Put it in the root and describe it in the tag manifest
* Wouldn't be a tag file (doesn't contain metadata)
* Maybe put README (or other docs) in a tag directory called "readme" or "reproducing" or "start-here"
* tag-manifest-mdf5.txt would contain
* hash start-here/readme.md
* Q. Is data=workspace or are we really putting /data/workspace?
* With tag directories, CW thinks data=workspace
* Q. Is a user expected to open a Bagged tale?
* (If not, why are we doing this...)
* We're not using BagIt to exchange between systems.
* We're not using BagIt strictly as an export format, because it's hard to understand.
* CW: We're using BagIt because PIs are recommending it -- to put a "metadata" or "standards" face on WT.
* In reality, most users would Export > Zip
* Q. Zip example - what would this look like without Bagit
* `<tale-id>`
* /workspace
* my stuff
* /metadata
* manifest.json <- bagit-ro manifest,
* provenance.json
* README.md
* Q. Why are we using bagit-ro?
* Because PI team suggests bagit
* bagit-ro covers lots of what we're doing -- emerging/proposed format that could evolve into a standard.
* If all providers (WT, CO, Gigantum, Binder) all exported bagged-ro objects, then we have something useful.
* This is where BagIt has some utility -- but today it seems premature.
* Q. Should we be using a Resource Map?
* There are inconsistencies between local_data and remote_data examples?
* bagit-ro doesn't use it, why would we.
* Look into replacing schema references with dc, pav, etc?
* Next steps:
* Tighten up README
* Complete example that covers majority cases:
* Combination of local and external data --
* Including dataset information (DOI)
* Globus and DataONE
* Propose whether to use fetch or not, acknowledging limitations
* With environment -- as Dockerfile
* Top-level "README"
* run-tale.sh - run the container + mount the external data
* "What am I looking at" readme file.
* BagIt v Zip
* Two different export formats
* Present to PI team
* Tommy
* Polish remote and local examples
* Create /example
* /Globus
* /DataONE?
* /Local
2019-1-10
=========
Kacper, Craig, + Tommy
* Discussion of BagIt, BDBags, and JSON-LD
* Reviewed five examples:
* Bagit RO: https://github.com/ResearchObject/bagit-ro
* QDR: https://github.com/QualitativeDataRepository/dataverse/tree/IQSS/4706-DPN_Submission_of_archival_copies/src/test/resources
* DataONE (dataset only): https://knb.ecoinformatics.org/view/doi:10.5063/F11C1V42
* Logan's Tales in Whole Tale.
* The Bagit-ro and QDR examples do basically the same thing. There's basic Bagit structure + a metadata directory containing some additional information that's not part of the payload. We came up with the following example expanding on Logan's simpler tale:
* /logans-bagit-tale
* bag-info.txt (required by bagit)
* bagit.txt (required by bagit)
* data/ (Workspace)
* apt.txt
* dey_element_data.csv
* Dey_replication.ipynb
* dey_training_set.csv
* requirements.txt
* README.md
* manifest-sha512.txt (required by bagit)
* metadata/ (custom tag directory)
* external_data.txt (map of external data DOI/URI)
* tale.jsonld (our main metadata file, modeled after bagit-ro)
* Reproducing.md (top-level readme for user to make sense of things)
* run.sh (top-level script user can run)
* Pros and Cons of BagIt
* Pro
* Standards-based
* Gives us an "archival" and "metadata" face
* It's just a zip file
* Con
* For export, the researcher doesn't care that it's a BagIt
* Complicated for the researcher to understand, if they look at it
* Standard but not familiar
* Lots of features we may not care about
* Can't use fetch because, for example, the URI could be a Globus identifier.
* pid-mapping and fetch.txt do not really cover our external data needs.
* WT would need to implement PIDs
* For the average user "data" (payload) will be as confusing as home/jovyan/work/data
* Doesn't reflect the actual Tale structure. We've made a big deal about the UI and in-container structures matching, but for some reason it's OK for the export to be something totally different.
* BagIt is for archive and exchange -- we won't be archiving.
* We're basically using it for download
* Confusion between WT Bag and D1 Bag (these are not the same thing).
* If the Tale format is BagIt, what does it mean to publish to DataONE, Dataverse
* Export: Girder > Export > WT BagIt
* Publish: Girder > Publish (provider agnostic) > Transform > Publish
* Import: WT Bagit > Import > Girder
* Import from DataONE: D1 BagIt > Transform > WT BagIt > Import > Girder
* So, this means that the Export and Import will be the only things that talk to Girder directly (and publish as much as it is a generic operation). The produced "Bag" will be the WT bag and up to the provider-specific code to transform/adapt.
* Discussion w/ Tommy about BagIt as transfer format:
* "I'd just discard the bagit information; Copy everything out of data and metadata, upload them to DataONE "
* Q. Craig: Why Bag? If we only use it
* Next steps:
* Create an example repo similar to https://github.com/ResearchObject/bagit-ro with a set of complete examples that explore different scenarios. Focus on real-world Tales (e.g., Logan, https://github.com/bocinsky/guedesbocinsky2018) and make sure we hit all of the cases (external data DOI, Globus, etc).
* The bagit-ro metadata format, as far as JSON-LD goes, seems like the best option for us (aggregates, etc). I'm not sure how it relates to the OAI-ORE map and why they went down this route.
* Jim Myers, involved with QDR/Dataverse, would be available to talk to us as well.
* Design/document how the Tale format relates to both export and publish, assuming multiple publishing providers with different weird constraints (file hierarchy support, requirements for specific metadata).
* Publishing API design for a "new" provider (e.g., Dataverse -- as 1) zipfile or 2) flattened)
* Can we minimize "touchpoints" between the provider-specific code and the WT API.
* Publishing was done with D1 in mind
* https://github.com/whole-tale/gwvolman/blob/dataone_publishing/gwvolman/publish.py
* See data import providers
* Publish scenarios
* Publish to Dataverse:
* 1) Publish the zipfile
* Create a zipfile with the contents of the tale
* 2)
* tale.jsonld created
* Dataverse specific files (xml)
* Stream each tale file + tale.jsonld + dataverse.xml to dataverse
2019-1-8
========
Tommy, Craig
* Follow-up to PI call
* Recommendation to explore JSONLD and BagIt formats
* Review of Tommy's document:
* https://docs.google.com/document/d/1dQg4J6SN6QFDyJI99LcAV_9fdHju3VMK7kcZzbdJoA4/edit
* Author
* @id is orcid; we don't have it until we publish
* Author should be authors?
* system_entrypoint v user_entrypoint
* system_entrypoint is the file that creates the Docker
* user_entrypoint is the file that the user says should be run
* https://schema.org/EntryPoint
* Questions about whether this is a dataset or softwaresourcecode and whether schema.org can be used to describe the contents of the package.
* environment:
* Is a Dockerfile source code?
* CreativeWork because of icon for environment
* hasPart -- not commonly used
* Idea: files in a workspace are part of the overall Tale
* No clear property for isPublic
* Other thoughts:
* Could use file/item endpoints as URIs
*
2018-12-06 Tale Format design
=============================
Tommy, Craig
* Tommy
* Our hands are tied in some regards when publishing to a third party
* Conform to Dataverse Atom file
* Conform to DataONE ORE
* Not completely realated to the serialization format
* Craig
* CO supports exporting, but not re-importing
* Tommy
* Visit https://dev.nceas.ucsb.edu/view/urn:uuid:df7ccaed-85e1-474f-b5cc-39e160808dee
* Download All
* Open folder
* Navigate to data
* Open file that starts with `0077302e`
* View the contents to see how prov looks
* Prov statements are made with the `prov` namespace
* Prov was extended with `provone`
* Upload the prov information the same way as the metadata: follow repository specifics
* http://guides.dataverse.org/en/latest/user/dataset-management.html#data-provenance
* ACTION:
* Use provR, RDataTracker, NoWorkFlow, recordr, CamFlow to generate prov traces.
* Record/document the output format. Is it JSON, JOSN-ld, xml, etc
* Document how a normal user would create provenance information on DataONE and Dataverse and provide examples
* ProvR
* https://github.com/ProvTools/provR
* Can output prov information to a file with `prov.json()`
* recordr
* https://github.com/NCEAS/recordr
* Stores information in sqlite database
* User can create prov, publish the prov+data to DataONE
* NoWorkFlow
* https://github.com/gems-uff/noworkflow
* Stores information in sqlite database
* User can export as prolog facts
* CamFlow
*
2018-12-05 Tale Format design
=============================
Tommy, Kacper, Craig
* Tommy
* Looked at Dataverse main metadata file/DC vocab terms covering licenses (type/file), name, author, etc.
* Still shooting for an RDF representation
* Review of notes:
* Distinction between export and publishing
* TT: When we do publish, we will have some sort of metadata document
* Craig:
* [12/04 Dataverse community call](https://docs.google.com/document/d/1EXKqARyxdPWW7yg3TPi9-QvXME44mwcyXkogxRGoFXk/edit) covered publishing Git repos and file hierarchy support
* Long-standing issue to support Zenodo-like Github publication
* Gigantum, Binder, WT talking about publishing to them
* Phase 1 recommendation is to publish Git repos as zipfiles
* Gigantum is publishing Git repos to dataverse as zipfiles to preserve hierarchy because Dataverse does not support file hierarchy.
* Re-iterate the Odum CoRE2 case, which is adding a Dockerfile to an existing Dataverse collection/dataset as a way to provide the environment
* Discussion
* TT: DC relates to the document we need to create for the DVN SWORD API
* http://guides.dataverse.org/en/latest/api/sword.html#create-a-dataset-with-an-atom-entry
* KK: They support adding dataset as zip -- zip is unpacked, zipped-zip is published as a zipfile
* CW: RDF is for internal structure which gets translated to each publisher?
* TT: may not have a 1:1 mapping
* CW:
* Publish metadata to DV/DataONE in their terms/format (i.e., their metadata) for discovery, etc.
* Facilitates discovery -- maximizes the value of integration with their system
* Publishing "tale.rdf" or "tale.yaml" allows us to know what we're dealing with -- serialize/deserialization export/import
* Spectrum of "tales" that might be in Dataverse/DataONE
* Created outside of WT
* Data + Code and no environment
* Dockerfile + Data + code
* Binder-like structure
* We would use the DataONE/DV metadata during tale import
* Tale -- tale.rdf/etc
* Can tales be run elsewhere?
* What we have that is different is data handling and mounting
* Q. What does DataONE publish format look like today?
* Data package has 4? objects:
* EML document: high-level overview of what's in the package, used by Metacat UI
* Datafile(s): mydata.txt
* ORE/RDF: describes the relationships between files. Package doesn't exist without this file.
* Metadata document for every file
* Discussion of DataONE API:
* https://releases.dataone.org/online/api-documentation-v2.0/apis/index.html
* Q. Why don't we publish a zipfile to DataONE?
* TT to follow-up -- quality/stigma, etc.
* Q. Publish to Zenodo via GitHub is always a zipfile?
* Yes
* Q. Do we need to be able to handle zipped repos?
* Yes, if we want to be able to run Binders, Gigantums
* Taking for granted that we publish a "tale.yml/rdf" file in the package to each repo
* Primary purpose is for us to read it back in.
* Q. Why go through the process of RDF-izing it if we're the only ones using
* Why deserialize RDF to put things into Girder in JSON?
* Cost/value?
* RDF is expensive to create -- making sure that things are mapping
* URI space + resolution -- hosting a service to handle concepts
* YAML is cheap and easy but non-standard
* schema.org
* DC is non-controversial / low hanging
* Mapping "tale.yaml" into DC/Schema.org space is super easy
* Provenance
* Q. How is provenance stored in DataONE?
* Always added to ORE document
* RDF/XML
* Q. Is provenance related to the file or all files in the dataset
* Provenance defined between two objects in a dataset
* https://www.dataone.org/best-practices/provenance
* Q. In Dataverse?
* Provenance file or text-free prov description?
* http://guides.dataverse.org/en/latest/api/native-api.html?highlight=provenance#provenance
* JSON format
* http://guides.dataverse.org/en/latest/user/dataset-management.html#data-provenance
* Q. Do we really need provenance information included in the "tale.yml"
* Q. What does WT do about provenance info?
* Use case: Diffing prov between runs
* Gist:
* Dublin Core/schema.org
* Provenance
* External data
*
2018-11-28 Tale Format design
=============================
Tommy, Kacper, Craig
Tale format discussion:
Review of Tommy's draft:
* Proposal:
* Tales described in RDF and serialized via JSON-LD
* Q. CW: Are we pushing a JSON-LD file to DataONE/Dataverse?
* Toss it into a "Bagit" zip
* BagIt
* Archival
* Directory structure w/ manifest (MDF5s), bagit.txt with version/character encoding
* Every Bagit has a data folder
* DataONE uses Bagit when downloading packages
* Q. CW Are we talking about publishing BagIts to dataone?
* TT: This would be used on Export
* Discussion of use cases:
* Export from WT to laptop
* Publish to DataONE
* EML
* Write prov information to package resource map
* Publish to Dataverse
* [SWORD + XML](http://guides.dataverse.org/en/latest/api/sword.html)
* [Native API](http://guides.dataverse.org/en/latest/api/native-api.html)
* []Write prov information to a file](http://guides.dataverse.org/en/latest/api/native-api.html#id65)
* Interoperability with Odum CoRe2
* There are real "tales" in the wild that do not conform to a strict specification
* https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911
* "Tale by convention"
* Interoperability with Binders/Capsules/O2R ERC
* Comparing Binder to Tale
* Publish a Git repo to a remote repository ala Zenodo
* Get a list of files and directories from remote repo via repo2docker
* Git repo = filesystem/directory/files
* Environment description: apt.txt, etc
* binder.yml: repo2docker version etc.
* Capsules
* zip file with "uuid name.version"
* Predefined structure
* code, environment (Dockerfile)
* YML format, simpler than ours
* Kacper:
* We need to decide if it's declarative or imperitive
* imperitive: repo2docker. Binder specification is in code.
* declarative:
* Import a Capsule:
* Looks like a Tale, but with many tags and authors
* Build Docker image from environment
* TT: discussion of DataONE package
* We have EML document and Tale yaml
* When we ingest data, we use the EML document (title, authors, etc)
* Different than importing a capsule
* CW: what do we/they have:
* Externally referenced/registered datasets
* Datalad git-annex could do this for Binder
* We have knowledge of the compute environment
* Capsules have Dockerfiles
* Binders implicitly are Git repos with env spec baked in
* Q. What distinguishes a "Tale" from a data package:
* Prov information
* CodeOcean enforces a directory structure
* data, code, environment
* WT can support directory hierarchy
* DataONE and Dataverse cannot
* CW: Does a Tale have to have an environment?
* KK: For Binder, you get a Jupyter notebook by default.
* TT: It needs at least a reference
* What if https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911 has a Dockerfile?
* Is the Dockerfile in data?
* Do we build the image?
* KK: With current import, we treat it as data.
* Need to detect whether it is data or a tale
* If data + code, import as data
* If environment, it's a tale
* Tale by convention vs specification
* CW: Should we impose structure/
* code, data, environment, etc.
* TT: It isn't BagIt
* codemeta?
* https://codemeta.github.io/
* Discussion of adapters during publication
* TT: We need to commit to support them
* What does export look like
* Current: curl -X GET --header 'Accept: application/zip' --header 'Girder-Token: 7xNaPOb1xweww1QCNrM8WRGuf7gwDkF4ld0D2jid4jZiDKuSEyovDw7aTid8UeLE' 'https://girder.dev.wholetale.org/api/v1/tale/59f0b91584b7920001b46f2e/export'
* Download zip file
* Unzip
* runtale.sh
* cd data
* docker run repo2docker:v .
* go get the data... and mount it...
* docker run -v $(pwd)/data:/data -P my/image
* Open browser to port?
* What about the data?
* Ala CodeOcean -- put it in the Zip?
* Can we use DataLad with WT as a provider?
* Who cares/What do I do with a prov trace?
* Verify that the runs are correct
* Check that your inputs/outputs match their inputs outputs
* Visualize the prov trace
* User wants to verify their own stuff
* Q. Did I reproduce the experiment?
* Diffing the prov graph
* DataLad does this too with Git diff
* What metadata do I care about?
* Is datasets.txt a new feature of repo2docker
* Contraints
* File path information
* Goal:
* Create slides/description consumable by PI team
* Relate to existing initiatives
* Inputs:
* RDA package specification
* codemeta
* schema.org
* BagIt
* Dataverse
* DataONE
* Binder
* CodeOcean
* bdbags
* Prov
* O2R/ERC?
* DataLad?