Tale Format design

Tale Format design ============================= 2019-1-30 ========= Craig, Tommy, Kacper * Tale format * fetch.txt: * We're not going to support it, since WT supports data that isn't supported by standard bagit clients (Globus) * Adopting the bagit-ro as much as possible * Adding the Datasets array + parent_dataset reference in the context and in the aggregated items. * Still need to figure out what to do with the narrative and environment (xxx attribute) * LICENSES? * Raises the question of Code v Data * Also question of user's specified license and the license supported by the provider * Next steps: * Clean up repo, READMEs, etc. * Write sample Python code to read and write both Zip and BagIt formats * Proof-of-concept export/import * Publishing * Refactor publishing to provider model -- ideally/eventually the Plugin-model * manifest.json would be the same across publishing providers * List of files in the workspace would be the same across providers * But each provider needs to do something different -- i.e., call their own API * Common Tale metadata is the same, but will be transformed into EML, DC, or whatever * Publication destination (Member nodes) as an API endpoint instead of an array in Javascript * License options as a provider API endpoint instead of the array in Javascript * Other issues: * Handling multiple authors, categories, citation info, etc. * See also * https://tools.ietf.org/html/draft-kunze-bagit-17#section-2.2.4 * https://researchobject.github.io/specifications/bundle/context.json * https://github.com/ResearchObject/bagit-ro/ * 2019-1-29 ========= Craig, Tommy * https://github.com/whole-tale/tale_serialization_formats * fetch.txt * WT supports data that can't be represented in the fetch.txt -- Globus * So either we put the registered data in both fetch.txt and the metadata or we ignore fetch.txt. * Which metadata format would you recommend? * The resource-map approach -- RO * Q. Why CreativeWork? For the filename. Why do we need filename when we have the full path? So we don't need to parse the URI * Q. Why not adopt the bagit-ro format as-is? We don't need certain things -- like author on every file. * Q. Where's the dataset section? * Q. What about entrypoints? * Q. What about environment? * Following the r2d model, the environment is defined by the workspace, and therefore just files in the payload. * Optionally, we could serialize environment files as a separate tag directory * Q. Where do we put the top-level README (ala CO "Reproducing.md")? Can we have a tag file that's not in a directory? * Put it under /data * Put it in the root and describe it in the tag manifest * Wouldn't be a tag file (doesn't contain metadata) * Maybe put README (or other docs) in a tag directory called "readme" or "reproducing" or "start-here" * tag-manifest-mdf5.txt would contain * hash start-here/readme.md * Q. Is data=workspace or are we really putting /data/workspace? * With tag directories, CW thinks data=workspace * Q. Is a user expected to open a Bagged tale? * (If not, why are we doing this...) * We're not using BagIt to exchange between systems. * We're not using BagIt strictly as an export format, because it's hard to understand. * CW: We're using BagIt because PIs are recommending it -- to put a "metadata" or "standards" face on WT. * In reality, most users would Export > Zip * Q. Zip example - what would this look like without Bagit * `<tale-id>` * /workspace * my stuff * /metadata * manifest.json <- bagit-ro manifest, * provenance.json * README.md * Q. Why are we using bagit-ro? * Because PI team suggests bagit * bagit-ro covers lots of what we're doing -- emerging/proposed format that could evolve into a standard. * If all providers (WT, CO, Gigantum, Binder) all exported bagged-ro objects, then we have something useful. * This is where BagIt has some utility -- but today it seems premature. * Q. Should we be using a Resource Map? * There are inconsistencies between local_data and remote_data examples? * bagit-ro doesn't use it, why would we. * Look into replacing schema references with dc, pav, etc? * Next steps: * Tighten up README * Complete example that covers majority cases: * Combination of local and external data -- * Including dataset information (DOI) * Globus and DataONE * Propose whether to use fetch or not, acknowledging limitations * With environment -- as Dockerfile * Top-level "README" * run-tale.sh - run the container + mount the external data * "What am I looking at" readme file. * BagIt v Zip * Two different export formats * Present to PI team * Tommy * Polish remote and local examples * Create /example * /Globus * /DataONE? * /Local 2019-1-10 ========= Kacper, Craig, + Tommy * Discussion of BagIt, BDBags, and JSON-LD * Reviewed five examples: * Bagit RO: https://github.com/ResearchObject/bagit-ro * QDR: https://github.com/QualitativeDataRepository/dataverse/tree/IQSS/4706-DPN_Submission_of_archival_copies/src/test/resources * DataONE (dataset only): https://knb.ecoinformatics.org/view/doi:10.5063/F11C1V42 * Logan's Tales in Whole Tale. * The Bagit-ro and QDR examples do basically the same thing. There's basic Bagit structure + a metadata directory containing some additional information that's not part of the payload. We came up with the following example expanding on Logan's simpler tale: * /logans-bagit-tale * bag-info.txt (required by bagit) * bagit.txt (required by bagit) * data/ (Workspace) * apt.txt * dey_element_data.csv * Dey_replication.ipynb * dey_training_set.csv * requirements.txt * README.md * manifest-sha512.txt (required by bagit) * metadata/ (custom tag directory) * external_data.txt (map of external data DOI/URI) * tale.jsonld (our main metadata file, modeled after bagit-ro) * Reproducing.md (top-level readme for user to make sense of things) * run.sh (top-level script user can run) * Pros and Cons of BagIt * Pro * Standards-based * Gives us an "archival" and "metadata" face * It's just a zip file * Con * For export, the researcher doesn't care that it's a BagIt * Complicated for the researcher to understand, if they look at it * Standard but not familiar * Lots of features we may not care about * Can't use fetch because, for example, the URI could be a Globus identifier. * pid-mapping and fetch.txt do not really cover our external data needs. * WT would need to implement PIDs * For the average user "data" (payload) will be as confusing as home/jovyan/work/data * Doesn't reflect the actual Tale structure. We've made a big deal about the UI and in-container structures matching, but for some reason it's OK for the export to be something totally different. * BagIt is for archive and exchange -- we won't be archiving. * We're basically using it for download * Confusion between WT Bag and D1 Bag (these are not the same thing). * If the Tale format is BagIt, what does it mean to publish to DataONE, Dataverse * Export: Girder > Export > WT BagIt * Publish: Girder > Publish (provider agnostic) > Transform > Publish * Import: WT Bagit > Import > Girder * Import from DataONE: D1 BagIt > Transform > WT BagIt > Import > Girder * So, this means that the Export and Import will be the only things that talk to Girder directly (and publish as much as it is a generic operation). The produced "Bag" will be the WT bag and up to the provider-specific code to transform/adapt. * Discussion w/ Tommy about BagIt as transfer format: * "I'd just discard the bagit information; Copy everything out of data and metadata, upload them to DataONE " * Q. Craig: Why Bag? If we only use it * Next steps: * Create an example repo similar to https://github.com/ResearchObject/bagit-ro with a set of complete examples that explore different scenarios. Focus on real-world Tales (e.g., Logan, https://github.com/bocinsky/guedesbocinsky2018) and make sure we hit all of the cases (external data DOI, Globus, etc). * The bagit-ro metadata format, as far as JSON-LD goes, seems like the best option for us (aggregates, etc). I'm not sure how it relates to the OAI-ORE map and why they went down this route. * Jim Myers, involved with QDR/Dataverse, would be available to talk to us as well. * Design/document how the Tale format relates to both export and publish, assuming multiple publishing providers with different weird constraints (file hierarchy support, requirements for specific metadata). * Publishing API design for a "new" provider (e.g., Dataverse -- as 1) zipfile or 2) flattened) * Can we minimize "touchpoints" between the provider-specific code and the WT API. * Publishing was done with D1 in mind * https://github.com/whole-tale/gwvolman/blob/dataone_publishing/gwvolman/publish.py * See data import providers * Publish scenarios * Publish to Dataverse: * 1) Publish the zipfile * Create a zipfile with the contents of the tale * 2) * tale.jsonld created * Dataverse specific files (xml) * Stream each tale file + tale.jsonld + dataverse.xml to dataverse 2019-1-8 ======== Tommy, Craig * Follow-up to PI call * Recommendation to explore JSONLD and BagIt formats * Review of Tommy's document: * https://docs.google.com/document/d/1dQg4J6SN6QFDyJI99LcAV_9fdHju3VMK7kcZzbdJoA4/edit * Author * @id is orcid; we don't have it until we publish * Author should be authors? * system_entrypoint v user_entrypoint * system_entrypoint is the file that creates the Docker * user_entrypoint is the file that the user says should be run * https://schema.org/EntryPoint * Questions about whether this is a dataset or softwaresourcecode and whether schema.org can be used to describe the contents of the package. * environment: * Is a Dockerfile source code? * CreativeWork because of icon for environment * hasPart -- not commonly used * Idea: files in a workspace are part of the overall Tale * No clear property for isPublic * Other thoughts: * Could use file/item endpoints as URIs * 2018-12-06 Tale Format design ============================= Tommy, Craig * Tommy * Our hands are tied in some regards when publishing to a third party * Conform to Dataverse Atom file * Conform to DataONE ORE * Not completely realated to the serialization format * Craig * CO supports exporting, but not re-importing * Tommy * Visit https://dev.nceas.ucsb.edu/view/urn:uuid:df7ccaed-85e1-474f-b5cc-39e160808dee * Download All * Open folder * Navigate to data * Open file that starts with `0077302e` * View the contents to see how prov looks * Prov statements are made with the `prov` namespace * Prov was extended with `provone` * Upload the prov information the same way as the metadata: follow repository specifics * http://guides.dataverse.org/en/latest/user/dataset-management.html#data-provenance * ACTION: * Use provR, RDataTracker, NoWorkFlow, recordr, CamFlow to generate prov traces. * Record/document the output format. Is it JSON, JOSN-ld, xml, etc * Document how a normal user would create provenance information on DataONE and Dataverse and provide examples * ProvR * https://github.com/ProvTools/provR * Can output prov information to a file with `prov.json()` * recordr * https://github.com/NCEAS/recordr * Stores information in sqlite database * User can create prov, publish the prov+data to DataONE * NoWorkFlow * https://github.com/gems-uff/noworkflow * Stores information in sqlite database * User can export as prolog facts * CamFlow * 2018-12-05 Tale Format design ============================= Tommy, Kacper, Craig * Tommy * Looked at Dataverse main metadata file/DC vocab terms covering licenses (type/file), name, author, etc. * Still shooting for an RDF representation * Review of notes: * Distinction between export and publishing * TT: When we do publish, we will have some sort of metadata document * Craig: * [12/04 Dataverse community call](https://docs.google.com/document/d/1EXKqARyxdPWW7yg3TPi9-QvXME44mwcyXkogxRGoFXk/edit) covered publishing Git repos and file hierarchy support * Long-standing issue to support Zenodo-like Github publication * Gigantum, Binder, WT talking about publishing to them * Phase 1 recommendation is to publish Git repos as zipfiles * Gigantum is publishing Git repos to dataverse as zipfiles to preserve hierarchy because Dataverse does not support file hierarchy. * Re-iterate the Odum CoRE2 case, which is adding a Dockerfile to an existing Dataverse collection/dataset as a way to provide the environment * Discussion * TT: DC relates to the document we need to create for the DVN SWORD API * http://guides.dataverse.org/en/latest/api/sword.html#create-a-dataset-with-an-atom-entry * KK: They support adding dataset as zip -- zip is unpacked, zipped-zip is published as a zipfile * CW: RDF is for internal structure which gets translated to each publisher? * TT: may not have a 1:1 mapping * CW: * Publish metadata to DV/DataONE in their terms/format (i.e., their metadata) for discovery, etc. * Facilitates discovery -- maximizes the value of integration with their system * Publishing "tale.rdf" or "tale.yaml" allows us to know what we're dealing with -- serialize/deserialization export/import * Spectrum of "tales" that might be in Dataverse/DataONE * Created outside of WT * Data + Code and no environment * Dockerfile + Data + code * Binder-like structure * We would use the DataONE/DV metadata during tale import * Tale -- tale.rdf/etc * Can tales be run elsewhere? * What we have that is different is data handling and mounting * Q. What does DataONE publish format look like today? * Data package has 4? objects: * EML document: high-level overview of what's in the package, used by Metacat UI * Datafile(s): mydata.txt * ORE/RDF: describes the relationships between files. Package doesn't exist without this file. * Metadata document for every file * Discussion of DataONE API: * https://releases.dataone.org/online/api-documentation-v2.0/apis/index.html * Q. Why don't we publish a zipfile to DataONE? * TT to follow-up -- quality/stigma, etc. * Q. Publish to Zenodo via GitHub is always a zipfile? * Yes * Q. Do we need to be able to handle zipped repos? * Yes, if we want to be able to run Binders, Gigantums * Taking for granted that we publish a "tale.yml/rdf" file in the package to each repo * Primary purpose is for us to read it back in. * Q. Why go through the process of RDF-izing it if we're the only ones using * Why deserialize RDF to put things into Girder in JSON? * Cost/value? * RDF is expensive to create -- making sure that things are mapping * URI space + resolution -- hosting a service to handle concepts * YAML is cheap and easy but non-standard * schema.org * DC is non-controversial / low hanging * Mapping "tale.yaml" into DC/Schema.org space is super easy * Provenance * Q. How is provenance stored in DataONE? * Always added to ORE document * RDF/XML * Q. Is provenance related to the file or all files in the dataset * Provenance defined between two objects in a dataset * https://www.dataone.org/best-practices/provenance * Q. In Dataverse? * Provenance file or text-free prov description? * http://guides.dataverse.org/en/latest/api/native-api.html?highlight=provenance#provenance * JSON format * http://guides.dataverse.org/en/latest/user/dataset-management.html#data-provenance * Q. Do we really need provenance information included in the "tale.yml" * Q. What does WT do about provenance info? * Use case: Diffing prov between runs * Gist: * Dublin Core/schema.org * Provenance * External data * 2018-11-28 Tale Format design ============================= Tommy, Kacper, Craig Tale format discussion: Review of Tommy's draft: * Proposal: * Tales described in RDF and serialized via JSON-LD * Q. CW: Are we pushing a JSON-LD file to DataONE/Dataverse? * Toss it into a "Bagit" zip * BagIt * Archival * Directory structure w/ manifest (MDF5s), bagit.txt with version/character encoding * Every Bagit has a data folder * DataONE uses Bagit when downloading packages * Q. CW Are we talking about publishing BagIts to dataone? * TT: This would be used on Export * Discussion of use cases: * Export from WT to laptop * Publish to DataONE * EML * Write prov information to package resource map * Publish to Dataverse * [SWORD + XML](http://guides.dataverse.org/en/latest/api/sword.html) * [Native API](http://guides.dataverse.org/en/latest/api/native-api.html) * []Write prov information to a file](http://guides.dataverse.org/en/latest/api/native-api.html#id65) * Interoperability with Odum CoRe2 * There are real "tales" in the wild that do not conform to a strict specification * https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911 * "Tale by convention" * Interoperability with Binders/Capsules/O2R ERC * Comparing Binder to Tale * Publish a Git repo to a remote repository ala Zenodo * Get a list of files and directories from remote repo via repo2docker * Git repo = filesystem/directory/files * Environment description: apt.txt, etc * binder.yml: repo2docker version etc. * Capsules * zip file with "uuid name.version" * Predefined structure * code, environment (Dockerfile) * YML format, simpler than ours * Kacper: * We need to decide if it's declarative or imperitive * imperitive: repo2docker. Binder specification is in code. * declarative: * Import a Capsule: * Looks like a Tale, but with many tags and authors * Build Docker image from environment * TT: discussion of DataONE package * We have EML document and Tale yaml * When we ingest data, we use the EML document (title, authors, etc) * Different than importing a capsule * CW: what do we/they have: * Externally referenced/registered datasets * Datalad git-annex could do this for Binder * We have knowledge of the compute environment * Capsules have Dockerfiles * Binders implicitly are Git repos with env spec baked in * Q. What distinguishes a "Tale" from a data package: * Prov information * CodeOcean enforces a directory structure * data, code, environment * WT can support directory hierarchy * DataONE and Dataverse cannot * CW: Does a Tale have to have an environment? * KK: For Binder, you get a Jupyter notebook by default. * TT: It needs at least a reference * What if https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911 has a Dockerfile? * Is the Dockerfile in data? * Do we build the image? * KK: With current import, we treat it as data. * Need to detect whether it is data or a tale * If data + code, import as data * If environment, it's a tale * Tale by convention vs specification * CW: Should we impose structure/ * code, data, environment, etc. * TT: It isn't BagIt * codemeta? * https://codemeta.github.io/ * Discussion of adapters during publication * TT: We need to commit to support them * What does export look like * Current: curl -X GET --header 'Accept: application/zip' --header 'Girder-Token: 7xNaPOb1xweww1QCNrM8WRGuf7gwDkF4ld0D2jid4jZiDKuSEyovDw7aTid8UeLE' 'https://girder.dev.wholetale.org/api/v1/tale/59f0b91584b7920001b46f2e/export' * Download zip file * Unzip * runtale.sh * cd data * docker run repo2docker:v . * go get the data... and mount it... * docker run -v $(pwd)/data:/data -P my/image * Open browser to port? * What about the data? * Ala CodeOcean -- put it in the Zip? * Can we use DataLad with WT as a provider? * Who cares/What do I do with a prov trace? * Verify that the runs are correct * Check that your inputs/outputs match their inputs outputs * Visualize the prov trace * User wants to verify their own stuff * Q. Did I reproduce the experiment? * Diffing the prov graph * DataLad does this too with Git diff * What metadata do I care about? * Is datasets.txt a new feature of repo2docker * Contraints * File path information * Goal: * Create slides/description consumable by PI team * Relate to existing initiatives * Inputs: * RDA package specification * codemeta * schema.org * BagIt * Dataverse * DataONE * Binder * CodeOcean * bdbags * Prov * O2R/ERC? * DataLad?

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.