fairly Package Design

Philosophy

We don't want to provide a package that replicates the functionalities of the data repository APIs and provides generic methods to perform arbitrary metadata and dataset operations. Our intention is to develop a package that focuses on the core task, i.e. uploading of datasets to data repositories, and provides necessary basis for the JupyterFAIR extension.

Vision

import fairly # Create a local dataset dataset = fairly.create_dataset('/path/dataset') # Set metadata dataset.set_metadata({ "title": "My wonderful dataset", "license": "CC BY 4.0", "keywords": ["FAIR", "data"], "authors": [ "0000-0002-0156-185X", { "name": "John", "surname": "Doe", "role": "contributor", }, ], }) # Add data files and folders dataset.add_files([ "README.txt", "*.csv", "train/*.jpg", "test/*.jpg" ]) # Upload to the remote data repository remote_dataset = dataset.upload("4tu") # Change metadata dataset.metadata["license"] = "MIT" # Synchronize the remote dataset with the local dataset dataset.synchronize()
import fairly # Connect to the local repository local = fairly.connect('/path/repository') # Open an existing local dataset # (Loads existing metadata and imports list of files and folders) dataset = local.open_dataset('/path/dataset') # Connect to the remote data repository remote = fairly.connect("4tu", token="1b7e-6700-26b5-4fda") # Find the remote dataset (i.e. archive) matching the local dataset, e.g. by using DOI archive = remote.find_dataset(dataset) # Synchronize the archived dataset with the local copy dataset.synchronize(archive) or # Find the archived dataset and synchronize it with the local copy dataset.synchronize(remote)

Terminology

Dataset
A collection of one or more data files and folders, and related metadata.
Data file
A file that is part of a dataset and stores data in a specific format, including (compressed) archive of multiple files or folders.
Data folder
A folder that is part of a dataset and contains one or more data files or folders.
Metadata
Structured information about a dataset (according to a standard).
Metadata standard
A requirement intended to establish a common understanding of the meaning or semantics of datasets.
Data repository
A place where datasets are stored, including online services.
Platform
Software platform used by a data repository.

Overall Design

TODO: Describe overall design, including the workflow to define a dataset and upload it to a data repository.

Identifiers

Data repository identifiers

A data repository can be identified by one of the following:

  • id : Unique identifier of the data repository as defined by the package
  • url : URL address of the data repository known to the package
  • api_url and platform: URL address of the data repository API endpoint and its platform identifier

platform determines the implementation to be used, and api_url determines the connection point.

Platform of the repository can be found from a generic uid value as follows:

  1. Check if it is a valid URL address
    Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
    (Source: https://regexr.com/39nr7)

    • Yes:
      Check if the URL address is a recognized data repository URL (see list_repositories())

      • Yes:
        Use the platform that is specified in the repository dictionary.

      • No:
        Use the platform specified by the platform argument if available, otherwise raise Unknown platform error.

    • No:
      Check if the id is a recognized data repository id (see list_repositories())

      • Yes:
        Use the platform that is specified in the repository dictionary.

      • No:
        Raise Invalid id error.

Dataset identifiers

A dataset published at a specific data repository (i.e. dataset archive) can be identified by one of the following:

  • id : Repository-specific unique id of the dataset archive
  • doi : DOI of the dataset associated to the dataset archive
  • url : URL address of the dataset archive

Type of a generic uid value can be found as follows:

  • Check if it is a valid URL address
    Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
    (Source: https://regexr.com/39nr7)

Remarks:

  • DOI and URL allow validation, i.e. it can be checked if identifier belongs to the specified repository:
    • For DOI, related URL address can be retrieved and compared with the base URL of the repository.
    • For URL, if can be compared with the base URL of the repository.

File paths

A file path can be a relative path including folders separated by slashes.

Examples:

  • README.txt
  • Agricultural SandboxNL Database V1.0.zip
  • Original Data/Ngari/SQ01.xlsx

Configuration

TODO: Describe overall configuration mechanism.

.dot files

TODO: Describe the use of .dot files.

Environmental variables

TODO: Describe the use of environmental variables.

  • FAIRLY_(ATTRIBUTE)
  • FAIRLY_(REPOSITORY)_(ATTRIBUTE)

Examples:

  • FAIRLY_USERNAME
  • FAIRLY_PASSWORD
  • FAIRLY_4TU_USERNAME
  • FAIRLY_4TU_PASSWORD

REMARKS:

  • Values of repository-specific variables overwrite the values of generic variables.

Platforms

The platforms that will be support by the package are listed below:

Local

Local platform will allow access to local datasets, as well as creating and modifying them. Having a local implementation will allow the users to use the same methods to deal with local and remote datasets.

Figshare

Figshare is an open-access, but closed-source commercial repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. 4TU.ResearchData is currently based on Figshare.

Djehuty

Djehuty is the new open-source repository system of 4TU.ResearchData. It mimics the Figshare API; therefore, can be supported easily. A separate implementation is required to support any deviation from the original Figshare API.

Invenio

Invenio is an open-source digital library and document repository framework used by Zenodo. We consider Zenodo as a potential repository to be supported in high priority. However, whether it will be implemented initially or not is to be decided (based on time availability).

Global Methods

These methods are available in the fairly module.

List supported platforms

platforms = fairly.list_platforms()

Returns a dictionary of dictionaries of platforms supported by the package.
Keys of the dictionary are unique platform identifiers (string).

Parameters:

None

Raises:

None

Platform dictionary:

Attribute Type Description
name string Name of the platform.
url string URL address of the platform portal.
has_folders boolean True if folders are supported by the platform.
has_partial_upload boolean True if partial uploads are supported by the platform.
experimental boolean True if the platform support is experimental.

Example:

[ "local": { "name": "Local", "url": "//localhost/", "has_folders": True, "has_partial_upload": True, "experimental": False, }, "figshare": { "name": "Figshare", "url": "https://figshare.com/", "has_folders": False, "has_partial_upload": True, "experimental": False, }, ... ]

List recognized data repositories

repositories = fairly.list_repositories(platform?=<id>)

Returns a list of dictionaries of the repositories recognized by the package.
Repository dictionaries may contain platform-specific information.

Parameters:

Name Type Description
platform string Platform identifier to filter repositories (optional)

Raises:

None

Repository dictionary:

Attribute Type Description
id string Unique identifier of the repository.
name string Name of the repository.
platform string Platform identifier of the repository.
url string URL address of the repository.
api_url string URL address of the repository API end-point
...

Example:

[ { "id": "local", "name": "Local Repository", "platform": "local", "url": "//localhost/", }, { "id": "4tu", "name": "4TU.ResearchData", "platform": "figshare", "url": "https://data.4tu.nl/", "api_url": "https://api.figshare.com/v2/", ... }, ... ]

List licenses

licenses = fairly.list_licenses()

Returns a dictionary of dictionaries of licenses recognized by the package.
Keys are unique license identifiers (string).

Parameters:

None

Raises:

None

License dictionary:

Attribute Type Description
name string Name of the license.
url string URL address of the license.

Example:

{ "CC BY 4.0": { "name": "Creative Commons Attribution 4.0 International", "url": "https://creativecommons.org/licenses/by/4.0/", }, ... }

NOTES:

  • Licenses can be stored as MarkDown documents and can be copied to the local dataset folder during initialization (similar to Github repository initialization).
  • Support for custom licenses can be added.
    • As a user I want to be able to select a license and put it as a file in my dataset folder.

    • As a user I would like to add a custom license.

Connect to a data repository

repository = fairly.connect(<id>)
repository = fairly.connect(<url>, platform?=<id>)

Returns a data repository object.

Parameters:

Name Type Description
id string Identifier of the repository (see fairly.list_repositories)
url string URL address of the repository (see fairly.list_repositories)
platform string Platform identifier of the repository (see fairly.list_platforms)
username string Name of the user account
password string Password of the user account
token string Access token linked to the user account

Raises:

  • IOError

    • Error occurred while connecting to the repository.
  • ValueError

    • Invalid id.
    • Unknown platform.
    • Invalid username.
    • Invalid password.
    • Invalid token.

Example:

fairly.connect("4tu") fairly.connect("/home/jovyan/research/dataset") fairly.connect("https://api.figshare.com/v2/", platform="figshare")

REMARKS:

  • Environmental variables and .dot files should be considered for missing arguments.

Repository methods

The templates of the methods are defined in the fairly.repository interface module. Platform-specific methods are implemented in respective platform modules.

Get repository information

info = repository.get_info()

Returns the dictionary of the repository.
Repository dictionaries may contain repository-specific information.

Parameters:

None

Raises:

  • IOError:
    • Error occurred while retrieving the repository information.

Repository Dictionary:

Attribute Type Description
repository_url string URL address of the repository.
platform string Platform identifier of the repository.
username string Name of the user account.
token string Access token linked to the user account.
name string Name of the user.
surname string Surname of the user.
email string E-mail address of the user.

Example:

{ "repository_url": "https://data.4tu.nl/", "platform": "figshare", "username": "s.girgin@utwente.nl", "token": "42370f95-8711-4c2d-8aeb-9f761bf35640", "name": "Serkan", "surname": "Girgin", "email": "s.girgin@utwente.nl", ... }

Remarks:

  • Can be cached?

Create a dataset

dataset = repository.create_dataset(metadata={})

Creates a dataset in the repository and returns a dataset object of the newly created dataset.

TODO: Add description

Open a dataset

dataset = repository.open_dataset(uid, version?=<version>)

Returns a dataset object for the specified dataset in the repository.

Parameters

Name Type Description
uid string Identifier of the dataset (see Dataset identifiers)
version string Version of the dataset (optional). If not specified, the latest version is used

Raises

  • IOError

    • Error occurred while opening the dataset.
  • ValueError

    • Invalid id.

List datasets

datasets = repository.list_datasets()

Returns a list of dictionaries of the datasets available in the repository.

Parameters:
None

Raises:

  • IOError:
    • Error occurred while retrieving the list of datasets.

Dataset Dictionary:

Attribute Type Description
id string Repository-specific unique identifier of the dataset.
doi string DOI of the dataset (if available).
url string URL address of the dataset.
title string Title of the dataset.
date date Date of the dataset.
versions dictionary Dictionary of dataset versions, {version: date, }

Example:

[ { "id": "14438750", "doi": "10.4121/14438750", "url": "https://data.4tu.nl/articles/dataset/Agricultural_SandboxNL_Database_V1_0/14438750", "title": "Agricultural SandboxNL Database V1.0", "date": "2022-03-17 09:28", "versions": { "1": "2021-07-16 15:42", "2": "2022-03-17 09:28", }, }, ... ]

Dataset Methods

These methods are available in the fairly.dataset module. A dataset object should keep a reference to the repository object and to perform the tasks it should call related repository methods by specifying the dataset id and version. The methods are mainly for convenience and the following uses are identical to each other:

repository = fairly.connect("4tu", token="1b7e-6700-26b5-4fda") dataset = repository.open_dataset("10.1204.56") # Get metadata by using the dataset method metadata = dataset.get_metadata() # Get metadata by using the repository method metadata = repository.get_metadata("10.1204.56", 2)

Get metadata of a dataset

metadata = dataset.get_metadata()

Returns the metadata dictionary of the dataset.

Parameters:
None

Raises

  • IOError
    • Error occurred while reading the metadata.

Metadata Dictionary:

​- id          : Repository-specific unique identifier of the dataset.
​- doi         : DOI of the dataset (if available).
​- url         : URL address of the dataset.
​- title       : Title of the dataset.
​- date        : Publication date of the dataset.
​- description : Description of the dataset.
​- keywords    : List of keywords associated with the dataset.
​- authors     : List of author dictionaries associated with the dataset.
​- license     : License type of the dataset (see "License types").
​- version     : Version of the dataset (if available).

Author dictionary:

​- id          : Repository-specific unique identifier of the author.
​- name        : Name(s) of the author
​- surname     : Surname(s) of the author
​- title       : Job title of the authors (if available)
​- institution : Institution of the author (if available)
​- orcid_id    : ORCID identifier of the author (if available)

Set metadata of a dataset

dataset.set_metadata(metadata)

Sets specified metadata attributes of the dataset.
See dataset.get_metadata() for supported metadata attributes.

Parameters:

Attribute Type Description
metadata dictionary Metadata attributes to be updated

Raises:

  • IOError:
    • Error occurred while setting the metadata.

Remarks:

  • Setting specific metadata attributes by name can also be supported, e.g. dataset.set_metadata(title = "New title")

List data files of a dataset

files = dataset.list_files()

Returns a list of dictionaries of the data files of the dataset.

Raises:

  • IOError:
    • Error occurred while retrieving the metadata.

Data File Dictionary:

​- id   : Unique id of the data file (if available).
​- url  : URL address of the data file [string].
​- path : Path of the data file (see "File paths") [string].
​- type : (MIME?) Type of the data file [string].
​- size : Size of the data file in bytes [int].
​- md5  : MD5 checksum of the data file [string].
​
​Example:
​[
​	{
​		"id": "36059786",
​		"url": "https://data.4tu.nl/ndownloader/files/36059786",
​		"path": "upload data.zip",
​		"type": "application/zip",
​		"size": 799271235,
​		"md5": "50bcc2c08d3cc45e9fa8ffcf2b2c391c",
​	},
​	...
​]

Compute the differences between the metadata of two datasets

diff = dataset.diff_metadata(other_dataset)

==TODO: Add description==

Compute the differences between the data files of two datasets

diff = dataset.diff_files(other_dataset)

==TODO: Add description==

Synchronize two datasets

push_id = dataset.push(other_dataset, callback?=<callback>)

Updates metadata and data files of the other dataset, so that they are the same as the ones of the dataset.
Repository-specific unique metadata attributes (e.g. id, url) should be excluded.

Synchronization includes the following actions:

  1. Update of the metadata
  2. Upload of the new files
  3. Upload of the updated files
  4. (Removal of the previous copies of the updated files)
  5. Removal of the deleted files

If upload of an updated file does not replace the existing file automatically (i.e. not in-place replacement), existing file should be deleted after the upload (step 4).

Sychronization should be atomic, e.g. in case of failure at any step the other dataset should revert back to its original state.

Get synchronization status

dataset.get_push_status(push_id)

TODO: Add description

Enumerations

Dataset types

Pre-defined dataset types supported by figshare:

  • figure
  • online resource
  • preprint
  • book
  • conference contribution
  • media
  • dataset
  • poster
  • journal contribution
  • presentation
  • thesis
  • software

NOTES:

  • Most of them are generic publication types, not dataset types. I'm not sure if they are meaningful to use. [SG]

Licenses

Pre-defined licenses supported by figshare:

  • CC BY 4.0
  • CC0
  • MIT
  • GPL
  • GPL 2.0+
  • GPL 2.0+
  • Apache 2.0

NOTES:

  • figshare uses numeric license id to indicate the license. I suggest we use string license names instead and map them internally. [SG]

API vision, proposal and design notes

Separating local related functions, from remote related functions like uploading, etc.

dataset.sync_metadata(archive) status = dataset.sync_files(archive) connection = fairly.connect("4tu") # Remote copy of the dataset (e.g. article in case of figshare) archive = connection.create_archive() archive = connection.open_record(<id>) dataset_record.get_metadata() dataset_record.list_files() Later: dataset.add_file() - Should update .ignore file dataset.remove_file() - Should update .ignore file dataset.export_metadata() - Should export metadata following a metadata standard

File update from local to remote

user story: when I uploaded wrong files, or I a missing files for the publication, I need to update the repository. Because files are very heavy I dont want to overwrite the entire archive, but instead only upload those that are not there yet, and perhaps delete those that are not there anymore.

Kind of smart overwrite feature:

List datasets

  • status : Status of the dataset [string].
    • "draft" : Not published yet (i.e. private)

    • "restricted" : Published with restrictions (i.e. under embargo)

    • "public" : Published publicly

      ​​​​​​​​"status": "public",            
      

Business logic/rules

  • We only syncrhonize to unpublished repositories
  • DOI should remain constant even if it is duplicated in different data providers (Zenodo, Figshare, etc)
  • Version can only be made by author of the dataset

Stories

List datasets

After having ready my dataset to be archived, I would like to identify to which archive I want to deposit my data, therefore I need a list to do this identification.

Get metadata of a dataset

might be used to store the metadata locally, mostly for local readaibility and consultation. This metadata that is downloaded shouldnt' be writable for example. Users can make a mistake and overwrite metadata, this is not desirable.

Select a repo