fairly Package Design

Philosophy

We don't want to provide a package that replicates the functionalities of the data repository APIs and provides generic methods to perform arbitrary metadata and dataset operations. Our intention is to develop a package that focuses on the core task, i.e. uploading of datasets to data repositories, and provides necessary basis for the JupyterFAIR extension.

Vision




































import fairly

# Create a local dataset
dataset = fairly.create_dataset('/path/dataset')

# Set metadata
dataset.set_metadata({
    "title": "My wonderful dataset",
    "license": "CC BY 4.0",
    "keywords": ["FAIR", "data"],
    "authors": [
        "0000-0002-0156-185X",
        {
            "name": "John",
            "surname": "Doe",
            "role": "contributor",
        },
    ],
})

# Add data files and folders
dataset.add_files([
    "README.txt",
    "*.csv",
    "train/*.jpg",
    "test/*.jpg"
])

# Upload to the remote data repository
remote_dataset = dataset.upload("4tu")

# Change metadata
dataset.metadata["license"] = "MIT"

# Synchronize the remote dataset with the local dataset
dataset.synchronize()






















import fairly

# Connect to the local repository
local = fairly.connect('/path/repository')

# Open an existing local dataset
# (Loads existing metadata and imports list of files and folders)
dataset = local.open_dataset('/path/dataset')

# Connect to the remote data repository
remote = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")

# Find the remote dataset (i.e. archive) matching the local dataset, e.g. by using DOI
archive = remote.find_dataset(dataset)

# Synchronize the archived dataset with the local copy
dataset.synchronize(archive)

or

# Find the archived dataset and synchronize it with the local copy
dataset.synchronize(remote)

Terminology

Dataset: A collection of one or more data files and folders, and related metadata.
Data file: A file that is part of a dataset and stores data in a specific format, including (compressed) archive of multiple files or folders.
Data folder: A folder that is part of a dataset and contains one or more data files or folders.
Metadata: Structured information about a dataset (according to a standard).
Metadata standard: A requirement intended to establish a common understanding of the meaning or semantics of datasets.
Data repository: A place where datasets are stored, including online services.
Platform: Software platform used by a data repository.

Overall Design

TODO: Describe overall design, including the workflow to define a dataset and upload it to a data repository.

Identifiers

Data repository identifiers

A data repository can be identified by one of the following:

id : Unique identifier of the data repository as defined by the package
url : URL address of the data repository known to the package
api_url and platform: URL address of the data repository API endpoint and its platform identifier

platform determines the implementation to be used, and api_url determines the connection point.

Platform of the repository can be found from a generic uid value as follows:

Check if it is a valid URL address
Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
(Source: https://regexr.com/39nr7)
- Yes:
  Check if the URL address is a recognized data repository URL (see list_repositories())
  - Yes:
    Use the platform that is specified in the repository dictionary.
  - No:
    Use the platform specified by the platform argument if available, otherwise raise Unknown platform error.
- No:
  Check if the id is a recognized data repository id (see list_repositories())
  - Yes:
    Use the platform that is specified in the repository dictionary.
  - No:
    Raise Invalid id error.

Dataset identifiers

A dataset published at a specific data repository (i.e. dataset archive) can be identified by one of the following:

id : Repository-specific unique id of the dataset archive
doi : DOI of the dataset associated to the dataset archive
url : URL address of the dataset archive

Type of a generic uid value can be found as follows:

Check if it is a valid URL address
Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
(Source: https://regexr.com/39nr7)
- Yes:
  If the host is doi.org, then the identifier is a doi, otherwise it is a url.
- No:
  - Check if it is a valid DOI
    Regular expression: /^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
    (Source: https://www.crossref.org/blog/dois-and-matching-regular-expressions/)
    - Yes:
      The identifier is a doi.
    - No:
      The identifier is an id.

Remarks:

DOI and URL allow validation, i.e. it can be checked if identifier belongs to the specified repository:
- For DOI, related URL address can be retrieved and compared with the base URL of the repository.
- For URL, if can be compared with the base URL of the repository.

File paths

A file path can be a relative path including folders separated by slashes.

Examples:

README.txt
Agricultural SandboxNL Database V1.0.zip
Original Data/Ngari/SQ01.xlsx

Configuration

TODO: Describe overall configuration mechanism.

.dot files

TODO: Describe the use of .dot files.

Environmental variables

TODO: Describe the use of environmental variables.

FAIRLY_(ATTRIBUTE)
FAIRLY_(REPOSITORY)_(ATTRIBUTE)

Examples:

FAIRLY_USERNAME
FAIRLY_PASSWORD
FAIRLY_4TU_USERNAME
FAIRLY_4TU_PASSWORD

REMARKS:

Values of repository-specific variables overwrite the values of generic variables.

Platforms

The platforms that will be support by the package are listed below:

Local

Local platform will allow access to local datasets, as well as creating and modifying them. Having a local implementation will allow the users to use the same methods to deal with local and remote datasets.

Figshare

Figshare is an open-access, but closed-source commercial repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. 4TU.ResearchData is currently based on Figshare.

Djehuty

Djehuty is the new open-source repository system of 4TU.ResearchData. It mimics the Figshare API; therefore, can be supported easily. A separate implementation is required to support any deviation from the original Figshare API.

Invenio

Invenio is an open-source digital library and document repository framework used by Zenodo. We consider Zenodo as a potential repository to be supported in high priority. However, whether it will be implemented initially or not is to be decided (based on time availability).

Global Methods

These methods are available in the fairly module.

List supported platforms

platforms = fairly.list_platforms()

Returns a dictionary of dictionaries of platforms supported by the package.
Keys of the dictionary are unique platform identifiers (string).

Parameters:

None

Raises:

None

Platform dictionary:

Attribute	Type	Description
`name`	string	Name of the platform.
`url`	string	URL address of the platform portal.
`has_folders`	boolean	`True` if folders are supported by the platform.
`has_partial_upload`	boolean	`True` if partial uploads are supported by the platform.
`experimental`	boolean	`True` if the platform support is experimental.

Example:


















[
    "local": {
        "name": "Local",
        "url": "//localhost/",
        "has_folders": True,
        "has_partial_upload": True,
        "experimental": False,
    },
    "figshare": {
        "name": "Figshare",
        "url": "https://figshare.com/",
        "has_folders": False,
        "has_partial_upload": True,
        "experimental": False,
    },
    ...
]

List recognized data repositories

repositories = fairly.list_repositories(platform?=<id>)

Returns a list of dictionaries of the repositories recognized by the package.
Repository dictionaries may contain platform-specific information.

Parameters:

Name	Type	Description
`platform`	string	Platform identifier to filter repositories (optional)

Raises:

None

Repository dictionary:

Attribute	Type	Description
`id`	string	Unique identifier of the repository.
`name`	string	Name of the repository.
`platform`	string	Platform identifier of the repository.
`url`	string	URL address of the repository.
`api_url`	string	URL address of the repository API end-point
`...`

Example:

















[
    {
        "id": "local",
        "name": "Local Repository",
        "platform": "local",
        "url": "//localhost/",
    },
    {
        "id": "4tu",
        "name": "4TU.ResearchData",
        "platform": "figshare",
        "url": "https://data.4tu.nl/",
        "api_url": "https://api.figshare.com/v2/",
        ...
    },
    ...
]

List licenses

licenses = fairly.list_licenses()

Returns a dictionary of dictionaries of licenses recognized by the package.
Keys are unique license identifiers (string).

Parameters:

None

Raises:

None

License dictionary:

Attribute	Type	Description
`name`	string	Name of the license.
`url`	string	URL address of the license.

Example:







{
    "CC BY 4.0": {
        "name": "Creative Commons Attribution 4.0 International",
        "url": "https://creativecommons.org/licenses/by/4.0/",
    },
    ...
}

NOTES:

Licenses can be stored as MarkDown documents and can be copied to the local dataset folder during initialization (similar to Github repository initialization).
Support for custom licenses can be added.
- As a user I want to be able to select a license and put it as a file in my dataset folder.
- As a user I would like to add a custom license.

Connect to a data repository

repository = fairly.connect(<id>)
repository = fairly.connect(<url>, platform?=<id>)

Returns a data repository object.

Parameters:

Name	Type	Description
id	string	Identifier of the repository (see `fairly.list_repositories`)
url	string	URL address of the repository (see `fairly.list_repositories`)
platform	string	Platform identifier of the repository (see `fairly.list_platforms`)
username	string	Name of the user account
password	string	Password of the user account
token	string	Access token linked to the user account

Raises:

IOError
- Error occurred while connecting to the repository.
ValueError
- Invalid id.
- Unknown platform.
- Invalid username.
- Invalid password.
- Invalid token.

Example:





fairly.connect("4tu")

fairly.connect("/home/jovyan/research/dataset")

fairly.connect("https://api.figshare.com/v2/", platform="figshare")

REMARKS:

Environmental variables and .dot files should be considered for missing arguments.

Repository methods

The templates of the methods are defined in the fairly.repository interface module. Platform-specific methods are implemented in respective platform modules.

Get repository information

info = repository.get_info()

Returns the dictionary of the repository.
Repository dictionaries may contain repository-specific information.

Parameters:

None

Raises:

IOError:
- Error occurred while retrieving the repository information.

Repository Dictionary:

Attribute	Type	Description
repository_url	string	URL address of the repository.
platform	string	Platform identifier of the repository.
username	string	Name of the user account.
token	string	Access token linked to the user account.
name	string	Name of the user.
surname	string	Surname of the user.
email	string	E-mail address of the user.

Example:










{
    "repository_url": "https://data.4tu.nl/",
    "platform": "figshare",
    "username": "s.girgin@utwente.nl",
    "token": "42370f95-8711-4c2d-8aeb-9f761bf35640",
    "name": "Serkan",
    "surname": "Girgin",
    "email": "s.girgin@utwente.nl",
    ...
}

Remarks:

Can be cached?

Create a dataset

dataset = repository.create_dataset(metadata={})

Creates a dataset in the repository and returns a dataset object of the newly created dataset.

TODO: Add description

Open a dataset

dataset = repository.open_dataset(uid, version?=<version>)

Returns a dataset object for the specified dataset in the repository.

Parameters

Name	Type	Description
uid	string	Identifier of the dataset (see Dataset identifiers)
version	string	Version of the dataset (optional). If not specified, the latest version is used

Raises

IOError
- Error occurred while opening the dataset.
ValueError
- Invalid id.

List datasets

datasets = repository.list_datasets()

Returns a list of dictionaries of the datasets available in the repository.

Parameters:
None

Raises:

IOError:
- Error occurred while retrieving the list of datasets.

Dataset Dictionary:

Attribute	Type	Description
id	string	Repository-specific unique identifier of the dataset.
doi	string	DOI of the dataset (if available).
url	string	URL address of the dataset.
title	string	Title of the dataset.
date	date	Date of the dataset.
versions	dictionary	Dictionary of dataset versions, {version: date, …}

Example:














[
    {
        "id": "14438750",
        "doi": "10.4121/14438750",
        "url": "https://data.4tu.nl/articles/dataset/Agricultural_SandboxNL_Database_V1_0/14438750",
        "title": "Agricultural SandboxNL Database V1.0",
        "date": "2022-03-17 09:28",
        "versions": {
            "1": "2021-07-16 15:42",
            "2": "2022-03-17 09:28",            
        },                   
    },
    ...
]

Dataset Methods

These methods are available in the fairly.dataset module. A dataset object should keep a reference to the repository object and to perform the tasks it should call related repository methods by specifying the dataset id and version. The methods are mainly for convenience and the following uses are identical to each other:










repository = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")

dataset = repository.open_dataset("10.1204.56")

# Get metadata by using the dataset method
metadata = dataset.get_metadata()

# Get metadata by using the repository method

metadata = repository.get_metadata("10.1204.56", 2)

Get metadata of a dataset

metadata = dataset.get_metadata()

Returns the metadata dictionary of the dataset.

Parameters:
None

Raises

IOError
- Error occurred while reading the metadata.

Metadata Dictionary:

- id          : Repository-specific unique identifier of the dataset.
- doi         : DOI of the dataset (if available).
- url         : URL address of the dataset.
- title       : Title of the dataset.
- date        : Publication date of the dataset.
- description : Description of the dataset.
- keywords    : List of keywords associated with the dataset.
- authors     : List of author dictionaries associated with the dataset.
- license     : License type of the dataset (see "License types").
- version     : Version of the dataset (if available).

Author dictionary:

- id          : Repository-specific unique identifier of the author.
- name        : Name(s) of the author
- surname     : Surname(s) of the author
- title       : Job title of the authors (if available)
- institution : Institution of the author (if available)
- orcid_id    : ORCID identifier of the author (if available)

Set metadata of a dataset

dataset.set_metadata(metadata)

Sets specified metadata attributes of the dataset.
See dataset.get_metadata() for supported metadata attributes.

Parameters:

Attribute	Type	Description
metadata	dictionary	Metadata attributes to be updated

Raises:

IOError:
- Error occurred while setting the metadata.

Remarks:

Setting specific metadata attributes by name can also be supported, e.g. dataset.set_metadata(title = "New title")

List data files of a dataset

files = dataset.list_files()

Returns a list of dictionaries of the data files of the dataset.

Raises:

IOError:
- Error occurred while retrieving the metadata.

Data File Dictionary:

- id   : Unique id of the data file (if available).
- url  : URL address of the data file [string].
- path : Path of the data file (see "File paths") [string].
- type : (MIME?) Type of the data file [string].
- size : Size of the data file in bytes [int].
- md5  : MD5 checksum of the data file [string].

Example:
[
	{
		"id": "36059786",
		"url": "https://data.4tu.nl/ndownloader/files/36059786",
		"path": "upload data.zip",
		"type": "application/zip",
		"size": 799271235,
		"md5": "50bcc2c08d3cc45e9fa8ffcf2b2c391c",
	},
	...
]

Compute the differences between the metadata of two datasets

diff = dataset.diff_metadata(other_dataset)

==TODO: Add description==

Compute the differences between the data files of two datasets

diff = dataset.diff_files(other_dataset)

==TODO: Add description==

Synchronize two datasets

push_id = dataset.push(other_dataset, callback?=<callback>)

Updates metadata and data files of the other dataset, so that they are the same as the ones of the dataset.
Repository-specific unique metadata attributes (e.g. id, url) should be excluded.

Synchronization includes the following actions:

Update of the metadata
Upload of the new files
Upload of the updated files
(Removal of the previous copies of the updated files)
Removal of the deleted files

If upload of an updated file does not replace the existing file automatically (i.e. not in-place replacement), existing file should be deleted after the upload (step 4).

Sychronization should be atomic, e.g. in case of failure at any step the other dataset should revert back to its original state.

Get synchronization status

dataset.get_push_status(push_id)

TODO: Add description

Enumerations

Dataset types

Pre-defined dataset types supported by figshare:

figure
online resource
preprint
book
conference contribution
media
dataset
poster
journal contribution
presentation
thesis
software

NOTES:

Most of them are generic publication types, not dataset types. I'm not sure if they are meaningful to use. [SG]

Licenses

Pre-defined licenses supported by figshare:

CC BY 4.0
CC0
MIT
GPL
GPL 2.0+
GPL 2.0+
Apache 2.0

NOTES:

figshare uses numeric license id to indicate the license. I suggest we use string license names instead and map them internally. [SG]

API vision, proposal and design notes

Separating local related functions, from remote related functions like uploading, etc.




















dataset.sync_metadata(archive)
status = dataset.sync_files(archive)


connection = fairly.connect("4tu")

# Remote copy of the dataset (e.g. article in case of figshare)
archive = connection.create_archive()
archive = connection.open_record(<id>)
dataset_record.get_metadata()
dataset_record.list_files()

Later:
dataset.add_file()
  - Should update .ignore file
dataset.remove_file()
  - Should update .ignore file
dataset.export_metadata()
  - Should export metadata following a metadata standard

File update from local to remote

user story: when I uploaded wrong files, or I a missing files for the publication, I need to update the repository. Because files are very heavy I dont want to overwrite the entire archive, but instead only upload those that are not there yet, and perhaps delete those that are not there anymore.

Kind of smart overwrite feature:

List datasets

status : Status of the dataset [string].
- "draft" : Not published yet (i.e. private)
- "restricted" : Published with restrictions (i.e. under embargo)
- "public" : Published publicly
```
"status": "public",            
```

Business logic/rules

We only syncrhonize to unpublished repositories
DOI should remain constant even if it is duplicated in different data providers (Zenodo, Figshare, etc)
Version can only be made by author of the dataset

Stories

List datasets

After having ready my dataset to be archived, I would like to identify to which archive I want to deposit my data, therefore I need a list to do this identification.

Get metadata of a dataset

might be used to store the metadata locally, mostly for local readaibility and consultation. This metadata that is downloaded shouldnt' be writable for example. Users can make a mistake and overwrite metadata, this is not desirable.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.