We don't want to provide a package that replicates the functionalities of the data repository APIs and provides generic methods to perform arbitrary metadata and dataset operations. Our intention is to develop a package that focuses on the core task, i.e. uploading of datasets to data repositories, and provides necessary basis for the JupyterFAIR extension.
import fairly
# Create a local dataset
dataset = fairly.create_dataset('/path/dataset')
# Set metadata
dataset.set_metadata({
"title": "My wonderful dataset",
"license": "CC BY 4.0",
"keywords": ["FAIR", "data"],
"authors": [
"0000-0002-0156-185X",
{
"name": "John",
"surname": "Doe",
"role": "contributor",
},
],
})
# Add data files and folders
dataset.add_files([
"README.txt",
"*.csv",
"train/*.jpg",
"test/*.jpg"
])
# Upload to the remote data repository
remote_dataset = dataset.upload("4tu")
# Change metadata
dataset.metadata["license"] = "MIT"
# Synchronize the remote dataset with the local dataset
dataset.synchronize()
import fairly
# Connect to the local repository
local = fairly.connect('/path/repository')
# Open an existing local dataset
# (Loads existing metadata and imports list of files and folders)
dataset = local.open_dataset('/path/dataset')
# Connect to the remote data repository
remote = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")
# Find the remote dataset (i.e. archive) matching the local dataset, e.g. by using DOI
archive = remote.find_dataset(dataset)
# Synchronize the archived dataset with the local copy
dataset.synchronize(archive)
or
# Find the archived dataset and synchronize it with the local copy
dataset.synchronize(remote)
TODO: Describe overall design, including the workflow to define a dataset and upload it to a data repository.
A data repository can be identified by one of the following:
id
: Unique identifier of the data repository as defined by the packageurl
: URL address of the data repository known to the packageapi_url
and platform
: URL address of the data repository API endpoint and its platform identifierplatform
determines the implementation to be used, and api_url
determines the connection point.
Platform of the repository can be found from a generic uid
value as follows:
Check if it is a valid URL address
Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
(Source: https://regexr.com/39nr7)
Yes:
Check if the URL address is a recognized data repository URL (see list_repositories()
)
Yes:
Use the platform
that is specified in the repository dictionary.
No:
Use the platform
specified by the platform argument if available, otherwise raise Unknown platform
error.
No:
Check if the id is a recognized data repository id (see list_repositories()
)
Yes:
Use the platform
that is specified in the repository dictionary.
No:
Raise Invalid id
error.
A dataset published at a specific data repository (i.e. dataset archive) can be identified by one of the following:
id
: Repository-specific unique id of the dataset archivedoi
: DOI of the dataset associated to the dataset archiveurl
: URL address of the dataset archiveType of a generic uid
value can be found as follows:
Check if it is a valid URL address
Regular expression: /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig
(Source: https://regexr.com/39nr7)
Yes:
If the host is doi.org
, then the identifier is a doi
, otherwise it is a url
.
No:
Check if it is a valid DOI
Regular expression: /^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
(Source: https://www.crossref.org/blog/dois-and-matching-regular-expressions/)
Yes:
The identifier is a doi
.
No:
The identifier is an id
.
Remarks:
A file path can be a relative path including folders separated by slashes.
Examples:
TODO: Describe overall configuration mechanism.
TODO: Describe the use of .dot files.
TODO: Describe the use of environmental variables.
Examples:
REMARKS:
The platforms that will be support by the package are listed below:
Local platform will allow access to local datasets, as well as creating and modifying them. Having a local implementation will allow the users to use the same methods to deal with local and remote datasets.
Figshare is an open-access, but closed-source commercial repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. 4TU.ResearchData is currently based on Figshare.
Djehuty is the new open-source repository system of 4TU.ResearchData. It mimics the Figshare API; therefore, can be supported easily. A separate implementation is required to support any deviation from the original Figshare API.
Invenio is an open-source digital library and document repository framework used by Zenodo. We consider Zenodo as a potential repository to be supported in high priority. However, whether it will be implemented initially or not is to be decided (based on time availability).
These methods are available in the fairly
module.
platforms = fairly.list_platforms()
Returns a dictionary of dictionaries of platforms supported by the package.
Keys of the dictionary are unique platform identifiers (string).
Parameters:
None
Raises:
None
Platform dictionary:
Attribute | Type | Description |
---|---|---|
name |
string | Name of the platform. |
url |
string | URL address of the platform portal. |
has_folders |
boolean | True if folders are supported by the platform. |
has_partial_upload |
boolean | True if partial uploads are supported by the platform. |
experimental |
boolean | True if the platform support is experimental. |
Example:
[
"local": {
"name": "Local",
"url": "//localhost/",
"has_folders": True,
"has_partial_upload": True,
"experimental": False,
},
"figshare": {
"name": "Figshare",
"url": "https://figshare.com/",
"has_folders": False,
"has_partial_upload": True,
"experimental": False,
},
...
]
repositories = fairly.list_repositories(platform?=<id>)
Returns a list of dictionaries of the repositories recognized by the package.
Repository dictionaries may contain platform-specific information.
Parameters:
Name | Type | Description |
---|---|---|
platform |
string | Platform identifier to filter repositories (optional) |
Raises:
None
Repository dictionary:
Attribute | Type | Description |
---|---|---|
id |
string | Unique identifier of the repository. |
name |
string | Name of the repository. |
platform |
string | Platform identifier of the repository. |
url |
string | URL address of the repository. |
api_url |
string | URL address of the repository API end-point |
... |
Example:
[
{
"id": "local",
"name": "Local Repository",
"platform": "local",
"url": "//localhost/",
},
{
"id": "4tu",
"name": "4TU.ResearchData",
"platform": "figshare",
"url": "https://data.4tu.nl/",
"api_url": "https://api.figshare.com/v2/",
...
},
...
]
licenses = fairly.list_licenses()
Returns a dictionary of dictionaries of licenses recognized by the package.
Keys are unique license identifiers (string).
Parameters:
None
Raises:
None
License dictionary:
Attribute | Type | Description |
---|---|---|
name |
string | Name of the license. |
url |
string | URL address of the license. |
Example:
{
"CC BY 4.0": {
"name": "Creative Commons Attribution 4.0 International",
"url": "https://creativecommons.org/licenses/by/4.0/",
},
...
}
NOTES:
As a user I want to be able to select a license and put it as a file in my dataset folder.
As a user I would like to add a custom license.
repository = fairly.connect(<id>)
repository = fairly.connect(<url>, platform?=<id>)
Returns a data repository object.
Parameters:
Name | Type | Description |
---|---|---|
id | string | Identifier of the repository (see fairly.list_repositories ) |
url | string | URL address of the repository (see fairly.list_repositories ) |
platform | string | Platform identifier of the repository (see fairly.list_platforms ) |
username | string | Name of the user account |
password | string | Password of the user account |
token | string | Access token linked to the user account |
Raises:
IOError
ValueError
Invalid id.
Unknown platform.
Invalid username.
Invalid password.
Invalid token.
Example:
fairly.connect("4tu")
fairly.connect("/home/jovyan/research/dataset")
fairly.connect("https://api.figshare.com/v2/", platform="figshare")
REMARKS:
The templates of the methods are defined in the fairly.repository
interface module. Platform-specific methods are implemented in respective platform modules.
info = repository.get_info()
Returns the dictionary of the repository.
Repository dictionaries may contain repository-specific information.
Parameters:
None
Raises:
IOError
:
Repository Dictionary:
Attribute | Type | Description |
---|---|---|
repository_url | string | URL address of the repository. |
platform | string | Platform identifier of the repository. |
username | string | Name of the user account. |
token | string | Access token linked to the user account. |
name | string | Name of the user. |
surname | string | Surname of the user. |
string | E-mail address of the user. |
Example:
{
"repository_url": "https://data.4tu.nl/",
"platform": "figshare",
"username": "s.girgin@utwente.nl",
"token": "42370f95-8711-4c2d-8aeb-9f761bf35640",
"name": "Serkan",
"surname": "Girgin",
"email": "s.girgin@utwente.nl",
...
}
Remarks:
dataset = repository.create_dataset(metadata={})
Creates a dataset in the repository and returns a dataset object of the newly created dataset.
TODO: Add description
dataset = repository.open_dataset(uid, version?=<version>)
Returns a dataset object for the specified dataset in the repository.
Parameters
Name | Type | Description |
---|---|---|
uid | string | Identifier of the dataset (see Dataset identifiers) |
version | string | Version of the dataset (optional). If not specified, the latest version is used |
Raises
IOError
ValueError
Invalid id.
datasets = repository.list_datasets()
Returns a list of dictionaries of the datasets available in the repository.
Parameters:
None
Raises:
Dataset Dictionary:
Attribute | Type | Description |
---|---|---|
id | string | Repository-specific unique identifier of the dataset. |
doi | string | DOI of the dataset (if available). |
url | string | URL address of the dataset. |
title | string | Title of the dataset. |
date | date | Date of the dataset. |
versions | dictionary | Dictionary of dataset versions, {version: date, …} |
Example:
[
{
"id": "14438750",
"doi": "10.4121/14438750",
"url": "https://data.4tu.nl/articles/dataset/Agricultural_SandboxNL_Database_V1_0/14438750",
"title": "Agricultural SandboxNL Database V1.0",
"date": "2022-03-17 09:28",
"versions": {
"1": "2021-07-16 15:42",
"2": "2022-03-17 09:28",
},
},
...
]
These methods are available in the fairly.dataset
module. A dataset
object should keep a reference to the repository
object and to perform the tasks it should call related repository methods by specifying the dataset id and version. The methods are mainly for convenience and the following uses are identical to each other:
repository = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")
dataset = repository.open_dataset("10.1204.56")
# Get metadata by using the dataset method
metadata = dataset.get_metadata()
# Get metadata by using the repository method
metadata = repository.get_metadata("10.1204.56", 2)
metadata = dataset.get_metadata()
Returns the metadata dictionary of the dataset.
Parameters:
None
Raises
IOError
Metadata Dictionary:
- id : Repository-specific unique identifier of the dataset.
- doi : DOI of the dataset (if available).
- url : URL address of the dataset.
- title : Title of the dataset.
- date : Publication date of the dataset.
- description : Description of the dataset.
- keywords : List of keywords associated with the dataset.
- authors : List of author dictionaries associated with the dataset.
- license : License type of the dataset (see "License types").
- version : Version of the dataset (if available).
Author dictionary:
- id : Repository-specific unique identifier of the author.
- name : Name(s) of the author
- surname : Surname(s) of the author
- title : Job title of the authors (if available)
- institution : Institution of the author (if available)
- orcid_id : ORCID identifier of the author (if available)
dataset.set_metadata(metadata)
Sets specified metadata attributes of the dataset.
See dataset.get_metadata()
for supported metadata attributes.
Parameters:
Attribute | Type | Description |
---|---|---|
metadata | dictionary | Metadata attributes to be updated |
Raises:
Remarks:
dataset.set_metadata(title = "New title")
files = dataset.list_files()
Returns a list of dictionaries of the data files of the dataset.
Raises:
Data File Dictionary:
- id : Unique id of the data file (if available).
- url : URL address of the data file [string].
- path : Path of the data file (see "File paths") [string].
- type : (MIME?) Type of the data file [string].
- size : Size of the data file in bytes [int].
- md5 : MD5 checksum of the data file [string].
Example:
[
{
"id": "36059786",
"url": "https://data.4tu.nl/ndownloader/files/36059786",
"path": "upload data.zip",
"type": "application/zip",
"size": 799271235,
"md5": "50bcc2c08d3cc45e9fa8ffcf2b2c391c",
},
...
]
diff = dataset.diff_metadata(other_dataset)
==TODO: Add description==
diff = dataset.diff_files(other_dataset)
==TODO: Add description==
push_id = dataset.push(other_dataset, callback?=<callback>)
Updates metadata and data files of the other dataset, so that they are the same as the ones of the dataset.
Repository-specific unique metadata attributes (e.g. id
, url
) should be excluded.
Synchronization includes the following actions:
If upload of an updated file does not replace the existing file automatically (i.e. not in-place replacement), existing file should be deleted after the upload (step 4).
Sychronization should be atomic, e.g. in case of failure at any step the other dataset should revert back to its original state.
dataset.get_push_status(push_id)
TODO: Add description
Pre-defined dataset types supported by figshare:
NOTES:
Pre-defined licenses supported by figshare:
NOTES:
Separating local related functions, from remote related functions like uploading, etc.
dataset.sync_metadata(archive)
status = dataset.sync_files(archive)
connection = fairly.connect("4tu")
# Remote copy of the dataset (e.g. article in case of figshare)
archive = connection.create_archive()
archive = connection.open_record(<id>)
dataset_record.get_metadata()
dataset_record.list_files()
Later:
dataset.add_file()
- Should update .ignore file
dataset.remove_file()
- Should update .ignore file
dataset.export_metadata()
- Should export metadata following a metadata standard
user story: when I uploaded wrong files, or I a missing files for the publication, I need to update the repository. Because files are very heavy I dont want to overwrite the entire archive, but instead only upload those that are not there yet, and perhaps delete those that are not there anymore.
Kind of smart overwrite feature:
List datasets
"draft" : Not published yet (i.e. private)
"restricted" : Published with restrictions (i.e. under embargo)
"public" : Published publicly
"status": "public",
After having ready my dataset to be archived, I would like to identify to which archive I want to deposit my data, therefore I need a list to do this identification.
might be used to store the metadata locally, mostly for local readaibility and consultation. This metadata that is downloaded shouldnt' be writable for example. Users can make a mistake and overwrite metadata, this is not desirable.