owned this note
owned this note
Published
Linked with GitHub
# *fairly* Package Design
## Philosophy
We don't want to provide a package that replicates the functionalities of the data repository APIs and provides generic methods to perform arbitrary metadata and dataset operations. Our intention is to develop a package that focuses on the core task, i.e. uploading of datasets to data repositories, and provides necessary basis for the JupyterFAIR extension.
## Vision
```python=
import fairly
# Create a local dataset
dataset = fairly.create_dataset('/path/dataset')
# Set metadata
dataset.set_metadata({
"title": "My wonderful dataset",
"license": "CC BY 4.0",
"keywords": ["FAIR", "data"],
"authors": [
"0000-0002-0156-185X",
{
"name": "John",
"surname": "Doe",
"role": "contributor",
},
],
})
# Add data files and folders
dataset.add_files([
"README.txt",
"*.csv",
"train/*.jpg",
"test/*.jpg"
])
# Upload to the remote data repository
remote_dataset = dataset.upload("4tu")
# Change metadata
dataset.metadata["license"] = "MIT"
# Synchronize the remote dataset with the local dataset
dataset.synchronize()
```
```python=
import fairly
# Connect to the local repository
local = fairly.connect('/path/repository')
# Open an existing local dataset
# (Loads existing metadata and imports list of files and folders)
dataset = local.open_dataset('/path/dataset')
# Connect to the remote data repository
remote = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")
# Find the remote dataset (i.e. archive) matching the local dataset, e.g. by using DOI
archive = remote.find_dataset(dataset)
# Synchronize the archived dataset with the local copy
dataset.synchronize(archive)
or
# Find the archived dataset and synchronize it with the local copy
dataset.synchronize(remote)
```
## Terminology
Dataset
: A collection of one or more data files and folders, and related metadata.
Data file
: A file that is part of a dataset and stores data in a specific format, including (compressed) archive of multiple files or folders.
Data folder
: A folder that is part of a dataset and contains one or more data files or folders.
Metadata
: Structured information about a dataset (according to a standard).
Metadata standard
: A requirement intended to establish a common understanding of the meaning or semantics of datasets.
Data repository
: A place where datasets are stored, including online services.
Platform
: Software platform used by a data repository.
## Overall Design
==**TODO**: Describe overall design, including the workflow to define a dataset and upload it to a data repository.==
## Identifiers
### <a name="repository-identifiers"></a>Data repository identifiers
A data repository can be identified by one of the following:
- `id ` : Unique identifier of the data repository as defined by the package
- `url` : URL address of the data repository known to the package
- `api_url` and `platform`: URL address of the data repository API endpoint and its platform identifier
`platform` determines the implementation to be used, and `api_url` determines the connection point.
Platform of the repository can be found from a generic `uid` value as follows:
1. Check if it is a valid URL address
Regular expression: `/[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig`
(Source: https://regexr.com/39nr7)
- Yes:
Check if the URL address is a recognized data repository URL (see `list_repositories()`)
- Yes:
Use the `platform` that is specified in the repository dictionary.
- No:
Use the `platform` specified by the platform argument if available, otherwise raise `Unknown platform` error.
- No:
Check if the id is a recognized data repository id (see `list_repositories()`)
- Yes:
Use the `platform` that is specified in the repository dictionary.
- No:
Raise `Invalid id` error.
### <a name="dataset-identifiers"></a>Dataset identifiers
A dataset published at a specific data repository (i.e. dataset archive) can be identified by one of the following:
- `id` : Repository-specific unique id of the dataset archive
- `doi` : DOI of the dataset associated to the dataset archive
- `url` : URL address of the dataset archive
Type of a generic `uid` value can be found as follows:
- Check if it is a valid URL address
Regular expression: `/[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig`
(Source: https://regexr.com/39nr7)
- Yes:
If the host is `doi.org`, then the identifier is a `doi`, otherwise it is a `url`.
- No:
- Check if it is a valid DOI
Regular expression: `/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i`
(Source: https://www.crossref.org/blog/dois-and-matching-regular-expressions/)
- Yes:
The identifier is a `doi`.
- No:
The identifier is an `id`.
**Remarks:**
- DOI and URL allow validation, i.e. it can be checked if identifier belongs to the specified repository:
- For DOI, related URL address can be retrieved and compared with the base URL of the repository.
- For URL, if can be compared with the base URL of the repository.
### <a name="file-paths"></a>File paths
A file path can be a relative path including folders separated by slashes.
Examples:
- README.txt
- Agricultural SandboxNL Database V1.0.zip
- Original Data/Ngari/SQ01.xlsx
## Configuration
==**TODO**: Describe overall configuration mechanism.==
### .dot files
==**TODO**: Describe the use of .dot files.==
### Environmental variables
==**TODO**: Describe the use of environmental variables.==
- FAIRLY_(ATTRIBUTE)
- FAIRLY_(REPOSITORY)_(ATTRIBUTE)
**Examples:**
- FAIRLY_USERNAME
- FAIRLY_PASSWORD
- FAIRLY_4TU_USERNAME
- FAIRLY_4TU_PASSWORD
**REMARKS:**
- Values of repository-specific variables overwrite the values of generic variables.
## Platforms
The platforms that will be support by the package are listed below:
### Local
Local platform will allow access to local datasets, as well as creating and modifying them. Having a local implementation will allow the users to use the same methods to deal with local and remote datasets.
### Figshare
[Figshare](https://figshare.com/) is an open-access, but closed-source commercial repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. 4TU.ResearchData is currently based on Figshare.
### Djehuty
[Djehuty](https://github.com/4TUResearchData/djehuty) is the new open-source repository system of 4TU.ResearchData. It mimics the Figshare API; therefore, can be supported easily. A separate implementation is required to support any deviation from the original Figshare API.
### Invenio
[Invenio](https://github.com/inveniosoftware/invenio) is an open-source digital library and document repository framework used by [Zenodo](https://zenodo.org/). We consider Zenodo as a potential repository to be supported in high priority. However, whether it will be implemented initially or not is to be decided (based on time availability).
## Global Methods
These methods are available in the `fairly` module.
### List supported platforms
`platforms = fairly.list_platforms()`
Returns a dictionary of dictionaries of platforms supported by the package.
Keys of the dictionary are unique platform identifiers *(string)*.
**Parameters:**
*None*
**Raises:**
*None*
**Platform dictionary:**
| Attribute | Type | Description |
|----------------------|---------|----------------------------------------------------------|
| `name` | string | Name of the platform. |
| `url` | string | URL address of the platform portal. |
| `has_folders` | boolean | `True` if folders are supported by the platform. |
| `has_partial_upload` | boolean | `True` if partial uploads are supported by the platform. |
| `experimental` | boolean | `True` if the platform support is experimental. |
**Example:**
```python=
[
"local": {
"name": "Local",
"url": "//localhost/",
"has_folders": True,
"has_partial_upload": True,
"experimental": False,
},
"figshare": {
"name": "Figshare",
"url": "https://figshare.com/",
"has_folders": False,
"has_partial_upload": True,
"experimental": False,
},
...
]
```
### List recognized data repositories
`repositories = fairly.list_repositories(platform?=<id>)`
Returns a list of dictionaries of the repositories recognized by the package.
Repository dictionaries may contain platform-specific information.
**Parameters:**
| Name | Type | Description |
|------------|---------|-------------------------------------------------------|
| `platform` | string | Platform identifier to filter repositories (optional) |
**Raises:**
None
**Repository dictionary:**
| Attribute | Type | Description |
|------------|---------|---------------------------------------------|
| `id` | string | Unique identifier of the repository. |
| `name` | string | Name of the repository. |
| `platform` | string | Platform identifier of the repository. |
| `url` | string | URL address of the repository. |
| `api_url` | string | URL address of the repository API end-point |
| `...` | | |
**Example:**
```python=
[
{
"id": "local",
"name": "Local Repository",
"platform": "local",
"url": "//localhost/",
},
{
"id": "4tu",
"name": "4TU.ResearchData",
"platform": "figshare",
"url": "https://data.4tu.nl/",
"api_url": "https://api.figshare.com/v2/",
...
},
...
]
```
### List licenses
`licenses = fairly.list_licenses()`
Returns a dictionary of dictionaries of licenses recognized by the package.
Keys are unique license identifiers *(string)*.
**Parameters:**
*None*
**Raises:**
*None*
**License dictionary:**
| Attribute | Type | Description |
|-----------|---------|-----------------------------|
| `name` | string | Name of the license. |
| `url` | string | URL address of the license. |
**Example:**
```python=
{
"CC BY 4.0": {
"name": "Creative Commons Attribution 4.0 International",
"url": "https://creativecommons.org/licenses/by/4.0/",
},
...
}
```
==**NOTES:**==
- Licenses can be stored as MarkDown documents and can be copied to the local dataset folder during initialization (similar to Github repository initialization).
- Support for custom licenses can be added.
- > As a user I want to be able to select a license and put it as a file in my dataset folder.
- > As a user I would like to add a custom license.
### Connect to a data repository
`repository = fairly.connect(<id>)`
`repository = fairly.connect(<url>, platform?=<id>)`
Returns a data repository object.
**Parameters:**
| Name | Type | Description |
|------------|--------|---------------------------------------------------------------------|
| id | string | Identifier of the repository (see `fairly.list_repositories`) |
| url | string | URL address of the repository (see `fairly.list_repositories`) |
| platform | string | Platform identifier of the repository (see `fairly.list_platforms`) |
| *username* | string | Name of the user account |
| *password* | string | Password of the user account |
| *token* | string | Access token linked to the user account |
**Raises:**
- `IOError`
- *Error occurred while connecting to the repository.*
- `ValueError`
- `Invalid id.`
- `Unknown platform.`
- `Invalid username.`
- `Invalid password.`
- `Invalid token.`
**Example:**
```python=
fairly.connect("4tu")
fairly.connect("/home/jovyan/research/dataset")
fairly.connect("https://api.figshare.com/v2/", platform="figshare")
```
**REMARKS**:
- Environmental variables and .dot files should be considered for missing arguments.
## Repository methods
The templates of the methods are defined in the `fairly.repository` interface module. Platform-specific methods are implemented in respective platform modules.
### Get repository information
`info = repository.get_info()`
Returns the dictionary of the repository.
Repository dictionaries may contain repository-specific information.
**Parameters:**
*None*
**Raises:**
- `IOError`:
- *Error occurred while retrieving the repository information.*
**Repository Dictionary:**
| Attribute | Type | Description |
|----------------|--------|------------------------------------------|
| repository_url | string | URL address of the repository. |
| platform | string | Platform identifier of the repository. |
| *username* | string | Name of the user account. |
| *token* | string | Access token linked to the user account. |
| *name* | string | Name of the user. |
| *surname* | string | Surname of the user. |
| *email* | string | E-mail address of the user. |
**Example:**
```python=
{
"repository_url": "https://data.4tu.nl/",
"platform": "figshare",
"username": "s.girgin@utwente.nl",
"token": "42370f95-8711-4c2d-8aeb-9f761bf35640",
"name": "Serkan",
"surname": "Girgin",
"email": "s.girgin@utwente.nl",
...
}
```
Remarks:
- Can be cached?
### Create a dataset
`dataset = repository.create_dataset(metadata={})`
Creates a dataset in the repository and returns a dataset object of the newly created dataset.
==**TODO:** Add description==
### Open a dataset
`dataset = repository.open_dataset(uid, version?=<version>)`
Returns a dataset object for the specified dataset in the repository.
**Parameters**
| Name | Type | Description |
|------------|--------|---------------------------------------------------------------------------------|
| uid | string | Identifier of the dataset (see [Dataset identifiers](dataset-identifiers)) |
| version | string | Version of the dataset (optional). If not specified, the latest version is used |
**Raises**
- `IOError`
- *Error occurred while opening the dataset.*
- `ValueError`
- `Invalid id.`
### List datasets
`datasets = repository.list_datasets()`
Returns a list of dictionaries of the datasets available in the repository.
**Parameters:**
*None*
**Raises:**
- IOError:
- *Error occurred while retrieving the list of datasets.*
**Dataset Dictionary:**
| Attribute | Type | Description |
|-----------|------------|-------------------------------------------------------|
| id | string | Repository-specific unique identifier of the dataset. |
| doi | string | DOI of the dataset (if available). |
| url | string | URL address of the dataset. |
| title | string | Title of the dataset. |
| date | date | Date of the dataset. |
| versions | dictionary | Dictionary of dataset versions, {version: date, ...} |
**Example:**
```python=
[
{
"id": "14438750",
"doi": "10.4121/14438750",
"url": "https://data.4tu.nl/articles/dataset/Agricultural_SandboxNL_Database_V1_0/14438750",
"title": "Agricultural SandboxNL Database V1.0",
"date": "2022-03-17 09:28",
"versions": {
"1": "2021-07-16 15:42",
"2": "2022-03-17 09:28",
},
},
...
]
```
## Dataset Methods
These methods are available in the `fairly.dataset` module. A `dataset` object should keep a reference to the `repository` object and to perform the tasks it should call related repository methods by specifying the dataset id and version. The methods are mainly for convenience and the following uses are identical to each other:
```python=
repository = fairly.connect("4tu", token="1b7e-6700-26b5-4fda")
dataset = repository.open_dataset("10.1204.56")
# Get metadata by using the dataset method
metadata = dataset.get_metadata()
# Get metadata by using the repository method
metadata = repository.get_metadata("10.1204.56", 2)
```
### <a name="dataset-get-metadata"></a>Get metadata of a dataset
`metadata = dataset.get_metadata()`
Returns the metadata dictionary of the dataset.
**Parameters:**
*None*
**Raises**
- `IOError`
- *Error occurred while reading the metadata.*
**Metadata Dictionary:**
- id : Repository-specific unique identifier of the dataset.
- doi : DOI of the dataset (if available).
- url : URL address of the dataset.
- title : Title of the dataset.
- date : Publication date of the dataset.
- description : Description of the dataset.
- keywords : List of keywords associated with the dataset.
- authors : List of author dictionaries associated with the dataset.
- license : License type of the dataset (see "License types").
- version : Version of the dataset (if available).
Author dictionary:
- id : Repository-specific unique identifier of the author.
- name : Name(s) of the author
- surname : Surname(s) of the author
- title : Job title of the authors (if available)
- institution : Institution of the author (if available)
- orcid_id : ORCID identifier of the author (if available)
### <a name="dataset-set-metadata"></a>Set metadata of a dataset
`dataset.set_metadata(metadata)`
Sets specified metadata attributes of the dataset.
See [`dataset.get_metadata()`](dataset-get-metadata) for supported metadata attributes.
**Parameters:**
| Attribute | Type | Description |
|-----------|------------|-----------------------------------|
| metadata | dictionary | Metadata attributes to be updated |
**Raises:**
- IOError:
- *Error occurred while setting the metadata.*
Remarks:
- Setting specific metadata attributes by name can also be supported, e.g. `dataset.set_metadata(title = "New title")`
## List data files of a dataset
`files = dataset.list_files()`
Returns a list of dictionaries of the data files of the dataset.
**Raises:**
- IOError:
- Error occurred while retrieving the metadata.
**Data File Dictionary:**
- id : Unique id of the data file (if available).
- url : URL address of the data file [string].
- path : Path of the data file (see "File paths") [string].
- type : (MIME?) Type of the data file [string].
- size : Size of the data file in bytes [int].
- md5 : MD5 checksum of the data file [string].
Example:
[
{
"id": "36059786",
"url": "https://data.4tu.nl/ndownloader/files/36059786",
"path": "upload data.zip",
"type": "application/zip",
"size": 799271235,
"md5": "50bcc2c08d3cc45e9fa8ffcf2b2c391c",
},
...
]
## Compute the differences between the metadata of two datasets
`diff = dataset.diff_metadata(other_dataset)`
**==TODO: Add description**==
## Compute the differences between the data files of two datasets
`diff = dataset.diff_files(other_dataset)`
**==TODO: Add description**==
## Synchronize two datasets
`push_id = dataset.push(other_dataset, callback?=<callback>)`
Updates metadata and data files of the other dataset, so that they are the same as the ones of the dataset.
Repository-specific unique metadata attributes (e.g. `id`, `url`) should be excluded.
Synchronization includes the following actions:
1. Update of the metadata
2. Upload of the new files
3. Upload of the updated files
4. (Removal of the previous copies of the updated files)
5. Removal of the deleted files
If upload of an updated file does not replace the existing file automatically (i.e. not in-place replacement), existing file should be deleted after the upload (step 4).
Sychronization should be atomic, e.g. in case of failure at any step the other dataset should revert back to its original state.
## Get synchronization status
`dataset.get_push_status(push_id)`
==**TODO:** Add description==
## Enumerations
### <a name="dataset-types"></a>Dataset types
Pre-defined dataset types supported by figshare:
- figure
- online resource
- preprint
- book
- conference contribution
- media
- dataset
- poster
- journal contribution
- presentation
- thesis
- software
**NOTES:**
- Most of them are generic publication types, not dataset types. I'm not sure if they are meaningful to use. [SG]
### <a name="licenses"></a>Licenses
Pre-defined licenses supported by figshare:
- CC BY 4.0
- CC0
- MIT
- GPL
- GPL 2.0+
- GPL 2.0+
- Apache 2.0
**NOTES:**
- figshare uses numeric license id to indicate the license. I suggest we use string license names instead and map them internally. [SG]
# API vision, proposal and design notes
Separating local related functions, from remote related functions like uploading, etc.
```python=
dataset.sync_metadata(archive)
status = dataset.sync_files(archive)
connection = fairly.connect("4tu")
# Remote copy of the dataset (e.g. article in case of figshare)
archive = connection.create_archive()
archive = connection.open_record(<id>)
dataset_record.get_metadata()
dataset_record.list_files()
Later:
dataset.add_file()
- Should update .ignore file
dataset.remove_file()
- Should update .ignore file
dataset.export_metadata()
- Should export metadata following a metadata standard
```
## File update from local to remote
**user story**: when I uploaded wrong files, or I a missing files for the publication, I need to update the repository. Because files are very heavy I dont want to overwrite the entire archive, but instead only upload those that are not there yet, and perhaps delete those that are not there anymore.
Kind of smart overwrite feature:
```python=
```
**List datasets**
- status : Status of the dataset [string].
- "draft" : Not published yet (i.e. private)
- "restricted" : Published with restrictions (i.e. under embargo)
- "public" : Published publicly
"status": "public",
# Business logic/rules
- We only syncrhonize to unpublished repositories
- DOI should remain constant even if it is duplicated in different data providers (Zenodo, Figshare, etc)
- Version can only be made by author of the dataset
## Stories
### List datasets
>After having ready my dataset to be archived, I would like to identify to which archive I want to deposit my data, therefore I need a list to do this identification.
### Get metadata of a dataset
>might be used to store the metadata locally, mostly for local readaibility and consultation. This metadata that is downloaded shouldnt' be writable for example. Users can make a mistake and overwrite metadata, this is not desirable.