Using Data Curator to publish data packages to CKAN

Purpose

These instructions are for a data publisher using:

Data Curator to create and update tabular data packages
CKAN and the CKAN Data Packager extension to publish and share data packages
CKAN Validation extension, when released, to validate data that has changed in CKAN. If the extension isn't available, then data may need to be validated using Data Curator before uploading to CKAN.

These instructions are being written to inform the development of:

Data Curator to inter-operate with the CKAN Data Packager extension
CKAN Data Packager extension as it currently doesn't fully implement the Frictionless Data specification and doesn't fully support the publishing scenarios identified below
CKAN Data Package Tools which is used by the CKAN Data Packager extension.

Assumptions and constraints

Assumptions and constraints that influence the instructions for each scenario:

Avoid the user editing the datapackage.json or tableschema.json files directly
Assume the data publisher is using data package versioning
Assume the CKAN Data Packager minimum viable product is implemented, including:
- a tableschema.json file stored or referenced for each data resource
- everything in a datapackage.zip file uploaded to CKAN is stored in some way and can be downloaded from CKAN as either a:
  - datapackage.zip file
  - datapackage.json (that doesn't include the README.md)
The instructions in the scenarios below may change depending on how the CKAN Data Packager and Data Curator are implemented.

Prior work

See: https://github.com/ckan/ideas-and-roadmap/blob/466418227cf4fca9bcc3b5e65d23b3ef117986b5/specs/datapackages/README.md#resource-schemas

Scenarios

There are a number of different scenarios for creating or updating data packages and publishing or accessing them on CKAN.

Data publishers can:

Create a data package using Data Curator
Publish a data package to CKAN for the first time
Change data in a published data package
Validate data using Data Curator
Add a data resource to a published data package
Package an existing CKAN dataset and resources
Publish a major change to a published data package

and after the data package is published, data consumers can:

Download the data package as datapackage.json
Download the data package as datapackage.zip

1. Create a data package using Data Curator

To create a data package:

Using Data Curator, open or create data
Add a header row if one doesn't exist
Check Header Row in Tools menu
Guess column properties to set the name, type and format values
Set column properties to describe the data in more detail
Validate the data (fix data errors or explain them in the provenance information)
Set table properties
Set provenance information
Set package properties
Export the data package to a .zip file, containing:
- datapackage.json generated from the column, table and package properties
- README.md generated from the provenance information
- / data directory
  - one or more separated value data files, one for each data tab saved in Data Curator

2. Publish a data package to CKAN for the first time

To publish a data package to CKAN and create a dataset and related resources:

Using Data Curator, create a data package with version number 1.0.0 and export it to a .zip file
Login to CKAN
Go to the Data page and select Import Data Package and upload the .zip file. The dataset and resources are now published in CKAN as:
- dataset metadata
- resource metadata entries
- resources for:
  - data file(s)
  - associated tableschema.json file(s)
  - a README.md file
Review the new dataset:
- Add any extra metadata to help users discover or understand the data, e.g.
  - Dataset: Tags
  - Resource: Next Review Date, ~~File size~~
- Make the dataset visible to the public

Thoughts

Consider adding support for the following properties in Data Curator:

keyword tags (currently supported by CKAN Data Package Tools.)
- This could result in a mess of tags as there is no lookup in Data Curator to suggest reusing existing tags
- If not implemented, then Data Curator will lose the tag information for data packages downloaded from CKAN. This could be important for scenarios 6 and 7 when the data package is uploaded to CKAN, requiring the tags to be reapplied
file size bytes (planned) (not currently supported by CKAN Data Package Tools.)
- Alternatively, why isn't file size bytes calculated by CKAN?

3. Change data in a published data package

Many data resources are published as complete snapshots of the data, e.g. at the end of a month, that month's data is appended to the end of the existing data.

To correct or add data to an existing data resource in CKAN:

Make the changes to the data and save it to a file. If the CKAN Validation extension isn't installed, validate the data using Data Curator.
Using CKAN, change the data resource by uploading the file. If the CKAN Validation extension is installed, the data will be validated against the associated tableschema.json file
Update CKAN metadata, e.g.
- Dataset: Version (increment the minor version number)
- Resource: Next Review Date, File size

4. Validate data using Data Curator

If the CKAN Validation extension isn't installed, before you add or change data in a published data package you may want to validate the data using Data Curator. This will provide two files to be uploaded to CKAN:

a validated separated value file
if needed, an updated README.md explaining any errors

To validate the data using Data Curator:

Download the datapackage.zip from CKAN
Open the datapackage.zip file in Data Curator
Change the data as required
Validate the data (fix data or explain errors in provenance)
- If the data is valid, save it to a separated value file to be uploaded to CKAN
- If the data is invalid and you can correct it, do so and save it to a separated value file to be uploaded to CKAN
- If the data is invalid and you have decided to publish the data with errors:
  - save the data to a separated value file to be uploaded to CKAN
  - update the provenance information explaining the errors.README.md
  - export the data package to a .zip file
  - unzip the .zip file to access the README.md to be uploaded to CKAN

Thoughts

If Data Curator could open a datapackage.json file that references the data and table schemas by URL, then the requirement to provide a datapackage.zip download could be deferred.

The instructions above would still be valid apart from an additional step if you decide to publish the data with errors. As the original README.md is not downloaded, it would need to be downloaded and its contents pasted into the provenance information before it could be updated explaining the errors.

5. Add a new data resource to a published data package

Sometimes data is added in increments to a dataset e.g. at the end of a year, that year’s data is add as a new data resource to other yearly data resources.

To add a new data resource to a published data package:

Acquire the new data and save it to a separated value file.
If the CKAN Validation extension isn't installed, validate the data using Data Curator
Using CKAN, add the new data resource:
- upload the data file
- complete the resource metadata
- If the CKAN Validation extension is installed, reference the relevant tableschema.json resource
Update CKAN metadata, e.g.
- Dataset: Version (increment the minor version number)
- Resource: Next Review Date, File size

6. Package an existing CKAN dataset and resources

Using CKAN, download the existing data resource files
Using Data Curator, open the data resource files
Use the CKAN dataset and resource metadata and other available information to create a data package
Explain the changes in the provenance information
Increment the data package major version number
Export the data package to a .zip file.

There is no way to upload the datapackage .zip file and apply it to the existing CKAN dataset. You can either:

Publish the data package to CKAN as a new data set
Unzip the data package .zip file and manually upload components and update metadata in the existing CKAN entry.

7. Publish a major change to a published data package

A major change to a data package is when you make changes that are incompatible with prior versions, e.g.

Change the table schema
Change field or data package names or data package identifiers
Add, remove or re-order fields

An example could be adding a reference table as a new dataset and creating a foreign key relationship between it and the existing data.

To publish a major change to a published data package:

Using CKAN, download datapackage.zip
Using Data Curator, open the datapackage.zip and make the changes
Validate the data
Explain the changes in the provenance information
Increment the data package major version number
Export the data package to a .zip file.

There is no way to upload the datapackage .zip file and apply it to the existing CKAN dataset. You can either:

Publish a data package to CKAN as a new data set with an explanation in the provenance information
Unzip the data package .zip file and manually upload components to the existing CKAN entry.

8. Download the data package as datapackage.json

To download a datapackage.json file:

Using CKAN, go to the dataset page and select Download Data Package.

9. Download the data package as datapackage.zip

To download a datapackage.zip:

Using CKAN, go to the dataset page
to be determined…

Implementation approach

Some new properties need to be included to support tabular data packages.

1. Create valid data package properties

Create valid data package properties for use in create.py. In converter.py convert the following properties from a data package to a CKAN dataset.

profile mandatory for tabular data packages
licenses (#62)
contributors (#59) maps to author in CKAN
sources (#59) maps to maintainer in CKAN

See notes below for what metadata is currently lost when converting between CKAN and data packages

2. Create valid resource properties

Create valid data resource properties for use in create.py. In converter.py, convert the following properties from the data resources to CKAN resources.

schema mandatory for tabular data resources
profile mandatory for tabular data resources
dialect mandatory for tabular data resources, if it differs from specification defaults
encoding mandatory for tabular data resources, if it differs from specification default

See notes below for what metadata is currently lost when converting between data resources and CKAN

3. Create a dataset and resources in CKAN

In create.py:

Store a schema property for each data resource. This would be a tableschema.json file for a Tabular Data Resource (#61)
Associate the schema with a data resource.
- Tabular Data Resources must have a schema that follows the Table Schema specification
- The association is needed between the data and the table schema to generate the datapackage.json and support the CKAN Validation extension
If needed, store the dialect for each data resource.
If needed, store the encoding for each data resource.

See:

CKAN Validation Overview and issue #9

4 Convert the CKAN dataset to a data package

Convert the CKAN dataset to a data package using convertor.py dataset_to_datapackage

5. Convert the CKAN resources to data resources

Convert the CKAN resources to data resources using convertor.py _convert_to_datapackage_resource

6. Generate `datapackage.json` for download

Generate a minimal, valid datapackage.json for download

Add the profile to the data package
For each resource add the associated:
- schema
  - use "schema": "URL" to point to the schema in CKAN (#49) (noting this discussion), or
  - embed the tableschema.json within the data package as an object.
- profile, dialect and encoding

README.md won't be included

7 Store `README.md`

Store README.md (#60) using create.py

8. Generate `datapackage.zip` for download

Generate a full datapackage.zip for download (#52).

This should match the datapackage.zip used to upload the data package to CKAN (less any properties not yet implemented e.g. image).

9. Store data resources in CKAN Data Store

Store data resources in the CKAN Data Store (#44)

Notes

Converting datasets

Converting a data package to a CKAN dataset

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

name
title
description
version
licenses (CKAN has a single value for a license but a data package supports an array of licenses)
sources
contributor (author role)
keywords

Other properties in the data package are converted to CKAN "extras" properties

Properties in the specification that are not directly converted:

profile (e.g. "tabular-data-package")
id
homepage
image
created

In the CKAN Data Package extension name is limited to 2-100 characters. Consider adding this validation to Data Curator (planned).

Converting a CKAN dataset to a data package

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

name
title
description
homepage
version
licenses
sources
contributor (author role)
keywords

Other properties in CKAN are parsed into "extras" properties

Properties in the specification that are not directly converted:

id
profile
image
created

Converting resources

Converting a data resource in a data package to a CKAN resource

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

path or data
name
title
description
format (e.g. "csv")
hash

Properties in the specification that are not directly converted:

profile (e.g. "tabular-data-resource")
schema (Table Schema for a Tabular Data Resource or another schema for other data resource types)
dialect (CSV Dialect for a Tabular Data Resource. Defaults "line terminator": "\r\n", "delimiter": "," )
encoding (e.g. default "UTF-8")
mediatype (e.g. "text/csv")
bytes
sources
licenses (CKAN doesn't store licenses at the resource level, they inherit from the dataset)

Would it help if a CKAN Schema was defined to support all data package metadata?

Converting a CKAN resource to a data resource in a data package

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

name
path
title
description
format (e.g. "csv")
hash
"schema" assume this the CKAN schema containing custom CKAN metadata properties (i.e. not a Table Schema)

Properties in the specification that are not directly converted:

profile (e.g. "tabular-data-resource")
mediatype (e.g. "text/csv")
encoding (e.g. "UTF-8")
bytes
sources
licenses (CKAN doesn't store licenses at the resource level, they inherit from the dataset)
schema (Table Schema for a Tabular Data Resource or another schema for other data resource types)
dialect (CSV Dialect for a Tabular Data Resource)

Using Data Curator to publish data packages to CKAN

Purpose

Assumptions and constraints

Prior work

Scenarios

1. Create a data package using Data Curator

2. Publish a data package to CKAN for the first time

Thoughts

3. Change data in a published data package

4. Validate data using Data Curator

Thoughts

5. Add a new data resource to a published data package

6. Package an existing CKAN dataset and resources

7. Publish a major change to a published data package

8. Download the data package as datapackage.json

9. Download the data package as datapackage.zip

Implementation approach

1. Create valid data package properties

2. Create valid resource properties

3. Create a dataset and resources in CKAN

4 Convert the CKAN dataset to a data package

5. Convert the CKAN resources to data resources

6. Generate datapackage.json for download

7 Store README.md

8. Generate datapackage.zip for download

9. Store data resources in CKAN Data Store

Notes

Converting datasets

Converting a data package to a CKAN dataset

Converting a CKAN dataset to a data package

Converting resources

Converting a data resource in a data package to a CKAN resource

Converting a CKAN resource to a data resource in a data package

6. Generate `datapackage.json` for download

7 Store `README.md`

8. Generate `datapackage.zip` for download