Try   HackMD

Using Data Curator to publish data packages to CKAN

Purpose

These instructions are for a data publisher using:

These instructions are being written to inform the development of:

Assumptions and constraints

Assumptions and constraints that influence the instructions for each scenario:

  • Avoid the user editing the datapackage.json or tableschema.json files directly
  • Assume the data publisher is using data package versioning
  • Assume the CKAN Data Packager minimum viable product is implemented, including:
    • a tableschema.json file stored or referenced for each data resource
    • everything in a datapackage.zip file uploaded to CKAN is stored in some way and can be downloaded from CKAN as either a:
      • datapackage.zip file
      • datapackage.json (that doesn't include the README.md)
  • The instructions in the scenarios below may change depending on how the CKAN Data Packager and Data Curator are implemented.

Prior work

See: https://github.com/ckan/ideas-and-roadmap/blob/466418227cf4fca9bcc3b5e65d23b3ef117986b5/specs/datapackages/README.md#resource-schemas

Scenarios

There are a number of different scenarios for creating or updating data packages and publishing or accessing them on CKAN.

Data publishers can:

  1. Create a data package using Data Curator
  2. Publish a data package to CKAN for the first time
  3. Change data in a published data package
  4. Validate data using Data Curator
  5. Add a data resource to a published data package
  6. Package an existing CKAN dataset and resources
  7. Publish a major change to a published data package

and after the data package is published, data consumers can:

  1. Download the data package as datapackage.json
  2. Download the data package as datapackage.zip

1. Create a data package using Data Curator

To create a data package:

  1. Using Data Curator, open or create data
  2. Add a header row if one doesn't exist
  3. Check Header Row in Tools menu
  4. Guess column properties to set the name, type and format values
  5. Set column properties to describe the data in more detail
  6. Validate the data (fix data errors or explain them in the provenance information)
  7. Set table properties
  8. Set provenance information
  9. Set package properties
  10. Export the data package to a .zip file, containing:
    • datapackage.json generated from the column, table and package properties
    • README.md generated from the provenance information
    • / data directory
      • one or more separated value data files, one for each data tab saved in Data Curator

2. Publish a data package to CKAN for the first time

To publish a data package to CKAN and create a dataset and related resources:

  1. Using Data Curator, create a data package with version number 1.0.0 and export it to a .zip file
  2. Login to CKAN
  3. Go to the Data page and select Import Data Package and upload the .zip file. The dataset and resources are now published in CKAN as:
    • dataset metadata
    • resource metadata entries
    • resources for:
      • data file(s)
      • associated tableschema.json file(s)
      • a README.md file
  4. Review the new dataset:
    • Add any extra metadata to help users discover or understand the data, e.g.
      • Dataset: Tags
      • Resource: Next Review Date, File size
    • Make the dataset visible to the public

Thoughts

Consider adding support for the following properties in Data Curator:

  • keyword tags (currently supported by CKAN Data Package Tools.)
    • This could result in a mess of tags as there is no lookup in Data Curator to suggest reusing existing tags
    • If not implemented, then Data Curator will lose the tag information for data packages downloaded from CKAN. This could be important for scenarios 6 and 7 when the data package is uploaded to CKAN, requiring the tags to be reapplied
  • file size bytes (planned) (not currently supported by CKAN Data Package Tools.)
    • Alternatively, why isn't file size bytes calculated by CKAN?

3. Change data in a published data package

Many data resources are published as complete snapshots of the data, e.g. at the end of a month, that month's data is appended to the end of the existing data.

To correct or add data to an existing data resource in CKAN:

  1. Make the changes to the data and save it to a file. If the CKAN Validation extension isn't installed, validate the data using Data Curator.
  2. Using CKAN, change the data resource by uploading the file. If the CKAN Validation extension is installed, the data will be validated against the associated tableschema.json file
  3. Update CKAN metadata, e.g.
    • Dataset: Version (increment the minor version number)
    • Resource: Next Review Date, File size

4. Validate data using Data Curator

If the CKAN Validation extension isn't installed, before you add or change data in a published data package you may want to validate the data using Data Curator. This will provide two files to be uploaded to CKAN:

  • a validated separated value file
  • if needed, an updated README.md explaining any errors

To validate the data using Data Curator:

  1. Download the datapackage.zip from CKAN
  2. Open the datapackage.zip file in Data Curator
  3. Change the data as required
  4. Validate the data (fix data or explain errors in provenance)
    • If the data is valid, save it to a separated value file to be uploaded to CKAN
    • If the data is invalid and you can correct it, do so and save it to a separated value file to be uploaded to CKAN
    • If the data is invalid and you have decided to publish the data with errors:
      • save the data to a separated value file to be uploaded to CKAN
      • update the provenance information explaining the errors.README.md
      • export the data package to a .zip file
      • unzip the .zip file to access the README.md to be uploaded to CKAN

Thoughts

If Data Curator could open a datapackage.json file that references the data and table schemas by URL, then the requirement to provide a datapackage.zip download could be deferred.

The instructions above would still be valid apart from an additional step if you decide to publish the data with errors. As the original README.md is not downloaded, it would need to be downloaded and its contents pasted into the provenance information before it could be updated explaining the errors.

5. Add a new data resource to a published data package

Sometimes data is added in increments to a dataset e.g. at the end of a year, that year’s data is add as a new data resource to other yearly data resources.

To add a new data resource to a published data package:

  1. Acquire the new data and save it to a separated value file.
  2. If the CKAN Validation extension isn't installed, validate the data using Data Curator
  3. Using CKAN, add the new data resource:
    • upload the data file
    • complete the resource metadata
    • If the CKAN Validation extension is installed, reference the relevant tableschema.json resource
  4. Update CKAN metadata, e.g.
    • Dataset: Version (increment the minor version number)
    • Resource: Next Review Date, File size

6. Package an existing CKAN dataset and resources

  1. Using CKAN, download the existing data resource files
  2. Using Data Curator, open the data resource files
  3. Use the CKAN dataset and resource metadata and other available information to create a data package
  4. Explain the changes in the provenance information
  5. Increment the data package major version number
  6. Export the data package to a .zip file.

There is no way to upload the datapackage .zip file and apply it to the existing CKAN dataset. You can either:

  • Publish the data package to CKAN as a new data set
  • Unzip the data package .zip file and manually upload components and update metadata in the existing CKAN entry.

7. Publish a major change to a published data package

A major change to a data package is when you make changes that are incompatible with prior versions, e.g.

  • Change the table schema
  • Change field or data package names or data package identifiers
  • Add, remove or re-order fields

An example could be adding a reference table as a new dataset and creating a foreign key relationship between it and the existing data.

To publish a major change to a published data package:

  1. Using CKAN, download datapackage.zip
  2. Using Data Curator, open the datapackage.zip and make the changes
  3. Validate the data
  4. Explain the changes in the provenance information
  5. Increment the data package major version number
  6. Export the data package to a .zip file.

There is no way to upload the datapackage .zip file and apply it to the existing CKAN dataset. You can either:

  • Publish a data package to CKAN as a new data set with an explanation in the provenance information
  • Unzip the data package .zip file and manually upload components to the existing CKAN entry.

8. Download the data package as datapackage.json

To download a datapackage.json file:

  1. Using CKAN, go to the dataset page and select Download Data Package.

9. Download the data package as datapackage.zip

To download a datapackage.zip:

  1. Using CKAN, go to the dataset page
  2. to be determined

Implementation approach

Some new properties need to be included to support tabular data packages.

1. Create valid data package properties

Create valid data package properties for use in create.py. In converter.py convert the following properties from a data package to a CKAN dataset.

  1. profile mandatory for tabular data packages
  2. licenses (#62)
  3. contributors (#59) maps to author in CKAN
  4. sources (#59) maps to maintainer in CKAN

See notes below for what metadata is currently lost when converting between CKAN and data packages

2. Create valid resource properties

Create valid data resource properties for use in create.py. In converter.py, convert the following properties from the data resources to CKAN resources.

  1. schema mandatory for tabular data resources
  2. profile mandatory for tabular data resources
  3. dialect mandatory for tabular data resources, if it differs from specification defaults
  4. encoding mandatory for tabular data resources, if it differs from specification default

See notes below for what metadata is currently lost when converting between data resources and CKAN

3. Create a dataset and resources in CKAN

In create.py:

  1. Store a schema property for each data resource. This would be a tableschema.json file for a Tabular Data Resource (#61)
  2. Associate the schema with a data resource.
    • Tabular Data Resources must have a schema that follows the Table Schema specification
    • The association is needed between the data and the table schema to generate the datapackage.json and support the CKAN Validation extension
  3. If needed, store the dialect for each data resource.
  4. If needed, store the encoding for each data resource.

See:

4 Convert the CKAN dataset to a data package

Convert the CKAN dataset to a data package using convertor.py dataset_to_datapackage

5. Convert the CKAN resources to data resources

Convert the CKAN resources to data resources using convertor.py _convert_to_datapackage_resource

6. Generate datapackage.json for download

Generate a minimal, valid datapackage.json for download

  • Add the profile to the data package
  • For each resource add the associated:
    • schema
      • use "schema": "URL" to point to the schema in CKAN (#49) (noting this discussion), or
      • embed the tableschema.json within the data package as an object.
    • profile, dialect and encoding

README.md won't be included

7 Store README.md

Store README.md (#60) using create.py

8. Generate datapackage.zip for download

Generate a full datapackage.zip for download (#52).

This should match the datapackage.zip used to upload the data package to CKAN (less any properties not yet implemented e.g. image).

9. Store data resources in CKAN Data Store

Store data resources in the CKAN Data Store (#44)


Notes

Converting datasets

Converting a data package to a CKAN dataset

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

  • name
  • title
  • description
  • version
  • licenses (CKAN has a single value for a license but a data package supports an array of licenses)
  • sources
  • contributor (author role)
  • keywords

Other properties in the data package are converted to CKAN "extras" properties

Properties in the specification that are not directly converted:

  • profile (e.g. "tabular-data-package")
  • id
  • homepage
  • image
  • created

In the CKAN Data Package extension name is limited to 2-100 characters. Consider adding this validation to Data Curator (planned).

Converting a CKAN dataset to a data package

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

  • name
  • title
  • description
  • homepage
  • version
  • licenses
  • sources
  • contributor (author role)
  • keywords

Other properties in CKAN are parsed into "extras" properties

Properties in the specification that are not directly converted:

  • id
  • profile
  • image
  • created

Converting resources

Converting a data resource in a data package to a CKAN resource

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

  • path or data
  • name
  • title
  • description
  • format (e.g. "csv")
  • hash

Properties in the specification that are not directly converted:

  • profile (e.g. "tabular-data-resource")
  • schema (Table Schema for a Tabular Data Resource or another schema for other data resource types)
  • dialect (CSV Dialect for a Tabular Data Resource. Defaults "line terminator": "\r\n", "delimiter": "," )
  • encoding (e.g. default "UTF-8")
  • mediatype (e.g. "text/csv")
  • bytes
  • sources
  • licenses (CKAN doesn't store licenses at the resource level, they inherit from the dataset)

Would it help if a CKAN Schema was defined to support all data package metadata?

Converting a CKAN resource to a data resource in a data package

The following properties are converted by CKAN Data Package Tools and the CKAN Data Packager extension (ignoring the issues mentioned above)

  • name
  • path
  • title
  • description
  • format (e.g. "csv")
  • hash
  • "schema" assume this the CKAN schema containing custom CKAN metadata properties (i.e. not a Table Schema)

Properties in the specification that are not directly converted:

  • profile (e.g. "tabular-data-resource")
  • mediatype (e.g. "text/csv")
  • encoding (e.g. "UTF-8")
  • bytes
  • sources
  • licenses (CKAN doesn't store licenses at the resource level, they inherit from the dataset)
  • schema (Table Schema for a Tabular Data Resource or another schema for other data resource types)
  • dialect (CSV Dialect for a Tabular Data Resource)