owned this note changed 2 years ago
Published Linked with GitHub

CKAN + Triplestore

Introduction

A triplestore is a database that is used to store and retrieve data in the form of triples. A triple consists of three parts: a subject, a predicate, and an object.

In Open Data context, a triplestore is a storage engine for RDF (Resource Description Framework) - directed labeled graphs.

Triplestores can be queried using SPARQL (SPARQL Protocol and RDF Query Language), a query language specifically designed for querying RDF data. SPARQL allows developers to specify complex queries that span multiple triples, enabling powerful data retrieval and analysis capabilities.

We are looking for general compatibility for RDF graphs to support DCAT (Data Catalog Vocabulary) and its various regional extensions such as DCAT-AP.

Problem

CKAN allows users to have default metadata schema to get started quickly and create necessary information about datasets. Users can also customize the metadata schema and recommended approach is to use https://github.com/ckan/ckanext-scheming extension. On the other hand, a number of metadata standards have been defined by different organizations and widely adopted:

DCAT has been adopted by number of governments and other organizations including:

Some governments are also creating their own extension of the DCAT:

CKAN has a popular extension to support DCAT - https://github.com/ckan/ckanext-dcat. However, it does not support all features of DCAT - it doesn't preserve metadata information while RDF gets transformed and loaded into Postgresql (CKAN's storage for metadata). I.e., once user exports the metadata from CKAN, its original information is lost. For example, as mentioned by Open Data Swiss, it doesn't provide representation for CatalogRecord and DataService classes. It also doesn't have relationaship between Catalog and CatalogRecord. There might be other items that wasn't considered yet.

Challenges

Going all-in for DCAT support by replacing Postgresql with some triplestore engine would have two major issues:

  • Uncertainty. We don't know how many CKAN users need it. We are only aware of some European governments looking for this feature. Given the popularity of CKAN project, we could assume that majority of users would prefer keeping the current architecture with Postgresql for metadata storage.
  • Complexity. It might be a technically challenging piece of work to migrate from current SQL database to triplestore engine.

In this writing, we are trying to identify a solution to overcome these challenges and find an approach to make minimum modifications to the core code and provide a technically simple solution (easier, cleaner, decoupled).

Architecture

Decoupling of triplestores (or RDF storage) could be a solution to support DCAT. For example, we could have a completely separate triplestore where data publishers can upload their RDFs without any interaction with CKAN. This would mean:

  • Having an ETL system that would pull RDF and load into CKAN by transforming the metadata.
  • The original RDFs would be preserved in the triplestore and could be then loaded into destination (European data catalog or other RDF stores).
  • Having the triplestore also means we can expose its API for users.
graph TD

subgraph .
  CKAN -.API.- Frontend
  Frontend -.Find API.- User
end

RDF -.0. Upload.-> Triplestore
ETL -.1. Harvest.- Triplestore
ETL -.2. Load.-> CKAN

subgraph dest
  European-Data-Catalog
  Other-RDF-Catalogs
end

ETL -.3. Load.-> European-Data-Catalog

Discussion

The CKAN 3.0 team has initiated a discussion thread on Github to gather opinions from the community and understand their needs:

https://github.com/ckan/ckan/issues/7489

Please, feel free to jump in if you have any thoughts on this matter.

Live example

Open Data Web in Taiwan has already done a similar approach to provide Triplestore support:

We could use this project as a proof of concept.

There might be various ways to achieve triplestore functionality including:

  • Native triplestores.
  • Triplestore plugin on top of SQL database.
  • NoSQL Graph Databases.

Options:

Postgresql:

Random ideas to check:

Awesome list for semantic data - https://github.com/semantalytics/awesome-semantic-web

Summary

Governments in Europe are looking for full support of DCAT based metadata as they are using DCAT-AP for conformance with European data catalog which aggregates all open data in the EU. Some countries are also creating their extension of DCAT-AP to meet their regional requirements.

To provide complete DCAT conformance CKAN needs to integrate with a RDF store. Our proposal is to use a decoupled approach so that we can keep CKAN core unaffected which is important for users who doesn't need triplestore integration.


OLD notes: General solution for datastore

Current backend: postgres

Potential solution: have any kind of backend

We add a interface class which will contain the general backend logic for db behaviour in datastore (CRUD).

Then we exdend it with existing postgres solution and we replace it everywhere with using this "interface" class except in one place where we would instantiate this interface with the backend solution we define in the configuration.

Select a repo