A triplestore is a database that is used to store and retrieve data in the form of triples. A triple consists of three parts: a subject, a predicate, and an object.
In Open Data context, a triplestore is a storage engine for RDF (Resource Description Framework) - directed labeled graphs.
Triplestores can be queried using SPARQL (SPARQL Protocol and RDF Query Language), a query language specifically designed for querying RDF data. SPARQL allows developers to specify complex queries that span multiple triples, enabling powerful data retrieval and analysis capabilities.
We are looking for general compatibility for RDF graphs to support DCAT (Data Catalog Vocabulary) and its various regional extensions such as DCAT-AP.
CKAN allows users to have default metadata schema to get started quickly and create necessary information about datasets. Users can also customize the metadata schema and recommended approach is to use https://github.com/ckan/ckanext-scheming extension. On the other hand, a number of metadata standards have been defined by different organizations and widely adopted:
DCAT has been adopted by number of governments and other organizations including:
Some governments are also creating their own extension of the DCAT:
CKAN has a popular extension to support DCAT - https://github.com/ckan/ckanext-dcat. However, it does not support all features of DCAT - it doesn't preserve metadata information while RDF gets transformed and loaded into Postgresql (CKAN's storage for metadata). I.e., once user exports the metadata from CKAN, its original information is lost. For example, as mentioned by Open Data Swiss, it doesn't provide representation for CatalogRecord
and DataService
classes. It also doesn't have relationaship between Catalog
and CatalogRecord
. There might be other items that wasn't considered yet.
Going all-in for DCAT support by replacing Postgresql with some triplestore engine would have two major issues:
In this writing, we are trying to identify a solution to overcome these challenges and find an approach to make minimum modifications to the core code and provide a technically simple solution (easier, cleaner, decoupled).
Decoupling of triplestores (or RDF storage) could be a solution to support DCAT. For example, we could have a completely separate triplestore where data publishers can upload their RDFs without any interaction with CKAN. This would mean:
graph TD
subgraph .
CKAN -.API.- Frontend
Frontend -.Find API.- User
end
RDF -.0. Upload.-> Triplestore
ETL -.1. Harvest.- Triplestore
ETL -.2. Load.-> CKAN
subgraph dest
European-Data-Catalog
Other-RDF-Catalogs
end
ETL -.3. Load.-> European-Data-Catalog
The CKAN 3.0 team has initiated a discussion thread on Github to gather opinions from the community and understand their needs:
https://github.com/ckan/ckan/issues/7489
Please, feel free to jump in if you have any thoughts on this matter.
Open Data Web in Taiwan has already done a similar approach to provide Triplestore support:
We could use this project as a proof of concept.
There might be various ways to achieve triplestore functionality including:
Options:
Postgresql:
Random ideas to check:
Awesome list for semantic data - https://github.com/semantalytics/awesome-semantic-web
Governments in Europe are looking for full support of DCAT based metadata as they are using DCAT-AP for conformance with European data catalog which aggregates all open data in the EU. Some countries are also creating their extension of DCAT-AP to meet their regional requirements.
To provide complete DCAT conformance CKAN needs to integrate with a RDF store. Our proposal is to use a decoupled approach so that we can keep CKAN core unaffected which is important for users who doesn't need triplestore integration.
Current backend: postgres
Potential solution: have any kind of backend
We add a interface class which will contain the general backend logic for db behaviour in datastore (CRUD).
Then we exdend it with existing postgres solution and we replace it everywhere with using this "interface" class except in one place where we would instantiate this interface with the backend solution we define in the configuration.