“Open Data Web” – A Linked Open Data Repository Built with CKAN

# “Open Data Web” – A Linked Open Data Repository Built with CKAN (p.1) Hi, I am Cheng-Jen Lee from Academia Sinica in Taiwan. Today I am going to share our experiences in building a linked data repository using CKAN. (p.2) Before I start, the slide for this talk has been uploaded to SlideShare and the transcript is also available on the HackMD. Or you can search for the hashtag CKANCon to get them. (p.3) This is the outline of this talk. (p.4) As you may know, digital archiving is an important process to preserve scholarly literature published in electronic form. In Taiwan, the government carried out a national digital archiving program called Digital Archives Taiwan. And they build a web catalog called Union Catalog of Digital Archives Taiwan, which provides an interface to browse digitized archives in 14 domains from institutions like universities, NGOs and libraries. Part of this catalog is released under the creative commons license, which means you are free to copy and redistribute them. Recently we try to represent these CC-licensed catalog records in linked data formats. We want to do this for two reasons. The first reason is for providing semantic queries for time, place, and objects. The second reason is for enriching resources by linking them to third-party datasets. (p.5) For those who have never heard about linked data, I would like to explain what is linked data first. Linked data is a method of publishing structured data so that it can be interlinked and become useful through semantic queries. And linked open data is linked data that is open content. Linked data is mostly in the form of RDF, which stands for Resource Description Framework, is a framework for expressing information about resources. Resources can be anything, including documents, people, physical objects, and abstract concepts. RDF can enrich a dataset by linking it to third-party datasets. For example, a dataset about paintings could be enriched by linking them to the corresponding artists in Wikidata. (p.6) The data model of RDF is based on a triple, which is composed of three elements: subject, predicate, and object. The subject and the object represent the two resources being related; the predicate represents the nature of their relationship. For example, in the triple "Bob is a person", Bob is the subject, person is the object, and "is a" shows the relationship between Bob and a person. We can visualize triples as a connected graph with nodes and arcs, like the picture in the bottom right corner of this slide. (p.7) In this project, we converted the archive catalog into two versions of linked data. The first one is triples with Dublin Core descriptions. That is, we just represent the catalog in the linked data format and do not change the column values in this version. We call it "version D". For the second one, we mapped column values in the catalog to external datasets with domain vocabularies to give enriched semantics. Specifically, we extracted place names, date strings, and titles of resources in the biology domain from the catalog, then normalized them or mapped them to third-party datasets including GeoNames.org, Encyclopedia of Life, and Wikidata. We call these refined triples "version R". (p.8) Then, for the linked data conversion, we designed a two-step process to do it. Firstly, we map column values to vocabularies to describe them. For example, in a catalog record, the value of Date::field is "採集日期", which means "date collected" in English. Because this is the time description about a create action, we map this value to a schema:CreateAction RDF type, and use a dwc:eventDate vocabulary to describe the date of this action. The mapping results will be saved as a CSV file. Here we call it the RDF-like CSV. Then we just need to convert the RDF-like CSV to a RDF format. The python scripts for this work are available on GitLab if you are interested in how we did this. (p.9) After converting the catalog to linked data, the next problem is: where to store these linked data? To do that, we build a data repository by CKAN called Open Data Web (ODW for short), which is at data.odw.tw. And if you are familiar with linked data, we also released an ontology for the ODW. However, since this is CKANCon, I will not talk about the ontology here. (p.10) When you enter the site, you will see the main menu at the top of the page. The "Records" section will bring you the D version triples and the "Refined" section is for the R version triples. (p.11) Both of them have a list of resources and some search filters to find resources you want. (p.12) And, on the right side of each resource in the list, there is a button to let you get D or R version for that resource easily. Unfortunately, most of R version triples are unavailable at this time since they are still being uploaded to CKAN. (p.13) Take the resource "Girl Lost in Thought" as an example. You can explore triples about it, (p.14) and export the triples as a linked data format including Turtle, JSON-LD, and RDF-XML. (p.15) The refined triples also enable us to make spatial and temporal queries. For example, we can find resources about Tainan City in southern Taiwan from the map. (p.16) You can search for resources in the 19th century, too. (p.17) We also provide sparql endpoint for machine access. (p.18) Finally, if the triples in the R version contain geonames information, they will be marked in the map. (p.19) The system architecture is shown in this picture. We use CKAN's harvest mechanism to import, or "harvest" converted linked data of the catalog. Then, for a single resource, we can use scheming and repeating extension to make a HTML page and use the dcat extension to generate linked data. Meanwhile, since CKAN doesn't support sparql query, we also import converted catalog to an OpenLink Virtuoso server, then integrate an interface with our CKAN site for testing sparql queries. (p.20) Here I would like to share some implementation details of the ODW. Firstly, for the custom fields, the scheming extension and the repeating extension are used to define CKAN custom fields for a data type in a JSON file. Each data type has its own directory. For example, record.json defines the fields in the D version triples. And, in the JSON file, a field is defined by a JSON object, including its field name, label, display property in the html, and so on. You can also create presets to reuse frequently used properties. (p.21) Secondly, we use the dcat extension for linked data import and export. The import tasks use the CKAN harvesting function supported by harvest extension. By extending the DCATRDFHarvester class, we can create our own harvesting profile to import linked data. Then, to define the import and export tasks, we extend the RDFProfile class. The two key functions in the RDFProfile class are parse_dataset and graph_from_dataset. The former is used for parsing loaded linked data to a CKAN dataset. And the latter is used for generating a linked data graph from a CKAN dataset. Finally, since the dcat extension is originally designed for DCAT vocabularies, we also modified it to support more namespaces. (p.22) All you need to do is add those namespaces to processors.py in the dcat extension. (p.23) Some useful extensions are also used in the ODW project. We use the sparql extension to integrate the Virtuoso sparql endpoint. Spatial extension supports the spatial indexing and searching. We also developed the tempsearch extension to do time indexing and searching. All source codes are available on GitLab. (p.24) In the ODW project, we use CKAN with slight modifications to build a simple linked data repository. The result seems good. However, there are still some challenges that need to be addressed. The first problem is that we are maintaining two triple stores -- CKAN and Virtuoso. They may be inconsistent since we don't sync them for now. The second and also the more serious problem is slow harvesting speed on the part of CKAN. According to our experience, it takes more than four hours to harvest 20 thousand resources (or packages in CKAN) on a Core i7 3.4 GHz machine. (p.25) The ODW project is at its early stage and there are still many works to do. Firstly, we want to provide native sparql queries in CKAN so that we don't need Virtuoso anymore. Also, to improve import speed, we will try to modify the harvesting profile to harvest multiple resources as a CKAN dataset. Time and place name mappings to third-party datasets still need further verifications, too. (p.26) And that's it, thank you.