Data validation and schema interoperability

--- title: 'Data validation and schema interoperability' tags: - schema - data validation - interoperability - FAIR authors: - name: Leyla Garcia orcid: 0000-0003-3986-0510 affiliation: 1 - name: Jerven Bolleman orcid: 0000-0002-7449-1266 affiliation: 2 - name: Michel Dumontier orcid: 0000-0003-4727-9435 affiliation: 3 - name: Simon Jupp orcid: 0000-0002-0643-3144 affiliation: 4 - name: Jose Emilio Labra Gayo orcid: 0000-0001-8907-5348 affiliation: 5 - name: Thomas Liener orcid: 0000-0003-3257-9937 affiliation: 6 - name: Tazro Ohta orcid: 0000-0003-3777-5945 affiliation: 7 - name: Núria Queralt-Rosinach orcid: 0000-0003-0169-8159 affiliation: 8 - name: Chunlei Wu orcid: 0000-0002-2629-6124 affiliation: 9 affiliations: - name: ZB MED Information centre for life sciences, Gleueler Str. 60, 50931 Cologne, Germany index: 1 - name: Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Amphipole, 1015 Lausanne, Switzerland index: 2 - name: Maastricht University, Minderbroedersberg 4-6, 6211 LK Maastricht, The Netherlands index: 3 - name: European bioinformatics institute EMBL-EBI, Wellcome Genome Campus, CB10 1SD, Hinxton, United Kingdom index: 4 - name: Universidad de Oviedo,C/Federico García Lorca, S/N, CP 33007, Oviedo, Spain index: 5 - name: Thomas Liener Consultancy, www.linkedin.com/in/thomas-liener index: 6 - name: Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Yata 1111, Mishima, Shizuoka, Japan index: 7 - name: Harvard Medical School, Countway Library 10 Shattuck St, Boston, MA 02115, United States index: 8 - name: The Scripps Research Institute, 10550 N Torrey Pines Rd, La Jolla, CA 92037, United States index: 9 date: 20 December 2019 bibliography: paper.bib --- # Background Validating RDF data becomes necessary in order to ensure data compliance against the conceptualization model it follows, e.g., schema or ontology behind the data. Validation could also help with data consistency and completeness. There are different approaches to validate RDF data. For instance, JSON schema is particularly useful for data expressed in JSON-LD RDF serialization while Shape Expression (ShEX)[@baker_shape_2019] and Shapes Constraint Language (SHACL) [@knublauch_shapes_2017] can be used with other serialization as well. Currently, no validation approach is prevalent regarding others, depending on data characteristics and personal preferences one or the other can be used. In some cases, the approaches are interchangeable; however, that is not always the case, making it necessary to identify a subset among them that can be seamlessly translated from one to another. During the DBCLS/NDBC 2019 BioHackathon, we worked on a variety of topics related to RDF data validation, including (i) development of ShEX shapes for a number of datasets, (ii) development of a tool to semi-automatically create ShEx shapes, (iii) improvements to the RDFShape tool [@labra-gayo_rdfshape:_2018] and (iv) enabling validation schema conversion from one format to the other. In the following sections we detailed the work done on each front. # Hackathon results ## Development of ShEx shapes We created and updated ShEX shapes for different biomedical resources including Health Care Life Science (HCLS) dataset descriptions [@gray_dataset_2015], Bioschemas [@gray_bioschemas:_2017], and DisGeNET [@pinero_disgenet:_2017]. In order to make it easier for future updates, we developed some applications to automatically create ShEX shapes from HCLS datasets specification and Bioschemas profiles. ### Bioschemas Schema.org is a collaborative effort aiminingto create, mantain and promote schemas for strutcture data on the Internet [@schema_org_noauthor_home_nodate]. Bioschemas is a community-driven project aiming to support schema.org types for Life Sciences. It contributes to the community by adding life Science types to schema.org, defining profiles adjust to community needs, and developing supporting tools. A Bioschemas profile is a type customization including property cardinality and requirement level. Bioschemas shapes currently focus on profiles corresponding to the Biotea project, particularly those related to bibliographic data. Biotea [@garcia_biotea:_2018] provides a model to express scholarly articles in RDF, including not only bibliographic data but also article structure and named entities recognized in the text. Biotea-Bioschemas ShEx shapes are created via a Jupyter notebook from the YAML Bioschemas profile files. Schema.org datatypes are transformed to XML Schema Definition (XSD) while supporting shapes are created for any combination of schema.org types used as ranges. In addition, three main shapes are created for any Bioschemas profile, corresponding to the three property requirement levels, i.e., minimum, recommended and optional. Profile information, i.e., profile name, schema.org type and YAML file location, are encoded in a comma separated value (CSV) file, making it easy to use the code to generate shapes for any other Bioschemas profile. More information is available at the GitHub repository for this project [@garcia_biotea/validation-shapes-bioschemas_2019]. ### DisGeNET DisGeNET is a comprehensive gene-disease association knowledge base in the Life Sciences. It is widely used by the biomedical community and its Linked Data representation has been selected as an Elixir Europe [@noauthor_elixir_nodate] interoperability resource. However, it is still lacking a way to easily query this vast amount of information and explore this knowledge across other domains through its SPARQL endpoint. During the BioHackathon we implemented the DisGeNET-RDF ShEx shape [@rosinach_nuriaqueralt/shex-shapes_2019]. In order to do so, we used RDFShape [@web_semantics_group_at_university_of_oviedo_rdfshape_nodate] and the suite of generation and validation tools it comes with. We detected some disagreements between the DisGeNET schema illustrated on its website and the actual underlying data. We actively discussed around how to best tackle the development of the ShEx shapes in an automatic and data-driven way so we can continue working on it after the BioHackathon. ### HCLS The HCLS Community Profile for Dataset Descriptions offers a concrete guideline to specify dataset metadata as RDF including elements of data description, versioning, and provenance so as to support discovery, exchange, query, and retrieval of dataset metadata. As part of their work, the HCLS Community created Validata [@beveridge_validata:_2015], a web application to check the compliance of RDF documents to the guideline specifications. Validata used a non-standard extension of ShEx to check various compliance levels. We created a ShEx compliant document by processing the HCLS guideline using a PHP script [@dumontier_micheldumontier/hcls-shex_2019]. The result is several ShEx documents that can be used to check compliance at various levels (MUST, SHOULD, MAY, SHOULD NOT, MUST NOT). We validated our work against the exemplar documents that are provided as part of the guideline, and have also used it to detect errors in HCLS metadata from UniProt. Our work revealed errors in UniProt metadata and the RDFShape tool. ### Rare disease catalogs and registries Data on rare disease is currently fragmented across various databases and online resources making efficient and timely use of this data in rare disease research challenging. Several data catalogs exist that collect data from biobanks and patient registries but these data are neither comprehensive or readily interoperable across catalogs. There is now an international effort to improve the discovery, linkage and sharing of rare disease data through the development of standards and the adoption of FAIR data principles. One component of this process is the development of common metadata models for describing and sharing data across resources using standard vocabularies and ontologies. During the hackathon we explored the use of both JSON schema and ShEx for validating data that conforms to schemas developed as part of the European Joint Programme on Rare Diseases (EJP RD) [@noauthor_ejp_nodate]. The EJP RD schemas are expressed using JSON Schema, and are accompanied by an additional JSON-LD context file that enables instance JSON data from data providers to be transformed into RDF. At the hackathon we developed a set of new ShEx shapes that could validate the resulting RDF. This required mapping validation rules, such as required properties and cardinality/value type constraints, from JSON schema to an equivalent constraint in ShEx. We were able to demonstrate how more complex types of validation were also possible using ShEx when additional RDF based resources are available. For example, we can express that the `dcat:theme` of rare disease dataset must be a URI and that this URI should be any subclass of the root disease class in the Orphanet rare disease ontology. The resulting EJP RD schemas and accompanying ShEx files are all available on GitHub [@jupp_ejp-rd-vpresource-metadata-schema_2019]. ## ShEx creator While ShEX is very useful as demonstrated to validate RDF data, the syntax to actually write a ShEX expression can be hard for new users and is time-consuming also for experienced people. Therefore, a prototype of a ShEX creator was proposed for the Biohackathon. This tool should help users to write correct ShEX expressions faster. The prototype is implemented as a javascript tool, supporting the user through e.g. dropdown menus to create a correct ShEX structure and it uses the RDFShape API in the background to validate the created ShEX expression. The prototype can be found at the corresponding GitHub repository [@liener_lltommy/rdfvalidation4humans_2019]. ## Improvements to RDFShape tool The RDFShape tool [@labra-gayo_rdfshape:_2018] comprises a set of tools to create and validate RDF data via ShEX and SHACL shapes. During the BioHackathon, it was used to create shapes and validate RDF data from different endpoints. Thanks to it, we identified some improvements for this tool such as the possibility to validate triples obtained from a mix including RDF data provided by the user and data already contained in a SPARQL endpoint. This feature was added to the new version developed during the BioHackathon. We also explored and implemented new visualization features for ShEx. Our implementation resulted in the separation in several modules: - RDFShape client [@noauthor_weso/rdfshape-client_2019] which consists of a javascript client based on the React framework. - RDFShape server [@noauthor_weso/rdfshape_2019] contains the server part and is implemented in Scala using the http4s [@noauthor_http4s_nodate] library. - umlSHaclEx] [@noauthor_weso/umlshaclex_2019] is a module that generates UML-like visualizations from Shapes schemas. The library can be used as a standalone command line tool. - SHaclEX [@noauthor_weso/shaclex_2019] contains the main validation modules for ShEx and SHACL. - SRDF [@noauthor_weso/srdf_2019] defines a simple RDF interface with the main features required by the validation library. The module contains several implementations of that interface which enables the use of validation with Apache Jena [@noauthor_apache_nodate] models, RDF4j [@guindon_eclipse_nodate], or SPARQL endpoints. ## Schema conversion across validation approaches As part of our work, during the biohackathon we worked on identifying a common subset of ShEx that could be used as the basis for the generation of RDF data models documentation, which can later be converted to JSON schema, ShEx or SHACL. Although full interoperability between those languages is not feasible, we consider that a subset language could be defined that could handle the most common cases [@Labra-Gayo2019]. Through CD2H's [@noauthor_cd2h_nodate] Data Discovery Engine [@noauthor_ctsa_nodate] project, we previously developed a web-based tool called Schema Playground [@noauthor_ctsa_nodate-1] to facilitate the schema visualization, hosting and extension. It helps developers to publish their existing schemas as well as build new schemas by extending the existing ones. Schema Playground currently supports schema.org schemas defined in JSON-LD format and JSON-schema-based data validation. While JSON-schema is a good-fit for the underlying JSON-based data structure, ShEx and SHACL provide a more expressive way to describe validation rules when the underlying data are presented as triples. At the hackathon, we converted several JSON-Schema based validation rules to ShEx and performed the validation on the underlying data (e.g., dataset metadata). These exercises help us to identify the requirements to add support for ShEx in our BioThings schema playground. # Conclusion We developed a formal description of the HCLS dataset metadata guidelines in a manner that is compliant with the latest version of ShEx. This work is important not only to the HCLS community that uses the guideline, but also can form a basis for automated computational validation of metadata descriptions, as per the FAIR (Findable, Accessible, Interoperable, Reusable) principles. In a similar vein, we prototyped a ShEx shapes (semi)automatic solution for Bioschemas which could be later extended to Bioschemas profiles other than those defined by Biotea. We also developed a prototype corresponding to the first formal description of the DisGeNET-RDF data model by using ShEx [@rosinach_nuriaqueralt/shex-shapes_2019]. Our strategy to generate the DisGeNET-RDF ShEx shape comprised three steps: (i) manual building via the depicted schema on the web, then (ii) polishing via inference from some actual data instances, and (iii) validating against all the database via the SPARQL endpoint. The shapes created for DisGeNET will work as a basis to develop a more automated solution for this resource. The development of ShEx shapes using the RDFShape tools resulted in a user testing exercise, where bugs were identified. This direct interaction with users allowed us quickly implement fixes and immediately testing them with users, giving place to a new version. In addition, from the creation of ShEx shapes and the transformation from one format to another, we identified a need to improve tools and technologies used to describe and validate RDF data. Such validation could facilitate machine-readable community agreements regarding metadata, thus leading to more Findable, Accessible, Interoperable and Reusable (FAIR) data as community-based validators could interoperate with the FAIR metrics evaluator [@wilkinson_evaluating_2018]. # Future work Regarding the generation of ShEX shapes, HCLS team plans to check the compliance of other HCLS dataset metadata documents on the web and report to the community our findings while Bioschemas will work on a validation platform that can later communicate with the FAIR evaluator. Regarding DisGeNET, the ShEX shapes will be finalized and move to a more automatic generation. In order to overcome the necessity to learn yet another syntax, i.e., ShEx syntax, the work on ShEx tooling will continue. Currently, the ShEX creator is a rough prototype. Future work consists of (i) making the code more stable and potentially publish it as npm module and (ii) integrate the ShEX creator within the RDFShape tool website, so it could further be combined with existing functionality, e.g., ShEX visualization in RDFShape platform. RDFShape will continue using user feedback to improve the services provided, taking into account scalability requirements of big SPARQL endpoints. Several issues appeared when validating those big data portals, such as the need to improve error messages, and to handle streaming validation for big RDF data. Regarding the visualization, we will work an a direction similar to the one carried out by the Japanese Life Science Database Integration portal [@noauthor_home_nodate]. This portal uses data model representations drawn manually, combining instances and schemas. In such a way, they can show a visualization that users will follow more easily as they will observe real data rather than only the underlying model. Future work could extend the visualization capabilities of RDFShape to automatically generate those kind of visualizations. Other future works on the RDFShape platform include the development of Jupyter notebooks integrating and showcasing the different tools provided. The BioThings team also plan to continue their work after the hackathon to allow publishing and visualizing ShEx schemas in Schema Playground, along with the support of schemas defined in schema.org and JSON-schema format. The ShEx parsing tools developed at RDFShape will be adopted to convert input ShEx schema into its JSON format for indexing purpose. And the visualization tool from RDFShape can also be used to generate the graph-representation of a ShEx schema. # Jupyter notebooks created * Bioschemas ShEX shapes: https://github.com/biotea/validation-shapes-bioschemas # References @techreport{baker_shape_2019, title = {Shape {Expressions} ({ShEx}) {Primer}}, url = {https://shexspec.github.io/primer/}, urldate = {2019-12-20}, author = {Baker, Thomas and Prud'hommeaux, Eric}, month = oct, year = {2019} } @techreport{knublauch_shapes_2017, title = {Shapes {Constraint} {Language} ({SHACL})}, url = {https://www.w3.org/TR/shacl/}, language = {en}, urldate = {2019-12-20}, author = {Knublauch, Holger and Kontokostas, Dimitris}, month = jul, year = {2017} } @inproceedings{labra-gayo_rdfshape:_2018, address = {Monterey, USA}, title = {{RDFShape}: {An} {RDF} playground based on {Shapes}}, abstract = {The quality of RDF based solutions requires the possibility to describe and validate RDF graphs. In the recent years, two technologies have been proposed for RDF validation: ShEx and SHACL. RDFShape is a web service that can be used as an RDF playground to validate RDF with those technologies, but can also be used as a general RDF playground allowing to parse, convert, visualize and query RDF data. The validation engine supports both ShEx and SHACL, allows the conversion between diﬀerent schema formats, and is able visualize schemas using UML-like class diagrams. The system oﬀers the possibility to play with diﬀerent RDF technologies in a uniﬁed setting and has already been used to teach RDF technologies in several courses and tutorials. It has also been employed in industrial RDF workﬂows that needed a web service for RDF data validation and management.}, language = {en}, booktitle = {Proceedings of the {ISWC} 2018 {Posters} \& {Demonstrations}, {Industry} and {Blue} {Sky} {Ideas} {Tracks} co-located with 17th {International} {Semantic} {Web} {Conference}}, author = {Labra-Gayo, Jose Emilio and Fernández-Álvarez, Daniel and GArcía-González, Herminio}, month = oct, year = {2018}, pages = {4} } @techreport{gray_dataset_2015, title = {Dataset descriptions: {HCLS} community profile}, url = {https://www.w3.org/TR/hcls-dataset/}, language = {en}, urldate = {2019-12-20}, author = {Gray, Alasdair J G and Baran, Joachim and Marshall, M. Scott and Dumontier, Michel}, month = may, year = {2015} } @inproceedings{gray_bioschemas:_2017, address = {Vienna, Austria}, title = {Bioschemas: {From} {Potato} {Salad} to {Protein} {Annotation}}, volume = {1963}, abstract = {The life sciences have a wealth of data resources with a wide range of overlapping content. Key repositories, such as UniProt for protein data or Entrez Gene for gene data, are well known and their content easily discovered through search engines. However, there is a long-tail of bespoke datasets with important content that are not so prominent in search results. Building on the success of Schema.org for making a wide range of structured web content more discoverable and interpretable, e.g. food recipes, the Bioschemas community (http://bioschemas.org) aim to make life sciences datasets more ﬁndable by encouraging data providers to embed Schema.org markup in their resources.}, language = {en}, booktitle = {Proceedings of the {ISWC} 2017 {Posters} \& {Demonstrations} and {Industry} {Tracks} co-located with 16th {International} {Semantic} {Web} {Conference}}, author = {Gray, Alasdair J G and Goble, Carole and Jimenez, Rafael C and The Bioschemas community}, month = oct, year = {2017}, pages = {4} } @article{pinero_disgenet:_2017, title = {{DisGeNET}: a comprehensive platform integrating information on human disease-associated genes and variants}, volume = {45}, issn = {0305-1048}, shorttitle = {{DisGeNET}}, url = {https://academic.oup.com/nar/article/45/D1/D833/2290909}, doi = {10.1093/nar/gkw943}, abstract = {The information about the genetic basis of human diseases lies at the heart of precision medicine and drug discovery. However, to realize its full potential to support these goals, several problems, such as fragmentation, heterogeneity, availability and different conceptualization of the data must be overcome. To provide the community with a resource free of these hurdles, we have developed DisGeNET (http://www.disgenet.org), one of the largest available collections of genes and variants involved in human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships. The information is accessible through a web interface, a Cytoscape App, an RDF SPARQL endpoint, scripts in several programming languages and an R package. DisGeNET is a versatile platform that can be used for different research purposes including the investigation of the molecular underpinnings of specific human diseases and their comorbidities, the analysis of the properties of disease genes, the generation of hypothesis on drug therapeutic action and drug adverse effects, the validation of computationally predicted disease genes and the evaluation of text-mining methods performance.}, language = {en}, number = {D1}, urldate = {2019-12-20}, journal = {Nucleic Acids Research}, author = {Piñero, Janet and Bravo, Àlex and Queralt-Rosinach, Núria and Gutiérrez-Sacristán, Alba and Deu-Pons, Jordi and Centeno, Emilio and García-García, Javier and Sanz, Ferran and Furlong, Laura I.}, month = jan, year = {2017}, pages = {D833--D839} } @article{garcia_biotea:_2018, title = {Biotea: semantics for {Pubmed} {Central}}, volume = {6}, issn = {2167-8359}, shorttitle = {Biotea}, url = {https://peerj.com/articles/4201}, doi = {10.7717/peerj.4201}, abstract = {A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology. We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language. We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.}, language = {en}, urldate = {2019-12-20}, journal = {PeerJ}, author = {Garcia, Alexander and Lopez, Federico and Garcia, Leyla and Giraldo, Olga and Bucheli, Victor and Dumontier, Michel}, month = jan, year = {2018}, pages = {e4201} } @misc{garcia_biotea/validation-shapes-bioschemas_2019, title = {biotea/validation-shapes-bioschemas}, copyright = {Apache-2.0}, url = {https://github.com/biotea/validation-shapes-bioschemas}, abstract = {ShEX validation shapes for Biotea-Bioschemas. Contribute to biotea/validation-shapes-bioschemas development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {biotea}, author = {Garcia, Leyla}, month = dec, year = {2019}, note = {original-date: 2019-09-04T03:25:40Z} } @misc{noauthor_elixir_nodate, title = {{ELIXIR} {\textbar} {A} distributed infrastructure for life-science information}, url = {https://elixir-europe.org/}, urldate = {2019-12-20} } @misc{rosinach_nuriaqueralt/shex-shapes_2019, title = {{NuriaQueralt}/shex-shapes}, url = {https://github.com/NuriaQueralt/shex-shapes}, abstract = {ShEx Shapes for RDF schema validation. Contribute to NuriaQueralt/shex-shapes development by creating an account on GitHub.}, urldate = {2019-12-20}, author = {Queralt-Rosinach, Núria}, month = sep, year = {2019}, note = {original-date: 2019-09-05T12:02:13Z}, keywords = {rdf, shex} } @misc{web_semantics_group_at_university_of_oviedo_rdfshape_nodate, title = {{RDFShape}}, url = {http://rdfshape.weso.es/}, urldate = {2019-12-20}, author = {Web semantics group at University of Oviedo} } @misc{beveridge_validata:_2015, title = {Validata: {RDF} {Validator}}, url = {https://www.w3.org/2015/03/ShExValidata/}, abstract = {Validata was developed as part of a group project for students on the Software Engineering MEng course at Heriot-Watt University led by Alasdair J G Gray. Building upon previous work, the group developed this tool to demonstrate real-world usage of Shape Expressions with a simple, intuitive interface. All code and resources are licensed under the free and permissive MIT license. Code for both the validator (NodeJS module) and Validata (this web interface) is available on GitHub Development Team Andrew Beveridge Jacob Baungard Hansen Johnny Val Leif Gehrmann Roisin Farmer Sunil Khutan Tomas Robertson Contributing Authors Eric Prud'Hommeaux}, urldate = {2019-12-20}, journal = {Validata: RDF Validator using Shape Expressions}, author = {Beveridge, Andrew and Baungard Hansen, Jacob and Val, Johnny and Gehrmann, Leif and Roisin, Farmer and Khutan, Sunil and Robertson, Tomas}, month = may, year = {2015} } @misc{dumontier_micheldumontier/hcls-shex_2019, title = {micheldumontier/hcls-shex}, copyright = {MIT}, url = {https://github.com/micheldumontier/hcls-shex}, abstract = {Contribute to micheldumontier/hcls-shex development by creating an account on GitHub.}, urldate = {2019-12-20}, author = {Dumontier, Michel}, month = oct, year = {2019}, note = {original-date: 2019-09-04T13:04:32Z} } @article{10.1093/nar/gky1049, author = {The UniProt Consortium}, title = "{UniProt: a worldwide hub of protein knowledge}", journal = {Nucleic Acids Research}, volume = {47}, number = {D1}, pages = {D506-D515}, year = {2018}, month = {11}, abstract = "{The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.}", issn = {0305-1048}, doi = {10.1093/nar/gky1049}, url = {https://doi.org/10.1093/nar/gky1049}, eprint = {http://oup.prod.sis.lan/nar/article-pdf/47/D1/D506/27437297/gky1049.pdf}, } @misc{liener_lltommy/rdfvalidation4humans_2019, title = {{LLTommy}/{RDFvalidation}4humans}, copyright = {MIT}, url = {https://github.com/LLTommy/RDFvalidation4humans}, abstract = {Repository for the Biohackathon 2019, Fukuoka. Topic: Human readable RDF validation}, urldate = {2019-12-20}, author = {Liener, Thomas}, month = sep, year = {2019}, note = {original-date: 2019-08-29T12:42:23Z} } @misc{noauthor_weso/rdfshape-client_2019, title = {weso/rdfshape-client}, copyright = {MIT}, url = {https://github.com/weso/rdfshape-client}, abstract = {RDFShape client. Contribute to weso/rdfshape-client development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {Web Semantics Oviedo, University of Oviedo}, month = dec, year = {2019}, note = {original-date: 2019-08-13T09:34:43Z} } @misc{noauthor_weso/rdfshape_2019, title = {weso/rdfshape}, copyright = {MIT}, url = {https://github.com/weso/rdfshape}, abstract = {RDF Playground server. Contribute to weso/rdfshape development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {Web Semantics Oviedo, University of Oviedo}, month = dec, year = {2019}, note = {original-date: 2014-06-04T15:07:04Z}, keywords = {rdf, rdf-library, rdf-validator, scala, shacl, shex, turtle} } @misc{noauthor_http4s_nodate, title = {http4s {\textbar} http4s}, url = {https://http4s.org/}, urldate = {2019-12-20} } @misc{noauthor_apache_nodate, title = {Apache {Jena} -}, url = {https://jena.apache.org/}, urldate = {2019-12-20} } @misc{guindon_eclipse_nodate, title = {Eclipse {RDF}4J {\textbar} {The} {Eclipse} {Foundation}}, url = {https://rdf4j.eclipse.org/}, abstract = {a Java RDF/SPARQL framework}, language = {en}, urldate = {2019-12-20}, journal = {Eclipse rdf4j}, author = {Guindon, Christopher} } @misc{noauthor_weso/umlshaclex_2019, title = {weso/{umlShaclex}}, copyright = {MIT}, url = {https://github.com/weso/umlShaclex}, abstract = {Converter from ShEx/SHACL to UML-like diagrams. Contribute to weso/umlShaclex development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {Web Semantics Oviedo, University of Oviedo}, month = dec, year = {2019}, note = {original-date: 2018-06-09T05:35:45Z} } @misc{noauthor_weso/shaclex_2019, title = {weso/shaclex}, copyright = {MIT}, url = {https://github.com/weso/shaclex}, abstract = {SHACL/ShEx implementation . Contribute to weso/shaclex development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {Web Semantics Oviedo, University of Oviedo}, month = dec, year = {2019}, note = {original-date: 2016-03-30T15:00:01Z}, keywords = {earl-report, rdf-library, scala, schema, shacl, shex} } @misc{noauthor_weso/srdf_2019, title = {weso/srdf}, copyright = {MIT}, url = {https://github.com/weso/srdf}, abstract = {Simple RDF interface. Contribute to weso/srdf development by creating an account on GitHub.}, urldate = {2019-12-20}, publisher = {Web Semantics Oviedo, University of Oviedo}, month = dec, year = {2019}, note = {original-date: 2019-10-01T05:55:19Z}, keywords = {cats, cats-effect, rdf, rdf-interface, rdf-libraries, scala} } @misc{noauthor_cd2h_nodate, title = {{CD}2H}, url = {https://ctsa.ncats.nih.gov/cd2h/}, language = {en-US}, urldate = {2019-12-20} } @misc{noauthor_ctsa_nodate, title = {{CTSA} {Data} {Discovery} {Engine}}, url = {http://discovery.biothings.io/}, abstract = {A CD2H PROJECT TO PROMPT FAIR DATA-SHARING BEST PRACTICES \& MAXIMIZE THE RESEARCH IMPACT OF CTSA HUBS}, language = {en}, urldate = {2019-12-20}, journal = {http://discovery.biothings.io/} } @misc{noauthor_ctsa_nodate-1, title = {{CTSA} {Data} {Discovery} {Engine} {\textbar} {Schema} playground}, url = {https://discovery.biothings.io/schema-playground}, abstract = {A CD2H PROJECT TO PROMPT FAIR DATA-SHARING BEST PRACTICES \& MAXIMIZE THE RESEARCH IMPACT OF CTSA HUBS}, language = {en}, urldate = {2019-12-20}, journal = {https://discovery.biothings.io/schema-playground} } @article{wilkinson_evaluating_2018, title = {Evaluating {FAIR}-{Compliance} {Through} an {Objective}, {Automated}, {Community}-{Governed} {Framework}}, copyright = {© 2018, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/}, url = {https://www.biorxiv.org/content/10.1101/418376v2}, doi = {10.1101/418376}, abstract = {{\textless}h3{\textgreater}Abstract{\textless}/h3{\textgreater} {\textless}p{\textgreater}With the increased adoption of the FAIR Principles, a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers, are seeking ways to transparently evaluate resource FAIRness. We describe the FAIR Evaluator, a software infrastructure to register and execute tests of compliance with the recently published FAIR Metrics. The Evaluator enables digital resources to be assessed objectively and transparently. We illustrate its application to three widely used generalist repositories - Dataverse, Dryad, and Zenodo - and report their feedback. Evaluations allow communities to select relevant Metric subsets to deliver FAIRness measurements in diverse and specialized applications. Evaluations are executed in a semi-automated manner through Web Forms filled-in by a user, or through a JSON-based API. A comparison of manual vs automated evaluation reveals that automated evaluations are generally stricter, resulting in lower, though more accurate, FAIRness scores. Finally, we highlight the need for enhanced infrastructure such as standards registries, like FAIRsharing, as well as additional community involvement in domain-specific data infrastructure creation.{\textless}/p{\textgreater}}, language = {en}, urldate = {2019-12-20}, journal = {bioRxiv}, author = {Wilkinson, Mark D. and Dumontier, Michel and Sansone, Susanna-Assunta and Santos, Luiz Olavo Bonino da Silva and Prieto, Mario and McQuilton, Peter and Gautier, Julian and Murphy, Derek and Crosas, Mercѐ and Schultes, Erik}, month = sep, year = {2018}, pages = {418376} } @misc{noauthor_home_nodate, title = {Home - integbio.jp}, url = {https://integbio.jp/en/}, urldate = {2019-12-20} } @Inbook{Labra-Gayo2019, author="Labra-Gayo, Jose Emilio and Garc{\'i}a-Gonz{\'a}lez, Herminio and Fern{\'a}ndez-Alvarez, Daniel and Prud'hommeaux, Eric", editor="Alor-Hern{\'a}ndez, Giner and S{\'a}nchez-Cervantes, Jos{\'e} Luis and Rodr{\'i}guez-Gonz{\'a}lez, Alejandro and Valencia-Garc{\'i}a, Rafael", title="Challenges in RDF Validation", bookTitle="Current Trends in Semantic Web Technologies: Theory and Practice", year="2019", publisher="Springer International Publishing", address="Cham", pages="121--151", abstract="The RDF data model forms a cornerstone of the Semantic Web technology stack. Although there have been different proposals for RDF serialization syntaxes, the underlying simple data model enables great flexibility which allows it to be successfully employed in many different scenarios and to form the basis on which other technologies are developed. In order to apply an RDF-based approach in practice it is necessary to communicate the structure of the data that is being stored or represented. Data quality is of paramount importance for the acceptance of RDF as a data representation language and it must be enabled by the use of tools that can check if some data conforms to some specific structure. There have been several recent proposals for RDF validation languages like ShEx and SHACL. In this chapter, we describe both proposals and enumerate some challenges and trends that we foresee with regards to RDF validation. We devote more space to what we consider one of the main challenges, which is to compare ShEx and SHACL and to understand their underlying foundations. To that end, we propose an intermediate language and show how ShEx and SHACL can be converted to it.", isbn="978-3-030-06149-4", doi="10.1007/978-3-030-06149-4_6", url="https://doi.org/10.1007/978-3-030-06149-4_6" } @misc{schema_org_noauthor_home_nodate, title = {Home - schema.org}, url = {http://schema.org/}, urldate = {2020-01-07} } @misc{noauthor_ejp_nodate, title = {{EJP} {RD} – {European} {Joint} {Programme} on {Rare} {Diseases}}, url = {http://www.ejprarediseases.org/}, language = {en-US}, urldate = {2020-01-07} } @misc{jupp_ejp-rd-vpresource-metadata-schema_2019, title = {ejp-rd-vp/resource-metadata-schema}, url = {https://github.com/ejp-rd-vp/resource-metadata-schema}, abstract = {Metadata model and schemas for the EJP virtual platform}, urldate = {2020-01-07}, publisher = {ejp-rd-vp}, author = {Jupp, Simon and Cornet, Ronald and Rajaram and Holub, Petr and PhilipvD}, month = sep, year = {2019}, note = {original-date: 2019-05-01T08:56:28Z}, keywords = {biosample-registries, ejp, metadata, rare-disease-registries, registries} } @misc{noauthor_http4s_nodate, title = {http4s {\textbar} http4s}, url = {https://http4s.org/}, urldate = {2019-12-20} }