Jose Emilio Labra Gayo
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: 'BioHackJP 2023 Report R4: RDF Data integration using Shape Expressions' title_short: 'BioHackJP 2023 Data Integration ShEx' tags: - Linked Data - Shape Expressions - RDF - Data Integration - SPARQL authors: - name: Jose Emilio Labra Gayo orcid: 0000-0001-8907-5348 affiliation: 1 - name: Andra Waagmeester orcid: 0000-0001-9773-4008 affiliation: 2 - name: Yasunori Yamamoto orcid: 0000-0002-6943-6887 affiliation: 3 - name: Ángel Iglesias Préstamo orcid: 0009-0004-0686-4341 affiliation: 1 - name: Toshiaki Katayama orcid: 0000-0003-2391-0384 affiliation: 3 - name: Thomas Liener orcid: 0000-0003-3257-9937 affiliation: 4 - name: Deepak R. Unni orcid: 0000-0002-3583-7340 affiliation: 6 - name: Jerven Bolleman orcid: 0000-0002-7449-1266 affiliation: 6 - name: Kiyoko F. Aoki-Kinoshita orcid: 0000-0002-6662-8015 affiliation: 7 - name: Masashi Yokochi orcid: 0000-0002-3253-7449 affiliation: 8 - name: Núria Queralt Rosinach orcid: 0000-0003-0169-8159 affiliation: 9 - name: Hiroshi Mori orcid: 0000-0003-0806-7704 affiliation: 10 - name: Daniel Fernández Álvarez orcid: 0000-0002-8666-7660 affiliation: 1 - name: Alberto Labarga orcid: 0000-0001-6781-893X affiliation: 11 - name: Robert Hoehndorf orcid: 0000-0001-8149-5890 affiliation: 13 - name: Eric Prud’hommeaux orcid: 0000-0003-1775-9921 affiliation: 14 - name: Claude Nanjo orcid: 0009-0002-1208-8858 affiliation: 15 - name: Nishad Thalhath orcid: 0000-0001-9845-9714 affiliation: 12 - name: Yoko Okabeppu orcid: 0000-0000-0000-0000 affiliation: 16 affiliations: - name: WESO Lab, University of Oviedo, Spain index: 1 - name: Micelio, Belgium index: 2 - name: Database Center for Life Science, Japan index: 3 - name: Unaffiliated index: 4 - name: SIB Swiss Institute of Bioinformatics, Switzerland index: 6 - name: Soka University, Hachioji, Tokyo, Japan index: 7 - name: Osaka University, Suita, Osaka, Japan index: 8 - name: Leiden University Medical Center, Netherlands index: 9 - name: National Institute of Genetics, Mishima, Japan index: 10 - name: Barcelona Supercomputing Center index: 11 - name: RIKEN Center for Integrative Medical Sciences, Yokohama, JP index: 12 - name: Janeiro Digital, USA index: 13 - name: MedOntology, LLC, USA index: 14 - name: OKBP inc. Japan index: 15 date: 30 June 2023 cito-bibliography: paper.bib event: BH23JP biohackathon_name: "BioHackathon Japan 2023" biohackathon_url: "https://2023.biohackathon.org/" biohackathon_location: "Kagawa, Japan, 2023" group: R4 # URL to project git repo --- should contain the actual paper.md: git_url: https://github.com/biohackathon-japan/bh23-dataintegrationshex # This is the short authors description that is used at the # bottom of the generated paper (typically the first two authors): authors_short: Jose E. Labra \emph{et al.} --- # Background [![hackmd-github-sync-badge](https://hackmd.io/jdpyyAXkSEK1RgT5QEx45g/badge)](https://hackmd.io/jdpyyAXkSEK1RgT5QEx45g) <!-- TODO: Review by Toshiaki --> In this report, we describe the activities that we have been carrying on during the Biohackathon 2023, held in Shodoshima, Japan. The main goal of the project has been to identify approaches and issues that can be used to integrate large RDF datasets by creating subsets described by [Shape Expressions](http://shex.io/) [@EricSemantics2014]. We have recently submitted a publication on creating [subsets from Wikidata](https://www.semantic-web-journal.net/content/wikidata-subsetting-approaches-tools-and-evaluation-0) [@SeyedWikidataSubsetting23]. Wikidata is a knowledge graph which is constantly in flux and getting to a size which makes it hard to locally replicate. By creating topical subsets we are able to dissect a managable subset that can be loaded in local RDF stores for further processing. However, this subsetting app approach relies on Wikidata daily dumps, which are available in JSON format. For this hackathon we specifically choose to extend the subsetting mechanisms to work on RDF dumps or SPARQL endpoints. RDF data provides a solution for data interoperability which in principle can enable different data sources to be smoothly integrated. Nevertheless, in practice, the growing adoption of RDF has also made that the consumption of RDF data is challenging given the size of RDF data collections makes them not feasible to be easily collected or manipulated. As an example, UniProt RDF data size is around 110 billions of triples or over 700 gigabytes of gzipped RDF/XML, and PubChem RDF is in the same order of magnitude in volume, both these resources describe 100's of different kinds of data. This situation requires some agents to provide intermediate resources to manipulate the RDF data which is consumed, and described. As an example, DBCLS has created the rdfconfig tool which provides descriptions of the RDF data collections they contain. In order to facilitate the integration of those RDF data, this project has been exploring ways to create subsets of RDF data which could contain only the information of interest for some specific tasks. Making those subsets available for researchers in an easy way, could facilitate the research activities and give the RDF data more value. Creating subsets of RDF data can also help when the information available in those large RDF data sources is continually evolving. So if a researcher wants to make reproducible research based on those SPARQL endpoints, it may be possible that the results of the queries can differ along the time. The possibility of having a tool that creates subsets can be considered as a way to create snapshots of the RDF data which could later be packaged and distributed along the research results, helping the creation of reproducible research based on RDF data. Another reason for the creation of RDF subsets is to make a subset from multiple data sources where it is unfeasible to get a designated dataset using federated queries due to these data sizes and technical immaturity of SPARQL federated query handling. The Shape Expressions language was designed as a concise and human-readable language to describe and validate RDF data. Although it was not initially designed to create subsets, the possibility of having a concise way to describe RDF makes it an ideal candidate for this task. Indeed, ShEx has already been used to create subsets of Wikidata. The following paper contains a list of different subseting approaches which have been used in Wikidata: <https://www.semantic-web-journal.net/content/wikidata-subsetting-approaches-tools-and-evaluation-0>. The main goals of this project have been to review the different approaches to create subsets of large RDF data in order to facilitate the integration of different RDF collections. We identified the participation of 3 main agents: - Data producers or providers who are interested in produce RDF data that can have more value and be used by third parties. - Data consumers who are interested in an easy way to get access to the RDF data available in those sources and to create reproducible workflows which can contain manageable RDF subsets. - Data integrators who can help in the intermediate process of bringing over the RDF data produced by the providers to the end consumers. One important aspect to take into account is that the RDF data may need to be transformed with actions like changing URI prefix declarations or manipulating the topology of the RDF data. Another aspect is the need to understand the structure of the RDF data collected and to agree in a common structure. For this, Shape expressions also offer the possibility to validate the RDF data that is produced and the RDF data that may be consumed. Avoiding the need of defensive programming techniques. # Outcomes ## Creation of ShEx based subsets One important aspect of data integration is the possibility of creating small subsets from large RDF data. During the biohackathon we explored several use cases taking data from UniProt, PubChem, TogoGenome, etc. We created a github repository called [subsetting-examples](https://github.com/shex-consolidator/subsetting-examples) that contains some example SPARQL queries, ShEx schemas and scripts used during the biohackathon. From those, two of the example queries were rewritten in such a way that is accepted by the [`pschema-rs`](https://github.com/angelip2303/pschema-rs) validator. More into this will be discussed later on. The use cases that we explored are the following: ### UniProt subset based on proteins A subset of proteins in UniProt are glycosylated; these are known as glycoproteins. We wanted to extract all glycoproteins from UniProt by taking those entries that had Glycosylation Annotations. We did a first experiment taking as input RDF dumps from UniProt. Kiyoko provided a [simple SPARQL query](https://github.com/shex-consolidator/subsetting-examples/blob/master/protein/protein.sparql) that obtains proteins and annotations. Jose Labra converted the SPARQL query to [ShEx](https://github.com/shex-consolidator/subsetting-examples/blob/master/protein/protein.shex). That ShEx schema was used as input in the [PSChema tool](https://github.com/angelip2303/pschema-rs) that has been developed by Ángel Iglesias in Rust and we already created a subset that was later [published in Zenodo](https://doi.org/10.5281/zenodo.8086938) with its proper doi. ### UniProt subset based on Subcellular locations <!-- TODO: Review by: Yasunori --> Yasunori provided an example [SPARQL query about Subcellular locations](https://github.com/shex-consolidator/subsetting-examples/blob/master/subcellular-locations/subcellular-locations.sparql) that was later converted by Jose Labra to [ShEx](https://github.com/shex-consolidator/subsetting-examples/blob/master/subcellular-locations/subcellular-locations.shex) in order to create subsets from UniProt. This query is to obtain UniProt annotations based on an amino-acid position of a specific protein, especially to know whether an amino-acid residue is located in a cytoplasmic or a transmembrane domain. The subsets were generated by Ángel Iglesias using the PSChema Rust tool which was interesting as we discovered a bug that required an update of the tool. In the following section we will comment on the improvements that led to the results obtained. ### The results obtained from the experiment <!-- TODO: Review by: Labra --> It is worth mentioning that we have used the `uniprotkb_reviewed_viruses_10239_0.rdf` dump (with a size of 1.13GB) from the Uniprot Knowledge graph to create the subsets from. As it is serialized in a RDF/XML format, a tiny pipeline for converting it into N-Triples, the serialization format accepted by the tool, is required. The script that automatically processes the [compressed dump](https://github.com/angelip2303/pschema-rs/blob/main/examples/from_uniprot/uniprotkb_reviewed_viruses_10239_0.rdf.xz) can be seen in the [examples](https://github.com/angelip2303/pschema-rs/blob/main/examples/bh23/run.sh) section of the tool's repository. With that being said, `pschema-rs` has been tested against the already described Shapes tracking the time it takes for the tool to process the dataset and the memory consumed. The program is run three times for each Shape Expression, after which the average results are calculated. Note that a machine with an Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz (12 cores and 24 threads), and 40GB of RAM memory installed is used to run the experiment. The results are presented in the table below: | Shape Expression | Number of triples Initially | Number of resulting triples | Time (s) | Memory consumption (GB) | | ---------------- | --------------------------- | --------------------------- | -------- | ----------------------- | | `protein` | 7,346,129 | 226,241 | 14.58 | 3.87 | | `subcellular_location` | 7,346,129 | 1,084,151 | 37.76 | 3.75 | In relation to the results obtained, it is important to note that despite being in the early stages of development, the tool demonstrates the capability to process large datasets in a reasonable amount of time, while maintaining an acceptable level of memory consumption. To achieve this, numerous optimizations were implemented, including the use of a system that converts textual representations of subject-predicate-object triple into numeric ones. This approach enables the tool to utilize less memory and improve its speed through the cache. Additionally, it is observed that the memory usage of the tool does not appear to be directly related to the complexity of the expression, but with the number of triples matched. However, further testing is being conducted to make this assertion more robust. ### Analysed some use cases to extract subsets from PubChem, GlyCosmos and Reactome Kiyoko suggested that we could use [example 10](https://pubchem.ncbi.nlm.nih.gov/docs/rdf-use-cases#section=Case-10-Summarize-the-statistics-about-the-total-number-of-substances-tested-in-the-PubChem-database-against-each-protein-target-) from PubChem documentation which works on PubChem RDF data. This query could be used to link the proteins in PubChem with those in UniProt. We analysed the [RDF data dumps](https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/) Another use case suggested from GlyCosmos was this SPARQL query: ```sparql= PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#> PREFIX gco: <http://purl.jp/bio/12/glyco/conjugate#> PREFIX faldo: <http://biohackathon.org/resource/faldo#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX mass: <https://glycoinfo.gitlab.io/wurcsframework/org/glycoinfo/wurcsframework/1.0.1/wurcsframework-1.0.1.jar#> SELECT (SUBSTR(STR(?protein), 33) AS ?ac) ?beginP ?saccharide WHERE { ?protein <http://purl.jp/bio/12/glyco/conjugate#glycosylated_at> ?glyco_site . ?glyco_site faldo:location ?site . ?site faldo:position ?beginP . ?glyco_site <http://purl.jp/bio/12/glyco/conjugate#has_saccharide> ?saccharide . } ``` That SPARQL query should work in [GlyCosmos SPARQL endpoint]( https://ts.alpha.glycosmos.org/sparql) which follows [this schema](https://glycosmos.org/programmatic). The RDF dump can be obtained by downloading the results of this SPARQL query as RDF: ```sparql= PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#> PREFIX gco: <http://purl.jp/bio/12/glyco/conjugate#> PREFIX faldo: <http://biohackathon.org/resource/faldo#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX mass: <https://glycoinfo.gitlab.io/wurcsframework/org/glycoinfo/wurcsframework/1.0.1/wurcsframework-1.0.1.jar#> CONSTRUCT { ?protein <http://purl.jp/bio/12/glyco/conjugate#glycosylated_at> ?glyco_site . ?glyco_site faldo:location ?site . ?site faldo:position ?beginP . ?glyco_site <http://purl.jp/bio/12/glyco/conjugate#has_saccharide> ?saccharide . } WHERE { ?protein <http://purl.jp/bio/12/glyco/conjugate#glycosylated_at> ?glyco_site . ?glyco_site faldo:location ?site . ?site faldo:position ?beginP . ?glyco_site <http://purl.jp/bio/12/glyco/conjugate#has_saccharide> ?saccharide . } ``` ### Reactome Kiyoko also suggested a query from Reactome for extracting pathways involving glycoproteins. Reactome data are currently not accessible from any endpoint, so the data would first need to be downloaded using BioPAX level 3 format and loaded into a triplestore. Based on this, Kiyoko suggested using this SPARQL query: ```Sparql= PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#> SELECT DISTINCT ?pathway_ID ?pathwayName ?organism FROM<http://rdf.glycosmos.org/pathway_reactome_v83> WHERE { ?pathway a biopax3:Pathway . ?pathway biopax3:displayName ?pathwayName . ?pathway biopax3:organism/biopax3:name ?organism . ?pathway biopax3:xref ?unixref . ?unixref biopax3:db ?db_g . FILTER(STR(?db_g)="GlyCosmos") ?pathway biopax3:xref ?unixref_r . ?unixref_r biopax3:db ?db_r ; biopax3:id ?pathway_ID . FILTER(STR(?db_r)="Reactome") } ``` which should be run in the following SPARQL endpoint: <https://ts.glycosmos.org/sparql> and the expected results should be [the following](https://reactome.org/download/current/biopax.zip). ### TogoGenome <!-- TODO: Review by Yoko Okabeppu --> Hiroshi Mori and Yoko Okabepu created this [example SPARQL query](https://is.gd/MGeyxI) which is working against the [TogoGenome SPARQL endpoint](http://togogenome.org/sparql). TogoGenome is an RDF-based genome database developed by DBCLS. This SPARQL query retrieves the UniProt protein ID list for a single microbial strain (*Hydrogenobacter thermophilus* TK-6). Jose Labra converted the SPARQL query to [ShEx](https://github.com/shex-consolidator/subsetting-examples/blob/master/togogenome/togogenome.shex). ## Subsetting book <!-- TODO: Review by Andra Waagmeester --> Andra Waagmeester created [an executable book](https://rdfsubsetting.github.io/) which will contain examples and descriptions about how to create RDF subsets. <!-- TODO: Review by Nuria Queralt --> The book was automatically published using Github actions created by Núria Queralt Rosinach. ## Work on sheXer <!-- TODO: Review by Yasunori --> sheXer is a tool that extracts ShEx descriptions from RDF content (in local dumps or remote files/endpoints) by mining instance-level data. sheXer is able to produce ShEx schemas and constraint-level stats w.r.t. compliance with the original RDF input. The stats are expressed as inline comments next to the shapes extracted. A problem of this tool is that it can raise out-of-memory errors or spend too much computation time when handling too large RDF dumps. A possibility to overcome this issue is to split the RDF dumps in portions, run sheXer for those portions and consolidate their results. During the biohackathon, Daniel Fernández created a tool that consolidates ShEx results. The proposed sollution is able to handle shexer's output, so it is adequate for this use case. However, It assumes some structural features in the ShEx inputs, so it may raise issues when trying to use it with input ShEx files not generated by sheXer. This tool is able to detect shape labels mentioned in more than one result file and merge their internal constraints in a new single shape. The integration of constraints ensure conformance between every instanceused to extract a shape in the input files and its correspondent shape. This is achieved by including a proper cardinality when a certain constraint is not observed among all the shapes to merge. The tool is also capable of keeping sheXer's stats regarding number of instances and frequency of each observed constraint. However, the stats attached to merged shapes may be affected by cases in which a certain instance was used to extract a shape in N (more than one) partial dumps. In such cases, the instance will have N times more weight than it should in the merged stats. ## Use case about Multi-omics data integration <!-- TODO: Review by Alberto Labarga --> <!-- TODO: Review by Robert Hoehndorf --> We were discussing a possible use case about data integration on multi-omics. [Med2RDF](http://med2rdf.org/) - UniProt - A possible example from natural language: “Patient with headache and some symptoms” - Tasks to do: Obtain relationships, extract triplets, obtain treatments/possible diagnostics/test - HPO - <https://athena.ohdsi.org/> (OMOP vocabularies) - Generate RDF from relationships Possible tools to use: [ShExML](http://shexml.herminiogarcia.com/). ## Analysis about RDF data description mechanisms <!-- TODO: Review by Toshiaki --> As with any data source, it is necessary to describe its structure. And if we want to obtain subsets of RDF data then the RDF data has to be described in sufficient detail that can facilitate the subsetting process. One way is to write ShEx to describe what the subset ought to be. Another approach would be to see if other modes of representation can be used to describe the structure of RDF data and then generate shapes from this description. As part of this effort, we explored the following: - [LinkML](https://linkml.io/): Linked Data Modeling Language (LinkML), a flexible modeling language that allows you to define schemas in YAML that describe the structure of your data. LinkML also provides a framework for translating this YAML representation into other forms like ShEx. - [RDF-config](https://github.com/dbcls/rdf-config): RDF-config is a tool that generates SPARQL queries, schema diagrams, and most importantly ShEx from a simple YAML-based configuration that conforms to a specification. Both LinkML and RDF-config describe the structure of RDF data in YAML and have sufficient tooling to automatically generate ShEx shapes. As a result, we decided to compare the ShEx output of both tools by using PubChem as the data source of interest. PubChem provides an RDF endpoint and has captured the structure of their RDF data as a LinkML YAML. This is the file that serves as an input to LinkML to generate ShEx shapes. Deepak provided the [ShEx shapes derived from the YAML](https://github.com/shex-consolidator/subsetting-examples/blob/master/linkml_rdfconfig/pubchemrdf_by_linkml.shex). Alternatively, Toshiaki had captured the structure of RDF data in PubChem in the YAML-based configuration of RDF-config. After which, this YAML file was used to generate ShEx shapes using RDF-config tools. Toshiaki provided the [ShEx shapes derived from the YAML](https://github.com/shex-consolidator/subsetting-examples/blob/master/linkml_rdfconfig/pubchem_by_rdfconfig.shex). It is clear that the philosophy adopted by LinkML and RDF-config is similar: - define the structure of RDF data in YAML - automatically generate ShEx shapes based on the YAML Next, we analyzed the shapes and a thorough side by side comparison revealed that the shapes generated by RDF-config are cleaner and simpler. This can be attributed to the fact that the RDF-config YAML was prepared after investigating the PubChem RDF endpoint and carefully cataloguing the classes, properties, and predicates. On the other hand, the shapes generated by LinkML contained more properties and uses specific types that are native to LinkML but provides more expressivity. Following is an example of the `Taxonomy` shape generated by LinkML: ```shex= <Taxonomy> CLOSED { ( $<Taxonomy_tes> ( rdf:type @linkml:Uri * ; skos:prefLabel @linkml:String ? ; skos:altLabel @linkml:String ? ; owl:sameAs @linkml:Uri ? ; skos:closeMatch @linkml:Uri * ) ; rdf:type [ <Taxonomy> ] ? ) } ``` In the case of rdfconfig. We found that the initial`Taxonomy` shape was the following: ```shex= <PubChemTaxonomyShape> { rdf:type [biopax:taxonomy] ; dcterms:title xsd:string ; skos:closeMatch IRI * } ``` Which was later improved during the biohackathon. The new updated version obtained from RDF-config was: ```shex= <PubChemTaxonomyShape> { rdf:type [sio:SIO_010000] ; skos:prefLabel xsd:string ; skos:altLabel xsd:string * ; skos:closeMatch IRI * ; cito:isDiscussedBy IRI * ; owl:sameAs IRI } ``` One aspect of the shapes generated by LinkML is that the constraint applied on `rdf:type` property needs to be improved. In the example above, it is clear that the constraint on `rdf:type` is self-referential, i.e. the constraint states that all instances of `Taxonomy` must have `rdf:type` as `Taxonomy`. This is the default expression of ShEx from LinkML and definitely needs improvement. Another way to express the `rdf:type` constraint would be to explicitly state the values for `rdf:type` as enumeration in the LinkML YAML such that this is parsed appropriately. When we tried expressing the enumeration and then translating the YAML to ShEx, we realized that the enumerations were not being parsed properly. There is a [ticket](https://github.com/linkml/linkml/issues/1513) in the LinkML repository on GitHub and a corresponding [pull request](https://github.com/linkml/linkml/pull/1516) that fixes this issue. # Future work ## Add FROM <...> to ShEx <!-- TODO: Review by Eric Prud'hommeaux --> - Problem to solve: how to validate/extract RDF data when it is behind different RDF graphs? - Currently ShEx processors assume a single RDF graph but it may be interesting to add some kind of annotation to ShEx which allows ShEx processors to take into accout the `FROM` declarations when they are working against SPARQl endpoints. ## Machine-readable way that RDF data providers describe their RDF dumps <!-- TODO: Review by Jerven Bolleman --> One problem for creating RDF data subsets from RDF data dumps is that not all RDF providers follow the same structure to publish the RDF dumps and that the descriptions of the contents of those RDF dumps are not published. In order to solve that problem it would be interesting to follow some guidelines or common patterns such as providing files in the `./well-known/` namespace. These files should contain or direct to machine-readable descriptions in either ShEx or SHACL. During the biohackathon we were reviewing the way that different data providers follow to describe their RDF dumps. Some of the providers we took into account were: - [Uniprot](https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/) - [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/) - [TogoGenome](http://togogenome.org/rdf/) In order to add `.well-known/void` declarations, Jerven Bolleman already had the [void-generator](https://github.com/JervenBolleman/void-generator). [VoID](https://www.w3.org/TR/void/) is a framework to describe the statistical distribution of data elements and their links. During the biohackathon, Jerven Bolleman and Jose Labra started a new project called [void2shapes](https://github.com/shex-consolidator/void2shapes) to generate ShEx and SHACL declarations from [void descriptions](https://www.w3.org/TR/void/) in UniProt. We succeeded in generating a minimal ShACL representation from the VoID files of [UniProt](https://sparql.uniprot.org/sparql) and [SwissLipids](https://beta.sparql.swisslipids.org/sparql). ## Improve the ShEx tooling <!-- TODO: Review by Nishad --> We are currently transitioning some of the ShEx implementations that were initially implemented in Scala to Rust. We have been looking to some tools in the Rust ecosystem like the [Severless SPARQL endpoint](https://github.com/nishad/serverless-sparql-endpoint) that has been developed by Nishad in Javascript for Deno based on the [Oxigraph database](https://oxigraph.org/) that is implemented in Rust that has a binding that works in [WebAssembly](https://webassembly.org/). We consider that this solution could enable further performance and dynamic scalability and enable edge-computing use cases. One possible use case would be to adapt the next Rust implementation of ShEx to work on WebAssembly and could be integrated with OxiGraph and do ShEx validation on-the-fly. ## Improvements to sheXer <!-- TODO: Review by: Yasunori --> - Automatic splitting of large RDF data files - Parallel/distributed extraction of ShEx portions - Automatic consolidation of ShEx portions - Generation of rdfconfig yaml files ## SPARQL to ShEx converter - We noticed that in several use cases, we start with an example SPARQL query which we want later to convert to ShEx. An interesting project would be to generate ShEx from those SPARQL queries. <!-- TODO: Review by Jose Labra --> ## Explore the idea about ShEx transitions - Being able to declare transitions between ShEx schemas so they can be checked as pre- and post- conditions, example before/after SPARQL updates. This idea has been provided by Thomas Liener [in this document](https://docs.google.com/document/d/1LyBlRuwvl6BQZkEQG6_2gc2kJrs05qaYVie-w4-Lnvc/edit?usp=sharing). <!-- TODO: Review by Thomas Liener --> ## ShEx template injection - Being able to have parameterizable ShEx schemas whose specific values could be injected in a way that ShEx schemas could be dynamically generated <!-- TODO: Review by Thomas Liener --> ## RDF representation of ShapeMaps - We resumed the work on [this issue](https://github.com/shexSpec/shex/issues/67) and we expect to update the [ShapeMap specification](http://shex.io/shape-map/) including a description of a representation of shape maps that shows how to represent them in RDF. <!-- Review by Eric Prud'hommeaux --> ## Generating ShEx files from void descriptions - RDF representation of ShEx uses RDF lists which are a bit challenging to generate from SPARQL Construct queries. Jerven Bolleman solved it generating IRIs for the intermediate nodes but it may be interesting to generate the nodes in the list as blank nodes. Eric Prod'hommeaux raised [this issue](https://github.com/apache/jena/issues/1933) in Apache Jena and it would be interesting work to see if we could solve the issue or find some way to solve the problem. - This work will be folllowed on in the [void2shapes](https://github.com/shex-consolidator/void2shapes/) project. <!-- TODO: Review by Eric Prud'hommeaux --> ## Conversion from ShEx to LinkML Another topic of discussion was the possibility of converting ShEx shapes to LinkML. This would allow for the conversion of existing ShEx shapes to a flexible format such as the LinkML YAML and enable the subsequent use of the tooling provided by LinkML to generate other outputs that are supported by the LinkML framework. For example, if we have a ShEx shape `GeneShape`, ```shex= PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX SO: <http://purl.obolibrary.org/obo/SO_> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> <GeneShape> { rdf:type [ SO:0000704 ] ; dcterms:title xsd:string ; dcterms:alternative xsd:string * ; } ``` Then we should be able to convert and represent this shape in LinkML YAML that looks like so, ```shex= id: http://example.org/Example name: Example prefixes: ex: http://example.org/ linkml: https://w3id.org/linkml/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# SO: http://purl.obolibrary.org/obo/SO_ dcterms: http://purl.org/dc/terms/ xsd: http://www.w3.org/2001/XMLSchema# imports: - linkml:types default_prefix: ex default_range: string classes: Gene: slots: - type - title - alternative slot_usage: type: slot_uri: rdf:type range: TypeEnum title: slot_uri: dcterms:title alternative: slot_uri: dcterms:alternative slots: type: title: alternative: enums: TypeEnum: permissible_values: gene: meaning: SO:0000704 ``` The above LinkML YAML is a working example and can be parsed using LinkML framework. Preliminary analysis seem to indicate that the conversion is possible but with some limitations which can be addressed if the use-cases are well defined and sufficent examples are available for rapid prototyping. This is something that Jose Labra and Deepak would like to work on in the future. ## Conversion from ShEx to RDF-config YAML files <!-- TODO: Review by Toshiaki --> Another possibility is to explore the generation of RDF-config YAML files that follow [this specification](https://github.com/dbcls/rdf-config/blob/master/doc/spec.md). ## Use case about integrating RDF data and clinical records <!-- TODO: Review by Claude Nanjo --> The accelerating pace of discoveries in biotechnology offers great hope in medicine. Yet, access to clinical information necessary for research and innovation has been challenging. Clinical data often resides in proprietary stores and proprietary formats that are not amenable to integration with other data sources. The increasing adoption of the Fast Healthcare Interoperability Resources ([FHIR](https://www.hl7.org/)) standard by Electronic Health Records (EHR) vendors and FHIR's support for RDF is offering hope that clinical data may soon be made more accessible to a broader community. Moreover, current investigations on how to support known semantic web standards such as SPARQL, ShEx and [ShExMap](https://shex.io/extensions/Map/) within FHIR all stand to make FHIR and thereby clinical data more accessible to the broader research community. ## Use case about drug repurposing <!-- TODO: Review by Núria --> Computational drug repurposing is a well known strategy to speed up drug development with potentially lower overall development costs and shorter development timelines with de-risked 'old' compounds. However, current knowledge graphs used in AI-based drug repurposing lack metabolite-related information, which is data especially interesting to include for metabolic diseases drug discovery. In this BioHackathon, we investigated ways to retrieve metabolite-related information crossing several knowledge bases such as [UniProt or PubChem](https://nuriaqueralt.github.io/bh23-onlinebook/metabolite_subsetting.html) to feed drug repurposing knowledge graphs. Even though there is a lot of curated metabolite data stored, this data is not neither easily nor systematically retrievable from databases. In the future, we aim at producing efficient ways for metabolite-related information retrieval for knowledge graph construction based on subsetting strategies such as using RDFconfig to generate SPARQL queries or using ShEx-based approaches. ## Acknowledgements We would like to thank the fellow participants at BioHackathon 2023 for their collaboration and constructive advice, which greatly influenced our project. We are grateful to the organizers for providing this platform and the developers of open source language models. Special thanks to our mentors, advisors, and colleagues for their guidance and support. Without their contributions, our project in linked data standardization with LLMs in bioinformatics would not have been possible. ## References

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully