owned this note
owned this note
Published
Linked with GitHub
# Welcome to the SWAT4HCLS 2022 hackathon
## Schedule (click on CET for other timezones):
09:00-10:00 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T080000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Getting ready to hack](https://discord.gg/N8zrAX9F) Discord text channel
10:00-10:30 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T090000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Pitches](https://discord.gg/cPF976Md) Discord video channel
10:30-13:00 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T093000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Hack, Hack, Write paper, next project proposal, Chat](https://discord.gg/N8zrAX9F)
13:00-13:30 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T120000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Intermediate reports and results](https://discord.gg/Y6bBtu5d)
13:30-17:00 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T123000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Hack, Hack, Write paper, next project proposal, Chat](https://discord.gg/N8zrAX9F)
17:00-17:30 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T160000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Intermediate reports and results](https://discord.gg/Y6bBtu5d)
17:30-20:30 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T163000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Hack, Hack, Write paper, next project proposal, Chat](https://discord.gg/N8zrAX9F)
20:30-21:00 [CET](https://www.timeanddate.com/worldclock/converter.html?iso=20220113T193000&p1=1306&p2=248&p3=304&p4=56&p5=tz_et&p6=tz_pt&p7=26) [Intermediate reports and results](https://discord.gg/Y6bBtu5d)
## Introduction
The hackathon will be happeing on the same Discord server as the full conference. The links in the schedule will direct you to the different discord channels. If need we can make more channels. If need either reach out to Andra Waagmeester (on Discord @andrawaag) or the [General channel](https://discord.gg/sAbVDHfT)
### Documentation
This HackMD document can also be used to document progress and results.
## Topics
<img style="float: right; height: 50px;" src="https://i.imgur.com/7lscekc.png"/>
### Developping ShEx (for chemistry and beyond!)
* Lead/Contact: Denise Slenter
* Participants: Jose Labra, Eric P, Jeaphianne, Scott, Matthijs, Pablo, ...
* Channel: #shex
This group will work on developing ShEx, using https://wikishape.weso.es/ including autocomplete for Wikidata (based on https://rdfshape.weso.es/).
Please join our group (Discord channel #shex) to learn how to write ShEx (also for your own RDF data), and how to validate them.
We will start with the Chemistry usecase: https://github.com/kg-subsetting/biohackathon2020/tree/main/use_cases/chemistry .
1. Eric:
Introduction to Shex tutorial: https://shex.io/
Getting started: https://shex.io/shex-primer/index.html
**Notes on using regular epressions (ReGex) in ShEx:**
PCRE is compatible with the presented ShEx tools; some ReGex might not work right away.
Regex builder (useful for learning and validating regex): https://regexper.com/ Example: https://regexper.com/#%7Bhttp%7Chttps%7D
1. Jose:
Validating against Wikidata data: https://www.weso.es/YASHE/
Find all current chemistry schemas here: https://www.wikidata.org/wiki/Wikidata:Database_reports/EntitySchema_directory#chemistry
Example Wikidata Entity Schema for Chemical compounds:
https://www.wikidata.org/wiki/EntitySchema:E239
(Note: Click on "Get entities against this schema" to retrieve some entities from Wikidata, to validate the schema against.)
View the Wikidata RDF behind an entity (example Q2546): https://www.wikidata.org/wiki/Special:EntityData/Q2546.ttl
Slides about RDF, Wikibase, ShEx: https://www.validatingrdf.com/tutorial/swat4hcls22/slides/RDF_ShEx_EntitySchemas_Labra.pdf
Slide 38 about qualifiers/references represented as RDF
Command line tool: https://www.weso.es/wb/
- Schema 1: Will pass for entries which have an "instance of" "chemical compound", however fails for entries which also have a "subclass of" x (where x is a child term of chemical compound); e.g. water (Q283) failed.
```
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
# Example SPARQL query: select * where { ?x wdt:P31 wd:Q11173 } limit 5
# Example SPARQL query: select * where { ?x wdt:P279 / wdt:P31 wd:Q11173 } limit 5
start = @<#wikidata-chemical-compound>
<#wikidata-chemical-compound> EXTRA wdt:P31 {
(
wdt:P31 [ wd:Q11173 ] ; # chemical compound
|
wdt:P279 <#wikidata-chemical-compound>
)
}
```
- Schema 2: Will pass entries which have an "instance of" "chemical compound", and also entries which also have a "subclass of" x (where x is a child term of chemical compound); e.g. water (Q283) passes.
```
start = @<#wikidata-chemical-compound>
<#wikidata-chemical-compound> {
wdt:P31 @<ChemicalCompoundOrSubclassOf> ;
wdt:P31 . *
}
<ChemicalCompoundOrSubclassOf> [ wd:Q11173 ]
OR {
wdt:P279 @<ChemicalCompoundOrSubclassOf>
}
```
Results:
1. New Chemistry entity schema: https://www.wikidata.org/wiki/EntitySchema:E340
Validates which Chemical Compounds have a (molecular) mass, and the provenance of this mass (PubChem or ChEMBL).
<img style="float: right; height: 50px;" src="https://upload.wikimedia.org/wikipedia/commons/6/66/Wikidata-logo-en.svg"/>
### Aligning Health Care and Life Science resources on Wikidata
* Lead/Contact: Egon Willighagen
* Participants: Quinten Groom, ...
* Channel: [#wikidata](https://discord.gg/JbSQ9jeF)
* [Notes](https://hackmd.io/8gf3isyZRTGIeIDDw9Rctg)
The recent [ELIXIR BioHackathon](https://biohackathon-europe.org/) featured a [project on making ELIXIR activities
more findable](https://github.com/elixir-europe/bioHackathon-projects-2021/tree/main/projects/32) by registering metadata about them in [Wikidata](https://wikidata.org/). This resulted in new Wikidata items, links between items, and [new SPARQL queries](https://www.wikidata.org/wiki/Wikidata:WikiProject_Elixir#Queries_for_Use_Cases). The BioHackathon also saw a growth in links to ELIXIR resources, with many thousands of new SwissLipids identifiers in Wikidata. But there is much left to do, and this project with continue with the open tasks, to create a bigger CCZero-licensed life sciences knowledge graph (KG) combining metadata and data. Visualization with Scholia like for [the SWAT4HCLS KG](https://scholia.toolforge.org/event/Q110499790) is expected.
We seek data analysts, curators, coders, or people just interested in making their research a bit more FAIR by putting metadata in Wikidata.
<img style="float: right; height: 50px;" src="https://bioschemas.org/assets/img/square_logo2.svg"/>
### Bioschemas Data
* Lead/Contact: Alasdair Gray
* Participants:
- Ammar Ammar (https://orcid.org/0000-0002-8399-8990)
- François Belleau (https://orcid.org/0000-0002-9816-1093)
- Alasdair Gray (https://orcid.org/0000-0003-1460-8327)
* Channel: #bioschemas
We have harvested a selection of data from sites marked up with Bioschemas; details available in the GitHub repo linked below. What can we do with it?
We have done some initial exploration and made SPARQL queries available in a [Jupyter Notebook](https://mybinder.org/v2/gh/BioSchemas/bioschemas-data-harvesting/HEAD?labpath=AnalysisQueries.ipynb). Can these be further expanded with meaningful questions of the data?
*(link opens a MyBinder instance running the notebook that executes queries against the SPARQL endpoint)*
What would be a semantic search engine (remember Sindice) now that Bioschemas is a reality and the LD cloud of Life Science. First F of FAIR is for Findable. Search is by far the principal tool on the web to start leveraging the value of its information. During this hackathon we will explore the potential of loading JSON-LD RDF data harvested into Elasticsearch and see what could happens if we store in the same index all Bioschemas data available and visualize it with Kibana.
[](https://mybinder.org/v2/gh/BioSchemas/bioschemas-data-harvesting/HEAD?labpath=AnalysisQueries.ipynb)
Links:
- [GitHub Readme](https://github.com/BioSchemas/bioschemas-data-harvesting/blob/main/README.md)
- [Notebook](https://github.com/BioSchemas/bioschemas-data-harvesting/blob/main/AnalysisQueries.ipynb)
- [SPARQL queries](https://github.com/BioSchemas/bioschemas-data-harvesting/tree/main/queries)
- GraphDB [triplestore](https://swel.macs.hw.ac.uk/data/)
- [Data download](https://swel.macs.hw.ac.uk/bioschemas-data/).
#### Hacking Activities
1. Extract a Bioschemas Dump from ChEMBL representing in the first instance their MolecularEntities but then going beyond to include target
- ChEMBL already have MolecularEntity markup in their website
https://validator.schema.org/#url=https%3A%2F%2Fwww.ebi.ac.uk%2Fchembl%2Fcompound_report_card%2FCHEMBL59%2F
It is not in the Bioschemas dump as we haven't crawled it since there is no sitemap.
- There needs to be a mapping of the JSON structure, it is not as straightforward as just putting a `@context` on the front
- Ammar investigated using a `CONSTRUCT` query for this and mapped the SmallMolecule and SigleProtein entities to Bioschemas entites: MolecularEntity and Protein respectively.
- François has ChEMBL in a JSON format in his elastic search. Alasdair reckons that we can simply add a context declaration on this to get ChEMBL in JSON-LD using the Bioschemas model for MolecularEntity.
1. Create a dashboard overview of the Bioschemas harvested data
- Convert the Bioschemas data into JSON-LD
- Load into Elastic Search
- Create dashboard to analyse the content of the harvested data
<img style="float: right; height: 50px;" src="https://bioschemas.org/assets/img/square_logo2.svg"/>
#### Hacking Progress
1. Extracting Data from ChEMBL [GitHub](https://github.com/ammar257ammar/SWAT4HCLS2022-ChEMBL-bioschemas-mapping)
- **Tried using ChEMBL REST API with [ChEMBL Web Resource Client]()**
- Notebook created
- Compound information extracted
- Can't filter with API to Bioschemas properties
- Abandoned due to time needed to pursue further
- **MolecularEntity information extracted with SPARQL query**
The approach adopted in this project is based on using the [ChEMBL mirror SPARQL endpoint (v28)](https://chemblmirror.rdf.bigcat-bioinformatics.org/) hosted by the department of Bioinforamtics at [Maastricht University](https://www.maastrichtuniversity.nl/) (BiGCaT) to construct the new RDF (following the [Bioschemas](https://bioschemas.org/) vocabulary) from the ChEMBL RDF. A mapping between the ChEMBL entities and predicates and the Bioschemas ones was performed using SPARQL [CONSTRUCT](https://www.w3.org/TR/rdf-sparql-query/#construct) queries.
- Query developed to construct Bioschemas model from ChEMBL Molecule report card
- Query run over Maastricht endpoint and JSON-LD file created
- Note that this is mostly duplicating the information in the ChEMBL report card pages, but means that we do not need to harvest the data
- There are some differences in the content of the markup
- **Protein information extraction with SPARQL query**
- Developed a query to extract Bioschemas markup for targets typed as `SingleProtein`
- Query uses `TargetComponent` to extract some of the properties used in the markup
- Query links through Assay to Activity to Molecule to provide a connection between a Protein and a MolecularEntity
- Note that this goes beyond the markup that is available on the ChEMBL website as they do not have markup for their target report cards
- **Implementation**
- The SPARQL queries used for the mapping are available in the [**queries**](https://github.com/ammar257ammar/SWAT4HCLS2022-ChEMBL-bioschemas-mapping/tree/master/queries) folder.
- The mapping was implemented using Python, Jupyter Notebook and the [SPARQLWrapper](https://pypi.org/project/SPARQLWrapper/) package. The [**notebook**](https://github.com/ammar257ammar/SWAT4HCLS2022-ChEMBL-bioschemas-mapping/blob/master/ETL.ipynb) contains the code for mapping the ChEMBL RDF to Bioschemas and serializing the results into JSON-LD format. The construction of the molecular entities was performed in batches (100k molecules in each batch).
- The process took ~4.5 hours using a personal laptop (Core-i7 CPU & 16GB RAM)
- Number of mapped molecules: **1920028 molecules** (~2 million molecules)
- Number of mapped proteins: **8525 proteins**
- Size of the output JSON-LD: **2.68 GB unzipped** (380 MB zipped)
- The following figure shows an overview of the implementation<br><br>

1. Creating Bioschemas.Search dashboard
- Done an initial load of some of the data
- Tidying up contexts
- Starting to scale up the load
- Aiming to load Ammar's ChEMBL MolecularEntity JSON-LD file
### Bioschemas/Phenopackets
Following Raja's talk, can we explore the connections between [Bioschemas Phenotype profile](https://bioschemas.org/profiles/Phenotype/0.2-DRAFT) and the Phenopackets model?
### Rare disease specific FAIR Maturity Indicators
Contact: Núria Queralt Rosinach
Participants: Rajaram Kaliyaperumal, Matthijs Sloep, César Bernabé, Karolis Cremers, Vincent Emonet, ..
Channel: [#fair-maturity](https://discord.gg/udQtX5cR)
Collaborative [document](https://docs.google.com/document/d/1oNJKtEH1xxWrG42JmsNYpcfV8CJ5SAnJWBY6Dw3hRsI/edit?usp=sharing)
[Objective and automated assessment of the FAIRness of digital resources](https://github.com/FAIRMetrics/Metrics) have been mainly developed in a domain agnostic manner so far. In this hackathon, we want to explore and develop FAIR Maturity Indicator (or metric) tests for community specific requirements. [The European Joint Programme on Rare Diseases (EJP RD)](https://www.ejprarediseases.org/) is a driven project on applying the FAIR principles for research in the Rare Disease (RD) domain. After a ‘contentathon’ developed in the EJP RD with FAIR stewards, we selected a list of FAIR choices by the EJP RD community. In this project, we aim at implementing and testing new RD-domain specific FAIR Maturity Indicator tests to the [FAIR Evaluation Services](https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/#!/). In particular, we selected to implement:
- FAIR for ERNs:
- MUST: Common Data Elements (CDE) Semantic Model, DCAT-based Metadata Semantic Model
- POTENTIALLY: HGNC, HGVS, LOINC, PPRL (EUPID), DUO
- FAIR for Virtual Platform:
- VP index search engine (F4 principle)
We need (EJP RD) FAIR stewards, RD data providers, coders, FAIR experts, people interested in making the FAIRification process more domain specific for RD research or people just curious! You all are welcome to contribute!
# sheXer project
- Umaka YummyData
It is possible to get access to each dataset through an endpoint provided by the API.
Documentation: https://yummydata.org/api
It also provides a visualization:
https://umaka-viewer-dev.dbcls.jp/
Example: https://umaka-viewer-dev.dbcls.jp/v/e0Ng1ZS3ivFMyGl80MIAftJM3s9Al_30
Running sheXer against a SPARQL endpoint can be problematic because the system could be banned from the server.
Approach could be to use a small dataset as a prototype
Before fixing the SPARQL endpoint issue, we try to get some data from a source by executing the shape extraction process locally, i.d., using local RDF files instead of consuming a SPARQL endpoint.
The chosen source is [disgenet](https://www.disgenet.org/), which has an [RDF dump](http://rdf.disgenet.org/download/v7.0.0/). The first attempts to run thin in an average PC crashed the memory, although we reach some partial results by excluding the biggest files in the RDF dump. However, we are moving the whole computation to a docker job executed on a machine with more resources. The final results may not have been available for this hackaton.
We tried instead to collect results from the Wikipathways project, which also have [RDF dumps available](https://wikipathways-data.wmcloud.org/current/rdf/).
This source was small enough to be completly computed in an average computer. We provide both [the resulting shapes](https://cdn.discordapp.com/attachments/931106290686656562/931252865127829524/wikipathways_v2.shex) and the Python script used to get them:
```python=
import os
from shexer.shaper import Shaper
from shexer.consts import TURTLE
def run(base_input_dir,
out_path,
namespaces):
in_files = [os.path.join(dp, f) for dp, dn, filenames in
os.walk(base_input_dir) for f in filenames
if os.path.splitext(f)[1] == '.ttl']
# The input format TURTLE uses rdflib to parse the graph, so the whole
# graph content is loaded in main memory. the TURTLE_ITER const causes
# sheXer to use a different parsers that walks the file iteratively
# annotating the target features, without loading the graph in memory.
# However, that parser has some known issues with BNodes. Do not use
# TURTLE_ITER unless you know that the target source does not contain
# Bnodes written with '[]' syntax
shaper = Shaper(graph_list_of_files_input=in_files,
all_classes_mode=True,
input_format=TURTLE,
namespaces_dict=namespaces,
disable_exact_cardinality=True)
# Verbose active mode, so one can check in which stage is the execution,
# some raw numbers about shapes and instances computed, and also
# execution times.
# This acceptance_threshold filters any information observed in less
# than 5% of the instances of any class.
shaper.shex_graph(output_file=out_path,
verbose=True,
acceptance_threshold=0.05)
print("Done!")
if __name__ == "__main__":
############### CONFIGURATION ###############
# Directory with the wikipathways dump (content unzipped). the process
# will recursively look for any ttl file in this folder or any of this
# subfolders, and it will merge them in a single graph
base_input_dir = r"F:\datasets\wikipathways"
# output shex file
out_path = r"F:\datasets\wikipathways\wikipathways_v2.shex"
# namespace-prefix pair to be used in the results
namespaces_dict = {"http://purl.org/dc/terms/": "dc",
"http://rdfs.org/ns/void#": "void",
"http://www.w3.org/2001/XMLSchema#": "xsd",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
"http://purl.org/pav/": "pav",
"http://www.w3.org/ns/dcat#": "dcat",
"http://xmlns.com/foaf/0.1/": "foaf",
"http://www.w3.org/2002/07/owl#": "owl",
"http://www.w3.org/2000/01/rdf-schema#": "rdfs",
"http://www.w3.org/2004/02/skos/core#": "skos",
"http://vocabularies.wikipathways.org/gpml#": "gpml",
}
############### EXECUTION ###############
run(base_input_dir=base_input_dir,
out_path=out_path,
namespaces=namespaces_dict)
```
At this point, we need to further discuss wheter these shapes are adequate for their intended purpose. But that is going to happen after this hackathon.
If they are indeed adequate, we would need to overcome the issue of how to deal with the endpoint consumption issues (timeouts, IP banns, complexity). Some ideas have been already proposed:
* To improve sheXer's strategy to gather data from endpoint, trying to decrease the ammount of queries and the information obtained each time.
* To limit the ammount of instances involved in the process, so just a representative group of each class is used instead.
* To focus just on those sources that provide RDF dumps (it could be problematic for scalability and maintenance).