Project 21 - Biohackathon 2021 - KG subsets

--- tags: - Wikidata - ShEx - Knowledge graphs --- # Project 21 - Biohackathon 2021 - KG subsets ## Links The contents of this repo are created also on [HackMD](https://hackmd.io/QI5KK1ecQICgGERmSj981g) Github repo: https://github.com/kg-subsetting/biohackathon2021 ## Notes of 8th Nov 2021 ### TODOs * List existing subsetting tools & methods * Create Gene Wiki subset from the eLife paper * Where to host subsets? * wbstack? * are there other resources * Can we set a dedicated wikibase? * Docker containers * some success last year using "docker save" which somehow pickled whole environment; but it seems fragile - @danbri. * Computational infrastructure to run the subset generation tools * We are using AWS for some experiments * Last month bill (approx. 500€) * Not optimized * Maybe consider other resources... ### Tools - WDumper - It creates subset from JSON dumps and generates dumps with RDF serialization - KGTK - Paper: [Creating and Querying Personalized Versions of Wikidata on a Laptop](https://arxiv.org/abs/2108.07119) - It seems not to support references - It is internally based on https://www.sqlite.org/ - WDSub - https://github.com/weso/wdsub - Sequentially going item by item - SparkWDSub - https://github.com/weso/sparkwdsub - Graph traversal ### Participants - Eric Prud'hommeaux - Jose Emilio Labra Gayo - Seyed Hosseini - Ammar Ammar - Frederic Bastian - Julien Wollbrett - Andra Waagmeester - Sabah Ul-Hasan - Dan Brickley (somewhat) ### Discussion about Use cases - Onology subsetting vs data subsetting - Use of reasoners/inference. Possibility to subset ontology based on entailed knowledge using a reasoner. Examples: - Find species that have "lungs"; - find all the anatomical structures existing in mammals. - Find anatomical structures developing from a developmental precursor (example find CL:0000027 "smooth muscle cell", CL:0000123 "neuron associated cell" etc, developing from CL:0000333 "migratory neural crest cell") - For all the above examples, being able to find the data in wikidata mapped to the results of the subsetting. Example: - Find gene expression data in mammalian heart. - Seyed talks about the need to support of references - Issue about how to handle already created subsets which are in RDF ([Some Examples](https://zenodo.org/record/5117928#.YYlQFGDP3IU) .nt files). - You need to re-create Wikibase triple by triple - Seyed: There are also some other considerations. For example [in WDumper](https://github.com/seyedahbr/wdumper), the subset only includes first-level items. For example, consider we tell WDumper to extract all type X molecules or type Y Genes. Then consider we need the discoverer of a gene in the subset. In that case, there would not be a proper item for discoverer! There is just a Qid. It would be good if our Emperor tool (I called it as it would be the emperor of all subsetting tools) can extract at least some `rdfs:labels` or `owl:description` of the second-level items (Items that are not completely in the topic scope but somehow are related). - Seyed: Another issue is publishing an endpoint over WDumper-based subsets: almost the only option is Blazegraph. Because of the encoding problems, other triple stores like Jena and GraphDB issue lots of import errors. Blazegraph is good, but its configuration can be very optional and one should find a balance between the import speed and query speed. ### 9th November 2021 - Ammar Ammar proposed to create a new subset based on Wikidata lipids entities (entities having wdt:P2063 predicate) - Seyed has 6 ShEx that can be used as use cases: https://github.com/seyedahbr/Wikidata_Reference_Statistics/tree/main/ShEx%20schemata - Background work we have been working in the last months: https://arxiv.org/abs/2110.11709 - GeneWiki work: https://github.com/weso/genewikisub - Work done in the last 3 months to obtain subsets from GeneWiki: https://docs.google.com/document/d/1mxEo6y4IJjVpDK1nT2PvcvkFgvTCLBJax6WipTMXCUM/edit?usp=sharing From previous meetings: - [elife paper on WD KG for lifesci](https://elifesciences.org/articles/52614) - [Google experiments](https://github.com/google/schemarama/tree/main/kgx/wikidata): wikidata-triples and bio/schema triples SPARQL ### Tasks to do - Jose Labra: I will try to run sparkwdsub on Ammar's shape expression on lipids - Ammar: Prepare ShEx on lipids, try to run pyshex+slurping - Seyed: 6 examples with WDumper - references and qualifiers - Frederic: generate an OWL ontology with complex axioms requiring reasoning for Eric to try to translate it with ShEx and subset it more easily ## 10th nov. 2021 - Today we have to do a report about our advances...template slides available here: https://docs.google.com/presentation/d/1GVMgpQBPOzMrodsZ2tBOaSDas-P8HokZBxlA-W0Rp_g/edit#slide=id.g10110a35a6e_0_2257 - we started these Slides to collect ideas: https://docs.google.com/presentation/d/1c1uWCVBedypoH5FOgQexZCBHQDYrbKzlmRopbgTe0C0/edit?usp=sharing ## 11th nov. 2021 WDSub created subsets available [here](http://files.hpc.weso.es/) which are documented in this google doc: https://docs.google.com/document/d/1mxEo6y4IJjVpDK1nT2PvcvkFgvTCLBJax6WipTMXCUM/edit ### running examples #### ShEx.js + slurp ``` sudo docker pull micelio/shex:firsttry sudo docker run -it micelio/shex:firsttry -x https://raw.githubusercontent.com/kg-subsetting/biohackathon2020/main/use_cases/genewiki/disease.shex -m "SPARQL 'SELECT ?s { ?s wdt:P699 ?o } '@:disease" | tee do20211111.ttl ``` result: [figshare](https://figshare.com/articles/dataset/DO_wikidata_subse/16990036) ### Meeting Participants: Filip, Kat, Eric, Seyed, Sabah, Andra, Jose - Created a table comparing different alternatives - Use case about using a large Knowledge graph as a kind of external database that can be easily queried without the need of a lot of SPARQL queries which are difficult to maintain. - It seems a nice use case for Shape Expressions where you can filter the data with a Shape Expression and later do simpler SPARQL queries instead of defensive SPARQL queries - Example: films with revenue, can be defined with a ShEx, and later the queries can assume there is already a revenue. - Seyed raises the possibility of creating a tool to create subsets - Problem of obtaining resources to create such a tool - We will consider it for next steps - Look for funding opportunities - Workshop on Shape Expressions in Leiden, jointly with SWAT4HCLS ---