--- title: 'Virtual biohackathon virus' tags: - Wikidata - COVID-19 - WikiPathways - Complex Portal authors: - name: Andra Waagmeester orcid: 0000-0001-9773-4008 email: andra'at'micel.io socials: andrawaag on Twitter affiliation: 1 - name: Egon Willighagen orcid: 0000-0001-7542-0286 affiliation: 2 - name: John Samuel orcid: 0000-0001-8721-7007 affiliation: 3 - name: Leyla Garcia orcid: 0000-0003-3986-0510 affiliation: 4 - name: Tiago Lubiana orcid: 0000-0003-2473-2313 affiliation: 6 - name: Daniel Mietchen orcid: 0000-0001-9488-1870 socials: EvoMRI on Twitter affiliation: 7 - name: Maarten Trekels orcid: 0000-0001-8282-8765 affiliation: 8 - name: Jose Emilio Labra Gayo orcid: 0000-0001-8907-5348 affiliation: 9 - name: Alejandro González Hevia orcid: 0000-0003-1394-5073 affiliation: 9 - name: Pablo Menéndez Suárez orcid: 0000-0002-8602-6927 affiliation: 9 - name: Birgit Meldal orcid: 0000-0003-4062-6158 affiliation: 10 affiliations: - name: Micelio, Antwerpen, Belgium index: 1 - name: Department of Bioinformatics - BiGCaT, NUTRIM, FHML, Maastricht University, Universiteitssingel 50, Maastricht, Netherlands index: 2 - name: Institution 1, address, city, country index: 3 - name: ZB MED Information Centre for Life Sciences, Gleueler Str 6 50931, Cologne, Germany index: 4 - name: CPE Lyon, France index: 5 - name: Institute of Mathematics and Statistics, University of São Paulo, Brazil index: 6 - name: School of Data Science, University of Virginia, Charlottesville, VA 22904 USA index: 7 - name: Meise Botanic Garden, Nieuwelaan 38, 1860 Meise, Belgium index: 8 - name: WESO Research Group, University of Oviedo, Spain index: 9 - name: European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom index: 10 date: 11 April 2020 bibliography: paper.bib --- [bibtex file](https://hackmd.io/@yVAYnqXiSWS8SfLIa0ttEw/rkd0IICvL) # Background Wikidata is the public knowledge graph operated by the Wikimedia Foundation and curated by a global community from many walks of life. Initially set up as the linked-data repository for the different lanugage variants of Wikipedia, it currently caters to a wide variety of use-cases and tools within in Wikipedia, but also beyond. During this virtual edition of the BioHackathon, Wikidata was one of the topics. Here, we report on progress and results. [@Katayamaetal-2010]. Within the Wikidata theme, there was a set of subprojects, all with their specific themes. # Sub projects Please keep sections to a maximum of three levels, even better if only two levels. ## Wikibase and preprint With the rapid pace followed by the Covid-19 situation in 2020, a higher than usual number of preprints on a particular subject were published. By the time of writing, medRxiv and bioRxiv, two well-known preprints in the Life Sciences and Health Care domains, had published 1059 and 293 articles respectevly. In order to facilitate machine access to this collection, we created a Wikibase instance hosting metadata for Covid-19 related preprints. Wikibase makes it possible to store structured data in a similar and compatible way to Wikidata. Our Wikibase instance is publicly accessible at (TODO). The preprints included can be validated via Shape Expressions (ShEx) (ref: https://shex.io/shex-primer/) with a shape created to that end (ref: https://www.wikidata.org/wiki/EntitySchema:E185). (TODO: how many preprints did comply or not?). One issue that arose was that the initial set of preprint items did not fully follow comunity practices established before ShEx became available to formalize them. In particular, author order was missing. The schema has been updated to take this into account. As part of our future work, we plan to include preprints from the Open Science Framework (OSF) (ref: https://osf.io/). Note: Add somewhere something about lack of coherence/consistency accross validators. Preprint https://www.wikidata.org/wiki/Q88973808 does not validate on [wmflabs](https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html) against [schema E185](https://www.wikidata.org/wiki/EntitySchema:E185) (version 03:12, 10 April 2020) due to cardinality issues on ``` <language> EXTRA wdt:P31{ wdt:P31 [ wd:Q34770 wd:Q436240 wd:Q1288568 ] } ``` which needs to be modified to ``` <language> EXTRA wdt:P31{ wdt:P31 [ wd:Q34770 wd:Q436240 wd:Q1288568 ] + ; } ``` Both shapes will work with the CheckShEx add-on on Wikidata. We also worked on enriching the [Scholia](https://www.wikidata.org/wiki/Wikidata:Scholia) profiles for [SARS-CoV-2](https://tools.wmflabs.org/scholia/Q82069695), [COVID-19](https://tools.wmflabs.org/scholia/Q84263196) and the current [COVID-19 pandemic](https://tools.wmflabs.org/scholia/Q81068910) as well as related entities. This involved creating items for missing publications (especially preprints, as discussed above) and missing authors as well as enriching existing items about publications by annotating them with topics (via [P921](https://www.wikidata.org/wiki/Property:P921)) and converting (via the [Author Disambiguator](https://tools.wmflabs.org/author-disambiguator/) tool) author name strings ([P2093](https://www.wikidata.org/wiki/Property:P2093)) into links to the appropriate author items (via [P50](https://www.wikidata.org/wiki/Property:P50); here, the aforementioned author order is important), as well as enriching author items, e.g. with annotations about affiliations, awards and co-authors. ### Schemas A range of new entity schemas have been created using various approaches ([example documentation](https://www.wikidata.org/wiki/User:Daniel_Mietchen/ShEx_for_clinical_trials)). * outbreak (general): https://www.wikidata.org/wiki/EntitySchema:E173 * pandemic: https://www.wikidata.org/wiki/EntitySchema:E184 * preprint: https://www.wikidata.org/wiki/EntitySchema:E185 * macromolecular complex: https://www.wikidata.org/wiki/EntitySchema:E186 * hospital: https://www.wikidata.org/wiki/EntitySchema:E187 * local outbreaks of the 2020 coronavirus pandemic: https://www.wikidata.org/wiki/EntitySchema:E188 * clinical trial https://www.wikidata.org/wiki/EntitySchema:E189 * lockdown https://www.wikidata.org/wiki/EntitySchema:E190 * coronavirus related lockdown https://www.wikidata.org/wiki/EntitySchema:E191 * virus taxon https://www.wikidata.org/wiki/EntitySchema:E192 * preprint server https://www.wikidata.org/wiki/EntitySchema:E193 * complex portal entity https://www.wikidata.org/wiki/EntitySchema:E194 ### SPARQL queries ### Jupyter notebooks, GitHub repositories and data repositories/release ## The virus taxonomy on Wikidata ### Jupyter notebooks, GitHub repositories and data repositories/release * Virus taxonomy wikibase https://virus-taxonomy.wiki.opencura.com/ ## Subsets and overlay from Wikidata ### Jupyter notebooks, GitHub repositories and data repositories/release ## Federated queries ### Jupyter notebooks, GitHub repositories and data repositories/release * https://federatedqueries.wiki.opencura.com/wiki/Main_Page ..... * Please add a list here * Make sure you let us know which of these correspond to Jupyter notebooks. Although not supported yet, we plan to add features for them * And remember, software and data need a license for them to be used by others, no license means no clear rules so nobody could legally use a non-licensed research object, whatever that object is ## Describing the virus Please remember to introduce tables (see Table 1) before they appear on the document. We recommend to center tables, formulas and figure but not the corresponding captions. Feel free to modify the table style as it better suits to your data. Table 1 | Header 1 | Header 2 | | -------- | -------- | | item 1 | item 2 | | item 3 | item 4 | Remember to introduce figures (see Figure 1) before they appear on the document. ![BioHackrXiv logo](./biohackrxiv.png) Figure 1. A figure corresponding to the logo of our BioHackrXiv preprint. ## Subsetting session Feel free to use numbered lists or bullet points as you need. * Item 1 * Item 2 ### Jupyter notebooks, GitHub repositories and data repositories * Please add a list here * Make sure you let us know which of these correspond to Jupyter notebooks. Although not supported yet, we plan to add features for them * And remember, software and data need a license for them to be used by others, no license means no clear rules so nobody could legally use a non-licensed research object, whatever that object is ## Macromolecular Complex session Macromolecular complexes are structures made of multiple macromolecules, such as protein and RNA molecules. SARS-CoV-2 depends on a series of macromolecular complexes for efficient replication and cell entry. On this BioHackathon we have reconciled data from [EBI's Complex Portal](https://www.ebi.ac.uk/complexportal/) about SARS-CoV-2 complexes to Wikidata, creating data models and inserting items for complex subunits. Notably, most SARS-CoV-2 complexes are derived from protein fragments of two major polyproteins, the Replicase polyprotein 1a/1ab and the Spike Glycoprotein polyprotein. The cleavage fragments of these proteins were added to Wikidata in order to be able to represent complex composition. ### Schemas * Entity Schemas: * Macromolecular complex ([E186](https://www.wikidata.org/wiki/EntitySchema:E186)) * complex portal entity ([E194](https://www.wikidata.org/wiki/EntitySchema:E194)) ### Updated Items #### Standard Items Our approach was to create manually a reference item in order to explore the modelling challenges of protein complexes on Wikidata. We choose the [SARS-CoV-2 NSP9 complex](https://www.wikidata.org/wiki/Q89792653) as an initial target, filling the information. Next, based on the information made available by the Complex Portal ([in this FTP link](ftp://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab)) we created a series of statements, combining [Open Refine](http://openrefine.org/) to reconcile Complex Portal information to Wikidata items. These statements were added to Wikidata with the batch editing tool [QuickStatements](https://tools.wmflabs.org/quickstatements/#/batch). The work can be readily seen via Wikidata SPARQL query service, by clicking on the following links: * Protein fragments in SARS-CoV-2: https://w.wiki/MZo * Macromolecular fragments in SARS-CoV-2: https://w.wiki/MZp * Complexes in SARS-CoV-2: https://w.wiki/??? ### Jupyter notebooks, GitHub repositories and data repositories * WikiPathways [WP4846](https://www.wikipathways.org/index.php/Pathway:WP4846) and [WP4863](https://www.wikipathways.org/index.php/Pathway:WP4863) demo linkouts to the Complex Portal * BridgeDb now lists Complex Portal as external database and now has a identifier mapping database (to be released) * The Jupyter notebook used for matching complexes to subunits is available at [this git repository](https://github.com/lubianat/covid_19_sandbox/tree/master/virtual_biohackathon). # Discussion and/or Conclusion We recommend to include some discussion or conclusion about your work. Feel free to modify the section title as it fits better to your manuscript. # Future work ... # Acknowledgements Please always remember to acknowledge the BioHackathon, CodeFest, VoCamp, Sprint or similar where this work was (partially) developed. # References Leave thise section blank, create a paper.bib with all your references.