owned this note
owned this note
Published
Linked with GitHub
NeXML to JSON-LD conversion
===========================
original idea issue: https://github.com/phenoscape/KB-DataFest-2017/issues/27
## Results
* [NexLD version 1](https://github.com/cboettig/nexld/)
* Developed by Carl Boettiger and Scott Chamberlain
* Added [additional tests](https://github.com/cboettig/nexld/pull/13) to Carl Boettiger's `nexld` library for R, which can convert NexML to NexLD and back.
* [Example NeXML files](https://github.com/phenoscape/nexld/tree/test_conversion/tests/example_nexml)
* [Example JSON-LD files](https://github.com/phenoscape/nexld/tree/test_conversion/tests/example_jsonld)
* [Example RDF triples](https://github.com/phenoscape/nexld/tree/test_conversion/tests/example_rdf)
* [pynexld](https://github.com/phenoscape/pynexld), a Python library for converting NeXML to NexLD
* [nexldrb](https://github.com/phenoscape/nexldrb), a Ruby library for converting NeXML to NexLD
* Notes on [NexLD version 2](https://github.com/phenoscape/pheno-jsonld):
* A simpler, cleaner phylogeny/phenotype/genotype data representation format
* Convertible with NeXML
* [phenoscaperb](https://github.com/phenoscape/phenoscaperb), a low-level Ruby client for the Phenoscape API
* [Fishtank](https://github.com/phenoscape/fishtank): a visualization of Phenoscape attributes arranged anatomically
* [Example visualization](https://raw.githubusercontent.com/phenoscape/fishtank/master/doc/viz.png)
* Found bug in DendroPy, now supports Phenoscape Ontotrace files: https://github.com/jeetsukumaran/DendroPy/issues/87
## Questions we can focus on
* Can JSON-LD be used to express a JSON file that is directly convertible to a valid, usable NeXML file?
* Can we create a JSON-LD representation of phylogenies and associated traits that breaks out of NeXML/Nexus' character-based system to allow for connections with digital media, specimen records, observation information and phenotypic traits?
* What is the best way to combine trees, genotypic characters and phenotypic traits into a single file for transmission and analysis?
## Derivative outcomes
* Can Dendropy parse NeXML from OntoTrace? One part of Ontotrace markup is not accessible in Dendropy.
*
## Goal
1. Develop a lossless transformation from NeXML to JSON-LD.
2. Develop a JSON-LD representation of phylogenetic data that can be used to transmit phylogenies, genotypic and phenotypic traits together.
## Use cases
* List of traits -> Phenoscape API -> download observations -> add annotations -> return formatted data -> ingest to Phenoscape
## Deliverables
* [ ] Example NeXML files, with traits
* From Phenoscape
* [File](https://raw.githubusercontent.com/cboettig/nexld/master/inst/extdata/ontotrace.xml)
* API call: `curl 'http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0037519%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0004765%3E&variable_only=true' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' --compressed`
* From Open Tree of Life - these won't have traits
* From TreeBASE (see "Example NeXML files" section below linking to Rutger Vos's supertreebase dump of TreeBASE )
Z * [ ] Example of an OpenTree [NexSON](https://github.com/OpenTreeOfLife/phylesystem-api/wiki/NexSON) files (not recommended for interoperability).
* [ ] Example JSON-LD files
* [ ] MVP Ruby client for converting NeXML to JSON-LD
* [ ] MVP Python client for converting NeXML to JSON-LD
* [ ] web API service to convert NeXML to JSON-LD and vice versa
* [ ] improving the R implementation https://github.com/cboettig/nexld
## JSON-LD
## changes to make to the nexld implementation
* within each otu: `"@id": "VTO_0065870"` goes to `"@id": "obo:VTO_0065870"` <-- not necessary if a @vocab is set!
* identifiers specific to a nexml file (e.g. "otu1") should use the [blank node notation](https://json-ld.org/spec/latest/json-ld/#identifying-blank-nodes) (e.g. "\_:otu1")
* best practice:
* don't use `@vocab` in `@context`
* when a value not used, set it to `null`? or remove the key?
*
## spec
* top level keys (are these okay, should they be changed?):
* `@context` [hash]
* ~~`@vocab`~~ TBD hash
* namespaces go here (e.g., `obo:http://purl.obolibrary.org/obo/`)
* `version` [string] (not needed if we don't use `@vocab`)
* `schemaLocation` [string] IRIs (optional?)
* `otus` [array] each element is a `otu` hash
* `characters` [hash]
* `@id` [IRI]
*
* `trees` [hash]
* `@id` [IRI]
* `otus` [string]
* ??? other metadata
* `otus` hash (arbitrary set of keys): what OTUs do we have information about
* `@type`
* `@id`
* `label`
* `about` ([Carl suggests to remove this](https://github.com/cboettig/nexld/issues/6)) **REMOVE**
* `dwc:taxonID`
* `characters` hash: what characters do we define here?
* `@id`
* `@type`
* `type`: another attribute for type of data
* `otus` **REMOVE**
* `format`
* `states`
* `char`
* `matrix` (**MOVE TO `data` HASH**)
* avoid `seq`
* something about [framing](https://json-ld.org/spec/latest/json-ld-framing/#framing)
* `data` hash (was `matrix`)
*
* `trees` hash
* `@id`
* `otus`
* `tree`
* `node`
* `edges`
* `labels`
* context file
```json
{
"@context": {
"@vocab": "http://www.nexml.org/2009/"
}
}
```
* framing file
```json
{
"@context": {
"@vocab": "http://www.nexml.org/2009/"
},
"otus": {},
"trees": {}
}
```
<details> <summary><strong>example (click to expand)</strong></summary>
```
{
"nexml": {
"@context": {
"@vocab": "http://www.nexml.org/2009/"
},
"version": "0.9",
"schemaLocation": "http://www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd http://www.bioontologies.org/obd/schema/pheno http://purl.org/phenoscape/phenoxml.xsd",
"otus": {
"@id": "tdcdf576d-af84-47da-aa08-a98992ca20be",
"otu": [
{
"@type": "otu",
"@id": "VTO_0065870",
"label": "Xenurobrycon macropus",
"about": "#VTO_0065870",
"dwc:taxonID": "http://purl.obolibrary.org/obo/VTO_0065870"
},
...
]
},
"characters": {
"@id": "20c810fb-e700-482e-8553-b7971e815f04",
"@type": "StandardCells",
"otus": "tdcdf576d-af84-47da-aa08-a98992ca20be",
"format": {
"states": [
{
"@type": "states",
"@id": "sbaa1df0d-ee55-4eee-bb60-617520728203",
"state": {
"@id": "UBERON_0010527_1",
"label": "present",
"symbol": "1"
}
},
...
],
"char": [
{
"@type": "char",
"@id": "UBERON_0010527",
"label": "cavity of bone organ",
"about": "#UBERON_0010527",
"states": "sbaa1df0d-ee55-4eee-bb60-617520728203",
"obo:IAO_0000219": "http://purl.obolibrary.org/obo/UBERON_0010527"
},
...
]
},
"matrix": {
"row": []
}
},
"trees": {
"@id": "tdf18b967-e3cb-45e2-8158-a59c4a1c305e",
"otus": "tdcdf576d-af84-47da-aa08-a98992ca20be"
},
"dc:creator": {},
"dc:description": {}
}
}
```
</details>
# Gaurav's notes
## Phenoscape and NeXML
* Produced by [OntoTrace](http://kb.phenoscape.org/#/ontotrace)
## Some data that's accessible from the Phenoscape RCN
* [`/term/search`](http://kb.phenoscape.org/apidocs/#/Terms/get_term_search) can look up taxonomic names, giving you a VTO term (e.g. http://purl.obolibrary.org/obo/VTO_0011993)
```
{
"results": [
{
"@id": "http://purl.obolibrary.org/obo/VTO_0011993",
"label": "Homo sapiens",
"matchType": "exact"
},
{
"@id": "http://purl.obolibrary.org/obo/VTO_9033255",
"label": "Homo sapiens idaltu",
"matchType": "partial"
},
{
"@id": "http://purl.obolibrary.org/obo/VTO_0015575",
"label": "Homo sapiens x Mus musculus hybrid cell line",
"matchType": "partial"
}
]
}
```
* [`/taxon`](http://kb.phenoscape.org/apidocs/#/Taxa/get_taxon) can give you a bunch of details about a taxon.
```
{
"rank": {
"@id": "http://purl.obolibrary.org/obo/TAXRANK_0000006",
"label": "species"
},
"label": "Homo sapiens",
"extinct": false,
"common_name": "human",
"@id": "http://purl.obolibrary.org/obo/VTO_0011993"
}
```
*
```
{
"results": [
{
"@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584",
"description": "Ankle, Malleolar bone, presence: absent",
"matrix": {
"@id": "http://dx.doi.org/10.1126/science.1229237",
"label": "O’Leary et al. (2013)"
},
"phenotype": {
"@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584#phenotype",
"label": "Ankle, Malleolar bone, presence: absent"
}
},
{
"@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1",
"description": "Astraglus, Head, shape: convex (arc-shaped)",
"matrix": {
"@id": "http://dx.doi.org/10.1126/science.1229237",
"label": "O’Leary et al. (2013)"
},
"phenotype": {
"@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1#phenotype",
"label": "Astraglus, Head, shape: convex (arc-shaped)"
}
},
...
]
}
```
* Some example SPARQL queries:
* Which character-state combinations do we have for *Homo sapiens*: http://yasgui.org/short/rJaPFd2bf
## Example NeXML files
* Contains DNA characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S1588.xml
* Contains labeled characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S12957.xml
* Contains non-DNA, unlabeled characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S10917.xml
* Contains continuous characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S15067.xml
## Aligned vs unaligned data
- Aligned data is comparable by column, i.e. 'A' in position 301 can be compared to 'T' in position 301
- Unaligned data is comparable by type, i.e. femur_length is 12cm
We can also distinguish between cases where a character is homologous ("ratio of x to y") or non-homologous ("body_length").
## Converting NeXML to CDAO
NeXML structure as per its XSD specification
* [nexml](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/nexml.xsd#L59)
- "version=": required, must be '0.9'
- "generator="
* *Annotated*
* [ResourceMeta](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/meta/annotations.xsd#L71)
* "href=": URI
* "rel=": xs:QName
* Allows embedded meta tags?
* [LiteralMeta](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/meta/annotations.xsd#L89)
* "property=": required, xs:QName
* "datatype=": xs:QName
* "content="
* Allows any embedded meta tags.
* [otus](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/nexml.xsd#L44) [min=1]: [Taxa](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/taxa/taxa.xsd#L30)
* *IDTagged*
* "id=": required
* [otu](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/taxa/taxa.xsd#L22): [Taxon](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/taxa/taxa.xsd#L22), IDTagged
* Maps to CDAO:TU
* [set](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/taxa/taxa.xsd#L35): [TaxonSet](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/meta/sets.xsd#L30)
* "otu=": xs:IDREFS
* [characters](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/nexml.xsd#L47)
* ?!
* [trees](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/nexml.xsd#L48)
* [network](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/trees/trees.xsd#L28): AbstractNetwork
* [tree](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/trees/trees.xsd#L29): [AbstractTree](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/trees/abstracttrees.xsd#L76), IDTagged
* "node=": AbstractNode
* Maps to CDAO:Node
*
* "rootedge=": AbstractRootEdge
* "edge=": AbstractEdge
* "set=": NodeAndRootEdgeAndEdgeSet
* [set](https://github.com/nexml/nexml/blob/544179c709d38abd273e426e93875f408209367b/xsd/trees/trees.xsd#L31): TreeAndNetworkSet
* "tree=": xs:IDREFS
* "network=": xs:IDREFS
# Matt's notes
## temp - attributes
```
id | integer
descriptor_id | integer (character)
otu_id | integer
collection_object_id | integer
character_state_id | integer
name
label
position
frequency | character varying (always, never, sometimes)
continuous_value | numeric
continuous_unit | character varying
sample_n | integer
sample_min | numeric
sample_max | numeric
sample_median | numeric
sample_mean | numeric
sample_units | character varying
sample_standard_deviation | numeric
sample_standard_error | numeric
presence | boolean
description | text (free text)
```
## Framing spam
```json
{
"@context": {"@vocab": "http://example.org/"},
"@type": "Matrix",
"contains": {
"@type": "Rows",
"contains": {
"@type": "Columns"
"contains": {
"@type": "Cells"
}
}
}
}
```