NeXML to JSON-LD conversion

original idea issue: https://github.com/phenoscape/KB-DataFest-2017/issues/27

Results

NexLD version 1
- Developed by Carl Boettiger and Scott Chamberlain
- Added additional tests to Carl Boettiger's nexld library for R, which can convert NexML to NexLD and back.
- pynexld, a Python library for converting NeXML to NexLD
- nexldrb, a Ruby library for converting NeXML to NexLD
Notes on NexLD version 2:
- A simpler, cleaner phylogeny/phenotype/genotype data representation format
- Convertible with NeXML
phenoscaperb, a low-level Ruby client for the Phenoscape API
Fishtank: a visualization of Phenoscape attributes arranged anatomically
- Example visualization
Found bug in DendroPy, now supports Phenoscape Ontotrace files: https://github.com/jeetsukumaran/DendroPy/issues/87

Questions we can focus on

Can JSON-LD be used to express a JSON file that is directly convertible to a valid, usable NeXML file?
Can we create a JSON-LD representation of phylogenies and associated traits that breaks out of NeXML/Nexus' character-based system to allow for connections with digital media, specimen records, observation information and phenotypic traits?
What is the best way to combine trees, genotypic characters and phenotypic traits into a single file for transmission and analysis?

Derivative outcomes

Can Dendropy parse NeXML from OntoTrace? One part of Ontotrace markup is not accessible in Dendropy.

Goal

Develop a lossless transformation from NeXML to JSON-LD.
Develop a JSON-LD representation of phylogenetic data that can be used to transmit phylogenies, genotypic and phenotypic traits together.

Use cases

List of traits -> Phenoscape API -> download observations -> add annotations -> return formatted data -> ingest to Phenoscape

Deliverables

Example NeXML files, with traits
- From Phenoscape
  - File
  - API call: curl 'http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0037519%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0004765%3E&variable_only=true' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' --compressed
- From Open Tree of Life - these won't have traits
- From TreeBASE (see "Example NeXML files" section below linking to Rutger Vos's supertreebase dump of TreeBASE )
  Z * [ ] Example of an OpenTree NexSON files (not recommended for interoperability).
Example JSON-LD files
MVP Ruby client for converting NeXML to JSON-LD
MVP Python client for converting NeXML to JSON-LD
web API service to convert NeXML to JSON-LD and vice versa
improving the R implementation https://github.com/cboettig/nexld

JSON-LD

changes to make to the nexld implementation

within each otu: "@id": "VTO_0065870" goes to "@id": "obo:VTO_0065870" <– not necessary if a @vocab is set!
identifiers specific to a nexml file (e.g. "otu1") should use the blank node notation (e.g. "_:otu1")
best practice:
- don't use @vocab in @context
- when a value not used, set it to null? or remove the key?

spec

top level keys (are these okay, should they be changed?):
- @context [hash]
  - ~~@vocab~~ TBD hash
  - namespaces go here (e.g., obo:http://purl.obolibrary.org/obo/)
- version [string] (not needed if we don't use @vocab)
- schemaLocation [string] IRIs (optional?)
- otus [array] each element is a otu hash
- characters [hash]
  - @id [IRI]
- trees [hash]
  - @id [IRI]
  - otus [string]
- ??? other metadata
otus hash (arbitrary set of keys): what OTUs do we have information about
- @type
- @id
- label
- about (Carl suggests to remove this) REMOVE
- dwc:taxonID
characters hash: what characters do we define here?
- @id
- @type
- type: another attribute for type of data
- otus REMOVE
- format
  - states
  - char
- matrix (MOVE TO data HASH)
  - avoid seq
  - something about framing
data hash (was matrix)
*
trees hash
- @id
- otus
- tree
  - node
  - edges
  - labels
context file

{
  "@context": {
    "@vocab": "http://www.nexml.org/2009/"
  }
}

framing file

{
  "@context": {
    "@vocab": "http://www.nexml.org/2009/"
  },
  "otus": {},
  "trees": {}
}

example (click to expand)

{
  "nexml": {
    "@context": {
      "@vocab": "http://www.nexml.org/2009/"
    },
    "version": "0.9",
    "schemaLocation": "http://www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd http://www.bioontologies.org/obd/schema/pheno http://purl.org/phenoscape/phenoxml.xsd",
    "otus": {
      "@id": "tdcdf576d-af84-47da-aa08-a98992ca20be",
      "otu": [
        {
          "@type": "otu",
          "@id": "VTO_0065870",
          "label": "Xenurobrycon macropus",
          "about": "#VTO_0065870",
          "dwc:taxonID": "http://purl.obolibrary.org/obo/VTO_0065870"
        },
        ...
      ]
    },
    "characters": {
      "@id": "20c810fb-e700-482e-8553-b7971e815f04",
      "@type": "StandardCells",
      "otus": "tdcdf576d-af84-47da-aa08-a98992ca20be",
      "format": {
        "states": [
          {
            "@type": "states",
            "@id": "sbaa1df0d-ee55-4eee-bb60-617520728203",
            "state": {
              "@id": "UBERON_0010527_1",
              "label": "present",
              "symbol": "1"
            }
          },
          ...
        ],
        "char": [
          {
            "@type": "char",
            "@id": "UBERON_0010527",
            "label": "cavity of bone organ",
            "about": "#UBERON_0010527",
            "states": "sbaa1df0d-ee55-4eee-bb60-617520728203",
            "obo:IAO_0000219": "http://purl.obolibrary.org/obo/UBERON_0010527"
          },
          ...
        ]
      },
      "matrix": {
        "row": []
      }
    },
    "trees": {
      "@id": "tdf18b967-e3cb-45e2-8158-a59c4a1c305e",
      "otus": "tdcdf576d-af84-47da-aa08-a98992ca20be"
    },
    "dc:creator": {},
    "dc:description": {}
  }
}

Gaurav's notes

Phenoscape and NeXML

Produced by OntoTrace

Some data that's accessible from the Phenoscape RCN

/term/search can look up taxonomic names, giving you a VTO term (e.g. http://purl.obolibrary.org/obo/VTO_0011993)

{
  "results": [
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_0011993",
      "label": "Homo sapiens",
      "matchType": "exact"
    },
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_9033255",
      "label": "Homo sapiens idaltu",
      "matchType": "partial"
    },
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_0015575",
      "label": "Homo sapiens x Mus musculus hybrid cell line",
      "matchType": "partial"
    }
  ]
}

/taxon can give you a bunch of details about a taxon.

{
  "rank": {
    "@id": "http://purl.obolibrary.org/obo/TAXRANK_0000006",
    "label": "species"
  },
  "label": "Homo sapiens",
  "extinct": false,
  "common_name": "human",
  "@id": "http://purl.obolibrary.org/obo/VTO_0011993"
}

{
  "results": [
    {
      "@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584",
      "description": "Ankle, Malleolar bone, presence: absent",
      "matrix": {
        "@id": "http://dx.doi.org/10.1126/science.1229237",
        "label": "O’Leary et al. (2013)"
      },
      "phenotype": {
        "@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584#phenotype",
        "label": "Ankle, Malleolar bone, presence: absent"
      }
    },
    {
      "@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1",
      "description": "Astraglus, Head, shape: convex (arc-shaped)",
      "matrix": {
        "@id": "http://dx.doi.org/10.1126/science.1229237",
        "label": "O’Leary et al. (2013)"
      },
      "phenotype": {
        "@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1#phenotype",
        "label": "Astraglus, Head, shape: convex (arc-shaped)"
      }
    },
    ...
  ]
}

Some example SPARQL queries:
Which character-state combinations do we have for Homo sapiens: http://yasgui.org/short/rJaPFd2bf

Example NeXML files

Contains DNA characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S1588.xml
Contains labeled characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S12957.xml
Contains non-DNA, unlabeled characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S10917.xml
Contains continuous characters: https://github.com/TreeBASE/supertreebase/blob/master/data/treebase/S15067.xml

Aligned vs unaligned data

Aligned data is comparable by column, i.e. 'A' in position 301 can be compared to 'T' in position 301
Unaligned data is comparable by type, i.e. femur_length is 12cm

We can also distinguish between cases where a character is homologous ("ratio of x to y") or non-homologous ("body_length").

Converting NeXML to CDAO

NeXML structure as per its XSD specification

nexml
- "version=": required, must be '0.9'
- "generator="
- Annotated
  - ResourceMeta
    - "href=": URI
    - "rel=": xs:QName
    - Allows embedded meta tags?
  - LiteralMeta
    - "property=": required, xs:QName
    - "datatype=": xs:QName
    - "content="
    - Allows any embedded meta tags.
- otus [min=1]: Taxa
  - IDTagged
    - "id=": required
  - otu: Taxon, IDTagged
    - Maps to CDAO:TU
  - set: TaxonSet
    - "otu=": xs:IDREFS
- characters
  - ?!
- trees
  - network: AbstractNetwork
  - tree: AbstractTree, IDTagged
    - "node=": AbstractNode
      - Maps to CDAO:Node
    - "rootedge=": AbstractRootEdge
    - "edge=": AbstractEdge
    - "set=": NodeAndRootEdgeAndEdgeSet
  - set: TreeAndNetworkSet
    * "tree=": xs:IDREFS
    * "network=": xs:IDREFS

Matt's notes

temp - attributes

 id                        | integer  
 
 descriptor_id             | integer   (character)   
                
 otu_id                    | integer                     
 collection_object_id      | integer                     

 character_state_id        | integer  
 name          
 label         
 position      
 frequency                 | character varying (always, never, sometimes)      
 
 continuous_value          | numeric                     
 continuous_unit           | character varying           

 sample_n                  | integer                     
 sample_min                | numeric                     
 sample_max                | numeric                     
 sample_median             | numeric                     
 sample_mean               | numeric                     
 sample_units              | character varying           
 sample_standard_deviation | numeric                     
 sample_standard_error     | numeric                     

 presence                  | boolean                     

 description               | text (free text)

Framing spam

{
  "@context": {"@vocab": "http://example.org/"},
  "@type": "Matrix",
  "contains": {
    "@type": "Rows",
    "contains": {
      "@type": "Columns"
      "contains": {
        "@type": "Cells"
      }
    }
  }
}

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.