changed 7 years ago
Linked with GitHub

NeXML to JSON-LD conversion

original idea issue: https://github.com/phenoscape/KB-DataFest-2017/issues/27

Results

Questions we can focus on

  • Can JSON-LD be used to express a JSON file that is directly convertible to a valid, usable NeXML file?
  • Can we create a JSON-LD representation of phylogenies and associated traits that breaks out of NeXML/Nexus' character-based system to allow for connections with digital media, specimen records, observation information and phenotypic traits?
  • What is the best way to combine trees, genotypic characters and phenotypic traits into a single file for transmission and analysis?

Derivative outcomes

  • Can Dendropy parse NeXML from OntoTrace? One part of Ontotrace markup is not accessible in Dendropy.

Goal

  1. Develop a lossless transformation from NeXML to JSON-LD.
  2. Develop a JSON-LD representation of phylogenetic data that can be used to transmit phylogenies, genotypic and phenotypic traits together.

Use cases

  • List of traits -> Phenoscape API -> download observations -> add annotations -> return formatted data -> ingest to Phenoscape

Deliverables

  • Example NeXML files, with traits
    • From Phenoscape
      • File
      • API call: curl 'http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0037519%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0004765%3E&variable_only=true' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' --compressed
    • From Open Tree of Life - these won't have traits
    • From TreeBASE (see "Example NeXML files" section below linking to Rutger Vos's supertreebase dump of TreeBASE )
      Z * [ ] Example of an OpenTree NexSON files (not recommended for interoperability).
  • Example JSON-LD files
  • MVP Ruby client for converting NeXML to JSON-LD
  • MVP Python client for converting NeXML to JSON-LD
  • web API service to convert NeXML to JSON-LD and vice versa
  • improving the R implementation https://github.com/cboettig/nexld

JSON-LD

changes to make to the nexld implementation

  • within each otu: "@id": "VTO_0065870" goes to "@id": "obo:VTO_0065870" < not necessary if a @vocab is set!
  • identifiers specific to a nexml file (e.g. "otu1") should use the blank node notation (e.g. "_:otu1")
  • best practice:
    • don't use @vocab in @context
    • when a value not used, set it to null? or remove the key?

spec

  • top level keys (are these okay, should they be changed?):

    • @context [hash]
      • @vocab TBD hash
      • namespaces go here (e.g., obo:http://purl.obolibrary.org/obo/)
    • version [string] (not needed if we don't use @vocab)
    • schemaLocation [string] IRIs (optional?)
    • otus [array] each element is a otu hash
    • characters [hash]
      • @id [IRI]
    • trees [hash]
      • @id [IRI]
      • otus [string]
    • ??? other metadata
  • otus hash (arbitrary set of keys): what OTUs do we have information about

  • characters hash: what characters do we define here?

    • @id
    • @type
    • type: another attribute for type of data
    • otus REMOVE
    • format
      • states
      • char
    • matrix (MOVE TO data HASH)
      • avoid seq
      • something about framing
  • data hash (was matrix)
    *

  • trees hash

    • @id
    • otus
    • tree
      • node
      • edges
      • labels
  • context file

{
  "@context": {
    "@vocab": "http://www.nexml.org/2009/"
  }
}
  • framing file
{
  "@context": {
    "@vocab": "http://www.nexml.org/2009/"
  },
  "otus": {},
  "trees": {}
}
example (click to expand)
{
  "nexml": {
    "@context": {
      "@vocab": "http://www.nexml.org/2009/"
    },
    "version": "0.9",
    "schemaLocation": "http://www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd http://www.bioontologies.org/obd/schema/pheno http://purl.org/phenoscape/phenoxml.xsd",
    "otus": {
      "@id": "tdcdf576d-af84-47da-aa08-a98992ca20be",
      "otu": [
        {
          "@type": "otu",
          "@id": "VTO_0065870",
          "label": "Xenurobrycon macropus",
          "about": "#VTO_0065870",
          "dwc:taxonID": "http://purl.obolibrary.org/obo/VTO_0065870"
        },
        ...
      ]
    },
    "characters": {
      "@id": "20c810fb-e700-482e-8553-b7971e815f04",
      "@type": "StandardCells",
      "otus": "tdcdf576d-af84-47da-aa08-a98992ca20be",
      "format": {
        "states": [
          {
            "@type": "states",
            "@id": "sbaa1df0d-ee55-4eee-bb60-617520728203",
            "state": {
              "@id": "UBERON_0010527_1",
              "label": "present",
              "symbol": "1"
            }
          },
          ...
        ],
        "char": [
          {
            "@type": "char",
            "@id": "UBERON_0010527",
            "label": "cavity of bone organ",
            "about": "#UBERON_0010527",
            "states": "sbaa1df0d-ee55-4eee-bb60-617520728203",
            "obo:IAO_0000219": "http://purl.obolibrary.org/obo/UBERON_0010527"
          },
          ...
        ]
      },
      "matrix": {
        "row": []
      }
    },
    "trees": {
      "@id": "tdf18b967-e3cb-45e2-8158-a59c4a1c305e",
      "otus": "tdcdf576d-af84-47da-aa08-a98992ca20be"
    },
    "dc:creator": {},
    "dc:description": {}
  }
}

Gaurav's notes

Phenoscape and NeXML

Some data that's accessible from the Phenoscape RCN

{
  "results": [
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_0011993",
      "label": "Homo sapiens",
      "matchType": "exact"
    },
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_9033255",
      "label": "Homo sapiens idaltu",
      "matchType": "partial"
    },
    {
      "@id": "http://purl.obolibrary.org/obo/VTO_0015575",
      "label": "Homo sapiens x Mus musculus hybrid cell line",
      "matchType": "partial"
    }
  ]
}
  • /taxon can give you a bunch of details about a taxon.
{
  "rank": {
    "@id": "http://purl.obolibrary.org/obo/TAXRANK_0000006",
    "label": "species"
  },
  "label": "Homo sapiens",
  "extinct": false,
  "common_name": "human",
  "@id": "http://purl.obolibrary.org/obo/VTO_0011993"
}
{
  "results": [
    {
      "@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584",
      "description": "Ankle, Malleolar bone, presence: absent",
      "matrix": {
        "@id": "http://dx.doi.org/10.1126/science.1229237",
        "label": "O’Leary et al. (2013)"
      },
      "phenotype": {
        "@id": "http://purl.org/phenoscape/uuid/16b25594-a8df-4ed6-924e-1fedbc24b584#phenotype",
        "label": "Ankle, Malleolar bone, presence: absent"
      }
    },
    {
      "@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1",
      "description": "Astraglus, Head, shape: convex (arc-shaped)",
      "matrix": {
        "@id": "http://dx.doi.org/10.1126/science.1229237",
        "label": "O’Leary et al. (2013)"
      },
      "phenotype": {
        "@id": "http://purl.org/phenoscape/uuid/9f1435fd-7675-4453-97ba-18e2108db0b1#phenotype",
        "label": "Astraglus, Head, shape: convex (arc-shaped)"
      }
    },
    ...
  ]
}

Example NeXML files

Aligned vs unaligned data

  • Aligned data is comparable by column, i.e. 'A' in position 301 can be compared to 'T' in position 301
  • Unaligned data is comparable by type, i.e. femur_length is 12cm

We can also distinguish between cases where a character is homologous ("ratio of x to y") or non-homologous ("body_length").

Converting NeXML to CDAO

NeXML structure as per its XSD specification

  • nexml
    • "version=": required, must be '0.9'
    • "generator="
    • Annotated
      • ResourceMeta
        • "href=": URI
        • "rel=": xs:QName
        • Allows embedded meta tags?
      • LiteralMeta
        • "property=": required, xs:QName
        • "datatype=": xs:QName
        • "content="
        • Allows any embedded meta tags.
    • otus [min=1]: Taxa
      • IDTagged
        • "id=": required
      • otu: Taxon, IDTagged
        • Maps to CDAO:TU
      • set: TaxonSet
        • "otu=": xs:IDREFS
    • characters
      • ?!
    • trees
      • network: AbstractNetwork
      • tree: AbstractTree, IDTagged
        • "node=": AbstractNode
          • Maps to CDAO:Node
        • "rootedge=": AbstractRootEdge
        • "edge=": AbstractEdge
        • "set=": NodeAndRootEdgeAndEdgeSet
      • set: TreeAndNetworkSet
        * "tree=": xs:IDREFS
        * "network=": xs:IDREFS

Matt's notes

temp - attributes

 id                        | integer  
 
 descriptor_id             | integer   (character)   
                
 otu_id                    | integer                     
 collection_object_id      | integer                     

 character_state_id        | integer  
 name          
 label         
 position      
 frequency                 | character varying (always, never, sometimes)      
 
 continuous_value          | numeric                     
 continuous_unit           | character varying           

 sample_n                  | integer                     
 sample_min                | numeric                     
 sample_max                | numeric                     
 sample_median             | numeric                     
 sample_mean               | numeric                     
 sample_units              | character varying           
 sample_standard_deviation | numeric                     
 sample_standard_error     | numeric                     

 presence                  | boolean                     

 description               | text (free text)                       

Framing spam

{
  "@context": {"@vocab": "http://example.org/"},
  "@type": "Matrix",
  "contains": {
    "@type": "Rows",
    "contains": {
      "@type": "Columns"
      "contains": {
        "@type": "Cells"
      }
    }
  }
}
Select a repo