# System Architecture High Level
This is a back end data system designed to deliver, at scale and volume, information to drive the phylogeny explorer application.
[Link to Architecture Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/architecture.svg)

The data has some interesting characteristics.
* mutation rate slow relative to query - delta:query ratio approaches 0
* graph structure - the overall shape is that of a graph
* graph-directed inheritance - normalized attributes apply to children
* normalized storage - each attribute is represented only change points in canonical data
* highly denormalized delivery - delivered entry records have all attributes relevant represented
There are several types of query that we have identified we will need to serve
* Clade Entry - retrieval of a particular clade entry. Dozens of times at page load, second-by-second during exploration
* Text Search index
* Attribute names - the process of finding attributes to use for other searches
* Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load.
* Clade names
* Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load.
* Description text
* more of a 'help us find things by description' that has a search result page
* Graph search index
* retrieval of clade entries based on their relationships
* Time interval index - clades which intersect with a particular time point
## Scale and flow
A principle principal of this design is to use the least possible amount of processing resources to serve a request. In evolved designs this is handled with cache but this has downsides such as only caching questions that have been answered before, and heavily loading the server on cache misses due to expiration or operational concerns like a new cloudfront cache node coming online, or deploying a new version of the application. We can take a proactive approach effectively because the delta:query ratio approaches zero - we can spend several very intense spot-minutes with high cpu capability pre-answering, assembling, and indexing all of the common query result patterns for rapid retrieval later with minimal resources required for the rest of the site's operation, allowing a smaller cluster of servers to handle the live load. These approaches also make it trivial to maintain stateless requesting, distributing the requests across more hardware as demand increases - with no star topology and its accompanying headaches of replication synchronization and ultimate dependency on a single conceptual system that is coordinates state across the application.
## SYSTEM CONCEPTUAL BLOCKS
* Static Content
* A simple file system accessible through the main web url.
* Provides application resources, images, and other unchanging bits of front end related content.
* Also has a category to provide clade entry particular media resources
* Front End Web Server
* Provides routing for the web application front end
* Provides api endpoints internal to the web application
* Open Authentication Server
* Validates tokens
* Provides Authentication endpoints
* Signs tokens that can be used
* data api server
* front end web server
* Authentication and Authorization Store
* Who are our users
* What are our groups
* What are our roles
* What actions can a role do
* What actions can't a role do
* Which users are in which groups
* Which groups are in which groups
* Data API Server
* Provides API endpoints for access to the indexed and queryable data
* Used by web front end
* Used by other clients
* Access controlled by tokens
* OAuth JWT verified with Open Authentication Server
* Document Data Store
* A Git repo with a series of clade entry documents
* Access control by Git permissions system
* Contains complete history of changes to the data set
* Integrator
* A continual integration/build process
* Compiles web application from source
* deploys to static content
* deploys to front end web server
* Reads and resolves normalized data to denormalized form
* deploys denormalized documents to static content
* Reads and resolves relationship graph
* deploys graph to graph database engine
* Extracts all attributes and their entry locations
* deploys attribute index to Attribute searcher
* Generates time segment tree index
* deploys tree to time interval index
[Link to Flow Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/flow.svg)

## THATS NOT JSON IN YOUR EXAMPLES
Yes. This is a form called 'YAML', or 'yet another markup language'. It is designed to be easy to type yet as comprehensive as and convertible directly to json using a tool. I am typing this, so, I used YAML to notate it. How do you read it you might ask... Well, you can get all the detailios [here](https://symfony.com/doc/current/components/yaml/yaml_format.html) and I for a simple brief:
* `---` Beginning of document marker
* `key: value` If you see a word and a `:`, you're in an object. in json, `{ "key": "value" }`
* `- value` The hyphen is used for an array entry. `[ "value" ]`
* ```yaml
akey:
bkey: value
```
indentation is for a nested object. In json `{ "akey": { "bkey": "value" } }`
* ```yaml
akey:
- bkey: value
```
In json `{ "akey": [ { "bkey": "value" } ] }`
Voila.
## STATES
### ANONYMOUS STATE
The client software (either the front end or some other client) will deliver credentials as necessary obtain a token from the authorization system - this is a combination of a session token and identity that can be used to control what the api is able to do and query. The user control system is authoritative, the state is contained within the signed token. This is OAuth flow.
### SESSION ACTIVE
A connection is made to the back end server with the signed authorization token, which will verify the authenticity of the token via oauth flow with the authentication service and return provide a session token scoped to the back end servers. This allows the system of back end servers to maintain authority among themselves without having to depend request-by-request upon the availability of the authoritative authentication server.
### QUERY RESULTS PENDING FRONT END ASSEMBLY
The application will make a query with its session token based on its own requirements. An example might be 'Biota and the first three clades beneath it', to drive the creation of an initial screen.
```yaml
---
version: 1.0
type: simple-graph
id: a9723dee-4bbc-11ea-b77f-2e728ce88125
depth: 2
languages:
- ar
- la
reference: GRAPH_ROOT
```
The response would come back with convenient display names in multiple languages - notice the RFC2047 encoded UTF-8 strings for displaying the text in arabic!
```yaml
---
version: 1.0
type: clade-set
meta:
page: 1
total_pages: 1
total_resources: 4
query:
id: a9723dee-4bbc-11ea-b77f-2e728ce88125
depth: 2
reference: GRAPH_ROOT
objects:
- display:
la: Biota
ar: =?UTF-8?B?2KfZhNmD2KfYptmG2KfYqiDYp9mE2K3Zitip?=
link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125
- name:
la: Archaea
ar: =?UTF-8?B?2KfZhNi52KrZitmC2Kk=?=
link: /static/clade/uuid/562d0808-4bbc-11ea-b77f-2e728ce88125
- name:
la: Eubacteria
ar: =?UTF-8?B?2KzZj9ix2ZLYq9mP2YXYqQ==?=
link: /static/clade/uuid/562d095c-4bbc-11ea-b77f-2e728ce88125
- name:
la: Eukaryota
ar: =?UTF-8?B?2YPYp9im2YYg2YjYrdmK2K8g2KfZhNiu2YTZitip?=
link: /static/clade/uuid/562d0a88-4bbc-11ea-b77f-2e728ce88125
```
### QUERY PROCESSING
A query against the data APIs will result in a set of clade entry urls delivered.
And the client software would be able to display those names and in turn retrieve (or load from many possible layers of cache) documents for each clade
### CLADE DATA RETRIEVE
A series of requests for these entries will be made to the static content server which has the denormalized documents available.
http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125
http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0808-4bbc-11ea-b77f-2e728ce88125
http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d095c-4bbc-11ea-b77f-2e728ce88125
http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0a88-4bbc-11ea-b77f-2e728ce88125
which would look something like
```yaml
---
created_on: 1580929460
created_by:
id: user/098359f7-59a3-43bd-82b6-2fcad56d0983
name:
la: Aran Ra
parent:
id: 562d0592-4bbc-11ea-b77f-2e728ce88125
link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125
name:
la: Biota
ar: =?UTF-8?B?2KfZhNmD2KfYptmG2KfYqiDYp9mE2K3Zitip?=
depth: 2
name:
la: Archaea
ar: =?UTF-8?B?2KfZhNi52KrZitmC2Kk=?=
children:
- name:
la: Crenarchaeota
ar: =?UTF-8?B?2LnYqtin2KbZgiDZhdi12K/YsdmK2Kk=?=
link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125
id: 562d0592-4bbc-11ea-b77f-2e728ce88125
- name:
la: Euryarchaeota
ar: 2LnYqtin2KbZgiDYudix2YrYttip
link: /static/clade/uuid/f22f492b-9328-4a5a-8751-0efd1712acd3
id: f22f492b-9328-4a5a-8751-0efd1712acd3
- name:
la: Korarchaeota
ar: 2LnYqtin2KbZgiDYtNin2KjYqQ
link: /static/clade/uuid/3800a855-64d8-4e46-961e-d1be2eb0a0b5
id: 3800a855-64d8-4e46-961e-d1be2eb0a0b5
- name:
la: Thermoplasmales
link: /static/clade/uuid/a74f9cc3-ff77-48f0-b1a6-abbfcee7c136
id: a74f9cc3-ff77-48f0-b1a6-abbfcee7c136
no_children: 4
no_descendant_clades: 4125372
no_species: 2148125
name_meaning:
la: =?UTF-8?B?R3JlZWsgzrXhvZYgKGV1LCAnd2VsbCcgb3IgJ3RydWUnKSArIM66zqzPgc+Fzr/OvSAoa2FyeW9uLCAnbnV0JyBvciAna2VybmVsJyk=?=
other_names:
- la: Eukarya
- la: Eucaryotae
common_names:
- la: Eukaryotes
description:
la: =?UTF-8?B?RXVrYXJ5b3RlcyAoL2p1y5DLiGvDpnJpb8qKdCwgLcmZdC8pIGFyZSBvcmdhbmlzbXMgd2hvc2UgY2VsbHMgaGF2ZSBhIG51Y2xldXMgZW5jbG9zZWQgd2l0aGluIG1lbWJyYW5lcywgdW5saWtlIHByb2thcnlvdGVzIChCYWN0ZXJpYSBhbmQgQXJjaGFlYSksIHdoaWNoIGhhdmUgbm8gbWVtYnJhbmUtYm91bmQgb3JnYW5lbGxlcy4uLg==?=
rank: Domain
citation:
name: Whittaker & Margulis
extant: true
conservation_status: Least concern
synapomorphies:
- name:
la: Biomembrane
ar: =?UTF-8?2LrYtNin2KEg2K3ZitmI2Yo=?=
link: /static/synapomorphy/uuid/a74f9cc3-ff77-48f0-b1a6-abbfcee7c136
computed_stats:
temporal_range:
start:
name:
la: Orosirian
ar: =?UTF-8?2KfZhNij2YjYsdmI2LPZitix2Yo=?=
when:
scale: 2050
unit: Mya
```
## CANONICAL DATA TYPES
### Normalized Clade Entry
This is the 'base' data type of the tree. The tree of life is formed from these
Each entry will consist of a pointer to parent, and a list of characteristics.
The characteristics will be validated using the metadata of the Characteristics structure
* node
* has a parent characteristic
* has a list of other characteristics
* each with data
* species entry
* is a leaf node
* has species characterisitics
* each with data
- clade entry
- has list of asserted characteristics
- each has a characteristic data
- described by characteristic type
- applies to all nodes beneath
- has list of deprecated characteristics
* taxonomy entry
* parent representing taxonomies of which this is a subset
* label
* characteristic entry
* has parent used for helping select and group characteristics
```yaml
---
Parent: PARENT-ID-HERE
NewTraits:
- auriculae
LostTraits:
- folicles
Characteristics: # contains list of characteristic entries
- characteristic: Binomun
text:
la: Brachylagus idahoensis # name in latin example
- characteristic: Physical
sexualDimorphy:
- gender: male
quantification:
type: length
unit: cm
scale: 32
- gender: female
quantification:
type: length
unit: cm
scale: 30
```
### Characteristics
This is the way you attach information to a clade entry
This structure defines what the content or structure of any characteristic is - it is metadata - that is to say, data-about-the-data. What do we call it, do we require it, how do we display it, what kind of content fields does it have.
It is intended that as new characteristics are decided upon, they'll be added to this list - and additional content fields can be added to them as needed.
characteristic grammar
* IDENTITY - a string (probably a uuid) which is used to identify a particular characteristic node
* LANGUAGE - data type consisting of an object whose keys are LANGUAGE_CODE and values are data
* LANGUAGE_CODE - international language code. 'en' for english. 'ar' for arabic. etc.
* RFC2047 encoded UTF-8 string - either ascii-text or "=?" charset "?" encoding "?" encoded-text "?="
* ascii-text - exactly like what you read here
* charset - UTF-8
* encoding - "B" for base64 or "Q" for quoted-printable.
* encoded-text - space/tab prohibited, encode occurences of ? = and _ all in accordance with the encoding
* SPECIES_REQUIRED_BOOLEAN - whether this characteristic is required on a leaf/species entry
* CLADE_REQUIRED_BOOLEAN - whether this characteristic is required on a node/clade entry
* INSTANCE_DATA_TYPE - describes a field that configuration from the node will be requested
* DATA_TYPE - one of TEXT - DATE - FLAG
* TEXT - expect a LANGUAGE type of data for this point
* DATE - expect a data type consisting of an object whose keys are "value" with a DATE_VALUE value and "unit" with a TIME_UNIT value
* ENUM - a limited number of selections
```yaml
---
# template of a CHARACTERISTIC
id: IDENTITY_CHARACTERISTIC
parent: IDENTITY_CHARACTERISTIC - the parent of this characterstic for purposes of relating characterstics to each other
label:
LANGUAGE_CODE: RFC2047 encoded UTF-8 string representing the label of the characteristic for selecting.
content:
required:
leaf: SPECIES_REQUIRED_BOOLEAN
node: CLADE_REQUIRED_BOOLEAN
```
```yaml
---
# template of an CHARACTERISTIC-DATA-TYPE
id: IDENTITY_CHARACTERISTIC_CONTENT
label:
LANGUAGE_CODE: RFC2047 encoded UTF-8 string representing the label of the characteristic data for entry.
type: DATA_TYPE
# example of an CHARACTERISTIC-DATA-TYPE
---
label:
en: furred
type: ENUM
enum:
- label: LANGUAGE
value: true
- label: LANGUAGE
value: false
# example of an CHARACTERISTIC-DATA-TYPE
---
label:
en: Taxonomy
type: ENUM
enum:
- label:
en: Linaean
value: LINAEAN
- label:
en: Phylogenetic
value: PHYLOGENETIC
---
label:
en: Length in meters
type: MEASUREMENT
unit: meter
data: 'length(meters)
---
label:
en:
```
```yaml
---
# example Root characteristic
name: Name
id: naming
required: true
---
name: CommonName
id: commonName
parent: naming
required:
leaf: true # so you can flag what you must have (name) vs. what you might (range)
label:
en: Common Name(s)
content:
name: LANGUAGE
---
name: VernacularName
parent: naming
required: false
label:
en: Vernacular name(s)
content:
text: LANGUAGE
---
name: Binomun
parent: naming
required: true # so you can flag what you must have (name) vs. what you might (range)
label:
en: Binomun name # The name of the field
content:
text: LANGUAGE
---
name: Attribution
required: true
label:
en: Attribution
content:
name: LANGUAGE
date: DATE-YEAR
original: FLAG
vide: FLAG
sensu: FLAG
non: FLAG
emended: FLAG
basic: LANGUAGE
---
name: Physical
required: false
label:
en: Physical
content:
description: LANGUAGE
sexualDimorphy:
- gender: GENDER
quantification: QUANTIFICATION
---
name: Image
required: false
label:
en: Images(s)
content
image: IMAGEREF
attribution: LANGUAGE
description: LANGUAGE
---
```
## Quantification Types
```yaml
---
length:
reference: meter
unit:
- name: kilometer
factor: 1000
- name: meter
factor: 1
- name: centimeter
factor: 0.1
- name: millimeter
factor: 0.01
mass:
reference: kilogram
unit:
- name: kilogram
factor: 1
- name: metric ton
factor: 1000
- name: US ton
factor: 907.185
- name: grams
factor: 0.01
volume:
reference: liter
unit:
- name: liter
factor: 1
time:
reference: day
unit:
- name: year
factor: 365.25
- name: day
factor: 1
energy:
reference: watts
unit:
- name: watts
factor: 1
```
### Exploration of Tree and Attributes
```yaml
THE MATTER OF TREE - OR TAXONIMIC PARENTS
- should we qualify that the parent of a clade be qualified by the various taxonomies
---(primates)
self:
linnaean:
rank: Order
name: Primates
cladistic: Primates
---(chierogaleidae)
self:
linnaean:
rank: Family
name: Cheirogaleidae
cladistic: Cheirogaleidae
parent:
linnaean:
rank: Order
name: Primates
cladistic: Lemuriformes
---(lemurs)
self:
linnaean:
rank: Family
name: Lepilemuridae
cladistic: Lemurs
parent:
linnaean:
rank: Order
name: Primates
cladistic: Lemuriformes
---(lemurlike)
self:
cladistic: Lemuriformes
parent:
cladistic: Strepsirrhini
---(lemurish?)
self:
cladistic: Strepsirrhini
parent:
cladistic: Primates
---(Allocebus)
self:
linnaean:
rank: Genus
name: Allocebus
parent:
linnaean:
rank: Family
name: Cheirogaleidae
---(trichotis)
self:
linnaean:
rank: Species
name: trichotis
cladistic: Allocebus trichotis
parent:
linnaean:
rank: Family
name: Allocebus
cladistic: Cheirogaleidae
flags:
- extant
characteristics:
- feature:
bodyPart: ear
observation: very short
- feature:
bodyPart: ear
observation: tufts of wavy hair that project above ear pelage
- feature:
bodyPart: face
observation: darker grey triangle between eyes
- feature:
bodyPart: eyes
observation: dark narrow rings around
- feature:
bodyPart: lips
observation: light pink color
- feature:
bodyPart: nose
observation: light pink color
- feature:
bodyPart: tongue
observation: long relative to other dwarf lemurs
- sizeClass:
relative: Dwarf
- massRange:
low: 75
high: 98
unit: gram
- dimensionHead:
low: 125
high: 145
unit: millimeter
- dimensionHeadTail:
low: 150
high: 195
unit: millimeter
```
* ~ different conceptual trees in example - one linnaen, one cladistic, both constructable out of the same normalized data structure.
* + allows expansion to any number of phylogenies
* + provides the basis for 'switching phylogeny' programmatically
* - demands phylogeny context when searches are done that are affected by phylogenitic relationships
* + accomodates data that may be used in teh future
* - complicated questions regarding clade characteristics tagged between linnaean nodes
* - complicated questions regarding non mono-phyletic matter and characteristics.
* + could leave the tree search restrictions to the most layered detailed
* - restricting to cladistic tree alone would potentially confuse nonmonophyletic user expectations
## INTEGRATION SCRIPTING
### Tree/Graph Index
A tree is constructed by reading all of the files, and using the 'Parent' field in them construct a graph where each node has an ID and the edges between express the characteristics/traits.
This structure will be loaded into a graph query engine named 'Graph Engine'
### Denormalized Data Files
The most important of the scripts is the one that creates the denormalized clade entry documents that are designed to be read by the front end/client software for full information on a clade node
In order to make this happen it needs to construct a tree - it will read all of the files, and using the 'Parent' field in them construct a tree.
Its main task is to copy all of the traits and characteristics from parents to their children - stopping where the traits stop.
The result will be a series of documents with additive traits as you head out to the leaves, with occaisonal subtractions
these documents will be available in the 'Clade Entry Index' for fast and immediate retrieval on demand By ID
### Attribute Search
Iterate over all of the defined variations of attributes and create a key-value document for each clade node and each data type.
Clade node
Each trait attribute and its value (to match trait-attributes with clades)
Data type
Each trait type and the values that are in it (for autocomplete)
### Time Interval Index
Extract all of the time range quantifications and build a segment tree. This tree will allow rapid query of all clade nodes that intersect with any particular time point. It also permits time range queries to aggregate up all of the clades across a particular time range.
### Image Content
Script will transfer images from the larger storage to which they have been allocated to each static content server for rapid servering.
### Front end application build
The compilation, verification, webpack-compression and deployment of the web code to the static content server and front end web servers
### Data Api application build
The compilation, verification, and deployment of the data api code to the back end servers
## Operational Data Versions
I am visualizing that any branch of the document data store can be made available (through the running of these scripts) to be queryable by modifiers to the data api server.
When a new version of the indexes is available, it should be available to query the data api server regarding what versions of the data are currently available to query. A client software (such as our front end) can choose to display these choices to qualified users (editors) so that they can test and troubleshoot the new data version in the actual running system before they decide to make it live for the general public.
## Minimum Viable Product
We must identify the minimum viable queries necessary to support the explorer.
* what clades are near this clade (graph query)
* what is the data for this clade (clade entry index)
We must identify the minimum viable interaction with the editors
* access to git and modifying yaml files
To support these queries, we need only write basic scripting for
* integration pipeline
* * clade canonical data integrity testing
* * front end build deploy
* * data api build deploy
* clade entry index build
* graph build