System Architecture High Level

# System Architecture High Level This is a back end data system designed to deliver, at scale and volume, information to drive the phylogeny explorer application. [Link to Architecture Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/architecture.svg) ![Architecture Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/architecture.svg) The data has some interesting characteristics. * mutation rate slow relative to query - delta:query ratio approaches 0 * graph structure - the overall shape is that of a graph * graph-directed inheritance - normalized attributes apply to children * normalized storage - each attribute is represented only change points in canonical data * highly denormalized delivery - delivered entry records have all attributes relevant represented There are several types of query that we have identified we will need to serve * Clade Entry - retrieval of a particular clade entry. Dozens of times at page load, second-by-second during exploration * Text Search index * Attribute names - the process of finding attributes to use for other searches * Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load. * Clade names * Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load. * Description text * more of a 'help us find things by description' that has a search result page * Graph search index * retrieval of clade entries based on their relationships * Time interval index - clades which intersect with a particular time point ## Scale and flow A principle principal of this design is to use the least possible amount of processing resources to serve a request. In evolved designs this is handled with cache but this has downsides such as only caching questions that have been answered before, and heavily loading the server on cache misses due to expiration or operational concerns like a new cloudfront cache node coming online, or deploying a new version of the application. We can take a proactive approach effectively because the delta:query ratio approaches zero - we can spend several very intense spot-minutes with high cpu capability pre-answering, assembling, and indexing all of the common query result patterns for rapid retrieval later with minimal resources required for the rest of the site's operation, allowing a smaller cluster of servers to handle the live load. These approaches also make it trivial to maintain stateless requesting, distributing the requests across more hardware as demand increases - with no star topology and its accompanying headaches of replication synchronization and ultimate dependency on a single conceptual system that is coordinates state across the application. ## SYSTEM CONCEPTUAL BLOCKS * Static Content * A simple file system accessible through the main web url. * Provides application resources, images, and other unchanging bits of front end related content. * Also has a category to provide clade entry particular media resources * Front End Web Server * Provides routing for the web application front end * Provides api endpoints internal to the web application * Open Authentication Server * Validates tokens * Provides Authentication endpoints * Signs tokens that can be used * data api server * front end web server * Authentication and Authorization Store * Who are our users * What are our groups * What are our roles * What actions can a role do * What actions can't a role do * Which users are in which groups * Which groups are in which groups * Data API Server * Provides API endpoints for access to the indexed and queryable data * Used by web front end * Used by other clients * Access controlled by tokens * OAuth JWT verified with Open Authentication Server * Document Data Store * A Git repo with a series of clade entry documents * Access control by Git permissions system * Contains complete history of changes to the data set * Integrator * A continual integration/build process * Compiles web application from source * deploys to static content * deploys to front end web server * Reads and resolves normalized data to denormalized form * deploys denormalized documents to static content * Reads and resolves relationship graph * deploys graph to graph database engine * Extracts all attributes and their entry locations * deploys attribute index to Attribute searcher * Generates time segment tree index * deploys tree to time interval index [Link to Flow Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/flow.svg) ![Flow Diagram](https://raw.githubusercontent.com/wiki/phylogeny-explorer/project-management/flow.svg) ## THATS NOT JSON IN YOUR EXAMPLES Yes. This is a form called 'YAML', or 'yet another markup language'. It is designed to be easy to type yet as comprehensive as and convertible directly to json using a tool. I am typing this, so, I used YAML to notate it. How do you read it you might ask... Well, you can get all the detailios [here](https://symfony.com/doc/current/components/yaml/yaml_format.html) and I for a simple brief: * `---` Beginning of document marker * `key: value` If you see a word and a `:`, you're in an object. in json, `{ "key": "value" }` * `- value` The hyphen is used for an array entry. `[ "value" ]` * ```yaml akey: bkey: value ``` indentation is for a nested object. In json `{ "akey": { "bkey": "value" } }` * ```yaml akey: - bkey: value ``` In json `{ "akey": [ { "bkey": "value" } ] }` Voila. ## STATES ### ANONYMOUS STATE The client software (either the front end or some other client) will deliver credentials as necessary obtain a token from the authorization system - this is a combination of a session token and identity that can be used to control what the api is able to do and query. The user control system is authoritative, the state is contained within the signed token. This is OAuth flow. ### SESSION ACTIVE A connection is made to the back end server with the signed authorization token, which will verify the authenticity of the token via oauth flow with the authentication service and return provide a session token scoped to the back end servers. This allows the system of back end servers to maintain authority among themselves without having to depend request-by-request upon the availability of the authoritative authentication server. ### QUERY RESULTS PENDING FRONT END ASSEMBLY The application will make a query with its session token based on its own requirements. An example might be 'Biota and the first three clades beneath it', to drive the creation of an initial screen. ```yaml --- version: 1.0 type: simple-graph id: a9723dee-4bbc-11ea-b77f-2e728ce88125 depth: 2 languages: - ar - la reference: GRAPH_ROOT ``` The response would come back with convenient display names in multiple languages - notice the RFC2047 encoded UTF-8 strings for displaying the text in arabic! ```yaml --- version: 1.0 type: clade-set meta: page: 1 total_pages: 1 total_resources: 4 query: id: a9723dee-4bbc-11ea-b77f-2e728ce88125 depth: 2 reference: GRAPH_ROOT objects: - display: la: Biota ar: =?UTF-8?B?2KfZhNmD2KfYptmG2KfYqiDYp9mE2K3Zitip?= link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125 - name: la: Archaea ar: =?UTF-8?B?2KfZhNi52KrZitmC2Kk=?= link: /static/clade/uuid/562d0808-4bbc-11ea-b77f-2e728ce88125 - name: la: Eubacteria ar: =?UTF-8?B?2KzZj9ix2ZLYq9mP2YXYqQ==?= link: /static/clade/uuid/562d095c-4bbc-11ea-b77f-2e728ce88125 - name: la: Eukaryota ar: =?UTF-8?B?2YPYp9im2YYg2YjYrdmK2K8g2KfZhNiu2YTZitip?= link: /static/clade/uuid/562d0a88-4bbc-11ea-b77f-2e728ce88125 ``` ### QUERY PROCESSING A query against the data APIs will result in a set of clade entry urls delivered. And the client software would be able to display those names and in turn retrieve (or load from many possible layers of cache) documents for each clade ### CLADE DATA RETRIEVE A series of requests for these entries will be made to the static content server which has the denormalized documents available. http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125 http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0808-4bbc-11ea-b77f-2e728ce88125 http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d095c-4bbc-11ea-b77f-2e728ce88125 http://explorer.phylogenyexplorerproject.com/static/clade/uuid/562d0a88-4bbc-11ea-b77f-2e728ce88125 which would look something like ```yaml --- created_on: 1580929460 created_by: id: user/098359f7-59a3-43bd-82b6-2fcad56d0983 name: la: Aran Ra parent: id: 562d0592-4bbc-11ea-b77f-2e728ce88125 link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125 name: la: Biota ar: =?UTF-8?B?2KfZhNmD2KfYptmG2KfYqiDYp9mE2K3Zitip?= depth: 2 name: la: Archaea ar: =?UTF-8?B?2KfZhNi52KrZitmC2Kk=?= children: - name: la: Crenarchaeota ar: =?UTF-8?B?2LnYqtin2KbZgiDZhdi12K/YsdmK2Kk=?= link: /static/clade/uuid/562d0592-4bbc-11ea-b77f-2e728ce88125 id: 562d0592-4bbc-11ea-b77f-2e728ce88125 - name: la: Euryarchaeota ar: 2LnYqtin2KbZgiDYudix2YrYttip link: /static/clade/uuid/f22f492b-9328-4a5a-8751-0efd1712acd3 id: f22f492b-9328-4a5a-8751-0efd1712acd3 - name: la: Korarchaeota ar: 2LnYqtin2KbZgiDYtNin2KjYqQ link: /static/clade/uuid/3800a855-64d8-4e46-961e-d1be2eb0a0b5 id: 3800a855-64d8-4e46-961e-d1be2eb0a0b5 - name: la: Thermoplasmales link: /static/clade/uuid/a74f9cc3-ff77-48f0-b1a6-abbfcee7c136 id: a74f9cc3-ff77-48f0-b1a6-abbfcee7c136 no_children: 4 no_descendant_clades: 4125372 no_species: 2148125 name_meaning: la: =?UTF-8?B?R3JlZWsgzrXhvZYgKGV1LCAnd2VsbCcgb3IgJ3RydWUnKSArIM66zqzPgc+Fzr/OvSAoa2FyeW9uLCAnbnV0JyBvciAna2VybmVsJyk=?= other_names: - la: Eukarya - la: Eucaryotae common_names: - la: Eukaryotes description: la: =?UTF-8?B?RXVrYXJ5b3RlcyAoL2p1y5DLiGvDpnJpb8qKdCwgLcmZdC8pIGFyZSBvcmdhbmlzbXMgd2hvc2UgY2VsbHMgaGF2ZSBhIG51Y2xldXMgZW5jbG9zZWQgd2l0aGluIG1lbWJyYW5lcywgdW5saWtlIHByb2thcnlvdGVzIChCYWN0ZXJpYSBhbmQgQXJjaGFlYSksIHdoaWNoIGhhdmUgbm8gbWVtYnJhbmUtYm91bmQgb3JnYW5lbGxlcy4uLg==?= rank: Domain citation: name: Whittaker & Margulis extant: true conservation_status: Least concern synapomorphies: - name: la: Biomembrane ar: =?UTF-8?2LrYtNin2KEg2K3ZitmI2Yo=?= link: /static/synapomorphy/uuid/a74f9cc3-ff77-48f0-b1a6-abbfcee7c136 computed_stats: temporal_range: start: name: la: Orosirian ar: =?UTF-8?2KfZhNij2YjYsdmI2LPZitix2Yo=?= when: scale: 2050 unit: Mya ``` ## CANONICAL DATA TYPES ### Normalized Clade Entry This is the 'base' data type of the tree. The tree of life is formed from these Each entry will consist of a pointer to parent, and a list of characteristics. The characteristics will be validated using the metadata of the Characteristics structure * node * has a parent characteristic * has a list of other characteristics * each with data * species entry * is a leaf node * has species characterisitics * each with data - clade entry - has list of asserted characteristics - each has a characteristic data - described by characteristic type - applies to all nodes beneath - has list of deprecated characteristics * taxonomy entry * parent representing taxonomies of which this is a subset * label * characteristic entry * has parent used for helping select and group characteristics ```yaml --- Parent: PARENT-ID-HERE NewTraits: - auriculae LostTraits: - folicles Characteristics: # contains list of characteristic entries - characteristic: Binomun text: la: Brachylagus idahoensis # name in latin example - characteristic: Physical sexualDimorphy: - gender: male quantification: type: length unit: cm scale: 32 - gender: female quantification: type: length unit: cm scale: 30 ``` ### Characteristics This is the way you attach information to a clade entry This structure defines what the content or structure of any characteristic is - it is metadata - that is to say, data-about-the-data. What do we call it, do we require it, how do we display it, what kind of content fields does it have. It is intended that as new characteristics are decided upon, they'll be added to this list - and additional content fields can be added to them as needed. characteristic grammar * IDENTITY - a string (probably a uuid) which is used to identify a particular characteristic node * LANGUAGE - data type consisting of an object whose keys are LANGUAGE_CODE and values are data * LANGUAGE_CODE - international language code. 'en' for english. 'ar' for arabic. etc. * RFC2047 encoded UTF-8 string - either ascii-text or "=?" charset "?" encoding "?" encoded-text "?=" * ascii-text - exactly like what you read here * charset - UTF-8 * encoding - "B" for base64 or "Q" for quoted-printable. * encoded-text - space/tab prohibited, encode occurences of ? = and _ all in accordance with the encoding * SPECIES_REQUIRED_BOOLEAN - whether this characteristic is required on a leaf/species entry * CLADE_REQUIRED_BOOLEAN - whether this characteristic is required on a node/clade entry * INSTANCE_DATA_TYPE - describes a field that configuration from the node will be requested * DATA_TYPE - one of TEXT - DATE - FLAG * TEXT - expect a LANGUAGE type of data for this point * DATE - expect a data type consisting of an object whose keys are "value" with a DATE_VALUE value and "unit" with a TIME_UNIT value * ENUM - a limited number of selections ```yaml --- # template of a CHARACTERISTIC id: IDENTITY_CHARACTERISTIC parent: IDENTITY_CHARACTERISTIC - the parent of this characterstic for purposes of relating characterstics to each other label: LANGUAGE_CODE: RFC2047 encoded UTF-8 string representing the label of the characteristic for selecting. content: required: leaf: SPECIES_REQUIRED_BOOLEAN node: CLADE_REQUIRED_BOOLEAN ``` ```yaml --- # template of an CHARACTERISTIC-DATA-TYPE id: IDENTITY_CHARACTERISTIC_CONTENT label: LANGUAGE_CODE: RFC2047 encoded UTF-8 string representing the label of the characteristic data for entry. type: DATA_TYPE # example of an CHARACTERISTIC-DATA-TYPE --- label: en: furred type: ENUM enum: - label: LANGUAGE value: true - label: LANGUAGE value: false # example of an CHARACTERISTIC-DATA-TYPE --- label: en: Taxonomy type: ENUM enum: - label: en: Linaean value: LINAEAN - label: en: Phylogenetic value: PHYLOGENETIC --- label: en: Length in meters type: MEASUREMENT unit: meter data: 'length(meters) --- label: en: ``` ```yaml --- # example Root characteristic name: Name id: naming required: true --- name: CommonName id: commonName parent: naming required: leaf: true # so you can flag what you must have (name) vs. what you might (range) label: en: Common Name(s) content: name: LANGUAGE --- name: VernacularName parent: naming required: false label: en: Vernacular name(s) content: text: LANGUAGE --- name: Binomun parent: naming required: true # so you can flag what you must have (name) vs. what you might (range) label: en: Binomun name # The name of the field content: text: LANGUAGE --- name: Attribution required: true label: en: Attribution content: name: LANGUAGE date: DATE-YEAR original: FLAG vide: FLAG sensu: FLAG non: FLAG emended: FLAG basic: LANGUAGE --- name: Physical required: false label: en: Physical content: description: LANGUAGE sexualDimorphy: - gender: GENDER quantification: QUANTIFICATION --- name: Image required: false label: en: Images(s) content image: IMAGEREF attribution: LANGUAGE description: LANGUAGE --- ``` ## Quantification Types ```yaml --- length: reference: meter unit: - name: kilometer factor: 1000 - name: meter factor: 1 - name: centimeter factor: 0.1 - name: millimeter factor: 0.01 mass: reference: kilogram unit: - name: kilogram factor: 1 - name: metric ton factor: 1000 - name: US ton factor: 907.185 - name: grams factor: 0.01 volume: reference: liter unit: - name: liter factor: 1 time: reference: day unit: - name: year factor: 365.25 - name: day factor: 1 energy: reference: watts unit: - name: watts factor: 1 ``` ### Exploration of Tree and Attributes ```yaml THE MATTER OF TREE - OR TAXONIMIC PARENTS - should we qualify that the parent of a clade be qualified by the various taxonomies ---(primates) self: linnaean: rank: Order name: Primates cladistic: Primates ---(chierogaleidae) self: linnaean: rank: Family name: Cheirogaleidae cladistic: Cheirogaleidae parent: linnaean: rank: Order name: Primates cladistic: Lemuriformes ---(lemurs) self: linnaean: rank: Family name: Lepilemuridae cladistic: Lemurs parent: linnaean: rank: Order name: Primates cladistic: Lemuriformes ---(lemurlike) self: cladistic: Lemuriformes parent: cladistic: Strepsirrhini ---(lemurish?) self: cladistic: Strepsirrhini parent: cladistic: Primates ---(Allocebus) self: linnaean: rank: Genus name: Allocebus parent: linnaean: rank: Family name: Cheirogaleidae ---(trichotis) self: linnaean: rank: Species name: trichotis cladistic: Allocebus trichotis parent: linnaean: rank: Family name: Allocebus cladistic: Cheirogaleidae flags: - extant characteristics: - feature: bodyPart: ear observation: very short - feature: bodyPart: ear observation: tufts of wavy hair that project above ear pelage - feature: bodyPart: face observation: darker grey triangle between eyes - feature: bodyPart: eyes observation: dark narrow rings around - feature: bodyPart: lips observation: light pink color - feature: bodyPart: nose observation: light pink color - feature: bodyPart: tongue observation: long relative to other dwarf lemurs - sizeClass: relative: Dwarf - massRange: low: 75 high: 98 unit: gram - dimensionHead: low: 125 high: 145 unit: millimeter - dimensionHeadTail: low: 150 high: 195 unit: millimeter ``` * ~ different conceptual trees in example - one linnaen, one cladistic, both constructable out of the same normalized data structure. * + allows expansion to any number of phylogenies * + provides the basis for 'switching phylogeny' programmatically * - demands phylogeny context when searches are done that are affected by phylogenitic relationships * + accomodates data that may be used in teh future * - complicated questions regarding clade characteristics tagged between linnaean nodes * - complicated questions regarding non mono-phyletic matter and characteristics. * + could leave the tree search restrictions to the most layered detailed * - restricting to cladistic tree alone would potentially confuse nonmonophyletic user expectations ## INTEGRATION SCRIPTING ### Tree/Graph Index A tree is constructed by reading all of the files, and using the 'Parent' field in them construct a graph where each node has an ID and the edges between express the characteristics/traits. This structure will be loaded into a graph query engine named 'Graph Engine' ### Denormalized Data Files The most important of the scripts is the one that creates the denormalized clade entry documents that are designed to be read by the front end/client software for full information on a clade node In order to make this happen it needs to construct a tree - it will read all of the files, and using the 'Parent' field in them construct a tree. Its main task is to copy all of the traits and characteristics from parents to their children - stopping where the traits stop. The result will be a series of documents with additive traits as you head out to the leaves, with occaisonal subtractions these documents will be available in the 'Clade Entry Index' for fast and immediate retrieval on demand By ID ### Attribute Search Iterate over all of the defined variations of attributes and create a key-value document for each clade node and each data type. Clade node Each trait attribute and its value (to match trait-attributes with clades) Data type Each trait type and the values that are in it (for autocomplete) ### Time Interval Index Extract all of the time range quantifications and build a segment tree. This tree will allow rapid query of all clade nodes that intersect with any particular time point. It also permits time range queries to aggregate up all of the clades across a particular time range. ### Image Content Script will transfer images from the larger storage to which they have been allocated to each static content server for rapid servering. ### Front end application build The compilation, verification, webpack-compression and deployment of the web code to the static content server and front end web servers ### Data Api application build The compilation, verification, and deployment of the data api code to the back end servers ## Operational Data Versions I am visualizing that any branch of the document data store can be made available (through the running of these scripts) to be queryable by modifiers to the data api server. When a new version of the indexes is available, it should be available to query the data api server regarding what versions of the data are currently available to query. A client software (such as our front end) can choose to display these choices to qualified users (editors) so that they can test and troubleshoot the new data version in the actual running system before they decide to make it live for the general public. ## Minimum Viable Product We must identify the minimum viable queries necessary to support the explorer. * what clades are near this clade (graph query) * what is the data for this clade (clade entry index) We must identify the minimum viable interaction with the editors * access to git and modifying yaml files To support these queries, we need only write basic scripting for * integration pipeline * * clade canonical data integrity testing * * front end build deploy * * data api build deploy * clade entry index build * graph build