Avernus Project Brief

--- title: 'Avernus Project Brief' disqus: hackmd --- Avernus Project Outline === :::warning **This is a work in progress** Be careful, ok ::: ## Table of Contents [TOC] ## Beginners Guide We already have some content built around Avernus. You can check it out here: 1. https://gitlab.com/ventures-data-services/avernus-data-lake-2 2. https://wiki.projects.ventures/index.php/Avernus_Data_Lake ## Problem statement > There are no facts, only interpretations. [name=Friedrich Nietzsche] * Why are we even doing this? * Microsoft licensing (expensive) * legacy complex stored procedures * haven't got Evolution data - Data architecture ain't fun - files / info spread across multiple locations using different rules and storage? - Data integrity / trust in data - Interoperability - data lineage - old DWH rule not documented - future proof - manual - secure - Technical debt ## Objectives > I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates] > Art is never finished, only abandoned. [name=Leonardo Da Vinci] * What exactly do we need to achieve? * Large variety of data, more sources * Standards/Rules in place * Automation of processes (ETL) * Trusted reporting * Comprehensive documentation * Canonical view of patients/patient activity * lineage tracked * Big audit data * User Access Automation, Tracking, Security * API ? * Auto Data source Reconsiliation/Validation? * What will we be able to do when the project is finished? * More data sources * CPI automation without manual pratice submissions * Build other services from the data - eg. a bot who has the ability to converse in some fashion * Interoperability projects possible (FHIR) * Better point in time data * Quarterly reporting * Complete PMR / Patient Data across the MHN network * Clear business definitions of dimensions/measures * Clear path for new development faster deployment * Scale up and down Services automatically * Access to New/latest Tools * How will we know it is successful? User stories --- > The hardest thing to explain is the glaringly evident which everybody has decided not to see. [name=some lady] > **Definition**: A user story is an informal, general explanation of a feature written from the perspective of the end user. Its purpose is to articulate how a feature will provide value to the customer. These stories use non-technical language to provide context for the development team and their efforts. After reading a user story, the team knows why they are building, what they're building, and what value it creates. 1. Patient activities ```gherkin= Feature: Patient activities The patient should be the centre of analysis, and all other events and transactions relate directly to that patient. # The first example has two steps Scenario: I want to see all activity for a single patient Given I have a unique identifier Then I can easily find all events related to that patient. # The second example has three steps Scenario: I want to see activities for patient cohorts Given I can identify a cohort of interest from diagnosis groups When I select one or several diagnosis groups Then I can easily find all events related to those patients. ``` 2. Snapshots ```gherkin= Feature: Snapshots We need to see 'current' patients, and their activity. We also need to analyse data at a point in time. Scenario: Enrolments as at When I select a date range Then I can easily find patients enrolled *at that time*. Scenario: Labs/Measurement as at When I select a date range Then I can easily find patients with HbA1c, or BMI, value *at that time*. ``` 3. Data lineage ```gherkin= Feature: Data lineage We can easily understand where each data point comes from. We can easily understand all transformations performed. Scenario: Audit response When Auditors ask us to detail how our smoking data is built Then We can provide diagrams clarifying all steps and transformations. ``` > Gherkin reference here: https://docs.cucumber.io/gherkin/reference/ Solution options --- > Do everything quickly and well. [name=GI Gurdjieff] ### 1. Continue as we are *Let's just keep going.* #### Pros * We already put a lot of work into it. * Gives Pedro angst, gives Alex a sense of accomplishment and innovation even though data lakes have been around since 2014 #### Cons * Sunk cost fallacy * Pedro keeps cryin' about all this jello shoe crap ### 2. AWS LakeFormation *AWS Lake Formation is a service that makes it easy to set up a secure data lake in days!* #### Pros * Managed service * Automated data lineage #### Cons * Cost (?) * Rigidity (?) * Using AWS Glue ### 3. RDS *Let's just build a new database* > What’s the point of going out? We’re just going to wind up back here anyway. [name=Homer Simpson] #### Pros * More SQL knowledge in team * Simple to understand * Compatability with BI tools #### Cons * Cost * Still need an ETL engine ### 4. Current + RDS *Instead of creating database from scratch, we use the current ETL process but instead of saving the dataframes as objects on S3, we save them as tables on a/many database(s).* #### Pros * More SQL knowledge in team * Removes the cataloguing process * Compatability * Best of both worlds (Spark for more advanced ETL, SQL for basic ETL) #### Cons * Cost (RDS + EMR) * Re-do data architecture ### 5. Amazon Neptune *Graph database so that we can show our superiority to the masses, plus it makes sense following our clear separation of user + event* #### Pros * Cool as heck * Makes sense for our data #### Cons * Expensive * None of us know Cypher * None of us have designed a graph data architecture POC --- Let's use the Objectives and User stories to set up a quick POC process for candidate Solution options. ### POC design The POC will .......... In .......... timeframe Based on XYZ datasets ### Option 1: What we were already doing (Avernus) Anna/Dylan ### Option 2: +RDS Pedro/Dan ### POC Measures The POC will need to achieve the following: * Join the following MoH and PMS datasets into a valid schema: * BPAC: * patient * inbox * register * Indici: * labs * patient * MoH: * NES * The data will be from the beginning to 15/02/2021. * Join PMS data with NES, for all enrolments since 1/1/2020 as a minimum * Deduplicate records. * Set appropriate data types for date of birth (datetime) and lab results (int/float) * End user can query data using SQL across time periods and by age group * End user can find average lab results, for specific tests, for a range of patient groups (based on age, ethnicity, gender) * This data can be visualised in Power BI. ### POC Teams The teams were broken as per below: * Zama Killer Katanas (ZK²) or (ZKK) [Daniel and Pedro] * Tokyo Bomber Girls (TBG) [Anna and Dylan] ### POC Evaluation In order to identify the best way forward, resulting from this POC, we need to have some evaluation measures in mind so that we can make a fair comparison. The team presentations should therefore encompass commentary on: * Cost * Complexity * Ease of maintenance * Ability to scale and expand ## Appendix and FAQ ### Reference Data Model ``` mermaid graph TD id1[Indici] --> id2[PMS data] id3[Medtech] --> id2 id2 --> id4[Staging] id5[NES] --> id6[MOH] id7[NNPAC] --> id6 id8[NMDS] --> id6 id6 --> id4 id4 --> id9[Reporting] id9 --> id10[Power BI] id9 --> id11[Looker] id9 --> id12[Shiny] id9 --> id13[JDBC] id4 --> id14[FHIR] id14 --> id15[SEHR] ``` > Mermaid reference here: http://mermaid-js.github.io/mermaid/#/flowchart :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `Avernus` `Documentation`