---
title: 'Avernus Project Brief'
disqus: hackmd
---
Avernus Project Outline
===
:::warning
**This is a work in progress** Be careful, ok
:::
## Table of Contents
[TOC]
## Beginners Guide
We already have some content built around Avernus. You can check it out here:
1. https://gitlab.com/ventures-data-services/avernus-data-lake-2
2. https://wiki.projects.ventures/index.php/Avernus_Data_Lake
## Problem statement
> There are no facts, only interpretations. [name=Friedrich Nietzsche]
* Why are we even doing this?
* Microsoft licensing (expensive)
* legacy complex stored procedures
* haven't got Evolution data
- Data architecture ain't fun
- files / info spread across multiple locations using different rules and storage?
- Data integrity / trust in data
- Interoperability
- data lineage
- old DWH rule not documented
- future proof
- manual
- secure
- Technical debt
## Objectives
> I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates]
> Art is never finished, only abandoned. [name=Leonardo Da Vinci]
* What exactly do we need to achieve?
* Large variety of data, more sources
* Standards/Rules in place
* Automation of processes (ETL)
* Trusted reporting
* Comprehensive documentation
* Canonical view of patients/patient activity
* lineage tracked
* Big audit data
* User Access Automation, Tracking, Security
* API ?
* Auto Data source Reconsiliation/Validation?
* What will we be able to do when the project is finished?
* More data sources
* CPI automation without manual pratice submissions
* Build other services from the data - eg. a bot who has the ability to converse in some fashion
* Interoperability projects possible (FHIR)
* Better point in time data
* Quarterly reporting
* Complete PMR / Patient Data across the MHN network
* Clear business definitions of dimensions/measures
* Clear path for new development faster deployment
* Scale up and down Services automatically
* Access to New/latest Tools
* How will we know it is successful?
User stories
---
> The hardest thing to explain is the glaringly evident which everybody has decided not to see. [name=some lady]
>
**Definition**:
A user story is an informal, general explanation of a feature written from the perspective of the end user. Its purpose is to articulate how a feature will provide value to the customer.
These stories use non-technical language to provide context for the development team and their efforts. After reading a user story, the team knows why they are building, what they're building, and what value it creates.
1. Patient activities
```gherkin=
Feature: Patient activities
The patient should be the centre of analysis, and all other events
and transactions relate directly to that patient.
# The first example has two steps
Scenario: I want to see all activity for a single patient
Given I have a unique identifier
Then I can easily find all events related to that patient.
# The second example has three steps
Scenario: I want to see activities for patient cohorts
Given I can identify a cohort of interest from diagnosis groups
When I select one or several diagnosis groups
Then I can easily find all events related to those patients.
```
2. Snapshots
```gherkin=
Feature: Snapshots
We need to see 'current' patients, and their activity.
We also need to analyse data at a point in time.
Scenario: Enrolments as at
When I select a date range
Then I can easily find patients enrolled *at that time*.
Scenario: Labs/Measurement as at
When I select a date range
Then I can easily find patients with HbA1c, or BMI, value *at that time*.
```
3. Data lineage
```gherkin=
Feature: Data lineage
We can easily understand where each data point comes from.
We can easily understand all transformations performed.
Scenario: Audit response
When Auditors ask us to detail how our smoking data is built
Then We can provide diagrams clarifying all steps and transformations.
```
> Gherkin reference here: https://docs.cucumber.io/gherkin/reference/
Solution options
---
> Do everything quickly and well. [name=GI Gurdjieff]
### 1. Continue as we are
*Let's just keep going.*
#### Pros
* We already put a lot of work into it.
* Gives Pedro angst, gives Alex a sense of accomplishment and innovation even though data lakes have been around since 2014
#### Cons
* Sunk cost fallacy
* Pedro keeps cryin' about all this jello shoe crap
### 2. AWS LakeFormation
*AWS Lake Formation is a service that makes it easy to set up a secure data lake in days!*
#### Pros
* Managed service
* Automated data lineage
#### Cons
* Cost (?)
* Rigidity (?)
* Using AWS Glue
### 3. RDS
*Let's just build a new database*
> What’s the point of going out? We’re just going to wind up back here anyway. [name=Homer Simpson]
#### Pros
* More SQL knowledge in team
* Simple to understand
* Compatability with BI tools
#### Cons
* Cost
* Still need an ETL engine
### 4. Current + RDS
*Instead of creating database from scratch, we use the current ETL process but instead of saving the dataframes as objects on S3, we save them as tables on a/many database(s).*
#### Pros
* More SQL knowledge in team
* Removes the cataloguing process
* Compatability
* Best of both worlds (Spark for more advanced ETL, SQL for basic ETL)
#### Cons
* Cost (RDS + EMR)
* Re-do data architecture
### 5. Amazon Neptune
*Graph database so that we can show our superiority to the masses, plus it makes sense following our clear separation of user + event*
#### Pros
* Cool as heck
* Makes sense for our data
#### Cons
* Expensive
* None of us know Cypher
* None of us have designed a graph data architecture
POC
---
Let's use the Objectives and User stories to set up a quick POC process for candidate Solution options.
### POC design
The POC will ..........
In .......... timeframe
Based on XYZ datasets
### Option 1: What we were already doing (Avernus)
Anna/Dylan
### Option 2: +RDS
Pedro/Dan
### POC Measures
The POC will need to achieve the following:
* Join the following MoH and PMS datasets into a valid schema:
* BPAC:
* patient
* inbox
* register
* Indici:
* labs
* patient
* MoH:
* NES
* The data will be from the beginning to 15/02/2021.
* Join PMS data with NES, for all enrolments since 1/1/2020 as a minimum
* Deduplicate records.
* Set appropriate data types for date of birth (datetime) and lab results (int/float)
* End user can query data using SQL across time periods and by age group
* End user can find average lab results, for specific tests, for a range of patient groups (based on age, ethnicity, gender)
* This data can be visualised in Power BI.
### POC Teams
The teams were broken as per below:
* Zama Killer Katanas (ZK²) or (ZKK) [Daniel and Pedro]
* Tokyo Bomber Girls (TBG) [Anna and Dylan]
### POC Evaluation
In order to identify the best way forward, resulting from this POC, we need to have some evaluation measures in mind so that we can make a fair comparison. The team presentations should therefore encompass commentary on:
* Cost
* Complexity
* Ease of maintenance
* Ability to scale and expand
## Appendix and FAQ
### Reference Data Model
``` mermaid
graph TD
id1[Indici] --> id2[PMS data]
id3[Medtech] --> id2
id2 --> id4[Staging]
id5[NES] --> id6[MOH]
id7[NNPAC] --> id6
id8[NMDS] --> id6
id6 --> id4
id4 --> id9[Reporting]
id9 --> id10[Power BI]
id9 --> id11[Looker]
id9 --> id12[Shiny]
id9 --> id13[JDBC]
id4 --> id14[FHIR]
id14 --> id15[SEHR]
```
> Mermaid reference here: http://mermaid-js.github.io/mermaid/#/flowchart
:::info
**Find this document incomplete?** Leave a comment!
:::
###### tags: `Avernus` `Documentation`