# SQL-on-FHIR v2 (FHIR Connectathon May 2023)
## Connectathon follow ups:
* Ryan Brush to turn this doc into a level 2 spec draft proposal
* Ryan Brush to update the [experimental Python example](https://github.com/google/fhir-py/tree/main/google-fhir-views#readme) to follow the draft spec.
* John Grimes to create some test cases for level 2 views, including test data and expected outputs
* John Grimes to create proof-of-concept implementation of view execution
* John Grimes to create proof-of-concept implementation of Spark encoder that follows these patterns. (A modification of what Pathling currently does)
* Dan Gottlieb to write examples of level 2 syntax
* Nikolai working on annotated parser for FHIRPath to translate to SQL (Duckdb & Postgres jsonb)
* Nikolai generate parquet from SD
* Make Rich Hickey DM for ideas we discussed
## Goals and Requirements
[Previous draft of requirements here](https://github.com/FHIR/sql-on-fhir-v2/discussions/69).
### A portable, unambiguous specification
Any good standard is unambiguous and portable between technology stacks, and this is no exception.
## Define an "Annotated Layer" for db-friendly lossless storage of hierarchical FHIR data, annotated to improve queryability
Problem: Raw FHIR can be inefficient to use inside a database, requiring repetitive extraction, string parsing, date math, etc.
Solution: Annotated Layer establishes consistent pre-processing and annotation of raw FHIR resources to simplify downstream logic.
Inputs: raw FHIR data (e.g., `.ndjson` files from Bulk `$export`)
Outputs: annotated FHIR data in a query-ready format -- e.g. as json in files, as a json column in a database, or in a hierarchical schema in a columnar database
#### TODOs:
* Define approach for extracting contained resources
* Define additional annotations (see [annotation TODOs](#annotation-todos))
* Determine the scope of these recommendations (e.g. JSON, Parquet, etc -- or leave those details out of scope?)
* Determine what unit conversions (if any) should be done in the annotation layer, or defer those to the view or analytics layers?
## Define a "View Layer" that maps hierarchical FHIR data into rectangular tables
Some common features such as:
* filter resources by simple fhirpath subset
* allow unnesting across paths in a resource
* extract data into columns
Leave room for optional features such as:
* units canonicalization / conversion
* resource reference resolution
* conditional evaluations
Possible approach:
- Convert existing queries into declarative FHIRPath rule syntax
- Generate SQL queries from FHIRPath rules
### Core FHIRPath function support:
Implementations of this spec MUST support these:
* `where` function to select items in arrays (like the home address)
* `exists` function to support filtering iitems
* `extension(url)` shortcut function
* `join` function
* `split` ?
* `equals` operator, primarily for use in the where function above.
* `ofType` function to select the desired value type.
* `first` function. This is easily implemented as getting the first item in an array, and simplifies the output when users are looking for only a single value for a scalar column.
* boolean operators (_and_, _or_, _not_)
* basic arithmetic (+, -, *, /)
* comparisons: =, !=, >, <=
* Literals for strings, numbers
* ?? string `substring`, `matches`, `startsWith`
* ?? terminology `.hasConcept(system,code)`
* ?? combining collections with `|`
* ?? set membership checks
* ?? date normalization
* Potential functions:
- reference.getId()
- (date | dateTime | instant | period).getEarliestInstant()
- (date | dateTime | instant | period).getLatestInstant()
- (date | dateTime | instant | period).getOriginalPrecision()
### Optional FHIRPath function support:
Implementations of this spec MAY support these:
* Subset of unit conversion operators
* `memberOf` function to allow checking for value sets
* Current
#### Grahame discussion:
* How to deal with CodeableReferences, unnest support Observation with multiple components
* Nested codesystems, CodeableReferences?
* How to deal with normalized references? In the base layer or view layer? (May be satisfied with key extraction in the annotation layer)?
* Dealing with datetime/normalization. Convert into periods?
#### Discussion topics
Should we allow arrays/objects as outputs in our Views?
* Can have indicators on the view to express
* This view can output anything (primitives, arrays, maps)
* This view only outputs primitives and primitive[]
* This view only outputs only primitives
* Goal: explicit indicators to allow better linting + feedback when authoring views
* Goal: determine which views from a library will be compatible with your environment
* Can pass a flag to executor to indicate "please fail if you get any outputs other than ____(primitives)____"
* Goal: allow flesibility about how to handle vs suppress errors at runtime
##### Defining behavior for use cases when output includes structs or arrays
**Working proposal**
* View specification allows for arrays to be produced for columns (they are not error conditions). There are use cases where such arrays are useful.
* View runner implementations can determine whether to either output arrays (when supported) or produce errors indicating nested arrays are not supported.
##### Metadata per column
* Required metadata per column:
* Column Name, restricted to `[A-Z][a-z][0-9]_`, can't start with a number
* FHIRPath expression
* Optional metadata per column:
* Human-readable description of column
Potential future option to add database type hints as metadata information to columns if there is a need, but this is out of scope for initial spec.
##### Working Proposal for vars and unnests semantics
<div id="working-vars-proposal"></div>
** Working Proposal **
```json
"vars": [{
"name": "mrn",
"expr": "Patient.identifier",
"whenMany": "error" | "array" | "unnest" | "cross"
}, {
"name": "myMrn",
"expr": "'12345'",
"whenMany": "error" // default behavior is an error
}]
```
Processing model:
* Loop over vars with nested loops
* If a var is marked `"whenMany": "error"` (or has no `whenMany`) and multiple values are present, the processor should log or throw an error
* If a var is marked `"whenMany": "unnest"` but introduces a new path traversal, this is an error
**Discussion Notes**
?? Combine vars and expands into the same field and use a flag to indicate whether it unnests/cross join per row?
```json
"vars": [
{"name": "address", ""},
{"name": "system", "expr": "'http://snomed.info/sct'"}
],
```
// Works, always will look for SNOMED codes
code.where(system=%system)
TODOs: Consider Ryan's half-baked idea expanding to a specific point to create a row per item -- e.g. Patient.contact.telecom creates a row per telecom, cross-joined up the tree with parents.
##### Date/time type handling in views
**Working proposal:**
* Judiciously add functions to handle needed patterns; exact functions TBDs -- see discussion below.
* May build on annotations stored in layer 1, if they exist -- or compute on the fly.
**Discussion notes:**
* Options:
* ?? Judiciously add functions to FHIRPath to retrieve date ranges on date? (may use annotations under the hood if present or calculate in SQL)
* ?? Add column-level metadata in the view to handle such behavior?
* ?? Use FHIR extensions to logically represent this from the view layer (but implementations could still get them from annotations)
* ?? Level 1 should be able to store anything that passes FHIR profile validation -- but should have annotations to make working with it simpler?
* Could follow Grahame's approach of splitting less precise dates into timestamps as an annotation level?
* How would we access date range annotations at level 2? A custom FHIRPath function?
* Could be done in view layers?
* ?? Level 2: Include support for FHIRPath`toDate` and `toDateTime` -- runner can use first-class date types in those cases?
* Treat partial dates as an error or empty result?
* The toDate() behavior seems confusing -- may just return a date if only the year is set, not meeting the above behavior.
##### View Definition JSON Syntax:
Working view structure for convenience to define fields and semantics, which will be translated to a FHIR resource definition once we have clear semantics.
```json
{
"name": "",
"desc": "",
"resource": "",
// Variables can lead to unnested rows based on whenMany behavior.
"vars": [{
"name": "",
"expr": "",
// ?? Default to error?
"whenMany": "error" | "array" | "unnest" | "cross"
}, {
"name": "",
"expr": "",
"whenMany": ""
}],
// Order of columns define order seen in output table
// ?? Should we have a whenMany at the column if its expression producces many?
"columns": [
{name: "", expr: "", desc: ""},
{name: "", expr: "", desc: ""}
],
// Filters are joined with an implicit "and"
"filters": [
{name: "", expr: "", desc: ""},
{name: "", expr: "", desc: ""}
]
}
```
Processing variables:
* Each variable is evaluated against the resource, with all previous variables as context
* Each column is evaluated against the resource,th all variables as context
#### Alternative with nesting
?? We decided to follow the above approach
```yaml=
as: 'Patient'
expand:
- expr: 'Patient.contact'
name: cont
expand:
- name: tel
expr: '.telecom'
- name: ident:
expr: 'Patient.identifier'
table:
id: 'id'
name: '%con.name.family.first()'
tel: '%tel.value'
ident_sys: '%ident.system'
ident_val: '%ident.value'
```
#### Test cases - WIP
https://hackmd.io/@johngrimes/HkS3TYwE3
### TODO:
* Determine how to deal with date shorthand (e.g., `2023` rather than `2023-01-01`). Also, should this be in Annotation level or flattening level?
* Clearly define unnest/expand semantics and relationship with trees of unnests and the relationship with the parent
## Level 3 transformations - calculate analytics and measures based on flat tables:
- Test approach with existing queries
- Explore query portability between systems through SQL conversion tools like sqlglot and/or macro solutions like DBT
---
## Annotation layer algorithms
- Update resource ids to be unique (if necessary)
- Build URL with base url, resourceType, and id
- Remove scheme
- Hash
- Set as sof_id
- Extract contained resources
- Build URL with parent resource id + "#" + contained resource id
- Remove URL scheme
- Hash
- Update resource id
- Set as sof_id
- Extract resource from parent resource
- Update internal references in former parent to add new sof_id
- Extract ids from references and store alongside url version when creating unique ids
- If absolute URL, and base matches FHIR server base URL (ie not an external reference): id = hash of url without scheme
- If relative URL: id = hash of base url without scheme + relative url
- If contained URL: id = hash of base url without scheme + relative url + "#" + relative id
- Store previous Reference.reference as sof_reference_prev
- Update Reference.reference to [resourceType]/[hashed id]
- Populate reference.type if not populated
- Populate reference.sof_id with the hashed id
- Extract ids from references and store alongside url version when NOT creating unique ids
- Store previous Reference.reference as sof_reference_prev
- Populate reference.type if not populated
- Populate reference.sof_id with the last segment of the Reference.reference url
- Standardize date representation
- Convert value to UTC
- If partial dates, convert to start and end with sub-second precision. For example, '2018-05' will be populated with start date being '2018-05-01T00:00:00.000Z' and end date being '2018-05-31T23:59:59.999Z'. '2017-03-01' will be populated with start date being '2017-03-01T00:00:00.000Z' and end date being '2017-03-01T23:59:59.999Z'
- Instant types should have the same start and end
- Periods should have a start date of the earliest instant of period.start and an end date of the lastest instant of period.end
- Add to resource as sof_{elementName}.start and sof_{elementName}_aa.end
- *FHIR Timing elements are ignored at present due to their limited use and the complexity involved in converting them into a date range.*
- Standardize extension representation
- Standardize units (details tbd, look at implementations like pathling)
```mermaid
flowchart TD
A[Raw FHIR Resources] --> B[Annotated Layer\n-- Hierarchies, Generic]
B --> D[View Layer\n-- Rectangles, Use Case Specific]
B --> F
D --> F[Analytics Queries]
```