SQL-on-FHIR v2 (FHIR Connectathon May 2023)

# SQL-on-FHIR v2 (FHIR Connectathon May 2023) ## Connectathon follow ups: * Ryan Brush to turn this doc into a level 2 spec draft proposal * Ryan Brush to update the [experimental Python example](https://github.com/google/fhir-py/tree/main/google-fhir-views#readme) to follow the draft spec. * John Grimes to create some test cases for level 2 views, including test data and expected outputs * John Grimes to create proof-of-concept implementation of view execution * John Grimes to create proof-of-concept implementation of Spark encoder that follows these patterns. (A modification of what Pathling currently does) * Dan Gottlieb to write examples of level 2 syntax * Nikolai working on annotated parser for FHIRPath to translate to SQL (Duckdb & Postgres jsonb) * Nikolai generate parquet from SD * Make Rich Hickey DM for ideas we discussed ## Goals and Requirements [Previous draft of requirements here](https://github.com/FHIR/sql-on-fhir-v2/discussions/69). ### A portable, unambiguous specification Any good standard is unambiguous and portable between technology stacks, and this is no exception. ## Define an "Annotated Layer" for db-friendly lossless storage of hierarchical FHIR data, annotated to improve queryability Problem: Raw FHIR can be inefficient to use inside a database, requiring repetitive extraction, string parsing, date math, etc. Solution: Annotated Layer establishes consistent pre-processing and annotation of raw FHIR resources to simplify downstream logic. Inputs: raw FHIR data (e.g., `.ndjson` files from Bulk `$export`) Outputs: annotated FHIR data in a query-ready format -- e.g. as json in files, as a json column in a database, or in a hierarchical schema in a columnar database #### TODOs: * Define approach for extracting contained resources * Define additional annotations (see [annotation TODOs](#annotation-todos)) * Determine the scope of these recommendations (e.g. JSON, Parquet, etc -- or leave those details out of scope?) * Determine what unit conversions (if any) should be done in the annotation layer, or defer those to the view or analytics layers? ## Define a "View Layer" that maps hierarchical FHIR data into rectangular tables Some common features such as: * filter resources by simple fhirpath subset * allow unnesting across paths in a resource * extract data into columns Leave room for optional features such as: * units canonicalization / conversion * resource reference resolution * conditional evaluations Possible approach: - Convert existing queries into declarative FHIRPath rule syntax - Generate SQL queries from FHIRPath rules ### Core FHIRPath function support: Implementations of this spec MUST support these: * `where` function to select items in arrays (like the home address) * `exists` function to support filtering iitems * `extension(url)` shortcut function * `join` function * `split` ? * `equals` operator, primarily for use in the where function above. * `ofType` function to select the desired value type. * `first` function. This is easily implemented as getting the first item in an array, and simplifies the output when users are looking for only a single value for a scalar column. * boolean operators (_and_, _or_, _not_) * basic arithmetic (+, -, *, /) * comparisons: =, !=, >, <= * Literals for strings, numbers * ?? string `substring`, `matches`, `startsWith` * ?? terminology `.hasConcept(system,code)` * ?? combining collections with `|` * ?? set membership checks * ?? date normalization * Potential functions: - reference.getId() - (date | dateTime | instant | period).getEarliestInstant() - (date | dateTime | instant | period).getLatestInstant() - (date | dateTime | instant | period).getOriginalPrecision() ### Optional FHIRPath function support: Implementations of this spec MAY support these: * Subset of unit conversion operators * `memberOf` function to allow checking for value sets * Current #### Grahame discussion: * How to deal with CodeableReferences, unnest support Observation with multiple components * Nested codesystems, CodeableReferences? * How to deal with normalized references? In the base layer or view layer? (May be satisfied with key extraction in the annotation layer)? * Dealing with datetime/normalization. Convert into periods? #### Discussion topics Should we allow arrays/objects as outputs in our Views? * Can have indicators on the view to express * This view can output anything (primitives, arrays, maps) * This view only outputs primitives and primitive[] * This view only outputs only primitives * Goal: explicit indicators to allow better linting + feedback when authoring views * Goal: determine which views from a library will be compatible with your environment * Can pass a flag to executor to indicate "please fail if you get any outputs other than ____(primitives)____" * Goal: allow flesibility about how to handle vs suppress errors at runtime ##### Defining behavior for use cases when output includes structs or arrays **Working proposal** * View specification allows for arrays to be produced for columns (they are not error conditions). There are use cases where such arrays are useful. * View runner implementations can determine whether to either output arrays (when supported) or produce errors indicating nested arrays are not supported. ##### Metadata per column * Required metadata per column: * Column Name, restricted to `[A-Z][a-z][0-9]_`, can't start with a number * FHIRPath expression * Optional metadata per column: * Human-readable description of column Potential future option to add database type hints as metadata information to columns if there is a need, but this is out of scope for initial spec. ##### Working Proposal for vars and unnests semantics <div id="working-vars-proposal"></div> ** Working Proposal ** ```json "vars": [{ "name": "mrn", "expr": "Patient.identifier", "whenMany": "error" | "array" | "unnest" | "cross" }, { "name": "myMrn", "expr": "'12345'", "whenMany": "error" // default behavior is an error }] ``` Processing model: * Loop over vars with nested loops * If a var is marked `"whenMany": "error"` (or has no `whenMany`) and multiple values are present, the processor should log or throw an error * If a var is marked `"whenMany": "unnest"` but introduces a new path traversal, this is an error **Discussion Notes** ?? Combine vars and expands into the same field and use a flag to indicate whether it unnests/cross join per row? ```json "vars": [ {"name": "address", ""}, {"name": "system", "expr": "'http://snomed.info/sct'"} ], ``` // Works, always will look for SNOMED codes code.where(system=%system) TODOs: Consider Ryan's half-baked idea expanding to a specific point to create a row per item -- e.g. Patient.contact.telecom creates a row per telecom, cross-joined up the tree with parents. ##### Date/time type handling in views **Working proposal:** * Judiciously add functions to handle needed patterns; exact functions TBDs -- see discussion below. * May build on annotations stored in layer 1, if they exist -- or compute on the fly. **Discussion notes:** * Options: * ?? Judiciously add functions to FHIRPath to retrieve date ranges on date? (may use annotations under the hood if present or calculate in SQL) * ?? Add column-level metadata in the view to handle such behavior? * ?? Use FHIR extensions to logically represent this from the view layer (but implementations could still get them from annotations) * ?? Level 1 should be able to store anything that passes FHIR profile validation -- but should have annotations to make working with it simpler? * Could follow Grahame's approach of splitting less precise dates into timestamps as an annotation level? * How would we access date range annotations at level 2? A custom FHIRPath function? * Could be done in view layers? * ?? Level 2: Include support for FHIRPath`toDate` and `toDateTime` -- runner can use first-class date types in those cases? * Treat partial dates as an error or empty result? * The toDate() behavior seems confusing -- may just return a date if only the year is set, not meeting the above behavior. ##### View Definition JSON Syntax: Working view structure for convenience to define fields and semantics, which will be translated to a FHIR resource definition once we have clear semantics. ```json { "name": "", "desc": "", "resource": "", // Variables can lead to unnested rows based on whenMany behavior. "vars": [{ "name": "", "expr": "", // ?? Default to error? "whenMany": "error" | "array" | "unnest" | "cross" }, { "name": "", "expr": "", "whenMany": "" }], // Order of columns define order seen in output table // ?? Should we have a whenMany at the column if its expression producces many? "columns": [ {name: "", expr: "", desc: ""}, {name: "", expr: "", desc: ""} ], // Filters are joined with an implicit "and" "filters": [ {name: "", expr: "", desc: ""}, {name: "", expr: "", desc: ""} ] } ``` Processing variables: * Each variable is evaluated against the resource, with all previous variables as context * Each column is evaluated against the resource,th all variables as context #### Alternative with nesting ?? We decided to follow the above approach ```yaml= as: 'Patient' expand: - expr: 'Patient.contact' name: cont expand: - name: tel expr: '.telecom' - name: ident: expr: 'Patient.identifier' table: id: 'id' name: '%con.name.family.first()' tel: '%tel.value' ident_sys: '%ident.system' ident_val: '%ident.value' ``` #### Test cases - WIP https://hackmd.io/@johngrimes/HkS3TYwE3 ### TODO: * Determine how to deal with date shorthand (e.g., `2023` rather than `2023-01-01`). Also, should this be in Annotation level or flattening level? * Clearly define unnest/expand semantics and relationship with trees of unnests and the relationship with the parent ## Level 3 transformations - calculate analytics and measures based on flat tables: - Test approach with existing queries - Explore query portability between systems through SQL conversion tools like sqlglot and/or macro solutions like DBT --- ## Annotation layer algorithms - Update resource ids to be unique (if necessary) - Build URL with base url, resourceType, and id - Remove scheme - Hash - Set as sof_id - Extract contained resources - Build URL with parent resource id + "#" + contained resource id - Remove URL scheme - Hash - Update resource id - Set as sof_id - Extract resource from parent resource - Update internal references in former parent to add new sof_id - Extract ids from references and store alongside url version when creating unique ids - If absolute URL, and base matches FHIR server base URL (ie not an external reference): id = hash of url without scheme - If relative URL: id = hash of base url without scheme + relative url - If contained URL: id = hash of base url without scheme + relative url + "#" + relative id - Store previous Reference.reference as sof_reference_prev - Update Reference.reference to [resourceType]/[hashed id] - Populate reference.type if not populated - Populate reference.sof_id with the hashed id - Extract ids from references and store alongside url version when NOT creating unique ids - Store previous Reference.reference as sof_reference_prev - Populate reference.type if not populated - Populate reference.sof_id with the last segment of the Reference.reference url - Standardize date representation - Convert value to UTC - If partial dates, convert to start and end with sub-second precision. For example, '2018-05' will be populated with start date being '2018-05-01T00:00:00.000Z' and end date being '2018-05-31T23:59:59.999Z'. '2017-03-01' will be populated with start date being '2017-03-01T00:00:00.000Z' and end date being '2017-03-01T23:59:59.999Z' - Instant types should have the same start and end - Periods should have a start date of the earliest instant of period.start and an end date of the lastest instant of period.end - Add to resource as sof_{elementName}.start and sof_{elementName}_aa.end - *FHIR Timing elements are ignored at present due to their limited use and the complexity involved in converting them into a date range.* - Standardize extension representation - Standardize units (details tbd, look at implementations like pathling) ```mermaid flowchart TD A[Raw FHIR Resources] --> B[Annotated Layer\n-- Hierarchies, Generic] B --> D[View Layer\n-- Rectangles, Use Case Specific] B --> F D --> F[Analytics Queries] ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.