Parquet on FHIR Introduction

# Parquet on FHIR Introduction ## Open Questions / TODOs - Definitions of: - impl pattern - use cases - fine grained schema levels - "overlay" - Goals - Implementation patterns: Both "SoF" and "raw SQL" or not? - Research - Support by specific systems for schema "overlay" (clickhouse, BigQuery, Athena, etc) - If not supported what are fallbacks/workarounds? And are limitations acceptable for expected use cases - Can add'l tooling help? - DDB specific: will there be future support for schema "overlay" for PQ? Parquet File Optimization for FHIR - Later: what optimizations are useful for PQ files (compression algos, row group size, etc) ## Problem statement The global healthcare community has a strong desire to use FHIR data with modern data warehouse and analytics platforms. A recent trend of these systems is to use highly-optimized column-oriented data representations because of their efficiency in running complex analytics queries over large data sets. A few popular examples of these systems are Apache Spark, Google BigQuery, Amazon Redshift, Snowflake, and DuckDB. The de facto standard data format for large FHIR data sets is [NDJSON](https://github.com/ndjson/ndjson-spec) files which contain FHIR resources as JSON-serialized strings. The popularity of this format is mainly due to its natural fit with FHIR data represented as JSON and its use in the FHIR Bulk Data IG. However, limitations of the JSON and NDJSON formats hamper their ability to be queried efficiently at scale, such as the need for parsing and the lack of indexing and statistics. A number of different methods for representing FHIR data within columnar data formats already exist as part of existing tools and data platforms that have implemented FHIR support. There is an opportunity to standardize these methods into a unified data format. This would enable the use of standard tools to transform data from FHIR JSON and XML. It would also simplify the query implementations required to extract data from this format by establishing a set of standard rules and expectations. Columnar formats also have some other benefits such as superior compression and encoding techniques, along with native encryption capabilities. These features could provide an efficient alternative for exchanging large volumes of FHIR data that minimizes the cost of bandwidth, storage, and processing. ## Proposed solution While there are several options that could provide a basis for this format, this specification targets Apache Parquet. Parquet is an ideal target for this work since it is likely the most widely-supported high-performance column-oriented open data storage format. Apache describes it as "an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools." While this solution targets the Parquet format, it does not preclude efforts to target any other format. The concepts presented in this specification could be translated to other similar formats such as ORC, Avro and Protocol Buffers. Other formats may have features that provide advantages in the representation of FHIR data, such as support for recusive types, that a Parquet-based solution lacks. Regardless of other potential solutions, our work demonstrates that this solution is a viable and practical representation of FHIR data. The solution presented in this specification is a set of "rules" that define a method to create a Parquet "super-schema". This super-schema contains all the structures required to represent any permissible data for a given FHIR resource type. The super-schema is not profile-specific, and can contain any arbitrary extension content. It supports a broad range of FHIR versions, from DSTU2 to R5. The super-schema is also technically infinite for FHIR resources that contain recursive structures. Some FHIR resources and datatypes contain elements that recurse (e.g. `Questionnaire.item` and `Reference.identifier.assigner`). This behavior makes it likly that most FHIR resources will have infinite super-schemas. It is not possible to implement an infinite schema. Even where a super-schema is not infinite, it may not be practical or performant to implement the full schema due to the size and complexity of a schema that can accommodate all possible instances of FHIR data for a given resource type. For this reason, we have defined several profiles that describe constrained schemas that are subsets of the super-schema, but compatible with its rules. A schema that is compatible with the super-schema need not contain all possible fields. However, the fields that it does contain must exactly match those present in the super-schema. This enables the merging of different datasets that contain the same resource type but have different coverage of the data elements within that resource type. The solution also feature the concept of annotations, which are optional fields added to the schema to house additional data that is derived from the original data. Annotations can be used to augment data to enhance the ease and performance of query. ### Profiles (Other words used: slim, minimal, classic, expanded, maximal, "overlay", use case based, resource/profile based) #### Comprehensive schema A comprehensive schema must contain all fields within the super-schema with one limitation. The limitation is that recursive structures shall be represented to six (6) levels. This will result in a large schema that can accommodate most FHIR data for a given resource type. Consumers of the comprehensive schema can rely upon the existence of all possible fields without the need to interrogate the schema of the Parquet table and conditionally query those fields. All Parquet tables with the comprehensive schema for a given resource type will have the exact same schema (with the exception of annotations). A given FHIR resource instance will not populate all fields within a comprehensive schema. Fields that are not present for the resource instance will be set to NULL. #### Focused schema A focused schema can have any subset of the fields defined within the super-schema. The only qualification on this is that every focused schema must at least contain the `resourceType` field. This means that a focused schema can be limited to only those fields that are necessary to represent the FHIR resource instances that are present in a given Parquet table. It also means that within a focused schema, recursion can be represented to any depth. Consumers of a focused schema will need to be tolerant to the fact that not all fields within the super-schema will be present within the Parquet table. An implementation that supports focused schema also by definition supports comprehensive schema, as a comprehensive schema _is_ a focused schema with additional constraints. ## Benefits In contrast to the existing and other potential solutions, this solution has the following benefits: - Ease of Adoption: Parquet is the de facto column-oriented data format. Most popular data platforms support the standard and can run efficient queries over this format without further processing. - Utility of JSON Schema Inference: Some systems support Parquet schema inference directly from JSON objects. This reduces the effort required to create schemas. - Formal Schema Input Not Required: Since schemas can be created directly from FHIR represented as JSON, the formal schema for the resource (StructureDefinition) is not required to create "focused schema" tables. - Readability of Output: The focused schema approach creates readable output schemas in comparison with other approaches that would require much larger and more complicated schemas to attempt comprehensiveness. - Cross FHIR Version Support - A single specification can support all FHIR versions, by using a superset of data types. - Profile-agnostic - A single specification can be used to represent FHIR resource data regardless of the profiles being declared. All extensions, including primitive extensions, can be represented within the format. - Lossless - The "focused schema" approach can be transformed to and from FHIR with zero loss. The "comprehensive schema" approach can achieve this for almost all data, with the exception of highly nested resource instances. ## Drawbacks The major drawback to this focused schema approach is the potential for complicated or expensive schema "merges". If multiple data sources exist for a given FHIR Resource, a schema inferred from one source may not be compatible with the others without manual effort. Assuming structurally valid FHIR data of the same version, different sources may include differing elements. In this case, individual "focused" schemas should be compatible/mergeable. This merged schema can then be used by data systems and communicated to downstream users. Some systems may provide support to automate this process.*** Merging will likely be an infrequent event, for example, when a new upstream data source is added. An additional drawback to this approach is that queries written against focused schemas will need to guard against accessing fields that exist in the comprehensive schema but do not exist in the focused schema. The additional logic required adds complexity to the query to be robust to this possiblilty. ## Goals This solution aims to support the following goals. ### Should be able to express all of FHIR The solution should naturally support all types of extension content including undeclared and primitive extensions.*** ### Lossless Bidirectional Transformations No loss of fidelity should occur in transformation to and from format. Decimals and datetimes should be transformed from and back to FHIR without loss. ### Performant at Real-world Scale The solution should support efficient compression of various data types. Runtime usage should be optimized supported with implementation-specific annotations. ### Wide Adoption Potential The solutions should target the most popular data warehouse systems. (*** more?) ## Non-Goals ### Query Writer Experience ### Query Portability Across Different System ### Ultimate Efficiency (*** more?) ## Primary Use Cases ### Bulk FHIR Output Format ### Standard "Data Lake" Format able to be used across many popular data systems