Bulk Publish - HackMD

# Bulk Publish (f.k.a. "Hosted Bulk Manifest") Target use cases: * Provider Directories * Endpoint Directories * Aggregate public health data (e.g., bed availability) Note: This is a flavor of the [Bulk Publish](https://github.com/HL7/bulk-data/blob/bulk-publish/input/pagecontent/bulk-publish.md) approach. Dan Gotlieb (the author of a previous, initial unknown to me flavor of Bulk Publish) agree on the pattern and want to align the two into a single exchange method. # Exchange Pattern The directory runs a regularly scheduled job to generate NJSON files and a Bulk-style manifest. The URL to the manifest is made available directly to consumers (no kickoff or notification is involved). An example flow: 1. A directory runs a regularly scheduled job (e.g., nightly) to export all directory content as FHIR resources to NDJSON files. 2. The directory publishes a manifest URL to consumers, with the manifest containing pointers to the NDJSON files. 3. The directory may optionally run a scheduled diff job (e.g. hourly) and use the "partial manifest" pattern to include any new data since the nightly job. 1. The consumer will query the published manifest (e.g., daily). 2. The consumer will download each NDJSON file referenced in the manifest. 3. Optionally, the consumer may check if a "next" manifest is indicated, and if so, query that (e.g., hourly) to see if there are any diffs posted. 4. If the "next" manifest indicates new NDJSON files are present, the consumer downloads only those new files. 5. Each night, the process resets. (TBD on this, see Open Questions) # Examples The directory would publish their primary manifest at: https://example.com/daily-output/manifest.json This manifest would include: ``` { "transactionTime": "2021-01-01T00:00:00Z", "requiresAccessToken": false, "outputOrganizedBy": "Organization", "output": [ { "url": "https://example.com/nightly-output/file_1.ndjson" }, { "url": "https://example.com/output/file_2.ndjson", "continuesInFile": "https://example.com/nightly-output/file_3.ndjson" }, { "url": "https://example.com/nightly-output/file_4.ndjson" } ], "link": [ { "relation": "next", "url": "https://example.com/hourly-output-1/manifest-2.json" } ] } ``` Initially, the "next" manifest URL would return 202, indicating that there is no new data (yet). If/when there is new data availale, for example, via an hourly diff job, then the "next" manifest would return 200, indicating the consumer can download the manifest and assocaited NDJSON files. https://example.com/hourly-output-1/manifest-2.json ``` { "transactionTime": "2021-01-01T00:00:00Z", "requiresAccessToken": false, "outputOrganizedBy": "Organization", "output": [ { "url": "https://example.com/hour-1-diff-output/file_1.ndjson" }, { "url": "https://example.com/hour-1-diff-output/file_2.ndjson" } ], "deleted": [ { "type": "Bundle", "url": "https://example.com/hour-1-diff-output/del_file_1.ndjson" } ], "link": [ { "relation": "next", "url": "https://example.com/hourly-output-2/manifest-3.json" } ] } ``` # Data Organization The details of the profiles are out of scope for this pattern, however, it is common for some directories to have natural "grouping". For example, in a provider directory, a consumer may want to download directory information for providers that provide services within a single organization (for example, an organization the consumer refers patients to often), but may not be interested in providers at other organizations. Directories can use the `outputOrganizedBy` property in the manifest to put all the directory info for a single org in a single file. TBD how we annotate the output entry so the consumer knows which to download and which to ignore. See Open Questions. # Advantages of this Pattern * Very performant. The consumers are accessing staged files (effectively a cache), and are not hitting production database directly. * Directories could choose to publish the files on CDNs (Content Delivery Networks). * Client registration and authentication may often be unnecessary, as the consumer is not imposing processing burden on the directory. # Open Questions * How to differentiate "diff" next links from "manifest too large" next links * How to indicate a "reset". I.e., when the daily job runs again and consumers should start again with a full download. Or do we just do diffs forever? * How to indicate a the "outputOrganizedBy", given that there are parent/child Orgs, and the intent is that the output is organized by the parent Org only. * How can we communicate to consumers about multi-org directories, where they only want some subset? The goal is for "common patterns" to be enabled. Possibly just have separate, published primary manifests for each org? # Prior Art ## US National Directory (NDH) NDH has a "Scheduled Bulk Data Export" option. However, that still requires a kickoff step. For smaller directories, it may be preferrable to just always have a pre-scheduled export available, rather than requiring clients to request it. Some directories may choose to support both a NDH Scheduled Bulk Export for when a consumer has specific criteria and a Bulk Publish where the directory chooses what is included. The Bulk Publish option is your "simple, for easy use cases" option and the NDH Scheduled Bulk Export is your "advanced use cases" option. Note: A kickoff operation will normally require authentication, and thus client registration will normally be necessary for the NDH Scheduled Bulk Export. A key benefit of the Hosted Bulk Manfest option is that it may be fully public, with no client ID or pre-coordinated auth requirement, since the client is not imposing processing burden on the server. ## SMART User Access Brands User Access Brands uses a single Bundle, which works for smaller directories, but doesn't scale well for larger directories. The Bulk Manifest approach achieves better scalability. ## Da Vinci Plan Net Da Vinci Plan Net currently is REST focused, and has a nod to future Bulk options. [[FHIR-51481](https://jira.hl7.org/browse/FHIR-51481)](https://) proposes to add this Bulk Publish option. This will be discussed on the Da Vinci Plan Net call on August 8, 2025. ## UDS+ and DEQM The UDS+ and DEQM IGs have some similarities, in that the data holder can run a scheduled job to stage NDJSON files. Those IGs both involve a $import or $submit operation to notify the consumer that data is ready. Bulk Publishs adopts the scheduled stating approach, but does not incorporate the notification approach.