Assessing the risks of language model fairness in health care

# Assessing the risks of language model fairness in health care The ACA was recently updated to include non-discrimination clauses: https://www.hhs.gov/civil-rights/for-individuals/section-1557/index.html This relates to algorithmic fairness: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148336/ ## Plan (can do in parallel) - [ ] outreach for publication partners & 1 sponsor (in progress) - [x] load SyH-DR data using duckdb & dbt (done: https://github.com/onefact/synthetic-healthcare-data/tree/main/data_processing/models/ahrq.gov/generated/with_types) - [x] load & visualize american community survey with `dbt` and `duckdb` (https://jaanli.github.io/american-community-survey/new-york-area/income-by-race) - [x] acquire & visualize data on all assisted living facilities (https://arxiv.org/abs/2212.14092) - [x] load Synthea data from MITRE (done: https://github.com/onefact/synthetic-healthcare-data/blob/main/data_processing/models/mitre.org/synthea.sql) - [x] prototype dashboard (done: https://onefact.github.io/synthetic-healthcare-data/) - [ ] load CPT codes to link to from https://www.fepblue.org/tcr-machine-readable-files (using `for i in $(seq 1000); do sky launch --detach-setup --detach-run --down -c job-$i job.yaml --env JOB_INDEX=$i; done ? - --down:` on gcp/aws) - [x] source PE deal parameters: `Deal Date, Deal Type, Deal Currency, Deal Size (Mn), Target Name, Target City, Target State, Target Country, Target Address, Target Phone Number, Target Website, Target Formation Year, Target Industry, Target SubIndustry, Target Verticals, Investment Stake, Investors, Bought from Sellers (Firms), Pre Money Valuation, Enterprise Value (Mn), P/E Multiple, EV/EBITDA, Net Income, Operating Income` - [ ] process MIMIC with `dbt` and `duckdb` - [ ] test Jax LLMs for higher throughput on TPUs (todo) with https://arxiv.org/pdf/2310.06552 and https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.363 and ClinicalBERT and (maybe) constrained generation - [ ] redo protocol for MedPAR based on the above ([fees](https://resdac.org/sites/datadocumentation.resdac.org/files/2024-02/CMS%20Fee%20List%20for%20Physical%20Research%20Data%20Requests.pdf)) - [ ] prototype leaderboard - [ ] outline blog post ### Technical - [ ] launch preemptible v3-256 VM - [ ] run fine-tuned LLM batch inference (by loading model weights) on TPU VM using PyTorch or Jax (and extracting logits) [no need for huggingface] - [ ] run ClinicalBERT batch inference on TPU VM (and extracting logits) - [ ] run ClinicalBERT batch inference on TPU VM using Guidance/constrained generation (and extracting logits) - [ ] implement https://arxiv.org/pdf/2310.06552 ### People - [ ] https://szarnyasg.github.io/talks/graphsys24-ldbc-keynote.pdf // ask gabor for help ## Notes on skypilot Zhongheng: ``` Quick thoughts: Is it possible to try out SkyPilot for your workloads in a single cloud manner first? Intention is to first test if the solutions match your needs :) For cross-cloud VPN The above looks feasible Alternatively, maybe a service like Tailscale can be used/is easier for you? We have a prototype of launching all VMs in Tailscale (in that PR, enabled for AWS; can enable for GCP analogously too) https://cloud.google.com/network-connectivity/docs/vpn/tutorials/create-ha-vpn-connections-google-cloud-aws https://github.com/GoogleCloudPlatform/gcp-to-aws-ha-vpn-terraform-module Was thinking maybe you can “overshoot” by setting a bunch of regions initially, and let SkyPilot handle the out-of-capacity errors? would it be possible to dynamically set the resources spec? This should be possible if you write a simple script That’s a good point! We don’t have visibility into each region/zone’s real-time availability, but you can add --retry-until-up / -r to sky launch to have SkyPilot automatically (re)try all locations for you. Does that work? React Zongheng Yang (skypilot) 1 month ago Or, is this question more about figuring out which regions to put into the resources spec? Many SkyPilot users are using Ray indeed. See for example a YAML for creating a 2-node cluster & running Ray on top: https://skypilot.readthedocs.io/en/latest/running-jobs/distributed-jobs.html#executing-a-distributed-ray-program For across clouds use case, you can use (docs): resources: .. any_of: - cloud: aws - cloud: gcp The compliance requirements are a great fit for the “Sky” idea. See here for an example of how to restrict launching to European regions only across clouds: https://skypilot.readthedocs.io/en/latest/reference/faq.html#region-settings ``` other thread: ``` Hi all! Has anyone used SkyPilot for distributed analytics queries with AWS lambda or other such serverless architectures? I see this: https://skypilot.readthedocs.io/en/latest/running-jobs/distributed-jobs.html But am not sure it is compatible with the analytics workloads we need for health care sector. Specifically, for the nonprofit I started we need “burst” query planning algorithms that leverage duckdb.org to work across 1,000+10,000 lambdas, that share memory mapped database files via S3 and GP3 high-IOPS SSDs. 2 replies Zhanghao Wu (skypilot) 24 days ago Thanks for the question @jaan ! We currently does not support AWS lambda, but what our users normally do for their embarrassingly parallel jobs is to run sky spot launch to launch several managed spot jobs that handles the data stored on cloud bucket like S3 or GCS. Those jobs are fully managed by SkyPilot, and will automatically be terminated once the job finish, so similar to the serverless function (you may have to chunk the dataset into larger size to reduce the overhead of launching and terminating those jobs). Here is a blog from a user running bioinformatics jobs on SkyPilot applying mappings on significant amount of data: https://cloud.google.com/blog/topics/hpc/salk-institute-brain-mapping-on-google-cloud-with-skypilot ``` # Notes on identifying companies - ACA identifier (via ACA marketplace, healthcare.gov) - EIN (TIN): LLC/Corp - Payor data: company is with UHC, price it with other companies - reference population: based on AHRQ SyH-DR price - SEO for free: one static HTML file generated with jinja with one dashboard per company and the name using the registrant 3ZIP code / CBSA and estimate of size (min. 100 employees, i.e. min. two states etc) - Linux Foundation - project - Data model: needs to work with OMOP/FHIR-HL7/Tuva