# Medical Coding ML System - Technical Design Document ## 1. Executive Summary ### System Overview A cloud-native, ML-powered medical coding prediction system leveraging AWS infrastructure and Snowflake data platform to predict both DRG and ICD-10 codes from clinical data. The system supports both real-time inference (<1s latency) and batch processing for high-volume coding operations. ### Key Capabilities - **Dual Prediction Models**: DRG-LLaMA for DRG codes (54.6% top-1 accuracy target), Clinical BERT variants for ICD-10 - **Hybrid Processing**: Real-time API endpoints and batch processing pipelines - **HIPAA Compliant**: End-to-end encryption, audit logging, BAA-covered services - **Scalable Architecture**: Auto-scaling from 100 to 100K+ predictions/day - **Epic-Ready**: Designed for future FHIR API and Clarity database integration ### Technology Stack - **ML Platform**: AWS SageMaker (training, deployment, monitoring) - **Data Platform**: Snowflake (existing instance leveraged) - **Data Lake**: AWS S3 with lifecycle policies - **Orchestration**: AWS Step Functions + Airflow - **Security**: AWS KMS, PrivateLink, CloudTrail --- ## 2. System Architecture ### 2.1 High-Level Architecture ```mermaid graph TB subgraph "External Data Sources" FHIR[Epic FHIR API] Clarity[Epic Clarity Database] Manual[Manual Upload/API<br/>CSV/JSON/HL7] end subgraph "Ingestion Layer - AWS" Gateway[AWS Transfer Family /<br/>API Gateway / EventBridge] S3Raw[S3 Raw Data Lake<br/>Encrypted] S3Paths["/raw/fhir/<br/>/raw/clarity/<br/>/raw/clinical-notes/<br/>/raw/structured/"] end subgraph "Data Processing Layer" subgraph "Snowflake + Snowpipe" Raw[Raw Layer] Staging[Staging Layer] Analytics[Analytics Layer] Feature[Feature Store<br/>Snowflake + SageMaker] Raw --> Staging Staging --> Analytics Analytics --> Feature end end subgraph "ML Platform Layer" subgraph "AWS SageMaker" Training[Training Jobs<br/>• DRG-LLaMA<br/>• Clinical BERT] Registry[Model Registry<br/>• Versioning<br/>• A/B Testing<br/>• Staging] Inference[Inference Endpoints<br/>• Real-time<br/>• Batch Transform<br/>• Multi-Model] end end subgraph "Application Layer" API["API Gateway + Lambda Functions<br/>• /predict/real-time<br/>• /predict/batch<br/>• /status/job/[id]<br/>• /metrics/performance"] end FHIR --> Gateway Clarity --> Gateway Manual --> Gateway Gateway --> S3Raw S3Raw --> S3Paths S3Paths --> Raw Feature --> Training Training --> Registry Registry --> Inference Inference --> API classDef aws fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff classDef snowflake fill:#29B5E8,stroke:#0C5B99,stroke-width:2px,color:#fff classDef epic fill:#CC0000,stroke:#660000,stroke-width:2px,color:#fff classDef ml fill:#04AA6D,stroke:#028A0F,stroke-width:2px,color:#fff class FHIR,Clarity epic class Gateway,S3Raw,S3Paths,API aws class Raw,Staging,Analytics,Feature snowflake class Training,Registry,Inference ml ``` ### 2.2 AWS Account Structure ```mermaid graph TD Org[AWS Organization Root] Mgmt[Management Account<br/>• AWS Organizations<br/>• CloudTrail Organization Trail<br/>• Cost Management] Security[Security Account<br/>• Security Hub<br/>• GuardDuty Master<br/>• AWS Config Aggregator] Prod[Production Account<br/>• Production Workloads<br/>• PHI Data Processing<br/>• Model Endpoints] Dev[Development Account<br/>• Development/Testing<br/>• Synthetic Data Only] Data[Data Account<br/>• S3 Data Lake<br/>• Snowflake External Stages<br/>• Backup/Archive] Org --> Mgmt Org --> Security Org --> Prod Org --> Dev Org --> Data Security -.->|monitors| Prod Security -.->|monitors| Dev Security -.->|monitors| Data classDef management fill:#FFA500,stroke:#FF8C00,stroke-width:2px classDef security fill:#DC143C,stroke:#8B0000,stroke-width:2px classDef prod fill:#228B22,stroke:#006400,stroke-width:2px classDef dev fill:#4169E1,stroke:#0000CD,stroke-width:2px classDef data fill:#9370DB,stroke:#4B0082,stroke-width:2px class Mgmt management class Security security class Prod prod class Dev dev class Data data ``` --- ## 3. Core Components ### 3.1 Data Ingestion Pipeline #### Component Definition ```yaml Epic FHIR Connector: Type: Lambda Function + EventBridge Runtime: Python 3.11 Memory: 3008 MB Timeout: 15 minutes Triggers: - Scheduled: Rate(15 minutes) - On-demand: API Gateway Functions: - Bulk export initiation - Incremental data sync - FHIR resource extraction Output: S3 Raw Layer (NDJSON format) Clarity Database Connector: Type: AWS Glue Job Worker Type: G.2X Workers: 2-10 (auto-scaling) Schedule: Daily at 2 AM UTC Tables: - CLARITY_DX (diagnoses) - CLARITY_PRC (procedures) - HNO_INFO (clinical notes) - PATIENT (demographics) Output: S3 Raw Layer (Parquet format) Manual Upload Handler: Type: S3 Event + Lambda Supported Formats: CSV, JSON, HL7, PDF Validation: JSON Schema / HL7 Parser Processing: - Format detection - Schema validation - PHI detection (Macie) - Quarantine invalid files ``` ### 3.2 Data Processing & Feature Engineering #### Snowflake Architecture ```mermaid graph LR subgraph "External Sources" FHIR[FHIR Data<br/>NDJSON] Clarity[Clarity Data<br/>Parquet] Manual[Manual Uploads<br/>CSV/JSON] end subgraph "Snowflake Database: MEDICAL_CODING_ML" subgraph "RAW Schema" Raw1[CLINICAL_DATA] Raw2[FHIR_RESOURCES] Raw3[CLARITY_EXTRACTS] end subgraph "STAGING Schema" Stage1[CLEANED_ENCOUNTERS] Stage2[VALIDATED_DIAGNOSES] Stage3[PROCESSED_NOTES] end subgraph "FEATURES Schema" Feat1[PATIENT_ENCOUNTERS] Feat2[FEATURE_VECTORS] Feat3[TEXT_EMBEDDINGS] end subgraph "ANALYTICS Schema" Ana1[PREDICTION_RESULTS] Ana2[MODEL_METRICS] Ana3[CODING_ANALYTICS] end end FHIR --> Raw2 Clarity --> Raw3 Manual --> Raw1 Raw1 --> Stage1 Raw2 --> Stage2 Raw3 --> Stage3 Stage1 --> Feat1 Stage2 --> Feat1 Stage3 --> Feat3 Feat1 --> Feat2 Feat3 --> Feat2 Feat2 --> Ana1 Ana1 --> Ana2 Ana1 --> Ana3 classDef external fill:#FFE0B2,stroke:#FF6F00,stroke-width:2px classDef raw fill:#FFCDD2,stroke:#D32F2F,stroke-width:2px classDef staging fill:#C5CAE9,stroke:#303F9F,stroke-width:2px classDef features fill:#C8E6C9,stroke:#388E3C,stroke-width:2px classDef analytics fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px class FHIR,Clarity,Manual external class Raw1,Raw2,Raw3 raw class Stage1,Stage2,Stage3 staging class Feat1,Feat2,Feat3 features class Ana1,Ana2,Ana3 analytics ``` ```sql -- Database Structure CREATE DATABASE IF NOT EXISTS MEDICAL_CODING_ML; -- Schemas CREATE SCHEMA IF NOT EXISTS RAW; -- Raw ingested data CREATE SCHEMA IF NOT EXISTS STAGING; -- Cleaned, validated data CREATE SCHEMA IF NOT EXISTS FEATURES; -- ML-ready features CREATE SCHEMA IF NOT EXISTS ANALYTICS; -- Aggregated metrics -- Key Tables CREATE TABLE FEATURES.PATIENT_ENCOUNTERS ( encounter_id VARCHAR PRIMARY KEY, patient_id VARCHAR, admission_date TIMESTAMP, discharge_date TIMESTAMP, principal_diagnosis VARCHAR, secondary_diagnoses ARRAY, procedures ARRAY, clinical_notes_processed VARIANT, lab_results VARIANT, vital_signs VARIANT, features_vector ARRAY, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Snowpipe for Continuous Ingestion CREATE PIPE medical_coding_pipe AUTO_INGEST = TRUE AS COPY INTO RAW.CLINICAL_DATA FROM @medical_coding_stage FILE_FORMAT = (TYPE = 'PARQUET'); ``` #### Feature Engineering Pipeline ```python # Core Feature Categories feature_pipeline = { "demographic_features": [ "age", "gender", "race", "ethnicity" ], "clinical_features": [ "chief_complaint_embedding", "diagnosis_count", "procedure_count", "comorbidity_index", "severity_scores" ], "temporal_features": [ "length_of_stay", "icu_days", "readmission_risk", "seasonal_patterns" ], "text_features": [ "clinical_note_embeddings", "discharge_summary_entities", "radiology_report_findings" ], "lab_features": [ "abnormal_lab_count", "critical_values", "trend_indicators" ] } ``` ### 3.3 ML Model Architecture ```mermaid graph LR subgraph "Training Pipeline" Data[Training Data<br/>from Snowflake] Prep[Data Preprocessing<br/>• Tokenization<br/>• Feature Engineering] Train[Model Training<br/>• DRG-LLaMA<br/>• Clinical BERT] Eval[Model Evaluation<br/>• Accuracy Metrics<br/>• F1 Scores] Reg[Model Registry<br/>• Version Control<br/>• Metadata] end subgraph "Deployment Pipeline" Stage[Staging<br/>Endpoint] AB[A/B Testing<br/>• Traffic Split<br/>• Performance Compare] Prod[Production<br/>Endpoint] Monitor[Model Monitor<br/>• Drift Detection<br/>• Performance Tracking] end subgraph "Inference Modes" RT[Real-time<br/>• <500ms latency<br/>• Single predictions] Batch[Batch Transform<br/>• Overnight processing<br/>• 10K records/batch] end Data --> Prep Prep --> Train Train --> Eval Eval --> Reg Reg --> Stage Stage --> AB AB --> Prod Prod --> Monitor Monitor -.->|Retrain Trigger| Data Prod --> RT Prod --> Batch classDef training fill:#E8EAF6,stroke:#3F51B5,stroke-width:2px classDef deploy fill:#E0F2F1,stroke:#00796B,stroke-width:2px classDef inference fill:#FFF8E1,stroke:#F57F17,stroke-width:2px class Data,Prep,Train,Eval,Reg training class Stage,AB,Prod,Monitor deploy class RT,Batch inference ``` #### DRG Prediction Model ```yaml Model: DRG-LLaMA Architecture: Base Model: LLaMA-13B Fine-tuning Dataset: 2M+ hospital admissions Context Window: 1024 tokens Training Infrastructure: Instance: ml.p4d.24xlarge GPUs: 8x A100 Training Time: ~48 hours Optimization: - Mixed Precision Training (FP16) - Gradient Checkpointing - DeepSpeed ZeRO-3 - Learning Rate: 1e-5 with cosine schedule Performance Targets: - Top-1 Accuracy: 54.6% - Top-5 Accuracy: 86.5% - Inference Latency: <500ms ``` #### ICD-10 Prediction Model ```yaml Model: Hierarchical Clinical BERT Architecture: Base Model: Bio_ClinicalBERT Task Head: Hierarchical Attention Network Label Space: Top 1000 ICD-10 codes (expandable) Training Infrastructure: Instance: ml.g5.12xlarge Training Time: ~24 hours Multi-Label Strategy: - Label-wise attention mechanism - Hierarchical loss function - Focal loss for class imbalance Performance Targets: - Micro-F1: 0.54 - Macro-F1: 0.48 - Top-10 Recall: 0.75 ``` ### 3.4 Inference Architecture #### Real-time Inference ```python # SageMaker Multi-Model Endpoint Configuration endpoint_config = { "EndpointName": "medical-coding-realtime", "ProductionVariants": [ { "VariantName": "drg-model", "ModelName": "drg-llama-v1", "InstanceType": "ml.g5.2xlarge", "InitialInstanceCount": 2, "AutoScaling": { "MinCapacity": 2, "MaxCapacity": 10, "TargetValue": 100, # requests per second "ScaleInCooldown": 300, "ScaleOutCooldown": 60 } }, { "VariantName": "icd10-model", "ModelName": "clinical-bert-icd10-v1", "InstanceType": "ml.g5.xlarge", "InitialInstanceCount": 2 } ] } ``` #### Batch Processing ```python # Step Functions State Machine Definition batch_pipeline = { "StartAt": "ValidateInput", "States": { "ValidateInput": { "Type": "Task", "Resource": "arn:aws:lambda:validate-batch-input", "Next": "ExtractFeatures" }, "ExtractFeatures": { "Type": "Task", "Resource": "arn:aws:states:::snowflake:query", "Next": "ParallelPrediction" }, "ParallelPrediction": { "Type": "Parallel", "Branches": [ { "StartAt": "PredictDRG", "States": { "PredictDRG": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:transform" } } }, { "StartAt": "PredictICD10", "States": { "PredictICD10": { "Type": "Task", "Resource": "arn:aws:states:::sagemaker:transform" } } } ], "Next": "PostProcess" }, "PostProcess": { "Type": "Task", "Resource": "arn:aws:lambda:postprocess-predictions", "End": true } } } ``` --- ## 4. Data Flow Specifications ### 4.1 Real-time Data Flow ```mermaid sequenceDiagram participant Client participant API Gateway participant Lambda participant Feature Store participant SageMaker participant DynamoDB Client->>API Gateway: POST /predict/real-time API Gateway->>Lambda: Invoke Preprocessor Lambda->>Feature Store: Fetch Features Feature Store-->>Lambda: Return Features Lambda->>SageMaker: Invoke Endpoint SageMaker-->>Lambda: Predictions Lambda->>DynamoDB: Store Results Lambda-->>API Gateway: Response API Gateway-->>Client: JSON Response ``` ### 4.2 Batch Processing Flow ```mermaid graph LR subgraph "Stage 1: Data Extraction [0-10 min]" Source[Snowflake/S3<br/>Source Data] Extract[Extract Pipeline<br/>• Parquet format<br/>• Date/facility partitions] end subgraph "Stage 2: Feature Engineering [10-30 min]" Struct[Snowpark UDFs<br/>Structured Features] NLP[SageMaker Processing<br/>NLP Features] Vectors[Feature Vectors<br/>in S3] end subgraph "Stage 3: Model Inference [30-90 min]" Transform[Batch Transform Jobs<br/>• Parallel processing<br/>• 10K records/batch<br/>• GPU optimization] DRGBatch[DRG Model<br/>Predictions] ICDBatch[ICD-10 Model<br/>Predictions] end subgraph "Stage 4: Post-processing [90-100 min]" Confidence[Confidence Scoring] Validation[Hierarchical<br/>Code Validation] Aggregate[Result<br/>Aggregation] end subgraph "Stage 5: Result Storage [100-110 min]" Snow[Write to<br/>Snowflake] Dash[Update<br/>Dashboards] Notify[Trigger<br/>Notifications] end Source --> Extract Extract --> Struct Extract --> NLP Struct --> Vectors NLP --> Vectors Vectors --> Transform Transform --> DRGBatch Transform --> ICDBatch DRGBatch --> Confidence ICDBatch --> Confidence Confidence --> Validation Validation --> Aggregate Aggregate --> Snow Snow --> Dash Snow --> Notify classDef extraction fill:#E8F5E9,stroke:#4CAF50,stroke-width:2px classDef feature fill:#E3F2FD,stroke:#2196F3,stroke-width:2px classDef inference fill:#FFF3E0,stroke:#FF9800,stroke-width:2px classDef processing fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px classDef storage fill:#FFEBEE,stroke:#F44336,stroke-width:2px class Source,Extract extraction class Struct,NLP,Vectors feature class Transform,DRGBatch,ICDBatch inference class Confidence,Validation,Aggregate processing class Snow,Dash,Notify storage ``` --- ## 5. Security & Compliance ```mermaid graph TB subgraph "Data Security Layers" subgraph "Data at Rest" S3KMS[S3 with SSE-KMS<br/>Customer Managed Keys] SnowEnc[Snowflake Tri-Secret<br/>Secure Encryption] EBSEnc[EBS Encrypted<br/>Volumes] end subgraph "Data in Transit" TLS[TLS 1.2+ All Connections] PLink[PrivateLink for<br/>Snowflake] VPCEnd[VPC Endpoints<br/>for AWS Services] CertPin[Certificate Pinning<br/>for Epic APIs] end subgraph "Access Control" IAM[IAM Roles &<br/>Policies] MFA[Multi-Factor<br/>Authentication] RBAC[Role-Based<br/>Access Control] Secrets[AWS Secrets<br/>Manager] end subgraph "Audit & Compliance" Trail[CloudTrail<br/>Logging] Config[AWS Config<br/>Rules] Hub[Security Hub<br/>Monitoring] Macie[Amazon Macie<br/>PHI Detection] end end S3KMS --> IAM SnowEnc --> IAM EBSEnc --> IAM TLS --> RBAC PLink --> RBAC VPCEnd --> RBAC CertPin --> RBAC IAM --> Trail MFA --> Trail RBAC --> Trail Secrets --> Trail Trail --> Config Config --> Hub Hub --> Macie classDef encryption fill:#FFF3E0,stroke:#F57C00,stroke-width:2px classDef transit fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px classDef access fill:#E3F2FD,stroke:#1565C0,stroke-width:2px classDef audit fill:#FCE4EC,stroke:#C2185B,stroke-width:2px class S3KMS,SnowEnc,EBSEnc encryption class TLS,PLink,VPCEnd,CertPin transit class IAM,MFA,RBAC,Secrets access class Trail,Config,Hub,Macie audit ``` ### 5.1 Encryption Strategy ```yaml Data at Rest: S3: - Server-side encryption: SSE-KMS - KMS Key: Customer Managed (CMK) - Rotation: Annual Snowflake: - Tri-Secret Secure encryption - Customer-managed keys in AWS KMS EBS Volumes: - Encrypted by default - KMS key per environment Data in Transit: - TLS 1.2+ for all connections - PrivateLink for Snowflake connectivity - VPC Endpoints for AWS services - Certificate pinning for Epic APIs ``` ### 5.2 Access Control ```yaml IAM Roles: MLEngineerRole: - SageMaker full access - S3 read/write to ML buckets - Snowflake external stage access DataScientistRole: - SageMaker training/tuning - S3 read-only to production data - CloudWatch metrics access ApplicationRole: - SageMaker endpoint invoke - DynamoDB read/write - S3 read to model artifacts AuditorRole: - CloudTrail read-only - S3 audit logs access - Compliance reports generation ``` ### 5.3 Audit & Monitoring ```python # CloudWatch Metrics Configuration custom_metrics = { "Model Performance": [ "prediction_accuracy", "inference_latency", "endpoint_availability" ], "Data Quality": [ "missing_field_rate", "schema_validation_failures", "data_drift_score" ], "Security": [ "unauthorized_access_attempts", "phi_exposure_events", "encryption_failures" ], "Business": [ "daily_prediction_volume", "code_distribution", "cost_per_prediction" ] } # Alerting Thresholds alerts = { "Critical": { "accuracy_drop": "< 50%", "endpoint_failure": "availability < 99%", "data_breach": "any PHI exposure" }, "Warning": { "accuracy_degradation": "< 52%", "high_latency": "> 1000ms p99", "cost_spike": "> 150% daily average" } } ``` --- ## 6. Implementation Phases ```mermaid gantt title Medical Coding ML System Implementation Timeline dateFormat YYYY-MM-DD section Phase 1 Foundation Infrastructure Setup :done, p1a, 2024-01-01, 14d Data Pipeline Foundation :done, p1b, after p1a, 14d section Phase 2 ML Dev Feature Engineering :active, p2a, after p1b, 14d Model Training :p2b, after p2a, 28d Inference Pipeline :p2c, after p2b, 14d section Phase 3 Production Integration & Testing :p3a, after p2c, 14d Deployment & Monitoring :p3b, after p3a, 14d section Phase 4 Optimization Performance Optimization :p4a, after p3b, 28d Epic Integration Prep :p4b, after p4a, 28d ``` ### Phase 1: Foundation (Weeks 1-4) ```mermaid flowchart LR subgraph "Week 1-2: Infrastructure" A1[AWS Account<br/>Structure] --> A2[VPC &<br/>Networking] A2 --> A3[IAM Roles<br/>& Policies] A3 --> A4[KMS Key<br/>Generation] A4 --> A5[Snowflake<br/>AWS Integration] end subgraph "Week 3-4: Data Pipeline" B1[S3 Bucket<br/>Structure] --> B2[Snowpipe<br/>Configuration] B2 --> B3[AWS Glue<br/>ETL Setup] B3 --> B4[Data Quality<br/>Framework] B4 --> B5[Epic FHIR<br/>Connector] end A5 --> B1 ``` ### Phase 2: ML Development (Weeks 5-12) ```mermaid flowchart LR subgraph "Week 5-6: Features" C1[Snowflake<br/>Feature Tables] --> C2[NLP<br/>Preprocessing] C2 --> C3[Feature Store<br/>Setup] C3 --> C4[Data Validation<br/>Rules] end subgraph "Week 7-10: Training" D1[SageMaker<br/>Environment] --> D2[DRG-LLaMA<br/>Fine-tuning] D2 --> D3[Clinical BERT<br/>Training] D3 --> D4[Hyperparameter<br/>Optimization] D4 --> D5[Model<br/>Evaluation] end subgraph "Week 11-12: Inference" E1[Endpoint<br/>Deployment] --> E2[Batch Transform<br/>Setup] E2 --> E3[A/B Testing<br/>Framework] E3 --> E4[Performance<br/>Benchmarking] end C4 --> D1 D5 --> E1 ``` ### Phase 3: Production Readiness (Weeks 13-16) ```mermaid flowchart LR subgraph "Week 13-14: Integration" F1[API Gateway<br/>Config] --> F2[End-to-End<br/>Testing] F2 --> F3[Load<br/>Testing] F3 --> F4[Security<br/>Scanning] F4 --> F5[DR<br/>Testing] end subgraph "Week 15-16: Deployment" G1[Production<br/>Deployment] --> G2[Monitoring<br/>Dashboard] G2 --> G3[Alerting<br/>Configuration] G3 --> G4[Documentation<br/>Completion] G4 --> G5[Shadow Mode<br/>Activation] end F5 --> G1 ``` ### Phase 4: Optimization & Scale (Weeks 17-24) ```mermaid flowchart LR subgraph "Week 17-20: Optimization" H1[Model<br/>Compression] --> H2[Inference<br/>Optimization] H2 --> H3[Cost<br/>Optimization] H3 --> H4[Cache<br/>Implementation] end subgraph "Week 21-24: Epic Integration" I1[FHIR API<br/>Deep Integration] --> I2[Clarity DB<br/>Connection] I2 --> I3[Workflow<br/>Integration Design] I3 --> I4[Clinical<br/>Validation] end H4 --> I1 ``` --- ## 7. Cost Optimization Strategies ### 7.1 Compute Optimization ```yaml SageMaker: - Savings Plans: 3-year commitment for 64% savings - Spot Training: 70% cost reduction for training jobs - Multi-model endpoints: Share infrastructure - Automatic scaling: Scale to zero during off-hours Snowflake: - Auto-suspend: 10-minute idle timeout - Warehouse sizing: Start small, scale as needed - Result caching: 24-hour cache retention - Clustering keys: Optimize query performance Lambda: - Reserved concurrency: Control costs - Graviton2: 20% price-performance improvement - Memory optimization: Right-size based on profiling ``` ### 7.2 Storage Optimization ```yaml S3 Lifecycle Policies: - Infrequent Access: After 30 days - Glacier: After 90 days - Expiration: After 7 years (HIPAA requirement) Data Compression: - Parquet format: 70% compression ratio - Gzip for JSON: 60% reduction - Model compression: Quantization to INT8 ``` --- ## 8. Performance Targets & SLAs ### 8.1 System Performance ```yaml Availability: - API Uptime: 99.9% (43.8 min/month downtime) - Batch Processing: 99.5% success rate Latency: - Real-time Inference: p50 < 200ms, p99 < 1000ms - Batch Processing: < 2 hours for 100K records Throughput: - Real-time: 1000 requests/second - Batch: 1M records/day Accuracy: - DRG Top-1: > 54% - DRG Top-5: > 85% - ICD-10 Micro-F1: > 0.52 ``` ### 8.2 Operational Metrics ```yaml Recovery Objectives: - RTO: 4 hours - RPO: 1 hour Model Management: - Retraining Frequency: Monthly - A/B Test Duration: 2 weeks minimum - Rollback Time: < 5 minutes ``` --- ## 9. Monitoring & Observability ```mermaid graph TB subgraph "Data Sources" SM[SageMaker<br/>Endpoints] Lambda[Lambda<br/>Functions] Snow[Snowflake<br/>Queries] API[API Gateway] end subgraph "Metrics Collection" CW[CloudWatch Metrics] Logs[CloudWatch Logs] XRay[AWS X-Ray<br/>Tracing] Custom[Custom Metrics<br/>via SDK] end subgraph "Monitoring Dashboards" Perf[Model Performance<br/>• Accuracy<br/>• Latency<br/>• Throughput] DQ[Data Quality<br/>• Missing Fields<br/>• Schema Violations<br/>• Data Drift] Sec[Security<br/>• Access Attempts<br/>• PHI Events<br/>• Encryption Status] Biz[Business Metrics<br/>• Daily Volume<br/>• Code Distribution<br/>• Cost/Prediction] end subgraph "Alerting" Crit[Critical Alerts<br/>• Accuracy < 50%<br/>• Endpoint Failure<br/>• Data Breach] Warn[Warning Alerts<br/>• Accuracy < 52%<br/>• High Latency<br/>• Cost Spike] SNS[Amazon SNS] PD[PagerDuty<br/>Integration] end SM --> CW Lambda --> CW Snow --> Custom API --> CW SM --> Logs Lambda --> Logs API --> XRay CW --> Perf Logs --> DQ XRay --> Perf Custom --> Biz Perf --> Crit DQ --> Warn Sec --> Crit Biz --> Warn Crit --> SNS Warn --> SNS SNS --> PD classDef source fill:#E1F5FE,stroke:#0277BD,stroke-width:2px classDef collect fill:#F3E5F5,stroke:#6A1B9A,stroke-width:2px classDef dashboard fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px classDef alert fill:#FFEBEE,stroke:#C62828,stroke-width:2px class SM,Lambda,Snow,API source class CW,Logs,XRay,Custom collect class Perf,DQ,Sec,Biz dashboard class Crit,Warn,SNS,PD alert ``` ### 9.1 Dashboard Configuration ```python # CloudWatch Dashboard Definition dashboard = { "name": "MedicalCodingML-Operations", "widgets": [ { "type": "metric", "properties": { "metrics": [ ["AWS/SageMaker", "ModelLatency", {"stat": "Average"}], ["AWS/SageMaker", "Invocations", {"stat": "Sum"}], ["Custom", "PredictionAccuracy", {"stat": "Average"}] ], "period": 300, "region": "us-east-1", "title": "Model Performance" } }, { "type": "log", "properties": { "query": """ fields @timestamp, accuracy, model_version | filter @type = "PREDICTION_RESULT" | stats avg(accuracy) by bin(5m) """, "region": "us-east-1", "title": "Accuracy Trend" } } ] } ``` ### 9.2 Alerting Rules ```yaml Critical Alerts: - Model accuracy < 50% - Endpoint health check failures - Data pipeline failures > 2 consecutive - PHI access violations Warning Alerts: - Inference latency p99 > 1s - Daily cost > $5000 - Model drift detected - Queue depth > 10000 ``` --- ## 10. Future Enhancements ### Near-term (3-6 months) - Multi-language support for clinical notes - Automated retraining pipelines - Explainability dashboard for predictions - Integration with additional EHR systems ### Medium-term (6-12 months) - Epic Cognitive Computing Platform integration - Real-time learning from coder feedback - Multi-site federated learning - Advanced ensemble methods ### Long-term (12+ months) - Full autonomous coding for specific specialties - Predictive analytics for coding optimization - Natural language query interface - Cross-institutional benchmarking --- ## Appendix A: Configuration Files ### A.1 Terraform Variables ```hcl variable "environment" { description = "Deployment environment" type = string default = "production" } variable "ml_instance_types" { description = "SageMaker instance types for endpoints" type = map(string) default = { drg_model = "ml.g5.2xlarge" icd10_model = "ml.g5.xlarge" } } variable "snowflake_account" { description = "Snowflake account identifier" type = string sensitive = true } ``` ### A.2 Model Configuration ```json { "drg_model": { "name": "drg-llama-v1", "framework": "pytorch", "framework_version": "2.0", "max_sequence_length": 1024, "batch_size": 32, "quantization": "int8" }, "icd10_model": { "name": "clinical-bert-icd10", "framework": "transformers", "framework_version": "4.35", "max_sequence_length": 512, "num_labels": 1000, "attention_heads": 12 } } ``` --- ## Appendix B: Troubleshooting Guide ### Common Issues and Resolutions 1. **High Inference Latency** - Check endpoint instance metrics - Verify batch size configuration - Consider upgrading instance type - Enable SageMaker Model Monitor 2. **Data Pipeline Failures** - Validate Snowpipe notification configuration - Check S3 bucket permissions - Review CloudWatch logs for Lambda errors - Verify network connectivity 3. **Model Accuracy Degradation** - Analyze data drift metrics - Review recent data quality issues - Check for schema changes in source systems - Trigger manual retraining if needed 4. **Cost Overruns** - Review SageMaker endpoint utilization - Check for runaway Snowflake queries - Audit S3 storage classes - Implement auto-scaling policies