1. 通用型分散式訓練與推論架構

# 1. 通用型分散式訓練與推論架構 ## 1.1 分散式訓練（Distributed Training） ### 1.1.1 為何需要分散式訓練隨著模型參數量與訓練資料規模快速成長，單一機器的 GPU 記憶體與計算能力已不足以支撐合理時間內完成訓練。分散式訓練透過多台節點協同運算，加速模型收斂，並支援超過單機記憶體上限的模型或批次大小。([IBM][1]) ### 1.1.2 常見並行策略 1. **資料並行（Data Parallelism）** * 每個工作節點持有一份完整模型權重，各自處理資料子集，計算梯度後再進行同步或非同步的梯度聚合。([IBM][1]) * 適用於影像分類、語言模型等多數深度學習任務。 ![image](https://hackmd.io/_uploads/HkQ41WolWe.png) 2. **模型並行（Model Parallelism）** * 將同一個模型拆分到不同節點，適用於單一模型即大於單卡記憶體（例如超大型語言模型）。([dlsys.cs.washington.edu][2]) ![image](https://hackmd.io/_uploads/Sk8K1Wslbx.png) 3. **流水線並行（Pipeline Parallelism）** * 依照網路層或子模組切成階段，資料如同「流水線」通過不同節點，提升硬體使用率。([Vipul Vaibhaw][3]) 4. **混合並行（Hybrid Parallelism）** * 實務上常結合資料並行＋模型並行＋流水線並行，搭配 ZeRO / FSDP 等技術以最佳化記憶體與通訊開銷。 ### 1.1.3 通用分散式訓練架構概念典型的分散式訓練架構可抽象為下列元件：([tensorflow.org][4]) * **參與節點** * *Workers*：負責前向與反向運算。 * *Parameter Servers* 或 *All-Reduce Group*：負責梯度聚合與權重更新。 * **通訊模式** * Parameter-Server 架構：集中式權重儲存，Workers 每一步將梯度 push/pull 到伺服器。 * All-Reduce 架構：權重分散在所有節點，透過 ring-all-reduce 等演算法進行同步。 * **共享儲存與資料來源** * 分散式檔案系統或物件儲存（S3/GCS/Azure Blob），供 Workers 並行讀取訓練批次。 * **排程與容錯** * 由 K8S 或雲端平台負責節點生命週期管理、重啟失敗節點、彈性擴縮。 ![image](https://hackmd.io/_uploads/B1A4g-jlbx.png) > 建議搭配圖示： > > * TensorFlow *Distributed training with ParameterServerStrategy* 文件中的叢集示意圖（Workers + Parameter Servers）。([tensorflow.org][4]) > * IBM *What is Distributed Machine Learning?* 對資料並行的示意圖。([IBM][1]) --- ## 1.2 分散式推論（Distributed Inference） ### 1.2.1 需求來源模型上線後，實際臨床或商業場景往往需要： * 高併發請求（多位醫師同時呼叫報告生成） * 嚴格延遲需求（秒級回應） * 多區域部署與容錯因此會採用分散式推論架構，以多個模型副本與多層負載平衡來分散流量，並在需要時自動擴充資源。([Google Cloud][5]) ### 1.2.2 通用推論架構元件 1. **入口層（API Gateway / Endpoint）** * 提供統一 URL，負責驗證、流量控制與版本導流。 * 例如 Azure Machine Learning 的 Online Endpoint，可掛多個部署（A/B、藍綠）。([Microsoft Learn][6]) 2. **模型服務層（Model Serving）** * 多個模型容器副本（Pods / Instances），由 K8S 或雲端服務管理。 * 支援同步線上推論與批次推論（Batch Prediction/Batch Endpoint）。([Google Cloud Documentation][7]) 3. **特徵前處理與後處理服務** * 將上游業務資料轉換為模型輸入格式（特徵工程、標準化）。 * 將模型輸出轉為報告、分數或標籤，並結合規則檢查。 4. **儲存與監控** * 紀錄每筆推論請求、模型版本、延遲與錯誤率。 * 監控推論分布與資料飄移，觸發再訓練流程。([Microsoft Learn][8]) ![image](https://hackmd.io/_uploads/SyYKlWjgZl.png) > 建議搭配圖示： > > * Azure Machine Learning *Endpoints for inference* 中「endpoint with multiple deployments」圖示，清楚顯示 endpoint 與多個 deployment 的關係。([Microsoft Learn][6]) > * Vertex AI Predictions / Online & Batch Prediction 相關教學中的架構圖，顯示從 BigQuery 或 GCS 到 BatchPredictionJob/Endpoint 的資料流程。([Google Cloud Documentation][7]) --- ## 1.3 三大雲廠分散式訓練與推論架構 ### 1.3.1 Google Cloud（GCP：Vertex AI + GKE） **（1）分散式訓練** * 官方 *Vertex AI Serverless training overview* 提供一張 workflow 圖，說明： * 資料來源（BigQuery / Cloud Storage） * 建立 Training Job（自訂容器或框架） * 平台自動配置多台 GPU/TPU 執行訓練 * 結果寫回 Model 資源與 Artifact Registry。([Google Cloud Documentation][9]) * 搭配 *Architecture for MLOps using TFX and Vertex AI Pipelines*，可看到： * 以 Pipelines 串接資料前處理、訓練、評估與部署的整體 CI/CD/CT 架構。([Google Cloud Documentation][10]) **（2）分散式推論** * *Overview of getting inferences on Vertex AI* 及相關文件說明： * **Online Prediction**：模型部署到 Endpoint，由多個後端副本處理 HTTP 請求。 * **Batch Prediction**：以 BatchPredictionJob 對 GCS/BigQuery 做離線推論，適合大量資料並行處理。([Google Cloud Documentation][7]) * Vertex AI 概念架構圖中，也標示了「Training」與「Serve / Inference」分層，底層以 GKE、GPU/TPU 為基礎。([Medium][11]) ![image](https://hackmd.io/_uploads/BkqRlbjeZx.png) > 建議插圖： > > * Vertex AI 官方「Serverless training workflow」圖（訓練篇）。([Google Cloud Documentation][9]) > * Vertex AI 官方「ML workflow / predictions overview」圖（推論篇）。([Google Cloud Documentation][12]) ![image](https://hackmd.io/_uploads/SJCbZ-ogZe.png) --- ### 1.3.2 AWS（SageMaker + EKS） **（1）分散式訓練** * *Model training – Amazon SageMaker AI* 官方文件提供一張示意圖： * 訓練資料存於 S3。 * SageMaker 依 Job 定義啟動多個 EC2 訓練節點。 * 訓練完成後模型存回 S3，並可註冊到 Model Registry。([AWS 文檔][13]) * *Guidance for Distributed Model Training on AWS* 與對應 PDF 則給出更完整架構： * 支援以 Kubeflow on EKS 或 SageMaker Managed Training 執行混合式分散訓練。 * 圖中包含 VPC、EKS 叢集、S3、CloudWatch、SageMaker Training Jobs 等元件。([Amazon Web Services, Inc.][14]) ![image](https://hackmd.io/_uploads/r1f4bZslbg.png) ![image](https://hackmd.io/_uploads/rkrubZilZx.png) **（2）分散式推論** * AWS 官方與範例專案 *Distributed training and inference using SageMaker AI and MLFlow* 中的架構圖顯示： * SageMaker Endpoint 或 EKS 上的模型服務，多副本背後連 S3/Feature Store。 * API Gateway / ALB 對外提供 REST 介面。 * MLflow / CloudWatch 收集推論紀錄與指標。([GitHub][15]) ![image](https://hackmd.io/_uploads/BJuqbWslZg.png) > 建議插圖： > > * SageMaker 官方訓練架構圖（train-model 文件中的圖）。([AWS 文檔][13]) > * AWS *Guidance for Distributed Model Training on AWS* PDF 中的參考架構圖。([d1.awsstatic.com][16]) > * GitHub 專案 *distributed-training-inference-sagemaker-mlflow* README 內的 end-to-end 架構圖（訓練＋推論）。([GitHub][15]) --- ### 1.3.3 Microsoft Azure（Azure ML + AKS） **（1）分散式訓練** * *Architecture & key concepts – Azure Machine Learning* 提供整體平台圖： * Workspace、Data Store、Compute Cluster（包括 GPU 節點）與 Pipelines。 * Training job 透過 Azure ML Compute Cluster 執行，支援分散式框架（PyTorch、TensorFlow 等）。([Microsoft Learn][17]) * Azure Architecture Center 的 MLOps 文章，則以三種應用（CV、NLP 等）展示 end-to-end 架構： * 從資料擷取、特徵處理，到訓練、模型註冊、部署與再訓練。([Microsoft Learn][8]) ![image](https://hackmd.io/_uploads/HJAaZbigZx.png) **（2）分散式推論** * *Endpoints for inference – Azure ML* 提供 endpoint 與多 deployment 的概念圖： * 單一 Endpoint 之下可掛多個 deployment（例如 v1 / v2 模型）。 ![image](https://hackmd.io/_uploads/B1eGfWsxWl.png) * 支援佔比流量路由，以便 A/B 測試或漸進式升級。([Microsoft Learn][6]) * MLOps 指南與 Azure AI Architecture 資料顯示： ![image](https://hackmd.io/_uploads/rkLbfZjxWx.png) * 前端透過 Application Gateway / API Management 進入。 * 後端可選擇 Azure ML Online Endpoint 或 AKS 部署容器化模型，搭配 Azure Monitor、Log Analytics 做監控。([Microsoft Learn][8]) ![image](https://hackmd.io/_uploads/rkheM-sl-g.png) > 建議插圖： > > * Azure ML 官方「architecture & key concepts」平台圖（訓練＋管理）。([Microsoft Learn][17]) > * Azure ML Endpoints 官方「endpoint with multiple deployments」圖（推論）。([Microsoft Learn][6]) --- # 2. 多模態醫療報告生成之分散式設計在上述通用架構之上，本節聚焦於多模態醫療報告生成（影像 + 結構化病歷 + 檢驗 + 文字紀錄 + 語音）的實作方式，分成訓練與推論兩階段。 --- ## 2.1 多模態分散式訓練：醫療報告生成 ### 2.1.1 資料與模態設計 * **結構化模態**：以 FHIR Resource（Encounter、Observation、Condition、Procedure 等）統一描述病歷、檢驗、處置。 * **影像模態**：PACS 輸出之 DICOM 影像，需保留 Study/Series/Instance metadata。 * **文字模態**：以往人工撰寫之報告、病摘、會診紀錄。 * **語音模態**：醫師口述報告之音檔，先經 ASR 轉文字。 * **標註目標**：標準化醫療報告全文（含 Impression / Findings / Recommendation）與必要結構欄位。每一筆訓練樣本可整理為： ```text { patient_id, encounter_id, modalities: { image, structured, text_context, speech_transcript }, target_report, meta: { modality_type, department, device_model, date_time, annotator_id } } ``` ### 2.1.2 分散式訓練流程（對應三大雲的通用做法） 1. **資料匯入與前處理** * 資料湖（GCS/S3/Blob）存放原始多模態資料。 * 以 Spark 或 Dask 在 K8S（GKE/EKS/AKS）上進行大規模 ETL： * 去識別化、檔案格式轉換（DICOM → tensor）。 * Alignment：將同一次 encounter 的影像、檢驗、報告與語音對齊。([Google Cloud Documentation][10]) 2. **特徵工程與資料切分** * 使用 Ray Data / Spark DataFrame 等工具，在叢集中完成： * 病種分層抽樣、平衡不同科別與設備。 * 自動建立 train/val/test split 版本並記錄在數據倉儲（BigQuery/Redshift/Synapse）。 3. **多模態模型架構** * 影像 Encoder（ResNet/ViT 等） * 文本 Encoder / LLM（醫療語料微調） * 語音透過 ASR 轉文字後併入文本模態 * 融合層（類似 LLaVA/Vision-LLM 架構），將影像 embedding 經 projection 接入 LLM input。([Vipul Vaibhaw][3]) 4. **分散式訓練與超參數調校** * 在 Vertex AI Training / SageMaker Training / Azure ML Compute Cluster 上以多機多卡方式訓練： * 使用 DDP / FSDP / ZeRO 等技術實現資料並行與參數分散。([AWS 文檔][18]) * 以 Ray Tune 或各雲端自帶超參數服務做搜尋，優化學習率、batch size、loss 組合。 5. **模型註冊與版本管理** * 訓練完成後將模型與評估指標註冊到： * Vertex AI Model Registry、SageMaker Model Registry 或 Azure ML Model Registry。([Google Cloud][5]) * 紀錄資料版本、程式版本與實驗設定，以因應醫療稽核需求。 > 教學上可以將 1.3 的三大雲架構圖，再加上一層「多模態 Encoder + LLM 模型訓練」標註，說明同一套設計如何在不同雲上落地。 --- ## 2.2 多模態分散式推論：醫療報告生成 <svg width="800" height="1150" xmlns="http://www.w3.org/2000/svg">  <defs> <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto"> <polygon points="0 0, 10 3.5, 0 7" fill="#555" /> </marker> <filter id="dropShadow" x="-20%" y="-20%" width="140%" height="140%"> <feGaussianBlur in="SourceAlpha" stdDeviation="3"/> <feOffset dx="2" dy="2" result="offsetblur"/> <feComponentTransfer> <feFuncA type="linear" slope="0.3"/> </feComponentTransfer> <feMerge> <feMergeNode/> <feMergeNode in="SourceGraphic"/> </feMerge> </filter> </defs>  <rect width="800" height="1150" fill="#ffffff" />  <text x="400" y="40" font-family="sans-serif" font-size="24" font-weight="bold" text-anchor="middle" fill="#333"> 2. 多模態醫療報告生成：分散式訓練架構 </text> <text x="400" y="65" font-family="sans-serif" font-size="14" fill="#666" text-anchor="middle"> 2.1.1 資料模態設計 & 2.1.2 分散式訓練流程 </text>  <g transform="translate(50, 100)"> <rect x="0" y="0" width="700" height="140" fill="#e3f2fd" stroke="#2196f3" stroke-width="2" rx="10" filter="url(#dropShadow)" /> <text x="20" y="30" font-family="sans-serif" font-size="16" font-weight="bold" fill="#0d47a1">2.1.1 多模態資料輸入</text>  <g transform="translate(30, 50)">  <rect x="0" y="0" width="150" height="70" fill="#fff" stroke="#1976d2" rx="5" /> <text x="75" y="30" font-family="sans-serif" font-size="14" font-weight="bold" text-anchor="middle" fill="#333">結構化模態</text> <text x="75" y="50" font-family="sans-serif" font-size="11" text-anchor="middle" fill="#555">FHIR (Encounter/Obs)</text>  <rect x="165" y="0" width="150" height="70" fill="#fff" stroke="#1976d2" rx="5" /> <text x="240" y="30" font-family="sans-serif" font-size="14" font-weight="bold" text-anchor="middle" fill="#333">影像模態</text> <text x="240" y="50" font-family="sans-serif" font-size="11" text-anchor="middle" fill="#555">DICOM / PACS</text>  <rect x="330" y="0" width="150" height="70" fill="#fff" stroke="#1976d2" rx="5" /> <text x="405" y="30" font-family="sans-serif" font-size="14" font-weight="bold" text-anchor="middle" fill="#333">文字模態</text> <text x="405" y="50" font-family="sans-serif" font-size="11" text-anchor="middle" fill="#555">病摘 / 會診紀錄</text>  <rect x="495" y="0" width="150" height="70" fill="#fff" stroke="#1976d2" rx="5" /> <text x="570" y="30" font-family="sans-serif" font-size="14" font-weight="bold" text-anchor="middle" fill="#333">語音模態</text> <text x="570" y="50" font-family="sans-serif" font-size="11" text-anchor="middle" fill="#555">口述音檔 (ASR轉文字)</text> </g> </g>  <line x1="400" y1="240" x2="400" y2="280" stroke="#555" stroke-width="2" marker-end="url(#arrowhead)" />  <g transform="translate(50, 280)"> <rect x="0" y="0" width="700" height="280" fill="#fff8e1" stroke="#ff8f00" stroke-width="2" rx="10" filter="url(#dropShadow)" /> <text x="20" y="30" font-family="sans-serif" font-size="16" font-weight="bold" fill="#e65100">2.1.2 資料匯入與前處理 (ETL)</text>  <g transform="translate(40, 50)"> <path d="M0,15 A30,10 0 1,1 60,15 A30,10 0 1,1 0,15 M0,15 v50 A30,10 0 0,0 60,65 v-50" fill="#ffe0b2" stroke="#ef6c00" stroke-width="2"/> <text x="30" y="45" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">Data Lake</text> <text x="30" y="60" font-family="sans-serif" font-size="10" text-anchor="middle">(GCS/S3/Blob)</text> </g>  <rect x="140" y="50" width="520" height="80" fill="#fff" stroke="#ef6c00" stroke-dasharray="5,5" rx="5" /> <text x="400" y="70" font-family="sans-serif" font-size="12" font-weight="bold" text-anchor="middle" fill="#e65100">運算叢集 (Spark/Dask on K8S)</text> <rect x="160" y="80" width="140" height="40" fill="#ffe0b2" rx="3" /> <text x="230" y="105" font-family="sans-serif" font-size="12" text-anchor="middle">去識別化 & 格式轉換</text> <rect x="310" y="80" width="180" height="40" fill="#ffcc80" rx="3" stroke="#e65100" stroke-width="2"/> <text x="400" y="100" font-family="sans-serif" font-size="12" font-weight="bold" text-anchor="middle">Alignment (對齊)</text> <text x="400" y="115" font-family="sans-serif" font-size="10" text-anchor="middle">Group by Encounter ID</text> <rect x="500" y="80" width="140" height="40" fill="#ffe0b2" rx="3" /> <text x="570" y="105" font-family="sans-serif" font-size="12" text-anchor="middle">特徵工程 & 切分</text>  <line x1="100" y1="90" x2="140" y2="90" stroke="#555" stroke-width="2" marker-end="url(#arrowhead)" />  <rect x="140" y="160" width="360" height="90" fill="#fff3e0" stroke="#ef6c00" rx="5" /> <text x="155" y="180" font-family="monospace" font-size="11" fill="#333" xml:space="preserve"> 訓練樣本物件 (Training Sample): { patient_id, encounter_id, modalities: {image, struct, text, speech}, target_report, meta } </text>  <g transform="translate(560, 160)"> <path d="M0,15 A30,10 0 1,1 60,15 A30,10 0 1,1 0,15 M0,15 v50 A30,10 0 0,0 60,65 v-50" fill="#ffe0b2" stroke="#ef6c00" stroke-width="2"/> <text x="30" y="45" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">Data Warehouse</text> <text x="30" y="60" font-family="sans-serif" font-size="10" text-anchor="middle">(BQ/Redshift)</text> <text x="30" y="90" font-family="sans-serif" font-size="10" text-anchor="middle" fill="#e65100">Split 紀錄</text> </g> <line x1="500" y1="205" x2="550" y2="205" stroke="#555" stroke-width="1" stroke-dasharray="4,4" marker-end="url(#arrowhead)" /> </g>  <line x1="400" y1="560" x2="400" y2="600" stroke="#555" stroke-width="2" marker-end="url(#arrowhead)" />  <g transform="translate(50, 600)"> <rect x="0" y="0" width="700" height="320" fill="#e8f5e9" stroke="#2e7d32" stroke-width="2" rx="10" filter="url(#dropShadow)" /> <text x="20" y="30" font-family="sans-serif" font-size="16" font-weight="bold" fill="#1b5e20">2.1.2 多模態模型與分散式訓練</text>  <g transform="translate(40, 50)"> <rect x="0" y="0" width="620" height="120" fill="#fff" stroke="#4caf50" rx="5" /> <text x="10" y="20" font-family="sans-serif" font-size="12" fill="#388e3c" font-weight="bold">模型架構 (Model Architecture)</text>  <rect x="40" y="40" width="120" height="60" fill="#c8e6c9" stroke="#2e7d32" rx="3" /> <text x="100" y="65" font-family="sans-serif" font-size="12" text-anchor="middle">影像 Encoder</text> <text x="100" y="80" font-family="sans-serif" font-size="10" text-anchor="middle">(ResNet/ViT)</text> <rect x="180" y="40" width="120" height="60" fill="#c8e6c9" stroke="#2e7d32" rx="3" /> <text x="240" y="65" font-family="sans-serif" font-size="12" text-anchor="middle">文本 Encoder</text> <text x="240" y="80" font-family="sans-serif" font-size="10" text-anchor="middle">(LLM/Bert)</text> <text x="320" y="75" font-size="20" font-weight="bold" fill="#2e7d32">→</text> <rect x="350" y="40" width="100" height="60" fill="#a5d6a7" stroke="#2e7d32" rx="3" /> <text x="400" y="65" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">Fusion Layer</text> <text x="400" y="80" font-family="sans-serif" font-size="10" text-anchor="middle">(Projection)</text> <text x="460" y="75" font-size="20" font-weight="bold" fill="#2e7d32">→</text> <rect x="490" y="40" width="100" height="60" fill="#81c784" stroke="#1b5e20" rx="3" stroke-width="2" /> <text x="540" y="65" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">LLM</text> <text x="540" y="80" font-family="sans-serif" font-size="10" text-anchor="middle">Report Gen</text> </g>  <g transform="translate(40, 190)"> <rect x="0" y="0" width="620" height="100" fill="#f1f8e9" stroke="#8bc34a" stroke-dasharray="5,5" rx="5" /> <text x="310" y="25" font-family="sans-serif" font-size="14" text-anchor="middle" fill="#33691e" font-weight="bold">雲端訓練叢集 (Vertex AI / SageMaker / Azure ML)</text>  <rect x="50" y="40" width="240" height="40" fill="#dcedc8" stroke="#558b2f" rx="3" /> <text x="170" y="65" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">平行化技術</text> <text x="170" y="85" font-family="sans-serif" font-size="10" text-anchor="middle" fill="#333">(DDP / FSDP / ZeRO)</text> <rect x="330" y="40" width="240" height="40" fill="#dcedc8" stroke="#558b2f" rx="3" /> <text x="450" y="65" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">超參數調校</text> <text x="450" y="85" font-family="sans-serif" font-size="10" text-anchor="middle" fill="#333">(Ray Tune / Bayesian Opt)</text> </g>  <line x1="310" y1="170" x2="310" y2="190" stroke="#2e7d32" stroke-width="2" marker-end="url(#arrowhead)" /> </g>  <line x1="400" y1="920" x2="400" y2="960" stroke="#555" stroke-width="2" marker-end="url(#arrowhead)" />  <g transform="translate(50, 960)"> <rect x="0" y="0" width="700" height="150" fill="#f3e5f5" stroke="#7b1fa2" stroke-width="2" rx="10" filter="url(#dropShadow)" /> <text x="20" y="30" font-family="sans-serif" font-size="16" font-weight="bold" fill="#4a148c">2.1.2 模型註冊與版本管理</text> <g transform="translate(200, 50)">  <path d="M0,15 A40,12 0 1,1 80,15 A40,12 0 1,1 0,15 M0,15 v50 A40,12 0 0,0 80,65 v-50" fill="#e1bee7" stroke="#8e24aa" stroke-width="2"/> <text x="40" y="45" font-family="sans-serif" font-size="12" text-anchor="middle" font-weight="bold">Model Registry</text> <text x="40" y="60" font-family="sans-serif" font-size="10" text-anchor="middle">(Vertex/SageMaker)</text> </g> <g transform="translate(350, 60)"> <text x="0" y="0" font-family="sans-serif" font-size="14" font-weight="bold" fill="#4a148c">稽核項目 (Audit)</text> <line x1="0" y1="5" x2="200" y2="5" stroke="#7b1fa2" stroke-width="1"/> <text x="10" y="25" font-family="sans-serif" font-size="12" fill="#333">• 資料版本 (Data Version)</text> <text x="10" y="45" font-family="sans-serif" font-size="12" fill="#333">• 程式碼版本 (Code Commit)</text> <text x="10" y="65" font-family="sans-serif" font-size="12" fill="#333">• 實驗設定 (Hyperparams)</text> </g> <line x1="290" y1="80" x2="340" y2="80" stroke="#7b1fa2" stroke-width="2" stroke-dasharray="4,4" /> </g> </svg> ### 2.2.1 臨床使用情境 * 放射科醫師於 PACS 選擇某一檢查影像並點選「AI 報告草稿」。 * 住院醫師於 EMR 開啟出院摘要頁面，呼叫 AI 生成初稿。 * 語音輸入情境：醫師以口述方式補充說明，系統將語音轉文字納入報告 context。 ### 2.2.2 分散式推論核心流程 1. **入口與驗證** * 醫院內部系統透過 VPN / 專線連線至雲端 Endpoint（或院內私有雲）。 * API Gateway / Endpoint 負責身份驗證、權限檢查與流量管控。([Microsoft Learn][6]) 2. **多模態資料蒐集與前處理** * Orchestrator 服務： * 向 FHIR Gateway 取得結構化病歷與檢驗值。 * 向 PACS Adapter 拉取 DICOM 影像。 * 將語音送至 ASR 服務產生 transcript。 * 前處理服務將資料轉成標準 JSON/張量格式，包含必要 metadata。 3. **多模態 Encoder 與 LLM 推論（分散式）** * 在 K8S 叢集上建立 GPU Node Pool，由 Ray / KServe / vLLM 等框架管理推論： * 影像 Encoder 服務取得影像 embedding。 * 文本 Encoder 服務處理病史摘要、檢驗結果與 ASR 文字。 * LLM 服務結合多模態 embedding 與結構化欄位生成報告草稿。([Medium][11]) * 透過 autoscaling 與多副本部署，支援多位醫師同時呼叫。 4. **規則檢核與稽核資訊產出** * 報告草稿送至 Rule Engine： * 比對健保申報規則（診斷碼、處置碼組合）。 * 對照臨床指引（例如某檢驗異常時報告是否有提及）。 * 產生標註，例如「建議補充 Impression」、「診斷碼與敘述不一致」等。 5. **醫師審閱與回寫** * 前端 UI 顯示： * 影像縮圖、檢驗重要欄位、多模態 AI 草稿與規則提示。 * 醫師編輯後簽章，最終報告回寫至 EMR / PACS，同時儲存： * AI 草稿版本 * 人工修改差異 * 使用之模型版本與時間戳 * 這些資料可回流到 2.1 的訓練流程，作為持續學習的標註來源。 ### 2.2.3 分散式推論的醫療特別考量 * **安全與隱私** * 採最小暴露原則，僅傳送訓練/推論所需欄位。 * 全程加密傳輸，並以審計紀錄（audit log）追蹤每次存取。([Google Cloud Documentation][19]) * **延遲與可用性** * 規劃醫院所在地區的多區域部署、備援節點。 * 對於非即時需求（批次出院摘要生成），可改走 Batch Endpoint 減少尖峰負載。([Google Cloud Documentation][7]) * **人機協作定位** * 系統明確定位為「報告草稿生成與輔助提示」，不取代醫師臨床判斷。 * 必須保留「可追溯」證據，包含輸入模態摘要、模型版本與修改紀錄，以符合醫院內部審查與外部監理需求。 [1]: https://www.ibm.com/think/topics/distributed-machine-learning?utm_source=chatgpt.com "What Is Distributed Machine Learning?" [2]: https://dlsys.cs.washington.edu/pdf/lecture11.pdf?utm_source=chatgpt.com "Lecture 11: Distributed Training and ..." [3]: https://vaibhawvipul.github.io/2024/09/29/Distributed-Training-of-Deep-Learning-models-Part-~-1.html?utm_source=chatgpt.com "Distributed Training of Deep Learning models - Part ~ 1" [4]: https://www.tensorflow.org/guide/distributed_training?utm_source=chatgpt.com "Distributed training with TensorFlow" [5]: https://cloud.google.com/vertex-ai?utm_source=chatgpt.com "Vertex AI Platform" [6]: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints?view=azureml-api-2&utm_source=chatgpt.com "Endpoints for inference - Azure Machine Learning" [7]: https://docs.cloud.google.com/vertex-ai/docs/predictions/overview?utm_source=chatgpt.com "Overview of getting inferences on Vertex AI" [8]: https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/machine-learning-operations-v2?utm_source=chatgpt.com "Machine learning operations - Azure Architecture Center" [9]: https://docs.cloud.google.com/vertex-ai/docs/training/overview?utm_source=chatgpt.com "Vertex AI serverless training overview" [10]: https://docs.cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build?utm_source=chatgpt.com "Architecture for MLOps using TensorFlow Extended, Vertex ..." [11]: https://medium.com/%40techlatest.net/what-is-google-cloud-vertex-ai-its-architecture-and-key-features-3a265ae09f82?utm_source=chatgpt.com "Google Cloud Vertex AI, its architecture, and key components" [12]: https://docs.cloud.google.com/vertex-ai/docs/start/introduction-unified-platform?utm_source=chatgpt.com "Introduction to Vertex AI" [13]: https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html?utm_source=chatgpt.com "Model training - Amazon SageMaker AI" [14]: https://aws.amazon.com/solutions/guidance/distributed-model-training-on-aws/?utm_source=chatgpt.com "Guidance for Distributed Model Training on AWS" [15]: https://github.com/aws-samples/distributed-training-inference-sagemaker-mlflow?utm_source=chatgpt.com "aws-samples/distributed-training-inference-sagemaker ..." [16]: https://d1.awsstatic.com/solutions/guidance/architecture-diagrams/distributed-model-training-on-aws.pdf?utm_source=chatgpt.com "Guidance for Distributed Model Training on AWS - awsstatic.com" [17]: https://learn.microsoft.com/en-us/AZURE/machine-learning/concept-azure-machine-learning-architecture?preserve-view=true&view=azureml-api-1&utm_source=chatgpt.com "Architecture & key concepts (v1) - Azure Machine Learning" [18]: https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html?utm_source=chatgpt.com "Introduction to the SageMaker AI distributed data ..." [19]: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture?utm_source=chatgpt.com "GKE cluster architecture | Google Kubernetes Engine (GKE)" --- ## 3. GPU 最佳化使用：訓練與推論的系統設計模組與演算法 ### 3.1 設計目標與考量在大型深度學習與多模態模型（特別是 LLM 類）情境下，GPU 成本與延遲往往是系統設計的關鍵限制。GPU 最佳化的目標可整理為三點： 1. **提升吞吐量與縮短訓練／推論時間**：透過並行化、混合精度與 kernel 最佳化，充分利用 Tensor Core 等硬體特性。 2. **降低記憶體壓力與成本**：利用參數分散、梯度壓縮與量化等方法，在同樣硬體上容納更大模型或更多併發請求。 3. **維持數值穩定與模型品質**：在使用混合精度與量化時，需要輔以適當的 loss scaling、校正與離線評估流程，確保預後表現不受明顯影響。以下分別從訓練與推論兩個角度，整理常見設計模組與演算法。 --- ### 3.2 GPU 訓練最佳化模組與演算法 #### 3.2.1 混合精度訓練（Mixed-Precision Training） * **概念**：在不影響收斂的前提下，將大部分矩陣運算改用 FP16/FP8 等低精度，在關鍵累加或統計運算仍使用 FP32。 * **硬體基礎**：NVIDIA Volta 之後的 GPU（V100、A100、H100 等）具備 Tensor Cores，可對半精度矩陣乘法提供顯著加速。官方指南顯示，混合精度搭配 Tensor Cores 可在多個網路上達到數倍加速且維持相同準確度。 * **實作重點**： * 使用框架提供的 AMP（automatic mixed precision）API。 * 加入 loss scaling 避免 underflow。 * 在雲端平台（Vertex AI、SageMaker、Azure ML）上，選擇支援 Tensor Core 的 GPU 機型並啟用相對應容器。 #### 3.2.2 並行策略與記憶體最佳化 * **資料並行 + All-Reduce** * 使用 NCCL 支援的 ring-all-reduce 在多 GPU 間同步梯度，是目前主流做法。 * **模型並行與流水線並行** * 超大模型需將不同層分散到不同 GPU，並透過 pipeline parallelism 提高利用率；DeepSpeed、Megatron-LM 等框架已內建這類策略。 * **FSDP / ZeRO / Checkpointing** * 透過 Fully Sharded Data Parallel（FSDP）或 ZeRO 將參數、梯度與 optimizer state 以分片形式分佈在多節點，搭配 activation checkpointing，可顯著降低單卡記憶體使用。 #### 3.2.3 Kernel Fusion 與編譯器最佳化 * **Operator / Kernel Fusion** * 透過將多個連續算子融合成單一 GPU kernel，減少 memory round-trip 與 kernel launch overhead。 * 2024 年 Souffle 等研究顯示，針對深度學習推論與訓練進行全域 DAG 分析與 kernel fusion，可相對 TensorRT、XLA 再獲得 3–7 倍加速。 * **深度學習編譯器** * XLA、TVM、MLIR/IREE 等工具可針對模型計算圖進行自動最佳化與編譯，為訓練與推論帶來更佳效能。 #### 3.2.4 I/O 與資料管線 * **高效率輸入管線** * 針對影像與語音訓練，必須確保資料解壓、增強與載入不成為瓶頸。 * 做法包括： * 使用 RecordIO/TFRecord/Parquet 等順序讀取格式。 * 在 CPU 端或另一組 GPU 上進行資料前處理並以 prefetch/batching 餵給訓練程式。([Google Cloud Documentation][1]) * **多 GPU 拓樸與頻寬考量** * 2024 年 GPU benchmark 研究指出，多 GPU 叢集中 NVLink/NVSwitch 拓樸與 PCIe 頻寬對訓練效率影響顯著，選擇具備高頻寬互連的機型是大型模型訓練的重要設計決策。 --- ### 3.3 GPU 推論最佳化模組與演算法對於 LLM 與多模態報告生成系統，推論階段往往佔整體成本的大宗。近期主流做法是使用專門的 LLM 推論引擎，例如 vLLM、TensorRT-LLM、DeepSpeed-MII 等。 #### 3.3.1 核心推論引擎與功能 * **TensorRT-LLM** * NVIDIA 開源的 LLM 推論最佳化套件，整合 TensorRT、CUDA、cuDNN。 * 提供： * 客製化 attention kernel * in-flight batching * paged KV cache * 多種精度量化（FP8、FP4、INT4 AWQ、INT8 SmoothQuant） * speculative decoding 等功能。 * **vLLM / DeepSpeed-MII** * vLLM 主打「paged attention」與高效率 KV cache 管理，可大幅提高單 GPU 併發數。 * DeepSpeed-MII 則整合 DeepSpeed 推論最佳化與模型壓縮技術。最新研究顯示，TensorRT-LLM 在 Nvidia 硬體上的延遲與吞吐量通常優於 vLLM 與 DeepSpeed-MII，但不同引擎在能耗與部署彈性上各有優劣。 #### 3.3.2 重要演算法技巧 1. **量化（Quantization）** * 以 INT8、INT4 或混合精度（例如 W4A16KV8）表達權重與 KV cache，可在幾乎不犧牲品質的情況下降低記憶體與延遲。 2. **KV Cache 管理與分頁（Paged KV Cache）** * 將 KV cache 以頁面形式管理，支援多請求共享與重複利用，減少記憶體碎片與拷貝開銷，特別適合長上下文 LLM 報告生成任務。 3. **動態批次與請求排程（Dynamic Batching & Scheduling）** * 動態將多個請求合併成 GPU 友善的 batch，以 trade-off 延遲與吞吐量。 * 現代推論平臺多採用「in-flight batching」與多階層排程策略，以適應不同長度與優先順序的請求。 4. **Speculative Decoding / Draft-Execute 架構** * 由一個較小的「草稿模型」先生成多個候選 token，再由大型模型驗證與接受部分輸出，減少大模型實際運算次數。 5. **多 GPU／多節點推論** * 對超大模型，可透過 tensor parallel 或 pipeline parallel 將推論分散到多張 GPU； * 近期也有將編碼與解碼拆分到不同 GPU、或採用分離式推論（disaggregated inference）的系統設計。 --- ### 3.4 與三大雲平臺的映射 * **GCP（Vertex AI + GKE）** * 以 A3/H100 或 A2/A100 GPU 節點搭配 Vertex AI Training、Vertex Online/Batch Prediction； * 可在 GKE 上自建 vLLM/TensorRT-LLM 服務，利用 autoscaling 與 GPU node pool 控管成本。([Medium][2]) * **AWS（SageMaker + EKS）** * 利用 p4/p5（A100/H100）實例與 SageMaker 分散式訓練功能，搭配 SageMaker Endpoint 或在 EKS 上執行 TensorRT-LLM/vLLM； * 官方架構文件提供從 S3 → Training Jobs → Endpoint 的完整範例。([AWS 文檔][3]) * **Azure（Azure ML + AKS）** * 使用 Azure ML Compute Cluster（N 系列 GPU）與 job API 執行分散式訓練； * 推論可透過 Azure ML Online Endpoint 或在 AKS 部署推論引擎，並藉由 Application Gateway / Azure Monitor 監控效能。([Microsoft Learn][4]) --- ## 4. 推論與訓練之資安防護設計 ### 4.1 威脅模型與風險概述近期文獻將機器學習系統的資安與隱私風險分為三類： 1. **資料層風險**：資料外洩、重識別攻擊、敏感特徵反推（如 membership inference）。 2. **模型層風險**：對抗樣本、模型竊取、模型反向工程、惡意權重注入。 3. **系統層風險**：弱存取控制、API 濫用、log 中殘留敏感資訊、供應鏈攻擊等。因此，訓練與推論環境的資安防護需要跨越資料、模型與基礎設施三個層面，並與雲端平台既有資安機制整合。 --- ### 4.2 訓練階段的防護架構 #### 4.2.1 存取控制與身分管理 * **RBAC 與最小權限原則** * 針對資料儲存區（例如 GCS/S3/Blob）、模型倉庫、訓練叢集，設計角色導向存取控制（RBAC）； * 僅允許特定專責人員存取原始去識別前的資料，訓練程式則僅能讀取以 pseudonymized ID 表示的資料集。 * **強化認證機制** * 管理者與 CI/CD 系統需使用多因子認證與短效存取權杖（short-lived tokens），降低長效憑證外洩風險。 #### 4.2.2 資料隱私與加密 * **靜態與傳輸中加密** * 訓練資料與模型權重在雲端儲存時採用伺服器端或客戶端加密，金鑰由 KMS 管理； * 所有跨服務通訊以 TLS 保護，避免中間人攻擊。 * **隱私強化技術（PETs）** * 視情境引入差分隱私訓練、聯邦學習或安全多方計算等方式，降低單一機構資料外洩的風險；近期綜述指出，這些技術可有效降低主動與被動隱私攻擊成功率。 #### 4.2.3 Confidential Computing 與硬體保護 * **機密運算（Confidential Computing）** * 透過硬體可信執行環境（TEE），在 CPU/GPU 訓練過程中對記憶體內容進行加密，避免雲端平臺管理者或惡意 hypervisor 攻擊。 * GCP 的 Confidential VMs on H100、AWS Nitro Enclaves、Azure Confidential VM 都支援在機密環境中執行 AI 訓練與分析。 #### 4.2.4 訓練資料與程式的供應鏈安全 * **資料版本與來源追蹤** * 建立 Data Catalog 與 lineage 紀錄，標註資料來源、處理流程與使用目的，以利稽核與回溯。 * **容器與套件掃描** * 訓練用 Docker 映像需經過惡意軟體與弱點掃描，並建置固定的基底映像版本，降低供應鏈植入風險。 --- ### 4.3 推論階段的防護架構 #### 4.3.1 API 與端點保護 * **Endpoint 存取控制** * 對外僅曝露 API Gateway / Endpoint，後端推論服務留在私有子網路； * 使用 OAuth2 / OIDC 或院內 SSO 進行端點存取控制，搭配 rate limiting、防暴力測試與 IP 過濾。([Microsoft Learn][5]) * **輸入驗證與節流** * 對推論請求進行大小、格式與頻率檢查，避免被用來進行 prompt injection、大量探測或 DoS 攻擊。 #### 4.3.2 模型保護與誤用防範 * **防止模型竊取與反向工程** * 控制推論 API 回應的細節與頻寬（例如不回傳 logits），減少以查詢方式複製模型的可能。 * **對抗樣本與濫用偵測** * 監控輸入分布與模型輸出，偵測異常查詢模式（如大量隨機輸入或特定攻擊樣板），必要時限制來源或觸發人工審查。 #### 4.3.3 日誌與稽核 * **安全日誌設計** * 針對每次推論紀錄：時間、呼叫者身分、模型版本與必要的輸入摘要（避免完整記錄敏感內容）； * 使用集中式 log 平臺與 SIEM 工具進行關聯分析。 * **隱私保護的監控** * 在不暴露病人個資的前提下，監控模型效能與 drift，必要時僅保存匿名化統計資料。 --- ### 4.4 三大雲平臺資安機制對照 * **GCP** * IAM + Cloud KMS + VPC Service Controls 控制存取邊界； * Cloud Logging / Cloud Monitoring 提供模型與推論監控； * Confidential VMs for H100/A3 支援機密訓練與推論。 * **AWS** * IAM、KMS、VPC、Security Group 形成網路與存取控制基礎； * SageMaker 提供 Dataset & Model lineage、Endpoint-level 日誌與 CloudWatch 監控； * Nitro Enclaves 支援關鍵任務在隔離環境中運行。([AWS 文檔][3]) * **Azure** * Entra ID（原 Azure AD） + RBAC + Key Vault 管理身分與金鑰； * Azure ML 與 Azure Monitor/Log Analytics 整合模型監控與稽核； * Confidential VM / Confidential Container 提供機密運算環境。([Microsoft Learn][4]) [1]: https://docs.cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build?utm_source=chatgpt.com "Architecture for MLOps using TensorFlow Extended, Vertex ..." [2]: https://medium.com/%40techlatest.net/what-is-google-cloud-vertex-ai-its-architecture-and-key-features-3a265ae09f82?utm_source=chatgpt.com "Google Cloud Vertex AI, its architecture, and key components" [3]: https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html?utm_source=chatgpt.com "Model training - Amazon SageMaker AI" [4]: https://learn.microsoft.com/en-us/AZURE/machine-learning/concept-azure-machine-learning-architecture?preserve-view=true&view=azureml-api-1&utm_source=chatgpt.com "Architecture & key concepts (v1) - Azure Machine Learning" [5]: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints?view=azureml-api-2&utm_source=chatgpt.com "Endpoints for inference - Azure Machine Learning"