使用 Alps 超級電腦上的混合精度計算加速全基因組關聯研究 [S72297]

使用 Alps 超級電腦上的混合精度計算加速全基因組關聯研究 [S72297] Rablab Alomairy,麻省理工學院/阿卜杜拉國王科技大學博士後研究員 Hatem Latief,KAUST首席研究員我們利用 NVIDIA Ampere 和 Hopper GPU 上 [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] 之間張量核性能的擴大幅度來提高來自英國生物銀行的 305,000 名患者的全基因組關聯研究 (GWAS) 的輸出精度保持混合精度 (MxP) 計算的性能，這是有史以來使用多變量方法研究遺傳上位性的最大的 GWAS 隊列。我們強調了 MxP 方法的數值穩健性，說明了使用 Summit/Leonardo/Alps 超級電腦對我們的 HPC GWAS 應用程式的效能影響，並將它們與 Frontier 進行了比較。 * 全基因組關聯研究分析整個基因組（人類或其他）的 DNA 序列變異，以確定人群中疾病或其他特徵的遺傳風險因素 * 我們部署高效能的以圖塊為中心的矩陣計算，以使核嶺回歸 (KRR) 能夠民主化，從而捕捉遺傳上位性 * 我們透過部署新的基於 Cholesky 的四精度求解器來加速正則化 KRR 系統的求解，該求解器在幾乎滿載的 Alps 系統上以 1.805 混合精度 ExaOp/s 的速度運行，比最先進的僅限 CPU 的 REGENIE GWAS 軟體高出五個數量級主題：模擬/建模/設計 - 超級運算產業領域：醫療保健與生命科學技術等級：技術 - 中級目標受眾：研究：學術所有目標受眾類型：資料科學家、開發人員/工程師、研究：非學術 3 月 20 日，星期四晚上 8:00 - 晚上 8:40 中部標準時間 S72297_使用阿爾卑斯超級電腦上的混合精度計算加速全基因組關聯研究.pdf AI逐字稿大家好！很高興今天能與大家分享這場演講。我通常表現得相當出色，目前仍在帶領我的團隊。我很興奮能與你們一起探討這個主題。今天和我一起的是來自極端計算研究中心(Extreme Computing Research Center)的Hudson。我們將共同發表一場名為「在我們的伺服器電腦上使用混合精度計算(Mixed Precision Computations)加速全基因組關聯研究(Genome-Wide Association Studies)」的演講。我們的團隊實際上是由輝達(NVIDIA)、中央大學(Central University)和麻省理工學院(MIT)組成的合作團隊。我們的團隊中有兩位傑出的成員曾在相關領域中獲得矚目。為什麼我們大多數人都擁有計算學(Computational Science)和電腦科學(Computer Science)的背景呢？這與我們的研究方向密切相關。 Hi, hello! I’m excited to share this talk with you today. I’m usually above average and still leading my team. Joining me is Hudson from the Extreme Computing Research Center. Together, we will be presenting our talk titled “Accelerating Genome-Wide Association Studies Using Mixed Precision Computations on Our Server Computers.” We are actually a team formed by NVIDIA, Central University, and MIT. Our team includes two members highlighted in notable achievements. Why do most of us come from computational and computer science backgrounds? It’s closely tied to the nature of our research. 在整個決策記錄的過程中，我們的團隊將全程回答你們的問題。如果你們需要深入了解圖表，我們的研究是以3/8教育結構為基礎設計的。我們專注於預測疾病的遺傳傾向(Genetic Predisposition to Diseases)，並解釋多種遺傳變異交互作用的最新進展，這就是所謂的上位效應(Epistatic Effect)。我們的目標是將研究擴展到30萬名真實患者(300K Real Patients)和1300萬名合成患者(13 Million Synthetic Patients)的算法。我們採用了自適應精度(Adaptive Precision)技術，通過平衡計算效率與準確性調優(Accuracy Tuning)來實現競爭優勢。我們的方法最多使用了五種資料類型(Data Types)，包括四種浮點精度(Floating-Point Precision)。此外，我們的架構報告(Architecture Report)顯示，我們的開發成果應用於全球十大超級電腦中的四台。我們利用FP8和整數8(Integer 8)技術實現了每秒1.8百萬億次運算(1.8 Exaflops)的性能，例如輝達的Grace Hopper超級晶片(GHK 100 Superchip)。 While the decision is being recorded, our entire team will be available to answer your questions. If you need to dive into the charts, we structured our research around a 3/8 education framework, where we focus on predicting genetic predisposition to diseases and interpreting advances in the interaction of multiple genetic variations, known as the epistatic effect. Our approach scales up to 300K real patients and 13 million synthetic patients in our algorithm. We employ adaptive precision, balancing competition and efficiency with accuracy tuning. Our approach utilizes up to five data types, including four floating-point precision levels. Our architecture report shows that our development has been applied to four of the world’s top ten supercomputers, leveraging FP8 and Integer 8 to achieve 1.8 exaflops, such as with NVIDIA’s Grace Hopper GHK 100 Superchip. 你知道什麼是關聯研究(Association Study)嗎？它是一種從基因型(Genotype)到表型(Phenotype)的映射技術。我們以一個訓練人群(Training Population)作為輸入，這些人群的DNA序列(DNA Sequence)已完成測序，並根據環境因素(Environmental Factors)進行表徵，這些因素與疾病(Diseases)和性狀(Traits)相關，也就是表型。通過測序和環境資料，我們能夠預測這個人群對某些疾病和性狀的傾向。預測的品質可以通過將結果與實際情況比較來評估。你可以把這種方法想像成一個正向求解(Forward Solve)的問題，用來檢測相對於參考基因組(Reference Genome)的特定遺傳變異(Genetic Variation)，這些變異與某些好的或壞的表型相關，並通過曼哈頓圖(Manhattan Plot)來表徵。每個表型對應一張圖表，顯示基因型中與其相關的具體位置。在人類參考基因組(Human Reference Genome)完成後不久，第一個重要的研究隨即展開，針對黃斑變性(Macular Degeneration)這一表型，它是導致失明的主要原因。這項研究針對晚年發病的患者，分析了11.6萬個基因型(116K Genotypes)。七年前，著名的「金膽獎」(Golden Bold Prize)頒發給了一項針對882名患者的調查，分析了2800萬個基因型(28 Million Genotypes)。你會發現，患者數量決定了核心矩陣(Kernel Matrix)的維度，而基因型數量則是形成每個矩陣元素的內部維度(Inner Dimension)。因此，2018年的研究涵蓋了相對少量患者之間的潛在廣泛變異。我們的研究則涵蓋了30萬名患者和4.3萬個基因型(43K Genotypes)，針對五種疾病的表型進行分析。我們進一步將其擴展到1300萬名患者，充分利用了可用的記憶體容量，這也成為可能的限制。我們下一個目標是大麥(Great Read)，它是人類19%營養來源，卻容易受到多種氣候相關疾病的影響。這是一個有趣的目標，因為它們的DNA比人類DNA複雜五倍，再次凸顯了算法的內部維度挑戰。 Do you know what an association study is? It’s a genotype-to-phenotype mapping technique that takes a training population as input. Their DNA sequences are characterized by environmental factors, representing diseases and traits, which are the phenotypes. Based on sequencing and environmental data, we can predict the population’s propensity for these diseases and traits. The quality of the predictions can be assessed by comparing them to actual outcomes. You can think of it as a forward solve in an inference problem, used to detect specific genetic variations relative to a reference genome responsible for certain good or bad phenotypes, characterized by a Manhattan plot. One plot for each phenotype shows which locations in the genotype correlate with it. The first significant study was conducted shortly after the completion of the human reference genome, targeting macular degeneration, a leading cause of blindness. This late-life condition was analyzed across 116K genotypes. Seven years ago, the Golden Bold Prize was awarded to a study surveying 882 patients across 28 million genotypes. You’ll see that the number of patients determines the dimension of a kernel matrix to be factored, while the number of genotypes is the inner dimension of an inner product forming each matrix element. So, the 2018 study covered potentially wide variations among a relatively small number of patients. Our study includes 300K patients across 43K genotypes, examining five disease phenotypes. We’ve extended it to 13 million patients, fully utilizing available memory as a potential limitation. Our next target is great read, a source of 19% of human nutrition, subject to numerous climate-related diseases. It’s an interesting target because its DNA is five times more complex than human DNA, once again stressing the inner dimension of the algorithm. 我們建議先定義一些在演講中會用到的術語。我先簡單提一下單核苷酸多態性(Single Nucleotide Polymorphism, SNP)，它是指基因組中特定位置的核苷酸(Nucleotide)替換，例如某個氨基酸(Amino Acid)。均方誤差(Mean Square Prediction Error, MSBE)是表型在測試人群中的平均平方預測誤差。而相關係數(Correlation Coefficient)則是衡量兩個變數協方差(Covariance)的指標，其值介於-1到1之間，正值表示正相關，負值表示反相關，零表示無相關。這項工作的主要意義，除了通過當今GPU中的低精度計算(Low Precision Computations)實現高效能外，還在於其多元特性(Multivariate Feature)。這種特性讓我們能夠考慮多個SNP之間的交互作用，因為表型可能取決於一個基因的表現觸發另一個基因。這是將核心技術(Kernel Technique)應用於標準關聯研究的最新發展。這種方法將資料映射到更高維度的空間，使關聯性更加清晰。你可能對支援向量機(Support Vector Machines)中嵌入的核心方法(Kernel Method)有所了解。例如，在二維空間中，無法用平面將紅點和綠點分開。但通過將距離中心的負值指數化，提升到第三維後，就會出現一個切割平面。我們的重點包括利用四代GPU，採用整數18(Integer 18)計算基因型距離，以及利用密集核心矩陣(Dense Kernel Matrices)的特性來提升效率。 He suggests defining some terms we’ll refer to in our talk. I’ll just mention single nucleotide polymorphism (SNP), which is the substitution of a nucleotide, such as an amino acid, at a specific location in the genome. MSBE is simply the mean square prediction error of a phenotype over the test population. The correlation coefficient is a measure of the covariance of two variables, ranging from -1 to 1, with positive values indicating correlation, negative values anti-correlation, and zero no correlation. The main significance of this work, apart from the performance achieved through low-precision computations in today’s GPUs, is its multivariate feature, which allows us to account for interactions between multiple SNPs. Since a phenotype may depend on one gene expression triggering another, this recent development applies a kernel technique to standard association studies. This approach maps the data to a higher-dimensional space, making associations clearer. You might be familiar with the kernel method embedded in support vector machines. For example, in 2D, there’s no planar cut separating red and green points. When lifted to a third dimension by exponentiating the negative distance from the center, a cutting plane appears. Our highlights include leveraging four generations of GPUs, utilizing Integer 18 for genotype distance calculations, and exploiting the properties of dense kernel matrices to enhance efficiency. 我們將人口規模擴大到數百萬，遠遠超出了當前繁瑣的軟體體制(Software Regime)所能處理的範圍。這一擴展是對DNA測序(DNA Sequencing)成本大幅下降的直接回應。我們指出，針對1300萬患者(13 Million Patients)的未來市場可能涵蓋全球104個國家，這些國家的整個人口都可能進行基因測序(Sequencing)。這些人群中的大多數以及資源種群(Resource Populations)的基因組(Genome)從未被納入歐洲或美國的醫學研究(Medical Studies)。我們提出的特點包括大規模人口(Large Populations)、可擴展性(Scalability)以及性能(Performance)。這些優勢超越了最近開發的核心形式(Kernel Form)，特別是在四條白線(Four White Lines)上運行的關聯研究(Association Studies)。我們利用低精度張量核心(Low Precision Tensor Cores)，並在精度與效能之間進行可量化的權衡(Quantifiable Tradeoffs in Accuracy)。這種方法能夠應用於多台不同世代處理器的軟體電腦(Software Computers)，實現每秒1.8百萬億次運算(1.8 Exaflops)。這些是我们運行的幾台科學用電腦(Science Computers)。我們在自己的機器上使用加密資料(Encrypted Data)。掃描過程在多個系統中執行的計算密集(Compute Intensive)和線性代數階段(Algebra Phase)尤為龐大。整個流程(Pipeline)使用合成資料(Synthetic Data)和Alps運行。有人正在研究流程中兩個計算密集階段的可擴展性，這一點你們稍後會看到。我們針對計算密集的關聯階段(Association Phase)，採用了基於任務圖(Task Graph)的金融條件瓦片(Finance Conditional Tile)線性代數方法。我們的算法特徵在於能夠逐次轉換資料類型(Data Types)，以最佳化通訊(Communication)和大量資料的精度。這一稀疏資料(Sparse Data)的協同方法(Synergetic Approach)為我們自2022年以來在最後三組決賽提交(Final Submissions)中提供了動力。我們結合低精度(Low Precision)處理小幅值(Small Magnitude)的部分，並在與大幅值(Large Magnitude)結合時，通過行進(Traversal)和決定效應(Decide the Effect)的計算，以及較低分解(Lower Decompositions)實現平滑(Smooth)。我認為，這種典型的平衡方式兼顧了精確性(Accuracy)和效率(Efficiency)，是經典的大規模計算計畫(Scale Computing Program)的優勢。動態密集系統(Dynamic Intense System)就像主持者(Master of Ceremonies)，通過最大化並行性(Concurrency)和根據算法適應性(Algorithm Adaptivity)重新分配不平衡負載(Imbalanced Load)。它在節點(Nodes)上處理任務圖，滿足資料依賴性(Data Dependencies)，並以批次風格(Batch Style)應用於兩個分散式共享記憶體(Distributed Shared Memory)。我要強調的是，這一切的動機來自最高與最低精度張量核心之間225倍的差距(Factor of 225)。 We are expanding population sizes into the millions, far beyond what the current state of arduous software regimes can handle. This expansion is a direct response to the significantly declining cost of DNA sequencing. We point out that the future market for the 13 million patients could include 104 of the world’s countries, where entire populations could be sequenced. The genomes of the majority of these and resource populations have never been entered into European or American medical studies. The attributes we present are large populations, scalability, and performance over the recently developed kernel form of association studies on the four white lines, exploiting low-precision tensor cores with quantifiable tradeoffs in accuracy, and the possibility of multiple software computers with different generations of processing, delivering 1.8 exaflops. These are the several science computers we run on. We are using encrypted data on our own machines. The scanning runs for the compute-intensive and enormous algebra phase across many systems. The whole pipeline runs using synthetic data and Alps. Somebody is studying the scalability of the two compute-intensive phases of the pipeline, as you’ll see later. We use finance conditional tile-based algebra based on task graphs for the compute-intensive association phase. Our algorithm’s signature is the ability to convert data types on a time-by-time basis to optimize precision for both communication and massive data. This sparse data in a synergetic approach has powered our last three groups in both final submissions since 2022—a combination of low precision for small-magnitude parts, traversal, and deciding the effect of computation when combined with large magnitude and lower decompositions for smoothness. I think typically balancing accuracy and efficiency is a classic advantage of scale computing programs. The dynamic intense system is the master of ceremonies to maximize concurrency and redistribute imbalanced load from the algorithm’s adaptivity. It consumes a task graph at the nodes where data dependencies are satisfied and batches tasks in a style of two distributed shared memories. What motivates all this is the factor of 225 between the highest and lowest precision tensor cores. 正如你們在右邊黑洞(Black Hole)的藍色部分和黃色部分之間所見，如果你回顧AP 64和投票(Voter)的資料，這顯示在7年內速率增加了三個數量級(Three Orders of Magnitude)，但前提是你能隨著AI的表面精度向下調整。好消息是，對於許多問題，包括關聯研究(Association Studies)，這是可行的。讓我們進入算法的部分吧。如我們所述，標準模型(Standard Model)涉及大量的單核苷酸多態性(Single Nucleotide Polymorphisms, SNPs)相對於樣本數量(Samples)。這種高維資料(High-Dimensional Data)帶來了挑戰，包括多重共線性(Multiple Linearity)和過擬合(Overfitting)的風險增加。整合(Integration)可以通過引入縮減SNPs效應大小(Effect Sizes)的正則化項(Regularization Term)來應對這些挑戰，使模型更穩定且穩健(Robust)。這一點在廣泛使用的關聯研究軟體工具Regime中得到了實現。這是正則化的目標函數(Objective Function of Regularization)，這是其封閉形式的解(Closed-Form Solution)，可以轉化為求解一個帶有大型密集患者-SNP矩陣(Patient-SNP Matrix)的線性方程系統(System of Linear Equations)。這裡的計算涉及對稱秩更新(Symmetric Rank Update)、添加正則化(Regularization)和矩陣乘法(Matrix Multiplication)來計算右邊項(Right-Hand Side)，接著使用基於喬利斯基分解(Cholesky Decomposition)的求解器(Solver)。核心正則化(Kernel Regularization)進一步通過使用核心方法(Kernel Method)擴展了正則化，來建模SNPs之間以及SNPs與性狀(Traits)之間的複雜非線性關係(Complex and Nonlinear Relationships)。這是核心正則化的目標函數(Objective Function of Kernel Regularization)，這是其封閉形式的解，也可以轉化為求解線性方程系統。在這裡，我們需要使用高斯徑向基函數(Gaussian RBF Kernel)來衡量個體之間的相似性(Similarity Measures)。此外，我們還需要對系統進行線性化(Linearization)，使用我們稱之為關聯階段(Association Phase)的喬利斯基分解求解器(Cholesky-Based Solver)。在基因型矩陣(Genotype Matrix)中，SNPs可以用整數值0、1或2來表示，如這裡的綠色瓦片(Green Tiles)，隨後是幾列共創始資料(Co-Founder Data)，如紫色瓦片(Purple Tiles)，這些資料儲存在單精度(Single Precision)中。因此，關聯研究的這種多精度特性(Multi-Precision Nature)允許我們有效利用硬體功能，例如嵌入式核心(Embedded Cores)。在計算過程中，第一步涉及使用BLAS 3操作(Operation)，特別是核心運算(Circle)，來計算協方差矩陣(Covariance Matrix)。這種名為混合精度矩陣乘法(Mixed Precision GEM)的算法適用於包含整數基因型資料(Integer Genotype Data)的瓦片，以及包含浮點共創始資料(Floating-Point Co-Founder Data)的單精度矩陣乘法瓦片(Single Precision GEM Tile)。生成的對稱矩陣(Symmetric Matrix)可以在添加正則化項後，使用喬利斯基分解進行因式分解(Factorization)。喬利斯基分解是這裡最耗時的步驟(Time-Consuming Step)。這種分解可以用全精度(Full Precision)、帶狀混合精度方法(Banded Mixed Precision Approach)或適應性混合精度方法(Adaptive Mixed Precision Approach)來執行。適應性方法根據Nick Higher公式(Nick Higher Formulation)為每個瓦片分配不同精度。如果某個瓦片的範數(Norm of a Tile)與整個矩陣的縮放和正規化範數(Scaled and Normalized Norm)相比足夠小，則該瓦片的計算可以用較低精度(Lower Precision)完成，而不會影響整體精度(Overall Accuracy)。核心正則化算法(Kernel Regularization Algorithm)可以分為三個主要階段：關聯階段(Association Phase)、預測階段(Predict Phase)和另一個階段。我們將深入探討這三個階段的細節。 As you can see between the blue and yellow parts of the black hole on the right, if you go back to AP 64 and Voter, this shows a three-order-of-magnitude rate increase in 7 years, but only if you can follow the precision downward with the surface for AI. The good news is that for many problems, including association studies, you can. Let’s get into the algorithm. As we mentioned, the standard model involves a large number of SNPs relative to the number of samples. This high-dimensional data poses challenges, including multiple linearity and an increased risk of overfitting. Integration can address these challenges by introducing a regularization term that shrinks the effect sizes of the SNPs, making the model more stable and robust. It’s implemented in Regime, a widely used software tool for association studies. This is the objective function of regularization, and this is a closed-form solution, which can be translated into solving a system of linear equations with a large, dense patient-SNP matrix. The computation here involves symmetric rank updates, adding regularization, and matrix multiplication to compute the right-hand side, followed by a Cholesky-based solver. Kernel regularization further extends regularization by using a kernel method to model complex and nonlinear relationships among SNPs and between SNPs and traits. This is the objective function of kernel regularization, and this is a closed-form solution, which can also be translated into solving a system of linear equations. Here, we need a Gaussian RBF kernel using similarity measures between individuals. We also need to linearize the system using a Cholesky-based solver, which we call the association phase here. In the genotype matrix, SNPs can be represented by integer values 0, 1, or 2, as shown in the green tiles here, followed by a few columns of co-founder data as the purple tiles, stored in single precision. So, this multi-precision nature of association studies’ dataset allows us to effectively utilize hardware features, for instance, embedded cores. In computation, the first step involves computing the covariance matrix using BLAS 3 operations, specifically circle operations. The algorithm, called mixed-precision GEM for tiles, can handle integer genotype data and single-precision GEM for tiles containing floating-point co-founder data. The resulting symmetric matrix can then be factorized using Cholesky decomposition after adding the regularization term. Cholesky decomposition is the most time-consuming step here. This decomposition can be performed in full precision, using a banded mixed-precision approach, or an adaptive mixed-precision approach, which assigns different precision to each tile following the Nick Higher formulation. If the norm of a tile is sufficiently small compared to the scaled and normalized norm of the entire matrix, the tile can be computed in lower precision without compromising overall accuracy. The kernel regularization algorithm can be divided into three main phases: the association phase, the predict phase, and another phase. We will go into the details of these three phases. 感謝Robert介紹了三階段核心正則化方法(Three Phase Kernel Regularization Methodology)。第一個階段是建構階段(Build Phase)。在這個階段，我們需要計算每對個體(Individuals)之間的歐氏距離(Euclidean Distance)。這種方法相當緩慢，我們必須採取一些措施來改善。對於整個矩陣，我們可以對每個項目進行指數運算(Exponentiate Every Entry)，這就是我們獲得協方差矩陣(Covariance Matrix)的方式。在精度方面，當處理矩陣中的整數項目(Integer Entries)時，我們使用多個整數8(Integer 8)；根據底層硬體架構(Underlying Hardware Architecture)，我們會使用FP16或FPA；對於非整數項目(Non-Integer Entries)，則使用FP64和FP32。第二個階段是關聯階段(Associate Phase)，這個階段最初執行自適應混合精度(Adaptive Mixed Precision)。我們以完全中心化的方式(Entirely Centric Way)判斷每個瓦片(Tile)應該使用哪種精度，然後針對表型列表(List of Phenotypes)執行基於喬利斯基分解的求解器(Cholesky-Based Solver)。最終，我們解決問題並獲得從核心矩陣(Kernel Matrix)到表型(Phenotype)的映射權重(Weights)和係數(Coefficients)。第三個階段是預測階段(Predict Phase)，在這個階段我們需要進行推理(Inference)。我們會引入一個新的群體(Cohort)作為測試資料集(Testing Datasets)，並依賴建構階段計算新資料集與最初使用的訓練資料集(Training Datasets)之間的交互作用(Interactions)。我們需要確定每個患者(Patients)發展出特定最終表型(Final Phenotypes)的可能性(Likelihood)。這就是我們如何利用AI浪潮，在關聯研究(Association Studies)中應用低精度運算(Low Precision Arithmetic)的方式。 Thank you, Robert, for introducing the three-phase kernel regularization methodology. The first phase is the build phase. We have to compute the Euclidean distance between each pair of individuals here. This is rather slow as an approach, and we have to do something about it. For the entire matrix, we can then exponentiate every entry, and this is how we get the covariance matrix. In terms of precision, we use multiple Integer 8 when dealing with the integer entries of the matrix, FP16 or FPA depending on the underlying hardware architecture, and FP64 and FP32 to deal with those non-integer entries. The second phase, the associate phase, performs adaptive mixed precision initially, where we identify in an entirely centric way which precision to use for each of those tiles. We can then execute a Cholesky-based solver against the list of phenotypes. We eventually solve the problem and get the weights and coefficients of the mapping from the kernel matrix into the phenotype. The third phase is the predict phase, where we perform inference, bringing in a new cohort as testing datasets. We rely on the build phase to calculate the interactions between those new datasets and the training datasets we used initially, and we determine the likelihood for each patient to develop specific final phenotypes. This is how we ride the AI wave with low-precision arithmetic in the context of association studies. 我們首先評估建構核心矩陣(Kernel Matrix)的精度，使用的真實資料集(Real Dataset)包含超過30萬名患者(300K Patients)和4.3萬個基因型(43K Genotypes)。如果你後端有A800，你可以啟用FP16，結果顯示我們的核心正則化矩陣(Kernel Regularization Matrix)對此具有彈性(Resilient)。如果你有GH200，你甚至可以啟用FP8。事實證明，這種應用本質上確實能充分利用低精度運算(Low Precision Arithmetic)。我們接著觀察低精度對哮喘表型(Asthma Phenotype)的均方預測誤差(Mean Square Prediction Error, MSB)的影響，比較標準正則化(Primary Regularization)和核心正則化(Kernel Regularization)。這是我們得到的MSB值，越低越好。當我們全部使用單精度(Single Precision)時效果最佳。但當引入半精度(Half Precision)時，標準正則化的MSB並未顯著降低。然而，我們開發了一個自適應程序(Adaptive Procedure)，根據瓦片逐一決定應使用哪種精度，這使我們在大幅提高生產力(Productivity)的同時，保持相似的精確性(Honesty)。現在，當我們引入混合精度的自適應核心正則化(Adaptive Kernel Regularization)時，我們確實能進一步降低MSB。這是非常有趣的發現，這正是我們研究核心正則化與標準正則化差異的目標。 We first look at the precision assessment of constructing the kernel matrix for a real dataset comprising more than 300K patients and 43K genotypes. If you have an A800 in your backend, you can activate FP16, and it turns out our kernel regularization matrix is resilient to that. If you have a GH200, you can even activate FP8. It turns out this application, by nature, can really exploit low-precision arithmetic. We then look at the impact of low precision on the mean square prediction error for the asthma phenotype, comparing primary regularization versus kernel regularization. This is the MSB we get—the lower, the better. That’s when we do everything in single precision. But when we introduce half precision, the MSB doesn’t really get lower for regularization. However, we are able to implement an adaptive procedure where we decide on a tile-by-tile basis which precision to use, and there we maintain similar honesty while being much more productive. Now, with this adaptive kernel regularization for mixed precision, when we introduce kernel regularization instead, we can really lower the MSB. That’s something really interesting, and in fact, that was our goal—to study the difference between kernel regularization and regularization. 我們進一步通過檢視英國生物銀行(UK Biobank)患者測試資料集的真實表型(Ground Truth Phenotype)與標準正則化(Regularization)和核心正則化模型(Kernel Regularization Models)在相同FP16精度下的預測之間的皮爾森相關性(Pearson Correlation)，來增強我們的定性評估(Qualitative Assessment)。核心正則化的結果與真實表型的相關性比標準正則化高出最多4倍，顯示其建模更複雜表型關係(Complex Phenotype Relationships)的能力。接著，我們使用MS Prime（一個模擬遺傳資料的工具）生成的合成資料集(Synthetic Datasets)來評估標準正則化(Hedge Regularization)和核心正則化(Kernel Regularization)的準確性。我們生成了四個合成資料集，展示了結合FP32和FP16的自適應正則化(Adaptive Regularization)的MSB，以及同樣結合FP32和FP16的自適應核心正則化(Adaptive Kernel Regularization)，最後還有結合FP32和FP8的自適應核心正則化。結果表明，核心正則化在所有合成資料集中的預測性能(Predictive Performance)均優於標準正則化。核心正則化在FP32和FP16配置下也顯示出與真實表型的相同相關性，而使用FP8的核心正則化則展現出最高相關性。雖然皮爾森相關性和預測精度略有犧牲，但這表明FP8精度在準確性(Accuracy)和計算效率(Computational Efficiency)之間提供了有利的權衡(Tradeoff)，使其成為大規模遺傳研究(Large-Scale Genetic Studies)中一個有前景的方法。 We further augment our qualitative assessment by looking at the Pearson correlation between the ground truth phenotype of the testing dataset from UK Biobank patients and the predictions under regularization and kernel regularization models at the same FP16 precision. The kernel regularization results are more highly correlated with the ground truth than regularization—up to 4 times more—demonstrating its capability to model more complex phenotype relationships. Here, we evaluate the accuracy of hedge regularization and kernel regularization using synthetic datasets generated by MS Prime, a tool for simulating genetic data. We generated four synthetic datasets. We showed the MSB for adaptive regularization combining FP32 and FP16, then adaptive kernel regularization with FP32 and FP16, and finally adaptive kernel regularization with FP32 and FP8. The results confirm the previous finding that kernel regularization demonstrates superior predictive performance compared to regularization across all synthetic datasets. Kernel regularization also performs in terms of the same correlations with the FP32 and FP16 configuration, showing the highest correlation when using kernel regularization with FP8. The Pearson correlation and prediction accuracy are slightly sacrificed; however, this demonstrates that FP8 precision offers a favorable tradeoff between accuracy and computational efficiency, making it a promising approach for large-scale genetic studies. 現在，讓我們來看看建構階段(Build Phase)，這是核心正則化的核心工作量(Workhorse)。它包含歐氏距離計算(Euclidean Distance Calculation)，這佔據了大部分執行時間(Elapsed Time)。此計算僅依賴純量運算(Scalar Operations)，並處理兩種資料類型：整數(Integer)和32位浮點(Floating-Point 32-bit)項目。我們能在張量核心(Tensor Cores)上加速它嗎？在這方面已經有一些針對一般矩陣(General Matrix)使用雙精度(Double Precision)的高效工作。我們利用這些現有成果，並將其應用於矽矩陣(Silicon Matrix)的混合精度(Mixed Precision)。讓我們來看看張量核心加速的距離計算(Tensor Core Accelerated Distance Calculations)。我們展開歐氏距離的平方(Square of the Euclidean Distance)，這樣你們就能更好地理解我們的做法。我們首先考慮三個患者—A、B、C，以及兩個線程(Threads)。對於每個平方項(Square Terms)，我們通過計算每個患者的性狀(Traits)的平方來呈現它們。你會為患者A、B和C這樣做嗎？ Let’s now focus on the build phase, which is the workhorse of kernel regularization. It involves Euclidean distance calculation, which accounts for most of the elapsed time. This relies only on scalar operations and operates on two data types: integer and 32-bit floating-point entries. Can we accelerate it on tensor cores? There has been some high-quality work on this front using double precision for general matrices. We leverage that existing work and apply it to mixed precision for silicon matrices instead. Let’s look at the tensor core-accelerated distance calculations. We expand the square of the Euclidean distance so you can better understand how we do this. We first consider three patients—A, B, C—and two threads. For each of those square terms, we make them appear by looking at the square of each patient’s traits. Would you do that for patients A, B, and C? 接著，我們可以利用運算子(Operator)的對稱性(Symmetry)來計算所有對距離(Pair Distances)的一半，這不僅節省記憶體(Memory)，顯然也節省了計算成本(Financial Solution)。我們隨後對該運算子進行矩陣化(Matrices)，並在執行時生成(On the Fly)。這時我們可以看到交叉項(Cross Terms)出現。基本上，我們發現可以將這種張量核心加速的距離計算(Tensor Core Accelerated Distance Calculation)映射到完全中心化的混合精度(Entirely Centric Mixed Precision)，分成三個區塊(Three Blocks)。這是一個名為對稱秩更新(Symmetric Rank Update)的函數，或者簡稱SYRK。我們也在不同的系統上運行，以展示軟體的可攜性(Software Portability)。例如，在Summit系統上使用NVIDIA V100、在Leonardo系統上使用NVIDIA A800、在CSCS的Alps系統上使用NVIDIA GH200。我們還在Frontier系統上運行，使用AMD MI250X。所有這些科學運算(Scientific Computations)都在Shane 3系統上執行。我敢說這些系統真是多樣混亂(Chaos)。我們也提前獲得了NVIDIA的某個系統的存取權，這是一個非常棒的測試系統(Testy System)，讓我們得以進行初步實驗(Preliminary Experiment)。 So then we can calculate half of all pair distances by leveraging the symmetry of the operator, saving memory and obviously computational cost. We then matrix that operator and generate it on the fly. That’s where we can see the cross terms showing up. Basically, it turns out we could map this tensor core accelerated distance calculation into entirely centric mixed precision within three blocks—a function called symmetric rank update, or SYRK. We also run on diverse sources to demonstrate software portability: Summit with NVIDIA V100, Leonardo with NVIDIA A800, and Alps from CSCS with NVIDIA GH200. We also run on Frontier with AMD MI250X. All of these sciences, we run on Shane 3. And I bet those systems are chaotic. We host that chaos. We also have early access to a system from NVIDIA. It was an excellent, testy system from which we could run a preliminary experiment. 讓我們來看看建構階段(Build Phase)在Alps系統上的性能分解(Performance Breakdown)。我們使用了從256個GPU到4096個GPU，搭配GHK 100，矩陣大小從260萬(2.6 Million)到1040萬(10.4 Million)。我們達到了最高12倍的性能提升(12X)，並實現了75%的並行效率(Parallel Efficiency)。接著看看Summit系統上的關聯階段(Associate Phase)。我們從256個節點增加到1000個節點，使用300個GPU。我們可以看到啟用FP16計算的巨大優勢，性能從5倍提升到6倍，與僅使用雙精度(Double Precision)在Leonardo上的A800相比。在Leonardo的A800上，從64位到32位再到半精度(Half Precision)，隨著矩陣大小和節點數增加到1000個，性能提升約3.5倍(3.5X)。這比Summit上的V100 GPU高出大約50%的性能。在Alps系統上，我們也運行到1000個節點，使用GH200。通過整合FP8計算，這些硬體的未來張量核心(Tensor Cores)支援FP8，我們獲得了最高5倍到5.6倍的性能提升(5X to 5.6X)，相較於僅使用FP32的情況。這為原本只能在FP32執行的計算帶來了顯著的性能加速(Dramatic Performance Speedup)。 Let’s look at the performance breakdown of the build phase on Alps. We run on 256 GPUs all the way up to 4,096, with GHK 100, on matrix sizes starting from 2.6 million to 10.4 million. We achieved performance up to 12X, with 75% parallel efficiency. Let’s look at the associate phase on Summit. 300 GPUs are moving from 256 to 1,000 nodes. We can see the huge advantage of enabling FP16 calculation, reaching 5X to 6X performance speedup compared to running double precision only on Leonardo with A800. On Leonardo with A800, we see a performance boost between 64-bit, 32-bit, and half precision—around 3.5X performance speedup as we increase the matrix size and number of nodes to 1,000 A800 GPUs. This is approximately 50% higher performance compared to Summit with V100 GPUs on Alps. We also run up to 1,000 nodes with GH200. By integrating FP8 calculations enabled by hardware and future tensor cores supporting FP8, we get up to 5X, sometimes 5.6X, performance speedup compared to running FP32 only. This brings dramatic performance speedup to computations that would otherwise run in FP32 only. 這是關聯階段的分析，從Summit的300個GPU、Leonardo的A800到GH200。在啟用FP8的情況下，性能提升非常顯著。如我提到的，在Summit和Leonardo之間有50%的差距，而在Alps和Leonardo之間則超過3倍的差距。我們正在研究關聯階段的性能能力(Performance Capability)，並針對Leonardo超級電腦(Leonardo Supercomputer)上的GPU進行性能標準化(Normalize the Performance)，以評估弱可擴展性(Weak Scalability)和強可擴展性(Strong Scalability)。當我們增加GPU數量和矩陣大小時，弱可擴展性幾乎達到完美(Perfect Scalability)。但在強可擴展性方面，我們觀察到性能下降。這主要是因為我們無法再充分飽和(Saturate)GPU。低精度運算(Low Precision Arithmetic)實現了高吞吐量(High Throughput)，因此我們大部分時間花在向GPU傳輸資料(Feeding GPUs)、網路(Network)傳輸和資料移動(Moving the Data)上。在Alps系統上觀察關聯階段的性能時，我們同樣標準化GPU性能，發現弱可擴展性依然完美。但在強可擴展性方面，我們看到與Leonardo相同的模式：隨著規模擴大，GPU無法再充分飽和，我們主要受限於網路頻寬(Network Bandwidth)，影響了從一個GPU或節點到另一個的資料傳輸。在Alps上，通過啟用FP8，我們實現了比Leonardo高3倍的性能(3X Performance)。 This is the associate phase looking at 300 GPUs from Summit, A800 from Leonardo, and GH200. With FP8 enabled on Alps, the performance boost is tremendous—50%, as I mentioned, between Summit and Leonardo, and a factor of more than three between Alps and Leonardo. We are looking at the performance capability for the associate phase and normalizing the performance per GPU for weak and strong scalability executed on the Leonardo supercomputer. As we increase the number of GPUs and matrix size for weak scalability, we achieve almost perfect scalability. When we look at strong scalability, we observe a performance drop, mostly because we are no longer able to saturate the GPUs given the low-precision arithmetic, which achieves high throughput. So we spend most of our time feeding those GPUs, dealing with the network, and moving data around. When looking at performance on Alps for the associate phase, we again normalize the performance per GPU. We observe perfect weak scalability. When it comes to strong scalability, we see the same pattern as Leonardo: we are unable to saturate the GPUs anymore as we scale strongly, becoming mostly network-bound and limited by the bandwidth of the network to move data from one GPU or node to another. On Alps, we achieve 3X performance compared to Leonardo by engaging FP8. 現在讓我們看看大規模核心正則化(Kernel Regularization)的整體工作流程(Overall Workflow)，包括建構(Build)和關聯(Associate)階段。我們分析其性能細項(Breakdown)。在Alps的1000個節點上運行核心正則化時，性能幾乎翻倍。對於1100萬(11 Million)的矩陣大小，建構階段達到超過1.8百萬億次運算(1.8 Exaflops)，關聯階段則接近900千兆浮點運算(900 Petaflops)。總體來說，我們達到了1.4到1.5百萬億次運算的性能嗎？ Now let’s look at the large-scale kernel regularization-based multivariate association studies—the overall workflow, which includes build and associate phases. These are the breakdowns. We look at the overall performance with kernel regularization on 1,000 nodes of Alps—almost double that on Alps again. For a matrix size of 11 million, we reach more than 1.8 exaflops for the build phase and almost 900 petaflops for the associate phase. All in all, are we getting 1.4 to 1.5 exaflops? 確實如此。當我們將這兩個階段(Build Phase and Associate Phase)整合到單一工作流程(Single Workflow)中時，情況就是這樣。現在我們進行了一次英雄運行(Hero Run)，比較Alps、Frontier、Leonardo和Summit系統。我們特別觀察關聯階段(Associate Phase)，在我們能存取的系統上執行它，並檢視性能。我們關注最大性能(Maximum Performance)，針對1300萬患者矩陣大小(13 Million Patient Matrix Size)和2000萬個SNP(20 Million SNPs)的資料集。在關聯階段的性能方面，我們在Alps系統上達到了超過1百萬億次浮點運算(1 Exaflops)，這比Frontier、Summit和Leonardo在不同數量的GPU上更高。這相當令人印象深刻且意義重大。特別是當你看到Frontier超級電腦(Frontier Supercomputer)和Alps上使用的GPU數量時，幾乎達到了滿載，甚至更多一點，這已經充分展示了GH200帶來的性能維度(Performance Dimension)。當你檢視整個工作流程採用核心正則化(Kernel Regularization)的性能時，我們在Alps上實現了超過1.8百萬億次運算(1.8 Exaflops)。這比現有技術(State of the Art)的原始軟體快了五個數量級(Five Orders of Magnitude)。 Exactly. When we combine both stages into a single workflow, that’s how it works. Now we perform our hero run where we compare Alps versus Frontier, Leonardo, and Summit. We look at the associate phase, where we could run it on systems we had access to, and we also look at the performance—the maximum performance we could achieve with a patient matrix size of 13 million and 20 million SNPs. In terms of the performance of the associate phase, we achieve more than 1 exaflop on Alps, which is higher than Frontier, Summit, and Leonardo across various numbers of GPUs. This is quite impressive and significant. In particular, when you see the number of GPUs involved on the Frontier supercomputer and Alps—almost a little more than full capacity—it already gives a good dimension of the performance brought by the GH200. When you look at the performance of the overall workflow with kernel regularization, we achieve more than 1.8 exaflops on Alps. That’s five orders of magnitude faster than the state-of-the-art original software. 作為總結與展望(Summary and Implications)，我們開發了一款功能強大的軟體工具(Capability Software Tool)，能夠執行多變量關聯研究(Multivariate Association Studies)，利用核心正則化(Kernel Regularization)檢測複雜表型關係(Complex Phenotype Studies)。我們將距離計算(Distance Calculations)轉換為張量核心(Tensor Cores)處理基矩陣(Ground Matrix)。我們使用混合精度(Mixed Precision)加速計算，並充分利用新型FP8精度。我們證明了核心正則化相較於標準回歸(Region Regression)在準確性(Accuracy)上的優越性。我們在Alps系統上實現了1.8百萬億次運算，處理1300萬名患者的資料。展望未來，我們希望利用三維基因組接觸圖譜(3D Genomic Contact Maps)的空間局部性(Spatial Locality)，提升對複雜性狀(Complex Traits)綜合效應(Combined Effects)的檢測，並通過低秩矩陣近似(Low-Rank Matrix Approximation)探索零稀疏性(Zero Sparsity)。這是我們的模型—用線性代數(Linear Algebra)解讀世界。非常感謝你們的聆聽，我們很樂意回答任何問題。謝謝！ As a summary and implications, we developed a capability software tool that performs multivariate association studies to detect complex phenotype studies using kernel regularization. We cast distance calculations to tensor cores for the ground matrix. We accelerate computation using mixed precision and leverage the new FP8. We demonstrate the accuracy superiority of kernel regularization over region regression. We achieve 1.8 exaflops on Alps with 13 million patients. Moving forward, we want to leverage the spatial locality from 3D genomic contact maps to improve the detection of combined effects from complex traits and explore zero sparsity using low-rank matrix approximation. This is our model—to do linear algebra and see the world. Thank you so much for listening, and we are happy to answer any questions. Thank you!