# [PAPER] BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores ###### tags: `Paper-reading` ## Abstract **`Q: What is BaGuaLu?`** **`Q: What was the background for coming up with BaGuaLu?`** ` Why do we need it?` ` Where do we use it?` 本文提出了一個方法 -- BaGuaLu,它是關於**如何在一個exascale超級電腦上,訓練大規模神經網絡模型**。 透過結合 **hardware-specific intra-node optimization** 和 **hybrid parallel strategies**,BaGuaLu在史無前例的大模型上,展現良好的性能和可擴充性。 > `Q:什麼是 hardware-specific intra-node optimization?` `Q:什麼是 hybrid parallel strategies?` ## 01 Introduction **`Q: What challenges would we face, when training Large-scale pretrained model on a supercomputer?`** 增加Model的規模,有助於提高準確度。 Mixture-of-Experts方法在NLP領域上,展現出廣泛的成功。 > `Q:什麼是 Mixture-of-Experts?` **但Training Large-scale pretrained model是很有挑戰性的問題。 以下問題需要被關注:** - Architecture challenges - 系統和應用實現需要被共同設計,這樣才能充分利用計算資源,達到高效能。 - Huge memory capacity - 不同的partitioning strategy會導致不同的communication patterns。 - 如何分割parameters, optimizer states, gradients 會嚴重影響記憶體的使用。 - Model規模越上升,記憶體使用問題越影響性能。 > Q:什麼是 partitioning strategy? > Q:什麼是 communication patterns? - Parallel strategy - 現存的平行策略直接擴大規模應用到這麼大的Model上,沒有效率。 - MoE帶來大量的All-to-All communication需求。 - 將Model規模增大到整台超級電腦時,load imbalance問題會出現。 - Mixed-precision - 如何在訓練中有效率的混和不同精度的浮點數。 **這項研究主要貢獻為:** - 有效的 hardware-specific intra-node optimizations 在 New Generation Sunway Supercomputer 上 - core scheduling - memory segmentation - memory access - 平行策略, MoDa, (Combining:) - MoE Paralelism - Data Paralelism - A distributed optimizer, ParO - 減少計算時間 - 減少記憶體用量 - A new load balancing strategy, SWIPE (for MoE) - 減少計算資源的浪費 - A layer-wise mixed-precision strategy - 優化訓練過程,但不影響收斂 - 展示BaGuaLu可以用1EFLOPS算力,訓練14.5 trillion parameters的Model ## 02 Background and Related Work ## 03 System Architecture **`Q: What is the architecture of the system?`** ![](https://i.imgur.com/lMomtMl.png) - **Compute node** - **1 heterogeneous CPU (SW26010-Pro)** - **6 CGs (core groups)**: connected by a network on chip (NoC) - **1 MPE (management processing element)** - fully functioned 64-bit RISC processor core - **1 array of 64 CPEs (computing processing elements)** - j - **DRAM channels** - - **Memory segment** - **cross segment** - interleaving addressed to all six CGs - supports synchronous memory access from six CGs - **share segment** - shared memory space for both MPE and CPEs within the same CG - supports synchronous memory access from one CG - **private segment** - private memory space for each CPE ## 04 Methodology **`Q: What should we think about during mapping large model to a supercomputer?`** ` How do we design hardwares architecture?` ` How do we schedule hardwares resources? Cores? Memories?` ` How do we map the software?` **`Q: What kind of optimization did we do? What do we get after doing them?`** **`Q: What kind of parallel strategy did we use? And anything special?`** ### hardware-specific intra-node optimization **`Q: How do we use hardwares?`** - Core scheduling - ![](https://i.imgur.com/5Y0TVVt.png) - **1 MPE with 1 process** to manage **all the CPEs** within one compute node - **5 MPEs** can be used to handle communication, I/O, other lightweight tasks - **`[advantages]`** - Communication between processes is performed across NICs. By using that way, it can avoid extra communication overhead. `Q:How to?` - **Memory segmentation** - **cross segment** - for computation and communication - **shared segment** - for OS libraries - **private segments** - for CPE libraries - The rest - managed by the OS - **Memory access** - **`SOMETHING SPECIAL`** > RMA + DMA > By that we can reduce the DMA call on NoC. > ![](https://i.imgur.com/NScD2HA.png) - global load/store - DMA - RMA ### efficient hybrid parallel strategy - Hybrid MoE parallelism and data parallelism strategy (MoDa) - SunWay Imbalance Proficiently Eliminated (SWIPE) - Parallel partition-based optimizer (ParO) ### efficient I/O implementation - Data loader - Checkpoints ### mixed-precision training