# [PAPER] BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores
###### tags: `Paper-reading`
## Abstract
**`Q: What is BaGuaLu?`**
**`Q: What was the background for coming up with BaGuaLu?`**
` Why do we need it?`
` Where do we use it?`
本文提出了一個方法 -- BaGuaLu,它是關於**如何在一個exascale超級電腦上,訓練大規模神經網絡模型**。
透過結合 **hardware-specific intra-node optimization** 和 **hybrid parallel strategies**,BaGuaLu在史無前例的大模型上,展現良好的性能和可擴充性。
> `Q:什麼是 hardware-specific intra-node optimization?`
`Q:什麼是 hybrid parallel strategies?`
## 01 Introduction
**`Q: What challenges would we face, when training Large-scale pretrained model on a supercomputer?`**
增加Model的規模,有助於提高準確度。
Mixture-of-Experts方法在NLP領域上,展現出廣泛的成功。
> `Q:什麼是 Mixture-of-Experts?`
**但Training Large-scale pretrained model是很有挑戰性的問題。
以下問題需要被關注:**
- Architecture challenges
- 系統和應用實現需要被共同設計,這樣才能充分利用計算資源,達到高效能。
- Huge memory capacity
- 不同的partitioning strategy會導致不同的communication patterns。
- 如何分割parameters, optimizer states, gradients 會嚴重影響記憶體的使用。
- Model規模越上升,記憶體使用問題越影響性能。
> Q:什麼是 partitioning strategy?
> Q:什麼是 communication patterns?
- Parallel strategy
- 現存的平行策略直接擴大規模應用到這麼大的Model上,沒有效率。
- MoE帶來大量的All-to-All communication需求。
- 將Model規模增大到整台超級電腦時,load imbalance問題會出現。
- Mixed-precision
- 如何在訓練中有效率的混和不同精度的浮點數。
**這項研究主要貢獻為:**
- 有效的 hardware-specific intra-node optimizations 在 New Generation Sunway Supercomputer 上
- core scheduling
- memory segmentation
- memory access
- 平行策略, MoDa, (Combining:)
- MoE Paralelism
- Data Paralelism
- A distributed optimizer, ParO
- 減少計算時間
- 減少記憶體用量
- A new load balancing strategy, SWIPE (for MoE)
- 減少計算資源的浪費
- A layer-wise mixed-precision strategy
- 優化訓練過程,但不影響收斂
- 展示BaGuaLu可以用1EFLOPS算力,訓練14.5 trillion parameters的Model
## 02 Background and Related Work
## 03 System Architecture
**`Q: What is the architecture of the system?`**

- **Compute node**
- **1 heterogeneous CPU (SW26010-Pro)**
- **6 CGs (core groups)**: connected by a network on chip (NoC)
- **1 MPE (management processing element)**
- fully functioned 64-bit RISC processor core
- **1 array of 64 CPEs (computing processing elements)**
- j
- **DRAM channels**
-
- **Memory segment**
- **cross segment**
- interleaving addressed to all six CGs
- supports synchronous memory access from six CGs
- **share segment**
- shared memory space for both MPE and CPEs within the same CG
- supports synchronous memory access from one CG
- **private segment**
- private memory space for each CPE
## 04 Methodology
**`Q: What should we think about during mapping large model to a supercomputer?`**
` How do we design hardwares architecture?`
` How do we schedule hardwares resources? Cores? Memories?`
` How do we map the software?`
**`Q: What kind of optimization did we do? What do we get after doing them?`**
**`Q: What kind of parallel strategy did we use? And anything special?`**
### hardware-specific intra-node optimization
**`Q: How do we use hardwares?`**
- Core scheduling
- 
- **1 MPE with 1 process** to manage **all the CPEs** within one compute node
- **5 MPEs** can be used to handle communication, I/O, other lightweight tasks
- **`[advantages]`**
- Communication between processes is performed across NICs.
By using that way, it can avoid extra communication overhead.
`Q:How to?`
- **Memory segmentation**
- **cross segment**
- for computation and communication
- **shared segment**
- for OS libraries
- **private segments**
- for CPE libraries
- The rest
- managed by the OS
- **Memory access**
- **`SOMETHING SPECIAL`**
> RMA + DMA
> By that we can reduce the DMA call on NoC.
> 
- global load/store
- DMA
- RMA
### efficient hybrid parallel strategy
- Hybrid MoE parallelism and data parallelism strategy (MoDa)
- SunWay Imbalance Proficiently Eliminated (SWIPE)
- Parallel partition-based optimizer (ParO)
### efficient I/O implementation
- Data loader
- Checkpoints
### mixed-precision training