A Survey on Accelerated Technologies for Mixture-of-Experts Model Training Systems

# A Survey on Accelerated Technologies for Mixture-of-Experts Model Training Systems ## link https://www.sciopen.com/article/10.26599/TST.2025.9010169 ## abstract > Mixture-of-Experts (MoE) models have emerged as a transformative paradigm for scaling Large Language Models (LLMs), enabling unprecedented model capacity while maintaining computational efficiency through sparse activation mechanisms. However, the unique architectural characteristics of MoE models introduce significant system-level challenges that fundamentally differ from traditional dense models. These challenges necessitate specialized system optimizations tailored to MoE’s distinctive properties. This survey systematically analyzes accelerated technologies for MoE training systems, discussing recent advances across four critical optimization dimensions: hybrid parallel computing, comprehensive memory management, fine-grained communication scheduling, and adaptive load balancing. Our analysis reveals a paradigm shift from computation-centric to workload-centric optimization strategies. What’s more, we identify emerging research directions including machine learning-guided load balancing, cross-layer optimization frameworks, and hardware-software co-design for MoE training workloads. This work aims to provide researchers and system engineers with a comprehensive technical reference to support the design of more efficient and scalable next-generation MoE training systems. 專家混合模型（MoE）已成為擴展大型語言模型（LLM）的變革性範式，它透過稀疏激活機制實現了前所未有的模型容量，同時保持了計算效率。然而，MoE模型獨特的架構特性帶來了與傳統密集模型截然不同的重大系統級挑戰。這些挑戰要求針對MoE的獨特屬性進行專門的系統最佳化。本篇綜述系統性地分析了MoE訓練系統的加速技術，並探討了四個關鍵最佳化維度上的最新進展： * 混合平行運算 * 全面的記憶體管理 * 細粒度的通訊調度 * 自適應負載平衡我們的分析揭示了最佳化策略從以運算為中心轉向以工作負載為中心的範式轉移。此外，我們也指出了新興的研究方向，包括機器學習引導的負載平衡、跨層優化框架以及針對MoE訓練工作負載的軟硬體協同設計。這項工作旨在為研究人員和系統工程師提供全面的技術參考，以支援設計更有效率、可擴展的下一代MoE訓練系統。 ## Intro > The exponential growth in Large Language Model (LLM) capabilities has been largely driven by scaling model parameters, following the scaling laws that demonstrate consistent improvements with increased model size[1]. However, this scaling approach faces fundamental limitations due to the linear relationship between model capacity and computational complexity, making training increasingly expensive and resource-intensive. Mixture-of-Experts (MoE) models[2-5] have emerged as a transformative paradigm that decouples this linear relationship, enabling unprecedented model scaling while maintaining computational efficiency through conditional computation mechanisms. >The past few years have witnessed remarkable progress in MoE model development, evolving from research prototypes such as GShard[6] and Switch Transformer[4] to production-scale deployments including Mixtral 8x7B[7], DeepSeek-V3[8], Snowflake Arctic[9], and PanGu-Ultra MoE[10]. The fundamental innovation of MoE architectures lies in two key mechanisms that dramatically reduce training costs while maintaining model expressiveness. First, sparse activation ensures that only a subset of expert networks is dynamically selected to participate in computation for each input token, significantly reducing the actual computational load per training iteration. Second, expert partitioning distributes different experts across multiple computing devices, effectively overcoming single-device memory limitations and enabling models with trillions of parameters to be trained on distributed systems. > > Despite these advantages, MoE architectures introduce unique system-level challenges that fundamentally differ from traditional dense models. These challenges make existing optimization techniques for dense LLMs training systems inadequate. First, cross-device communication from expert partitioning becomes a critical training bottleneck. In particular, the All-to-All communication operations required in expert parallelism implementations could consume 35%−50% of total training time[11, 12]. Second, load imbalance severely impacts computational resource efficiency, as certain experts may become heavily loaded while others remain underutilized, creating system-wide performance bottlenecks[4, 13]. Third, the sparse characteristics of MoE models make memory management substantially more complex than dense counterpart, requiring sophisticated strategies for expert parameter allocation, activation checkpointing, and dynamic memory scheduling[14, 15]. These challenges are interdependent and exhibit non-linear growth with increasing model scale and expert count, rendering simple hardware scaling solutions inadequate. Consequently, system-level optimization specifically tailored to MoE architectural characteristics has become critical for realizing the full potential of these models in large-scale training scenarios. > > Given the significant interest from both academia and industry in MoE training system optimization technologies, this survey provides a systematic investigation of recent advances in this rapidly evolving field. While several surveys have addressed related topics in LLM training system optimization[16-18] and MoE model architectures and system optimization[19-22], our work distinguishes itself by providing the first comprehensive analysis specifically focused on system-level optimization techniques for MoE training scenario. Through systematic discussion of state-of-the-art works, we reveal key insights into the evolution from computation-centric to workload-centric optimization paradigms while identifying critical trade-offs between different strategies. This foundation enables us to outline promising future research directions including machine learning-guided optimization, cross-layer co-design, and hardware-software integration for MoE workloads. > > Specifically, the primary contributions of this survey are threefold: > > • We provide a systematic taxonomy and analysis of MoE training system optimization techniques across multiple dimensions; > > • We identify key technical insights and design principles that guide effective MoE system optimization; > > • We outline future research directions that will shape the next generation of scalable MoE training infrastructure. > > The structure of this survey is organized as follows. We begin by providing an overview of MoE architectures in Section 2. We then discuss performance evaluation references and system-level performance optimization challenges in Section 3. Subsequently, we examine four fundamental optimization dimensions for addressing these challenges: Section 4 investigates multi-dimensional parallelism strategies, Section 5 analyzes memory optimization technologies, Section 6 studies communication optimization approaches, and Section 7 discusses load balancing optimization methods. Section 8 synthesizes design principles and best practices derived from these optimization techniques, providing practical guidance for building efficient MoE training systems. Finally, Section 9 concludes this survey. > 大型語言模型 (LLM) 能力的指數級增長主要由模型參數的擴展驅動，遵循擴展規律，即隨著模型規模的增大，性能持續提升[1]。然而，這種擴展方法面臨根本性的局限性，因為模型容量與計算複雜度之間存在線性關係，導致訓練成本日益高且資源消耗巨大。混合專家 (MoE) 模型[2-5] 的出現，打破了這種線性關係，透過條件電腦製實現了前所未有的模型擴展，同時保持了計算效率。過去幾年，MoE 模型的發展取得了顯著進展，從 GShard[6] 和 Switch Transformer[4] 等研究原型發展到 Mixtral 8x7B[7]、DeepSeek-V3[8]、Snowflake Arctic[9] 和 PanGu-Ultra MoE[10] 等生產級部署。 MoE 架構的根本創新在於兩個關鍵機制，它們在顯著降低訓練成本的同時，保持了模型的表達能力。首先，稀疏啟動確保每個輸入標記僅動態選擇一部分專家網路參與計算，從而顯著降低每次訓練迭代的實際計算負載。其次，專家劃分將不同的專家分佈在多個計算設備上，有效克服了單設備記憶體限制，使得在分散式系統上訓練具有數萬億參數的模型成為可能。儘管有這些優勢，MoE架構也引入了獨特的系統級挑戰，這些挑戰與傳統的密集模型有著本質差異。這些挑戰使得現有的密集LLM訓練系統最佳化技術不再適用。首先，專家劃分導致的跨裝置通訊成為關鍵的訓練瓶頸。特別是，專家並行實現中所需的全設備通訊操作可能會消耗35%至50%的總訓練時間[11, 12]。其次，負載不均衡嚴重影響運算資源效率，因為某些專家可能負載過重，而其他專家則利用率不足，造成系統層級的效能瓶頸[4, 13]。第三，MoE模型的稀疏特性使得記憶體管理比稠密模型複雜得多，需要採用複雜的策略來進行專家參數分配、激活檢查點和動態記憶體調度[14, 15]。這些挑戰相互關聯，並隨著模型規模和專家數量的增加呈現非線性成長，使得簡單的硬體擴展方案無法滿足需求。因此，針對MoE架構特性量身定制的系統級最佳化對於在大規模訓練場景中充分發揮這些模型的潛力至關重要。鑑於學術界和工業界對MoE訓練系統優化技術的濃厚興趣，本綜述系統地探討了這一快速發展領域的最新進展。雖然已有若干綜述探討了LLM訓練系統優化[16-18]和MoE模型架構及系統優化[19-22]等相關主題，但本文的獨特之處在於，首次對MoE訓練場景的系統級優化技術進行了全面分析。透過對現有前沿研究成果的系統性探討，我們揭示了優化範式從以計算為中心向以工作負荷為中心的演變過程中的關鍵洞見，並指出了不同策略之間的重要權衡。在此基礎上，我們展望了未來有前景的研究方向，包括機器學習引導的最佳化、跨層協同設計以及面向MoE工作負載的軟硬體整合。具體而言，本綜述的主要貢獻體現在以下三個方面： • 我們對MoE訓練系統最佳化技術進行了多維度的系統分類與分析； • 我們總結了指導有效MoE系統最佳化的關鍵技術洞見和設計原則； • 我們概述了將塑造下一代可擴展MoE訓練基礎設施的未來研究方向。本綜述的架構如下。我們首先在第 2 節概述 MoE 架構。然後，我們在第 3 節討論效能評估參考資料和系統級效能最佳化挑戰。接下來，我們研究了應對這些挑戰的四個基本最佳化維度：第 4 節探討多維並行策略，第 5 節分析記憶體最佳化技術，第 6 節研究通訊最佳化方法，第 7 節討論負載平衡最佳化方法。第 8 節總結了從這些最佳化技術中提煉出的設計原則和最佳實踐，為建構 MoE 架構提供實用指導。 ## 2 MoE Architecture >The architectural evolution from dense to sparse models represents a fundamental paradigm shift in large language model design. The layer architecture of conventional dense Transformer LLMs is depicted in Fig. 1a. In dense models, each layer follows a fixed sequential processing pipeline. Input representations are first processed through the attention mechanism to capture contextual dependencies. This is followed by residual connections and layer normalization (add & norm) operations that facilitate information integration and training stabilization. Subsequently, the representations undergo nonlinear transformation via the Feed-Forward Network (FFN), which typically consists of two linear projections with an activation function. Finally, another round of residual connections and layer normalization completes the layer’s computational cycle, ensuring gradient flow and representational stability throughout the network depth. 從密集模型到稀疏模型的架構演化代表了大型語言模型設計中的根本範式轉移。傳統密集Transformer語言模型的層架構如圖1a所示。在密集模型中，每一層都遵循固定的順序處理流程。輸入表示首先透過注意力機制進行處理，以捕捉上下文依賴關係。隨後進行殘差連接和層歸一化（加法和歸一化）操作，以促進資訊整合和訓練穩定性。之後，表示透過前饋網路（FFN）進行非線性變換，該網路通常由兩個線性投影和一個激活函數組成。最後，再次進行殘差連接和層歸一化，完成該層的計算循環，從而確保梯度流和表示穩定性貫穿整個網路深度。 ![TST-2025-0329-1](https://hackmd.io/_uploads/B1prUwm6bx.jpg) 與傳統的密集型LLM架構相比，基於稀疏活化的MoE模型代表了一種截然不同的運算範式，它能夠在維持運算效率的同時實現模型的可擴展性。在MoE架構中，每個Transformer層中的單一前饋神經網路（FFN）被多個並行的FFN（稱為「專家」）所取代，這些FFN可以表示為： $$F(x) = \sum_{i\in TopK(G(x))} G_i(x)E_i(x)$$ where $G_i(x)$ experts $E_i(x)$ selected by $TopK()$ the gating function. The evolution of MoE architectures can be categorized into three types (as illustrated in Figs. 1b-1d), each addressing different trade-offs between computational efficiency, model capacity, and specialization granularity MoE架構的演化可以分為三種類型（如圖1b-1d所示），每種類型都針對運算效率、模型容量和專業化粒度之間的不同權衡進行權衡。 ### 2.1 Coarse-grained MoE architecture >As illustrated in Fig. 1b, coarse-grained MoE architecture represents a paradigmatic shift from dense to sparse computation. The key innovation lies in decomposing the monolithic FFN layer in traditional Transformers LLM into multiple parallel expert networks. This design transcends the constraint of “full parameter activation” and transitions toward a computational mode of “selective activation”. Coarse-grained MoE models typically employ 8 to 16 experts per layer, with each expert containing millions to billions of parameters. This architecture strikes a balance between computational efficiency and implementation complexity, making it suitable for production deployments. Representative works include Mixtral 8×7B[7] and DBRX[23]. These models have demonstrated remarkable performance across various benchmarks while significantly reducing computational costs compared to equivalent dense LLMs. 如圖 1b 所示，粗粒度 MoE 架構代表了從密集運算到稀疏運算的典範轉移。其關鍵創新在於將傳統 Transformer LLM 中的單體前饋神經網路 (FFN) 層分解為多個平行專家網路。這種設計突破了「全參數啟動」的限制，轉向了「選擇性激活」的計算模式。粗粒度 MoE 模型通常每層使用 8 到 16 個專家，每個專家包含數百萬到數十億個參數。這種架構在運算效率和實現複雜度之間取得了平衡，使其適用於生產環境部署。代表性工作包括 Mixtral 8×7B[7] 和 DBRX[23]。這些模型在各種基準測試中都展現了卓越的性能，同時與同等密集 LLM 相比顯著降低了計算成本。 ### 2.2 Fine-grained MoE architecture As depicted in Fig. 1c, fine-grained MoE architecture employs smaller, more numerous experts, enabling more refined knowledge partitioning and enhanced specialization compared to coarse-grained designs. This approach typically employs 64 to 256 experts per layer, with each expert being substantially smaller than those in coarse-grained designs. The increased granularity allows for more precise routing decisions and enhanced model expressiveness through specialized expert functions. Fine-grained MoE architecture has been shown to exhibit accelerated convergence rates compared to their coarse-grained counterparts, attributed to the reduced interference between different types of knowledge and more efficient gradient updates[24]. Pioneering implementations include Switch Transformer[4], GShard[6], and GLaM 如圖 1c 所示，細粒度 MoE 架構採用數量較多、規模較小的專家，與粗粒度設計相比，能夠實現更精細的知識分割和更強的專業化能力。這種方法通常每層使用 64 到 256 個專家，每個專家的規模都遠小於粗粒度設計中的專家。更高的粒度使得路由決策更加精確，並透過專門的專家功能增強了模型的表達能力。研究表明，與粗粒度架構相比，細粒度 MoE 架構具有更快的收斂速度，這歸因於不同類型知識之間的干擾減少以及梯度更新效率更高[24]。開創性的實現包括 Switch Transformer[4]、GShard[6] 和 GLaM ### 2.3 Fine-grained with shared-experts MoE architecture > As presented in Fig. 1d, the fine-grained with shared-experts MoE architecture combines the benefits of both always-activated shared experts and dynamically routed ones, creating a more robust and stable training paradigm. > > The architecture employs two distinct types of expert networks: Shared experts (highlighted in orange) that are always activated and responsible for capturing cross-domain universal patterns and fundamental linguistic representations; On the other hand, routed experts (depicted in blue) that are selectively activated to focus on domain-specific and specialized knowledge acquisition. > > This design ensures that essential common knowledge is consistently accessible while still enabling fine-grained processing through dynamic routing algorithms. > > Formally, the output $F(x)$ of the routing function can be expressed as > 如圖 1d 所示，細粒度共享專家 MoE 架構結合了始終啟動的共享專家和動態路由專家的優勢，從而創建了一個更穩健、更穩定的訓練範式。該架構採用了兩種不同類型的專家網路：始終激活的共享專家（橙色高亮顯示），負責捕捉跨領域的通用模式和基本語言表徵；以及選擇性激活的路由專家（藍色顯示），專注於特定領域和專業知識的獲取。這種設計確保了關鍵通用知識的持續可訪問性，同時透過動態路由演算法實現了細粒度處理。形式上，路由函數的輸出 $F(x)$可以表示為 $$F(x) = F_{share}(x)+ \sum_{i\in TopK(G(x))} G_i(x)E_i(x)$$ >represents the contribution from shared experts that are always activated, ensuring consistent access to universal knowledge across all inputs, while the second term captures the specialized processing from dynamically selected experts. This architecture has been successfully implemented in several state-of-the-art models, including DeepSeek-V3[8], Hunyuan-Large[25], Qwen2-MoE[26], and PanGu-Ultra MoE[10]. These models have demonstrated superior performance in both general language understanding and specialized domain tasks, validating the effectiveness of the hybrid design 第一項代表始終啟動的共享專家的貢獻，確保所有輸入都能一致地存取通用知識；第二個項目則代表動態選擇的專家的專門處理。該架構已成功應用於多個先進模型，包括 DeepSeek-V3[8]、Hunyuan-Large[25]、Qwen2-MoE[26] 和 PanGu-Ultra MoE[10]。這些模型在通用語言理解和專業領域任務中均展現出卓越的性能，驗證了混合設計的有效性。 ![TST-2025-0329-2](https://hackmd.io/_uploads/By58FDQ6-g.jpg) >System challenges in the distributed MoE training context, illustrating three key bottlenecks including huge memory demand, All-to-All communication overhead, and load imbalance across experts, along in a typical distributed training setup. > ## 3 Performance Evaluation and Challenge >While MoE architectures offer compelling advantages in scaling LLMs, they introduce a unique set of system-level challenges that fundamentally differ from traditional dense model training scenario. These challenges arise from the inherent characteristics of sparse activation, dynamic routing, and expert specialization, creating complex interdependencies that cannot be addressed through simple hardware scaling or conventional optimization techniques. Figure 2 illustrates the key system bottlenecks in distributed MoE training systems. To facilitate reader comprehension, this section first establishes the evaluation metrics and baseline for assessing the performance of MoE training systems, and then systematically analyzes the key system challenges and their implications for system design. 儘管 MoE 架構在擴展 LLM 方面具有顯著優勢，但它們也引入了一系列獨特的系統級挑戰，這些挑戰與傳統的密集模型訓練場景有著本質差異。這些挑戰源自於稀疏激活、動態路由和專家專精等固有特性，從而產生了複雜的相互依賴關係，而這些關係無法透過簡單的硬體擴展或傳統的最佳化技術來解決。圖 2 展示了分散式 MoE 訓練系統中的關鍵系統瓶頸。為了便於讀者理解，本節首先建立了評估 MoE 訓練系統表現的評估指標和基準，然後系統地分析了關鍵系統挑戰及其對系統設計的影響。 ### 3.1 Evaluation metrics and benchmark > Rigorous evaluation of MoE training systems requires comprehensive methodologies. These must capture both standard distributed training metrics and unique performance dimensions introduced by sparse expert architectures. However, the absence of standardized benchmarks complicates cross-system comparisons, as different works employ varying model configurations, hardware platforms, and measurement methods. This section synthesizes the key metrics and benchmarks (reference implementation) used in MoE system evaluation. > > Performance metrics: Similar to dense LLM training systems, MoE counterparts are primarily evaluated on training throughput (indicated by tokens or samples per second) and computational performance, quantified as flop operations per second (namely FLOPS), which directly determine training time and cost. However, the sparse activation and dynamic routing characteristics of MoE introduce additional evaluation dimensions that are critical in understanding system efficiency: a system achieving high raw throughput may still under-perform if communication overhead consumes significant part of iteration time, or if load imbalance leaves Graphics Processing Units (GPUs) low utilization. Therefore, hardware efficiency metrics are essential for quantifying accelerator utilization under achieved throughput. Model FLOPS Utilization (MFU) measures the ratio of achieved FLOPS to hardware theoretical peak, while Hardware FLOPS Utilization (HFU) represents hardware efficiency when activation recomputation is applied. These two indices together reveal whether bottlenecks stem from accelerator hardware underutilization. > > Beyond computational efficiency metrics, two bandwidth metrics critically impact MoE training efficiency. Network bandwidth utilization measures achieved communication throughput as a percentage of theoretical peak, while memory bandwidth utilization typically quantifies how effectively the system saturates High-Bandwidth-Memory (HBM) bandwidth of accelerators. Together, these metrics reveal system bottlenecks and guide optimization priorities: low network utilization indicates the need for communication optimizations, whereas low memory bandwidth utilization suggests opportunities for memory access pattern improvements. > > Benchmarks and reference implementation: The MoE research community currently lacks fully standardized benchmarks comparable to those for dense models. For instance, MLPerf-Training[27] provides industry-standard evaluation protocols, but its primary LLM benchmarks target dense architectures. In the absence of standardized benchmarks, several systems have emerged as de facto evaluation baselines. Megatron-LM[28] establishes references for production-scale hybrid parallelism with MoE extensions. DeepSpeed-MoE[15] provides a baseline system that achieves memory optimization. FasterMoE[11] and Tutel[12] serve as primary references for communication optimization techniques in MoE training systems. Comparative evaluations typically measure performance improvements relative to these baselines under controlled conditions—identical model configurations, expert counts, hardware platforms, and datasets—to isolate the impact of specific optimization techniques. Despite the utility of these established baselines, comprehensive standardized evaluation protocols remain necessary to enable fair assessment of both system performance and across the diverse MoE training landscape. 對 MoE 訓練系統進行嚴格評估需要全面的方法。這些方法必須涵蓋標準的分散式訓練指標以及稀疏專家架構所引入的獨特效能維度。然而，由於缺乏標準化的基準，跨系統比較變得複雜，因為不同的研究採用了不同的模型配置、硬體平台和測量方法。本節總結了 MoE 系統評估中使用的關鍵指標和基準（參考實作）。 * 效能指標：與密集 LLM 訓練系統類似，MoE 系統主要根據訓練吞吐量（以每秒標記數或樣本數表示）和計算效能進行評估，計算效能以每秒浮點運算次數 (FLOPS) 來量化，這直接決定了訓練時間和成本。然而，MoE 的稀疏激活和動態路由特性引入了額外的評估維度，這些維度對於理解系統效率至關重要：即使系統實現了很高的原始吞吐量，如果通訊開銷消耗了大部分迭代時間，或者負載不均衡導致圖形處理器 (GPU) 利用率低，其性能仍然可能不足。因此，硬體效率指標對於量化在達到的吞吐量下加速器的利用率至關重要。模型浮點運算利用率 (MFU) 衡量實際浮點運算次數與硬體理論峰值的比值，而硬體浮點運算利用率 (HFU) 則表示應用啟動重計算時的硬體效率。這兩個指標共同揭示了瓶頸是否源自於加速器硬體利用率不足。除了計算效率指標之外，頻寬指標中有兩個對 MoE 訓練效率至關重要。網路頻寬利用率衡量實際通訊吞吐量佔理論峰值的百分比，而記憶體頻寬利用率通常量化系統對加速器高頻寬記憶體 (HBM) 頻寬的利用效率。這些指標共同揭示了系統瓶頸並指導了最佳化優先順序：網路利用率低表示需要進行通訊最佳化，而記憶體頻寬利用率低則表示記憶體存取模式存在改進空間。基準測試和參考實現：目前 MoE 研究領域缺乏與密集模型基準測試相當的完全標準化的基準測試。例如，MLPerf-Training[27] 提供了業界標準的評估協議，但其主要的 LLM 基準測試針對的是密集架構。由於缺乏標準化的基準測試，一些系統已發展成為事實上的評估基準。 Megatron-LM[28] 為具有 MoE 擴展的生產級混合平行系統建立了參考標準。 DeepSpeed-MoE[15] 提供了一個實現記憶體最佳化的基準系統。 FasterMoE[11] 和 Tutel[12] 則作為 MoE 訓練系統中通訊優化技術的主要參考標準。比較評估通常在受控條件下（相同的模型配置、專家數量、硬體平台和資料集）測量相對於這些基準的效能提升，以隔離特定最佳化技術的影響。儘管這些已建立的基準具有實用價值，但仍需要全面的標準化評估協議，才能對系統性能以及各種 MoE 訓練環境進行公平的評估。 ### 3.2 System performance challenges #### 3.2.1 Huge memory demand > MoE models exhibit substantially larger memory requirements compared to dense models of equivalent computational complexity. While a dense transformer model with similar flop operations might require tens of gigabytes of memory, MoE models with hundreds of experts can demand hundreds of gigabytes to several terabytes of memory for parameter storage alone. For instance, a 1-trillion-parameter MoE model trained under Brain Floating point 16-bit (BF16) mixed-precision requires approximately 16 terabytes (TB) of memory to accommodate the model parameters, gradients, activations, and optimizer states. This massive scale pushes beyond the memory capacity of even the most advanced GPU clusters, necessitating sophisticated memory management strategies that span multiple memory hierarchies. > > Beyond sheer capacity requirements, the MoE training workload introduces fundamental management complexity problems. First, memory fragmentation occurs due to the sparse distribution of expert parameters across different devices and memory segments, leading to inefficient memory utilization and increased allocation overhead. Second, additional routing metadata and gating network states must be maintained for each layer, including expert selection histories, load balancing statistics, and routing probabilities. Third, dynamic activation patterns complicate memory locality optimization, as the set of active experts varies unpredictably across different inputs and training iterations. > > The technical complexity stems from the inherent dynamism in MoE training systems. Since expert token allocation varies continuously during training, traditional static memory allocation strategies prove inadequate. On the other hand, dynamic memory allocation schemes can improve memory utilization by 15%−20% compared to static approaches but incur additional runtime overhead due to frequent allocation and deallocation operations[28, 29]. Furthermore, the memory hierarchy considerations become more complex due to the need to efficiently manage expert parameters distributed across different memory tiers depending on activation patterns and memory constraints[14]. 與計算複雜度相當的密集模型相比，多專家模型 (MoE) 的記憶體需求要大得多。一個具有類似浮點運算次數的密集 Transformer 模型可能需要幾十 GB 的內存，而擁有數百個專家的 MoE 模型僅參數存儲就可能需要數百 GB 到數 TB 的內存。例如，一個使用 Brain Floating point 16-bit (BF16) 混合精度訓練的 1 兆參數 MoE 模型大約需要 16 TB 的記憶體來儲存模型參數、梯度、啟動值和最佳化器狀態。如此龐大的規模甚至超出了最先進的 GPU 叢集的記憶體容量，因此需要跨多個記憶體層次結構的複雜記憶體管理策略。除了龐大的容量需求之外，MoE 的訓練工作負載還帶來了根本性的管理複雜性問題。首先，由於專家參數在不同裝置和記憶體段上的分佈稀疏，導致記憶體碎片化，進而造成記憶體利用率低和分配開銷增加。其次，每一層都需要維護額外的路由元資料和門控網路狀態，包括專家選擇歷史、負載平衡統計資料和路由機率。第三，動態啟動模式使記憶體局部性最佳化變得複雜，因為活躍專家的集合會隨著不同的輸入和訓練迭代而不可預測地變化。技術複雜性源自於MoE訓練系統固有的動態性。由於專家令牌的分配在訓練過程中不斷變化，傳統的靜態記憶體分配策略已無法滿足需求。另一方面，動態記憶體分配方案與靜態方法相比可以提高15%至20%的記憶體利用率，但由於頻繁的分配和釋放操作，會產生額外的運行時開銷[28, 29]。此外，由於需要根據激活模式和記憶體約束有效管理分佈在不同記憶體層級的專家參數，記憶體層次結構的考量也變得更加複雜[14]。 #### 3.2.2 Communication bottleneck > Expert parallelism across multiple accelerators introduces severe communication bottlenecks centered around All-to-All communication patterns, which constitute the most critical system constraint. > > Unlike the predictable communication patterns in dense model training, MoE systems exhibit dynamic, data-dependent communication requirements that create significant optimization challenges. Quantitative analysis reveals that in distributed MoE training systems, All-to-All communication operations consume 35%−50% of per-iteration time in typical multi-node configurations[7, 11]. > > Moreover, All-to-All communication exhibits extremely high latency variability due to the unpredictable nature of token routing decisions, severely compromising training system predictability and convergence stability. > > The fundamental cause of communication bottlenecks lies in MoE’s unique data-dependent communication patterns. Since token routing decisions are determined dynamically at runtime, the resulting communication exhibits several characteristics: > >* First, communication patterns are difficult to predict and optimize in advance, preventing effective pre-computation and scheduling. >* Second, the dynamic nature of routing leads to highly imbalanced data exchange between devices, creating hotspots and underutilized links. >* Third, the need for global synchronization across all participating devices increases synchronization barriers and reduces pipeline efficiency. > >What’s more, the severity of communication bottlenecks is highly dependent on hardware topology and interconnect characteristics. Within nodes connected by high-bandwidth, low-latency NVLink connections, All-to-All communication efficiency can reach 70%−80% of theoretical peak bandwidth. However, with cross-node communications over InfiniBand or Ethernet connections, this efficiency drops dramatically to 20%−35%[12, 30]. This efficiency disparity necessitates topology-aware optimization strategies that adapt communication patterns to specific network architectures and bandwidth hierarchies. 跨多個加速器的專家並行性引入了嚴重的通訊瓶頸，主要集中在全對全通訊模式上，這構成了最關鍵的系統約束。與密集模型訓練中可預測的通訊模式不同，MoE 系統具有動態的、資料相關的通訊需求，這帶來了巨大的最佳化挑戰。定量分析表明，在分散式 MoE 訓練系統中，典型的多節點配置下，全對全通訊操作會消耗每次迭代 35% 至 50% 的時間[7, 11]。此外，由於令牌路由決策的不可預測性，全對全通訊表現出極高的延遲波動性，嚴重損害了訓練系統的可預測性和收斂穩定性。通訊瓶頸的根本原因在於 MoE 獨特的資料相關通訊模式。由於令牌路由決策是在運行時動態確定的，因此產生的通訊具有以下幾個特點：首先，通訊模式難以預先預測和最佳化，從而阻礙了有效的預計算和調度。其次，路由的動態特性導致設備間資料交換高度不平衡，進而產生熱點和連結利用率不足的問題。第三，所有參與設備間全域同步的需求增加了同步障礙，降低了管線效率。此外，通訊瓶頸的嚴重程度高度依賴硬體拓撲結構和互連特性。在透過高頻寬、低延遲的NVLink連線連線的節點間，全節點通訊效率可以達到理論峰值頻寬的70%至80%。然而，當透過InfiniBand或乙太網路連接進行跨節點通訊時，這種效率會急劇下降至20%至35%[12, 30]。這種效率差異需要拓撲感知的最佳化策略，以使通訊模式適應特定的網路架構和頻寬層次結構。 #### 3.2.3 Load imbalance > Load imbalance represents the most challenging system performance issue in the MoE training context, stemming directly from the fundamental architectural design principles of expert specialization and dynamic token routing. This challenge is inherently difficult to resolve because perfect load balancing fundamentally conflicts with the core premise of MoE architectures—enabling experts to specialize on different types of input patterns and linguistic phenomena. > > Load imbalance manifests in two critical system-level problems that significantly impact training efficiency: First, resource underutilization occurs when certain experts or accelerators remain in idle or low-utilization states for extended periods, leading to poor hardware efficiency and wasted computational resources. Second, performance bottlenecks emerge when heavily loaded experts create system-wide throughput limitations, as the overall training speed is constrained by the slowest processing unit. > > The complexity of load balancing in MoE systems is further compounded by its dynamic nature. The degree of load imbalance evolves continuously throughout the training process as experts progressively specialize on different types of inputs, making their utilization patterns increasingly heterogeneous. This dynamic evolution renders static load balancing strategies inadequate and necessitates adaptive mechanisms capable of real-time adjustment based on current routing patterns, expert utilization statistics, and training phase characteristics[13, 31]. 負載不平衡是 MoE 訓練環境中最具挑戰性的系統效能問題，它直接源自於專家專業化和動態令牌路由的基本架構設計原則。這項挑戰本質上難以解決，因為完美的負載平衡從根本上與 MoE 架構的核心前提——允許專家專注於不同類型的輸入模式和語言現象——相衝突。負載不平衡體現在兩個關鍵的系統級問題上，這兩個問題會顯著影響訓練效率：首先，當某些專家或加速器長時間處於空閒或低利用率狀態時，就會出現資源利用不足，導致硬體效率低下和計算資源浪費。其次，當負載過重的專家造成系統範圍內的吞吐量限制時，就會出現效能瓶頸，因為整體訓練速度受限於速度最慢的處理單元。 MoE 系統中負載平衡的複雜度因其動態特性而進一步加劇。隨著專家逐步專注於不同類型的輸入，負載不均衡的程度在整個訓練過程中不斷變化，使得它們的利用模式變得越來越異質。這種動態演變使得靜態負載平衡策略不足，需要能夠根據目前路由模式、專家利用率統計資料和訓練階段特徵進行即時調整的自適應機制[13, 31]。 ### 4 Parallelism Optimization > The exponential scale expansion of model size with MoE architectures renders traditional a single-accelerator or single-node training system fundamentally insufficient to accommodate the LLM. Large-scale distributed MoE training necessitates efficient parallelism strategies encompassing multiple dimensions, including data parallelism for batch distribution across devices, tensor parallelism for intra-layer weight partitioning, and expert parallelism for expert distribution across accelerators[15, 28, 29]. Furthermore, to fully exploit the computational potential of multi-node, multi-accelerator systems, hybrid parallelism strategies have become essential in contemporary MoE training frameworks. > > This section examines multi-dimensional parallelism strategies for training MoE models at scale. We begin by introducing fundamental parallelism paradigms and their unique characteristics in MoE contexts. We then analyze three evolutionary stages of hybrid parallelism: homogeneous approaches that apply uniform strategies across all components; heterogeneous methods that exploit computational diversity between attention and expert layers; and dynamic techniques that adapt parallelism strategies at runtime based on workload characteristics. Through this analysis, we reveal the paradigm shift from treating parallelization as a static configuration problem to viewing it as a dynamic optimization challenge that requires workload-aware adaptation. > 隨著多尺度工程（MoE）架構的出現，模型規模呈指數級增長，傳統的單加速器或單節點訓練系統無法滿足層級模型（LLM）的需求。大規模分散式MoE訓練需要高效的並行策略，這些策略涵蓋多個維度，包括用於跨設備批量分發的資料並行、用於層內權重劃分的張量並行以及用於跨加速器分發專家模型的專家並行[15, 28, 29]。此外，為了充分發揮多節點、多加速器系統的運算潛力，混合平行策略已成為當代MoE訓練架構中不可或缺的一部分。本節將探討用於大規模訓練MoE模型的多維並行策略。首先，我們將介紹基本的平行範式及其在MoE環境中的獨特特徵。然後，我們將分析混合併行的三個演進階段：在所有組件中應用統一策略的同構方法；利用注意力層和專家層之間計算多樣性的異構方法；以及根據工作負載特徵在運行時調整並行策略的動態技術。透過這項分析，我們揭示了範式轉變，從將平行化視為靜態配置問題轉變為將其視為動態最佳化挑戰，需要工作負載感知調整。 ### 4.1 Fundamental parallelism > Data Parallelism (DP): As the most fundamental and widely adopted parallelism strategy in deep learning training frameworks, DP uniformly distributes training data across multiple computing devices along the batch dimension[32]. Each DP group maintains a complete model replica, independently processes its assigned data samples, and synchronizes gradients through All-Reduce collective communication to update model parameters after completing forward and backward propagation for each batch samples[33, 34]. By processing multiple batches in parallel across devices, DP effectively improves aggregate training throughput, though at the cost of increased memory consumption due to parameter replication. However, DP suffers from memory redundancy as devices in each DP group store the complete model parameters, making it less suitable for accelerators with limited memory capacity. > > Tensor Parallelism (TP): By partitioning large matrix operations at the operator level, TP addresses the fundamental challenge of single-device memory limitations when accommodating large model layers[28, 29]. Typical TP implementations split tensors along either row or column dimension, employing collective communications such as All-Reduce to aggregate computational results during both forward and backward propagation[35]. While TP effectively alleviates memory constraints for training large MoE models, it incurs significant communication overhead due to frequent synchronization requirements in each layer. Consequently, TP is typically constrained to intra-node deployment to avoid prohibitive inter-node communication costs. > > Pipeline Parallelism (PP): PP divides deep neural networks into multiple consecutive stages by layers, with different stages allocated to distinct devices for execution, thereby forming a computational pipeline[36-38]. Inter-stage data communication utilizes Point-to-Point (P2P) transfers, which typically incur lower overhead compared to collective communication primitives. This characteristic makes PP particularly valuable for resource-constrained platforms or scenarios with limited inter-device bandwidth. However, PP introduces pipeline bubbles during forward and backward training processes, leading to device underutilization and reduced computational efficiency, especially at pipeline boundaries[39]. > > Sequence Parallelism (SP)/Context Parallelism (CP): Both approaches are specifically designed to address computational and memory inefficiencies when training large models on long-sequence data. SP partitions input samples along the sequence dimension, primarily reducing memory consumption of long-sequence activations in non-attention layers such as LayerNorm, Dropout, and residual connections[29, 40]. In contrast, CP partitions key-value pairs in the attention mechanism along the sequence dimension, thereby optimizing memory usage of attention matrices[40]. Furthermore, CP enables overlap of communication and computation operations, improving overall training efficiency[41]. > > Expert Parallelism (EP): Since expert matrices in FFN layers constitute the primary component of model size and consume substantial memory footprint, EP is specifically designed for MoE architectures to partition different subsets of experts in each layer across multiple accelerators[4, 13]. Due to the dynamic routing mechanism that maps training tokens to experts in each layer, EP is inherently coupled with the All-to-All communication to facilitate token redistribution[12, 15]. Through this approach, each EP group maintains parameters for only a subset of experts, significantly reducing per-device memory overhead. In mainstream MoE training systems, EP is usually combined with DP by executing attention layers along the DP dimension while processing FFN layers along the EP dimension. > > Expert Tensor Parallelism (ETP): ETP extends beyond traditional EP by further partitioning individual expert matrices, addressing scenarios where even single experts exceed device memory capacity[35, 42]. While conventional TP partitions attention layers and expert layers in a uniform manner, ETP differs from TP by providing specific tensor partition capability only on expert layers. This approach is particularly beneficial for managing “hot experts” that receive disproportionately high token assignments and require substantial memory to accommodate their computational workload[43]. However, ETP introduces additional communication overhead due to increased synchronization requirements for expert-level tensor operations, necessitating careful trade-offs between memory efficiency and communication costs. * 資料並行 (DP)：作為深度學習訓練框架中最基本且應用最廣泛的平行策略，DP 將訓練資料沿著批次維度均勻分佈到多個計算設備上[32]。每個 DP 組維護完整的模型副本，獨立處理其分配的資料樣本，並透過 All-Reduce 集體通訊同步梯度，在完成每個批次樣本的前向和反向傳播後更新模型參數[33, 34]。透過跨裝置並行處理多個批次，DP 有效地提高了整體訓練吞吐量，但代價是由於參數複製而增加了記憶體消耗。然而，由於每個 DP 組中的裝置都儲存完整的模型參數，DP 存在記憶體冗餘問題，因此不太適合記憶體容量有限的加速器。 * 張量並行 (TP)：透過在算子層級劃分大型矩陣運算，TP 解決了在處理大型模型層時單一設備記憶體限制的根本挑戰[28, 29]。典型的張量並行（TP）實作會沿著行或列維度分割張量，並採用諸如All-Reduce之類的集體通訊方式在正向傳播和反向傳播過程中聚合計算結果[35]。雖然TP有效地緩解了訓練大型MoE模型時的記憶體限制，但由於每一層都需要頻繁的同步，因此會產生顯著的通訊開銷。因此，TP通常被限制在節點內部署，以避免過高的節點間通訊成本。 * 流水線並行（PP）： PP將深度神經網路按層劃分為多個連續階段，並將不同的階段分配給不同的設備執行，從而形成計算流水線[36-38]。階段間資料通訊採用點對點（P2P）傳輸，與集體通訊原語相比，P2P傳輸通常開銷較低。這項特性使得PP對於資源受限的平台或裝置間頻寬有限的情境尤為重要。然而，PP在正向和反向訓練過程中會引入流水線氣泡，導致設備利用率不足和運算效率降低，尤其是在管線邊界處[39]。 * 序列並行 (SP)/上下文並行 (CP)：這兩種方法都專門用於解決在長序列資料上訓練大型模型時遇到的計算和記憶體效率低下問題。 SP 沿著序列維度劃分輸入樣本，主要減少非注意力層（例如 LayerNorm、Dropout 和殘差連接）中長序列啟動的記憶體消耗[29, 40]。相較之下，CP 沿序列維度劃分注意力機制中的鍵值對，從而優化注意力矩陣的記憶體使用[40]。此外，CP 還允許通訊和計算操作重疊，從而提高整體訓練效率[41]。 > 在傳統的張量並行 (Tensor Parallelism, TP) 中，我們會把模型參數（權重）劈開，分給不同 GPU。但有個問題：LayerNorm 和 Dropout 這些層，通常會在每張顯卡上「完整地」重複計算一遍。 > * 步驟 A： > 切割序列 (Split)，對輸入token 的拆分。現在每張卡在 LayerNorm 階段只需要一半的顯存。 >* 步驟 B：通訊與重組 (All-Gather) 當數據要進入 Self-Attention 進行矩陣運算時，因為每個字都必須 broadcast ，所以這時候需要把序列接回來。透過 All-Gather 通訊，GPU 0 和 GPU 1 互相交換數據，讓彼此都重新擁有完整的 token > * 步驟 C：計算與再次切分 (Reduce-Scatter) 運算完 Linear 層後，我們得到了一個完整的輸出。為了節省下一層的顯存，我們再次把這個輸出切開。透過 Reduce-Scatter，把完整的序列重新切碎，發回給各個 GPU * 專家並行 (EP)：由於前饋神經網路 (FFN) 層中的專家矩陣構成模型規模的主要組成部分，並消耗大量內存，因此 EP 專門針對多模型 (MoE) 架構設計，將每一層中的不同專家子集分佈到多個加速器上[4, 13]。由於動態路由機制將訓練令牌映射到每一層的專家，EP 本質上與全對全通訊結合，從而促進令牌的重新分配[12, 15]。透過這種方法，每個 EP 組僅維護一部分專家的參數，顯著降低了每個裝置的記憶體開銷。在主流的 MoE 訓練系統中，EP 通常與 DP 結合使用，即在 DP 維度上執行注意力層，同時在 EP 維度上處理 FFN 層。 > 分發 (All-to-All)：當 GPU 0 處理的 Token 經過路由器後，發現其中一個 Token 應該由專家 E7 處理，它就會把這個 Token 發送給 GPU 7。計算：每張 GPU 在本地對「被分配到自己家」的 Tokens 運行專家計算。收回 (All-to-All)：計算完成後，再把結果傳回給當初發送 Token 的那張 GPU。 * 專家張量並行 (ETP)： ETP 在傳統 EP 的基礎上進一步劃分了各個專家矩陣，解決了單一專家超出裝置記憶體容量的情況[35, 42]。傳統的 TP 以統一的方式劃分注意力層和專家層，而 ETP 則不同，它僅在專家層上提供特定的張量劃分能力。這種方法尤其有利於管理“熱門專家”，這些專家接收的令牌數量異常高，需要大量記憶體來容納其計算工作負載[43]。然而，由於專家級張量操作的同步要求增加，ETP 引入了額外的通訊開銷，因此需要在記憶體效率和通訊成本之間進行仔細的權衡。 ### 4.2 Hybrid parallelism > While each of the parallelism strategies discussed above offers distinct advantages, naively combining them into a hybrid approach for training MoE models often leads to significant inefficiencies. The fundamental challenge stems from the inherent incompatibilities between these strategies and the unique architectural and workload characteristics of MoE models. For instance, traditional pipeline parallelism, which was designed for uniform computation patterns, suffers from excessive bubble time when confronted with MoE’s irregular and unpredictable workloads. Additionally, under Fully Sharded Data Parallel (FSDP), the distributed sharding of model parameters significantly amplifies the All-Gather communication overhead required for collecting expert parameters, making it substantially more expensive than in dense model training. > > MoE models exhibit three distinctive characteristics that demand hybrid solutions: (1) computational heterogeneity between attention and expert layers, (2) dynamic workload variation across training iterations due to routing evolution, and (3) parameter scaling that requires distributing massive models across devices while maintaining sparse activation patterns[6, 15]. Overcoming these challenges and designing efficient multi-dimensional hybrid parallelism strategies has enabled MoE models with hundreds of billions to trillions of parameters, such as Google’s 600-billion-parameter GShard[6] and Microsoft’s 1.7-trillion-parameter models[15] to be successfully trained and deployed in production environments. > > The evolution of hybrid parallelism for MoE models can be classified into three categories, each addressing different aspects of the these challenges: (1) homogeneous strategies that optimize parallelism approaches by considering both workload and topology characteristics; (2) heterogeneous approaches that exploit fine-grained intra-layer characteristics; and (3) dynamic parallelism techniques that adapt to runtime conditions. We outline related works in 儘管上文討論的每種平行策略都各有優勢，但簡單地將它們組合成混合方法來訓練 MoE 模型往往會導致效率顯著下降。根本挑戰在於這些策略與 MoE 模型獨特的架構和工作負載特性之間存在固有的不相容性。例如，傳統的管線並行是為統一的計算模式設計的，但在面對 MoE 模型不規則且不可預測的工作負載時，會產生過長的氣泡時間。此外，在完全分片資料並行 (FSDP) 下，模型參數的分散式分片會顯著增加收集專家參數所需的全集通訊開銷，使其成本遠高於密集模型訓練。 MoE 模型有三個顯著特徵，需要採用混合解決方案： (1) 注意力層和專家層之間的計算異構性； (2) 由於路由演化，訓練迭代過程中工作負載的動態變化； (3) 參數擴展需要在多個設備上部署大規模模型，同時保持稀疏的激活模式。克服這些挑戰並設計高效的多維混合併行策略，使得擁有數十億到數萬億參數的MoE模型（例如穀歌的6000億參數GShard[6]和微軟的1.7萬億參數模型[15]）得以成功訓練並部署到生產環境中。 MoE模型混合並行技術的演進可以分為三類，每一類都針對這些挑戰的不同面向：（1）透過同時考慮工作負載和拓樸特徵來優化平行方法的同構策略；（2）利用細粒度層內特徵的異質方法；（3）適應運行時條件的動態並行技術。 ![TST-2025-0329-3](https://hackmd.io/_uploads/ByHR1dmp-e.jpg) #### 4.2.1 Homogeneous hybrid parallelism > Early-stage MoE parallel training systems applied uniform combinations of parallelism strategies across all model layers. This design treats the entire MoE model as a single computational unit, applying the same parallelism configuration throughout the network. FastMoE[44] exemplifies this approach, providing a unified framework where all layers use the same DP+EP configuration. DeepSpeed-MoE[15] extends the homogeneous paradigm by introducing TP alongside DP and EP, creating a three-dimensional parallelism space. This system maintains uniformity across layers while providing additional flexibility in resource allocation. > > Recognizing that large-scale distributed systems exhibit significant differences in communication bandwidth between intra-node and inter-node connections, researchers have developed topology-aware parallelism strategies. For example, based on communication topology characteristics, FasterMoE[11] introduces a distributed roofline modeling approach that considers both computation and communication characteristics to optimize hybrid parallel configurations. Colossal-AI[63] proposes regulating more experts within nodes to encourage intra-node communication, thereby reducing the overhead of inter-node communication. BaGuaLu[45] designs architecture-specific optimizations for the Sunway Supercomputer, successfully training a massive MoE model with 174 trillion parameters through tailored data and expert parallel strategies. APTMoE[47] proposes efficient MoE training schemes on resource-constrained commercial accelerators by combining pipeline parallelism with affinity-aware offloading. > > These efforts demonstrate that substantial improvements in training efficiency can be achieved by optimizing parallelism combinations and considering hardware characteristics. However, the fundamental limitation of homogeneous approaches lies in their inability to exploit the distinct computational characteristics of different components in each layer. Attention module, with their dense and regular computation patterns, have different optimal parallelism strategies compared to MoE component with sparse, irregular patterns. This mismatch leads to suboptimal resource utilization, where either the attention part is over-parallelized (leading to communication overhead) or the expert module are under-parallelized (leading to memory pressure). 早期的 MoE 平行訓練系統在所有模型層中應用統一的平行策略組合。這種設計將整個 MoE 模型視為單一的計算單元，並在整個網路中應用相同的平行配置。 FastMoE[44] 就是這種方法的典型例子，它提供了一個統一的框架，其中所有層都使用相同的 DP+EP 配置。 DeepSpeed-MoE[15] 透過在 DP 和 EP 之外引入 TP 來擴展同構範式，從而創建了一個三維並行空間。該系統在保持各層一致性的同時，提供了更大的資源分配彈性。考慮到大規模分散式系統在節點內連接和節點間連接的通訊頻寬方面存在顯著差異，研究人員開發了拓撲感知並行策略。例如，基於通訊拓撲特徵，FasterMoE[11] 引入了一種分散式屋頂線建模方法，該方法同時考慮了計算和通訊特性，以優化混合並行配置。 Colossal-AI[63]提出在節點內部管理更多專家，以促進節點內通信，從而降低節點間通信的開銷。 BaGuaLu[45]針對神威超級電腦設計了架構特定的最佳化方案，透過客製化的資料和專家並行策略，成功訓練了一個擁有174兆個參數的大規模MoE模型。 APTMoE[47]結合管線並行和親和性感知卸載，提出了一種在資源受限的商用加速器上高效的MoE訓練方案。這些研究表明，透過優化並行組合併考慮硬體特性，可以顯著提高訓練效率。然而，同構方法的根本限制在於它們無法利用每一層中不同組件的獨特計算特性。注意力模組具有密集且規則的計算模式，與具有稀疏且不規則模式的MoE組件相比，它們需要不同的最優平行策略。這種不匹配導致資源利用率不足，要麼注意力部分過度並行化（導致通訊開銷），要麼專家模組並行化不足（導致記憶體壓力）。 #### 4.2.2 Heterogeneous hybrid parallelism > Recognizing that attention and expert components have fundamentally different computational characteristics, heterogeneous parallelism strategies have been developed to improve MoE training efficiency. GShard[6] pioneers this approach by applying data parallelism to attention parts while introducing expert parallelism for MoE ones. This division allows each component to use its most efficient parallelism strategy while maintaining overall model consistency. > > DeepSpeed-TED[49] introduces a hierarchical view of parallelism. Instead of viewing different parallelism dimensions as competing alternatives, this approach treats them as complementary techniques that can be composed hierarchically. Recognizing that MoE and non-MoE parameters have different memory access patterns and computational characteristics, Colossal-AI proposed EZ-MoE[63], which introduces a different perspective on heterogeneous parallelism by combining Zero Redundancy Optimizer (ZeRO) powered data parallelism with expert parallelism. For MoE modules, the system creates multiple copies within devices to reduce cross-device communication, while non-MoE modules use global ZeRO parallelism for parameter efficiency. > > In Megatron-Core[64, 65], MoE parallel folding[50] has been introduced to separate the parallelism strategies between attention and MoE layers. Attention modules can configure parallelism combinations from TP, CP, DP, and PP dimensions, while MoE modules adopt TP, DP, EP, and PP combinations. Furthermore, the EP×TP group in MoE sub-layers becomes a subgroup of the DP×CP×TP group in attention layers. This hierarchical relationship reduces the minimum GPU requirements from CP×EP to max (CP, EP), enabling efficient resource utilization while maintaining high communication locality. Recent work on FSSDP[48] achieves expert parallelism on top of FSDP by fully sharding MoE parameters and optimizer states across devices. Moreover, FSSDP materializes individual experts on demand through two sparse collectives (SparseAllGather and SparseReduceScatter), eliminating the need to store complete model parameters. This sparse materialization strategy achieves up to 3.54× speedup over state-of-the-art MoE training systems while maintaining memory efficiency. > > Heterogeneous accelerator optimization: Modern clusters often consist of heterogeneous accelerators with different computational capabilities and memory capacities to reduce infrastructure investment costs. Several approaches have emerged to effectively leverage this heterogeneity in MoE systems. HeterMoE[53] exploits hardware heterogeneity by recognizing that attention and expert computations have different hardware sensitivities. The system assigns attention computation to newer, more capable GPUs while placing expert computation on older hardware. This hardware-aware approach extends beyond simple workload assignment to include sophisticated memory management and scheduling strategies that account for the different capabilities of heterogeneous hardware. > > HEXA-MoE[66] takes a complementary approach by proposing a heterogeneous-aware expert allocation framework that redesigns MoE computation through expert-specific operators. Rather than relying on traditional general matrix multiplication (namely GEMM) that require token padding and lead to computation redundancy, HEXA-MoE introduces three specialized operators: Expert-Specific Matrix Multiplication (ESMM), Expert-Specific Summation (ESS), and Expert-Specific Transposed Matrix Multiplication (ESTMM). What’s more, it enables workload distribution across heterogeneous devices by dynamically adjusting local batch sizes in data-centric configurations or sub-dimensions of FFN intermediate sizes in model-centric configurations based on each device’s computational capacity. > > Heterogeneous hybrid parallelism can also be applied to different training phases, as the forward and backward passes in MoE training have fundamentally different computational and memory characteristics, motivating phase-specific parallelism strategies[62]. Therefore, optimal parallelism strategies should be designed not only in an architecture-aware manner but also in a phase-aware way, considering both the heterogeneous nature of modern clusters and the diverse computational requirements across different training phases. 鑑於注意力機制和專家組件的計算特性存在根本差異，異構平行策略已被開發出來以提高 MoE 訓練效率。 GShard[6] 開創了這種方法，它將資料並行應用於注意力部分，同時為 MoE 部分引入專家並行。這種分割使得每個元件都能使用其最高效的並行策略，同時保持模型的整體一致性。 DeepSpeed-TED[49] 引入了一種分層平行視角。此方法並非將不同的平行維度視為相互競爭的替代方案，而是將它們視為可分層組合的互補技術。鑑於 MoE 和非 MoE 參數具有不同的記憶體存取模式和計算特性，Colossal-AI 提出了 EZ-MoE[63]，它將零冗餘優化器 (ZeRO) 驅動的數據並行與專家並行相結合，為異構並行引入了不同的視角。對於 MoE 模組，系統會在裝置內建立多個副本以減少跨裝置通信，而非 MoE 模組則使用全域 ZeRO 並行性以提高參數效率。在 Megatron-Core[64, 65] 中，引入了 MoE 並行折疊[50]，以分離注意力層和 MoE 層之間的平行策略。注意力模組可以配置來自 TP、CP、DP 和 PP 維度的平行組合，而 MoE 模組則採用 TP、DP、EP 和 PP 組合。此外，MoE 子層中的 EP×TP 組成為注意力層中 DP×CP×TP 群組的子群組。這種層級關係將最小 GPU 需求從 CP×EP 降低到 max(CP, EP)，從而在維持高通訊局部性的同時實現高效率的資源利用。最近在 FSSDP[48] 上的工作透過在設備間完全分片 MoE 參數和優化器狀態，在 FSDP 之上實現了專家級並行性。此外，FSSDP 透過兩個稀疏集合（SparseAllGather 和 SparseReduceScatter）按需實例化各個專家，無需儲存完整的模型參數。這種稀疏實例化策略在保持記憶體效率的同時，比最先進的 MoE 訓練系統實現了高達 3.54 倍的加速。異質加速器最佳化：現代叢集通常由具有不同運算能力和記憶體容量的異質加速器組成，以降低基礎設施投資成本。目前已湧現出多種方法來有效利用 MoE 系統中的這種異構性。 HeterMoE[53] 透過識別注意力機制和專家計算對硬體的不同敏感性來利用硬體異構性。該系統將注意力計算分配給更新、更強大的 GPU，而將專家計算分配給較舊的硬體。這種硬體感知方法超越了簡單的工作負載分配，包含了複雜的記憶體管理和調度策略，以適應異質硬體的不同效能。 HEXA-MoE[66] 提出了一種互補的方法，即異構感知專家分配框架，透過專家專用算子重新設計了 MoE 計算。 HEXA-MoE 沒有依賴傳統的通用矩陣乘法（即 GEMM），後者需要標記填充並導致計算冗餘，而是引入了三個專用算子：專家專用矩陣乘法 (ESMM)、專家專用求和 (ESS) 和專家專用轉置矩陣乘法 (ESTMM)。此外，它還能夠根據每個設備的運算能力，在資料中心配置中動態調整本地批次大小，或在模型中心配置中動態調整 FFN 中間尺寸的子維度，從而實現跨異構設備的工作負載分配。異質混合並行也可以應用於不同的訓練階段，因為 MoE 訓練中的前向傳播和反向傳播具有根本不同的計算和記憶體特性，這促使人們採用階段特定的平行策略[62]。因此，設計最優並行策略時，不僅要考慮架構，還要考慮階段，既要考慮現代集群的異質性，也要考慮不同訓練階段的運算需求差異。 #### 4.2.3 Dynamic hybrid parallelism > The inherent dynamism of MoE training suggests that the most efficient parallelism strategy at training onset may become suboptimal as the model learns and expert utilization patterns stabilize. Dynamic hybrid parallelization techniques overcome these issues by enabling runtime adaptation of parallelism strategies, allowing systems to continuously optimize resource allocation based on evolving workload characteristics. > > A critical enabler of dynamic parallelism is the design of data structures that can efficiently support multiple parallelism strategies without incurring prohibitive transition costs. For instance, Tutel[12] proposes a unified data layouts that maintain compatibility across different parallelism dimensions. The system’s data layout abstraction decouples the physical data organization from the logical parallelism strategy, enabling transparent switching between DP, EP, and hybrid configurations without requiring expensive data redistribution operations. > > MPipeMoE[54] recognizes that traditional static pipeline parallelism fails to accommodate the dynamic memory requirements of MoE models, where expert activation patterns create varying memory footprints across pipeline stages. Based on this observation, MPipeMoE proposed an adaptive pipeline depth adjustment mechanism, which dynamically modifies the number of pipeline stages based on real-time memory utilization and expert load distribution. MPMoE[55] extends the concept of adaptive pipeline parallelism by introducing proactive memory management techniques that anticipate future memory requirements based on expert activation trends. The system employs predictive algorithms to forecast expert utilization patterns and preemptively adjusts parallelism strategies to prevent memory bottlenecks before they occur. > > Automatic parallelization: The complexity of selecting optimal parallelism strategies for efficient training of large language models has motivated researchers to develop automated systems that can optimize parallelism without human intervention. Alpa[67], Aceso[68], nnScaler[69], and Mist[70] demonstrate how to automatically generate optimal parallelism strategies for dense LLMs by modeling the problem as hierarchical optimization problems. > > In MoE systems, expert placement is regarded as a new dimension for automatic parallelization[56], which has motivated numerous adaptive expert placement approaches, including FasterMoE[11], Prophet[31], SmartMoE[56], Lazarus[58], and CCFuser[59]. Note that while these techniques fundamentally constitute dynamic parallelism—as they adapt parallelism strategies based on runtime workload characteristics—we discuss this body of work in more detail in the load balancing optimization section (Section 7), here, we outline the characteristic comparison of these works in Table 1. MoE 訓練固有的動態性表明，訓練初期最高效的平行策略可能會隨著模型學習和專家利用模式的穩定性而變得次優。動態混合並行化技術透過實現並行策略的運行時自適應來克服這些問題，使系統能夠根據不斷變化的工作負載特徵持續最佳化資源分配。動態並行的關鍵在於設計能夠有效率支援多種平行策略且不會產生過高轉換成本的資料結構。例如，Tutel[12]提出了一種統一的資料佈局，該佈局能夠保持不同平行維度之間的兼容性。該系統的資料佈局抽象將實體資料組織與邏輯並行策略解耦，從而能夠在 DP、EP 和混合配置之間透明切換，而無需昂貴的資料重新分配操作。 MPipeMoE[54]意識到，傳統的靜態管線並行無法滿足 MoE 模型動態的記憶體需求，因為專家啟動模式會在管線的不同階段產生不同的記憶體佔用。基於此觀察，MPipeMoE 提出了一種自適應管線深度調整機制，該機制根據即時記憶體利用率和專家負載分佈動態調整管線階段數。 MPMoE[55] 透過引入主動記憶體管理技術擴展了自適應管線並行性的概念，該技術能夠根據專家激活趨勢預測未來的記憶體需求。該系統採用預測演算法來預測專家利用模式，並預先調整平行策略，從而在記憶體瓶頸出現之前將其消除。自動並行化：為高效訓練大型語言模型選擇最優並行策略的複雜性促使研究人員開發無需人工幹預即可優化並行性的自動化系統。 Alpa[67]、Aceso[68]、nnScaler[69] 和 Mist[70] 透過將問題建模為分層最佳化問題，展示如何自動產生密集型語言模型的最優並行策略。在多任務均衡（MoE）系統中，專家部署被視為自動並行化的新維度[56]，這催生了眾多自適應專家部署方法，包括 FasterMoE[11]、Prophet[31]、SmartMoE[56]、Lazarus[58] 和 CCFuser[59]。需要注意的是，儘管這些技術從根本上構成了動態並行——因為它們會根據運行時工作負載特徵調整並行策略——但我們將在負載平衡優化部分（第 7 節）更詳細地討論這方面的工作 ### 4.3 Technical insights and research directions > The evolution of MoE parallelism strategies—from homogeneous through heterogeneous to dynamic approaches—reveals a paradigm shift in distributed MoE training system design: from computation-centric to workload-centric optimization. This transition exposes several critical insights that will shape the future of large-scale MoE training systems. > > The principle of computational locality suggests that optimal parallelism must be co-designed with the intrinsic computational characteristics of each system component[71]. The success of heterogeneous approaches like MoE parallel folding demonstrates that treating different layer types uniformly is fundamentally suboptimal. However, current implementations still operate at coarse granularities—layer types, training phases, or expert groups. The next frontier lies in fine-grained adaptive parallelism that can dynamically adjust strategies at the granularity of individual operations or even tensor slices, based on real-time computational signatures[72]. > > The tension between adaptability and predictability reveals a deeper challenge in system design. While dynamic systems like Tutel[12] and SmartMoE[56] demonstrate significant performance improvements through runtime adaptation, they introduce non-deterministic behavior that complicates debugging, performance analysis, and reproducibility. The field needs to develop principled dynamic optimization approaches that maintain system predictability while enabling adaptive behavior. > > On the other hand, the increasing complexity of hybrid parallelism strategies highlights the critical need for holistic system-hardware co-design. Current approaches often treat hardware topology as a constraint rather than a design parameter. The most promising future direction involves parallelism-aware hardware architectures that can efficiently support the complex communication patterns induced by heterogeneous and dynamic parallelism strategies. This includes developing specialized interconnect topologies, adaptive bandwidth allocation mechanisms, and hardware-accelerated collective operations that can seamlessly handle the irregular communication patterns characteristic of MoE workloads. MoE平行策略的演進——從同構、異構到動態方法——揭示了分散式MoE訓練系統設計的範式轉移：從以計算為中心到以工作負載為中心的最佳化。這項轉變揭示了幾個關鍵見解，這些見解將塑造未來大規模MoE訓練系統的發展方向。計算局部性原則表明，最優並行性必須與每個系統組件的固有計算特性協同設計[71]。異質方法（例如MoE並行折疊）的成功表明，對不同類型的層進行統一處理從根本上來說是次優的。然而，目前的實現仍然以粗粒度運行——層類型、訓練階段或專家組。下一個前沿領域在於細粒度自適應並行，它能夠基於即時計算特徵，在單一操作甚至張量切片的粒度上動態調整策略[72]。適應性和可預測性之間的矛盾揭示了系統設計中更深層的挑戰。儘管像 Tutel[12] 和 SmartMoE[56] 這樣的動態系統透過運行時自適應顯著提升了效能，但它們引入的非確定性行為卻增加了調試、性能分析和可復現性的難度。此領域亟需開發出既能維持系統可預測性又能實現自適應行為的原則性動態最佳化方法。另一方面，混合平行策略日益複雜化，凸顯了系統與硬體協同設計的重要性。目前的方法通常將硬體拓撲視為一種約束，而非設計參數。未來最有前景的發展方向是建構能夠高效支援異質和動態平行策略所帶來的複雜通訊模式的平行感知硬體架構。這包括開發專用互連拓撲、自適應頻寬分配機制以及硬體加速的集體操作，以無縫處理 MoE 工作負載特有的不規則通訊模式。 ## 5 Memory Optimization > Large-scale MoE model training presents unprecedented memory challenges that fundamentally differ from those in dense models. The massive memory consumption not only constrains batch sizes but also reduces accelerator utilization, creating a fundamental barrier to achieving optimal training performance[14]. While parallelism strategies effectively address some computational distribution challenges, they do not directly optimize memory usage during training and often introduce additional memory overhead through parameter replication and communication buffers. > > This section provides a comprehensive analysis of memory optimization techniques specifically designed for MoE training systems. We organize these techniques into four major categories that address different aspects of the memory performance challenge: redundancy optimization techniques, activation recomputation strategies, expert parameter offloading approaches, and low-precision training methods. Figure 4 illustrates the taxonomy of these memory optimization approaches and their relationships. ![TST-2025-0329-4](https://hackmd.io/_uploads/ryRumOmT-g.jpg) 大規模 MoE 模型訓練面臨前所未有的記憶體挑戰，這與密集模型的記憶體挑戰有著本質差異。龐大的記憶體消耗不僅限制了批次大小，還降低了加速器的利用率，從而成為實現最佳訓練效能的根本障礙[14]。雖然平行策略能夠有效解決一些計算分佈方面的挑戰，但它們並不能直接優化訓練過程中的記憶體使用，而且通常會透過參數複製和通訊緩衝區引入額外的記憶體開銷。本節將對專為 MoE 訓練系統設計的記憶體最佳化技術進行全面分析。我們將這些技術分為四大類，分別針對記憶體效能挑戰的不同面向：冗餘最佳化技術、啟動值重計算策略、專家參數卸載方法和低精度訓練方法。圖 4 展示了這些記憶體優化方法的分類及其相互關係。 ### 5.1 Redundancy optimization > Significant parameter redundancy in distributed MoE training is introduced by parallelism strategies. While parallelization is essential for scaling MoE models across multiple devices, it often comes with the cost of increased memory demand due to parameter redundancy across model replicas, gradients, and optimizer states. Additionally, during the training process, continuous memory allocation for temporary data and layer activations contributes to substantial memory overhead. Redundancy optimization techniques address these challenges by detecting and reducing parameter replicas while eliminating unnecessary temporary memory allocations. #### 5.1.1 Parameter redundancy elimination > Data parallelism, while serving as one of the most effective strategies for improving training throughput, introduces significant parameter redundancy across data parallelism groups. Each device in a data parallel group maintains specific copies of model parameters, gradients, and optimizer states, leading to substantial memory footprint. For MoE models, this memory-throughput trade-off exacerbates memory demands due to their massive parameter counts, making efficient redundancy elimination crucial for scalability. 在分散式 MoE 訓練中，平行策略會引入顯著的參數冗餘。雖然並行化對於跨多個裝置擴展 MoE 模型至關重要，但由於模型副本、梯度和優化器狀態之間的參數冗餘，並行化通常會導致記憶體需求增加。此外，在訓練過程中，持續的記憶體分配（用於臨時資料和層啟動）也會導致大量的記憶體開銷。冗餘優化技術透過檢測和減少參數副本，同時消除不必要的臨時記憶體分配來應對這些挑戰。 > ZeRO[34], developed within the DeepSpeed training framework[82], represents a foundational technique for eliminating parameter redundancy in distributed training. The core innovation of ZeRO lies in its graduated three-stage approach: ZeRO-1 partitions optimizer states across devices; ZeRO-2 extends this to gradient partitioning; and ZeRO-3 further partitions model parameters themselves. > > This hierarchical partitioning ensures that devices in the same data parallel group store only parameter shards rather than complete parameters, reducing memory requirements to approximately 1/𝑀 of the original footprint, where 𝑀 represents the number of accelerators in the group. > > The impact of optimizer state redundancy is particularly significant in MoE training, as optimizers like Adam[83] maintain momentum and variance estimates that constitute a substantial portion of total memory consumption. Therefore, ZeRO-1 has been widely adopted in MoE training systems, including DeepSpeed-TED[49] and Llama-3 training framework[41], to alleviate memory pressure. Similarly, PyTorch’s FSDP[84] implements a redundancy elimination scheme that allows for finer-grained parameter slicing, enabling users to apply parameter sharding selectively to different model components. For high-performance computing environments, BaGuaLu[45] introduces a topology-aware distributed optimizer specifically designed for the Sunway supercomputer’s 96000-node hierarchical heterogeneous network. This system substantially reduces optimizer state overhead to less than 5% of total memory footprint, enabling the training of 174-trillion-parameter MoE models through intelligent memory distribution that accounts for network topology constraints. 資料並行雖然是提高訓練吞吐量最有效的策略之一，但也會在資料並行組之間引入顯著的參數冗餘。資料並行組中的每個裝置都維護著模型參數、梯度和最佳化器狀態的特定副本，從而導致大量的記憶體佔用。對於 MoE 模型而言，由於其參數數量龐大，這種記憶體-吞吐量權衡會加劇記憶體需求，因此高效的冗餘消除對於可擴展性至關重要。 ZeRO[34]是在DeepSpeed訓練框架[82]中開發的，它代表了一種消除分散式訓練中參數冗餘的基礎技術。 ZeRO的核心創新在於其漸進式的三階段方法：ZeRO-1將優化器狀態劃分到各個設備上；ZeRO-2將其擴展到梯度劃分；ZeRO-3則進一步劃分模型參數本身。這種分層劃分確保同一資料並行組中的裝置僅儲存參數分片而非完整參數，從而將記憶體需求降低到原始記憶體佔用的約 1/𝑀 倍，其中 𝑀 表示群組中加速器的數量。優化器狀態冗餘的影響在MoE訓練中特別顯著，因為像Adam[83]這樣的優化器需要維護動量和方差估計，而這些估計會消耗大量的記憶體。因此，ZeRO-1已被廣泛應用於MoE訓練系統，包括DeepSpeed-TED[49]和Llama-3訓練框架[41]，以緩解記憶體壓力。類似地，PyTorch 的 FSDP[84] 實作了一個冗餘消除方案，允許更細粒度的參數切片，使用戶能夠選擇性地將參數分片應用於不同的模型組件。針對高效能運算環境，BaGuaLu[45] 引入了一種拓樸感知分散式最佳化器，專為神威超級電腦的 96000 節點分層異質網路而設計。該系統將優化器狀態開銷大幅降低至總記憶體佔用的 5% 以下，透過考慮網路拓撲約束的智慧記憶體分配，實現了訓練 174 兆參數的 MoE 模型。 #### 5.1.2 Temporal memory reuse > Since tensors in different layers exhibit non-overlap lifetimes during forward and backward propagation, temporal memory sharing presents significant opportunities for reducing memory footprint. This temporal locality can be exploited through sophisticated memory reuse strategies that allow multiple operations to share the same memory buffers when their execution windows do not overlap. > > DeepSpeed-TED[49] implements a tile-based method that partitions optimizer states into tiles, enabling the reuse of temporary buffers across different computational phases. This approach reduces memory fragmentation by approximately 40% while maintaining computational efficiency. HECATE[48] extends this concept by designing expert-level memory reuse mechanisms that shard MoE layers at expert granularity, enabling cross-layer memory reuse for materializing expert parameters on demand. > > MPipeMoE[54] addresses temporal memory fluctuations in pipeline parallelism through a comprehensive three-pronged memory reuse strategy: (1) Reusing activation memory across pipeline stages based on temporal dependencies; (2) Sharing gradient computation buffers across non-overlapping pipeline stages; and (3) Implementing intelligent cycling strategies for expert parameters based on pipeline scheduling patterns. Building upon MPipeMoE, MPMoE[55] proposes performance model-directed memory configurations that optimize pipeline parallelism and memory reuse strategies jointly. > > Above redundancy elimination techniques substantially reduce memory fragmentation and slash peak memory requirements. Although these methods introduce additional communication overhead and scheduling complexity, they enable training of large-scale MoE models by fundamentally improving the trade-off between memory constraints and computational scalability. 由於不同層的張量在正向和反向傳播過程中生命週期互不重疊，所以時間記憶體共享為減少記憶體佔用提供了重要的機會。這種時間局部性可以透過複雜的記憶體重複使用策略來利用，這些策略允許多個操作在執行視窗不重疊時共享相同的記憶體緩衝區。 DeepSpeed-TED[49]實作了一種基於分塊的方法，將最佳化器狀態劃分為多個分塊，從而能夠在不同的計算階段重複使用臨時緩衝區。這種方法在保持計算效率的同時，將記憶體碎片化減少了約40%。 HECATE[48]擴展了這個概念，設計了專家級記憶體重複使用機制，以專家粒度對MoE層進行分片，從而實現跨層記憶體重複使用，以便按需實例化專家參數。 MPipeMoE[54]透過一個全面的三管齊下的記憶體重用策略來解決管線並行中的時間記憶體波動問題： (1)基於時間依賴性在管線階段之間重用活化記憶體； (2) 在非重疊的管線階段之間共享梯度運算緩衝區； (3) 基於管線調度模式，實現專家參數的智慧循環模式，實現專家參數的智慧循環策略。 MPMoE[55] 在 MPipeMoE 的基礎上，提出了一種以效能模型為導向的記憶體配置方案，該方案能夠聯合優化管線並行性和記憶體重複使用策略。上述冗餘消除技術顯著降低了記憶體碎片化，並大幅削減了尖峰記憶體需求。儘管這些方法引入了額外的通訊開銷和調度複雜性，但它們從根本上改善了記憶體約束和計算可擴展性之間的權衡，從而能夠訓練大規模的 MoE 模型。 ### 5.2 Expert parameter offloading optimization > For MoE models with massive expert parameters, temporarily relocating unused model parameters from GPU memory to slower storage media—such as Dynamic Random Access Memory (DRAM) or Non-Volatile Memory express (NVMe) Solid State Drive (SSDs), as shown in Fig. 5, can effectively overcome GPU memory limitations. When these parameters are required during forward and backward passes, they are swapped into GPU Memory such as HBM. This on-demand parameter scheduling approach has been actively studied for dense LLMs[85-90], with ZeRO-Infinity[88] achieving over 6× model scaling through sophisticated prefetching and overlapping techniques that fully utilize NVMe bandwidth. 對於具有大量專家參數的 MoE 模型，如圖 5 所示，將未使用的模型參數從 GPU 記憶體暫時遷移到速度較慢的儲存媒體（例如動態隨機存取記憶體 (DRAM) 或非揮發性記憶體高速介面 (NVMe) 固態硬碟 (SSD)）可以有效克服 GPU 記憶體的限制。當在正向和反向傳播過程中需要這些參數時，它們會被交換回 GPU 記憶體（例如 HBM）。這種按需參數調度方法已在密集 LLM 模型中得到廣泛研究[85-90]，其中 ZeRO-Infinity[88] 透過充分利用 NVMe 頻寬的複雜預取和重疊技術，實現了超過 6 倍的模型擴展。 ![TST-2025-0329-5](https://hackmd.io/_uploads/B1ckId7aWl.jpg) > However, MoE architectures present unique challenges that render traditional offloading approaches inadequate. The parameter and optimizer state requirements in MoE models are orders of magnitude larger than dense models, while the available bandwidth between CPU and GPU remains constrained at 32 GB/s for Peripheral Component Interconnect express (PCIe) 4.0. Consequently, traditional model-level or layer-level offloading strategies become prohibitively expensive in MoE training systems, creating communication bottlenecks that negate the computational efficiency gains that MoE architectures are designed to provide. > > To address these limitations, expert-granular offloading strategies represent a paradigm shift toward fine-grained memory management approaches that align with MoE’s natural computational boundaries. ES-MoE[14] pioneers this approach by implementing parameter and optimizer state offloading at the expert granularity, combined with sequential expert computation rather than traditional batched processing. This design enables the system to accommodate 67× more experts while achieving a 17.5× improvement in training throughput. The effectiveness of expert-level offloading stems from its alignment of memory management granularity with MoE’s computational structure, allowing fine-grained control over which parameters reside in fast memory based on immediate computational requirements. > > Activation-aware offloading: The breakthrough in efficient MoE offloading emerged from addressing a fundamental limitation in existing memory management approaches: the inability to predict expert activation patterns before they occur. Traditional offloading strategies treat expert selection as essentially random, leading to reactive loading schemes that introduce substantial latency penalties whenever experts must be transferred from slower memory tiers. This reactive approach becomes particularly problematic in MoE architectures where the majority of parameters reside in experts, yet dynamic routing decisions make it impossible to determine which experts will be needed until computation begins. > > MoE-Infinity[73] addresses this challenge by exploiting temporal locality in expert activation patterns. Despite dynamic routing, the system leverages recurring activation patterns for intelligent memory management. The system’s core innovation lies in its Expert Activation Matrix Collection (EAMC) framework, which systematically analyzes expert activation sequences across diverse input distributions to identify recurring patterns that enable predictive prefetching. By recognizing that certain experts consistently activate together for specific linguistic or semantic constructs, the system can anticipate future expert requirements based on recent activation history, effectively transforming reactive memory management into proactive optimization. Janus[74] proposes a data-centric MoE training framework where experts are transferred between different workers on-demand, implementing sophisticated individual expert swapping that coordinates CPU-GPU transfers with computation schedules to minimize idle time and maximize utilization efficiency. > > Hierarchical memory management: Recognizing that modern training systems possess heterogeneous communication capabilities with different bandwidth characteristics, advanced offloading strategies simultaneously exploit multiple communication pathways to minimize offloading overhead. MoESys[75] exemplifies this approach through a hybrid storage strategy. It applies ZeRO-3-like distributed storage to dense model components, leveraging high-bandwidth NVLink connections for parameter prefetching. Meanwhile, sparse expert parameters are offloaded to DRAM and SSD storage. This hierarchical approach demonstrates that effective MoE memory management requires not only innovative algorithms but also careful consideration of underlying hardware topology and communication infrastructure to achieve optimal performance. 然而，MoE架構面臨獨特的挑戰，使得傳統的卸載方法難以應對。 MoE模型的參數和優化器狀態需求比密集模型高出幾個數量級，而CPU和GPU之間的可用頻寬在PCIe 4.0標準下仍然受限於32 GB/s。因此，傳統的模型級或層級卸載策略在MoE訓練系統中成本過高，造成通訊瓶頸，抵銷了MoE架構所要提升的運算效率。為了解決這些限制，專家級卸載策略代表了一種範式轉變，即採用更細粒度的記憶體管理方法，以適應MoE的自然運算邊界。 ES-MoE[14]率先採用了這種方法，在專家粒度上實現了參數和優化器狀態卸載，並結合了順序專家計算而非傳統的批次。這種設計使系統能夠容納67倍的專家數量，同時訓練吞吐量提高了17.5倍。專家級卸載的有效性源於其記憶體管理粒度與 MoE 運算結構的匹配，從而能夠根據即時計算需求，對駐留在快速記憶體中的參數進行精細控制。啟動感知卸載：高效 MoE 卸載的突破在於解決了現有記憶體管理方法的一個根本性限制：無法在專家啟動模式出現之前對其進行預測。傳統的卸載策略將專家選擇視為本質上的隨機行為，導致被動式載入方案，每當需要將專家從較慢的記憶體層轉移時，都會引入顯著的延遲懲罰。這種被動方法在 MoE 架構中特別成問題，因為大多數參數都駐留在專家中，而動態路由決策使得在計算開始之前無法確定需要哪些專家。 MoE-Infinity[73] 透過利用專家活化模式中的時間局部性來應對這項挑戰。儘管採用動態路由，該系統仍利用重複出現的啟動模式進行智慧記憶體管理。該系統的核心創新在於其專家激活矩陣收集（EAMC）框架，該框架系統地分析不同輸入分佈下的專家激活序列，以識別重複出現的模式，從而實現預測性預取。透過識別某些專家針對特定語言或語意結構持續活化的情況，系統能夠基於近期的活化歷史預測未來的專家需求，有效地將被動的記憶體管理轉變為主動的最佳化。 Janus[74]提出了一種以資料為中心的MoE訓練框架，其中專家可以根據需要在不同的工作節點之間進行切換，實現了複雜的個體專家交換機制，協調CPU-GPU的資源調配與計算調度，從而最大限度地減少空閒時間並提高資源利用效率。分層記憶體管理：考慮到現代訓練系統具有不同的通訊能力和頻寬特性，先進的卸載策略同時利用多條通訊路徑來最大限度地減少卸載開銷。 MoESys[75]透過混合儲存策略展示了這種方法。它將類似ZeRO-3的分散式儲存應用於密集模型元件，並利用高頻寬的NVLink連接進行參數預取。同時，稀疏的專家參數被卸載到DRAM和SSD儲存。這種分層方法表明，有效的MoE記憶體管理不僅需要創新的演算法，還需要仔細考慮底層硬體拓撲和通訊基礎設施，才能實現最佳效能。 #### 5.3 Activation recomputation strategy > In large-scale model training, the storage of intermediate activations often dominate memory usage. Activation recomputation, also known as gradient checkpointing, addresses this challenge through a computation-for-storage trade-off: intermediate activations are discarded during forward propagation and recomputed when needed during backpropagation[29, 91]. This approach can significantly reduce peak memory usage, enabling training of deeper or wider models under memory constraints. For MoE training specifically, activation recomputation has been applied to mitigate out-of-memory issues caused by severe load imbalance during early training stages[41]. > > Traditional activation recomputation strategies, while effective at reducing memory pressure, suffer from execution within the critical training path, creating scenarios where memory savings come at the expense of training throughput. This limitation often results in negative performance impacts that discourage adoption in production environments. Recent advances in MoE-specific activation recomputation have addressed these limitations through intelligent selectivity and precision-aware strategies. MegaScale-MoE[78] implements selective recomputation that retains activations costly to recompute while selectively recomputing those generated by memory-intensive operations, eliminating traditional performance penalties associated with activation checkpointing. DeepSeek-V3[8] integrates FP8 mixed-precision storage with selective recomputation strategies, demonstrating how activation recomputation can be enhanced through careful consideration of numerical precision requirements to enable effective memory reduction without compromising training stability. > > PanGu-Ultra-MoE[10] implements fine-grained recomputation operating at individual operator level rather than traditional layer-wise strategy. It introduces Multi-head Latent Attention (MLA) Key-Value (KV)-only recomputation, where only key-value vectors in attention modules are recomputed while other activations are selectively stored or recomputed based on computational cost and memory footprint. HECATE[48] reconceptualizes activation management through sparse materialization strategies where only activations corresponding to selected experts are computed and stored, with full activation patterns reconstructed from sparse representations during backpropagation. > > The sparse activation characteristics of MoE models present unique opportunities and challenges for activation recomputation. Since each token passes through only selected experts, inactive components do not require recomputation, enabling more efficient memory-computation trade-offs. MoETion[79] addresses this through a “sparse checkpoint” mechanism that selectively saves and recomputes activations based on MoE’s routing sparsity. By analyzing which expert outputs are most critical for memory usage and gradient computation, the system checkpoints only a small portion of “low-reuse” activations, significantly reducing overall recomputation overhead. This demonstrates that for dynamic computation graphs like MoE, customizing activation checkpoint strategies with routing information can achieve superior memory-computation trade-offs. 在大規模模型訓練中，中間激活值的儲存通常會佔用大量記憶體。激活值重計算（也稱為梯度檢查點）透過計算與儲存的權衡來解決這一挑戰：在前向傳播期間丟棄中間激活值，並在反向傳播期間根據需要重新計算[29, 91]。這種方法可以顯著降低峰值記憶體使用量，從而在記憶體限制下訓練更深或更寬的模型。具體到 MoE 訓練，啟動值重計算已被應用於緩解早期訓練階段嚴重負載不均衡所導致的記憶體溢位問題[41]。傳統的激活值重計算策略雖然能夠有效降低記憶體壓力，但其執行位置往往位於關鍵訓練路徑上，導致記憶體節省是以犧牲訓練吞吐量為代價的。這種限制通常會導致性能下降，從而阻礙其在生產環境中的應用。近年來，針對 MoE 的激活值重計算策略透過智慧選擇性和精度感知策略解決了這些限制。 MegaScale-MoE[78] 實現了選擇性重計算，它保留了重計算成本較高的激活值，同時選擇性地重計算了內存密集型操作產生的激活值，從而消除了傳統激活檢查點帶來的性能損失。 DeepSeek-V3[8] 將 FP8 混合精度儲存與選擇性重計算策略相結合，展示瞭如何透過仔細考慮數值精度要求來增強激活值重計算，從而在不影響訓練穩定性的前提下有效減少記憶體佔用。 PanGu-Ultra-MoE[10] 實現了細粒度的重計算，其操作層級為單一算子，而非傳統的逐層策略。它引入了多頭潛在註意力（MLA）鍵值（KV）重計算，其中僅重計算注意力模組中的鍵值向量，而其他激活值則根據計算成本和內存佔用情況選擇性地存儲或重計算。 HECATE[48]透過稀疏物化策略重新定義了活化管理，其中僅計算並儲存與選定專家對應的活化值，並在反向傳播過程中從稀疏表示中重建完整的活化模式。 MoE模型的稀疏激活特性為激活值重計算帶來了獨特的機會和挑戰。由於每個詞元僅經過選定的專家，因此非活躍組件無需重新計算，從而實現了更高效的內存-計算權衡。 MoETion[79]透過「稀疏檢查點」機制解決了這個問題，該機制基於MoE的路由稀疏性選擇性地保存和重新計算激活值。透過分析哪些專家輸出對記憶體使用和梯度計算最為關鍵，系統僅對一小部分「低重用」激活值進行檢查點保存，從而顯著降低了整體的重新計算開銷。這表明，對於像MoE這樣的動態計算圖，利用路由資訊自訂啟動檢查點策略可以實現更優的記憶體-計算權衡。 #### 5.4 Low-precision training > Low-precision training has emerged as a powerful optimization strategy that enables training of massive MoE models within constrained memory budgets while maintaining competitive performance. Modern accelerators such as NVIDIA H100 GPUs support 8-bit Float Point (FP8) calculations, making FP8 low-precision training particularly attractive since FP8 GEMMs in forward propagation, weight gradients, and activation gradients effectively halve memory requirements compared to BF16 precision. > > Several training frameworks and acceleration engines support FP8 training with flexible precision switching capabilities, including Microsoft’s FP8-LM[92], NVIDIA’s Transformer Engine[80], and Amazon SageMaker[81]. DeepSeek V3[8] demonstrates the practical effectiveness of FP8 mixed-precision training for a 671-billion-parameter MoE model, achieving competitive performance while significantly reducing memory requirements. However, large-scale MoE training with extremely low precision faces unique challenges, as aggressive quantization can more easily result in expert collapse and training instability compared to dense models. The dynamic routing mechanisms in MoE models are particularly sensitive to precision reduction, as small quantization errors can significantly impact expert selection decisions. Despite these challenges, the substantial memory savings achievable through low-precision training make this an important area for continued research and development. 低精度訓練已成為一種強大的最佳化策略，它能夠在記憶體預算有限的情況下訓練大規模的 MoE 模型，同時保持優異的效能。諸如 NVIDIA H100 GPU 等現代加速器支援 8 位元浮點 (FP8) 計算，使得 FP8 低精度訓練特別吸引人，因為與 BF16 精度相比，FP8 GEMM 在前向傳播、權重梯度和激活梯度方面的記憶體需求有效地減少了一半。包括 Microsoft 的 FP8-LM[92]、NVIDIA 的 Transformer Engine[80] 和 Amazon SageMaker[81] 在內的多個訓練框架和加速引擎都支援具有靈活精度切換功能的 FP8 訓練。 DeepSeek V3[8] 展示了 FP8 混合精度訓練在 6710 億參數 MoE 模型上的實際有效性，在顯著降低記憶體需求的同時實現了優異的性能。然而，大規模、低精度的MoE訓練面臨獨特的挑戰，因為與密集模型相比，激進的量化更容易導致專家崩潰和訓練不穩定。 MoE模型中的動態路由機制對精確度降低特別敏感，因為微小的量化誤差會顯著影響專家選擇決策。儘管存在這些挑戰，但低精度訓練能夠顯著節省內存，使其成為值得持續研究和開發的重要領域。 #### 5.5 Technical insights and research directions > Current MoE memory optimization techniques reveal several critical insights about expert activation patterns and memory hierarchy utilization that will guide future research directions. Expert activation exhibits strong temporal and spatial locality, with certain experts being co-activated frequently within short time windows and across similar input sequences[4, 6]. This locality can be exploited through intelligent prefetching and expert clustering strategies that group frequently co-activated experts in the same memory regions. > > The heterogeneous nature of expert utilization presents opportunities for adaptive precision strategies. Frequently activated experts that handle common patterns can benefit from higher precision to maintain model quality, while specialized experts that activate infrequently can tolerate lower precision with minimal impact on overall performance. This insight suggests dynamic precision scaling strategies where expert precision is adjusted based on utilization, activation importance scores, and gradient magnitudes during the training period. > > On the other hand, the optimal balance between memory usage and computational overhead is highly dependent on hardware characteristics and workload patterns, but importantly, this balance is dynamic rather than static. The trade-off changes based on several factors: (1) batch size effects, where larger batch sizes amortize expert loading costs but increase memory pressure; (2) sequence length dependencies, where longer sequences benefit more from expert caching due to increased activation probability; and (3) hardware topology characteristics, where memory bandwidth, cache hierarchy, and interconnect properties fundamentally alter optimal operating points. 目前的MoE記憶體優化技術揭示了專家激活模式和記憶體層次結構利用的幾個關鍵見解，這些見解將指導未來的研究方向。專家活化表現出強烈的時空局部性，某些專家會在短時間視窗內以及在相似的輸入序列中頻繁地被共同活化[4, 6]。這種局部性可以透過智慧預取和專家聚類策略來利用，將頻繁共同活化的專家分組到相同的記憶體區域。專家利用的異質性為自適應精確度策略提供了機會。頻繁激活的、處理常見模式的專家可以從更高的精度中獲益，從而保持模型質量；而激活頻率較低的專業專家則可以容忍較低的精度，對整體性能的影響也微乎其微。這項發現提示我們可以採用動態精確度擴展策略，即在訓練期間根據使用率、激活重要性得分和梯度幅度來調整專家的精確度。另一方面，記憶體使用和計算開銷之間的最佳平衡高度依賴硬體特性和工作負載模式，但重要的是，這種平衡是動態的而非靜態的。這種權衡會根據以下幾個因素而變化：（1）批次大小的影響，較大的批次大小可以分攤專家載入成本，但會增加記憶體壓力；（2）序列長度的依賴性，較長的序列由於激活機率更高，因此更能受益於專家快取; （3）硬體拓撲特性，其中記憶體頻寬、快取結構和互連特性會從根本上改變工作點層次會從根本上改變 ## 6 Communication Optimization > The sparse activation and dynamic routing characteristics of MoE models introduce significant communication challenges that fundamentally differ from traditional dense models. Traditional distributed training systems, designed for dense models with regular data distributions and predictable communication patterns, provide limited effectiveness for MoE optimization where data-dependent communication pattern should be considered during runtime. > > This section discusses MoE-oriented communication optimization technologies, which are organized into three key categories: All-to-All collective optimization, computation-communication overlap techniques, and communication compression methods. We outline related work in Fig. 6. Through systematic analysis, we extract key insights about the importance of topology-aware scheduling, the resource complementarity enabling effective overlap, and emerging opportunities in compression-aware communication where operations can be performed directly on compressed representations without full decompression. > MoE 模型稀疏的活化特性和動態路由特性帶來了顯著的通訊挑戰，這與傳統的密集模型有著本質差異。傳統的分散式訓練系統是為具有規則資料分佈和可預測通訊模式的密集模型設計的，因此對於 MoE 的最佳化效果有限，因為 MoE 的通訊模式依賴於運行時資料。本節討論 MoE 導向的通訊最佳化技術，這些技術分為三大類： 1. (A2A)全對全集體最佳化 2. 計算通訊重疊技術 3. 通訊壓縮方法。圖 6 概述了相關工作。透過系統分析，我們總結了拓樸感知調度的重要性、資源互補性如何實現有效重疊，以及壓縮感知通訊領域的新興機會——在壓縮感知通訊中，無需完全解壓縮即可直接對壓縮表示執行操作。 ![TST-2025-0329-6](https://hackmd.io/_uploads/H1CruOQTWe.jpg) ### 6.1 Collectives redesign and scheduling > The naive All-to-All collective (also called Linear All-to-All) requires all participants to communicate with each other simultaneously. This approach is agnostic to bandwidth heterogeneity between intra-node and inter-node connections. Consequently, it results in low utilization of intra-node high-bandwidth links during data exchange and synchronization[55]. Furthermore, when training MoE models with hybrid parallelization methods, the interleaving between different collective communications, especially All-to-All and All-Reduce operations in the backward process, creates bandwidth competition[74]. Therefore, resource-aware All-to-All communication is essential for improving training efficiency. > > Hierarchical All-to-All: To adapt to heterogeneous networking bandwidth, hierarchical All-to-All approaches, where data exchange across workers is enforced with awareness of cluster topology, have been studied and implemented by several MoE training systems. For example, HetuMoE[93] implements a topology-aware One-Dimensional (1D) All-to-All approach where data destined for GPUs within the same node are first aggregated and sent together, then scattered to individual GPUs within the node. This design fully utilizes intra-node bandwidth while reducing the number of inter-node communications. > > Building upon the 1D approach, both DeepSpeed-MoE[15] and SE-MoE[94] implement Two-Dimensional (2D) All-to-All communication, which decomposes the operation into separate intra-node and inter-node phases. This approach leverages intra-node high-bandwidth connections to minimize local data exchange latency and prepares large data chunks for efficient inter-node transfer. FasterMoE[11] extends this concept with group-based All-to-All, enabling intra-node data exchange to complete rapidly before initiating expert computation. Tutel[12] further optimizes the 2D All-to-All approach by improving data layout and leveraging compiler optimizations to reduce synchronization overhead between different All-to-All stages. More recently, HECATE[48] introduces advanced scheduling algorithms that dynamically adapt to network conditions and load patterns. > >Locality aware scheduling: Beyond algorithmic optimizations, exploiting locality principles have shown significant promise. NETMOE[95] introduces network-aware expert placement strategies that minimize cross-node communication by co-locating frequently accessed experts. LUFFY[96] extends this concept by implementing dynamic expert migration based on workload patterns, achieving substantial communication overhead reduction in heterogeneous cluster environments. 簡單的全對全(A2A)集體溝通（也稱為線性全對全通訊）要求所有參與者同時相互溝通。這種方法忽略了節點內連接和節點間連接之間的頻寬異質性。因此，在資料交換和同步過程中，節點內高頻寬連結的利用率較低[55]。此外，在使用混合平行化方法訓練 MoE 模型時，不同集體通訊之間的交錯，特別是在反向過程中的全對全通訊和全歸約操作，會造成頻寬競爭[74]。因此，資源感知的全對全溝通對於提高訓練效率至關重要。分層全對全通訊：為了適應異質網路頻寬，分層全對全通訊方法已被研究並應用於多個 MoE 訓練系統中。在分層全對全通訊方法中，工作節點之間的資料交換會根據叢集拓樸結構進行調整。例如，HetuMoE[93]實現了一種拓撲感知的一維（1D）全連接方法，其中發送到同一節點內所有GPU的資料首先被聚合並一起發送，然後再分散到節點內的各個GPU。這種設計充分利用了節點內頻寬，同時減少了節點間通訊次數。基於一維（1D）方法，DeepSpeed-MoE[15]和SE-MoE[94]都實現了二維（2D）全連接通信，將操作分解為獨立的節點內和節點間階段。這種方法利用節點內的高頻寬連接來最大限度地減少本地資料交換延遲，並準備用於高效節點間傳輸的大資料區塊。 FasterMoE[11]擴展了這個概念，採用基於群組的全連接通信，使得節點內資料交換能夠在啟動專家計算之前快速完成。 Tutel[12]透過改進資料佈局和利用編譯器最佳化來降低不同全連接階段之間的同步開銷，從而進一步優化了二維全連接方法。最近，HECATE[48]引入了能夠動態適應網路狀況和負載模式的高階調度演算法。局部性感知調度：除了演算法最佳化之外，利用局部性原理也展現了巨大的潛力。 NETMOE[95] 提出了一種網路感知專家部署策略，透過將頻繁存取的專家部署在同一節點上，最大限度地減少跨節點通訊。 LUFFY[96] 擴展了這個概念，實現了基於工作負載模式的動態專家遷移，從而在異質叢集環境中顯著降低了通訊開銷。 ### 6.2 Concurrent execution > The naive workflow of MoE training systems typically implements communication and computation, as well as different types of collective communication, serially. For example, when training MoE models with hybrid parallelism strategies like expert parallelism and expert-sharding parallelism, All-Gather or Reduce-Scatter operations for data collection can be executed concurrently with All-to-All[62]. Additionally, researchers have observed that All-to-All operations can be delayed by All-Reduce during the backward process. Efficiently interleaving different tasks through computation-communication overlap or interleaved communication represents an effective approach to improve training efficiency. This section discusses computation-communication optimization and interleaved communication optimization MoE訓練系統的簡單工作流程通常串行地實現通訊和計算，以及各種類型的集體通訊。例如，在使用專家並行和專家分片並行等混合平行策略訓練MoE模型時，用於資料收集的All-Gather或Reduce-Scatter操作可以與All-to-All操作並發執行[62]。此外，研究人員觀察到，在反向過程中，All-to-All操作可能會被All-Reduce操作延遲。透過計算通訊重疊或交錯通訊有效率地交錯執行不同的任務是提高訓練效率的有效方法。本節將討論計算通訊優化和交錯通訊優化。 #### 6.2.1 Computation-communication overlap >Unlike the strict dependency relationships of traditional dense models, expert computation in MoE exhibits natural parallelism: forward computation of different experts can proceed simultaneously, and computation within experts can overlap with communication of other experts. Thus, optimizing MoE training system performance through computation-communication overlap methodologies has attracted significant interest from both academia and industry. > Task-level overlap: The most straightforward approach to overlap communication and computation is to treat them as complete task units and enable overlapping at the task level through scheduling mechanisms, without modifying operator implementations. FasterMoE[11] and Tutel[12] separate these two different types of tasks into distinct Compute Unified Devices Architecture (CUDA) streams and enforce out-of-order asynchronous communication/computation sub-task execution whenever their data dependencies are satisfied. Tutel further introduces unified data layout abstractions that maintain compatibility across different parallelism strategies, enabling transparent switching without expensive data reorganization. Inspired by the micro-batching technique used in GPipe[36], MPipeMoE[55] enforces overlap by partitioning mini-batch data into several micro-batches and interleaving communication and computation operations for adjacent micro-batches. > > Beyond basic asynchronous scheduling, predictive approaches leverage temporal patterns in expert activation to proactively allocate resources and schedule tasks. Prophet[31] exploits temporal locality in token routing patterns to predict expert loads and proactively schedule parameter transfers across iterations, while restructuring gradient aggregation to overlap expert All-Reduce with non-expert layer communications. Building upon this idea, Pro-Prophet[98] systematically decomposes load balancing into offline planner (profiling inter-iteration distributions) and online scheduler (maximizing overlap within iterations), enabling data-dependent optimization. Janus[74] implements a data-centric framework where experts migrate on-demand between workers, coordinating asynchronous CPU-GPU transfers with computation scheduling to minimize idle time. > > LANCET[99] extends task-level overlap optimization from individual MoE layers to the entire computation graph. Through Intermediate Representation (IR) analysis, LANCET discovers global overlap opportunities that are invisible to local optimizers, enabling whole-graph coordination of communication and computation. DeepSeek-V3[8] introduces DualPipe[103], an innovative bidirectional pipeline parallelism algorithm that achieves near-perfect overlap through careful task-level scheduling. DualPipe employs bidirectional pipeline with fine-grained chunk decomposition (attention, All-to-All dispatch, MLP, and All-to-All combine) and manually adjusts the ratio of GPU Streaming Multiprocessors (SMs) dedicated to communication versus computation within each chunk, maximizing hardware utilization through resource allocation at the task level. > > Operator-level overlap: While task-level approaches schedule coarse-grained operations, operator-level techniques dive into individual operators, decomposing them into finer-grained sub-operations to enable deeper integration of communication and computation. However, concurrently running computation and communication kernels on the same accelerator can incur performance degradation[12]. Therefore, the key to effectively overlapping computation and communication at operator level lies in kernel fusion and fine-grained decomposition that allows seamless integration within unified kernels. > > FLUX[101] introduces a novel fine-grained decomposition approach that over-decomposes communication and computation operations into much finer-grained tiles than existing methods, then fuses tiled computation and communication into a single larger kernel. This aggressive decomposition enables communication and computation to execute concurrently within the same kernel, eliminating synchronization overhead between separate operations. COMET[100] observes that granularity mismatch between communication and computation makes task-level overlap difficult to fully utilize accelerators, resulting in suboptimal efficiency. It addresses this by decomposing coarse-grained operation dependencies into fine-grained data dependencies, allowing communication and computation to be fused into a single kernel at tensor element granularity rather than operation level. > > Triton-distributed[102] takes a compiler-based approach to operator-level overlap. By integrating communication primitives compliant with the OpenSHMEM standard[107] directly into the Triton compiler[108], Triton-distributed allows programmers to achieve complex joint optimization of computation, memory access, and communication through a unified high-level Python programming model. The compiler automatically generates fused kernels that seamlessly integrate communication operations with computation at the operator level. > > Hybrid approaches: Several advanced systems combine both task-level and operator-level techniques to achieve superior overlap efficiency. Lina[97] operates at both levels: at the task level, it conducts deep analysis of expert activation patterns to discover temporal locality, enabling predictive prefetching and priority-based collectives scheduling; at the operator level, it decomposes communication operations into uniform micro-ops for fine-grained scheduling, allowing more precise control over communication-computation interleaving. MegaScale-MoE[78] implements a comprehensive hierarchical overlap scheme that explicitly optimizes at both levels. At the inter-operator (task) level, MegaScale-MoE achieves overlap through macro module execution that coordinates multiple operators. At the intra-operator level, it employs tile-based GEMM decomposition, breaking down individual operators into tiles that enable concurrent execution of communication and computation kernels within the same operator. This two-level strategy allows MegaScale-MoE to maximize overlap opportunities across different granularities. > > The detailed comparison of these techniques is demonstrated in Table 2. The evolution from task-level to operator-level and finally to hybrid approaches reveals an important trend: as systems mature, they increasingly adopt multi-level optimization strategies that coordinate scheduling decisions across different granularities to achieve near-optimal overlap efficiency. > 與傳統密集模型嚴格的依賴關係不同，MoE 中的專家計算展現出天然的平行性：不同專家的前向計算可以同時進行，專家內部的計算可以與其他專家的溝通重疊。因此，透過計算-通訊重疊方法優化 MoE 訓練系統的性能引起了學術界和工業界的廣泛關注。任務層級重疊：實現通訊和運算重疊最直接的方法是將它們視為完整的任務單元，並透過調度機制在任務層級實現重疊，而無需修改算子實作。 FasterMoE[11] 和 Tutel[12] 將這兩種不同類型的任務分離到不同的統一計算設備架構 (CUDA) 流中，並在滿足資料依賴關係時強制執行亂序非同步通訊/計算子任務。 Tutel 還引入了統一的資料佈局抽象，以保持不同平行策略之間的相容性，從而實現透明切換而無需昂貴的資料重組。受 GPipe[36] 中使用的微批次技術的啟發，MPipeMoE[55] 透過將小批量資料分割成多個微批次，並交錯相鄰微批次的通訊和計算操作來強制實現重疊。除了基本的非同步調度之外，預測方法還利用專家啟動中的時間模式來主動分配資源和調度任務。 Prophet[31] 利用令牌路由模式中的時間局部性來預測專家負載，並主動調度跨迭代的參數傳輸，同時重建梯度聚合，使專家 All-Reduce 與非專家層通訊重疊。在此基礎上，Pro-Prophet[98] 將負載平衡系統地分解為離線規劃器（分析迭代間分佈）和線上調度器（最大化迭代內的重疊），從而實現資料相關的最佳化。 Janus[74] 實現了一個以資料為中心的框架，其中專家按需在工作節點之間遷移，協調非同步 CPU-GPU 傳輸與運算調度，以最大限度地減少空閒時間。 LANCET[99] 將任務級重疊最佳化從單一 MoE 層擴展到整個計算圖。透過中間表示 (IR) 分析，LANCET 發現了局部最佳化器無法發現的全局重疊機會，從而實現了通訊和計算的全圖協調。 DeepSeek-V3[8] 引入了 DualPipe[103]，這是一種創新的雙向管線並行演算法，透過精細的任務級調度實現了近乎完美的重疊。 DualPipe 採用細粒度的區塊分解（注意力機制、全對全調度、多層感知器和全對全合併）的雙向管線，並手動調整每個區塊內用於通訊和計算的 GPU 流式多處理器 (SM) 的比例，透過任務級資源分配最大化硬體利用率。算子級重疊：任務級方法調度的是粗粒度操作，而算子級技術則深入到單個算子，將其分解為更細粒度的子操作，從而實現通信和計算的更深層次集成。然而，在同一加速器上同時運行計算和通訊內核可能會導致效能下降[12]。因此，在算子層級有效重疊計算和通訊的關鍵在於內核融合和細粒度分解，從而在統一的內核中實現無縫整合。 FLUX[101]提出了一種新穎的細粒度分解方法，該方法將通訊和計算操作過度分解為比現有方法更細粒度的區塊，然後將分塊的計算和通訊融合到一個更大的內核中。這種激進的分解使得通訊和計算能夠在同一核心中並發執行，從而消除了不同操作之間的同步開銷。 COMET[100]指出，通訊與運算之間的粒度不匹配使得任務層級重疊難以充分利用加速器，導致效率欠佳。它透過將粗粒度的操作依賴關係分解為細粒度的資料依賴關係來解決這個問題，從而允許在張量元素粒度而非操作層級上將通訊和計算整合到單一核心中。 Triton-distributed[102]採用基於編譯器的方法來實作運算元層級的重疊。透過將符合OpenSHMEM標準[107]的通訊原語直接整合到Triton編譯器[108]中，Triton-distributed允許程式設計師透過統一的高級Python程式設計模型實現計算、記憶體存取和通訊的複雜聯合最佳化。編譯器會自動產生融合內核，從而在操作符層級無縫地將通訊操作與計算整合在一起。混合方法：一些先進的系統結合了任務級和操作符級技術，以實現更高的重疊效率。 Lina[97]在兩個層面上都運行：在任務級，它對專家激活模式進行深入分析以發現時間局部性，從而實現預測性預取和基於優先級的集體調度；在算子層面，它將通信操作分解為統一的微操作，以實現細粒度調度，從而更精確地控制通信與計算的交錯執行。 MegaScale-MoE[78]實現了一種全面的分層重疊方案，該方案在兩個層面上都進行了明確優化。在算子間（任務）層面，MegaScale-MoE透過協調多個算子的宏模組執行來實現重疊。在算子內層面，它採用基於分塊的GEMM分解，將單一算子分解為分塊，從而允許在同一算子內並發執行通訊內核和計算內核。這種兩級策略使MegaScale-MoE能夠最大化不同粒度上的重疊機會。表2詳細比較了這些技術。從任務級到算子級，最終到混合方法的演變揭示了一個重要的趨勢：隨著系統的成熟，它們越來越多地採用多層最佳化策略，協調不同粒度上的調度決策，以實現接近最優的重疊效率。 #### 6.2.2 Concurrent collective communication > As discussed earlier, advancements in hybrid parallelism offer more optimization space for MoE training but lead to more sophisticated communication patterns. One limitation of traditional All-to-All collectives is that different tasks (or stages) are implemented serially. Since there are no data dependencies between intra-node and inter-node operations, All-to-All can be decoupled into two independent tasks and executed concurrently. For example, ScheMoE[104] addresses the limitation of serial task execution in traditional All-to-All implementations by decoupling intra-node and inter-node communications into independent tasks executed concurrently using separate CUDA streams. This approach effectively hides intra-node communication latency while maintaining computation efficiency. Beyond decoupling communication tasks, another critical challenge emerges from bandwidth contention between different collective operations during concurrent execution. Lina[97] systematically analyzes this bottleneck, revealing that All-to-All and All-Reduce operations often contend for network bandwidth when they overlap in the backward pass. To address this issue, Lina introduces tensor partitioning that decomposes communication operations into uniform micro-ops, enabling fine-grained priority scheduling where All-to-All operations receive dedicated bandwidth while All-Reduce operations utilize remaining resources opportunistically. 如前所述，混合平行技術的進步為 MoE 訓練提供了更大的最佳化空間，但也導致了更複雜的通訊模式。傳統 All-to-All 集體操作的一個限制在於，不同的任務（或階段）是串行執行的。由於節點內操作和節點間操作之間不存在資料依賴關係，All-to-All 可以解耦為兩個獨立的任務並並行執行。例如，ScheMoE[104] 透過將節點內通訊和節點間通訊解耦為使用獨立 CUDA 流並發執行的獨立任務，解決了傳統 All-to-All 實作中串行任務執行的限制。這種方法有效地隱藏了節點內通訊延遲，同時保持了運算效率。除了解耦通訊任務之外，並發執行期間不同集體操作之間的頻寬爭用也帶來了另一個關鍵挑戰。 Lina[97] 系統性地分析了這一瓶頸，揭示了 All-to-All 和 All-Reduce 操作在反向傳播過程中重疊時經常會爭用網路頻寬。為了解決這個問題，Lina 引入了張量分區，將通訊操作分解為統一的微操作，從而實現細粒度的優先級調度：All-to-All 操作獲得專用頻寬，而 All-Reduce 操作則根據機會利用剩餘資源。 > 串行 (serial) : 照順序 > Node independency : 這裡是指物理通道（Hardware Path）和數據對象的獨立性：物理通道獨立：節點內 (Intra-node)：走的是 NVLink，這是 GPU 之間的專用高速公路。節點間 (Inter-node)：走的是網卡 (NIC)，通過網路交換機傳輸。 > 這兩條路在硬體上是不同的，理論上可以同時壅塞或同時暢通。 > 數據對象獨立（關鍵）：在 MoE 的 All-to-All 中，GPU 0 擁有的 Token 需要被發送到不同的目的地： >1. 有些 Token 的目標專家在「隔壁顯卡」（同一台機器）。 >2. 有些 Token 的目標專家在「遠端機器」。這兩批數據之間沒有先後順序關係。 ### 6.3 Communication compression > In the MoE training context, typical communication volume in each All-to-All operation can reach millions of bytes[104]. While previous sections discussed communication optimization through collective scheduling and overlapping, from a data perspective, an efficient approach to reduce All-to-All communication latency is to compress data volume or reduce data size. However, communication compression techniques face unique challenges in MoE training scenarios: compression algorithms must reduce transmission volume without destroying token semantics and routing information. This constraint renders many compression techniques effective in traditional distributed training ineffective in MoE scenarios. > > Volume compression: One straightforward approach to enforce communication compression is to compress and decompress data before and after All-to-All communication. For instance, ScheMoE[104] designs abstractions for data compression and decompression in MoE layers, where low-bit data representation and compression algorithms can be utilized to reduce communication volume. From an information-theoretic perspective, the compression potential of MoE communication primarily derives from redundancy in token embeddings and predictability of routing decisions. By observing that data after routing decisions exhibit similarity, LSH-MoE[105] utilizes Principal Component Analysis (PCA) to cluster data and transmits only centroids rather than original data to reduce communication volume. > > Lower dimension: Another method to reduce communication overhead is through low-dimensional communication approaches. BigMac[106] introduces Dimension-Compressed Communication Adaptation (DCCA), where tensor dimensions are scaled down before the first All-to-All operation and scaled up after the second All-to-All communication. By adding descending and ascending projections at expert entrance and exit points, BigMac achieves comparable model quality to fine-grained MoEs while reducing end-to-end latency by up to 3.09 × > for training. > > Lower precision: Expressing data with lower bit precision is a common method to reduce space or bandwidth requirements. Quantization techniques have demonstrated enormous potential in MoE compression. QMoE[109] achieves breakthrough compression results, compressing the 1.6-trillion-parameter Switch Transformer[4] from 3.2 TB to 160 GB, achieving extreme compression of less than 1 bit per parameter. MegaScale-MoE[78] implements a multi-tier precision strategy: reducing inter-node parameter synchronization precision from 32-bit Floating Point (FP32) to BF16, effectively halving communication overhead while maintaining convergence stability. DeepSeek-V3[8] adopts FP8 quantization for expert parallelism, reducing communication volume by half compared to BF16. > > Compression-aware communication: Existing compression approaches in MoE systems follow a compress-communicate-decompress paradigm, where activations are compressed before All-to-All communication and fully decompressed upon arrival before computation. While this reduces transmission overhead, the decompression step remains on the critical path, creating performance bottlenecks. > > A promising alternative, inspired by database systems like TADOC[110] and CompressDB[111] that perform operations directly on compressed data, is to perform computation directly on compressed activations without full decompression. This approach fundamentally rethinks the communication-computation interface by treating compressed representations as computational substrates rather than mere storage formats. While systems like DeepSeek-V3[8] integrate low-precision communication with selective full-precision computation for numerically sensitive operations, current implementations still treat compressed formats purely as communication encodings. Developing truly compression-aware communication—where operations execute directly on compressed data—represents a significant opportunity for future research. 在 MoE 訓練環境中，每次 All-to-All 操作的典型通訊量可達數百萬位元組[104]。雖然前面的章節討論了透過集體調度和重疊進行通訊優化，但從資料角度來看，降低 All-to-All 通訊延遲的有效方法是壓縮資料量或減少資料大小。然而，通訊壓縮技術在 MoE 訓練場景中面臨著獨特的挑戰：壓縮演算法必須在不破壞令牌語義和路由資訊的前提下減少傳輸量。這項限制使得許多在傳統分散式訓練中有效的壓縮技術在 MoE 場景中失效。資料量壓縮：一種直接的通訊壓縮方法是在 All-to-All 通訊前後對資料進行壓縮和解壓縮。例如，ScheMoE[104]設計了MoE層中資料壓縮和解壓縮的抽象，利用低位元資料表示和壓縮演算法來減少通訊量。從資訊理論的角度來看，MoE通訊的壓縮潛力主要來自於令牌嵌入的冗餘性和路由決策的可預測性。 LSH-MoE[105]觀察到路由決策後的數據具有相似性，因此利用主成分分析（PCA）對數據進行聚類，並僅傳輸聚類中心而非原始數據，從而減少通訊量。降低維度：另一種降低通訊開銷的方法是採用低維度通訊方式。 BigMac[106]提出了維度壓縮通訊自適應（DCCA），在第一次全連接通訊之前降低張量維度，在第二次全連接通訊之後提高張量維度。透過在專家入口點和出口點添加降階和升階投影，BigMac 在訓練過程中實現了與細粒度 MoE 相當的模型質量，同時將端到端延遲降低了高達 3.09 倍。降低精度：使用較低的位元精度來表示資料是降低空間或頻寬需求的常用方法。量化技術在 MoE 壓縮方面展現了巨大的潛力。 QMoE[109] 取得了突破性的壓縮成果，將 1.6 兆參數的 Switch Transformer[4] 從 3.2 TB 壓縮到 160 GB，實現了每個參數小於 1 位元的極致壓縮。 MegaScale-MoE[78] 採用了一種多層精度策略：將節點間參數同步精度從 32 位元浮點數 (FP32) 降低到 BF16，有效地將通訊開銷減半，同時保持了收斂穩定性。 DeepSeek-V3[8] 採用 FP8 量化來實現專家並行，與 BF16 相比，通訊量減少了一半。壓縮感知通訊：MoE 系統中現有的壓縮方法遵循「壓縮-通訊-解壓縮」的模式，即在進行全對全通訊之前壓縮啟動值，並在到達後進行完全解壓縮，然後再進行計算。雖然這降低了傳輸開銷，但解壓縮步驟仍然位於關鍵路徑上，從而造成效能瓶頸。 continuous : https://hackmd.io/NgQEwu7JTPuMUCQYloosqA