DualPipe could be better without the Dual

<h1>DualPipe could be better without the Dual</h1> Penghui Qi*, Xinyi Wan*, Guangxing Huang, Min Lin                                          Sea AI Lab \* Equal Contributions                                                                                      27 Feb, 2025 --- [中文版](https://hackmd.io/@ufotalent/S1N_ay0ckx) Deepseek opensourced DualPipe on day 4 of their OpenSourceWeek. It's a codesign of Pipeline Parallelism and Expert Parallelism for better training performance. In this blog, we show that the **Dual** part of the DualPipe is actually bad for its 2×parameter redundancy, it is unnecessary and can be removed almost for free, with very slight impacts on other properties of the schedule. The trick is to transform it into a V-Shape schedule by a simple "cut-in-half" procedure. We further show that when Expert Parallel (EP) is not required, the efficiency can be further improved which leads to the ZBV schedule.        ## Removing Duplicated Parameters from DualPipe It is important to note that DualPipe scheduling can be divided into two mirrored halves, as illustrated in the figure below. For example, device $0$ and $7$ hold the same pipeline stages and have exactly the same scheduling. ![image](https://hackmd.io/_uploads/SyuwD3a91l.png) Figure 1. DualPipe can be devided into two mirrored halves. ### Cut in half:  By keeping only the first half devices (and grafting the late stages of the bottom-up microbatches on the early stages of the top-down), we can obtain a schedule without the "dual" but is exactly the same in the properties of bubble rate, memory footprint, etc.  ![image](https://hackmd.io/_uploads/BJU6L1C5kg.png) Figure 2. Illustration of halfing the devices & grafting the stages. ![image](https://hackmd.io/_uploads/rJZ9D26q1l.png) Figure 3. The Cut-in-half schedule. Let's call this schedule the Cut-in-half schedule. We can see that it has no duplication of parameters, however, since the number of devices also reduce in half, the _per-device_ parameter memory remains unchanged. Following the same philosophy, we can as well design a 8 device Cut-in-half schedule by increasing the number of pipeline stages and halfing the layers per stage. Due to the halfed layers, the per-device parameter is brought down to half the original amount.  We show a detail comparison table here, all the schedules are compared assuming the same number of devices (denoted as $d$). Table 1. Comparison of various pipeline schedules | Method | Bubble | Parameter | Activation | PP Communication | |-------------|---------------------------------|-----------|------------|---| | 1F1B | ($d$-1)(𝐹+𝐵) | 1× | $d$ | 1x| | ZB1P | ($d$-1)(𝐹+𝐵-2𝑊) | 1× | $d$ | 1x| | DualPipe | ($d$/2-1)(𝐹&𝐵+𝐵-3𝑊) | 2× | $d$+1 | 1x| | Cut-in-half* | (($d$-1)/2)(𝐹&𝐵+𝐵-3𝑊) | 1× | $d$+1/2 | 2x| \* Note: $d$ is the number of devices. For cut-in-half, it's the number of devices after cutting. Cut-in-half doubles the PP communication volume compared to other methods. However, the benefit of halving parameter memory outweighs this, as PP communication is relatively minor compared to EP communication. ## Cut-in-half is an EP specialized ZB-V Schedule   The Cut-in-half schedule is reminiscent of the V-Shape schedules, we highlight its connections to prior works. The V-Shape (wave-like) schedules was first proposed in [3], and improved to achieve zero bubble in the ZB-V schedule [1], and further modified in [2] for reduced memory foot-print. Pipeline bubbles in DualPipe/Cut-in-half partially result from the specialization for Expert Parallelism (EP). The forward and backward passes are overlapped in the stable phase to mitigate communication overhead in Expert Parallelism (EP). Without considering EP, these bubbles can be further reduced, effectively converting Cut-in-half into a ZB-V schedule. 1. Untie F/B and squeeze As shown in the figure, untying the forward and backward passes into separate ones leads to more flexible dependency, making it possible to squeeze the schedule. The schedule now looks more like a ZB-V schedule. ![image](https://hackmd.io/_uploads/S1rsP3pcJl.png) 2. Untie B/W in cool-down phase and reorder By further untying B/W in the cooldown phase and bypassing the sychronizations (introducted in Section 4 of [1]), we can get the ZB-V schedule and achieve zero bubble. ![image](https://hackmd.io/_uploads/HkNpPhpc1g.png)        --> ## Citation If you find this blog helpful, please consider citing: ``` @misc{qi2025dual, title={DualPipe could be better without the Dual}, author={Penghui Qi and Xinyi Wan and Guangxing Huang and Min Lin}, year={2025}, howpublished={\url{https://hackmd.io/@ufotalent/r1lVXsa9Jg}}, note={Blog}, } ``` [1] https://arxiv.org/pdf/2401.10241 [2] https://arxiv.org/abs/2405.15362 [3] https://dl.acm.org/doi/pdf/10.1145/3581784.3607073