<h1>DualPipe could be better without the Dual</h1>
Penghui Qi*, Xinyi Wan*, Guangxing Huang, Min Lin Sea AI Lab
\* Equal Contributions 27 Feb, 2025
---
[δΈζη](https://hackmd.io/@ufotalent/S1N_ay0ckx)
Deepseek opensourced DualPipe on day 4 of their OpenSourceWeek. It's a codesign of Pipeline Parallelism and Expert Parallelism for better training performance.
In this blog, we show that the **Dual** part of the DualPipe is actually bad for its 2Γparameter redundancy, it is unnecessary and can be removed almost for free, with very slight impacts on other properties of the schedule. The trick is to transform it into a V-Shape schedule by a simple "cut-in-half" procedure. We further show that when Expert Parallel (EP) is not required, the efficiency can be further improved which leads to the ZBV schedule.
<!-- ## DualPipe -->
<!-- The whole story starts from optimizing the overhead of large all-to-all communication in EP by overlapping all-to-all communication and computation. They found out that most consecutive forward and backward on the same stage in PP do not have dependency because they are processing different microbatch. Thus, the order of inner operations can be rearranged.
The dependency graph of forward and backward are as follows: -->
<!-- In Expert Parallelism, the communication operations DISPATCH and COMBINE in both forward and backward are time-consuming.
In Pipeline Parallelism, many consecutive forward and backward passes are independent of each other. Also, the weight gradient computation in backward can be split like Zero Bubble [1]. These create opportunities for overlapping operations. By rearranging the operations in a consecutive forward and backward from the following graph,
-->
<!-- DualPipe employes the opportunities in Pipeline Parallelism to overlap operations computation and EP communication.
```
ATTN(F) -> DISPATCH(F) -> MLP(F) -> COMBINE(F)
MLP(B) -> DISPATCH(B) -> ATTN(B) -> COMBINE(B)
\ \
-> MLP(W) -> ATTN(W)
```
all communication can be hiden in the computation, creating a building block with both forward chunk and backward chunk. -->
<!-- In backward, Dualpipe employs the BW-split in [1] so that the weight gradient `MLP(W)` and `ATTEN(W)` do not block the communication `DISPATCH(B)` and `COMBINE(B)` respectively.
By carefully arange their order, all communication can be hiden in the computation. -->
<!--  -->
<!-- This building block is further codesigned with a dual-direction PP schedule like Chimera [4], eliminating bubbles in training iteration. -->
## Removing Duplicated Parameters from DualPipe
It is important to note that DualPipe scheduling can be divided into two mirrored halves, as illustrated in the figure below. For example, device $0$ and $7$ hold the same pipeline stages and have exactly the same scheduling.

Figure 1. DualPipe can be devided into two mirrored halves.
### Cut in half:
<!-- By switching the second half pipeline stages, we can convert one 8-device DualPipe to 2 4-device V-Shape schedule. -->
By keeping only the first half devices (and grafting the late stages of the bottom-up microbatches on the early stages of the top-down), we can obtain a schedule without the "dual" but is exactly the same in the properties of bubble rate, memory footprint, etc.
<!-- Another modification is that we need only half of the microbatches of DualPipe. The forward and backward passes goes in a V-Shape direction.-->

Figure 2. Illustration of halfing the devices & grafting the stages.

Figure 3. The Cut-in-half schedule.
Let's call this schedule the Cut-in-half schedule. We can see that it has no duplication of parameters, however, since the number of devices also reduce in half, the _per-device_ parameter memory remains unchanged. Following the same philosophy, we can as well design a 8 device Cut-in-half schedule by increasing the number of pipeline stages and halfing the layers per stage. Due to the halfed layers, the per-device parameter is brought down to half the original amount.
<!--is exactly the first half of DualPipe, without duplicating parameters. Note that cut-in-half is only the design philosophy, and it *can* have the same number of devices as DualPipe. The cut-in-half schedule still enjoyes DualPipe's low bubble rate and EP optimizations.-->
We show a detail comparison table here, all the schedules are compared assuming the same number of devices (denoted as $d$).
Table 1. Comparison of various pipeline schedules
| Method | Bubble | Parameter | Activation | PP Communication |
|-------------|---------------------------------|-----------|------------|---|
| 1F1B | ($d$-1)(πΉ+π΅) | 1Γ | $d$ | 1x|
| ZB1P | ($d$-1)(πΉ+π΅-2π) | 1Γ | $d$ | 1x|
| DualPipe | ($d$/2-1)(πΉ&π΅+π΅-3π) | 2Γ | $d$+1 | 1x|
| Cut-in-half* | (($d$-1)/2)(πΉ&π΅+π΅-3π) | 1Γ | $d$+1/2 | 2x|
\* Note: $d$ is the number of devices. For cut-in-half, it's the number of devices after cutting.
Cut-in-half doubles the PP communication volume compared to other methods. However, the benefit of halving parameter memory outweighs this, as PP communication is relatively minor compared to EP communication.
## Cut-in-half is an EP specialized ZB-V Schedule
<!-- After converting the bidirectional pipeline into a V-Shape schedulng, we can build some connections between DualPipe and ZB-V schedules. -->
<!--None of these methods are entirely new.-->
The Cut-in-half schedule is reminiscent of the V-Shape schedules, we highlight its connections to prior works. The V-Shape (wave-like) schedules was first proposed in [3], and improved to achieve zero bubble in the ZB-V schedule [1], and further modified in [2] for reduced memory foot-print.
Pipeline bubbles in DualPipe/Cut-in-half partially result from the specialization for Expert Parallelism (EP). The forward and backward passes are overlapped in the stable phase to mitigate communication overhead in Expert Parallelism (EP). Without considering EP, these bubbles can be further reduced, effectively converting Cut-in-half into a ZB-V schedule.
1. Untie F/B and squeeze
As shown in the figure, untying the forward and backward passes into separate ones leads to more flexible dependency, making it possible to squeeze the schedule. The schedule now looks more like a ZB-V schedule.

2. Untie B/W in cool-down phase and reorder
By further untying B/W in the cooldown phase and bypassing the sychronizations (introducted in Section 4 of [1]), we can get the ZB-V schedule and achieve zero bubble.

<!-- ## ZB-V with EP Block
### F/B/W -> Single block in stable phase -->
<!-- ## Comparison Table -->
<!-- ## Background: V-Shape Schedules and ZB-V -->
<!-- In Zero Bubble Pipeline Parallelism [1], a key contribution is to split the original backward pass into a **'backward for input (B)'** pass and **'backward for weights (W)'** pass, known as **B-W Split**. The cross-stage dependencies only exsit for **B** but not **W** passes and as a result **W** passes can be delayed. With **B-W Split**, pipeline bubbles can be reduced (**ZB1P** schedule) or even eliminated (**ZB2P** schedule). **B-W Split** is also adopted in DualPipe to reduce pipeline bubbles.
 -->
<!-- V-shape schedules [2] are adopted in a followup work of Zero Bubble [1]. The paper presents a family of V-Shape schedules which further reduces the memory cost. A similar approach is Hanayo [3] which removes the requirements of duplicating parameters in Chimera [4].
<!--  -->
<!-- Notably, one important schedule within the V-shape schedule family is **ZB-V**, which eliminates pipeline bubbles with the same memory consumption as 1F1B. -->
<!--  --> -->
## Citation
If you find this blog helpful, please consider citing:
```
@misc{qi2025dual,
title={DualPipe could be better without the Dual},
author={Penghui Qi and Xinyi Wan and Guangxing Huang and Min Lin},
year={2025},
howpublished={\url{https://hackmd.io/@ufotalent/r1lVXsa9Jg}},
note={Blog},
}
```
[1] https://arxiv.org/pdf/2401.10241
[2] https://arxiv.org/abs/2405.15362
[3] https://dl.acm.org/doi/pdf/10.1145/3581784.3607073
<!-- [4] https://arxiv.org/pdf/2107.06925 -->