# **Lab 6: DeiT-S Pipeline Parallelism**
## Part 1: Pipeline Parallelism Implementation (40%)
Done!
## Part 2: Experiment (40%)
**Q1(15%)**
**Q1-1(5%):A naive way of splitting the model is to split into equal size (pippy.split_into_equal_size), is this a good policy? What are the split points of this policy?**
Ans1-1:
To maximize pipeline performance, we target equal processing times for each stage. The splitting the model into equal size for pipeline stages will split model with equal size. Splitting the workload into equal sizes is a reasonable first step, but it might not guarantee perfectly balanced processing times across all pipeline stages. However, this approach can still achieve similar processing durations for most cases.
**Q1-2(10%): Please describe the split policy you have used in your implementation. Explain why it is better.**
Ans1-2:
We tried manual add split point through annotate_split_points(). In our experiment, we investigated splitting the model at block levels using annotate_split_points() to achieve balanced processing times across stages. However, this approach proved difficult to achieve and resulted in worse performance compared to splitting the workload into equal sizes. While we believe a more fine-grained, manual approach might achieve better balanced processing times in theory, for this specific model, the equal size split provides sufficient performance.

To simplify our analysis, we initially assumed equal processing times per stage (denoted by Fi,j) and ignored data transmission overhead. In an ideal scenario with 4 devices processing 7 chunks each, we would expect to process a total of 28 chunks (4 devices * 7 chunks). However, due to idle time(12 chunks), only 16 chunks were actually completed. This translates to a theoretical speedup of (4 devices * 7 chunks - 12 chunks) / 7 chunks = 2.29. Interestingly, this theoretical speedup closely matches the observed performance of the equal-size split method (2.36). Given this close alignment and the challenges of achieving perfect balance with fine-grained manual splitting, we opted for the simpler and effective equal-size split approach.
**Q2(15%)**
Ans2:
Comparing the speed up between 2, 3, 4, and 6-stage pipeline (As you can see from screenshots below), we found that speedup generally increased with the number of stages. However, the results for stage 6 were lower compared to stage 4, with the speedup decreasing from 3.88 to 3.47. We believe that when the number of stages exceeds a certain point, the pipeline stages are not balanced anymore. For example, some stages have much higher computational loads than others, the overall performance then may not reach the ideal state. Other than these, the increased communication overhead between stages can also become bottlenecks evan computation is parallelized.

**Q3(10%)**
Based on the experimental results, we can see that in an n-stage pipeline, the speedup isn't really close to n. It's even more obvious as the number of stages increase.
We use torch.profiler to analyze the execution in both serial mode and pipeline mode, and we found that in the stage 2 profiler ( As you can see below ), the most time-consuming operations are gloo:send and gloo:recv, which are used to send and recieve information between stages. This indicates that the model requires a significant amount of time for inter-stage communication during pipeline parallelism.



Combining this with the previous conclusion, this explains why we cannot expect more stages to result in proportional speedup.
## Part 3: Improve Speed Up (20%)
**Report(15%)**
To clearly explain how we further improve performance and results and to understand our interpretation of the lab ,let's break it down into four parts.
1. Model Loading and Splitting:
First, the model is loaded onto the CPU. To ensure balanced processing across each stage, which enhances overall performance, we adopt split_into_equal_size policy that founded to be able to automatically divides the model into 4 stages. Our experiments revealed that this approach achieves an "acceptable split" for model deti-S.
2. Pipeline Creation:
Using Pipe.from_tracing, we analyze the model's structure and generate a pipeline based on our splitting policy. Stages are configured with 4 stages and a depth of 20 chunks. However, we've noted that increasing the depth can reduce overall efficiency, as some stages may idle while waiting during begin and end of pipeline.
3. Pipeline Stages:
PipelineStage creates individual processing stages on separate devices that each stage has a unique rank defining its order within the pipeline.
4. Execution and Speedup:
During execution, stages take input based on their rank (e.g., rank 0 takes the initial input). The final output comes from the last ranked stage. As shown below, this parallel execution achieves a significant speedup of 3.88 compared to the serial execution.

**Performance Evaluation of your work(5%)**
the speed up ~3.88 will get 3pts