EdgeAI-Lab6-Group7-Report
Part2: Experiment
Q1
- Q1.1 A naive way of splitting the model is to split into equal size, is this a good policy? What are the split points of this policy?
- speed up: 3.5167
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- I don't think it's a good policy, because doing this will induce problems like:
- Imbalanced Workload Distribution:
- Splitting the model into equal sizes does not consider the computational complexity of different layers. Some layers might require significantly more computation than others, leading to an imbalanced workload across the nodes.
- Increased Communication Overhead:
- Equal-sized splits might not optimize the data flow between layers. If layers that heavily interact are placed on different nodes, the communication overhead will increase.
- split point:
- Q1-2(10%): Please describe the split policy you have used in your implementation. Explain why it is better.
- Initially, I split the model at the smallest output feature layer, which reduced the communication overhead in the pipeline parallel method across four machines. As a result, this method outperformed an average split, achieving a speed-up of 3.6086 (see figure below).
- split point:
- Result: speed up: 3.6086
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Subsequently, using a custom split strategy, we pruned our model to further enhance speed-up performance. After pruning and fine-tuning, the model reached a speed-up of 4.09. However, this came at the cost of reduced accuracy, dropping to 88.59%. For more details, refer to part 3.
- The custom split strategy offers better performance than the average split by strategically distributing the workload, minimizing communication overhead and optimizing resource utilization. This leads to more efficient parallel execution and higher overall speedup.
Comparison of Profile Executions
Serial Execution (Single Machine)
- Profile Characteristics:
- Single-threaded execution.
- Process larger data sets in each iteration on a single machine.
- No parallelism, thus no communication overhead.
- Longer execution time as all computations are performed sequentially.
- Throughput and speedup are limited by the single-thread performance.
Average Split on 4 Machines
- Profile Characteristics:
- Execution split evenly across 4 machines.
- Process smaller data sets in each iteration on a single machine.
- Moderate parallelism with balanced load across all nodes.
- Reduced execution time compared to serial execution due to parallel processing.
- Some communication overhead due to data transfer between stages.
- Need lots of "gloo:send" operation
Custom Split on 4 Machines
- Profile Characteristics:
- Execution optimized based on custom split strategy.
- Process smaller data sets in each iteration on a single machine.
- Further enhanced parallelism and reduced execution time compared to average split.
- Communication overhead managed more effectively, leading to improved speedup
- Need lots of "gloo:send" operation
Q2
In the previous setup, we split the model into 4 stages across 4 devices. Now, let's try splitting the model into more stages or less stages. Please compare the speed up between 2, 3, 4, and 6-stage pipeline.
- multi stage
- 2 stage
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- speed up: 2.02
- This setup utilizes fewer stages, reducing the complexity of synchronization but limiting the parallelism potential.
- 3 stage
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- speed up: 2.71
- It strikes a balance between reducing communication overhead and utilizing available nodes efficiently.
- 4 stage
- speed up: 3.51(refer to Q1.1)
- This setup efficiently utilizes all 4 nodes, maximizing parallelism. However, it causes a larger communication overhead compared to fewer stages.
- 6 stage
- speed up: 2.62
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- The main reason for the lower speed-up is the constraint of having only 4 nodes. In a 6-stage setup, some nodes need to handle multiple stages, leading to increased computational load and communication overhead.
- This configuration would benefit from having 6 nodes, where each node could handle a single stage, potentially leading to a higher speed-up.
Q3
- use torch profiler analyze:
- Serial model
- Average split pipeline model
- Custom split pipeline model
- While increasing the number of pipeline stages can enhance parallelism and potentially increase speed-up, various factors such as communication overhead, load imbalance, resource contention, and the limitations imposed by Amdahl's Law prevent the speed-up from reaching the theoretical maximum equal to the number of stages.
- By examining the model's execution profiles, it becomes evident that while parallel execution reduces overall computation time, the associated overheads, communication and inefficiencies impact the achievable speedup, preventing it from reaching the ideal value of n.
Part3: Imporve Speed Up
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Methodologies:
https://github.com/VainF/Torch-Pruning/blob/master/examples/transformers/prune_timm_vit.py
- Re-implement the forwarding
- Prepare a pruner
- Prune model
- Fine-tune model
Hyperparameters:
1. Re-implement the forwarding
2. Prepare a pruner:
3. Prune model
4. Fine-tuning:
Pruned model accuracy test
