owned this note
owned this note
Published
Linked with GitHub
# EdgeAI-Lab6-Group7-Report
[toc]
## Part2: Experiment
### Q1
- Q1.1 A naive way of splitting the model is to split into equal size, is this a good policy? What are the split points of this policy?
- speed up: 3.5167

- I don't think it's a good policy, because doing this will induce problems like:
- Imbalanced Workload Distribution:
- Splitting the model into equal sizes does not consider the computational complexity of different layers. Some layers might require significantly more computation than others, leading to an imbalanced workload across the nodes.
- Increased Communication Overhead:
- Equal-sized splits might not optimize the data flow between layers. If layers that heavily interact are placed on different nodes, the communication overhead will increase.
- split point:
```
{
blocks.2.mlp_norm,
blocks.5.mlp_norm,
blocks.8.mlp_norm
}
```
- Q1-2(10%): Please describe the split policy you have used in your implementation. Explain why it is better.
- Initially, I split the model at the smallest output feature layer, which reduced the communication overhead in the pipeline parallel method across four machines. As a result, this method outperformed an average split, achieving a speed-up of 3.6086 (see figure below).
- split point:
```
{
blocks.2.attn.proj,
blocks.5.attn.proj,
blocks.8.attn.proj
}
```
- Result: speed up: 3.6086

- Subsequently, using a custom split strategy, we pruned our model to further enhance speed-up performance. After pruning and fine-tuning, the model reached a speed-up of 4.09. However, this came at the cost of reduced accuracy, dropping to 88.59%. For more details, refer to part 3.
- The custom split strategy offers better performance than the average split by strategically distributing the workload, minimizing communication overhead and optimizing resource utilization. This leads to more efficient parallel execution and higher overall speedup.
### Comparison of Profile Executions
#### Serial Execution (Single Machine)
- Profile Characteristics:
- Single-threaded execution.
- Process larger data sets in each iteration on a single machine.
- No parallelism, thus no communication overhead.
- Longer execution time as all computations are performed sequentially.
- Throughput and speedup are limited by the single-thread performance.
#### Average Split on 4 Machines
- Profile Characteristics:
- Execution split evenly across 4 machines.
- Process smaller data sets in each iteration on a single machine.
- Moderate parallelism with balanced load across all nodes.
- Reduced execution time compared to serial execution due to parallel processing.
- Some communication overhead due to data transfer between stages.
- Need lots of "gloo:send" operation
#### Custom Split on 4 Machines
- Profile Characteristics:
- Execution optimized based on custom split strategy.
- Process smaller data sets in each iteration on a single machine.
- Further enhanced parallelism and reduced execution time compared to average split.
- Communication overhead managed more effectively, leading to improved speedup
- Need lots of "gloo:send" operation
### Q2
In the previous setup, we split the model into 4 stages across 4 devices. Now, let's try splitting the model into more stages or less stages. Please compare the speed up between 2, 3, 4, and 6-stage pipeline.
- multi stage
- 2 stage 
- speed up: 2.02
- This setup utilizes fewer stages, reducing the complexity of synchronization but limiting the parallelism potential.
- 3 stage
- 
- speed up: 2.71
- It strikes a balance between reducing communication overhead and utilizing available nodes efficiently.
- 4 stage
- speed up: 3.51(refer to Q1.1)
- This setup efficiently utilizes all 4 nodes, maximizing parallelism. However, it causes a larger communication overhead compared to fewer stages.
- 6 stage
- speed up: 2.62
- 
- The main reason for the lower speed-up is the constraint of having only 4 nodes. In a 6-stage setup, some nodes need to handle multiple stages, leading to increased computational load and communication overhead.
- This configuration would benefit from having 6 nodes, where each node could handle a single stage, potentially leading to a higher speed-up.
### Q3
- use torch profiler analyze:
- Serial model
- Average split pipeline model
- Custom split pipeline model
- While increasing the number of pipeline stages can enhance parallelism and potentially increase speed-up, various factors such as communication overhead, load imbalance, resource contention, and the limitations imposed by Amdahl's Law prevent the speed-up from reaching the theoretical maximum equal to the number of stages.
- By examining the model's execution profiles, it becomes evident that while parallel execution reduces overall computation time, the associated overheads, communication and inefficiencies impact the achievable speedup, preventing it from reaching the ideal value of n.
## Part3: Imporve Speed Up

**Methodologies:**
https://github.com/VainF/Torch-Pruning/blob/master/examples/transformers/prune_timm_vit.py
1. Re-implement the forwarding
2. Prepare a pruner
3. Prune model
4. Fine-tune model
**Hyperparameters:**
```python
BATCH_SIZE = 16
LEARNING_RATE = 5e-4
NUM_EPOCH = 15
```
**1. Re-implement the forwarding**
```python=
# Here we re-implement the forward function of timm.models.vision_transformer.Attention
# as the original forward function requires the input and output channels to be identical.
def forward(self, x):
"""https://github.com/huggingface/pytorch-image-models/blob/054c763fcaa7d241564439ae05fbe919ed85e614/timm/models/vision_transformer.py#L79"""
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
q, k = self.q_norm(q), self.k_norm(k)
if self.fused_attn:
x = F.scaled_dot_product_attention(
q, k, v,
dropout_p=self.attn_drop.p,
)
else:
q = q * self.scale
attn = q @ k.transpose(-2, -1)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = attn @ v
x = x.transpose(1, 2).reshape(B, N, -1) # original implementation: x = x.transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
```
**2. Prepare a pruner:**
```python=
imp = tp.importance.GroupHessianImportance()
example_image = next(iter(train_loader))[0][0]
ignored_layers = [model.head]
for m in model.modules():
if isinstance(m, timm.models.vision_transformer.Attention):
m.forward = forward.__get__(m, timm.models.vision_transformer.Attention)
num_heads[m.qkv] = m.num_heads
pruner = tp.pruner.MetaPruner(
model,
example_image,
global_pruning=True, # If False, a uniform pruning ratio will be assigned to different layers.
importance=imp, # importance criterion for parameter selection
pruning_ratio=0.15, # target pruning ratio
ignored_layers=ignored_layers,
num_heads=num_heads, # number of heads in self attention
prune_num_heads=False, # reduce num_heads by pruning entire heads (default: False)
prune_head_dims=not True, # reduce head_dim by pruning featrues dims of each head (default: True)
head_pruning_ratio=0.0, #args.head_pruning_ratio, # remove 50% heads, only works when prune_num_heads=True (default: 0.0)
round_to=2
)
```
**3. Prune model**
```python=
if isinstance(imp, (tp.importance.GroupTaylorImportance, tp.importance.GroupHessianImportance)):
model.zero_grad()
if isinstance(imp, tp.importance.GroupHessianImportance):
imp.zero_grad()
print("Accumulating gradients for pruning...")
for k, (imgs, lbls) in enumerate(train_loader):
if k >= 10: break
imgs = imgs.to(device)
lbls = lbls.to(device)
output = model(imgs)
if isinstance(imp, tp.importance.GroupHessianImportance):
loss = torch.nn.functional.cross_entropy(output, lbls, reduction='none')
for l in loss:
model.zero_grad()
l.backward(retain_graph=True)
imp.accumulate_grad(model)
elif isinstance(imp, tp.importance.GroupTaylorImportance):
loss = torch.nn.functional.cross_entropy(output, lbls)
loss.backward()
for i, g in enumerate(pruner.step(interactive=True)):
g.prune()
```
**4. Fine-tuning:**
```python=
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, NUM_EPOCH)
criterion = nn.CrossEntropyLoss()
best_acc = 0
for epoch_num in tqdm(range(1, NUM_EPOCH + 1)):
train_one_epoch(model, criterion, optimizer, train_loader, device, scheduler)
acc = evaluate_model(model, test_loader, device)
print(f"epoch {epoch_num}:", acc)
if acc > best_acc:
torch.save(model, f"prune_iter.pth")
best_acc = acc
```
**Pruned model accuracy test**
