Try   HackMD

EdgeAI-Lab6-Group7-Report

Part2: Experiment

Q1

  • Q1.1 A naive way of splitting the model is to split into equal size, is this a good policy? What are the split points of this policy?
    • speed up: 3.5167
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • I don't think it's a good policy, because doing this will induce problems like:
      • Imbalanced Workload Distribution:
        • Splitting the model into equal sizes does not consider the computational complexity of different layers. Some layers might require significantly more computation than others, leading to an imbalanced workload across the nodes.
      • Increased Communication Overhead:
        • Equal-sized splits might not optimize the data flow between layers. If layers that heavily interact are placed on different nodes, the communication overhead will increase.
    • split point:
      ​​​​​​​​{
      ​​​​​​​​    blocks.2.mlp_norm,
      ​​​​​​​​    blocks.5.mlp_norm,
      ​​​​​​​​    blocks.8.mlp_norm
      ​​​​​​​​}
      
  • Q1-2(10%): Please describe the split policy you have used in your implementation. Explain why it is better.
    • Initially, I split the model at the smallest output feature layer, which reduced the communication overhead in the pipeline parallel method across four machines. As a result, this method outperformed an average split, achieving a speed-up of 3.6086 (see figure below).
    • split point:
    ​​​​{
    ​​​​    blocks.2.attn.proj,
    ​​​​    blocks.5.attn.proj,
    ​​​​    blocks.8.attn.proj
    ​​​​}
    
    • Result: speed up: 3.6086
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • Subsequently, using a custom split strategy, we pruned our model to further enhance speed-up performance. After pruning and fine-tuning, the model reached a speed-up of 4.09. However, this came at the cost of reduced accuracy, dropping to 88.59%. For more details, refer to part 3.
    • The custom split strategy offers better performance than the average split by strategically distributing the workload, minimizing communication overhead and optimizing resource utilization. This leads to more efficient parallel execution and higher overall speedup.

Comparison of Profile Executions

Serial Execution (Single Machine)

  • Profile Characteristics:
    • Single-threaded execution.
    • Process larger data sets in each iteration on a single machine.
    • No parallelism, thus no communication overhead.
    • Longer execution time as all computations are performed sequentially.
    • Throughput and speedup are limited by the single-thread performance.

Average Split on 4 Machines

  • Profile Characteristics:
    • Execution split evenly across 4 machines.
    • Process smaller data sets in each iteration on a single machine.
    • Moderate parallelism with balanced load across all nodes.
    • Reduced execution time compared to serial execution due to parallel processing.
    • Some communication overhead due to data transfer between stages.
    • Need lots of "gloo:send" operation

Custom Split on 4 Machines

  • Profile Characteristics:
    • Execution optimized based on custom split strategy.
    • Process smaller data sets in each iteration on a single machine.
    • Further enhanced parallelism and reduced execution time compared to average split.
    • Communication overhead managed more effectively, leading to improved speedup
    • Need lots of "gloo:send" operation

Q2

In the previous setup, we split the model into 4 stages across 4 devices. Now, let's try splitting the model into more stages or less stages. Please compare the speed up between 2, 3, 4, and 6-stage pipeline.

  • multi stage
  • 2 stage
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
    • speed up: 2.02
    • This setup utilizes fewer stages, reducing the complexity of synchronization but limiting the parallelism potential.
  • 3 stage
    • Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • speed up: 2.71
    • It strikes a balance between reducing communication overhead and utilizing available nodes efficiently.
  • 4 stage
    • speed up: 3.51(refer to Q1.1)
    • This setup efficiently utilizes all 4 nodes, maximizing parallelism. However, it causes a larger communication overhead compared to fewer stages.
  • 6 stage
    • speed up: 2.62
    • Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • The main reason for the lower speed-up is the constraint of having only 4 nodes. In a 6-stage setup, some nodes need to handle multiple stages, leading to increased computational load and communication overhead.
    • This configuration would benefit from having 6 nodes, where each node could handle a single stage, potentially leading to a higher speed-up.

Q3

  • use torch profiler analyze:
    • Serial model
    • Average split pipeline model
    • Custom split pipeline model
  • While increasing the number of pipeline stages can enhance parallelism and potentially increase speed-up, various factors such as communication overhead, load imbalance, resource contention, and the limitations imposed by Amdahl's Law prevent the speed-up from reaching the theoretical maximum equal to the number of stages.
  • By examining the model's execution profiles, it becomes evident that while parallel execution reduces overall computation time, the associated overheads, communication and inefficiencies impact the achievable speedup, preventing it from reaching the ideal value of n.

Part3: Imporve Speed Up

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Methodologies:
https://github.com/VainF/Torch-Pruning/blob/master/examples/transformers/prune_timm_vit.py

  1. Re-implement the forwarding
  2. Prepare a pruner
  3. Prune model
  4. Fine-tune model

Hyperparameters:

BATCH_SIZE = 16
LEARNING_RATE = 5e-4
NUM_EPOCH = 15

1. Re-implement the forwarding

# Here we re-implement the forward function of timm.models.vision_transformer.Attention # as the original forward function requires the input and output channels to be identical. def forward(self, x): """https://github.com/huggingface/pytorch-image-models/blob/054c763fcaa7d241564439ae05fbe919ed85e614/timm/models/vision_transformer.py#L79""" B, N, C = x.shape qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4) q, k, v = qkv.unbind(0) q, k = self.q_norm(q), self.k_norm(k) if self.fused_attn: x = F.scaled_dot_product_attention( q, k, v, dropout_p=self.attn_drop.p, ) else: q = q * self.scale attn = q @ k.transpose(-2, -1) attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = attn @ v x = x.transpose(1, 2).reshape(B, N, -1) # original implementation: x = x.transpose(1, 2).reshape(B, N, C) x = self.proj(x) x = self.proj_drop(x) return x

2. Prepare a pruner:

imp = tp.importance.GroupHessianImportance() example_image = next(iter(train_loader))[0][0] ignored_layers = [model.head] for m in model.modules(): if isinstance(m, timm.models.vision_transformer.Attention): m.forward = forward.__get__(m, timm.models.vision_transformer.Attention) num_heads[m.qkv] = m.num_heads pruner = tp.pruner.MetaPruner( model, example_image, global_pruning=True, # If False, a uniform pruning ratio will be assigned to different layers. importance=imp, # importance criterion for parameter selection pruning_ratio=0.15, # target pruning ratio ignored_layers=ignored_layers, num_heads=num_heads, # number of heads in self attention prune_num_heads=False, # reduce num_heads by pruning entire heads (default: False) prune_head_dims=not True, # reduce head_dim by pruning featrues dims of each head (default: True) head_pruning_ratio=0.0, #args.head_pruning_ratio, # remove 50% heads, only works when prune_num_heads=True (default: 0.0) round_to=2 )

3. Prune model

if isinstance(imp, (tp.importance.GroupTaylorImportance, tp.importance.GroupHessianImportance)): model.zero_grad() if isinstance(imp, tp.importance.GroupHessianImportance): imp.zero_grad() print("Accumulating gradients for pruning...") for k, (imgs, lbls) in enumerate(train_loader): if k >= 10: break imgs = imgs.to(device) lbls = lbls.to(device) output = model(imgs) if isinstance(imp, tp.importance.GroupHessianImportance): loss = torch.nn.functional.cross_entropy(output, lbls, reduction='none') for l in loss: model.zero_grad() l.backward(retain_graph=True) imp.accumulate_grad(model) elif isinstance(imp, tp.importance.GroupTaylorImportance): loss = torch.nn.functional.cross_entropy(output, lbls) loss.backward() for i, g in enumerate(pruner.step(interactive=True)): g.prune()

4. Fine-tuning:

optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9, weight_decay=1e-4) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, NUM_EPOCH) criterion = nn.CrossEntropyLoss() best_acc = 0 for epoch_num in tqdm(range(1, NUM_EPOCH + 1)): train_one_epoch(model, criterion, optimizer, train_loader, device, scheduler) acc = evaluate_model(model, test_loader, device) print(f"epoch {epoch_num}:", acc) if acc > best_acc: torch.save(model, f"prune_iter.pth") best_acc = acc

Pruned model accuracy test

螢幕快照 2024-06-21 01-11-47