[EdgeAI] Lab 4: Quantize DeiT-S

# [EdgeAI] Lab 4: Quantize DeiT-S ###### 資科工碩二 311551174 李元亨 --- ## 1. A visual analysis of the weight/activation parameters of the model. (20%) ### Weight parameters Below is a visualization of layers with top numbers of visualization. Note that we are not presenting all the weights due to the common characteristic among blocks. ![image](https://hackmd.io/_uploads/S1SJAIsQA.png) The bar plot above shows that the MLPs inside each transformer encoder contribute the most to the model size. In order to reduce model size effectively, we want to focus on quantizing parameters like these. As for things like bias and ls.gamma, we might keep them as float. Moreover, we visualize the weight distribution of the model. ![Screenshot 2024-05-22 at 7.26.39 PM](https://hackmd.io/_uploads/HJ1Y88sXC.png) ![Screenshot 2024-05-22 at 10.43.12 PM](https://hackmd.io/_uploads/SkpYVYiX0.png) From the above plot, there are a few things worth mentioning here: - Some distributions are narrow, around 0, but with some outliers on both positive and negative sides. The presence of outliers can make quantization more challenging as the quantization range needs to cover these outlier values. e.g., pos_embed, qkv-related modules. - Some distributions are shifted away from zero. This asymmetry could lead to quantization errors if not properly handled. e.g., norm, ls. ## 2. Identify the specific layers or components that quantization impacts the most. (10%) - Positional Embedding (pos_embed): The positional embedding shows a narrow distribution with outliers, suggesting that a straightforward quantization approach might introduce errors. Advanced techniques that handle outliers effectively will be required. - Query-Key-Value (qkv) Modules: Similar to the positional embeddings, qkv modules have distributions with outliers. Effective quantization of these modules is crucial due to their central role in the transformer's attention mechanism. - Normalization Layers (norm): These layers exhibit shifted distributions. Proper quantization techniques that account for this asymmetry are necessary to avoid significant quantization errors. - Layer Scale (ls): Also showing shifted distributions, ls parameters require careful quantization to prevent errors that could degrade model performance. ## 3. Explain why DeiT-S is harder to quantize than MobileNet. Use relevant charts or graphs to support your findings. We can compare the weight distributions between DeiT-S and MobileNet. - DeiT-S: ![Screenshot 2024-05-22 at 7.26.39 PM](https://hackmd.io/_uploads/HJ1Y88sXC.png) ![Screenshot 2024-05-22 at 10.43.12 PM](https://hackmd.io/_uploads/SkpYVYiX0.png) - DeiT-S is based on the transformer architecture, which heavily relies on self-attention mechanisms and includes numerous multi-head attention layers, MLP blocks, and normalization layers. - The presence of outliers means that quantization ranges must be wide enough to accommodate extreme values, which can reduce the precision of the majority of the weights. - Asymmetrical weight distributions might face a more severe performance drop when applying symmetric quantization methods. - MobileNet: ![image_2024-05-22_23-32-23](https://hackmd.io/_uploads/rJifx9sX0.png) - The weight distributions in MobileNet are typically more uniform and centered around zero, making them more amenable to quantization. - More uniform and symmetrical weight distributions are easier to quantize using standard techniques, leading to less quantization error. ## 4. Suggestions for improving quantization on DeiT-S. - Asymmetric Quantization: Use asymmetric quantization to handle weight distributions not centered around zero. This allows the quantization range to fit the actual distribution of the weights better. - Per-Channel Quantization: Apply per-channel quantization instead of per-tensor quantization. This approach quantizes weights differently for each channel, accommodating variations in the distribution across different channels. - Mixed-Precision Quantization: Use mixed-precision quantization, where critical layers or weights with significant outliers are quantized to a higher bit-width (e.g., 16-bit) while others use lower bit-widths (e.g., 8-bit). ## 5. Explain what you have done in your quantization pipeline to enhance model performance in part 3. - Quantization Configuration - Activation Quantization: Set to use FusedMovingAvgObsFakeQuantize with dtype=torch.int8. Specified range with quant_min=-128 and quant_max=127. Applied torch.per_tensor_affine scheme. Incorporated eps=2**-12 for enhanced precision. - Weight Quantization: Configured to use FusedMovingAvgObsFakeQuantize with dtype=torch.int8. Utilized torch.per_tensor_symmetric scheme Added MovingAverageMinMaxObserver with eps=2**-12. - Bias Quantization: Applied PlaceholderObserver with dtype=torch.float. - Model Preparation for Quantization-Aware Training (QAT) - Ignore Configurations: Defined ignore patterns for layers include norm, qkv, and ls. Applied quantization configurations are mentioned above for these layers.