# Lab4 [Deit Quantization]
A122582 陳佑祥
## Table of Contents
[TOC]
---
###### tags: `Edge AI`
# Part 1: Visual analysis of the weight/activation parameters
## Experiment Setup
In this study, we analyze the DeiT (Data-efficient Image Transformers) model to identify layers that might potentially be outliers, which could affect the accuracy of quantization. We define a threshold of 5% to classify the significance of outliers within each layer. Specifically, we detect outliers using the Interquartile Range (IQR) method and quantify the percentage of outliers in the activation outputs of each layer.
### Methodology
- **Outlier Detection:** Outliers are identified using the IQR method. A data point is considered an outlier if it lies beyond 1.5 times the IQR above the third quartile or below the first quartile.
- **Significance Threshold:** Layers with an outlier percentage above 5% are considered significant.
- **Visualization:** We generate boxplots to visualize the distribution of activation outputs for each layer and highlight the layers with significant outliers.
- Each Layers outputs distribution after activation **(potentially outliers are marked as red)**:

:::info
**Note:** In the appendix section, we provide detailed plots of the weight distribution for each layer to further illustrate these observations.
:::
---
# Part 2: Identify the Specific Layers that are Sensitive to Quantization
To quantify the impact of quantization on each layer's weights, we use the following formula for the **Change Score**:
$$
C = \frac{\text{mean change}}{\text{mean}(\text{original data})} + \frac{\text{std change}}{\text{std}(\text{original data})}
$$
This formula quantifies each parameter change in each layer by summing the normalized mean and standard deviation changes.
- **Code Block**:
```python
# Calculate a simple change score as a sum of normalized changes
change_score = mean_change / np.mean(original_data) + std_change / np.std(original_data)
change_scores[name] = change_score
```

The **Change Score** reflects the degree to which quantization affects each layer's parameters and weights, with a higher score indicating a greater impact.
### Key Observations
1. **Attention Projection (`proj.weight`) and Query-Key-Value (`qkv.weight`) Layers**: These are critical components of the Transformer's self-attention mechanism. The projection weights transform the input features into different spaces for queries, keys, and values. Given the sensitivity of the softmax operation that follows these transformations (as part of calculating attention scores), even minor perturbations in these weights due to quantization can significantly affect the model's output.
2. **Initial Layers (`blocks.0`) vs. Deeper Layers (`blocks.5`)**: It is notable that the initial layers appear as highly sensitive. This could be because early layers in deep networks typically capture low-level features, and errors introduced here propagate through to all subsequent layers, amplifying their impact.
3. **Variability in Sensitivity Across Layers**: The varying degrees of change scores across different layers suggest that some layers have learned representations that are more robust to reduced precision than others. This might be influenced by the nature of the features these layers are processing or their role in the network architecture.
### Results
#### Top 10 Potential Outlier Layers:
The following layers exhibited the highest percentages of outliers in their activation outputs:
\
\begin{array}{|l|r|}
\hline
\text{Layer} & \text{Outlier Percentage} \\
\hline
\text{blocks.0.attn.proj} & 20.26\% \\
\text{blocks.0.attn.qkv} & 18.19\% \\
\text{blocks.0.mlp.fc2} & 14.41\% \\
\text{blocks.1.attn.proj} & 12.38\% \\
\text{blocks.1.mlp.fc2} & 11.70\% \\
\text{blocks.2.attn.qkv} & 10.11\% \\
\text{blocks.1.attn.qkv} & 8.86\% \\
\text{head} & 7.44\% \\
\text{blocks.9.mlp.fc2} & 7.14\% \\
\text{blocks.3.attn.qkv} & 6.70\% \\
\hline
\end{array}
#### Top 5 Layers Affected by Quantization Based on Parameter Changes:
The following layers showed the highest change scores, indicating they are significantly affected by quantization:
\
\begin{array}{|l|r|}
\hline
\text{Layer} & \text{Change Score} \\
\hline
\text{blocks.0.attn.proj.weight} & 71134.7031 \\
\text{blocks.5.attn.proj.weight} & 59492.6641 \\
\text{blocks.3.attn.qkv.weight} & 55263.0820 \\
\text{blocks.0.attn.qkv.weight} & 41150.3867 \\
\text{blocks.2.attn.qkv.weight} & 34689.1172 \\
\hline
\end{array}
### Observations
From our analysis, we observe that certain layers in the DeiT model exhibit a significant percentage of outliers in their activation outputs. These layers could potentially impact the accuracy of the quantization process. Additionally, layers such as `blocks.0.attn.proj.weight` and `blocks.5.attn.proj.weight` show high change scores, indicating that they are heavily affected by quantization.
In the appendix section, we provide detailed plots of the weight distribution for each layer to further illustrate these observations.
---
# Part 3: Why Deit are harder to quantize than MobileNet?
### 1. **Architectural Differences**
DeiT-S is based on the Transformer architecture, which uses self-attention mechanisms extensively. This architecture is fundamentally different from the convolution-based architecture used in MobileNet.
- **Attention Mechanisms**: Transformers compute attention scores based on the dot products of queries and keys, which can lead to large dynamic ranges in the scores, especially before they are normalized by softmax. The softmax function itself is sensitive to small variations in input, which can be exacerbated by low-precision computation. High dynamic ranges and sensitivity to input variations make quantization challenging because reduced numeric precision can significantly distort the outcomes.

- **Depthwise Convolutions**: MobileNet utilizes depthwise separable convolutions, which are a type of convolution that decouples filtering and combining steps, resulting in fewer parameters and generally less computational complexity. These operations typically exhibit robust behavior under quantization because they inherently involve fewer multiplications and additions that could amplify quantization errors.
### 2. **Data Flow and Feature Representation**
- **Feature Distribution**: Transformers like DeiT-S, the representation of information across different heads in multi-head attention can vary significantly, potentially leading to a wide distribution of values. Efficient quantization requires careful handling of this distribution to avoid information loss. In contrast, features in MobileNet, processed through convolutions, tend to have a more regularized and predictable distribution which is generally more amenable to uniform quantization.

- **Skip Connections**: While both architectures utilize skip connections, in Transformers, these are critical for propagating information across multiple layers without attenuation. Any quantization error introduced in early layers can propagate and even amplify through these connections, impacting the overall model accuracy more severely.
### 3. **Model Sensitivity and Error Propagation**
- **Error Accumulation**: Transformers aggregate outputs from multiple attention heads and layers, where each contributes a small error due to quantization. These errors can accumulate, especially in deeper models like DeiT-S, leading to a significant deviation from the expected output.
- **Batch Normalization**: MobileNet makes extensive use of batch normalization, which can help mitigate some of the quantization effects by normalizing the output distributions of layers. This normalization can reduce the impact of quantization errors on the model's performance.
### Conclusion
Theoretically, the complexity and sensitivity of the operations in DeiT-S, combined with its reliance on dynamic range-heavy computations like softmax in attention, make it more susceptible to performance degradation under quantization compared to the more straightforward, convolution-dominated operations in MobileNet. Quantizing a Transformer model like DeiT-S often requires more nuanced handling to maintain a balance between efficiency and accuracy.
---
# Part 4: Exploring Quantization-Aware Training (QAT) for Improved Accuracy
### Experiment Setup: Partial Post-Training Quantization (PTQ)
Initially, we aimed to improve quantization results by excluding potentially problematic layers identified as outliers. These layers were added to an ignore list during PTQ:
```python
ignore_list = [
"blocks.0.attn.proj.weight",
"blocks.0.attn.qkv.weight",
"blocks.0.mlp.fc2",
"blocks.1.attn.proj",
"blocks.1.mlp.fc2",
"blocks.2.attn.qkv",
"blocks.1.attn.qkv",
"head",
"blocks.9.mlp.fc2",
"blocks.3.attn.qkv"
]
```
We experimented with excluding 3 to 10 layers and even iterated through them in loops to perform PTQ. Despite these efforts, the accuracy achieved was suboptimal, ranging from 83% to 85%, compared to the original model accuracy of 90.9%. The significant range of these layers likely contributes to the accuracy drop during quantization.
Given the poor results from partial PTQ, we decided to switch to Quantization-Aware Training (QAT), a more robust approach for maintaining accuracy in quantized models.
### Implementing QAT with Optuna Fine-Tuning
For QAT, we implemented a strategy to capture the model graph, initialize the quantizer, and prepare the model for quantization. We used the XNNPACK quantizer with symmetric quantization configurations and trained the model for five epochs. To further optimize the hyperparameters, we employed Optuna for fine-tuning.
Here are the key steps and results:
1. **Hyperparameter Optimization**:
- Learning Rate: $$3.647 \times 10^{-5}$$
- Momentum: $$0.857$$
- Weight Decay: $$0.00129$$
2. **Training Process**:
- The model was trained over ten epochs, with observers and batch normalization stats adjusted during the training.
- We evaluated the quantized model's accuracy after each epoch to monitor performance.
3. **Results**:
- QAT with Optuna fine-tuning achieved an accuracy of 90.0%, closely matching the original model's accuracy of 90.9%.
- The final model size was 21.94 MB, with an execution time for the training process of 54.76 seconds.
- The model score post-QAT and fine-tuning was 38.00, reflecting the effectiveness of this approach.
### Conclusion
While partial PTQ did not yield satisfactory accuracy, QAT proved to be an effective method for quantization, preserving the model's performance.

---
# Part 4(II): Suggestions for improving quantization on DeiT-S.
**This part is my original thought,but i ask TA in the class that we are not allowed to use dynamic quantization, but i still record it here.**
For DeiT-S (Data-efficient Image Transformer - Small),I think dynamic quantization stands out as an effective method to balance model size, inference speed, and accuracy. We discuss why dynamic quantization is important for DeiT-S and provide suggestions for improving quantization effectiveness.
**Dynamic Quantization: A Perfect Fit for Transformers**
Dynamic quantization involves quantizing the weights of the model once and quantizing the activations on-the-fly during inference. This approach is particularly suited for transformers due to their diverse and variable activation ranges across different layers. Each activation function is quantized dynamically, adapting to the specific data being processed. This dynamic adaptation ensures that the quantization process closely matches the actual data distribution, minimizing information loss and preserving accuracy.
**Explanation**
1. **Weight Quantization (Static)**
- Quantize weights $$ W $$ **Once** during model conversion:
$$
W_q = \text{round}\left(\frac{W - W_{\text{min}}}{\Delta_W}\right)
$$
- Dequantize during inference:
$$
W \approx W_q \cdot \Delta_W + W_{\text{min}}
$$
2. **Activation Quantization (Dynamic)**
- Determine the range of activations $$ A $$ **Dynamically** :
$$
A_q = \text{round}\left(\frac{A - A_{\text{min}}}{\Delta_A}\right)
$$
- Dequantize during inference:
$$
A \approx A_q \cdot \Delta_A + A_{\text{min}}
$$
:::info
**Note:** Static Quantization quantized activations only at once, While dynamic quantization quantized activations while inferencing at each layer
:::
**Comparison with Static Quantization**
Static quantization uses fixed quantization parameters determined during a calibration phase with a representative dataset. This approach can lead to suboptimal performance for transformers, which exhibit high variance in activation ranges. In contrast, dynamic quantization adapts to the input data, providing more accurate quantization and better preserving model performance.
**Advantages of Dynamic Quantization**
- **Flexibility**: Dynamic quantization adapts to the current input data, ensuring optimal quantization parameters for each layer.
- **Accuracy**: By reducing quantization error through dynamic adaptation, dynamic quantization helps maintain higher model accuracy.
- **Simplified Process**: There is no need for extensive calibration, making it easier to implement and deploy.
- **Comprehensive Quantization**: Unlike static quantization, which might necessitate ignoring some layers to prevent significant accuracy loss, dynamic quantization can be applied uniformly across all layers without degrading performance.
**Performance Evaluation**
The following data highlights the impact of dynamic quantization on the DeiT-S model:
| Model Variant | Accuracy on Test Images | Model Size (MB) | Execution Time (s) | Model Score |
|--------------------------------|-------------------------|-----------------|-------------------|-------------|
| Before Quantization | 93.6% | 86.91 | 4.3 | - |
| Static Quantization | 84.8% | 21.94 | 65.25 | 20.00 |
| Dynamic Quantization | 93.0% | 23.08 | 46.48 | 40.00 |
- **Accuracy**: Dynamic quantization achieves 93.0% accuracy, closely matching the original model's performance (93.6%), while static quantization significantly reduces accuracy to 84.8%.
- **Model Size**: Both quantization methods substantially reduce the model size, with dynamic quantization resulting in a slightly larger but still compact model compared to static quantization.
- **Execution Time**: Dynamic quantization speeds up inference (46.48 seconds) compared to static quantization (65.25 seconds).
- **Model Score**: The model score, a composite metric considering size, accuracy, and execution time, is significantly higher for the dynamically quantized model (40.00) compared to the statically quantized model (20.00).
**Conclusion**
Dynamic quantization offers superior accuracy, simplified deployment process.
**Comparison with Static Quantization**
The static quantization approach required ignoring some layers to preserve accuracy

# Part 5: The Quantization pipeline to enhance model performance.
1. **Initial PTQ Attempt**:
- **Outlier Detection**: Identify layers with a high percentage of outliers to understand which layers might affect the quantization process.
- **Ignoring Outlier Layers**: Initially, attempt to exclude layers with high outlier percentages from the PTQ process to maintain accuracy.
2. **PTQ Results**:
- **Partial PTQ**: Perform partial PTQ by ignoring the identified outlier layers. Despite these efforts, the resulting accuracy was only around 83-85%, which was significantly lower than the original accuracy of 90.9%.
3. **Switch to QAT**:
- **QAT Preparation**: Initialize the quantization-aware training (QAT) process by preparing the model and data loader for QAT.
- **Observer and BatchNorm Handling**: Fine-tune the model with specific epochs dedicated to observer updates and batch normalization updates.
- **Fine-Tuning with Optuna**: Use Optuna to optimize hyperparameters such as learning rate, momentum, and weight decay to enhance model performance during QAT.
4. **QAT Implementation**:
- **Training and Evaluation**: Train the model for several epochs, periodically evaluating the quantized model's accuracy.
- **Final Results**: Achieve a final accuracy of 90.0%, which is comparable to the original model, indicating successful quantization with QAT.
### Flow Diagram
```mermaid
graph TD
A[Initial Model: 90.9% Accuracy] --> B{Outlier Detection}
B --> C{PTQ}
C -->|Ignore Outlier Layers| D[Partial PTQ]
D -->|83-85% Accuracy| E[Switch to QAT]
E --> F[Prepare Model for QAT]
F --> G[Observer and BatchNorm Handling]
G --> H[Hyperparameter Optimization with Optuna]
H --> I[Train and Evaluate Model]
I --> J{Final Accuracy: 90.0%}
style A fill:#f9f,stroke:#333,stroke-width:2px;
style B fill:#f96,stroke:#333,stroke-width:2px;
style C fill:#6f9,stroke:#333,stroke-width:2px;
style D fill:#9f6,stroke:#333,stroke-width:2px;
style E fill:#69f,stroke:#333,stroke-width:2px;
style F fill:#96f,stroke:#333,stroke-width:2px;
style G fill:#f69,stroke:#333,stroke-width:2px;
style H fill:#9f9,stroke:#333,stroke-width:2px;
style I fill:#f99,stroke:#333,stroke-width:2px;
style J fill:#99f,stroke:#333,stroke-width:2px;
```
---
# Appendix
In the appendix section, we include plots of the weight distribution for each layer, which provide further insights into how each layer's parameters are distributed and highlight potential areas of concern for quantization.
- Each Layers parameter weights distribution:
