# 對深度學習模型整形的能力-量化(Quantization)
對經典神經網絡模型進行量化,以減少模型大小和延遲。此任務的目標如下:
實作 K-Means Quantization 及 Linear Quantization
- 了解量化的基本概念
- 實現並應用 K-means 量化
- 實現並應用針對 K-means 量化的量化感知訓練
- 實現並應用線性量化
- 實現並應用線性量化的整數推斷
- 基本了解量化帶來的性能改善(如加速)
- 了解這些量化方法之間的差異和權衡
1. K-Means Quantization
2. Linear Quantization
---
1. K-Means Quantization
```
第一步:衡量目前模型
VGG fp32 有 accuracy = 92.95%, size = 35.2 MiB
```
```
第二步: 回顧 K-means 的原理
```
```
第三步: 實作,將測試 tensor 做 2-bits k-means quantization
實作,將 VGG model 做 2/4/8-bits k-means quantization
```

```
描述上圖
```
```
k-means quantizing model into 8 bits
8-bit k-means quantized model has size=8.80 MiB
8-bit k-means quantized model has accuracy=92.79%
k-means quantizing model into 4 bits
4-bit k-means quantized model has size=4.40 MiB
4-bit k-means quantized model has accuracy=83.03%
k-means quantizing model into 2 bits
2-bit k-means quantized model has size=2.20 MiB
2-bit k-means quantized model has accuracy=10.00%
```
```
回顧一下模型經細粒度修剪後,取得 8.15 MiB 及 92.8% accuracy 的成績(此時精度是多少呢?)
8-bits k-means 量化竟然已經可以比擬了,但2/4-bits量化的精度大幅下降,retrain能夠拉回多少精度呢?
```
第四步: quantization-aware training (QAT)
根據[1],The gradient for the centroids:
$\frac{\partial \mathcal{L} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \frac{\partial W_{j} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \mathbf{1}(I_{j}=k)$
where $\mathcal{L}$ is the loss, $C_k$ is *k*-th centroid, $I_{j}$ is the label for weight $W_{j}$. $\mathbf{1}()$ is the indicator function, and $\mathbf{1}(I_{j}=k)$ means $1\;\mathrm{if}\;I_{j}=k\;\mathrm{else}\;0$, *i.e.*, $I_{j}==k$.
該如何實作呢?
```
看一下效果,QAT 實在好用阿,模型已經壓到大小 2.2 MiB,accuracy = 91.68%
k-means quantizing model into 8 bits
8-bit k-means quantized model has size=8.80 MiB
8-bit k-means quantized model has accuracy=92.79% before quantization-aware training
No need for quantization-aware training since accuracy drop=0.16% is smaller than threshold=0.50%
k-means quantizing model into 4 bits
4-bit k-means quantized model has size=4.40 MiB
4-bit k-means quantized model has accuracy=83.03% before quantization-aware training
Quantization-aware training due to accuracy drop=9.92% is larger than threshold=0.50%
Epoch 0 Accuracy 92.50% / Best Accuracy: 92.50%
k-means quantizing model into 2 bits
2-bit k-means quantized model has size=2.20 MiB
2-bit k-means quantized model has accuracy=10.00% before quantization-aware training
Quantization-aware training due to accuracy drop=82.95% is larger than threshold=0.50%
Epoch 0 Accuracy 90.88% / Best Accuracy: 90.88%
Epoch 1 Accuracy 91.22% / Best Accuracy: 91.22%
Epoch 2 Accuracy 91.43% / Best Accuracy: 91.43%
Epoch 3 Accuracy 91.68% / Best Accuracy: 91.68%
Epoch 4 Accuracy 91.57% / Best Accuracy: 91.68%
```
Linear Quantization
線性量化可表示為
$r = S(q-Z)$
where $r$ is a floating point real number, $q$ is a *n*-bit integer, $Z$ is a *n*-bit integer, and $S$ is a floating point real number. $Z$ is quantization zero point and $S$ is quantization scaling factor. Both constant $Z$ and $S$ are quantization parameters.
q 跟基準值 Z 進行比較,其比較結果即差值 (q-Z) 經過調節因子 S 轉換成一離散空間的新數值,
可將輸入的連續空間轉換為離散的輸出空間,實現了對輸入值的量化。
老規矩,寫函式 linear_quantize() 並測試之。

Special case: linear quantization on weight tensor
寫好函式能求出 quantized_tensor, scale, zero_point之後
看看實際例子,把線性量化用在權重上,下圖是 VGG 各層權重從
32-bits 到 4-bits以及 2-bits



```
另外需注意
密集的實驗表明,對於不同的輸出通道使用不同的縮放因子S和零點Z會有更好的表現。
因此,我們必須獨立地為每個輸出通道的子張量確定縮放因子S和零點Z
"獨立輸出通道"是指權重張量中的每個輸出通道。例如,如果你有一個卷積層,它的權重張量可能具有形狀 [輸出通道, 輸入通道, 高度, 寬度]。在這種情況下,每個"獨立輸出通道"就是指這個四維張量中的每個一維切片,即每個輸出通道的權重。
```
```
量化之後,卷積層和全連接層的推論也會發生變化
```
---
Ref.
[1] Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization And Huffman Coding.