對神經網路模型整形的能力-量化(Quantization)

# 對深度學習模型整形的能力-量化(Quantization) 對經典神經網絡模型進行量化，以減少模型大小和延遲。此任務的目標如下：實作 K-Means Quantization 及 Linear Quantization - 了解量化的基本概念 - 實現並應用 K-means 量化 - 實現並應用針對 K-means 量化的量化感知訓練 - 實現並應用線性量化 - 實現並應用線性量化的整數推斷 - 基本了解量化帶來的性能改善（如加速） - 了解這些量化方法之間的差異和權衡 1. K-Means Quantization 2. Linear Quantization --- 1. K-Means Quantization ``` 第一步:衡量目前模型 VGG fp32 有 accuracy = 92.95%, size = 35.2 MiB ``` ``` 第二步: 回顧 K-means 的原理 ``` ``` 第三步: 實作，將測試 tensor 做 2-bits k-means quantization 實作，將 VGG model 做 2/4/8-bits k-means quantization ``` ![image](https://hackmd.io/_uploads/Sy5Trjf06.png) ``` 描述上圖 ``` ``` k-means quantizing model into 8 bits 8-bit k-means quantized model has size=8.80 MiB 8-bit k-means quantized model has accuracy=92.79% k-means quantizing model into 4 bits 4-bit k-means quantized model has size=4.40 MiB 4-bit k-means quantized model has accuracy=83.03% k-means quantizing model into 2 bits 2-bit k-means quantized model has size=2.20 MiB 2-bit k-means quantized model has accuracy=10.00% ``` ``` 回顧一下模型經細粒度修剪後，取得 8.15 MiB 及 92.8% accuracy 的成績(此時精度是多少呢?) 8-bits k-means 量化竟然已經可以比擬了，但2/4-bits量化的精度大幅下降，retrain能夠拉回多少精度呢? ``` 第四步: quantization-aware training (QAT) 根據[1]，The gradient for the centroids: $\frac{\partial \mathcal{L} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \frac{\partial W_{j} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \mathbf{1}(I_{j}=k)$ where $\mathcal{L}$ is the loss, $C_k$ is *k*-th centroid, $I_{j}$ is the label for weight $W_{j}$. $\mathbf{1}()$ is the indicator function, and $\mathbf{1}(I_{j}=k)$ means $1\;\mathrm{if}\;I_{j}=k\;\mathrm{else}\;0$, *i.e.*, $I_{j}==k$. 該如何實作呢? ``` 看一下效果，QAT 實在好用阿，模型已經壓到大小 2.2 MiB，accuracy = 91.68% k-means quantizing model into 8 bits 8-bit k-means quantized model has size=8.80 MiB 8-bit k-means quantized model has accuracy=92.79% before quantization-aware training No need for quantization-aware training since accuracy drop=0.16% is smaller than threshold=0.50% k-means quantizing model into 4 bits 4-bit k-means quantized model has size=4.40 MiB 4-bit k-means quantized model has accuracy=83.03% before quantization-aware training Quantization-aware training due to accuracy drop=9.92% is larger than threshold=0.50% Epoch 0 Accuracy 92.50% / Best Accuracy: 92.50% k-means quantizing model into 2 bits 2-bit k-means quantized model has size=2.20 MiB 2-bit k-means quantized model has accuracy=10.00% before quantization-aware training Quantization-aware training due to accuracy drop=82.95% is larger than threshold=0.50% Epoch 0 Accuracy 90.88% / Best Accuracy: 90.88% Epoch 1 Accuracy 91.22% / Best Accuracy: 91.22% Epoch 2 Accuracy 91.43% / Best Accuracy: 91.43% Epoch 3 Accuracy 91.68% / Best Accuracy: 91.68% Epoch 4 Accuracy 91.57% / Best Accuracy: 91.68% ``` Linear Quantization 線性量化可表示為 $r = S(q-Z)$ where $r$ is a floating point real number, $q$ is a *n*-bit integer, $Z$ is a *n*-bit integer, and $S$ is a floating point real number. $Z$ is quantization zero point and $S$ is quantization scaling factor. Both constant $Z$ and $S$ are quantization parameters. q 跟基準值 Z 進行比較，其比較結果即差值 (q-Z) 經過調節因子 S 轉換成一離散空間的新數值，可將輸入的連續空間轉換為離散的輸出空間，實現了對輸入值的量化。老規矩，寫函式 linear_quantize() 並測試之。 ![image](https://hackmd.io/_uploads/BJoOzBVAT.png) Special case: linear quantization on weight tensor 寫好函式能求出 quantized_tensor, scale, zero_point之後看看實際例子，把線性量化用在權重上，下圖是 VGG 各層權重從 32-bits 到 4-bits以及 2-bits ![image](https://hackmd.io/_uploads/HyJnX8EA6.png) ![image](https://hackmd.io/_uploads/SJGhEIV0a.png) ![image](https://hackmd.io/_uploads/SJ5h4UVR6.png) ``` 另外需注意密集的實驗表明，對於不同的輸出通道使用不同的縮放因子S和零點Z會有更好的表現。因此，我們必須獨立地為每個輸出通道的子張量確定縮放因子S和零點Z "獨立輸出通道"是指權重張量中的每個輸出通道。例如，如果你有一個卷積層，它的權重張量可能具有形狀 [輸出通道, 輸入通道, 高度, 寬度]。在這種情況下，每個"獨立輸出通道"就是指這個四維張量中的每個一維切片，即每個輸出通道的權重。 ``` ``` 量化之後，卷積層和全連接層的推論也會發生變化 ``` --- Ref. [1] Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization And Huffman Coding.