# Week 26: ONNX / QUANTIZATION 技術研究
###### tags: `技術研討`
## 1. 本次介紹
* ONNX 模型轉換
* QUANTIZATION 模型壓縮
## 2. ONNX 模型轉換
### 2.1 ONNX 介紹
https://onnx.ai/
ONNX(Open Neural Network Exchange)是一種針對機器學習所設計的開放式的文件格式,用於存儲訓練好的模型。
它使得不同的**人工智慧框架**(如Pytorch、MXNet)可以採用**相同格式存儲模型數據**並交互。
ONNX 的規範及代碼主要由微軟,亞馬遜,Facebook 和 IBM 等公司共同開發,以開放原始碼的方式託管在 Github 上。
目前官方支持加載 ONNX 模型並進行推理的深度學習框架有: [Caffe, Caffe2, Keras, PyTorch, MXNet,ML NET, MXNet, TensorRT, TensorFlow, Microsoft CNTK](https://github.com/onnx/tutorials#converting-to-onnx-format)。
除了自己轉換 ONNX 也可以透過 ONNX Model Zoo 取得 ONNX Pretrained model:
[Vision](https://github.com/onnx/models#image-classification-)
- Image Classification
- Object Detection & Image Segmentation
- Body, Face & Gesture Analysis
- Image Manipulation (style transfer or enhancing images by increasing resolution)
Language
- Machine Comprehension
- Machine Translation
- Language Modelling
Other
- Visual Question Answering & Dialog
- Speech & Audio Processing
- Other interesting models
### 2.2 Pytorch 轉換方式
首先先 load 要轉換的 model 檔
```python=
import torch
from src.model import CRNN
checkpoint = torch.load('crnn.pth')
model = CRNN(...)
model.load_state_dict(checkpoint['state_dict'])
```
torch 本身就有支援輸出成 onnx 的語法
[官方文件](https://docs.microsoft.com/zh-tw/windows/ai/windows-ml/tutorials/pytorch-convert-model)
```python=
model.eval() # 記得轉模式
dummy_input = torch.randn(1, 1, 32, 100, requires_grad=True)
save_path = 'crnn.onnx'
torch.onnx.export(model, dummy_input, save_path, export_params=True, keep_initializers_as_inputs=True, input_names = ['Inputs'])
```
### 2.3 Inference time (CPU/GPU)
使用 onnx model 要先安裝相關的套件
:exclamation: 若要使用 gpu 的話要安裝 onnxruntime-gpu 這個套件 [(按照對應 cuda 版本裝)](https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html)
```python=
!pip install onnx==1.8.1 # 試過使用最新的會有 bug
!pip install onnxruntime
!pip install onnxruntime-gpu
!pip install --upgrade protobuf # 不打這個也會有 bug
```
設定使用 single thread
```python=
import onnx
import onnxruntime as ort
options = ort.SessionOptions()
options.intra_op_threads = 1
options.inter_op_threads = 1
ort_session = ort.InferenceSession(path_or_bytes='crnn.onnx', sess_options=options)
```
```python=
ort.get_device() # 可以知道現在在用 CPU/GPU
```
開始 run inference
```python=
outputs = ort_session.run(None, {'Inputs': image.astype(np.float32)})
```
實驗結果
| 項目 | CRNN (pth) | CRNN (onnx) | Yolov4 (pth) | Yolov4 (onnx) |
| -------- | -------- | -------- | -------- | -------- |
| 模型檔大小 | 30M | 30M | 256M | 256M |
| CPU inference 速度 (跑 50 次) | 39.3 ms ± 410 μs | 29.5 ms ± 2.11 ms | 2.6 s ± 221 ms | 5.6 s ± 53.3 ms |
| GPU inference 速度 (跑 50 次) | 10.2 ms ± 238 μs | 8.36 ms ± 78.6 μs | 38 ms ± 621 μs | 37.9 ms ± 77 μs |
## 3. QUANTIZATION 模型壓縮
### 3.1 QUANTIZATION 介紹
量化 (Quantization) 是指<font color=red>用較低精度的資料</font>來執行運算和存取記憶體,通常使用 INT8 的資料型態
* ==優點:== 模型縮小、降低 memory 使用量、提升 inference 速度
* ==缺點:== Quantization 的設計上是採用近似值的概念,因此通常會讓模型的準確率略低一些
* Model Size 模型大小比較
![](https://i.imgur.com/75HpxDx.png)
* Inference time 比較
![](https://i.imgur.com/pNu6ZKF.png)
### 3.2 Pytorch 實作 (CRNN)
torch 本身支援實作 Quantize [(連結)](https://pytorch.org/tutorials/recipes/quantization.html)
總共有三種量化 (Quantization) 的方法~
#### 3.2.1 Post Training Dynamic Quantization
* <font color=red>only supports nn.Linear and nn.LSTM</font>
* 官方範例用在 Bert / LSTM-based model
```python=
import torch
# a set of layers to dynamically quantize
layers = [torch.nn.Linear, torch.nn.LSTM]
model_dynamic_quantized = torch.quantization.quantize_dynamic(model, # the original model
layers,
dtype=torch.qint8) # the target dtype for quantized weights
```
架構轉換後~
```
CRNN(
(cnn): Sequential(
(conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu0): ReLU(inplace=True)
(pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
...
(conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu6): ReLU(inplace=True)
)
(map_to_seq): DynamicQuantizedLinear(in_features=512, out_features=64, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(rnn1): DynamicQuantizedLSTM(64, 256, bidirectional=True)
(rnn2): DynamicQuantizedLSTM(512, 256, bidirectional=True)
(dense): DynamicQuantizedLinear(in_features=512, out_features=42, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
```
#### 3.2.2 Post Training Static Quantization
補介紹~
合併可以合併的 layer,目的是為了提高速度和準確度 (只有這些順序可以被合併):
* Convolution, Batch normalization
* Convolution, Batch normalization, Relu
* Convolution, Relu
* Linear, Relu
* Batch normalization, Relu
```python=
for m in model.cnn.modules():
if type(m) == torch.nn.modules.container.Sequential():
torch.quantization.fuse_modules(m, ['conv0', 'batchnorm0', 'relu0'], inplace=True)
torch.quantization.fuse_modules(m, ['conv1', 'batchnorm1', 'relu1'], inplace=True)
torch.quantization.fuse_modules(m, ['conv2', 'batchnorm2', 'relu2'], inplace=True)
torch.quantization.fuse_modules(m, ['conv3', 'batchnorm3', 'relu3'], inplace=True)
torch.quantization.fuse_modules(m, ['conv4', 'batchnorm4', 'relu4'], inplace=True)
torch.quantization.fuse_modules(m, ['conv5', 'batchnorm5', 'relu5'], inplace=True)
torch.quantization.fuse_modules(m, ['conv6', 'batchnorm6', 'relu6'], inplace=True)
```
架構變成如下
```
CRNN(
(cnn): Sequential(
(conv0): ConvReLU2d(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm0): Identity()
(relu0): Identity()
(pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv1): ConvReLU2d(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm1): Identity()
(relu1): Identity()
(pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): ConvReLU2d(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm2): Identity()
(relu2): Identity()
(conv3): ConvReLU2d(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm3): Identity()
(relu3): Identity()
(pooling2): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
(conv4): ConvReLU2d(
(0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm4): Identity()
(relu4): Identity()
(conv5): ConvReLU2d(
(0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm5): Identity()
(relu5): Identity()
(pooling3): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
(conv6): ConvReLU2d(
(0): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm6): Identity()
(relu6): Identity()
)
(map_to_seq): Linear(in_features=512, out_features=64, bias=True)
(rnn1): LSTM(64, 256, bidirectional=True)
(rnn2): LSTM(512, 256, bidirectional=True)
(dense): Linear(in_features=512, out_features=42, bias=True)
```
開始進行轉換
```python=
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = torch.quantization.prepare(model)
model_int8 = torch.quantization.convert(model_prepared)
```
#### 3.2.3 Quantization Aware Training
### 3.3 Model size & Inference time
架構還是 pytorch,所以 inference 的方法跟原本一樣
==但目前好像還沒支援 GPU QQ==
| 項目 | Original | Dynamic Quantization | Static Quantization | Static + Dynamic Quantization |
| -------- | -------- | -------- | -------- | -------- |
| 模型檔大小 | 30M | 24M | 14M | 7.6M |
| CPU inference 速度 (跑 50 次) | 39.3 ms ± 410 μs | 34.6 ms ± 574 μs | 49.2 ms ± 1.59 ms | 40.4 ms ± 1.12 ms |