Week 26: ONNX / QUANTIZATION 技術研究
1. 本次介紹
- ONNX 模型轉換
- QUANTIZATION 模型壓縮
2. ONNX 模型轉換
2.1 ONNX 介紹
https://onnx.ai/
ONNX(Open Neural Network Exchange)是一種針對機器學習所設計的開放式的文件格式,用於存儲訓練好的模型。
它使得不同的人工智慧框架(如Pytorch、MXNet)可以採用相同格式存儲模型數據並交互。
ONNX 的規範及代碼主要由微軟,亞馬遜,Facebook 和 IBM 等公司共同開發,以開放原始碼的方式託管在 Github 上。
目前官方支持加載 ONNX 模型並進行推理的深度學習框架有: Caffe, Caffe2, Keras, PyTorch, MXNet,ML NET, MXNet, TensorRT, TensorFlow, Microsoft CNTK。
除了自己轉換 ONNX 也可以透過 ONNX Model Zoo 取得 ONNX Pretrained model:
Vision
- Image Classification
- Object Detection & Image Segmentation
- Body, Face & Gesture Analysis
- Image Manipulation (style transfer or enhancing images by increasing resolution)
Language
- Machine Comprehension
- Machine Translation
- Language Modelling
Other
- Visual Question Answering & Dialog
- Speech & Audio Processing
- Other interesting models
2.2 Pytorch 轉換方式
首先先 load 要轉換的 model 檔
torch 本身就有支援輸出成 onnx 的語法
官方文件
2.3 Inference time (CPU/GPU)
使用 onnx model 要先安裝相關的套件
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
若要使用 gpu 的話要安裝 onnxruntime-gpu 這個套件 (按照對應 cuda 版本裝)設定使用 single thread
開始 run inference
實驗結果
項目 |
CRNN (pth) |
CRNN (onnx) |
Yolov4 (pth) |
Yolov4 (onnx) |
模型檔大小 |
30M |
30M |
256M |
256M |
CPU inference 速度 (跑 50 次) |
39.3 ms ± 410 μs |
29.5 ms ± 2.11 ms |
2.6 s ± 221 ms |
5.6 s ± 53.3 ms |
GPU inference 速度 (跑 50 次) |
10.2 ms ± 238 μs |
8.36 ms ± 78.6 μs |
38 ms ± 621 μs |
37.9 ms ± 77 μs |
3. QUANTIZATION 模型壓縮
3.1 QUANTIZATION 介紹
量化 (Quantization) 是指用較低精度的資料來執行運算和存取記憶體,通常使用 INT8 的資料型態
- 優點: 模型縮小、降低 memory 使用量、提升 inference 速度
- 缺點: Quantization 的設計上是採用近似值的概念,因此通常會讓模型的準確率略低一些
- Model Size 模型大小比較
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Inference time 比較
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
3.2 Pytorch 實作 (CRNN)
torch 本身支援實作 Quantize (連結)
總共有三種量化 (Quantization) 的方法~
3.2.1 Post Training Dynamic Quantization
- only supports nn.Linear and nn.LSTM
- 官方範例用在 Bert / LSTM-based model
架構轉換後~
3.2.2 Post Training Static Quantization
補介紹~
合併可以合併的 layer,目的是為了提高速度和準確度 (只有這些順序可以被合併):
- Convolution, Batch normalization
- Convolution, Batch normalization, Relu
- Convolution, Relu
- Linear, Relu
- Batch normalization, Relu
架構變成如下
CRNN(
(cnn): Sequential(
(conv0): ConvReLU2d(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm0): Identity()
(relu0): Identity()
(pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv1): ConvReLU2d(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm1): Identity()
(relu1): Identity()
(pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): ConvReLU2d(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm2): Identity()
(relu2): Identity()
(conv3): ConvReLU2d(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm3): Identity()
(relu3): Identity()
(pooling2): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
(conv4): ConvReLU2d(
(0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm4): Identity()
(relu4): Identity()
(conv5): ConvReLU2d(
(0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm5): Identity()
(relu5): Identity()
(pooling3): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
(conv6): ConvReLU2d(
(0): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(1): ReLU(inplace=True)
)
(batchnorm6): Identity()
(relu6): Identity()
)
(map_to_seq): Linear(in_features=512, out_features=64, bias=True)
(rnn1): LSTM(64, 256, bidirectional=True)
(rnn2): LSTM(512, 256, bidirectional=True)
(dense): Linear(in_features=512, out_features=42, bias=True)
開始進行轉換
3.2.3 Quantization Aware Training
3.3 Model size & Inference time
架構還是 pytorch,所以 inference 的方法跟原本一樣
但目前好像還沒支援 GPU QQ
項目 |
Original |
Dynamic Quantization |
Static Quantization |
Static + Dynamic Quantization |
模型檔大小 |
30M |
24M |
14M |
7.6M |
CPU inference 速度 (跑 50 次) |
39.3 ms ± 410 μs |
34.6 ms ± 574 μs |
49.2 ms ± 1.59 ms |
40.4 ms ± 1.12 ms |