# Week 26: ONNX / QUANTIZATION 技術研究 ###### tags: `技術研討` ## 1. 本次介紹 * ONNX 模型轉換 * QUANTIZATION 模型壓縮 ## 2. ONNX 模型轉換 ### 2.1 ONNX 介紹 https://onnx.ai/ ONNX(Open Neural Network Exchange)是一種針對機器學習所設計的開放式的文件格式,用於存儲訓練好的模型。 它使得不同的**人工智慧框架**(如Pytorch、MXNet)可以採用**相同格式存儲模型數據**並交互。 ONNX 的規範及代碼主要由微軟,亞馬遜,Facebook 和 IBM 等公司共同開發,以開放原始碼的方式託管在 Github 上。 目前官方支持加載 ONNX 模型並進行推理的深度學習框架有: [Caffe, Caffe2, Keras, PyTorch, MXNet,ML NET, MXNet, TensorRT, TensorFlow, Microsoft CNTK](https://github.com/onnx/tutorials#converting-to-onnx-format)。 除了自己轉換 ONNX 也可以透過 ONNX Model Zoo 取得 ONNX Pretrained model: [Vision](https://github.com/onnx/models#image-classification-) - Image Classification - Object Detection & Image Segmentation - Body, Face & Gesture Analysis - Image Manipulation (style transfer or enhancing images by increasing resolution) Language - Machine Comprehension - Machine Translation - Language Modelling Other - Visual Question Answering & Dialog - Speech & Audio Processing - Other interesting models ### 2.2 Pytorch 轉換方式 首先先 load 要轉換的 model 檔 ```python= import torch from src.model import CRNN checkpoint = torch.load('crnn.pth') model = CRNN(...) model.load_state_dict(checkpoint['state_dict']) ``` torch 本身就有支援輸出成 onnx 的語法 [官方文件](https://docs.microsoft.com/zh-tw/windows/ai/windows-ml/tutorials/pytorch-convert-model) ```python= model.eval() # 記得轉模式 dummy_input = torch.randn(1, 1, 32, 100, requires_grad=True) save_path = 'crnn.onnx' torch.onnx.export(model, dummy_input, save_path, export_params=True, keep_initializers_as_inputs=True, input_names = ['Inputs']) ``` ### 2.3 Inference time (CPU/GPU) 使用 onnx model 要先安裝相關的套件 :exclamation: 若要使用 gpu 的話要安裝 onnxruntime-gpu 這個套件 [(按照對應 cuda 版本裝)](https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html) ```python= !pip install onnx==1.8.1 # 試過使用最新的會有 bug !pip install onnxruntime !pip install onnxruntime-gpu !pip install --upgrade protobuf # 不打這個也會有 bug ``` 設定使用 single thread ```python= import onnx import onnxruntime as ort options = ort.SessionOptions() options.intra_op_threads = 1 options.inter_op_threads = 1 ort_session = ort.InferenceSession(path_or_bytes='crnn.onnx', sess_options=options) ``` ```python= ort.get_device() # 可以知道現在在用 CPU/GPU ``` 開始 run inference ```python= outputs = ort_session.run(None, {'Inputs': image.astype(np.float32)}) ``` 實驗結果 | 項目 | CRNN (pth) | CRNN (onnx) | Yolov4 (pth) | Yolov4 (onnx) | | -------- | -------- | -------- | -------- | -------- | | 模型檔大小 | 30M | 30M | 256M | 256M | | CPU inference 速度 (跑 50 次) | 39.3 ms ± 410 μs | 29.5 ms ± 2.11 ms | 2.6 s ± 221 ms | 5.6 s ± 53.3 ms | | GPU inference 速度 (跑 50 次) | 10.2 ms ± 238 μs | 8.36 ms ± 78.6 μs | 38 ms ± 621 μs | 37.9 ms ± 77 μs | ## 3. QUANTIZATION 模型壓縮 ### 3.1 QUANTIZATION 介紹 量化 (Quantization) 是指<font color=red>用較低精度的資料</font>來執行運算和存取記憶體,通常使用 INT8 的資料型態 * ==優點:== 模型縮小、降低 memory 使用量、提升 inference 速度 * ==缺點:== Quantization 的設計上是採用近似值的概念,因此通常會讓模型的準確率略低一些 * Model Size 模型大小比較 ![](https://i.imgur.com/75HpxDx.png) * Inference time 比較 ![](https://i.imgur.com/pNu6ZKF.png) ### 3.2 Pytorch 實作 (CRNN) torch 本身支援實作 Quantize [(連結)](https://pytorch.org/tutorials/recipes/quantization.html) 總共有三種量化 (Quantization) 的方法~ #### 3.2.1 Post Training Dynamic Quantization * <font color=red>only supports nn.Linear and nn.LSTM</font> * 官方範例用在 Bert / LSTM-based model ```python= import torch # a set of layers to dynamically quantize layers = [torch.nn.Linear, torch.nn.LSTM] model_dynamic_quantized = torch.quantization.quantize_dynamic(model, # the original model layers, dtype=torch.qint8) # the target dtype for quantized weights ``` 架構轉換後~ ``` CRNN( (cnn): Sequential( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu0): ReLU(inplace=True) (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ... (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu6): ReLU(inplace=True) ) (map_to_seq): DynamicQuantizedLinear(in_features=512, out_features=64, dtype=torch.qint8, qscheme=torch.per_tensor_affine) (rnn1): DynamicQuantizedLSTM(64, 256, bidirectional=True) (rnn2): DynamicQuantizedLSTM(512, 256, bidirectional=True) (dense): DynamicQuantizedLinear(in_features=512, out_features=42, dtype=torch.qint8, qscheme=torch.per_tensor_affine) ) ``` #### 3.2.2 Post Training Static Quantization 補介紹~ 合併可以合併的 layer,目的是為了提高速度和準確度 (只有這些順序可以被合併): * Convolution, Batch normalization * Convolution, Batch normalization, Relu * Convolution, Relu * Linear, Relu * Batch normalization, Relu ```python= for m in model.cnn.modules(): if type(m) == torch.nn.modules.container.Sequential(): torch.quantization.fuse_modules(m, ['conv0', 'batchnorm0', 'relu0'], inplace=True) torch.quantization.fuse_modules(m, ['conv1', 'batchnorm1', 'relu1'], inplace=True) torch.quantization.fuse_modules(m, ['conv2', 'batchnorm2', 'relu2'], inplace=True) torch.quantization.fuse_modules(m, ['conv3', 'batchnorm3', 'relu3'], inplace=True) torch.quantization.fuse_modules(m, ['conv4', 'batchnorm4', 'relu4'], inplace=True) torch.quantization.fuse_modules(m, ['conv5', 'batchnorm5', 'relu5'], inplace=True) torch.quantization.fuse_modules(m, ['conv6', 'batchnorm6', 'relu6'], inplace=True) ``` 架構變成如下 ``` CRNN( (cnn): Sequential( (conv0): ConvReLU2d( (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm0): Identity() (relu0): Identity() (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv1): ConvReLU2d( (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm1): Identity() (relu1): Identity() (pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv2): ConvReLU2d( (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm2): Identity() (relu2): Identity() (conv3): ConvReLU2d( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm3): Identity() (relu3): Identity() (pooling2): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False) (conv4): ConvReLU2d( (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm4): Identity() (relu4): Identity() (conv5): ConvReLU2d( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm5): Identity() (relu5): Identity() (pooling3): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False) (conv6): ConvReLU2d( (0): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (1): ReLU(inplace=True) ) (batchnorm6): Identity() (relu6): Identity() ) (map_to_seq): Linear(in_features=512, out_features=64, bias=True) (rnn1): LSTM(64, 256, bidirectional=True) (rnn2): LSTM(512, 256, bidirectional=True) (dense): Linear(in_features=512, out_features=42, bias=True) ``` 開始進行轉換 ```python= model.qconfig = torch.quantization.get_default_qconfig('fbgemm') model_prepared = torch.quantization.prepare(model) model_int8 = torch.quantization.convert(model_prepared) ``` #### 3.2.3 Quantization Aware Training ### 3.3 Model size & Inference time 架構還是 pytorch,所以 inference 的方法跟原本一樣 ==但目前好像還沒支援 GPU QQ== | 項目 | Original | Dynamic Quantization | Static Quantization | Static + Dynamic Quantization | | -------- | -------- | -------- | -------- | -------- | | 模型檔大小 | 30M | 24M | 14M | 7.6M | | CPU inference 速度 (跑 50 次) | 39.3 ms ± 410 μs | 34.6 ms ± 574 μs | 49.2 ms ± 1.59 ms | 40.4 ms ± 1.12 ms |