Try   HackMD

Week 26: ONNX / QUANTIZATION 技術研究

tags: 技術研討

1. 本次介紹

  • ONNX 模型轉換
  • QUANTIZATION 模型壓縮

2. ONNX 模型轉換

2.1 ONNX 介紹

https://onnx.ai/

ONNX(Open Neural Network Exchange)是一種針對機器學習所設計的開放式的文件格式,用於存儲訓練好的模型。
它使得不同的人工智慧框架(如Pytorch、MXNet)可以採用相同格式存儲模型數據並交互。
ONNX 的規範及代碼主要由微軟,亞馬遜,Facebook 和 IBM 等公司共同開發,以開放原始碼的方式託管在 Github 上。
目前官方支持加載 ONNX 模型並進行推理的深度學習框架有: Caffe, Caffe2, Keras, PyTorch, MXNet,ML NET, MXNet, TensorRT, TensorFlow, Microsoft CNTK

除了自己轉換 ONNX 也可以透過 ONNX Model Zoo 取得 ONNX Pretrained model:

Vision

  • Image Classification
  • Object Detection & Image Segmentation
  • Body, Face & Gesture Analysis
  • Image Manipulation (style transfer or enhancing images by increasing resolution)

Language

  • Machine Comprehension
  • Machine Translation
  • Language Modelling

Other

  • Visual Question Answering & Dialog
  • Speech & Audio Processing
  • Other interesting models

2.2 Pytorch 轉換方式

首先先 load 要轉換的 model 檔

import torch from src.model import CRNN checkpoint = torch.load('crnn.pth') model = CRNN(...) model.load_state_dict(checkpoint['state_dict'])

torch 本身就有支援輸出成 onnx 的語法
官方文件

model.eval() # 記得轉模式 dummy_input = torch.randn(1, 1, 32, 100, requires_grad=True) save_path = 'crnn.onnx' torch.onnx.export(model, dummy_input, save_path, export_params=True, keep_initializers_as_inputs=True, input_names = ['Inputs'])

2.3 Inference time (CPU/GPU)

使用 onnx model 要先安裝相關的套件

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
若要使用 gpu 的話要安裝 onnxruntime-gpu 這個套件 (按照對應 cuda 版本裝)

!pip install onnx==1.8.1 # 試過使用最新的會有 bug !pip install onnxruntime !pip install onnxruntime-gpu !pip install --upgrade protobuf # 不打這個也會有 bug

設定使用 single thread

import onnx import onnxruntime as ort options = ort.SessionOptions() options.intra_op_threads = 1 options.inter_op_threads = 1 ort_session = ort.InferenceSession(path_or_bytes='crnn.onnx', sess_options=options)
ort.get_device() # 可以知道現在在用 CPU/GPU

開始 run inference

outputs = ort_session.run(None, {'Inputs': image.astype(np.float32)})

實驗結果

項目 CRNN (pth) CRNN (onnx) Yolov4 (pth) Yolov4 (onnx)
模型檔大小 30M 30M 256M 256M
CPU inference 速度 (跑 50 次) 39.3 ms ± 410 μs 29.5 ms ± 2.11 ms 2.6 s ± 221 ms 5.6 s ± 53.3 ms
GPU inference 速度 (跑 50 次) 10.2 ms ± 238 μs 8.36 ms ± 78.6 μs 38 ms ± 621 μs 37.9 ms ± 77 μs

3. QUANTIZATION 模型壓縮

3.1 QUANTIZATION 介紹

量化 (Quantization) 是指用較低精度的資料來執行運算和存取記憶體,通常使用 INT8 的資料型態

  • 優點: 模型縮小、降低 memory 使用量、提升 inference 速度
  • 缺點: Quantization 的設計上是採用近似值的概念,因此通常會讓模型的準確率略低一些
  • Model Size 模型大小比較
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Inference time 比較
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

3.2 Pytorch 實作 (CRNN)

torch 本身支援實作 Quantize (連結)
總共有三種量化 (Quantization) 的方法~

3.2.1 Post Training Dynamic Quantization

  • only supports nn.Linear and nn.LSTM
  • 官方範例用在 Bert / LSTM-based model
import torch # a set of layers to dynamically quantize layers = [torch.nn.Linear, torch.nn.LSTM] model_dynamic_quantized = torch.quantization.quantize_dynamic(model, # the original model layers, dtype=torch.qint8) # the target dtype for quantized weights

架構轉換後~

CRNN(
  (cnn): Sequential(
    (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (batchnorm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace=True)
    (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    ...
    (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
    (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu6): ReLU(inplace=True)
  )
  (map_to_seq): DynamicQuantizedLinear(in_features=512, out_features=64, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (rnn1): DynamicQuantizedLSTM(64, 256, bidirectional=True)
  (rnn2): DynamicQuantizedLSTM(512, 256, bidirectional=True)
  (dense): DynamicQuantizedLinear(in_features=512, out_features=42, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

3.2.2 Post Training Static Quantization

補介紹~
合併可以合併的 layer,目的是為了提高速度和準確度 (只有這些順序可以被合併):

  • Convolution, Batch normalization
  • Convolution, Batch normalization, Relu
  • Convolution, Relu
  • Linear, Relu
  • Batch normalization, Relu
for m in model.cnn.modules(): if type(m) == torch.nn.modules.container.Sequential(): torch.quantization.fuse_modules(m, ['conv0', 'batchnorm0', 'relu0'], inplace=True) torch.quantization.fuse_modules(m, ['conv1', 'batchnorm1', 'relu1'], inplace=True) torch.quantization.fuse_modules(m, ['conv2', 'batchnorm2', 'relu2'], inplace=True) torch.quantization.fuse_modules(m, ['conv3', 'batchnorm3', 'relu3'], inplace=True) torch.quantization.fuse_modules(m, ['conv4', 'batchnorm4', 'relu4'], inplace=True) torch.quantization.fuse_modules(m, ['conv5', 'batchnorm5', 'relu5'], inplace=True) torch.quantization.fuse_modules(m, ['conv6', 'batchnorm6', 'relu6'], inplace=True)

架構變成如下

CRNN(
  (cnn): Sequential(
    (conv0): ConvReLU2d(
      (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm0): Identity()
    (relu0): Identity()
    (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv1): ConvReLU2d(
      (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm1): Identity()
    (relu1): Identity()
    (pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv2): ConvReLU2d(
      (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm2): Identity()
    (relu2): Identity()
    (conv3): ConvReLU2d(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm3): Identity()
    (relu3): Identity()
    (pooling2): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
    (conv4): ConvReLU2d(
      (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm4): Identity()
    (relu4): Identity()
    (conv5): ConvReLU2d(
      (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm5): Identity()
    (relu5): Identity()
    (pooling3): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
    (conv6): ConvReLU2d(
      (0): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
      (1): ReLU(inplace=True)
    )
    (batchnorm6): Identity()
    (relu6): Identity()
  )
  (map_to_seq): Linear(in_features=512, out_features=64, bias=True)
  (rnn1): LSTM(64, 256, bidirectional=True)
  (rnn2): LSTM(512, 256, bidirectional=True)
  (dense): Linear(in_features=512, out_features=42, bias=True)

開始進行轉換

model.qconfig = torch.quantization.get_default_qconfig('fbgemm') model_prepared = torch.quantization.prepare(model) model_int8 = torch.quantization.convert(model_prepared)

3.2.3 Quantization Aware Training

3.3 Model size & Inference time

架構還是 pytorch,所以 inference 的方法跟原本一樣
但目前好像還沒支援 GPU QQ

項目 Original Dynamic Quantization Static Quantization Static + Dynamic Quantization
模型檔大小 30M 24M 14M 7.6M
CPU inference 速度 (跑 50 次) 39.3 ms ± 410 μs 34.6 ms ± 574 μs 49.2 ms ± 1.59 ms 40.4 ms ± 1.12 ms