YH Hsu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    2
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Model Quantization Note 模型量化筆記 ###### tags: `Edge AI` `Quantization` `TensorRT` `cuda` `deployment` `YOLOv7`  `Nvidia` `Jetson` `Object Detection` ## NVIDIA Jetson 平台部署相關筆記 ### 基本環境設定 - [Jetson AGX Xavier 系統環境設定1_在windows10環境連接與安裝](https://hackmd.io/@YungHuiHsu/HJ2lcU4Rj) - [Jetson AGX Xavier 系統環境設定2_Docker安裝或從源程式碼編譯](https://hackmd.io/k-lnDTxVQDWo_V13WEnfOg) - [NVIDIA Container Toolkit 安裝筆記](https://hackmd.io/wADvyemZRDOeEduJXA9X7g) - [Jetson 邊緣裝置查詢系統性能指令jtop](https://hackmd.io/VXXV3T5GRIKi6ap8SkR-tg) - [Jetson Network Setup 網路設定](https://hackmd.io/WiqAB7pLSpm2863N2ISGXQ) - [OpenCV turns on cuda acceleration in Nvidia Jetson platform<br>OpenCV在Nvidia Jetson平台開啟cuda加速](https://hackmd.io/6IloyiWMQ_qbIpIE_c_1GA) ### 模型部署與加速 - [[Object Detection_YOLO] YOLOv7 論文筆記](https://hackmd.io/xhLeIsoSToW0jL61QRWDcQ) - [Deploy YOLOv7 on Nvidia Jetson](https://hackmd.io/kZftj6AgQmWJsbXsswIwEQ) - [Convert PyTorch model to TensorRT for 3-8x speedup<br>將PyTorch模型轉換為TensorRT,實現3-8倍加速](https://hackmd.io/_oaJhYNqTvyL_h01X1Fdmw?both) - [Accelerate multi-streaming cameras with DeepStream and deploy custom (YOLO) models<br>使用DeepStream加速多串流攝影機並部署客製(YOLO)模型](https://hackmd.io/@YungHuiHsu/rJKx-tv4h) - [Model Quantization Note 模型量化筆記](https://hackmd.io/riYLcrp1RuKHpVI22oEAXA) --- ## 前言 在部署階段的模型加速方案包括:1) 模型壓縮(Model Compression),包括模型剪枝(Model Pruning)、模型量化(Model Quantization)、模型蒸餾(Model Distillation)、低秩因式分解(Low-Rank Factorization)等方法 ![](https://hackmd.io/_uploads/ByjdtdgS2.png =400x) (source:[2020。A comprehensive survey on model compression and acceleration](https://www.researchgate.net/figure/Different-types-of-compression-techniques-for-DNN-and-traditional-ML-methods-Here-the_fig1_339129502)) - 4種常見型壓縮技巧簡介: [4 Popular Model Compression Techniques Explained](https://xailient.com/blog/4-popular-model-compression-techniques-explained/#3_The_knowledge_distillation_technique) 本文為針對較為主流通用的"模型量化"筆記 ## 原理 - 問問ChatGpt > 模型量化(Model Quantization)的好處: > + 透過降低精度和簡化模型參數,大幅減少模型的體積和計算負擔,並提高模型在特定硬體上的效率。 ### 什麼是量化? > 量化是一種模型尺寸縮減技術,可將模型權重從高精度浮點表示轉換為低精度浮點 (FP) 或整數 (INT) 表示。通過將模型的權重從高精度浮點表示轉換為低精度,模型大小和推理速度可以顯著提高,而不會犧牲太多精度。此外,量化將通過降低內存帶寬需求和提高緩存利用率來提高模型的性能。 - 挑戰 - 導致模型精度的損失,需要在 [速度、空間] <-> [精準度]間取捨。 ### 降低數據的動態範圍 模型量化主要透過將浮點數權重和激活函數轉換為定點數表示,在量化過程中,必須將非常高的 FP32 動態範圍壓縮到 INT8 的 255 ($2^8-1$)個值中,甚至壓縮(編碼)到 INT4 ($2^4-1$)的 15 個值中 ![](https://hackmd.io/_uploads/rJw6Y9gH2.png =400x) (source:[TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)) - float32 > int8 ![](https://hackmd.io/_uploads/BkkGq9xSn.png =400x) (source:[Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)) ## 校準 Calibration 模型量化後的校準(Calibration)是一種用於解決量化模型中精度損失的過程。在模型量化中,由於將浮點數參數轉換為低位表示,會引入量化誤差,可能導致模型性能下降。 校準的目的是通過調整量化參數來最小化量化誤差,以提高模型的性能和準確性。 - 校準示意圖(FP32 > INT8) ![](https://hackmd.io/_uploads/ry9GZ6xS2.png =500x) (source:[Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)) ### 量化基礎:量化範圍對應(Range Mapping)方法 - (NVIDIA TensorRT中採取對稱量化、並以KL散度(KL divergence)校準) 轉換過程要讓一個真實的高動態數值分布範圍$[\beta, \alpha]$,映(壓縮)到較小的整數範圍$[-2^{b-1},2^{b-1}-1]$ - 這裡以b位元寬度(這裡b=2, bit-width)的有符號(有正負值)整數範圍來表示 - 轉換公式均為線性轉換,以$f(x)= s · x + z$ 表示,當z=0時,即$f(x)= s · x$,分別以 Affine 與 Scale指稱之 ![](https://hackmd.io/_uploads/rJobHsbHh.png) (source : [2020。INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE:PRINCIPLES AND EMPIRICAL EVALUATION](https://arxiv.org/pdf/2004.09602.pdf)) 左邊是非對稱量化(Affine Quantization),右邊是對稱量化(Scale Quantization) 以下可以觀察幾個重點: - 對稱量化的實數0也對應著整數的0,而非對稱量化的實數0不一定對應著整數0,而是z。 - 對稱量化實數的範圍是對稱的$[\alpha,\alpha]$([-128,127]),而非對稱量化的則不對稱$[\beta, \alpha]$([-127,127]) #### a ) Affine Quantization 非對稱量化 Affine Quantization是一種非對稱的量化方法,其整數範圍為[-128, 127]。這種方法在TensorRT中常被使用,其中量化後的值的範圍與原始範圍有所差別,但差異並不大 ps : NVdia在推導計算損失的計算量後,建議選擇用計算量較小的Scale Quantization #### b ) Scale Quantization 對稱量化/比例量化 從對稱量化的公式$f(x)= s · x$來看,單純用一個比例因子$s$來執行範圍的映射(mapping) 輸入範圍為$x ∈ [-α, α]$ ,最大 $α$ 值 (amax) 被校準以最大化精度。校準 $α$ 後,通過乘/除比例因子 $s$ 來執行映射(mapping)。 $$s=\frac{2^{b-1}-1}{\alpha} = \frac{127}{\alpha}$$ $$x_{q} = quantize(x, b, s) = clip(round(s · x), −2^{b−1} + 1, 2^{b−1} − 1)$$ α-interval 之外的動態範圍內的所有內容都將被剪裁(outlier clipping),並且它內部的所有內容都將四捨五入(Rounding)到最接近的整數。必須小心選擇這些範圍——大的 $\alpha$ 會“涵蓋”更多的值,然而,會導致粗略的量化和高量化誤差,因此這些範圍的選擇通常是減裁(clipping)誤差和捨入(rounding )誤差之間的權衡。 ### Calibration 校準方法 校準即是替模型的權重與活化函數選擇分布範圍的過程,在Scale Quantization分布範圍即為$[-α, α]$,下圖中的虛線代表3種用來決定α(clip截斷點)的方法 ![](https://hackmd.io/_uploads/SJJt1hbBn.png) (source : [2020。INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE:PRINCIPLES AND EMPIRICAL EVALUATION](https://arxiv.org/pdf/2004.09602.pdf)) #### 直接採用最大值(Max ) - 使用絕對值分佈的最大值作為縮放上界(α) ,使用這種們沒有裁剪誤差(clipping error),但如果有異常值,捨入誤差(rounding error)將很大。 #### 根據分布差異指標(KL Divergence)/相對熵(relative entropy) - 選擇可以最小化量化分佈和原始分佈之間的 KL Divergence的α - 如果原始分佈接近常態分佈,則此方法表現良好,可在裁剪誤差和捨入誤差之間實現最佳權衡。但是,如果分佈遠離常態分佈,則其性能會受到阻礙。 - KL Divergence是用於衡量兩個分布之間差異的指標,即比較量化前後參數的機率分布來選擇適合的α #### 採用百分位數(percentile) - 使用對應於絕對值分佈的第 k 個百分位數的α 。作為一種啟發式方法,它需要調整附加參數 - k - 允許用戶在刪除異常值的同時實現所需的權衡。 ## Post-training quantization (PTQ) vs Quantization-aware training (QAT) 模型量化方法可區分為兩大類: #### 流程概述 流程示意圖 ![](https://hackmd.io/_uploads/SykYgKlB2.png =300x) (source : [2021。Olivia Weng。Neural Network Quantization for Efficient Inference: A Survey](https://www.researchgate.net/publication/357014029_Neural_Network_Quantization_for_Efficient_Inference_A_Survey)) | | PTQ | QAT | |:--------:|:--------------------------------------:|:--------------------------------------------------------------:| | 定義 | 將已經訓練好的模型<br>轉換成低精度模型 | 訓練過程中同時考慮模型的<br>精度和訓練目標,生成低精度模型 | | 流程 | 訓練完成後,對模型進行量化 | 訓練過程中,同時訓練<br>高精度和低精度模型<br>,並逐步調整量化因子 | | 精度損失 | 低度-中度 | 輕微 | | 需求 | 只需要訓練一次,轉換後直接使用 | 需要額外的訓練時間和計算資源 | | 應用<br>範圍 | 可用於靜態環境下的模型部署 | 適用於需要低延遲和低功耗的動態系統 | ### Post-training quantization (PTQ)。訓練後量化 - 在訓練后應用量化 - 將原本的浮點模型通過校準(Calibration)轉換為固定精度的整數模型,從而降低模型的存儲空間和計算資源需求 - 可以應用於已經訓練好的模型上,需要在不影響模型精度的前提下對模型進行微調和量化處理 ### Quantization-aware training (QAT)。量化感知訓練 - 在模型訓練的過程中加入量化約束的技術,通過類比推理時間量化來優化模型權重,以提高模型在下游任務上的性能。它在訓練期間使用稱為 Q/DQ (quantize then de-quantize)的「假」量化模組來( "fake" quantization modules)模仿測試或推理階段的行為 - 訓練階段模擬以int8精度(在原conv模組前插入fake quantization)進行,實際運行時仍以float精度 - QAT 技術可以有效地降低量化後的精度損失,但是訓練過程會變得較為複雜和耗時,因為需要在考慮精度的前提下對模型進行微調和訓練。 ![](https://hackmd.io/_uploads/B1AWa9ZBh.png=800x) (source:[Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)) ### Implement - 實作參考[NVIDIA-AI-IOT/yolo_deepstream/tree/main/yolov7_qat](https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/yolov7_qat) - [Convert PyTorch model to TensorRT for 3-8x speedup<br>將PyTorch模型轉換為TensorRT,實現3-8倍加速](https://hackmd.io/_oaJhYNqTvyL_h01X1Fdmw?both) ## 參考資料 ### Concept and Theory - [2020。Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius。arXiv:2004.09602。INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE:PRINCIPLES AND EMPIRICAL EVALUATION](https://arxiv.org/pdf/2004.09602.pdf) - [The Ultimate Guide to Deep Learning Model Quantization and Quantization-Aware Training](https://deci.ai/quantization-and-quantization-aware-training/#:~:text=Post%2Dtraining%20quantization%20(PTQ)%20is%20a%20quantization%20technique%20where,trained%20with%20quantization%20in%20mind.) - [2001。贰浪先生。神经网络模型量化综述](https://zhuanlan.zhihu.com/p/374374300)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully