HackMD
  • Beta
    Beta  Get a sneak peek of HackMD’s new design
    Turn on the feature preview and give us feedback.
    Go → Got it
      • Create new note
      • Create a note from template
    • Beta  Get a sneak peek of HackMD’s new design
      Beta  Get a sneak peek of HackMD’s new design
      Turn on the feature preview and give us feedback.
      Go → Got it
      • Sharing Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • More (Comment, Invitee)
      • Publishing
        Please check the box to agree to the Community Guidelines.
        Everyone on the web can find and read all notes of this public team.
        After the note is published, everyone on the web can find and read this note.
        See all published notes on profile page.
      • Commenting Enable
        Disabled Forbidden Owners Signed-in users Everyone
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Invitee
      • No invitee
      • Options
      • Versions and GitHub Sync
      • Transfer ownership
      • Delete this note
      • Template
      • Save as template
      • Insert from template
      • Export
      • Dropbox
      • Google Drive Export to Google Drive
      • Gist
      • Import
      • Dropbox
      • Google Drive Import from Google Drive
      • Gist
      • Clipboard
      • Download
      • Markdown
      • HTML
      • Raw HTML
    Menu Sharing Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Versions and GitHub Sync Transfer ownership Delete this note
    Export
    Dropbox Google Drive Export to Google Drive Gist
    Import
    Dropbox Google Drive Import from Google Drive Gist Clipboard
    Download
    Markdown HTML Raw HTML
    Back
    Sharing
    Sharing Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    More (Comment, Invitee)
    Publishing
    Please check the box to agree to the Community Guidelines.
    Everyone on the web can find and read all notes of this public team.
    After the note is published, everyone on the web can find and read this note.
    See all published notes on profile page.
    More (Comment, Invitee)
    Commenting Enable
    Disabled Forbidden Owners Signed-in users Everyone
    Permission
    Owners
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Invitee
    No invitee
       owned this note    owned this note      
    Published Linked with GitHub
    Like BookmarkBookmarked
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # MobileViTs # Intro - ViT從出現以來一直有很笨重的特性,在實務上難以放在移動裝置上輕鬆運行,那麼我們是否能夠結合 CNN 和 ViT 來建立一個輕量型網路呢? MobileViT 這個系列的作法就是將 ViT 融入進去 MobileNetV2,做出輕量化的 ViT 模型,而 V2 則是提出可分離式 Self-attention 進一步將計算輕量化 - Apple 於2021年發表了 MobileViT (ICLR2022).宣稱可以用比 MobileNetV3 更低的參數量達到更高的準確度,之後同一批人在 2022年6月發布 MobileViT v2,而後於 2022年10月普渡大學+美光也發布了 MobileViT v3,但 v3 沒有 timm 支援且並不是 Apple 原班人馬發的所以這裡先不講 - [timm links](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py) # MobileViT v1 [paper link](https://arxiv.org/abs/2110.02178) - ViT-B/16 vs MobileNetv3的參數量 86 vs 7.5 millions parameters - 本篇目標在於做出 1.) light-weight 2.) general-purpose 3.) low latency 的 ViT 模型 ![](https://i.imgur.com/hwRwctd.png) ## Review of MobileNetV1~V2 ### MobileNetV1 將 3x3 Conv 置換為 3x3 Depthwise Conv + Pointwise Conv (1x1 Conv),整個模組被稱為 Depthwise Separable Conv - Depthwise Conv 將每個 channel 分開做 Conv 來降低計算量,而 Pointwise Conv 則用來來學習同一張圖不同 channel 之間的關係 ![](https://i.imgur.com/Egh9qov.jpg) ### MobileNetV2 基於 V1 的 Depthwise Separable Conv 架構,新增 1.) Linear bottleneck layer,用來避免掉 ReLU 在低維空間失去太多資訊的問題,新增 2.) InvertedResidual 來連接 bottleneck layer 用來更快速地獲取所需資訊 ![https://i.imgur.com/IFEcriu.jpg](https://i.imgur.com/IFEcriu.jpg) ![差別在於左邊連結的是 expansion layer,右邊連結的是 bottleneck layer](https://i.imgur.com/7g0B0SS.png) 差別在於左邊連結的是 expansion layer,右邊連結的是 bottleneck layer ![https://i.imgur.com/fSv01JY.jpg](https://i.imgur.com/fSv01JY.jpg) 簡單來說,就是 1x1 PW + 3x3 DW + 1x1 PW + residual (before 1x1 PW and after 1x1 PW) ## Standard ViT vs MobileViT - Standard ViT: 輸入 HxWxC → reshape to NxPC token patches → project to Nxd feature vector - N: Number of patches, P=w*h=patch中大小為(w, h)的pixels ![](https://i.imgur.com/GStGYBF.png) - MobileViT blocks : 先用 3x3Conv 對原圖取出 token patches,再利用多層的 MobileViT blocks 做特徵擷取 (Unfold → Fold → Unfold),用 1x1 Conv 操作通道拼回原圖並做 concat,再過一個 Conv 當成 head 輸出 ![](https://i.imgur.com/cHlKoTW.png) > N: Number of patches = HW / P = HW / wh, d: transformer projection dimension, d>C] - 所謂的 Unfold → fold → Unfold (feature map → patches → feature map) 其實就只是為了讓 self-attention 可以只對相同顏色的 token 做學習的 reshape 操作,他所謂的輕量化 attention 的來源主要就是因為這樣,因為**他只對相同顏色的 token 算 attention (patch 大小有多大就分成幾種顏色)** - 正常的 attention 是直接把H跟W攤平變成 [N, H, W, C] -> [N, H*W, C],而 MobileViT 得多做一步去針對相同顏色的 token 做攤平來算出 attention weight martix,所以才會多這些步驟 - 所以實際上他所使用的 transformer 就是一般的 transformer,並沒有改變其本身的機制 ![](https://i.imgur.com/ZQLeYNb.png) ![](https://i.imgur.com/ohiLld6.png) - 詳細的 ViT block 操作過程: ![](https://i.imgur.com/1psvTFl.png) [H, W, C] 經過一個 **n * n conv** 來 encode local 特徵 → 經過 **1 * 1 conv (PWConv)** 來放大 channel 數為 d,得 [H, W, d] → 把 [H, W, d] 展開成 [(H * W), 1, d] → 經過 L 個 transformers,最後輸出 [(H * W), 1, d],用來 encode global 特徵 → 把 [(H * W), 1, d] 摺疊復原為 [H, W, d] → 經過 PWConv 復原回 [H, W, C] → 與一開始的輸入拼接成 [H, W, 2C] → 經過 n * n conv 融合,得 [H, W, C] 如何最佳的同時 encode 到 global 和 local 特徵? - 測試 patch sizes - 雖然 3x3 表現最好,但他們最後採用的是 2x2 patch size,因為 [feature map 的維度(K*((W−F+2P)/S+1))](https://cs231n.github.io/convolutional-networks/#conv) 通常都是二的倍數,如果要用 patch size = 3x3 需要做 padding 或是 resize,而這又得去 mask transformer token,造成更高的運算成本,除此之外也因為latency 差蠻多的 > n: kernel size, h, w: patch size ![](https://i.imgur.com/pu8VXBV.png) ![](https://i.imgur.com/c1n3Zx7.png) ## 模型架構 ### 整體架構 - MV2 就是 [MobileNetv2 block](https://github.com/rwightman/pytorch-image-models/blob/4e24f75289d46176159c6cff3ed01a5c73d886d3/timm/models/_efficientnet_blocks.py#L133),裡面就是 conv_pw → convdw → conv_pw ![](https://i.imgur.com/drCpVaJ.png) MobileViT block直接看 [timm 實作](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py#L228)會更清楚,裡面是 convkxk → conv1x1 → Unfold → trasnformer → Fold ### Model configs - 在 Model config 部分他提出了三種可行的參數組合,分別是 S, XS, XXS - 設計都一樣如下: - 3 * 3 conv with 2x down-sampling - → 4 MobileNetV2 block with 2x down-sampling twice - → 間隔添加 MobileVit block 和 MV2 - → 經過 PWconv 壓縮 channel - → Global pooling ![](https://i.imgur.com/4wpXISd.png) ![](https://i.imgur.com/RTdcHXc.png) ## 消融實驗 - 測試 weight decay,最後是 1e-2 的 weight decay 效果最好 ![](https://i.imgur.com/ZrLxMOf.png) - 測試 skip connection,有 skip connection 的表現比較好 - 然而這東西在 V2 會被拔掉 ![](https://i.imgur.com/ZnGvSf1.png) ## Results ### Compare with CNNs ![](https://i.imgur.com/beIWdQ7.png) ### Compare with ViTs ![](https://i.imgur.com/EI7a9yv.jpg) ### Object detection benchmark ![](https://i.imgur.com/AegcReV.png) ### 實際表現並不夠好,不夠快 他們將原因歸咎於現存 mobile device 有對 CNN 計算做優化,但 ViT 架構沒有 ![](https://i.imgur.com/8KttUmm.png) # MobileViT v2: **Separable Self-attention for Mobile Vision Transformers** - 在出完 v1 之後他們即使已經成為了 ViT 模型中最輕量最準確的模型,但仍不敵某些輕量 CNN 模型,分析後他們發現 **Multi-Head Self-Attention (MHSA) 是他們效能瓶頸的元兇**,因此本作 v2 就要學習 MobileNet 將 Conv 拆分成 DW 和 PW 那樣,將 MHSA 拆解為更輕量化的組合 - 他們宣稱所提出的 Separable self-attention 可達成 linear time O(k) 的 complexity,打破以往 Attention 需要 patch size 平方 O(k^2) 的計算複雜度,而他們的魔法就是在 attention 運算中使用 **element-wise operations** 來算 attention,**將矩陣相乘的運算拆解成只有加法和乘法** ## Background - Linear time 的 attention 運算其實並不是什麼大新聞,在本作之前已經有如 [Linformer](https://arxiv.org/abs/2006.04768), [Reformer](https://arxiv.org/abs/2001.04451), 等作有使用如低秩矩陣或是將token分組計算hash來近似達成線性時間過,但 Linformer 的機制會導致 bmm (batch-wise matrix multiplication) 的運算還是很久,做出來的成果並不是那麼令人滿意,而且 Linformer 和 Reformer 的方法則是只適用於 token 大於 512、1024、甚至是 2048 的情況下才能有點效果,這完全不適合用在 ViT 場景 - [Reformer 推薦閱讀](https://marssu.coderbridge.io/2021/02/09/reformer/) - [Linformer 推薦閱讀](https://www.youtube.com/watch?v=-_2AF9Lhweo) 下圖左是比較在一層 transformer block 中各種不同的 attention 操作所花費的 Top5 時間 (token=256),下圖右則是實驗 token 數量與 latency 的關係 ![](https://i.imgur.com/VLP6NBm.png) ## Review of MHSA (**multi-headed self-attention)** MobileViT V2 中敘述 MHSA 的數學形式 ![](https://i.imgur.com/7ajbABj.png) ### MHSA 計算流程 input x 乘上三個不同的權重矩陣 → Q, K, V → 原始 q 乘上 head 的數量產生出 n 個 q,k 跟 v 也同理產生出一樣數量的 k 跟 v → 每個 q 跟相同數字的 k 算 attention 得到 score 在跟相同數字的 v 算 weighted sum ![](https://i.imgur.com/DvEM1NM.png) → n 個 head 會得到 n 個 b,將這 n 個 b concat 起來再通過 linear transform $W^O$ 得到 $b^i$ 送到下一層 ![](https://i.imgur.com/awY6waN.png) ## **Separable self-attention** ![](https://i.imgur.com/b2x2iYF.png) - 在原本的 self-attention 中需要對每一個 query 和 key 去算 dot product 得到 attention score,而這就是 attention 一直以來的計算瓶頸所在 - Separable self attention 將這個過程改為使用一個 k 維的 latent token $L$ 來取代掉這整個從 input → score 的過程,流程變成先將 input 做 linear projection + softmax (Conv2d) 變成 L,再用這個 L 和 k 做 element-wise 乘法直接得到 score (做softmax的位置有改變) - 也就是從 input → q,k,v → score 變成 input → L,兩者也都一樣是使用 softmax 來得到 score $c_s$,能夠這樣取代就表示**在視覺任務上實際上很可能並不需要完整的 Q 來學習到全局特徵,只需要給予 K 一個權重來學習就可以學習到全局特徵了** - 之後他們會將不同 token 所得到的 $c_s$ 做為 $X_K$的 weigt 來得到 $c_v$ 來代表這一次 separable attention 運算的輸出,其中包含了輸入 token 之間的上下文關係,其實也就是等同於原本 attention 數學式中 Q*K 的結果 $a$ ![](https://i.imgur.com/LDyyJIj.png) - $c_v$與 V 做 element-wise 乘法,最後乘上 linear layer $W_O$ 得出結果 y ![](https://i.imgur.com/O02NcvV.png) ![](https://i.imgur.com/6Zq8tZt.png) ```python class LinearSelfAttention(nn.Module): """ This layer applies a self-attention with linear complexity, as described in `https://arxiv.org/abs/2206.02680` This layer can be used for self- as well as cross-attention. Args: embed_dim (int): :math:`C` from an expected input of size :math:`(N, C, H, W)` attn_drop (float): Dropout value for context scores. Default: 0.0 bias (bool): Use bias in learnable layers. Default: True Shape: - Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels, :math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches - Output: same as the input .. note:: For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor, we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be expensive on resource-constrained devices) that may be required to convert the unfolded tensor from channel-first to channel-last format in case of a linear layer. """ def __init__( self, embed_dim: int, attn_drop: float = 0.0, proj_drop: float = 0.0, bias: bool = True, ) -> None: super().__init__() self.embed_dim = embed_dim self.qkv_proj = nn.Conv2d( in_channels=embed_dim, out_channels=1 + (2 * embed_dim), bias=bias, kernel_size=1, ) self.attn_drop = nn.Dropout(attn_drop) self.out_proj = nn.Conv2d( in_channels=embed_dim, out_channels=embed_dim, bias=bias, kernel_size=1, ) self.out_drop = nn.Dropout(proj_drop) def _forward_self_attn(self, x: torch.Tensor) -> torch.Tensor: # [B, C, P, N] --> [B, h + 2d, P, N] qkv = self.qkv_proj(x) # Project x into query, key and value # Query --> [B, 1, P, N] # value, key --> [B, d, P, N] query, key, value = qkv.split([1, self.embed_dim, self.embed_dim], dim=1) # apply softmax along N dimension context_scores = F.softmax(query, dim=-1) context_scores = self.attn_drop(context_scores) # Compute context vector # [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N] --> [B, d, P, 1] context_vector = (key * context_scores).sum(dim=-1, keepdim=True) # combine context vector with values # [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N] out = F.relu(value) * context_vector.expand_as(value) out = self.out_proj(out) out = self.out_drop(out) return out ``` ### Compare with Linformer and standard transformer - 這裡是去抽換掉 MobileViT 中的 self-attention 來比較 - 精度會掉一點點但減少為三分之一倍的 latency ![](https://i.imgur.com/Y47WaqK.png) ### Context score 到底抓到了些什麼 - 這裡分別抓出了不同層的 separable self-attention 所產出的 context score $c_s$ (MxN) 然後拼成成相同的空間維度得到 context score map $c_m$ (HxW) - M 為 patch 中的 pixels (M=hw), N 為 patches 數量 ![](https://i.imgur.com/2Y72RqA.png) ## 模型架構 - 大致上沒改,幾乎是沿用 V1,一開始的 Conv 也是 3x3,但 V2 改為用 DW+PW 取代 ![](https://i.imgur.com/f8vYr6t.png) - config 也沒變甚麼,但不再有 XXS, XS, S 而是用一個 scale factor $\alpha$ = 0.5, 2.0 來 scale 參數 ![](https://i.imgur.com/mHFxrd0.png) - 但是拔掉了 skip connection,這裡借用 MobileViT V3 的圖來看 - V3 把一開始的 Conv 3x3 改為 DW Conv - 也把 V1 的 Fusion block 中的 3x3 Conv 改為 1x1 Conv - 然後又把 V2 拔掉的 skip connecton 加回來 (Top1 差 0.6),這裡 skip 的地方也有變化,他改成把 local 出來的結果跟 global 出來的結果 concat 起來通過 1x1 調整維度再與輸入來做相加 (原因是他們覺得**全局資訊跟區域資訊比全局資訊跟輸入特徵相對而言有更緊密的關係**) ![](https://i.imgur.com/RiteMaj.png) - [參考 timm 會更詳細](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py#L514) ## Results ### Image classification - From scratch training on image-1k - AdamW - Pretraining → fine-tune - 先用 ImageNet-1k 的權重來做初始化 → 在 Imagenet21k 上用 AdamW 做預訓練 80 epoch,然後再用 SGDm (m=0.9) fine-tune ImageNet-1K 50 epoch - 一樣老調重彈了說 ViT 在終端裝置上面仍然缺少優化,所以 latency 跟 CNN 比還是沒有很理想 ![](https://i.imgur.com/tsWXaZg.png) - 他們發現用更大的 image size 來做 fine-tune 效果更好一些 ![](https://i.imgur.com/Etzz50E.png) ### Object Detection ![](https://i.imgur.com/OyBDljk.png) ### Compare with light-weight CNNs and ViTs ![](https://i.imgur.com/x0mVzap.png) # 結論 - V1 參考 MBNet 產出初始架構, V2 將 attention 做優化, V3 做各種實驗來找出最好的架構 - 無論如何盡量把 MobileNet 抄過來,這樣你就會很輕量 - 優化問題未解的情況下,使用 ViT 的動機仍須看使用場景是否需要全局資訊 - 本系列是將 MB block 放在前面取得區域特徵,然後去疊transformer取得全局資訊,與最近出的一篇 SOTA ViT 叫做 [MaxViT](https://arxiv.org/abs/2204.01697) 有異曲同工之妙,MaxViT 除了一樣是在前面放入 MB block 後面疊 transformer 以外,同時也提出了線性時間的 attention 操作 - 它不只是拆解 QK 運算的部分而是利用兩種劃分方式產出不重疊的 window (grid & block),直接把 transformer block 轉變為下圖的流程,因此它的 attention 是沒有使用所有 token 來做,而是用兩種方式切開來做以避免 N^2 運算 - grid attention 是新意所在,但它好像有點像 Dilated Conv ![](https://i.imgur.com/dqGp89L.png) ![](https://i.imgur.com/8V9I6bP.png) # Reference [MobileViT模型简介](https://blog.csdn.net/qq_37541097/article/details/126715733) [MobileViT、MobileViTv2、MobileViTv3学习笔记(自用)](https://blog.csdn.net/weixin_44911037/article/details/127515858)

    Import from clipboard

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lost their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template is not available.


    Upgrade

    All
    • All
    • Team
    No template found.

    Create custom template


    Upgrade

    Delete template

    Do you really want to delete this template?

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Tutorials

    Book Mode Tutorial

    Slide Mode Tutorial

    YAML Metadata

    Contacts

    Facebook

    Twitter

    Feedback

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions

    Versions and GitHub Sync

    Sign in to link this note to GitHub Learn more
    This note is not linked with GitHub Learn more
     
    Add badge Pull Push GitHub Link Settings
    Upgrade now

    Version named by    

    More Less
    • Edit
    • Delete

    Note content is identical to the latest version.
    Compare with
      Choose a version
      No search result
      Version not found

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub

        Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

         Sign in to GitHub

        HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Available push count

        Upgrade

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Upgrade

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully