MobileViTs - HackMD

# MobileViTs # Intro - ViT從出現以來一直有很笨重的特性，在實務上難以放在移動裝置上輕鬆運行，那麼我們是否能夠結合 CNN 和 ViT 來建立一個輕量型網路呢? MobileViT 這個系列的作法就是將 ViT 融入進去 MobileNetV2，做出輕量化的 ViT 模型，而 V2 則是提出可分離式 Self-attention 進一步將計算輕量化 - Apple 於2021年發表了 MobileViT (ICLR2022)．宣稱可以用比 MobileNetV3 更低的參數量達到更高的準確度，之後同一批人在 2022年6月發布 MobileViT v2，而後於 2022年10月普渡大學+美光也發布了 MobileViT v3，但 v3 沒有 timm 支援且並不是 Apple 原班人馬發的所以這裡先不講 - [timm links](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py) # MobileViT v1 [paper link](https://arxiv.org/abs/2110.02178) - ViT-B/16 vs MobileNetv3的參數量 86 vs 7.5 millions parameters - 本篇目標在於做出 1.) light-weight 2.) general-purpose 3.) low latency 的 ViT 模型 ![](https://i.imgur.com/hwRwctd.png) ## Review of MobileNetV1~V2 ### MobileNetV1 將 3x3 Conv 置換為 3x3 Depthwise Conv + Pointwise Conv (1x1 Conv)，整個模組被稱為 Depthwise Separable Conv - Depthwise Conv 將每個 channel 分開做 Conv 來降低計算量，而 Pointwise Conv 則用來來學習同一張圖不同 channel 之間的關係 ![](https://i.imgur.com/Egh9qov.jpg) ### MobileNetV2 基於 V1 的 Depthwise Separable Conv 架構，新增 1.) Linear bottleneck layer，用來避免掉 ReLU 在低維空間失去太多資訊的問題，新增 2.) InvertedResidual 來連接 bottleneck layer 用來更快速地獲取所需資訊 ![https://i.imgur.com/IFEcriu.jpg](https://i.imgur.com/IFEcriu.jpg) ![差別在於左邊連結的是 expansion layer，右邊連結的是 bottleneck layer](https://i.imgur.com/7g0B0SS.png) 差別在於左邊連結的是 expansion layer，右邊連結的是 bottleneck layer ![https://i.imgur.com/fSv01JY.jpg](https://i.imgur.com/fSv01JY.jpg) 簡單來說，就是 1x1 PW + 3x3 DW + 1x1 PW + residual (before 1x1 PW and after 1x1 PW) ## Standard ViT vs MobileViT - Standard ViT: 輸入 HxWxC → reshape to NxPC token patches → project to Nxd feature vector - N: Number of patches, P=w*h=patch中大小為(w, h)的pixels ![](https://i.imgur.com/GStGYBF.png) - MobileViT blocks : 先用 3x3Conv 對原圖取出 token patches，再利用多層的 MobileViT blocks 做特徵擷取 (Unfold → Fold → Unfold)，用 1x1 Conv 操作通道拼回原圖並做 concat，再過一個 Conv 當成 head 輸出 ![](https://i.imgur.com/cHlKoTW.png) > N: Number of patches = HW / P = HW / wh, d: transformer projection dimension, d>C] - 所謂的 Unfold → fold → Unfold (feature map → patches → feature map) 其實就只是為了讓 self-attention 可以只對相同顏色的 token 做學習的 reshape 操作，他所謂的輕量化 attention 的來源主要就是因為這樣，因為**他只對相同顏色的 token 算 attention (patch 大小有多大就分成幾種顏色)** - 正常的 attention 是直接把H跟W攤平變成 [N, H, W, C] -> [N, H*W, C]，而 MobileViT 得多做一步去針對相同顏色的 token 做攤平來算出 attention weight martix，所以才會多這些步驟 - 所以實際上他所使用的 transformer 就是一般的 transformer，並沒有改變其本身的機制 ![](https://i.imgur.com/ZQLeYNb.png) ![](https://i.imgur.com/ohiLld6.png) - 詳細的 ViT block 操作過程: ![](https://i.imgur.com/1psvTFl.png) [H, W, C] 經過一個 **n * n conv** 來 encode local 特徵 → 經過 **1 * 1 conv (PWConv)** 來放大 channel 數為 d，得 [H, W, d] (Linear bottleneck) → 把 [H, W, d] 展開成 [(H * W), 1, d] → 經過 L 個 transformers，最後輸出 [(H * W), 1, d]，用來 encode global 特徵 → 把 [(H * W), 1, d] 摺疊復原為 [H, W, d] → 經過 PWConv 復原回 [H, W, C] → 與一開始的輸入拼接成 [H, W, 2C] → 經過 n * n conv 融合，得 [H, W, C] 如何最佳的同時 encode 到 global 和 local 特徵? - 測試 patch sizes - 雖然 3x3 表現最好，但他們最後採用的是 2x2 patch size，因為 [feature map 的維度(K*((W−F+2P)/S+1))](https://cs231n.github.io/convolutional-networks/#conv) 通常都是二的倍數，如果要用 patch size = 3x3 需要做 padding 或是 resize，而這又得去 mask transformer token，造成更高的運算成本，除此之外也因為latency 差蠻多的 > n: kernel size, h, w: patch size ![](https://i.imgur.com/pu8VXBV.png) ![](https://i.imgur.com/c1n3Zx7.png) ## 模型架構 ### 整體架構 - MV2 就是 [MobileNetv2 block](https://github.com/rwightman/pytorch-image-models/blob/4e24f75289d46176159c6cff3ed01a5c73d886d3/timm/models/_efficientnet_blocks.py#L133)，裡面就是 conv_pw → convdw → conv_pw ![](https://i.imgur.com/drCpVaJ.png) MobileViT block直接看 [timm 實作](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py#L228)會更清楚，裡面是 convkxk → conv1x1 → Unfold → trasnformer → Fold ### Model configs - 在 Model config 部分他提出了三種可行的參數組合，分別是 S, XS, XXS - 設計都一樣如下: - 3 * 3 conv with 2x down-sampling - → 4 MobileNetV2 block with 2x down-sampling twice - → 間隔添加 MobileVit block 和 MV2 - → 經過 PWconv 壓縮 channel - → Global pooling ![](https://i.imgur.com/4wpXISd.png) ![](https://i.imgur.com/RTdcHXc.png) ## 消融實驗 - 測試 weight decay，最後是 1e-2 的 weight decay 效果最好 ![](https://i.imgur.com/ZrLxMOf.png) - 測試 skip connection，有 skip connection 的表現比較好 - 然而這東西在 V2 會被拔掉 ![](https://i.imgur.com/ZnGvSf1.png) ## Results ### Compare with CNNs ![](https://i.imgur.com/beIWdQ7.png) ### Compare with ViTs ![](https://i.imgur.com/EI7a9yv.jpg) ### Object detection benchmark ![](https://i.imgur.com/AegcReV.png) ### 實際表現並不夠好，不夠快他們將原因歸咎於現存 mobile device 有對 CNN 計算做優化，但 ViT 架構沒有 ![](https://i.imgur.com/8KttUmm.png) # MobileViT v2: **Separable Self-attention for Mobile Vision Transformers** - 在出完 v1 之後他們即使已經成為了 ViT 模型中最輕量最準確的模型，但仍不敵某些輕量 CNN 模型，分析後他們發現 **Multi-Head Self-Attention (MHSA) 是他們效能瓶頸的元兇**，因此本作 v2 就要學習 MobileNet 將 Conv 拆分成 DW 和 PW 那樣，將 MHSA 拆解為更輕量化的組合 - 他們宣稱所提出的 Separable self-attention 可達成 linear time O(k) 的 complexity，打破以往 Attention 需要 patch size 平方 O(k^2) 的計算複雜度，而他們的魔法就是在 attention 運算中使用 **element-wise operations** 來算 attention，**將矩陣相乘的運算拆解成只有加法和乘法** ## Background - Linear time 的 attention 運算其實並不是什麼大新聞，在本作之前已經有如 [Linformer](https://arxiv.org/abs/2006.04768), [Reformer](https://arxiv.org/abs/2001.04451), 等作有使用如低秩矩陣或是將token分組計算hash來近似達成線性時間過，但 Linformer 的機制會導致 bmm (batch-wise matrix multiplication) 的運算還是很久，做出來的成果並不是那麼令人滿意，而且 Linformer 和 Reformer 的方法則是只適用於 token 大於 512、1024、甚至是 2048 的情況下才能有點效果，這完全不適合用在 ViT 場景 - [Reformer 推薦閱讀](https://marssu.coderbridge.io/2021/02/09/reformer/) - [Linformer 推薦閱讀](https://www.youtube.com/watch?v=-_2AF9Lhweo) 下圖左是比較在一層 transformer block 中各種不同的 attention 操作所花費的 Top5 時間 (token=256)，下圖右則是實驗 token 數量與 latency 的關係 ![](https://i.imgur.com/VLP6NBm.png) ## Review of MHSA (**multi-headed self-attention)** MobileViT V2 中敘述 MHSA 的數學形式 ![](https://i.imgur.com/7ajbABj.png) ### MHSA 計算流程 input x 乘上三個不同的權重矩陣 → Q, K, V → 原始 q 乘上 head 的數量產生出 n 個 q，k 跟 v 也同理產生出一樣數量的 k 跟 v → 每個 q 跟相同數字的 k 算 attention 得到 score 在跟相同數字的 v 算 weighted sum ![](https://i.imgur.com/DvEM1NM.png) → n 個 head 會得到 n 個 b，將這 n 個 b concat 起來再通過 linear transform $W^O$ 得到 $b^i$ 送到下一層 ![](https://i.imgur.com/awY6waN.png) ## **Separable self-attention** ![](https://i.imgur.com/b2x2iYF.png) - 在原本的 self-attention 中需要對每一個 query 和 key 去算 dot product 得到 attention score，而這就是 attention 一直以來的計算瓶頸所在 - Separable self attention 將這個過程改為使用一個 k 維的 latent token $L$ 來取代掉這整個從 input → score 的過程，流程變成先將 input 做 linear projection + softmax (Conv2d) 變成 L，再用這個 L 和 k 做 element-wise 乘法直接得到 score (做softmax的位置有改變) - 也就是從 input → q,k,v → score 變成 input → L，兩者也都一樣是使用 softmax 來得到 score $c_s$，能夠這樣取代就表示**在視覺任務上實際上很可能並不需要完整的 Q 來學習到全局特徵，只需要給予 K 一個權重來學習就可以學習到全局特徵了** - 之後他們會將不同 token 所得到的 $c_s$ 做為 $X_K$的 weigt 來得到 $c_v$ 來代表這一次 separable attention 運算的輸出，其中包含了輸入 token 之間的上下文關係，其實也就是等同於原本 attention 數學式中 Q*K 的結果 $a$ ![](https://i.imgur.com/LDyyJIj.png) - $c_v$與 V 做 element-wise 乘法，最後乘上 linear layer $W_O$ 得出結果 y ![](https://i.imgur.com/O02NcvV.png) ![](https://i.imgur.com/6Zq8tZt.png) ```python class LinearSelfAttention(nn.Module): """ This layer applies a self-attention with linear complexity, as described in `https://arxiv.org/abs/2206.02680` This layer can be used for self- as well as cross-attention. Args: embed_dim (int): :math:`C` from an expected input of size :math:`(N, C, H, W)` attn_drop (float): Dropout value for context scores. Default: 0.0 bias (bool): Use bias in learnable layers. Default: True Shape: - Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels, :math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches - Output: same as the input .. note:: For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor, we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be expensive on resource-constrained devices) that may be required to convert the unfolded tensor from channel-first to channel-last format in case of a linear layer. """ def __init__( self, embed_dim: int, attn_drop: float = 0.0, proj_drop: float = 0.0, bias: bool = True, ) -> None: super().__init__() self.embed_dim = embed_dim self.qkv_proj = nn.Conv2d( in_channels=embed_dim, out_channels=1 + (2 * embed_dim), bias=bias, kernel_size=1, ) self.attn_drop = nn.Dropout(attn_drop) self.out_proj = nn.Conv2d( in_channels=embed_dim, out_channels=embed_dim, bias=bias, kernel_size=1, ) self.out_drop = nn.Dropout(proj_drop) def _forward_self_attn(self, x: torch.Tensor) -> torch.Tensor: # [B, C, P, N] --> [B, h + 2d, P, N] qkv = self.qkv_proj(x) # Project x into query, key and value # Query --> [B, 1, P, N] # value, key --> [B, d, P, N] query, key, value = qkv.split([1, self.embed_dim, self.embed_dim], dim=1) # apply softmax along N dimension context_scores = F.softmax(query, dim=-1) context_scores = self.attn_drop(context_scores) # Compute context vector # [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N] --> [B, d, P, 1] context_vector = (key * context_scores).sum(dim=-1, keepdim=True) # combine context vector with values # [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N] out = F.relu(value) * context_vector.expand_as(value) out = self.out_proj(out) out = self.out_drop(out) return out ``` ### Compare with Linformer and standard transformer - 這裡是去抽換掉 MobileViT 中的 self-attention 來比較 - 精度會掉一點點但減少為三分之一倍的 latency ![](https://i.imgur.com/Y47WaqK.png) ### Context score 到底抓到了些什麼 - 這裡分別抓出了不同層的 separable self-attention 所產出的 context score $c_s$ (MxN) 然後拼成成相同的空間維度得到 context score map $c_m$ (HxW) - M 為 patch 中的 pixels (M=hw), N 為 patches 數量 ![](https://i.imgur.com/2Y72RqA.png) ## 模型架構 - 大致上沒改，幾乎是沿用 V1，一開始的 Conv 也是 3x3，但 V2 改為用 DW+PW 取代 ![](https://i.imgur.com/f8vYr6t.png) - config 也沒變甚麼，但不再有 XXS, XS, S 而是用一個 scale factor $\alpha$ = 0.5, 2.0 來 scale 參數 ![](https://i.imgur.com/mHFxrd0.png) - 但是拔掉了 skip connection，這裡借用 MobileViT V3 的圖來看 - V3 把一開始的 Conv 3x3 改為 DW Conv - 也把 V1 的 Fusion block 中的 3x3 Conv 改為 1x1 Conv - 然後又把 V2 拔掉的 skip connecton 加回來 (Top1 差 0.6)，這裡 skip 的地方也有變化，他改成把 local 出來的結果跟 global 出來的結果 concat 起來通過 1x1 調整維度再與輸入來做相加 (原因是他們覺得**全局資訊跟區域資訊比全局資訊跟輸入特徵相對而言有更緊密的關係**) ![](https://i.imgur.com/RiteMaj.png) - [參考 timm 會更詳細](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilevit.py#L514) ## Results ### Image classification - From scratch training on image-1k - AdamW - Pretraining → fine-tune - 先用 ImageNet-1k 的權重來做初始化 → 在 Imagenet21k 上用 AdamW 做預訓練 80 epoch，然後再用 SGDm (m=0.9) fine-tune ImageNet-1K 50 epoch - 一樣老調重彈了說 ViT 在終端裝置上面仍然缺少優化，所以 latency 跟 CNN 比還是沒有很理想 ![](https://i.imgur.com/tsWXaZg.png) - 他們發現用更大的 image size 來做 fine-tune 效果更好一些 ![](https://i.imgur.com/Etzz50E.png) ### Object Detection ![](https://i.imgur.com/OyBDljk.png) ### Compare with light-weight CNNs and ViTs ![](https://i.imgur.com/x0mVzap.png) # 結論 - V1 參考 MBNet 產出初始架構, V2 將 attention 做優化, V3 做各種實驗來找出最好的架構 - 無論如何盡量把 MobileNet 抄過來，這樣你就會很輕量 - 優化問題未解的情況下，使用 ViT 的動機仍須看使用場景是否需要全局資訊 - 本系列是將 MB block 放在前面取得區域特徵，然後去疊transformer取得全局資訊，與最近出的一篇 SOTA ViT 叫做 [MaxViT](https://arxiv.org/abs/2204.01697) 有異曲同工之妙，MaxViT 除了一樣是在前面放入 MB block 後面疊 transformer 以外，同時也提出了線性時間的 attention 操作 - 它不只是拆解 QK 運算的部分而是利用兩種劃分方式產出不重疊的 window (grid & block)，直接把 transformer block 轉變為下圖的流程，因此它的 attention 是沒有使用所有 token 來做，而是用兩種方式切開來做以避免 N^2 運算 - grid attention 是新意所在，但它好像有點像 Dilated Conv ![](https://i.imgur.com/dqGp89L.png) ![](https://i.imgur.com/8V9I6bP.png) # Reference [MobileViT模型简介](https://blog.csdn.net/qq_37541097/article/details/126715733) [MobileViT、MobileViTv2、MobileViTv3学习笔记（自用）](https://blog.csdn.net/weixin_44911037/article/details/127515858)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.