HRViT & MobileViT

--- tags: Paper --- # HRViT & MobileViT [TOC] --- ## HRNet $\because$ HRViT 是 HRNet 和 Transformer 的結合 $\therefore$ 先簡介 HRNeT [Paper link](https://arxiv.org/pdf/1902.09212.pdf) 通常要處理高解析度圖片時，會先經過 stride, pooling 來不斷的縮小 feature maps 的大小，再輸入classifier。但經過不斷的縮小，高解析度資訊容易丟失太多。 HRNet 就是為了**維持高解析度的特徵**而設計的。 ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/Ui8des4.png =500x) 如圖，最上面那層沒有任何 stride 或 pooling，保持相同的尺寸。在維持高解析度特徵的同時，也需要解決多尺度的問題，所以也有一部分的網路透過不斷縮小在處理多尺度。這些不同尺度的branches (or say feature maps)，會搭配 **upsampling/downsampling** 彼此融合。 ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/nkY21Vj.png =500x) ## HRViT A ViT backbone design for semantic segmentation. ### Architecture ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/n7TX7k3.png =700x) * Downsampling convolutional stem(4x) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/chU6w24.png =250x) * 4 Progressive Transformer stages where n^th^ stage contains n parallel multi-scale Transformer branches * One or more modules * Cross-resolution fusion layer ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/prkyPCv.png =500x) * 融合不同尺度的feature maps，實現 cross-resolution interaction * Repeated **Transformer Blocks** * Augmented cross-shaped local self-attention block (**HRViTAttn**) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/gHctFNW.png =500x) * 把 input 分兩半，變兩個 [H x W x C/2] * 一半切成 s×W horizontal windows 另一半切成 H×s vertical windows * Within each window, the patch is chunked into K d~k~-dimensional heads, then a local self-attention is applied * Share the linear projections for key and value tensors to save computation and parameters * Mixed-scale convolutional feedforward networks (**MiXCFN**) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/EEO4BuU.png =500x) * 用 3 * 3 conv 和 5 * 5 conv 提取多尺度特徵  ### Performance ADE20K ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/2orOHwq.png =600x) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/OTwhNcZ.png =600x) ## Mobile ViT :::spoiler Recall: Standard ViT ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/oMwgT4W.png =x180) * 把 [H, W, C] 壓扁，得 [N, PC] * N: patch 的數量 * P: 一個 patch 的 pixel 數 * 線性變換成 [N, d] * Positional embedding * 經過 L 個 transformers * 線性變換成最終結果 ::: ### MobileViT block ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/t8k1EZX.png =x200) [H, W, C] 經過一個 **n * n conv** encode **local** 特徵 ↓ 經過 1 * 1 conv (i.e. **PWConv**)，放大 channel 數為 d，得 [H, W, d] ↓ 把 [H, W, d] 展開成 [(H * W), 1, d] ↓ 經過 L 個 **transformers**，最後輸出 [(H * W), 1, d] encode **global** 特徵 ↓ 把[(H * W), 1, d] 摺疊復原為 [H, W, d] ↓ 經過 **PWConv** 復原回 [H, W, C] ↓ 與一開始的輸入**拼接**成 [H, W, 2C] ↓ 經過 **n * n conv** 融合，得 [H, W, C] #### 優點 * 相較於一般的 convolution，i.e. unfolding $\rightarrow$ 矩陣乘法 $\rightarrow$ folding，把矩陣乘法替換成 transformer 後，可以多學到 global 特徵 * Transformer **output 的每個 pixel**都包含 **input 的每個 pixel 的信息** ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/eG8ND1j.png =200x) ### Architecture ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/GgGoVb6.png =x140) * 經過 **3 * 3 conv** with 2x down-sampling * 4 **MobileNetV2 block** with 2x down-sampling twice * 間隔添加**MobileVit block**和MV2 * 經過 **PWconv** 壓縮 channel * Global pooling ### Performance On ImageNet-1k, ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/PqaY1Zc.png =350x)