--- tags: Paper --- # HRViT & MobileViT [TOC] --- ## HRNet $\because$ HRViT 是 HRNet 和 Transformer 的結合 $\therefore$ 先簡介 HRNeT [Paper link](https://arxiv.org/pdf/1902.09212.pdf) 通常要處理高解析度圖片時,會先經過 stride, pooling 來不斷的縮小 feature maps 的大小,再輸入classifier。 但經過不斷的縮小,高解析度資訊容易丟失太多。 HRNet 就是為了<font color=orange>**維持高解析度的特徵**</font>而設計的。 ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/Ui8des4.png =500x) 如圖,最上面那層沒有任何 stride 或 pooling,保持相同的尺寸。 在維持高解析度特徵的同時,也需要解決多尺度的問題,所以也有一部分的網路透過不斷縮小在處理多尺度。這些不同尺度的branches (or say feature maps),會搭配 **upsampling/downsampling** 彼此融合。 ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/nkY21Vj.png =500x) ## HRViT A ViT backbone design for semantic segmentation. ### Architecture ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/n7TX7k3.png =700x) * Downsampling convolutional stem(4x) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/chU6w24.png =250x) * 4 Progressive Transformer stages where n^th^ stage contains n parallel multi-scale Transformer branches * One or more modules * Cross-resolution fusion layer ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/prkyPCv.png =500x) * <font color=orange>融合不同尺度的feature maps,實現 cross-resolution interaction</font> * Repeated **Transformer Blocks** * Augmented cross-shaped local self-attention block (**HRViTAttn**) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/gHctFNW.png =500x) * 把 input 分兩半,變兩個 [H x W x C/2] * 一半切成 s×W horizontal windows 另一半切成 H×s vertical windows * Within each window, the patch is chunked into K d~k~-dimensional heads, then a local self-attention is applied * Share the linear projections for key and value tensors to save computation and parameters * Mixed-scale convolutional feedforward networks (**MiXCFN**) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/EEO4BuU.png =500x) * 用 3 * 3 conv 和 5 * 5 conv 提取多尺度特徵 <!-- * share linear projections for key & value in attention layer * save computing cost & parameters * 參照CSWin把input分一半,分別chunk into horizontal and vertical windows,做local attentions --> ### Performance ADE20K ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/2orOHwq.png =600x) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/OTwhNcZ.png =600x) ## Mobile ViT :::spoiler Recall: Standard ViT ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/oMwgT4W.png =x180) * 把 [H, W, C] 壓扁,得 [N, PC] * N: patch 的數量 * P: 一個 patch 的 pixel 數 * 線性變換成 [N, d] * Positional embedding * 經過 L 個 transformers * 線性變換成最終結果 ::: ### MobileViT block ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/t8k1EZX.png =x200) [H, W, C] 經過一個 **n * n conv** <font color=orange>encode **local** 特徵</font> ↓ 經過 1 * 1 conv (i.e. **PWConv**),放大 channel 數為 d,得 [H, W, d] ↓ 把 [H, W, d] 展開成 [(H * W), 1, d] ↓ 經過 L 個 **transformers**,最後輸出 [(H * W), 1, d] <font color=orange>encode **global** 特徵</font> ↓ 把[(H * W), 1, d] 摺疊復原為 [H, W, d] ↓ 經過 **PWConv** 復原回 [H, W, C] ↓ 與一開始的輸入**拼接**成 [H, W, 2C] ↓ 經過 **n * n conv** 融合,得 [H, W, C] #### 優點 * 相較於一般的 convolution,i.e. unfolding $\rightarrow$ 矩陣乘法 $\rightarrow$ folding,把矩陣乘法替換成 transformer 後,可以多學到 global 特徵 * Transformer **output 的每個 pixel**都包含 **input 的每個 pixel 的信息** ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/eG8ND1j.png =200x) ### Architecture ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/GgGoVb6.png =x140) * 經過 **3 * 3 conv** with 2x down-sampling * 4 **MobileNetV2 block** with 2x down-sampling twice * 間隔添加**MobileVit block**和MV2 * 經過 **PWconv** 壓縮 channel * Global pooling ### Performance On ImageNet-1k, ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/PqaY1Zc.png =350x) <!-- <br><br><br><br><br><br><br> --- MIoU(Mean Intersection over Union) * The mean of 橘/(紅+橘+黃) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/zp9eGLa.png =300x) -->