[CV] AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (2020)

###### tags: `Paper` [CV] AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (2020) === https://arxiv.org/pdf/2010.11929.pdf https://www.youtube.com/watch?v=TrdevFK_am4 ### Intro - Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention, some replacing the convolutions entirely. - 後者仍無法使用在大量的資料上因為 transformer 使用上的改動, SOTA仍是 ResNet 家族 - Split an image into patches -> embedding -> each patch is treated as a token(word) to the Transformer - 在 mid-size 資料上, 表現比 ResNet 差一點點, 作者認為是 Transformer 無 CNN 對於區域的資訊, 但當資料便非常多之後 (14M-300M), Transformer 表現就比 CNN 好 ### Related Works - 此前 Transformer 在圖片的應用多數都是 pixel-wise attention, 但運算量過大, 所以就有各種方法如: local attention, attension on single axis 等變體 - 目前並沒有人使用 global-attention 在整張圖片上 ### Method (Vision Transformer, ViT) ![](https://i.imgur.com/4YTA9cN.png) ![](https://i.imgur.com/HYUCYhI.png) - 每一張圖從 HxWxC 切成 Nx(P^2xC), where N is the sequence length of Transfomer input denoted as (HxW)/P^2 - **(1)** patch 中每個 pixel 被 embedd 成 D-dim vector 並外加一個 init class-vector - **(2)、(3)** l 代表第幾個 Transformer (MSA: multi-head self-attention, LN: layer-norm) - **(4)** 最後的 classification head 落在 z(0,L) - **Positional Embedding**: 實驗表示 2D embedding 沒有比 1D 好, 採用 1D - **方法: patch 為單位, 非 pixel** - 無 positional embedding - 1D (this paper) - 2D: X, Y 軸各學一個 embedding, 各自長度為 D/2, 使用時直接 concate - Relative positional embeddings: 將兩兩 patch 之間的距離定義出來, 並使用 1D attension 去學這個距離的 embedding - **使用方式:** - 進 Transformer 前就先加 (this paper) - 每一個 layer 的 input 都加一次並更新 embedding 參數 - 只學一次但在每一個 layer input 都加入 - 作者認為因為使用 patch 而非純粹 pixel, 所以空間的 embedding 沒有那麼重要, 尤其輸入比 pixel level 小 (14x14 v.s. 224x224), 所以空間關係應該學得到 ![](https://i.imgur.com/EcohKBp.png) - Hybrid Architecture: 把 embbeding 換成 ResNet 前幾層的 feature map (沒什麼用) ### Fine Tuning - fine tune 使用更高解析度圖片, 保持 patch 內長度相同, 代表 patch 數量 (sequence 變長) 會變多, 而 Transformer 本身的 Wq, Wk 跟輸入 "長度" 無關 (embedding size 有關) - 因為 seq 長度改變, 1D positional embedding 會失效, 使用 patch 在原圖中的位置使用 1D embedding 做 interpolation ### Result ![](https://i.imgur.com/ptZBrAd.png) ![](https://i.imgur.com/xkp8UrZ.png) ![](https://i.imgur.com/28MIzG1.png) - Pre-train 資料數量越大越好 ![](https://i.imgur.com/5rlQcJa.png) - transformer 有能力在很淺層就看到很遠的 pixel (甚至大於 10 之後就能看到 global) ![](https://i.imgur.com/W3YGkqc.png) ### 結語 - CNN 強大的 inductive prior(bias) 為從 kernel 範圍去看鄰居再慢慢結合其他 kernel - 當資料夠多的時候, inductive bias 反而會傷害表現 (transfomer are more general) - **transfomer 中的 skip connection 可能是之後的重點**