MAE - HackMD

# MAE ###### tags: `self-sup` ## 概述 **MAE (Masked Autoencoder)** 為一種 self-supervised learning 模型，預訓練階段，會將輸入圖像進行隨機遮罩，最後，模型會對原圖進行復原。MAE 有兩個核心設計： 1. **Asymmetric encoder-decoder**：Encoder 僅會作用在未遮罩的 patch 上，Lightweight Decoder 則會根據 Encoder 的輸出與 mask tokens 來還原原圖。 2. **Masking a high proportion of the input image**：實驗顯示遮罩 75% 的 patch 取得很好的實驗效果。 ## 介紹 BERT 在 NLP 領域取得了成功，作者提出：What makes masked autoencoding different between vision and language？ 1. **Structure**：傳統的 CNN 架構與 BERT 架構有著很大的差異，直到 ViT 的出現才改變了這個局面 2. **Information density**：Language 有著很高的資訊密度，相較而言，Image 資訊密度較低 (a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes.)。實驗顯示，提高 MAE 的遮罩比例有助於提升訓練成效。 3. **Decoder**：BERT 的 decoder 為 MLP，然而 MAE 的 decoder 需要復原原圖，實驗顯示，decoder 的設計對於 encoder 的特徵萃取能力影響重大。實驗顯示，MAE 的 self-supervised pretraining 相比於 supervised pretraining 取得了更好的效果，在模型擴大的情況下，結果將更顯著。 ## 方法 ![](https://hackmd.io/_uploads/Hyw8bauH2.png) MAE 使用 asymmetric design，Encoder 僅會作用於非遮罩的 patch，Decoder 會根據 Encoder 的輸出與 mask token 輸出復原結果。 1. **Masking** 2. **MAE encoder**：為 ViT 模型，不過僅會作用在 unmasked patches。同傳統的 ViT，patch 會先過 linear projection，接著加入 position encoding，最後放入一系列的 Transformer block 中。 3. **MAE decoder**：decoder 的輸入為 visible patches & masked tokens，所有 masked token 為可學習，且權重共享的向量，decoder 會先對所有的輸入加入 positional embedding，再經過一系列的 Transformer block。**Decoder 僅會在 pretrainig 階段使用**。 4. **Reconstruction target**：decoder 的最後一層 layer 為 linear projection，輸出的 channel 維度等於 patch 中的 pixel 數量，輸出張量會經過 reshape 形成 reconstructed image，使用 MSE 作為損失函數，僅會對 masked patches 計算損失。實驗顯示，reconstruction target 使用 normalized pixel values of each masked patch 能提升 representation quality。 ## ImageNet 實驗首先進行 self-supervised 預訓練在 ImageNet-1K training set 上，接著進行 supervised training 來評估 representation 的成效，藉由 + end-to-end fine-tuning + linear probing 使用 **ViT-Large** 做為 baseline backbone，實驗顯示，使用 MAE 預訓練後表現明顯優於沒有預訓練的情況 ![](https://hackmd.io/_uploads/SywGCCOrh.png) 作者還進行了消融實驗 1. **Masking ratio**：遮罩比例 75% 在 linear probing & fine-tuning 實驗上皆取得好的效果，這與 BERT 情況有所不同。 ![](https://hackmd.io/_uploads/BJJ1JkYS2.png) ![](https://hackmd.io/_uploads/BJUhpvBUn.png) 2. **Decoder design**： + **decoder depth**：linear probing 適用較深的 decoder，decoder 深度對 fine-tuning 影響不大，因為 fine-tuning 時，encoder 的最後一層 layer 會被 tune 到。 ![](https://hackmd.io/_uploads/By_0xkFSh.png) + **decoder width**：512-d 在 linear probing & fine-tuning 上皆取得較好的結果。 ![](https://hackmd.io/_uploads/B16J-JFHn.png) 3. **Mask Token**：Encoder 使用 mask token，模型表現較差，在 linear probing 情況下，準確度掉了 14%。 ![](https://hackmd.io/_uploads/HkB2W1YH3.png) 4. **Reconstruction target** ![](https://hackmd.io/_uploads/HkB-Mytrh.png) 5. **Data augmentation**：不進行資料增強情況下，MAE 就能取得不錯的成果，這是因為 random masking 就具有資料增強的效果。 ![](https://hackmd.io/_uploads/HyhLMyKrh.png) 6. **Mask sampling strategy** ![](https://hackmd.io/_uploads/ryTeQ1KS2.png) ## 與 supervised pretraining 比較 ![](https://hackmd.io/_uploads/r1zMVytHn.png)