論文筆記 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

# An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., *An image is worth 16x16 words: Transformers for image recognition at scale*, International Conference on Learning Representations, 2021. [TOC] ## Abstract > 首度完全捨棄CNN，將transformer結構應用於classification，且在大規模數據下省資源 Transformer has become the standard for NLP tasks, in CV, attention is either applied in conjunction with CNN, or used to replace certain components of CNNs while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. ViT attains excellent results compared to CNN while requiring substantially fewer computational resources to train. ## Introduction > 1. 盡可能維持Transformer架構，提出patch(token)概念取代convolution > 2. 在中規模數據上表現一般，但在大規模數據下表現優異且省資源 Inspired by the Transformer scaling successes in NLP, we experiment with **applying a standard Transformer directly to images, with the fewest possible modifications**. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image **patches are treated the same way as tokens** in an NLP application. We train the model on image classification in supervised fashion. When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: **Transformers lack some of the inductive biases inherent to CNNs**, such as **translation equivariance** and **locality**.However, the picture changes if the models are trained on larger datasets. We find that **large scale training trumps inductive bias**. ViT attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. ## Method ![image](https://hackmd.io/_uploads/S121Ig1Ha.png) reshape: $x\in\mathbb{R}^{\text{H}\times\text{W}\times\text{C}}\rightarrow x\in\mathbb{R}^{\text{N}\times\text{P}^2\times\text{C}}$, $N$: number of patches, $P$: patch size ### Fine-Tuning and Higher Resolution > 1. 先在大數據pretrain，再將 MLP head 換成 $D\rightarrow K$ 針對資料train > 2. 針對資料使用 high resolution，維持一樣的P(sequence變長) We pre-train ViT on large datasets, and fine-tune to downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized $D\times K$ feedforward layer, where $K$ is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training. ### Inductive bias > ViT的inductive bias比CNN少 > **CNN** 每層都是locality, 2D neighborhood structure, translation equivariance > **ViT** MLP是locality, translation equivariance，attention是global，而只有一開始split patch與fine-tuning是2D neighborhood structure We note that **ViT has much less image-specific inductive bias than CNNs**. In CNN, locality, 2D neighborhood structure, and translation equivariance are baked into each layer. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The 2D neighborhood structure is less used: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution. Other than that, all spatial relations between the patches have to be learned from scratch. ### Hybrid Architecture > 使用CNN feature map做為input sequence As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN. In this hybrid model, the patch embedding projection E is applied to patches extracted from a CNN feature map. ## Experiments compare ResNet、ViT、hybrid ### Models ![image](https://hackmd.io/_uploads/ryWAwl1S6.png =560x) ### Datasets **Pretrain** ImageNet, ImageNet-21k, JFT **Train** ImageNet/ReaL, CIFAR-10/100, Oxford-IIIT Pets/Flowers-102 pre-processing follows BiT (https://arxiv.org/abs/1912.11370) ![image](https://hackmd.io/_uploads/HJrrFgkSa.png =640x) ### Pretrain ViT works better than CNN only on large datasets ![image](https://hackmd.io/_uploads/ByJaYxJSp.png) Hybrid increases performance for small datasets, decreases performance for large datasets ![image](https://hackmd.io/_uploads/BkPW9x1Sa.png =640x) ## Open Source https://github.com/google-research/vision_transformer