{%hackmd SybccZ6XD %}
# AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
###### tags: `paper`
## ABSTRACT
- Image classification tasks
Previous: CNN
Paper: pure transformer
- How to do that
- Transformer
- Experiment
- Pre-train and transfer to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.)
## INTRODUCTION
- Step
- split an image into patches
- embeddings
- Transformer
- MLP
- When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size
- Transformers lack some of the inductive biases such as translation equivariance and locality 如果每個patch單獨做self-attention呢?
- Trained in larger datasets, excellent result
- large scale training trumps inductive bias.
## METHOD
follow the original Transformer (Vaswani et al., 2017) as closely as possible
### VISION TRANSFORMER (VIT)

### Reshape and Unroll
The input of transformer is a sequence, so reshape the image

### Linear Projection and Embedding
$z_0 = [x_{class}; x_p^1E; x_p^2E; ...; x_p^NE] + E_{pos}$
N = 9 in this example
$E\in \mathbb{R}^{(P^2C)\times D}$
$E_{pos}\in \mathbb{R}^{(N+1)\times D}$

Embedding code
```python=
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
x + = self.pos_embedding(:,:(n + 1))
```
class token

similar to BERT (initial zero)

### Transformer Encoder

[Transformer](https://hackmd.io/11069bzHTUyQU1ClW9_TMQ)
### MLP
Fully connected and activation function
### FINE-TUNING AND HIGHER RESOLUTION
Pre-train on large datasets, and fine-tune to smaller task.
Step
- Pre-train
- predition head: MLP with one hidden layer
- Remove pre-trained predition head
- Attach a zero-initialized D x K feedforward layer
## EXPERIMENTS
### SETUP
==Datasets.==
Pre-train
- ILSVRC-2012 ImageNet dataset: 1k classes and 1.3M images
- ImageNet-21k: 21k classes and 14M images
- JFT: 18k classes and 303M high resolution images
benchmark tasks: (Prepross by Big transfer (BiT): General visual representation learning.)
- ImageNet on the original validation labels and the cleaned-up ReaL labels
- CIFAR-10/100
- Oxford-IIIT Pets
- Oxford Flowers-102
19-task VTAB classification suite:
- Natural:tasks like the above, Pets, CIFAR
- Specialized:medical and satellite imagery
- Structured:tasks that require geometric understanding like localization
==Model Variants.==
Layers: How many encoder block
Hidden size D: The dim of output of linear projection
MLP size:
Heads: How many head in the Multi-Head Attention

==Training & Fine-tuning.==
Optimization
Adam: $\beta_1 = 0.9, \beta_2 = 0.999, batch size = 4096, high weight decay of 0.1$
Fine-tune
SGD with momentum: $batch size = 512$
==Metrics.==
results
- few-shot accuracy: solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to $\{-1,1\}^K$ target vectors
- fine-tuning accuracy: after fine-tuning
:::warning
補充 (shot)
n-shot: 1 class have n samples
:::
### COMPARISON TO STATE OF THE ART


### PRE-TRAINING DATA REQUIREMENTS
The comparison between small datasets and larger datasets when pre-trained.
- large ViT models perform worse than BiT
- large ViT models shine when pre-trained on larger datasets.

The number in the above picture means P

The comparison between different size of subset
- use early-stopping, and report the best validation accuracy achieved during training

### SCALING STUDY

### INSPECTING VISION TRANSFORMER
### SELF-SUPERVISION
## my result
randomcrop and resize
0.9737 3:03:00
resize
0.9823 2:58:00
me resize
0.9874 2:20:00