# Transformer
###### tags: `Deep Learning for Computer Vision`
## Self-Attention
query, key : caculate relationship between two word
softmax : transform scaler $a_{i,j}$ into logits
vector : weight of word which weighted sum other word
### Step 1 :
Use $X = \{x_1, x_2, ..., x_n\}$ to represent $N$ input information, and get the initial representation of the three vectors $Q, K, V$ through **linear transformation** $W$

### Step 2 :
Calculate the similarity score between query $q$ and key $k$ by doing the inner product

### Step 3 :
Apply softmax to project the vector range between 0~1

### Step 4 :
Weighted sum each $a_{i,j}$

### Step 5 :
Doing attention by each word

### Implementation

#### Muti-Head Self-Attention
Divide the model into multiple heads to form multiple subspaces, allowing the model to focus on **different aspects of information** (each weight is initialized randomly)


#### Layer normalization

#### The Decoder in Transformer

# Vision Transformer

## Query-Key-Value Attention
* Assume that the input is partitioned into 4 patches and the feature dimension is 3, that is, P=4 and D=3
* Note that there are (P+1) rows since we have an additional token

## CNN v.s. Transformer
When we train a CNN, the kernel is learnable, and we use the kernel to convolve the entire image. If the convolution weight is high, the region is more important.
As same as CNN, Transformer uses attention mechanism to compute the query-key weight.
## PS-ViT
* Vision Transformers with Progressive Sampling
* Progressively select important patches by shifting patch centers

## Transformer for Semantic Segmentation
<img src="https://i.imgur.com/FevokBI.png" width="300"/>
### Architecture

### Different patch size
