# Transformer ###### tags: `Deep Learning for Computer Vision` ## Self-Attention query, key : caculate relationship between two word softmax : transform scaler $a_{i,j}$ into logits vector : weight of word which weighted sum other word ### Step 1 : Use $X = \{x_1, x_2, ..., x_n\}$ to represent $N$ input information, and get the initial representation of the three vectors $Q, K, V$ through **linear transformation** $W$ ![](https://i.imgur.com/vRr9pok.jpg) ### Step 2 : Calculate the similarity score between query $q$ and key $k$ by doing the inner product ![](https://i.imgur.com/li0UKpk.jpg) ### Step 3 : Apply softmax to project the vector range between 0~1 ![](https://i.imgur.com/FdqDTzy.jpg) ### Step 4 : Weighted sum each $a_{i,j}$ ![](https://i.imgur.com/CP9vVHf.jpg) ### Step 5 : Doing attention by each word ![](https://i.imgur.com/dpRIOl2.jpg) ### Implementation ![](https://i.imgur.com/O7gkD81.jpg) #### Muti-Head Self-Attention Divide the model into multiple heads to form multiple subspaces, allowing the model to focus on **different aspects of information** (each weight is initialized randomly) ![](https://i.imgur.com/RXKOFeo.jpg) ![](https://i.imgur.com/KSbcFgm.jpg) #### Layer normalization ![](https://i.imgur.com/TMm5XSU.jpg) #### The Decoder in Transformer ![](https://i.imgur.com/djnIHpQ.jpg) # Vision Transformer ![](https://i.imgur.com/TIYP8NX.jpg) ## Query-Key-Value Attention * Assume that the input is partitioned into 4 patches and the feature dimension is 3, that is, P=4 and D=3 * Note that there are (P+1) rows since we have an additional token ![](https://i.imgur.com/Plzaagu.jpg) ## CNN v.s. Transformer When we train a CNN, the kernel is learnable, and we use the kernel to convolve the entire image. If the convolution weight is high, the region is more important. As same as CNN, Transformer uses attention mechanism to compute the query-key weight. ## PS-ViT * Vision Transformers with Progressive Sampling * Progressively select important patches by shifting patch centers ![](https://i.imgur.com/xzsNXHb.jpg) ## Transformer for Semantic Segmentation <img src="https://i.imgur.com/FevokBI.png" width="300"/> ### Architecture ![](https://i.imgur.com/RTI0DsV.jpg) ### Different patch size ![](https://i.imgur.com/SXsfvqg.jpg)