Transformer - HackMD

# Transformer ###### tags: `Deep Learning for Computer Vision` ## Self-Attention query, key : caculate relationship between two word softmax : transform scaler $a_{i,j}$ into logits vector : weight of word which weighted sum other word ### Step 1 : Use $X = \{x_1, x_2, ..., x_n\}$ to represent $N$ input information, and get the initial representation of the three vectors $Q, K, V$ through **linear transformation** $W$ ![](https://i.imgur.com/vRr9pok.jpg) ### Step 2 : Calculate the similarity score between query $q$ and key $k$ by doing the inner product ![](https://i.imgur.com/li0UKpk.jpg) ### Step 3 : Apply softmax to project the vector range between 0~1 ![](https://i.imgur.com/FdqDTzy.jpg) ### Step 4 : Weighted sum each $a_{i,j}$ ![](https://i.imgur.com/CP9vVHf.jpg) ### Step 5 : Doing attention by each word ![](https://i.imgur.com/dpRIOl2.jpg) ### Implementation ![](https://i.imgur.com/O7gkD81.jpg) #### Muti-Head Self-Attention Divide the model into multiple heads to form multiple subspaces, allowing the model to focus on **different aspects of information** (each weight is initialized randomly) ![](https://i.imgur.com/RXKOFeo.jpg) ![](https://i.imgur.com/KSbcFgm.jpg) #### Layer normalization ![](https://i.imgur.com/TMm5XSU.jpg) #### The Decoder in Transformer ![](https://i.imgur.com/djnIHpQ.jpg) # Vision Transformer ![](https://i.imgur.com/TIYP8NX.jpg) ## Query-Key-Value Attention * Assume that the input is partitioned into 4 patches and the feature dimension is 3, that is, P=4 and D=3 * Note that there are (P+1) rows since we have an additional token ![](https://i.imgur.com/Plzaagu.jpg) ## CNN v.s. Transformer When we train a CNN, the kernel is learnable, and we use the kernel to convolve the entire image. If the convolution weight is high, the region is more important. As same as CNN, Transformer uses attention mechanism to compute the query-key weight. ## PS-ViT * Vision Transformers with Progressive Sampling * Progressively select important patches by shifting patch centers ![](https://i.imgur.com/xzsNXHb.jpg) ## Transformer for Semantic Segmentation <img src="https://i.imgur.com/FevokBI.png" width="300"/> ### Architecture ![](https://i.imgur.com/RTI0DsV.jpg) ### Different patch size ![](https://i.imgur.com/SXsfvqg.jpg)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.