changed 3 years ago
Published Linked with GitHub

Transformer

tags: Deep Learning for Computer Vision

Self-Attention

query, key : caculate relationship between two word
softmax : transform scaler \(a_{i,j}\) into logits
vector : weight of word which weighted sum other word

Step 1 :

Use \(X = \{x_1, x_2, ..., x_n\}\) to represent \(N\) input information, and get the initial representation of the three vectors \(Q, K, V\) through linear transformation \(W\)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Step 2 :

Calculate the similarity score between query \(q\) and key \(k\) by doing the inner product

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Step 3 :

Apply softmax to project the vector range between 0~1

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Step 4 :

Weighted sum each \(a_{i,j}\)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Step 5 :

Doing attention by each word

Implementation

Muti-Head Self-Attention

Divide the model into multiple heads to form multiple subspaces, allowing the model to focus on different aspects of information (each weight is initialized randomly)

Layer normalization

The Decoder in Transformer

Vision Transformer

Query-Key-Value Attention

  • Assume that the input is partitioned into 4 patches and the feature dimension is 3, that is, P=4 and D=3
  • Note that there are (P+1) rows since we have an additional token

CNN v.s. Transformer

When we train a CNN, the kernel is learnable, and we use the kernel to convolve the entire image. If the convolution weight is high, the region is more important.

As same as CNN, Transformer uses attention mechanism to compute the query-key weight.

PS-ViT

  • Vision Transformers with Progressive Sampling
  • Progressively select important patches by shifting patch centers

Transformer for Semantic Segmentation

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Architecture

Different patch size

Select a repo