or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
[Transformer_CV] Vision Transformer(ViT)重點筆記
tags:
Literature Reading
Transformer
ViT
Vision Transformer
AI / ML領域相關學習筆記入口頁面
論文亮點
首度完全捨棄CNN,將transformer結構應用於CV領域的,後續ViT相關改進大抵基於這篇論文
原始論文:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
模型主架構
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →圖1:模型概述。我們將圖像分割成固定大小的圖像塊(image patch/token),對每個圖像塊進行線性嵌入並添加位置資訊,並將得到的向量序列送入一個標準的Transformer編碼器。
相對於NLP領域的每個輸入單位使用的是Word Embedding,本篇論文提出了Patch Embedding作法,將影像切分為圖塊向量
ViT演算法
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Embedding :
Transormer Encoder
最後的 \(LN(z^0_L)\)
卷積網路與ViT模型的歸納偏差 Inductive Biases
什麼是Inductive Biases?
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →從機率角度來看
在機器學習領域
KNN(K-Nearest Neighbors):
SVM(Support Vector Machines):
CNN(Convolutional Neural Nkerneletworks):
RNN(Recurrent Neural Networks):
Inductive Biase的作用
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →參考資料
原論文中關於CNN與ViT在卷積的歸納偏差 Convolutional Inductive Biases
inductive bias 原論文中的討論
p.4
Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
白話理解
關於CNN卷機操作的空間約束
卷積操作施加了兩個重要的空間約束,促進了視覺特徵的學習。
不過,卷積的歸納偏向,缺乏對圖像本身的全局理解。它們在提取視覺特徵方面很出色,但它們無法對它們之間的依賴關係進行建模。
CNN善於捕捉局部特徵

Is this the end for Convolutional Neural Networks?
用ImageNet訓練的CNN模型更傾向於通過紋理分類圖像
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
ViT 和 ResNet 之間的主要區別之一是初始層的大視野。
參考資料
ViT (vision transformer)幾個特點
具有良好拓展性與計算效率優良
參數數量

Benchmark表現與計算效率
性能受數據集大小影響
不同模型架構的效能與計算成本
ViT參考資料
[论文阅读] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
推薦學習資源
文章
影片
ViT家族與近期進展
2101.01169 Transformers in Vision: A Survey
圖: 自注意力空間設計分類
2111.06091 A Survey of Visual Transformers
中國科學院等發佈最新Transformer綜述
Vision Transformer_嵌入分類
學習資源
Deep Learning相關筆記
Self-supervised Learning
Object Detection
ViT與Transformer相關
Autoencoder相關