# An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(翻譯)
###### tags:`論文翻譯` `deeplearning`
[TOC]
## 說明
- 此論文翻譯採用Gemini直接翻譯並人工調校
- 專業用語翻譯參考國家教育研究院,並寫入提示詞強制要求Gemini採用
- 利用『重要的話說兩次』的方式,讓Gemini以快捷的方式輸出,實測效果不錯
- 附錄的部份就只挑個人有興趣的部份翻譯
排版的說明:
1. 先原文
2. 後中文
3. 有數學式的部份就說明數學式,並說明參考來源
4. >個人註解
參考資料:
* [paper hyperlink](https://arxiv.org/abs/2010.11929)
## ABSTRACT
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
雖然 Transformer 架構已成為自然語言處理(NLP)任務的業界標準(de-facto standard),但其在電腦視覺領域上的應用仍然有限。在視覺領域中,注意力機制(attention)要麼與卷積網路(convolutional networks)結合使用,要麼被用來替換卷積網路中的某些元件,同時仍保留其整體的結構基礎。我們證明了這種對卷積神經網路(CNNs)的依賴並非必要,直接將純 Transformer 應用於影像區塊(image patches)序列,也能在影像分類任務上表現得非常出色。當在大量數據上進行預訓練,並轉移(transferred)至多個中小型影像識別基準測試(如 ImageNet、CIFAR-100、VTAB 等)時,Vision Transformer (ViT) 與當前最先進的(state-of-the-art)卷積網路相比,不僅取得了優異的結果,且訓練所需的計算資源大幅減少。
## 1 INTRODUCTION
Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.
以自注意力(Self-attention)為基礎的架構,特別是 Transformers (Vaswani et al., 2017),已成為自然語言處理(NLP)領域的首選模型。目前主流的方法是在大型文字語料庫上進行預訓練,接著再針對較小的特定任務資料集進行微調(fine-tune)(Devlin et al., 2019)。得益於 Transformers 的運算效率與可擴展性(scalability),訓練具有前所未有規模的模型已成為可能,其參數數量甚至超過 100B (Brown et al., 2020; Lepikhin et al., 2020)。隨著模型與資料集的增長,效能至今仍未見飽和跡象。
In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNetlike architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020).
然而,在電腦視覺領域中,卷積架構(Convolutional architectures)仍佔據主導地位 (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016)。受到自然語言處理(NLP)成功的啟發,許多研究嘗試將類 CNN 架構與自注意力(Self-attention)相結合 (Wang et al., 2018; Carion et al., 2020),有些研究甚至完全取代了卷積運算 (Ramachandran et al., 2019; Wang et al., 2020a)。後者這些模型雖然在理論上很有效率,但由於使用了特殊的注意力模式,尚未能在現代硬體加速器上實現有效的規模化(Scaled)。因此,在大規模影像辨識中,經典的 ResNetlike 架構目前仍處於領先地位(State of the art)(Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020)。
Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.
受到 Transformer 在自然語言處理(NLP)領域規模化成功的啟發,我們嘗試在盡可能減少修改的情況下,將標準的 Transformer 直接應用於影像。為了達成這個目標,我們將一張影像分割成多個區塊(patches),並將這些區塊的線性嵌入(linear embeddings)序列作為 Transformer 的輸入。在處理上,影像區塊被視為等同於 NLP 應用中的標記(tokens,即單詞)。我們以監督式學習的方式,在影像分類任務上對此模型進行訓練。
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.
當在缺乏強效正規化(regularization),且規模僅如 ImageNet 的中型資料集上進行訓練時,這些模型的準確度表現平平,比規模相當的 ResNets 低了幾個百分點。這種看似令人沮喪的結果其實是在預料之中的:Transformer 缺乏一些卷積神經網路(CNNs)固有的歸納偏置(inductive biases),例如平移等變性(translation equivariance)與局部性(locality),因此在資料量不足的情況下進行訓練時,泛化(generalize)能力並不理想。
However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.
然而,若模型是在更大的資料集(1400 萬至 3 億張影像)上進行訓練,情況則有所改變。我們發現,大規模訓練的效果優於歸納偏置(inductive bias)。當我們的 Vision Transformer (ViT) 在足夠規模下進行預訓練,並遷移(transferred)至資料點較少的任務時,能取得極佳的結果。當使用公開的 ImageNet-21k 資料集或 in-house JFT-300M 資料集進行預訓練時,ViT 在多項影像辨識基準測試中,達到或超越了目前的尖端技術(state of the art)。具體而言,表現最佳的模型在 ImageNet 上達到 88.55%、在 ImageNet-ReaL 上達到 90.72%、在 CIFAR-100 上達到 94.55%,以及在包含 19 個任務的 VTAB 評測套件中達到 77.63% 的準確率。
## 2 RELATED WORK
Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).
Vaswani et al. (2017) 提出了 Transformer 架構並應用於機器翻譯,自此之後,該架構已成為許多自然語言處理(NLP)任務中的頂尖(state of the art)方法。大型的 Transformer 基礎模型通常先在大型語料庫上進行預訓練(pre-trained),接著再針對特定任務進行微調(fine-tuned):例如 BERT (Devlin et al., 2019) 採用去噪自監督預訓練任務,而 GPT 系列的研究則使用語言模型(language modeling)作為其預訓練任務 (Radford et al., 2018; 2019; Brown et al., 2020)。
Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global selfattention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.
將自注意力機制自然地應用於影像處理,會需要每個像素都對其它所有像素進行注意。由於計算成本隨像素數量呈 **二次方(quadratic cost)** 增長,這使得模型無法擴展至現實中的輸入尺寸。因此,為了在影像處理的上下文中應用 Transformer,過去已經嘗試過幾種近似方法。Parmar et al. (2018) 針對每個查詢像素(query pixel)僅在局部鄰域內應用自注意力,而非進行全域運算。這種局部多頭點積自注意力區塊可以完全取代卷積運算 (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020)。在另一項研究路線中,Sparse Transformers (Child et al., 2019) 採用了可擴展的全域自注意力近似法,以便應用於影像。另一種擴展注意力的方法是將其應用於大小不一的區塊 (Weissenborn et al., 2019),在極端情況下甚至僅沿著單一軸向進行 (Ho et al., 2019; Wang et al., 2020a)。許多此類專門的注意力架構在電腦視覺任務上展現了極具潛力的結果,但若要在硬體加速器(hardware accelerators)上高效實作,則需要複雜的工程技術。
Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2 from the input image and applies full self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution images, while we handle medium-resolution images as well.
與我們研究最為相關的是 Cordonnier et al. (2020) 的模型,該模型從輸入影像中提取大小為 $2 \times 2$ 的影像區塊(patches),並在上方應用完整的自注意力機制。這個模型與 ViT 非常相似,但我們的研究進一步證明了大規模的預訓練能使原始(vanilla)的 Transformer 具備與當前最先進的 CNN 抗衡(甚至超越)的競爭力。此外,Cordonnier et al. (2020) 使用 $2 \times 2$ 像素的微小影像區塊尺寸,這使得該模型僅適用於低解析度影像,而我們的方法是能夠處理中等解析度的影像。
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).
學界對於將卷積神經網絡(CNNs)與各種形式的自注意力相結合也有著濃厚的興趣,例如:透過增強用於影像分類的特徵圖(Bello et al., 2019),或是利用自注意力進一步處理 CNN 的輸出結果,其應用涵蓋了物件偵測(Hu et al., 2018; Carion et al., 2020)、影片處理(Wang et al., 2018; Sun et al., 2019)、影像分類(Wu et al., 2020)、非監督式物件發現(Locatello et al., 2020),以及統一文本視覺任務(Chen et al., 2020c; Lu et al., 2019; Li et al., 2019)。
Another recent related model is image GPT (iGPT) (Chen et al., 2020a), which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised fashion as a generative model, and the resulting representation can then be fine-tuned or probed linearly for classification performance, achieving a maximal accuracy of 72% on ImageNet. Our work adds to the increasing collection of papers that explore image recognition at larger scales than the standard ImageNet dataset. The use of additional data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020).
另一個近期相關的模型是 image GPT (iGPT) (Chen et al., 2020a),該模型在降低影像解析度與色彩空間後,將 Transformer 應用於影像像素上。該模型以非監督式學習的方式作為生成模型進行訓練,其產生的表示(representation)隨後可以進行微調,或透過線性探測(linearly probed)來評估分類效能,在 ImageNet 上達到了 72% 的最高準確率。我們的研究進一步擴充了日益增多的論文集,這些論文旨在探索比標準 ImageNet 資料集規模更大的影像辨識研究。使用額外的資料來源,使我們能夠在標準基準測試(benchmarks)中取得最好的的結果 (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020)。
Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior works.
此外,Sun et al. (2017) 研究了 CNN 的效能如何隨資料集規模而縮放;而 Kolesnikov et al. (2020) 與 Djolonga et al. (2020) 則針對來自大規模資料集(如 ImageNet-21k 與 JFT-300M)的 CNN 遷移學習進行了實證探索。我們同樣專注於後者這兩個資料集,但我們訓練的是 Transformer,而非先前研究中所使用的基於 ResNet 的模型。
## 3 METHOD
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible. An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.
在模型設計方面,我們盡可能地遵循原始的 Transformer (Vaswani et al., 2017) 架構。這種刻意簡化的設定有一個優點,即那些具擴展性的自然語言處理(NLP)Transformer 架構及其高效的實作版本,幾乎可以達到「開箱即用」的效果。
### 3.1 VISION TRANSFORMER (VIT)
An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image $\bf{x}\in\mathbb{R}^{H\times W\times C}$ sequence of flattened 2D patches $\mathbf{x}_p \in \mathbb{R}^{N \times (P^2\cdot C)}$ , where $(H, W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N = HW/P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size $D$ through all of its layers, so we flatten the patches and map to $D$ dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.
模型概觀如 Figure 1 所示。標準的 Transformer 接收一維的 token embeddings 序列作為輸入。為了處理二維影像,我們將影像 $\bf{x}\in\mathbb{R}^{H\times W\times C}$ 重新整理為攤平後的二維影像區塊(patches)序列 $\mathbf{x}_p \in \mathbb{R}^{N \times (P^2\cdot C)}$,其中 $(H, W)$ 為原始影像的解析度,$C$ 為通道數,$(P, P)$ 為每個影像區塊的解析度,而 $N = HW/P^2$ 則是產生的影像區塊數量,這也作為 Transformer 的有效輸入序列長度。Transformer 在其所有網路層中使用固定的隱藏向量維度 $D$,因此我們將這些影像區塊攤平,並透過一個可訓練的線性投影(Eq. 1)將其映射至 $D$ 維空間。我們將此投影的輸出稱為 patch embeddings。

Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017).
Figure 1:模型概觀。我們將一張圖片分割成固定大小的影像區塊,對每一個影像區塊進行線性嵌入(linearly embed),加入位置嵌入(position embeddings),並將產生的向量序列輸入至標準的 Transformer 編碼器(encoder)。為了執行分類任務,我們採用標準做法,在序列中加入一個額外的可學習「分類標記」(classification token)。此 Transformer 編碼器的插圖靈感來自於 Vaswani et al. (2017)。
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches ($\mathbf{z}^0_0=\mathbf{x}_{\text{class}}$), whose state at the output of the Transformer encoder ($\mathbf{z}_L^0$) serves as the
image representation $y$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\mathbf{z}_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
與 BERT 的 [class] token 類似,我們在嵌入影像區塊(embedded patches)的序列最前方,加上一個可學習的 embedding ($\mathbf{z}^0_0=\mathbf{x}_{\text{class}}$);該 embedding 在 Transformer 編碼器輸出端狀態 ($\mathbf{z}_L^0$) 即作為影像表示 $y$ (Eq. 4)。無論在預訓練(pre-training)或微調(fine-tuning)階段,分類標頭(classification head)皆會連接至 $\mathbf{z}_L^0$。在預訓練時,分類標頭是由具備一個隱藏層的 MLP 實現;而在微調時,則由單一線性層(linear layer)實現。
Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.
將位置嵌入(Position embeddings)加入 patch embeddings 中,以保留位置資訊。我們使用標準的可學習一維(1D)位置嵌入,因為我們並未觀察到使用更進階的二維感知(2D-aware)位置嵌入能帶來顯著的性能提升(Appendix D.4)。最後產出的嵌入向量序列將作為編碼器(encoder)的輸入。
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).The MLP contains two layers with a GELU non-linearity.
Transformer 編碼器 (Vaswani et al., 2017) 是由多頭自注意力 (MSA, 參見附錄 A) 層與 MLP 區塊 (式 2, 3) 交替組成。在每個區塊之前都會應用層歸一化 (LN),並在每個區塊之後加入殘差連接 (Wang et al., 2019; Baevski & Auli, 2019)。該 MLP 包含兩層,並使用 GELU 非線性活化函數。
$$
\begin{aligned}
\mathbf{z}_0 &= [\mathbf{x}_{\text{class}}; \mathbf{x}_p^1\mathbf{E}; \mathbf{x}_p^2\mathbf{E}; \cdots; \mathbf{x}_p^N\mathbf{E}] + \mathbf{E}_{pos}, & & \mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}, \mathbf{E}_{pos} \in \mathbb{R}^{(N+1) \times D} \\
\mathbf{z}_{\ell}^{\prime} &= \text{MSA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, & & \ell=1 \ldots L \\
\mathbf{z}_{\ell} &= \text{MLP}(\text{LN}(\mathbf{z}_{\ell}^{\prime})) + \mathbf{z}_{\ell}^{\prime}, & & \ell=1 \ldots L \\
\mathbf{y} &= \text{LN}(\mathbf{z}_L^0)
\end{aligned}
$$
**Inductive bias.** We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
**Inductive bias.** 我們注意到,相較於 CNN,Vision Transformer 具有更少影像特有的歸納偏置(inductive bias)。在 CNN 中,局部性(locality)、二維鄰域結構(two-dimensional neighborhood structure)以及平移等變性(translation equivariance)被內建於整個模型的每一層之中。而在 ViT 中,只有 MLP 層具有局部性與平移等變性,自注意力(self-attention)層則是全局性的。二維鄰域結構的使用非常少:僅在模型開始處將影像切分為影像區塊(patches),以及在微調(fine-tuning)階段為了調整不同解析度影像的位置編碼(position embeddings)時才會用到(如下文所述)。除此之外,初始化時的位置編碼並不包含影像區塊(patches)二維位置的資訊,影像區塊(patches)之間所有的空間關係都必須從頭開始學習。
**Hybrid Architecture.** As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection $\mathbf{E}$ (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classification input embedding and position embeddings are added as described above.
**Hybrid Architecture.** 作為原始影像區塊(raw image patches)的替代方案,輸入序列也可以由 CNN 的特徵圖(feature maps)構成 (LeCun et al., 1989)。在這個混合模型中,patch embedding 投影矩陣 $\mathbf{E}$ (Eq. 1) 被應用於從 CNN 特徵圖中提取出的影像區塊。在一種特殊情況下,這些影像區塊的空間維度可以是 $1 \times 1$,這意味著輸入序列是透過簡單地將特徵圖的空間維度拉平(flattening),並投影到 Transformer 維度來獲得的。最後,再按照前述方式加入分類輸入嵌入(classification input embedding)與位置嵌入(position embeddings)。
### 3.2 FINE-TUNING AND HIGHER RESOLUTION
Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized $D \times K$ feedforward layer, where $K$ is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.
通常情況下,我們會在大型資料集上對 ViT 進行預訓練,並針對(較小的)下游任務進行微調。為此,我們會移除預訓練好的預測頭(prediction head),並附加一個以零初始化的 $D \times K$ 前饋層,其中 $K$ 代表下游任務的類別數量。在微調時,採用比預訓練時更高的解析度通常是有益的 (Touvron et al., 2019; Kolesnikov et al., 2020)。當輸入更高解析度的圖像時,我們保持影像區塊(patch)的大小不變,這會導致更長的有效序列長度。Vision Transformer 可以處理任意長度的序列(受限於記憶體限制),然而,預訓練好的位置嵌入(position embeddings)可能不再具有意義。因此,我們根據預訓練位置嵌入在原始圖像中的位置,對其進行 2D 內插(2D interpolation)。請注意,這種解析度調整與影像區塊提取,是手動將影像 2D 結構的歸納偏置(inductive bias)注入 Vision Transformer 的唯一之處。
:::info
1. 先把原本 196 個位置編碼(扣掉 [CLS] Token),重新折疊回 14x14 的 2D 矩陣形狀。
2. 使用影像處理常見的「雙三次內插(Bicubic Interpolation)」,把這個 14x14 的座標網格,平滑地放大成 24x24 的網格。
3. 再把這個 24x24 的網格展平成 576 的 1D 序列,最後把 [CLS] 加回去。
這樣一來,新產生的位置座標依然保留了原本預訓練學到的「左上角、右下角、相鄰關係」等空間幾何意義。
:::
## 4 EXPERIMENTS
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks. When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.
我們評估了 ResNet、Vision Transformer (ViT) 以及混合模型(hybrid)的表示學習能力。為了瞭解每個模型對資料的需求,我們在不同規模的資料集上進行預訓練,並評估多項基準測試任務。當考量預訓練模型的計算成本時,ViT 的表現非常出色,能以較低的預訓練成本,在大多數的辨識基準測試中達到最佳的運算效能。最後,我們進行了一項使用自監督學習(self-supervision)的小型實驗,並展示了自監督式的 ViT 在未來具有發展潛力。
### 4.1 SETUP
**Datasets.** To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020).
**Datasets.** 為了探討模型的擴展性(scalability),我們使用了包含 1,000 個類別、130 萬張圖片的 ILSVRC-2012 ImageNet 資料集(以下簡稱為 ImageNet);其[母集](https://terms.naer.edu.tw/detail/9464147328c9134146388e18899477de/)(superset)ImageNet-21k 則包含 21,000 個類別與 1,400 萬張圖片(Deng et al., 2009);以及擁有 18,000 個類別、3.03 億張高解析度圖片的 JFT(Sun et al., 2017)。我們參考 Kolesnikov et al. (2020) 的方法,針對預訓練資料集與下游任務(downstream tasks)的測試集進行了去重(de-duplicate)處理。我們將在這些資料集上訓練的模型遷移至多個基準測試任務:包含原始驗證標籤與清理後的 ReaL 標籤(Beyer et al., 2020)的 ImageNet、CIFAR-10/100(Krizhevsky, 2009)、Oxford-IIIT Pets(Parkhi et al., 2012)以及 Oxford Flowers-102(Nilsback & Zisserman, 2008)。對於這些資料集,預處理流程均遵循 Kolesnikov et al. (2020) 的做法。
We also evaluate on the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite imagery, and Structured – tasks that require geometric understanding like localization.
我們也在包含 19 個任務的 VTAB 分類套件上進行評估 (Zhai et al., 2019b)。VTAB 透過在每個任務僅使用 1,000 個訓練樣本,來評估模型在少量資料轉移至多樣化任務時的表現。這些任務被分為三個族群:自然 (Natural) —— 包含上述任務、Pets、CIFAR 等;專業 (Specialized) —— 包含醫學與衛星影像;以及結構化 (Structured) —— 需要幾何理解(如定位)的任務。
**Model Variants.** We base ViT configurations on those used for BERT (Devlin et al., 2019), as summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we add the larger “Huge” model. In what follows we use brief notation to indicate the model size and the input patch size: for instance, ViT-L/16 means the “Large” variant with 16×16 input patch size. Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.
**Model Variants.** 我們參考了 BERT (Devlin et al., 2019) 所使用的配置來設定 ViT 的參數,如 Table 1 所示。其中的「Base」和「Large」模型是直接採用自 BERT,而我們額外增加了更大的「Huge」模型。在下文中,我們使用簡短的標記方式來表示模型大小和輸入的影像區塊(patch)尺寸:例如,ViT-L/16 表示 16×16 輸入影像區塊尺寸的「Large」變體。請注意,Transformer 的序列長度與影像區塊尺寸的平方成反比,因此影像區塊尺寸越小的模型,其計算成本就越高。
For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modifications improve transfer (Kolesnikov et al., 2020), and we denote the modified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
對於基準線的卷積神經網路(CNNs),我們使用 ResNet (He et al., 2016),但將 Batch Normalization (Ioffe & Szegedy, 2015) 替換為 Group Normalization (Wu & He, 2018),並使用了標準化卷積(standardized convolutions) (Qiao et al., 2019)。這些修改提升了遷移效果 (Kolesnikov et al., 2020),我們將此修改後的模型稱為「ResNet (BiT)」。至於混合模型(hybrids),我們將中間特徵圖(intermediate feature maps)輸入至 ViT,其影像區塊尺寸設為一個「像素(pixel)」。為了實驗不同的序列長度,我們採取以下兩種做法:(i) 取常規 ResNet50 第 4 階段(stage 4)的輸出,或者 (ii) 移除第 4 階段,並在第 3 階段放置相同數量的層數(保持總層數不變),接著取此擴展後第 3 階段的輸出。選項 (ii) 會導致序列長度增加 4 倍,且會產生計算成本更高的 ViT 模型。

Table 1: Details of Vision Transformer model variants.
**Training & Fine-tuning.** We train all models, including ResNets, using Adam (Kingma & Ba,
2015) with $\beta_1=0.9, \beta_2=0.999$, a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fine-tuned at higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0.9999 (Ramachandran et al., 2019; Wang et al., 2020b).
**Training & Fine-tuning.** 我們訓練所有模型(包含 ResNets)時,皆使用 Adam (Kingma & Ba, 2015) 優化器,參數設定為 $\beta_1 = 0.9$、$\beta_2 = 0.999$、批次大小(batch size)為 4096,並施加 0.1 的高權重衰減(weight decay),我們發現這對所有模型的遷移(transfer)都很有幫助(Appendix D.1 顯示,與一般做法不同,在我們的設定中,Adam 在 ResNets 上的表現略優於 SGD)。我們使用線性學習率來暖機(warmup)與衰減,詳情請參閱 Appendix B.1。對於微調(fine-tuning),我們對所有模型皆使用帶有動量(momentum)的 SGD,批次大小為 512,請參閱 Appendix B.1.1。對於 Table 2 中的 ImageNet 結果,我們以更高的解析度進行微調:ViT-L/16 為 512,ViT-H/14 為 518,並使用了係數為 0.9999 的 Polyak & Juditsky (1992) 平均法 (Ramachandran et al., 2019; Wang et al., 2020b)。
**Metrics.** We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {−1, 1} K target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-fly evaluation where fine-tuning would be too costly.
**Metrics.** 我們透過少樣本(few-shot)或微調準確度來報告下游資料集上的結果。微調準確度呈現了各個模型在對應資料集上進行微調後的效能。少樣本準確度則是透過解決一個正規化最小平方法(regularized least-squares)回歸問題獲得,該問題將訓練影像子集的(凍結)表示映射至 $\{-1, 1\}^K$ 目標向量。此公式化方法讓我們能以閉合解(closed form)求得精確解。雖然我們主要關注於微調效能,但有時為了進行快速的即時評估(當微調成本過高時),我們會使用線性少樣本準確率。
### 4.2 COMPARISON TO STATE OF THE ART
We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.
我們首先將最大的模型— —ViT-H/14 與 ViT-L/16——與文獻中目前最先進(SOTA)的 CNN 進行比較。第一個比較基準是 Big Transfer (BiT) (Kolesnikov et al., 2020),該模型使用大型 ResNets 執行監督式遷移學習。第二個是 Noisy Student (Xie et al., 2020),這是一個大型的 EfficientNet,其在移除標籤後的 ImageNet 與 JFT300M 資料上,使用半監督式學習進行訓練。目前,Noisy Student 在 ImageNet 上是目前最先進的技術,而 BiT-L 則是在此報告的其它資料集上領先。所有模型均在 TPUv3 硬體上進行訓練,我們記錄了預訓練每個模型所需的 TPUv3-core-days,即訓練所使用的 TPU v3 核心數量(每顆晶片有 2 個核心)乘以訓練天數。
Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.
Table 2 顯示了結果。在 JFT-300M 上預訓練的較小模型 ViT-L/16,在所有任務上的表現均優於 BiT-L(同樣在該資料集上預訓練),同時訓練所需的運算資源大幅減少。較大的模型 ViT-H/14 進一步提升了效能,特別是在更具挑戰性的資料集上——如 ImageNet、CIFAR-100 以及 VTAB 測試組合。有趣的是,與先前的 SOTA 相比,該模型預訓練所需的運算量仍然大幅減少。然而,我們注意到預訓練效率可能不僅受到架構選擇的影響,還受到其它參數的影響,例如訓練排程(training schedule)、優化器(optimizer)、權重衰減(weight decay)等。我們在 Section 4.4 中針對不同架構的效能與運算量提供了對照研究。最後,在公開的 ImageNet-21k 資料集上預訓練的 ViT-L/16 模型,在大多數資料集上也表現良好,且預訓練耗費的資源較少:使用標準的雲端 TPUv3(含 8 個核心)在大約 30 天內即可完成訓練。

Table 2: Comparison with state of the art on popular image classification benchmarks. We report mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. ∗Slightly improved 88.5% result reported in Touvron et al. (2020).
Table 2: 與熱門影像分類基準測試中當前最先進(state of the art)結果的比較。我們報告了三次微調執行後的準確度平均值與標準差。在 JFT-300M 資料集上進行預訓練的 Vision Transformer 模型,在所有資料集上的表現均優於以 ResNet 為基礎的基準模型,且預訓練所需的運算資源大幅減少。在較小的公開 ImageNet-21k 資料集上進行預訓練的 ViT 表現同樣出色。 ∗Touvron et al. (2020) 報告了稍微改善後的 88.5% 結果。

Figure 2: Breakdown of VTAB performance in Natural, Specialized, and Structured task groups.
Figure 2 decomposes the VTAB tasks into their respective groups, and compares to previous SOTA methods on this benchmark: BiT, VIVI – a ResNet co-trained on ImageNet and Youtube (Tschannen et al., 2020), and S4L – supervised plus semi-supervised learning on ImageNet (Zhai et al., 2019a). ViT-H/14 outperforms BiT-R152x4, and other methods, on the Natural and Structured tasks. On the Specialized the performance of the top two models is similar.
Figure 2 將 VTAB 任務分解為各自的組別,並與該基準測試中先前的 SOTA 方法進行比較:包含 BiT、VIVI(在 ImageNet 和 Youtube 上協同訓練的 ResNet,Tschannen et al., 2020),以及 S4L(在 ImageNet 上進行監督加半監督式學習,Zhai et al., 2019a)。ViT-H/14 在「自然」(Natural)和「結構化」(Structured)任務上的表現優於 BiT-R152x4 及其他方法。在「專業」(Specialized)任務上,表現最好的前兩個模型效能相近。
### 4.3 PRE-TRAINING DATA REQUIREMENTS
The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.
當在大型的 JFT-300M 資料集上進行預訓練時,Vision Transformer 的表現非常優異。由於其針對視覺所設計的歸納偏置(inductive biases)比 ResNets 更少,資料集的大小究竟有多關鍵?我們進行了兩系列的實驗。
First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT300M. To boost the performance on the smaller datasets, we optimize three basic regularization parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after finetuning to ImageNet (results on other datasets are shown in Table 5). When pre-trained on the smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.
首先,我們在規模逐漸增加的資料集上預訓練 ViT 模型:ImageNet、ImageNet-21k 以及 JFT-300M。為了提升在較小資料集上的性能,我們優化了三個基本的正規化(regularization)參數:權重衰減(weight decay)、丟棄法(dropout)以及標籤平滑(label smoothing)。Figure 3 顯示了在 ImageNet 上進行微調後的結果(在其它資料集上的結果則如 Table 5 所示)。當在最小的資料集 ImageNet 上進行預訓練時,儘管進行了(適度的)正規化,ViT-Large 模型的表現仍不如 ViT-Base 模型。而在 ImageNet-21k 的預訓練下,兩者的表現相近。只有在 JFT-300M 之下,我們才能看到大型模型帶來的完整效益。Figure 3 同時也顯示了不同規模的 BiT 模型所涵蓋的性能區間。BiT CNNs 在 ImageNet 上的表現優於 ViT,但在較大的資料集下,ViT 則實現了超越。

Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on larger datasets. Similarly, larger ViT variants overtake smaller ones as the dataset grows.
Figure 3: Transfer to ImageNet。雖然當在小型資料集上進行預訓練時,大型 ViT 模型的表現不如 BiT ResNets(陰影區域),但當在較大的資料集上進行預訓練時,它們的表現便脫穎而出。同樣地,隨著資料集規模的成長,較大型的 ViT 變體會超越較小的變體。

Table 5: Top1 accuracy (in %) of Vision Transformer on various datasets when pre-trained on ImageNet, ImageNet-21k or JFT300M. These values correspond to Figure 3 in the main text. Models are fine-tuned at 384 resolution. Note that the ImageNet results are computed without additional techniques (Polyak averaging and 512 resolution images) used to achieve results in Table 2.
Table 5:Vision Transformer 在 ImageNet、ImageNet-21k 或 JFT300M 進行預訓練後,在各個資料集上的 Top1 準確度(以 % 表示)。這些數值對應於正文中的 Figure 3。模型是在 384 解析度下進行微調。請注意,此處的 ImageNet 結果在計算時並未使用 Table 2 中為了取得結果而採用的額外技術(Polyak 平均法和 512 解析度影像)。
Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.
其次,我們在 9M、30M 和 90M 的隨機子集以及完整的 JFT300M 資料集上訓練模型。我們不對較小的子集進行額外的正規化(regularization),且在所有設定中皆使用相同的超參數(hyper-parameters)。透過這種方式,我們得以評估模型的內在特性,而非正規化帶來的影響。然而,我們確實使用了早停法(early-stopping),並記錄訓練過程中所達到的最佳驗證準確度。為了節省運算量,我們報告的是少樣本線性準確度(few-shot linear accuracy),而非全微調(full finetuning)準確度。Figure 4 包含了這些結果。在運算成本相當的情況下,Vision Transformers 在較小資料集上的過擬合(overfit)情形比 ResNets 更為嚴重。例如,ViT-B/32 比 ResNet50 稍快;它在 9M 子集上的表現差得多,但在 90M 以上的子集表現則較好。ResNet152x2 和 ViT-L/16 之間的情況亦然。這項結果強化了一種直覺:卷積歸納偏置(convolutional inductive bias)對於較小的資料集非常有用,但對於較大的資料集,直接從資料中學習相關特徵模式就已經足夠,甚至更有裨益。
Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB (Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT is an exciting direction of future work.
總體而言,在 ImageNet 上的少樣本結果(Figure 4)以及在 VTAB 上的低資料量結果(Table 2),對於極低資料量的轉移學習(transfer learning)展現出很好的前景。進一步分析 ViT 的少樣本特性是未來工作中一個令人振奮的方向。

Figure 4: Linear few-shot evaluation on ImageNet versus pre-training size. ResNets perform better with smaller pre-training datasets but plateau sooner than ViT, which performs better with larger pre-training. ViT-b is ViT-B with all hidden dimensions halved.
Figure 4:ImageNet 上的線性少樣本評估(Linear few-shot evaluation)對比預訓練規模。ResNets 在較小的預訓練資料集上表現較佳,但比 ViT 更早進入高原期(效能飽和),而 ViT 在較大的預訓練規模下表現更好。ViT-b 是將 ViT-B 的所有隱藏層維度減半後的版本。
### 4.4 SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).
我們透過評估 JFT-300M 的遷移效能,對不同模型進行受控的擴展性研究(scaling study)。在這種設定下,資料量大小不會成為模型效能的瓶頸,我們評估的是每個模型的效能與預訓練成本之間的關係。模型集包含:7 個 ResNets,分別為 R50x1、R50x2、R101x1、R152x1、R152x2(預訓練 7 個 epoch),以及 R152x2 與 R200x3(預訓練 14 個 epoch);6 個 Vision Transformers,分別為 ViT-B/32、B/16、L/32、L/16(預訓練 7 個 epoch),以及 L/16 與 H/14(預訓練 14 個 epoch);以及 5 個混合模型(hybrids),分別為 R50+ViT-B/32、B/16、L/32、L/16(預訓練 7 個 epoch),以及 R50+ViT-L/16(預訓練 14 個 epoch)。(對於混合模型,模型名稱末尾的數字並非代表影像區塊大小,而是指 ResNet 骨幹網路中的總下採樣率)。
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5 for details on computational costs). Detailed results per model are provided in Table 6 in the Appendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately 2 − 4× less compute to attain the same performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.
Figure 5 包含了遷移效能與總預訓練計算量之間的關係(關於計算成本的細節請參閱 Appendix D.5)。Table 6 則提供了每個模型的詳細結果。我們可以觀察到幾個模式:首先,在效能與計算量的權衡(trade-off)上,Vision Transformers 優於 ResNets。ViT 僅需大約 $2-4\times$ 較少的計算量即可達到相同的效能(5 個資料集的平均值)。其次,在較小的計算預算下,混合模型的表現略優於 ViT,但隨著模型規模增大,這種差異便消失了。這個結果有些令人驚訝,因為人們原先可能預期卷積的局部特徵處理在任何規模下都能對 ViT 有所幫助。第三,Vision Transformers 在嘗試的範圍內似乎尚未達到飽和,這激勵了未來的擴展嘗試。

Figure 5: Performance versus pre-training compute for different architectures: Vision Transformers, ResNets, and hybrids. Vision Transformers generally outperform ResNets with the same computational budget. Hybrids improve upon pure Transformers for smaller model sizes, but the gap vanishes for larger models.
Figure 5:不同架構(Vision Transformers、ResNets 與混合模型)的效能與預訓練運算量對比。在相同的運算預算下,Vision Transformers 的表現通常優於 ResNets。在較小的模型尺寸下,混合模型的表現優於純 Transformer,但隨著模型尺寸增大,兩者之間的差距便會消失。

Table 6: Detailed results of model scaling experiments. These correspond to Figure 5 in the main paper. We show transfer accuracy on several datasets, as well as the pre-training compute (in exaFLOPs).
Table 6:模型縮放實驗的詳細結果。這些結果對應於正文中的 Figure 5。我們展示了在數個資料集上的遷移準確度(transfer accuracy),以及預訓練的運算量(單位為 exaFLOPs)。
### 4.5 INSPECTING VISION TRANSFORMER
To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding filters. The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.
為了初步了解 Vision Transformer 如何處理影像資料,我們分析了其內部表示(internal representations)。Vision Transformer 的第一層將展平的影像區塊(flattened patches)線性投影到一個低維空間(Eq. 1)。Figure 7(左)顯示了學習到的嵌入濾波器(embedding filters)之前幾項主要成分(top principal components)。這些成分類似於用來表示每個影像區塊內微細結構的低維表示之合理[基底函數](https://terms.naer.edu.tw/detail/bf70b4ee760a49f21f17b58f3a918b0b/)(basis functions)。
After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
在投影之後,會將學習到的位置嵌入(learned position embedding)加入到影像區塊表示中(patch representations)。Figure 7(中)顯示模型透過位置嵌入的相似性來學習編碼影像內的距離,意即距離較近的影像區塊傾向於具有更相似的位置嵌入。此外,圖中出現了列—行(row-column)結構;位於同一列或同一行的影像區塊具有相似的嵌入。最後,在較大的網格中,有時可以明顯看到正弦結構(Appendix D)。位置嵌入學習到如何表示 2D 影像拓撲(topology),這解釋了為何手工設計的 2D 感知嵌入變體並未帶來改進(Appendix D.4)。

Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Similarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position embedding of the patch with the indicated row and column and the position embeddings of all other patches. Right: Size of attended area by head and network depth. Each dot shows the mean attention distance across images for one of 16 heads at one layer. See Appendix D.7 for details.
Figure 7:左圖:ViT-L/32 對 RGB 值進行初始線性嵌入(linear embedding)的濾波器。中圖:ViT-L/32 位置嵌入(position embeddings)的相似度。各個方塊顯示了指定行列之影像區塊(patch)的位置嵌入,與所有其他影像區塊位置嵌入之間的餘弦相似度(cosine similarity)。右圖:依注意力的標頭(head)與網路深度分類的關注區域大小。每個點代表在一層中 16 個標頭之一,在所有圖像上的平均注意力距離。詳情請參閱 Appendix D.7。
Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specifically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive field size in CNNs. We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we find that the model attends to image regions that are semantically relevant for classification (Figure 6)
自注意力(Self-attention)使得 ViT 即使在最底層也能整合整個影像的資訊。我們調查了網路在何種程度上利用了這項能力。具體而言,我們根據注意力權重,計算了資訊整合時在影像空間中的平均距離(Figure 7,右)。這種「注意力距離」(attention distance)類似於 CNN 中的感受野(receptive field)大小。我們發現某些注意力頭(attention heads)在最底層就已經關注到影像的大部分區域,這顯示模型確實利用了全域整合資訊的能力。其它的注意力頭在低層則始終保持較小的注意力距離。這種高度局部化的注意力在 Transformer 之前先應用 ResNet 的混合模型(hybrid models)中較不明顯(Figure 7,右),這暗示它可能具有與 CNN 早期卷積層相似的功能。此外,注意力距離會隨著網路深度增加。從全域來看,我們發現模型會關注與分類具有語義相關性的影像區域(Figure 6)。

Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.7 for details.
Figure 6: 從輸出標記(output token)到輸入空間之注意力的具體範例。詳情請參閱 Appendix D.7。
### 4.6 SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.
Transformer 在自然語言處理(NLP)任務中展現了令人驚豔的效能。然而,其成功很大程度上不僅源於其優異的可擴展性,還歸功於大規模的自監督預訓練(Devlin et al., 2019; Radford et al., 2018)。我們也針對自監督進行了「遮蔽影像區塊預測」(masked patch prediction)的初步探索,模仿 BERT 中所使用的遮蔽語言模型任務。透過自監督預訓練,我們較小的 ViT-B/16 模型在 ImageNet 上達到了 79.9% 的準確率,相較於從頭開始訓練(training from scratch)有著 2% 的顯著提升,但仍落後於監督式預訓練 4%。Appendix B.1.2 包含了進一步的細節。我們將對比式預訓練(contrastive pre-training,Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020)的探索留待未來的工作。
## 5 CONCLUSION
We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
我們探索了將 Transformer 直接應用於影像辨識的可行性。不同於以往在電腦視覺領域中使用自注意力的研究,除了初始的影像區塊(patch)提取步驟外,我們並未在架構中引入特定於影像的歸納偏置(inductive biases)。相反地,我們將一張影像視為一系列的影像區塊(patches),並使用 NLP 中常見的標準 Transformer 編碼器(encoder)進行處理。當這種簡單且具擴展性的策略與大型資料集上的預訓練結合時,效果出奇地好。因此,Vision Transformer 在許多影像分類資料集上達到或超越了當前的先進技術(state of the art),同時預訓練的成本相對較低。
While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring selfsupervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.
雖然這些初步結果令人振奮,但仍面臨許多挑戰。其一是將 ViT 應用於其他電腦視覺任務,例如偵測(detection)與分割(segmentation)。我們的結果結合 Carion et al. (2020) 的研究,顯示了此方法的發展前景。另一個挑戰是持續探索自監督(self-supervised)的預訓練方法。我們的初步實驗顯示,自監督預訓練確實能帶來改進,但在自監督與大規模監督式預訓練之間仍存在巨大差距。最後,進一步擴大 ViT 的規模可能會帶來更好的性能。
## APPENDIX
### B EXPERIMENT DETAILS
#### B.1 TRAINING
Table 3 summarizes our training setups for our different models. We found strong regularization to be key when training models from scratch on ImageNet. Dropout, when used, is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings. Hybrid models are trained with the exact setup as their ViT counterparts. Finally, all training is done on resolution 224.
Table 3 總結了我們針對不同模型所採用的訓練設定。我們發現,在 ImageNet 上從頭開始訓練模型時,強大的正規化(regularization)是關鍵。在使用 Dropout 時,除了 qkv-projections 之外,我們會將其應用於每個全連接層(dense layer)之後,以及在將位置編碼(positional embeddings)加到 patch embeddings 之後。混合模型(Hybrid models)的訓練設定與其對應的 ViT 模型完全相同。最後,所有的訓練均在解析度 224 下完成。

Table 3: Hyperparameters for training. All models are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet we found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224.
##### B.1.1 FINE-TUNING
We fine-tune all ViT models using SGD with a momentum of 0.9. We run a small grid search over learning rates, see learning rate ranges in Table 4. To do so, we use small sub-splits from the training set (10% for Pets and Flowers, 2% for CIFAR, 1% ImageNet) as development set and train on the remaining data. For final results we train on the entire training set and evaluate on the respective test data. For fine-tuning ResNets and hybrid models we use the exact same setup, with the only exception of ImageNet where we add another value 0.06 to the learning rate sweep. Additionally, for ResNets we also run the setup of Kolesnikov et al. (2020) and select the best results across this run and our sweep. Finally, if not mentioned otherwise, all fine-tuning experiments run at 384 resolution (running fine-tuning at different resolution than training is common practice (Kolesnikov et al., 2020)).
我們使用動量(momentum)為 0.9 的 SGD 對所有 ViT 模型進行微調(fine-tune)。我們針對學習率(learning rates)進行了小規模的網格搜索(grid search),具體的學習率範圍請參閱 Table 4。為了進行搜索,我們從訓練集中提取了小部分的子集(Pets 與 Flowers 為 10%、CIFAR 為 2%、ImageNet 為 1%)作為開發集(development set),並利用剩餘的資料進行訓練。至於最終結果,我們則使用整個訓練集進行訓練,並在各自的測試資料上進行評估。在微調 ResNets 與混合模型(hybrid models)時,我們採用完全相同的設定,唯一的例外是 ImageNet,我們在學習率掃描(sweep)中額外增加了一個數值 0.06。此外,針對 ResNets,我們也運行了 Kolesnikov et al. (2020) 的設定,並從該次運行與我們的掃描結果中選擇最佳表現。最後,若無特別說明,所有的微調實驗均在 384 解析度下進行(在與訓練階段不同的解析度下進行微調是常見的做法 (Kolesnikov et al., 2020))。

Table 4: Hyperparameters for fine-tuning. All models are fine-tuned with cosine learning rate decay, a batch size of 512, no weight decay, and grad clipping at global norm 1. If not mentioned otherwise, fine-tuning resolution is 384.
When transferring ViT models to another dataset, we remove the whole head (two linear layers) and replace it by a single, zero-initialized linear layer outputting the number of classes required by the target dataset. We found this to be a little more robust than simply re-initializing the very last layer.
當將 ViT 模型遷移(transferring)到另一個資料集時,我們會移除整個頭部(head,即兩個線性層),並將其替換為一個單一的、零初始化的線性層,該層會根據目標資料集所需的類別數量進行輸出。我們發現這種做法比單純重新初始化最後一層更具強健性(robust)。
For VTAB we follow the protocol in Kolesnikov et al. (2020), and use the same hyperparameter setting for all tasks. We use a learning rate of 0.01 and train for 2500 steps (Tab. 4). We chose this setting by running a small sweep over two learning rates and two schedules, and selecting the setting with the highest VTAB score on the 200-example validation sets. We follow the pre-processing used in Kolesnikov et al. (2020), except that we do not use task-specific input resolutions. Instead we find that Vision Transformer benefits most from a high resolution (384 × 384) for all tasks.
對於 VTAB,我們遵循 Kolesnikov et al. (2020) 中的協定,並對所有任務使用相同的超參數設定。我們使用 0.01 的學習率並訓練 2500 個步數(steps)(Tab. 4)。我們透過對兩種學習率和兩種排程(schedules)進行小規模掃描來選擇此設定,並在包含 200 個樣本的驗證集上選擇 VTAB 分數最高的設定。我們遵循 Kolesnikov et al. (2020) 中使用的預處理(pre-processing),不同之處在於我們不使用特定於任務的輸入解析度。相反地,我們發現 Vision Transformer 在所有任務中,從高解析度(384 × 384)獲益最多。
##### B.1.2 SELF-SUPERVISION
We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%). This setup is very similar to the one used for language by Devlin et al. (2019). Finally, we predict the 3-bit, mean color (i.e., 512 colors in total) of every corrupted patch using their respective patch representations.
我們採用遮罩影像區塊預測(masked patch prediction)目標的方式來進行初步的自監督(self-supervision)實驗。為此,我們弄掉了 50% 的 patch embeddings,其方式包含:將其 embedding 替換為一個可學習的 [mask] embedding (80%)、替換為一個隨機的其它影像區塊 embedding (10%),或者保持原樣 (10%)。此設置與 Devlin et al. (2019) 在語言處理中所使用的設置非常相似。最後,我們利用各個受損影像區塊對應的影像區塊表示法(patch representations),來預測其 3-bit 的平均顏色(即總共 512 種顏色)。
We trained our self-supervised model for 1M steps (ca. 14 epochs) with batch size 4096 on JFT. We use Adam, with a base learning rate of $2\times 10^{-4}$ , warmup of 10k steps and cosine learning rate decay. As prediction targets for pretraining we tried the following settings: 1) predicting only the mean, 3bit color (i.e., 1 prediction of 512 colors), 2) predicting a 4 × 4 downsized version of the 16 × 16 patch with 3bit colors in parallel (i.e., 16 predictions of 512 colors), 3) regression on the full patch using L2 (i.e., 256 regressions on the 3 RGB channels). Surprisingly, we found that all worked quite well, though L2 was slightly worse. We report final results only for option 1) because it has shown best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al. (2019) but results were also slightly worse on our few-shot metrics.
我們在 JFT 資料集上,以 4096 的批次大小(batch size)訓練了自監督模型共 1M 個步數(約 14 個輪次/epochs)。我們使用 Adam 優化器,基礎學習率(base learning rate)為 $2\times 10^{-4}$,並採用 10k 步的暖機(warmup)與餘弦學習率衰減(cosine learning rate decay)。作為預訓練的預測目標,我們嘗試了以下設置:1) 僅預測平均的 3-bit 顏色(即 512 種顏色中的 1 個預測值);2) 同時預測 16 × 16 影像區塊的 4 × 4 縮小版本與 3-bit 顏色(即 512 種顏色中的 16 個預測值);3) 使用 L2 對整個影像區塊進行迴歸(即在 3 個 RGB 通道上進行 256 次迴歸)。令人驚訝的是,我們發現所有方式的效果都相當不錯,儘管 L2 的表現略差。我們僅報告選項 1) 的最終結果,因為它展現了最佳的少樣本(few-shot)效能。我們也嘗試過 Devlin et al. (2019) 所使用的 15% 損壞率,但在我們的少樣本指標上結果也略差。
Lastly, we would like to remark that our instantiation of masked patch prediction doesn’t require such an enormous amount of pretraining nor a large dataset such as JFT in order to lead to similar performance gains on ImageNet classification. That is, we observed diminishing returns on downstream performance after 100k pretraining steps, and see similar gains when pretraining on ImageNet.
最後,我們想說明,我們對遮罩影像區塊預測的實作,並不需要如此龐大的預訓練量,也不需要像 JFT 這樣的大型資料集,就能在 ImageNet 分類任務上獲得類似的效能提升。也就是說,我們觀察到在 100k 個預訓練步數後,下游任務的效能收益開始遞減,且在 ImageNet 上進行預訓練時也看到了類似的增益。
## D.4 POSITIONAL EMBEDDING
We ran ablations on different ways of encoding spatial information using positional embedding. We tried the following cases:
- Providing no positional information: Considering the inputs as a bag of patches.
- 1-dimensional positional embedding: Considering the inputs as a sequence of patches in the raster order (default across all other experiments in this paper).
- 2-dimensional positional embedding: Considering the inputs as a grid of patches in two dimensions. In this case, two sets of embeddings are learned, each for one of the axes, X-embedding, and Y-embedding, each with size D/2. Then, based on the coordinate on the path in the input, we concatenate the X and Y embedding to get the final positional embedding for that patch.
- Relative positional embeddings: Considering the relative distance between patches to encode the spatial information as instead of their absolute position. To do so, we use 1- dimensional Relative Attention, in which we define the relative distance all possible pairs of patches. Thus, for every given pair (one as query, and the other as key/value in the attention mechanism), we have an offset $p_q - p_k$, where each offset is associated with an embedding. Then, we simply run extra attention, where we use the original query (the content of query), but use relative positional embeddings as keys. We then use the logits from the relative attention as a bias term and add it to the logits of the main attention (content-based attention) before applying the softmax.
我們針對使用位置嵌入(positional embedding)對空間資訊進行編碼的不同方式進行了消融實驗(ablations)。我們嘗試了以下幾種情況:
- 不提供空間資訊: 將輸入視為一袋(bag)影像區塊。
- 一維位置嵌入(1-dimensional positional embedding):將輸入視為依光柵順序(raster order)排列的一系列影像區塊(這是本論文中所有其他實驗的預設設置)。
- 二維位置嵌入(2-dimensional positional embedding): 將輸入視為二維網格狀的影像區塊。在這種情況下,會學習兩組嵌入,分別對應於其中一個軸:X-embedding 與 Y-embedding,各別的大小為 $D/2$。接著,根據輸入影像區塊中的坐標,我們將 X 和 Y 嵌入級聯(concatenate)起來,以獲得該影像區塊最終的位置嵌入。
- 相對位置嵌入(Relative positional embeddings): 考慮影像區塊之間的相對距離來編碼空間資訊,而非使用其絕對位置。為此,我們使用一維相對注意力(Relative Attention),其中我們定義了所有可能的影像區塊對之間的相對距離。因此,對於每一組給定的配對(一個作為查詢(query),另一個作為自監督機制中的鍵/值(key/value)),我們有一個偏移量 $p_q - p_k$,每個偏移量都與一個嵌入相關聯。接著,我們簡單地運行額外的注意力機制,其中我們使用原始查詢(查詢的內容),但使用相對位置嵌入作為鍵。隨後,我們將來自相對注意力的對數值(logits)作為偏置項(bias term),在應用 softmax 之前將其加到主注意力(基於內容的注意力)的對數值中。
In addition to different ways of encoding spatial information, we also tried different ways of incorporating this information in our model. For the 1-dimensional and 2-dimensional positional embeddings, we tried three different cases: (1) add positional embeddings to the inputs right after the stem of them model and before feeding the inputs to the Transformer encoder (default across all other experiments in this paper); (2) learn and add positional embeddings to the inputs at the beginning of each layer; (3) add a learned positional embeddings to the inputs at the beginning of each layer (shared between layers).
除了編碼空間資訊的不同方式外,我們還嘗試了將這些資訊整合進模型的不同方法。對於一維和二維位置嵌入,我們嘗試了三種不同的情況:(1) 在模型的主幹(stem)之後、將輸入送入 Transformer 編碼器之前,將位置嵌入加到輸入中(這是本論文中所有其他實驗的預設設置);(2) 在每一層的開頭學習並將位置嵌入加到輸入中;(3) 在每一層的開頭將學習到的位置嵌入加到輸入中(各層之間共享)。
Table 8 summarizes the results from this ablation study on a ViT-B/16 model. As we can see, while there is a large gap between the performances of the model with no positional embedding and models with positional embedding, there is little to no difference between different ways of encoding positional information. We speculate that since our Transformer encoder operates on patch-level inputs, as opposed to pixel-level, the differences in how to encode spatial information is less important. More precisely, in patch-level inputs, the spatial dimensions are much smaller than the original pixel-level inputs, e.g., 14 × 14 as opposed to 224 × 224, and learning to represent the spatial relations in this resolution is equally easy for these different positional encoding strategies. Even so, the specific pattern of position embedding similarity learned by the network depends on the training hyperparameters (Figure 10).
Table 8 總結了在 ViT-B/16 模型上進行此消融研究的結果。如我們所見,雖然不使用位置嵌入的模型與使用位置嵌入的模型在性能上存在巨大差距,但不同位置資訊編碼方式之間的差異卻微乎其微。我們推測,由於我們的 Transformer 編碼器運作於影像區塊等級(patch-level)的輸入,而非像素等級(pixel-level),因此編碼空間資訊方式的差異變得不那麼重要。更準確地說,在影像區塊等級的輸入中,空間維度比原始像素等級的輸入小得多(例如 $14 \times 14$ 相對於 $224 \times 224$),在這種解析度下學習表示空間關係,對於這些不同的位置編碼策略來說同樣容易。即便如此,網路所學習到的位置嵌入相似性之特定模式仍取決於訓練超參數(Figure 10)。

Figure 10: Position embeddings of models trained with different hyperparameters.

Table 8: Results of the ablation study on positional embeddings with ViT-B/16 model evaluated on ImageNet 5-shot linear.
## 延伸閱讀
### PatchEmbedding
下面程式碼是我利用 Gemini Pro 生成做為自己研究 ViT 的觀念使用,不保證正確,但是我多個模型驗證,應該是沒有錯才對,應該啦。
```python
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
"""
處理 1. Image 轉 Patch 與 2. Patch 轉 Token
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = (img_size // patch_size) * (img_size // patch_size)
# 【關鍵實作】
# 論文中提到將圖片切成 16x16 的 patches,然後展平並做線性轉換。
# 在 PyTorch 中,最有效率的等價做法是使用 Conv2d,
# 將 kernel_size 和 stride 都設定為 patch_size。
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
# 假設輸入 x 的維度為: (Batch_Size, 3, 224, 224)
# Image -> Patch & Token:
# Conv2d 一次完成了切割與特徵提取(轉換為 embedding 維度)
x = self.proj(x)
# 此時維度變成: (Batch_Size, 768, 14, 14) -> 14x14 = 196 個 patches
# 展平並調整維度以符合 Transformer 的輸入格式 (Batch_Size, Sequence_Length, Embedding_Dim)
x = x.flatten(2) # 變成 (Batch_Size, 768, 196)
x = x.transpose(1, 2) # 變成 (Batch_Size, 196, 768)
return x
```
這邊的實作很巧妙的利用2維卷積的方式,以不覆蓋的大步幅卷積將影像分塊,然後展平後面的維度再做個轉置,最終得到每一個影像區塊對應的768維向量表示。
### PatchPosition
```python
class ViTDataPreparation(nn.Module):
"""
處理 3. 位置編碼 (Positional Encoding) 與 Class Token
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
self.patch_embed = PatchEmbedding(img_size, patch_size, in_chans, embed_dim)
num_patches = self.patch_embed.num_patches
# 定義一個可學習的分類 Token (類似 BERT 的 [CLS])
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# 【位置編碼處理】
# 論文採用的是「可學習的 1D 位置編碼 (Learnable 1D Position Embeddings)」,
# 而不是像原始 NLP Transformer 使用的正弦/餘弦固定編碼。
# 長度為 num_patches + 1 (因為多了一個 cls_token)
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
def forward(self, x):
B = x.shape[0] # 取得 Batch Size
# 1 & 2: 取得 Patch Tokens
x = self.patch_embed(x) # 維度: (B, 196, 768)
# 準備 [CLS] Token 並與 Patch Tokens 在序列維度上拼接 (Concatenate)
cls_tokens = self.cls_token.expand(B, -1, -1) # 擴展到與 Batch 同等大小: (B, 1, 768)
x = torch.cat((cls_tokens, x), dim=1) # 拼接後維度: (B, 197, 768)
# 3: 加入位置編碼 (Broadcasting 機制會自動將 pos_embed 加到每一個 Batch 上)
x = x + self.pos_embed # 維度保持: (B, 197, 768)
# 到這裡,資料就完全準備好,可以送入 Transformer Encoder 了
return x
```
這邊利用 `self.cls_token` 來學習圖片所代表的類別表示,位置編碼則是一個很簡單的一維可學習的參數。
### ViTClassifier
```python
class ViTClassifier(nn.Module):
"""
處理 4. 萃取 [CLS] Token 並輸出分類結果 (MLP Head)
"""
def __init__(self, embed_dim=768, num_classes=1000):
super().__init__()
# 為了穩定訓練,通常會在最後的線性層之前加上一層 Layer Normalization
self.norm = nn.LayerNorm(embed_dim)
# 這就是最後的「分類器」,把 768 維的特徵,轉換成 1000 個類別的機率分佈 (Logits)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
# 假設這裡的 x 是剛從 Transformer Encoder 走出來的完整序列
# x 的維度: (Batch_Size, 197, 768)
# 【關鍵實作 1:切片操作 Slicing】
# 我們只在乎序列中的「第一個」Token,也就是位置在 index 0 的 [CLS] Token
# 語法 x[:, 0] 的意思是:取出所有的 Batch,但只拿第 0 個 Token 的所有特徵
cls_output = x[:, 0]
# 此時維度變成: (Batch_Size, 768) ,那 196 個 Patch Tokens 就這樣被丟棄了!
# 【關鍵實作 2:分類運算】
# 先做正規化
cls_output = self.norm(cls_output)
# 丟入線性分類層
logits = self.head(cls_output)
# 此時維度變成: (Batch_Size, 1000)
return logits
# 測試這段架構
if __name__ == "__main__":
# 模擬從 Transformer Encoder 輸出的資料 (Batch=2, Seq_Len=197, Embed_Dim=768)
encoder_output = torch.randn(2, 197, 768)
# 假設我們要分類成 1000 種不同的物件 (如 ImageNet)
classifier = ViTClassifier(embed_dim=768, num_classes=1000)
# 取得分類結果
predictions = classifier(encoder_output)
print(f"Transformer 輸出的維度: {encoder_output.shape}")
print(f"最終分類 Logits 的維度: {predictions.shape}")
# 預期輸出: torch.Size([2, 1000])
```
這邊說明的是將學習到的 `cls_token` 取出做計算並取得其 `logits`,再根據你是訓練、推論的階段來做不同的後處理就可以了。
根據我與 Gemini 的互動,這個 `cls_token` 並沒有被特別的處理,就是很標準的 QKV 計算。