Very Deep Convolutional Networks for Large-Scale Image Recognition(VGG16)(翻譯)

# Very Deep Convolutional Networks for Large-Scale Image Recognition(VGG16)(翻譯) ###### tags: `VGG` `CNN` `論文翻譯` `deeplearning` >[name=Shaoe.chen] [time=Thu, Feb 14, 2020] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1409.1556.pdf) * [參考來源_VGGNet 閱讀理解- Very Deep Convolutional Networks for Large-Scale Image Recognition](https://blog.csdn.net/zziahgf/article/details/79614822) ::: ## Abstract :::info In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision. ::: :::success 在這個工作中，我們研究卷積網路的深度在[大型](http://terms.naer.edu.tw/detail/6632949/)影像辨識環境對準確度的影響。我們主要的貢獻就是使用一個帶有非常小(3x3)的卷積[過濾器](http://terms.naer.edu.tw/detail/6622536/)的架構對深度增加的網路做全面的評估，這說明了，將深度推深到16-19個權重層可以實現對現有技術配置的明顯改進。這些發現是我們2014年ImageNet[挑戰賽](http://terms.naer.edu.tw/detail/6588066/)提交的基礎，其中我們團隊在定位與分類過程中分別得到第一、第二名。我們還表明，我們的表示可以很好的泛化到其它的資料集，在這些資料集上可以來到當前最佳結果。我們已經公開兩個最佳效能的卷積模型，以促進在電腦視覺中使用深度視覺表示的進一步的研究。 ::: ## 1 INTRODUCTION :::info Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al.,2012) (the winner of ILSVRC-2012). ::: :::success 卷積網路(ConvNets)近來在大型影像與視頻辨識中取得巨大的成功(Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014)，這成功的可能是由於大型公開影像庫，像是ImageNet (Deng et al., 2009)，與高效率的計算系統，像是GPUs或大型分佈式[叢集](http://terms.naer.edu.tw/detail/6686743/)(Dean et al., 2012)。特別是，ImageNet大型視覺辨識[挑戰賽](http://terms.naer.edu.tw/detail/6588066/)(ILSVRC) (Russakovsky et al., 2014)，在深度視覺辨識架構的發展中起了重要的作用，這挑戰賽已經是幾代大型影像分類系統的[測試平台](http://terms.naer.edu.tw/detail/3229776/)，從高維淺層特徵編碼(Perronnin et al., 2010) (the winner of ILSVRC-2011)到深度卷積網路 (Krizhevsky et al.,2012) (the winner of ILSVRC-2012)。 ::: :::info With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3×3) convolution filters in all layers. ::: :::success 隨著卷積神經網路在電腦視覺領域中日益成為一種商品，人們嚐試改進Krizhevsky et al. (2012)的原始架構，試圖達到更好的準確度。舉例來說，提交ILSVRC2013的最佳效能(Zeiler & Fergus, 2013; Sermanet et al., 2014)利用較小的感受視窗大小與較小的第一層卷積層的步幅(stride)。另一項改進涉及在整個影像與多個尺度上密集地訓練與測試網路(Sermanet et al., 2014; Howard, 2014)。這篇論文中，我們討論另卷積神經網路架構設計的另一個重要方面-它的深度。為此，我們固定架構的其它參數，透過加入更多的卷積層來逐步的增加網路深度，這是可行的，因為所有的卷積層中都使用非常小(3x3)的[過濾器](http://terms.naer.edu.tw/detail/6622536/)。 ::: :::info As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models^1^ to facilitate further research. ::: :::success 因此，我們提出更精確的卷積神經網路架構，不止在ILSVRC分類與定位任務上得到最佳準確度，而且還適用於其它影像辨識資料集，即使做為相對簡單的[管線](http://terms.naer.edu.tw/detail/6665364/)的一部份使用(例如，由線性SVM分類的深度特徵，不需要微調)依然可以獲得出色的效能。我們已經發佈兩個最佳效能的模型^1^，以方便進一步的研究。 ::: :::info The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions. ::: :::success 論文的其它部份安排如下。在Sect.2中，我們說明我們的卷積神經網路的配置。影像分類與估測的細節在Sect. 3中說明，在ILSVRC分類任務上的配置比較在Sect. 4中。Sect. 5是論文的結論。為了完整起見，我們還在附錄A中說明並評估我們的ILSVRC-2014物體定位系統，並在附錄B中討論將非常深的特徵泛化到其它資料集上。最後，附錄C包含主要論文修訂的清單。 ::: ## 2 CONVNET CONFIGURATIONS :::info To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3. ::: :::success 為了在公平環境中衡量增加網路深度帶來的改進，所有我們的卷積神經網路層配置皆使用相同的原則設計，靈感來自Ciresan et al. (2011); Krizhevsky et al. (2012)。在這一節中，我們首先說明卷積神經網路配置的通用佈局(Sect. 2.1)，然後，詳細說明評估中使用的特定配置(Sect. 2.2)。然後討論我們的設計選擇，在Sect. 2.3中與現有技術進行比較。 ::: ### 2.1 ARCHITECTURE :::info During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only pre-processing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2. ::: :::success 訓練期間，我們的卷積神經網路的輸入是固定大小的224x224 RGB影像。我們唯一的預處理就是從每個像素中減掉在訓練集上計算出來的RGB均值。影像通過堆疊的卷積層傳遞，卷積層中我們使用[過濾器](http://terms.naer.edu.tw/detail/6622536/)有著非常小的感受域：3x3(這是補捉左/右，上/上，中心的概念的最小尺寸)。在其中一個配置中，我們還利用1x1卷積過濾器，這個[過濾器](http://terms.naer.edu.tw/detail/6622536/)以視為輸入channels的線性轉換(其次是非線性)。卷積的步幅(stride)固定為1像素；卷積層輸入的空間填充使得卷積後的空間解析度得以維持不變。即3x3卷積層的填充為1像素。空間的池化(pooling)由五個max-pooling執行，這五層在部份卷積層後面(並非所有的卷積層後面都接max-pooling)。Max-pooling以2x2像素視窗，步幅(stride)為2執行。 ::: :::info A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks. ::: :::success 一個堆疊的卷積層(不同架構有著不同深度)後面接著三個全連接層(FC)~(Fully-Connected)~：前兩個FC有4096個channels，第三個執行1000-way ILSVRC分類，因此包含1000個channels(每一個channel代表一個類別)。最後一層為softmax。所有網路的全連接層的配置都是一樣的。 ::: :::info All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012). ::: :::success 所有的隱藏層都配有非線性校正(ReLU (Krizhevsky et al., 2012))。我們有注意到，我們的網路(一個除外)並不包含Local Response Normalisation (LRN)正規化(Krizhevsky et al., 2012)：將在Sec. 4中說明，這類的正規化並不能在ILSVRC資料集上得到效能的提高，但是會導致記憶體的消耗以計算時間的增加。在適用的情況下，LRN層的參數為(Krizhevsky et al., 2012)(AlexNet)。 ::: ### 2.2 CONFIGURATION :::info The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512. ::: :::success Table 1中概述了本論文中卷積神經網路的配置，每個column一個。後續，我們將以它們的名稱(A-E)來談論網路。所有的配置都依Sect 2.1中所說明的通用設置，唯一不同處在深度：從網路A(8層卷積與3層全連接層)中11層權重層到網路E中(16層卷積與3層全連接層)中的19層權重層。卷積層的寬度(channels數)相當小，從第一層的64開機台，在每個max-pooling後增加2倍，直到512。 ::: :::info Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as "convhreceptive field sizei-hnumber of channelsi". The ReLU activation function is not shown for brevity. Table 1：卷積神經網路的配置(在columns中說明)。配置的深度從左(A)到右(E)增加(增加的層以粗體顯示)。卷積層參數被標記為"convhreceptive field sizei-hnumber of channelsi"。為了版本的簡潔，並沒有顯示出ReLU(啟動函數)。 ![](https://i.imgur.com/SKe2xxr.png) ::: :::info In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)). ::: :::success 在Table 2中，我們報告了每個配置的參數量。儘管深度很大，但我們網路的權重數並不會比擁有較大卷積層寬度與感受域的淺型網路中的權重數還要多(144M權重數(Sermanet et al., 2014))) ::: :::info Table 2: Number of parameters (in millions). Table 2：參數量(單位百萬) ![](https://i.imgur.com/oINR4zI.png) ::: ### 2.3 DISCUSSION :::info Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11 with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has $C$ channels, the stack is parametrised by $3(3^2C^2)=27C^2$ weights; at the same time, a single 7 × 7 conv. layer would require $7^2C^2=49C^2$ parameters, i.e 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between). ::: :::success 我們的卷積神經網路的配置與ILSVRC-2012 (Krizhevsky et al., 2012)、ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014)的最佳[作品](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E-%E6%BC%A2%E8%AA%9E-%E7%B9%81%E9%AB%94/entry)中所使用的配置完全不同。相對於在第一層卷積層中使用相對較大的感受域 (e.g. 11×11 with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014))，我們在整個網路中使用非常小(3x3)的感受域，這些感受域與輸入的每個像素進行卷積(步幅~(stride)~為1)。顯而易見的，兩個3x3卷積層(中間沒有空間池化)的堆疊具有5x5感受域的效果；如果三個，那就有7x7感受域的效果。那這麼做我們到底得到什麼(例如，以三個3x3卷積堆疊而不是一個7x7卷積)?首先，我們結合三個(3x3)非[線性整流](https://zh.wikipedia.org/zh-tw/%E7%BA%BF%E6%80%A7%E6%95%B4%E6%B5%81%E5%87%BD%E6%95%B0)層來替代單一個(7x7)，這讓整個決策函數更具區別性。第二，我們減少參數的數量，假設三層3x3的卷積堆疊的輸入與輸出都有$C$個channels，這堆疊由$3(3^2C^2)=27C^2$個權重參數化；同時，單一個7x7卷積層將需要$7^2C^2=49C^2$，足足多了81%。這可以視為在7x7的卷積過慮器上施加正規化，強迫它們透過3x3的過濾器[分解](http://terms.naer.edu.tw/detail/930141/)(中間注入非線性)。 ::: :::warning 想法上，用更少的參數(3x3 filter)來達成相同的結果(7x7 filter)，某種角度來看也是一種正規化。 ::: :::info The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×1 conv. layers have recently been utilised in the "Network in Network" architecture of Lin et al. (2014). ::: :::success 結合1x1卷積層(配置C，Table 1)是一種在不影響卷積層的感受域情況下，增加決策函數的非線性的一種方法。即使在我們的情況下，1x1卷積本質上是在相同維度空間上的線性投射(輸入與輸出的channels相同)，但整流函數會引入其它非線性。值得注意的是，最近在Lin et al. (2014)的"Network in Network"架構中使用了1x1卷積層。 ::: :::info Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNet (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy. ::: :::success 小型卷積過濾器已經在先前(2014)被Ciresan等人使用過，但是他們網路的深度明顯小帶我們，而且他們並沒有在大型ILSVRC資料集上評估過。Goodfellow et al. (2014)應用深度卷積網路(11個權重層)在街道號碼辨識任務，並表明深度的增加會導致更好的效能。GoogLeNet(Szegedy et al., 2014)，ILSVRC-2014分類任務最佳作品，其開發與我們的工作無關，但類似，GoogLeNet基於非常深的卷積神經網路(22個權重層)與小的卷積過濾器(除了3x3之外，他們還使用1x1與5x5的卷積)。然而，他們的網路拓撲比我們的更複雜，而且在第一層更積極的降低feature maps的空間解析度，以減少計算量。如Sect. 4.5所說明，在單網路分類準確度上面，我們的模型優於Szegedy et al. (2014)。 ::: ## 3 CLASSIFICATION FRAMEWORK :::info In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation. ::: :::success 先前的章節中，我們說明了關於我們網路配置的細節。這個章節，我們將說明卷積神經網路訓練與評估的細節。 ::: ### 3.1 TRAINING :::info The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the $L_2$ penalty multiplier set to $5 \cdot 10^{-1}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to $10^{−2}$, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers. ::: :::success 卷積神經網路的訓練過程會依循Krizhevsky et al. (2012)(除了從multi-scale訓練影像中對輸入剪裁進行採樣之外，後續說明)。也就是說，訓練是透過使用momentum的小批量梯度下降(基於反向傳播(LeCun et al., 1989))來最佳化[多項式](http://terms.naer.edu.tw/detail/3216877/)邏輯斯迴歸目標來進行。batch size設置為256，momentum為0.9。訓練透過權重衰減($L_2$懲罰乘數設置為$5 \cdot 10^{-1}$)以及前兩個全連接層執行dropout做正規化(dropout rate設置為0.5)。learning rate初始設置為$10^{−2}$，然後在驗證集準確度不再提升的時候降低10倍。總體來說，learning rate降低3次，學習在370K次迭代(74個epochs)後停止。我們推測，儘管與(Krizhevsky et al., 2012)相比，我們的網路的參數量更大，而且更深，但網路收斂所需的epochs更少，這是因為(a)較大的深度與較小的卷積過濾器尺寸所造成的隱式正規化；(b)某些層的預初始化。 ::: :::warning dropout的功用在聽李弘毅老師的線上課程的時候有提到，就跟隨機森林一樣概念，每次生成不同的tree，最後ensemble使用。 ::: :::info The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{−2}$ variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010). ::: :::success 網路權重的初始化是重要的，因為在深度層路中梯度的不穩定性，不良的初始化會造成學習停滯。為了避免這個問題，我們從訓練配置A(Tabel 1)開始，夠淺，可以用隨機初始化進行訓練。然後，當訓練更深的架構時，我們以網路A的層初始化前四層卷積層與最後三層全連接層(中間層隨機初始化)。我們沒有降低預初始化層的learning rate，而是允許它們在訓練期間改變。對於隨機初始化的部份(如適用)，我們會從均值為零且方差為$10^2$的正態分佈中隨機抽樣權重。偏差單元初始化為零。值得注意的是，在論文提交之後，我們發現透過使用[Glorot & Bengio (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)的隨機初始化程序，就可以在不做預訓練的前提下初始化權重。 ::: :::info To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below. ::: :::success 為了獲得固定大小的卷積神經網路輸入影像(224x224)，他們從重新縮放的訓練影像中隨機剪裁(每次SGD迭代每張影像一個剪裁)。為了更進一步的增強訓練集，剪裁的部份會進行隨機的垂直翻轉與隨機的RGB偏移(Krizhevsky et al., 2012)。訓練影像的重新縮放在下面說明。 ::: :::info Training image size. Let $S$ be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to $S$ as the training scale). While the crop size is fixed to 224 × 224, in principle $S$ can take on any value not less than 224: for $S = 224$ the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for $S ≫ 224$ the crop will correspond to a small part of the image, containing a small object or an object part. ::: :::success **訓練影像大小。** 假設$S$是isotropically-rescaled(等向縮放?)訓練影像的最小邊，從中剪裁卷積神經網路的輸入(我們也將$S$稱為訓練尺度(比例))。雖然剪裁大小固定為224x224，但原則上$S$可以是任何不小於224的值：當$S=224$時，剪裁會獲得整體影像的統計信息，完全跨過訓練影像的最小邊；當$S >> 224$時，剪裁會對應影像的一小部份，包含一小個物件或物件部份。 ::: :::info We consider two approaches for setting the training scale $S$. The first is to fix S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales: $S = 256$ (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and $S = 384$. Given a ConvNet configuration, we first trained the network using $S = 256$. To speed-up training of the $S = 384$ network, it was initialised with the weights pre-trained with $S = 256$, and we used a smaller initial learning rate of $10^{−3}$. ::: :::success 我們考慮兩種方法來設置訓練尺度(比例)$S$。第一種是對應固定single-scale訓練的$S$(注意，採樣剪裁中的影像內容仍然可以表示multi-scale影像統計信息)。在我們的實驗中，我們以兩種固定尺度評估訓練的模型：$S = 256$(現有技術中已廣泛使用(Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014))與$S=384$。給定卷積神經網路的配置，我們首先使用$S=256$訓練網路。為了加快$S=384$的訓練速度，以$S = 256$預訓練的權重初始化，而且使用更小的初始learning rate($10^{-3}$)。 ::: :::warning 可以看到不是224就是384，個人理解是，每次的convolutional之後都會padding回去input dimension，然後再經過pooling減半，而這過程經過五次的pooling，也就是dimension會除掉32，而224跟384也正好都是32的倍數，這樣子在conv的過程中才不會有奇數維度的問題。 ::: :::info The second approach to setting $S$ is multi-scale training, where each training image is individually rescaled by randomly sampling $S$ from a certain range $[S_\min, S_\max]$ (we used $S_\min = 256$ and $S_\max = 512$). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed $S = 384$. ::: :::success 第二種方法是設置$S$為multi-scale訓練，其中每一張訓練影像從某一個區間($[S_\min, S_\max]$)中各別以隨機採樣的$S$重新縮放(我們使用$S_\min = 256$，$S_\max = 512$)。由於影像中的物件可以擁有不同的大小，因此在訓練過程中考慮這點是有好處的。這可以視為是利用scale jittering(尺度抖動?)做訓練集增強，其中訓練單一模型在大的範圍尺度內辨識物件。出於速度原因，我們透過使用相同配置的single-scale模型對所有的層做微調來訓練多尺度模型，並使用固定$S=384$來做預訓練。 ::: :::warning scale jittering，大致理解就是影像的短邊縮放到設置的範圍區間，然後以固定的剪裁尺寸去隨機剪一張影像這樣，假設輸入dimension為224x224，那就是將影像短邊縮放到\[256, 512\]間隨機一個值，然後從這張縮放影像剪一張224x224的影像做為訓練影像。 ::: ### 3.2 TESTING :::info At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as $Q$ (we also refer to it as the test scale). We note that $Q$ is not necessarily equal to the training scale $S$ (as we will show in Sect. 4, using several values of $Q$ for each $S$ leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image. ::: :::success 在測試時，給定一個訓練好的卷積神經網路以輸入影像，以下面方式進行分類。首先，它被isotropically rescaled(等向縮放?)到預定義的最小影像邊，標記為$Q$(我們也將它稱為測試尺度)。我們注意到，$Q$並沒有一定要等於訓練尺度$S$(我們將在Sect. 4中說明，對每個$S$使用多個$Q$的值可以提高效能)。然後，以類似於(Sermanet et al., 2014)的作法將重新縮放的測試影像密集地應用在網路上。也就是說，全連接層首先轉換為卷積層(第一個FC layer為7x7卷積，最後兩個FC layer為1x1卷積)。然後將得到的全完卷積網路應用到全部(未剪裁)影像。結果是一個類別分數圖，其channel的數量等同於類別數量，並依據輸入影像大小變化[空間解析度](http://terms.naer.edu.tw/detail/3123223/)。最後，為了獲得影像的類別分數的固定大小向量，其類別分數圖在空間上是平均的(sum-pooled)。我們還利用水平翻轉影像來增強測試集；將原始影像與翻轉影像的soft-max類別後驗進行平均，以獲得影像的最終分數。 ::: :::warning 觀察VGG16的最後一個卷積層的output，它最後的output是7x7x512，因此將第一個FC調整為7x7的卷積，其output就是1x1，後面再接上兩層1x1卷積，替換掉之後就是所謂的FCN(Fully-Convolutional Network)，這麼做的好處在於可以處理各種大小的輸入，這也是為什麼你在使用keras的預訓練模型時，如果要`include_top`的話，那`input_shape`就一定要固定(224x224x3)的原因。以224x224x3的影像做為輸入，最終的output是1x1x1000，但如果是以384為輸入影像，那最終的output就會是6x6x1000，然後再經過softmax。 ::: :::info Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014). ::: :::success 因為將完全卷積網路應用於全部的影像上，因此不需要在測試時做多剪裁採樣(Krizhevsky et al., 2012)，採樣的話效率低，因為每一個剪裁都需要網路重新計算。同時，如 Szegedy et al. (2014)所做，使用大量剪裁可以提高準確度，因為與完全卷積網路相比，它可以對輸入影像做更精細的採樣。同樣的，由於不同的卷積邊界條件，多剪裁(multi-crop)評估與密集評估是[互補](http://terms.naer.edu.tw/detail/656063/)的：將卷積神經網路應用於剪裁的時候，卷積的feature maps以零填充，儘管在密集評估的情況下，相同剪裁的填充自然而然的來自影像的相鄰部份(由於卷積與空間的池化)，這大大的增加整個網路的感受域，因此獲得更多的上下文。儘管我們認為在實作中，增加多剪裁(multiple crops)的計算時間並不能證明準確度的的潛在提高，但作為參考，我們也是使用每個尺度縮放50個剪裁來評估我們的網路，在3個縮放尺度上總共150個剪裁(5x5常規網格，2次翻轉)，相當於Szegedy et al. (2014)使用的4個縮放尺度上144個剪裁。 ::: :::warning dense evaluation、multi-crop evaluation的手法似乎是FCN來的，後續可以閱讀相關論文。 ::: ### 3.3 IMPLEMENTATION DETAILS :::info Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU. ::: :::success 我們的實驗[衍生自](http://terms.naer.edu.tw/detail/6566267/)可公開使用的 C++ Caffe toolbox (Jia, 2013)(branched out in December 2013)，但包含許多重大改版，讓我們可以在單一系統中安裝多GPUs進行訓練與評估，以及以多尺度(如上所述)的全尺寸(未剪裁)影像做訓練與評估。多GPU利用資料的[平行性](http://terms.naer.edu.tw/detail/264282/)，而且透過將每個訓練影像batch分成多個GPU batches，在每個GPU上平行處理。在GPU batch梯度計算完成之後，它會平均所有batch獲得的梯度。梯度計算在GPUs之間是同步的，因此結果與在單一GPU上會完全相同。 ::: :::info While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture. ::: :::success 雖然最近提出更複雜的方法來加速卷積神經網路的訓練(Krizhevsky, 2014)，這方法對網路的不同層採用模型與資料平行，我們發現到，[概念上](https://tw.voicetube.com/definition/conceptually)而言，我們已經提供一個與使用單GPU相比，在[現成](http://terms.naer.edu.tw/detail/1283425/)4-GPU系統上能加速到3.75倍的更簡單的方案。在配有四張NVIDIA Titan Black GPUs的系統上，依不同的架構，訓練單一網路需要2-3週。 ::: ## 4 CLASSIFICATION EXPERIMENTS :::info **Dataset.** In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories. ::: :::success **資料集。** 在這一節，我們說明透過在ILSVRC-2012資料集上介紹的卷積神經網路架構所實現的影像分類結果(用於ILSVRC 2012–2014挑戰)。資料集包含1000個類別的影像，而且分割為三組：訓練(1.3M)，驗證(50K)與測試(100K(具類別標籤影像))。使用兩種方法評估分類效能：top-1與top-5誤差。前者是多類別分類誤差，即錯誤分類影像的比例；後者是ILSVRC中使用的主要評估標準，而且按影像計算[基準真相](https://zh.wikipedia.org/wiki/Ground_Truth)類別不存在top-5預測類別的比例(大概就是計算實際類別不存在top-5的比例)。 ::: :::info For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a "VGG" team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014). ::: :::success 對於多數的實驗，我們使用驗證集做為測試集。測試集上還做了某些實驗，而且提交ILSVRC官方伺服哈，做為"VGG"小組的參加ILSVRC-2014競賽的作品(Russakovsky et al., 2014)。 ::: ### 4.1 SINGLE SCALE EVALUATION :::info We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows: $Q = S$ for fixed $S$, and $Q = 0.5(S_\min + S_\max)$ for jittered $S \in [S_\min, S_\max]$. The results of are shown in Table 3. ::: :::success 我們首先以single scale來評估各別卷積神經網路模型的效能，相關層的配置見Sect. 2.2說明。測試影像大小設置如下：對於固定$S$的$Q = S$，然後對於抖動的$S \in [S_\min, S_\max]$，設置$Q = 0.5(S_\min + S_\max)$，結果請見Table 3.。 ::: :::info Table 3: ConvNet performance at a single test scale. Table 3：單一測試尺度的卷積神經網路效能 ![](https://i.imgur.com/hZUzC64.png) ::: :::info First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E). ::: :::success 首先，我們注意到，在模型A沒有任何正規化層的情況下使用local response normalisation(A-LRN network)並不能提高效能。因此，我們在較深的架構(B-E)並沒有採用正規化。 ::: :::info Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the configuration C (which contains three 1 × 1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters. ::: :::success 第二，我們觀察到分類誤差會隨著深度的增加而減少：從A的11層到E的19層。特別是，儘管有著相同的深度，配置C(包含三個1x1卷積)的效能比配置D還要差(整個網路都使用3x3卷積)。這指出了，雖然額外的非線性是有幫助的(C比B好)，但透過使用具有non-trivial感受域的卷積過濾器來補獲空間上下文也是很重要的(D比C好)。當深度達到19層的時候，我們的架構的誤差率就飽合了，但是，更深的模型可能對更大的資料集是更有利的。我們還比較一個衍生自B的淺層網路(五個5x5卷積層)，將B內成對的3x3卷積層以一個5x5卷積層替代(見Sect. 2.3中說明，擁有相同的感受域)。淺層網路的top-1誤差率比B還要高7%(中心剪裁上)，這證明了，擁有小型過濾器的深度網路會比擁有大型過濾器的的淺層網路還要好。 ::: :::info Finally, scale jittering at training time ($S \in [256; 512]$) leads to significantly better results than training on images with fixed smallest side ($S = 256$ or $S = 384$), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics. ::: :::success 最後，訓練期間的scale jittering(尺度抖動?)($S \in [256; 512]$)明顯比固定最小邊($S = 256$ or $S = 384$)有更好的結果，即使測試的時候使用single scale。這證明透過scale jittering(尺度抖動?)做訓練集增強確實對補獲multi-scale影像統計信息有幫助。 ::: ### 4.2 MULTI-SCALE EVALUATION :::info Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of $Q$), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed $S$ were evaluated over three test image sizes, close to the training one: $Q ={S − 32, S, S + 32}$. At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable $S \in [S_\min; S_\max]$ was evaluated over a larger range of sizes $Q = {S_\min, 0.5(S_\min + S_\max), S_\max}$. ::: :::success 在single scale上評估完卷積神經網路模型之後，我們現在來評估scale jittering(尺度抖動)對測試時的影響。它包括在多個重新縮放版本的測試影像上執行模型(對應不同的$Q$的值)，然後在得到類別[後驗](httpshttp://terms.naer.edu.tw/detail/3189962/)計算平均。考量到訓練與測試尺度之間造成的巨大差異導致效能下降，使用固定的$S$訓練模型，在三個測試影像大小上評估，接近訓練影像大小：$Q ={S − 32, S, S + 32}$。同時，訓練期間的scale jittering允許網路在測試期間應用於更大範圍的尺度，因此在更大範圍的尺度$Q = {S_\min, 0.5(S_\min + S_\max), S_\max}$上評估以可變$S \in [S_\min; S_\max]$訓練的模型。 ::: :::info The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side $S$. Our best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error (highlighted in bold in Table 4). On the test set, the configuration E achieves 7.3% top-5 error. ::: :::success 結果於Table 4.中說明，在測試期間，scale jittering(尺度抖動)導致更好的效能(對比single scale的相同模型的評估，如Table 3所示)。一如以往，最深的配置(D與E)執行起來是最好的，而且scale jittering(尺度抖動)會比以固定最小邊$S$的訓練來的好。我們在驗證集上的最佳單個網路效能為24.8%/7.5%(top-1/top-5)誤差(在Table 4中以粗體顯示)。在測試集上，配置E達到7.3%的top-5誤差。 ::: :::info Table 4: ConvNet performance at multiple test scales. Table 4: 卷積神經網路在多個測試尺度上的效能 ![](https://i.imgur.com/skJm6Py.png) ::: ### 4.3 MULTI-CROP EVALUATION :::info In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their softmax outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions. ::: :::success 在Table 5中，我們比較dense卷積神經網路與mult-crop(多剪裁)(細節請見Sect 3.2)。我們還利用平均它們的softmax輸出來評估這兩種評估技術的[互補性](http://terms.naer.edu.tw/detail/559051/)。可以看的出來，使用多剪裁的效能略比dense還要好，兩種方法確實是[相輔相成](http://terms.naer.edu.tw/detail/559051/)的，因為它們的組合優於各自。如上所述，我們假設這是由於對卷積邊界條件的不同處理。 ::: :::info Table 5: ConvNet evaluation techniques comparison. In all experiments the training scale $S$ was sampled from \[256; 512\], and three test scales $Q$ were considered: {256, 384, 512}. Table 5：卷積神經網路評估技術的比較。所有的實驗中的訓練尺度$S$都是由\[256; 512\]採樣，且三個測試尺度$Q$為{256, 384, 512}。 ![](https://i.imgur.com/hkvdHbP.png) ::: ### 4.4 CONVNET FUSION :::info Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014). ::: :::success 目前為止，我們評估了各別的卷積神經網路模型的效能。在這部份的實驗中，我們透過平均模型的softmax類別後驗來結合多個模型的輸出。由於模型的互補性，這提高了效能，並且在2012 (Krizhevsky et al., 2012)與2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014)的頂級提交中使用。 ::: :::info The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5). ::: :::success 結果於Table 6中說明。到我們提交ILSVRC的時候，我們只訓練single-scale網路，還有一個multi-scale模型D(僅對全連接層微調)。這七個網路的組合(ensemble)結果為7.3%的ILSVRC測試誤差。在提交之後，我們僅考慮兩個最佳效能的multi-scale模型的組合(D與E)，使用dense evaluation，測試誤差來到7.0%，而結合dense與multi-crop evaluation則是來到6.8%。作為參考，我們的最佳效能單一模型達到7.1%的誤差(E, Table 5)。 ::: :::info Table 6: Multiple ConvNet fusion results. ![](https://i.imgur.com/uswD61K.png) ::: ### 4.5 COMPARISON WITH THE STATE OF THE ART :::info Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our "VGG" team secured the 2nd place with 7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models. ::: :::success 最後，將我們的結果與Table 7中的最新技術做比較。在ILSVRC-2014挑戰賽(Russakovsky et al., 2014)的分類任務中，我們的"VGG"小組使用7個模型的組合，以7.3%的測試誤差得到第二名。在提交之後，我們使用兩個模型的組合將誤差率降低6.8%。 ::: :::info As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models – significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth. ::: :::success 從Table 7可以看的出來，我們的深度卷積神經網路明顯優於上一代的模型(ILSVRC-2012與ILSVRC-2013競賽中得到最佳結果)。我們的結果與分類任務的獲勝者(GoogLeNet with 6.7% error)相比也是也有競爭力的，本質上還優於ILSVRC-2013獲券者Clarifai(在外部訓練資料協助下來到11.2%的誤差，沒有外部訓練資料則來到11.7%)。這是很厲害的，考慮到我們的最佳結果只結合兩個模型，明顯低於多數ILSVRC提交所使用的數量。就單一網路的效能而言，我們的架構實現最佳結構(7.0%測試誤差)，優化單一GoogLeNet0.9%。特別是，我們並沒有脫離LeCun et al. (1989)的經典架構，而是透過大幅增加深度來改進它。 ::: :::info Table 7: Comparison with the state of the art in ILSVRC classification. Our method is denoted as “VGG”. Only the results obtained without outside training data are reported. ![](https://i.imgur.com/QiXG2Wn.png) ::: ## 5 CONCLUSION :::info In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations. ::: :::success 這項工作中，我們評估了非常深的卷積神經網路(最多19個權重層)，用於大規模的影像分類。這證明了，代表深度對分類的準度確是有利的，而且可以使用一般的卷積神經網路架構(LeCun et al., 1989; Krizhevsky et al., 2012)並顯著的加深深度來實現ImageNet挑戰賽資料集上的最佳效能。在附錄中，我們還說明我們模型可以很好的泛化到各種任務與資料集，匹配或優於圍繞深度較少的影像表示所建置的更複雜的辨識管道。我們的結果再次的證明深度在視覺表示的重要性。 ::: ## A LOCALISATION :::warning 這邊說明物體定位的結果 ::: :::info In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with 25.3% error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a few modifications. Our method is described in Sect. A.1 and evaluated in Sect. A.2. ::: :::success 在論文主體中，我們考慮了ILSVRC挑戰賽的分類任務，而且對不同深度的卷積神經網路架構做了全面的評估。這一節中，我們轉而挑戰定位任務，在2014年中以25.3%的誤差贏得比賽。這可以視為物體偵測的一種特殊狀況，這種情況下，不管類別的實際物體數量多少，top-5類別中的每一個類別都應該預測一個物體邊界框。為此，我們採用Sermanet et al. (2014)的方法(ILSVRC-2013挑戰賽定位冠軍)，然後做了一些調整。我們的方法在Sect.A.1中說明，在Sect.A.2中評估。 ::: ### A.1 LOCALISATION CONVNET :::info To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4). ::: :::success 為了執行物件的定位，我們使用非常深的卷積神經網路，其中最後一層全連接層預測邊界框的定位，而不是類別分數。邊界框以4維向量來表示，保存其中心座標，寬與高。可以選擇邊界框的預測是所有類別之間的共享(單一類別迴歸，SCR (Sermanet et al., 2014))還是特定於類別(每個類別迴歸，PCR)。基於前者，則最後一層是4維，基於後者，則最後一層是4000維(因為資料集中有1000個類別)。除了最後一個邊界框預測層之外，我們使用卷積神經網路架構D(Table 1)，包含16層權重層，而且被認為是分類任務中表示最好的模型(Sect. 4)。 ::: :::info **Training.** Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale: $S = 256$ and $S = 384$ (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to $10^{−3}$ . We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch ::: :::success **訓練。** 訓練定位卷積神經網路類似於訓練分類卷積神經網路(Sect. 3.1)。主要的問題是我們以歐幾里德損失取代掉logistic regression objective，這懲罰預測邊界框參數與實際情況的偏差。我們訓練兩個定位模型，每個模型都在單一尺度上：$S = 256$ 與 $S = 384$(因為時間限制，我們在提交ILSVRC-2014時並沒有使用scale jittering(尺度抖動?)訓練)。使用相對應的分類模型初始化訓練(依相同尺度訓練)，初始的learning rate設置為$10^{−3}$。我們探索了對所有層的微調以及僅對前兩個全連接層微調(如Sermanet et al., 2014所做)。最後一個連接層是隨機初始化，從頭開始訓練。 ::: :::info **Testing.** We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth-class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image. ::: :::success **測試。** 我們考慮兩個測試協議。第一個用於比較驗證集上不同的網路調整，而且僅考慮實際類別的邊界框預測(以排除分類錯誤)。邊界框是透過將網路應用於影像的中心剪裁而獲得。 ::: :::info The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results. ::: :::success 第二個協議，完全成熟的測試過程，基於定位卷積神經網路在整個影像上的密集應用，類似於分類任務(Sect. 3.2)。差異在於，最後一個連全接層的輸出是一組邊界框的預測，而不是類別分數的映射。為了得到最終的預測，我們使用greedy merging procedure(貪婪合併程序?)(Sermanet et al. (2014))，它首先合併空間上接近的預測(通過對座標計算平均)，然後根據從分類卷積神經網路中獲得的分數進行評分。當使用多個定位的卷積神經網路時，我們首先取得邊界框預測集的聯集，然後在聯集上執行合併程序。我們並沒有使用multiple pooling offsets的技術(Sermanet et al. (2014))，這技術可以增加邊界框預測的空間解析度，而且可以進一步的改善結果。 ::: ### A.2 LOCALISATION EXPERIMENTS :::info In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5. ::: :::success 這一節中，我們首先確定最佳效能的定位設置(使用第一個測試協議)，然後在成熟的場景中(第二個測試協議)對其進行評估。定位誤差的測量是根據ILSVRC標準(Russakovsky et al., 2014)，即，如果邊界框與實際邊界框的交集比例高過0.5，那麼邊界框的預測就認定為正確。 ::: :::info **Settings comparison.** As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to $S = 384$; the results with $S = 256$ exhibit the same behaviour and are not shown for brevity. ::: :::success **設置比較。** 從Table 8可以看的出來，per-class regression (PCR)優於class-agnostic single-class regression (SCR)，這與Sermanet et al. (2014)所發現的不同(PCR優於SCR這件事不同)。我們還注意到，針對定位任務而言，對全部的層做微調所得到的結果會比單純對全連接層做微調還要好(如(Sermanet et al., 2014)所做)。在這些實驗中，最小的影像邊設置為$S=384$；其結果與$S = 256$所表現相同，為了版面簡潔而未呈現。 ::: :::info Table 8: Localisation error for different modifications with the simplified testing protocol: the bounding box is predicted from a single central image crop, and the ground-truth class is used. All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR). Table 8：使用簡化的測試協議做不同調整的定位誤差：邊界框由單個中心影像剪裁所預測，並使用實際類別。所有卷積神經網路層(最後一層除外)都使用配置D(Table 1)，而最後一層執行single-class regression (SCR)或per-class regression (PCR)。 ![](https://i.imgur.com/gObFct7.png) ::: :::info **Fully-fledged evaluation.** Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance. ::: :::success **成熟的評估。** 確定最佳定位設置之後(PCR, fine-tuning of all layers)，現在我們要將它應用在成熟的場景上，其中top-5類別標記使用我們最佳效能的分類系統來預測(Sect. 4.5)，並且使用Sermanet et al. (2014)的方法合併多個densely-computed邊界框的預測。從Table 9可以看的出來，與使用中心剪裁相比(Table 8)，儘管使用top-5預測類別標記，而不是實際的標記，將定位的卷積神經網路應用於整個影像是可以明顯的改善結果。類似於分類任務(Sect 4.)，測試多個尺度並組合多個網路的預測可以進一步提高效能。 ::: :::info Table 9: Localisation error Table 9：定位誤差 ![](https://i.imgur.com/vCvyKrY.png) ::: :::info **Comparison with the state of the art.** We compare our best localisation result with the state of the art in Table 10. With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation performance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation. ::: :::success **與最新技術相比。** Table 10中我們比較了我們的最佳定位結果與目前最新技術。我們的"VGG"團隊以25.3%測試誤差贏得ILSVRC-2014定位挑戰賽(Russakovsky et al., 2014)。值得注意的是，即使我們使用較小的尺度，而且沒有使用他們的解析度增強技術，我們的結果依然比ILSVRC-2013冠軍Overfeat (Sermanet et al., 2014)還要好。我們設想，如果將這個技術併入我們的方法，那就可以達到更好的定位效能。這表明了，我們的深度卷積神經網路帶來了效能上的提高-我們用更簡單的定位方法，但有更強力的表示能力，因此得到更好的結果。 ::: :::info Table 10: Comparison with the state of the art in ILSVRC localisation. Our method is denoted as "VGG". Table 10：比較ILSVRC定位中的最佳技術。我們的方法標記為"VGG" ![](https://i.imgur.com/P28KDYY.png) ::: ## B GENERALISATION OF VERY DEEP FEATURES :::warning 這邊說明transfer learning，將訓練好的模型應用到其它小型資料集上，觀察其泛化狀況。 ::: :::info In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC (Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available). ::: :::success 前幾個章節中，我們討論在ILSVRC資料集上非常深的深度卷積神經網路的訓練與評估。這一節中，我們會評估在ILSVRC上預訓練的卷積神經網路，做為其它較小的資料集的特徵提取器(由於過擬點，小資料集無法在大型模型上重頭訓練)。最近，人們對這種情況有很大的興趣(Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014)，我們發現，訓練於ILSVRC上，其深度影像的representations可以很好的泛化到其它的資料集，其很大程度上優於手工製做的representations。依循著這個工作，我們調查我們的模型是否比最新技術中使用的更淺層的模型能夠帶來更好的效能。在這個評估中，我們考慮ILSVRC上(Sect. 4)兩個最佳分類效能的模型配置"網路-D"與"網路-E"(我們已經公開)。 ::: :::info To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor is $L_2$-normalised and combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed). ::: :::success 為了利用在ILSVRC上預測試的卷積神經網路對其它資料集做影像分類，我們移除最後一層全連接層(執行1000-way ILSVRC分類)，然後使用倒數第二層的4096維啟動(激活?)作為影像特徵，這跨多個位置與尺度做彙總。得到的影像[描述符](http://terms.naer.edu.tw/detail/6607408/)經過$L_2$正規化，並且與在目標資料集上做訓練的線性SVM分類器結合。為了簡單起見，預訓練的卷積神經網路權重保持固定(不執行微調) ::: :::info Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals $Q$, and then the network is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a horizontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales $Q$. The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors. ::: :::success 執行特徵的[聚合](http://terms.naer.edu.tw/detail/1012749/)(彙總)與我們的ILSVRC評估過程類似(Sect. 3.2)。也就是說，影像首先重新縮放以使最小邊等於$Q$，然後將網路密集地應用在影像平面上(當所有權重層都視為卷積的時候，這是可能的)。然後，我們將得到的feature map做全域平均池化(global average pooling)，產生一個4096維的影像[描述符](http://terms.naer.edu.tw/detail/6607408/)。然後將這個[描述符](http://terms.naer.edu.tw/detail/6607408/)水平翻轉影像的[描述符](http://terms.naer.edu.tw/detail/6607408/)做平均計算。如Sect. 4.2說明，透過multiple scales評估是有利的，因此我們在多個尺度$Q$中提取特徵。得到的multi-scale features可以堆疊或跨尺度合併。堆疊可以讓[後續](http://terms.naer.edu.tw/detail/6561405/)的分類器在一定區間內學習最佳組合影像統計信息，然而，這是以增加[描述符](http://terms.naer.edu.tw/detail/6607408/)維度為代價而得的。下面的實驗中，我們會回來討論這種設計的選擇。我們評估使用兩個網路所計算的特徵的late fusion，這是透過堆疊各自的影像描述符來執行的。 ::: :::warning early fusion/late fusion，early fusion為feature-level，late fusion為score-level。([參考來源](https://zhuanlan.zhihu.com/p/48351805)) ::: :::info **Image Classification on VOC-2007 and VOC-2012.** We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels, corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes. ::: :::success **Image Classification on VOC-2007 and VOC-2012。** 我們首先評估PASCAL VOC-2007與VOC-2012基準的影像分類任務(Everingham et al., 2015)。這些資料集分別包含10K與22.5K張影像，每一張影像都標註一個或多個標籤，對應20個物件類別。VOC組織者提供預定義好的分割資料集(訓練、驗證、測試)(VOC-2012的測試資料並沒有公開，而是提供官方評估伺服器)。辨識效能使用各類別的mean average precision(平均精確度的平均)(mAP)來衡量。 ::: :::info Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit. Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q \in {256, 384, 512, 640, 768}$. It is worth noting though that the improvement over a smaller range of {256, 384, 512} was rather marginal (0.3%). ::: :::success 特別是，透過檢查VOC-2007與VOC-2012驗證集的效能，我們發現到，透過平均在多個尺度上計算的聚合影像描述符的效能與透過堆疊聚合的效能相似。我們假設這是由於以下事實，在VOC資料集中，物件以各種尺度出現，因此，沒有特定的尺度語意可以給分類器使用。因為平均有一個好處，就是不會造成[描述符](http://terms.naer.edu.tw/detail/6607408/)維度的膨脹，因此我們能夠在較大區間的尺度上聚合影像[描述符](http://terms.naer.edu.tw/detail/6607408/)：$Q \in {256, 384, 512, 640, 768}$。值得注意的是，在{256, 384, 512}的較小區間內的改善是很小的 (0.3%)。 ::: :::info The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results. Our methods set the new state of the art across image representations, pre-trained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%. It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline. ::: :::success 測試集效能的報告以及與其它方法的比較在Table 11中說明。我們的網路"Net-D"與"Net-E"在VOC資料集上表現出相同的效能，而且它們的結合可以稍微改善結果。我們的方法在影像的representations上開創了一個新時代，在ILSVRC資料集上預訓練，優於Chatfield et al. (2014)的最佳結果6%。應該注意的是，Wei et al. (2014)的方法在VOC-2012上有著1% mAP的改善，它是在擴展2000-class的ILSVRC資料集上預訓練的，包含額外的1000個類別，語意接近VOC資料集的類別。它還受益於與物件檢測輔助分類pipeline的融合。 ::: :::info Table 11: Comparison with the state of the art in image classification on VOC-2007, VOC-2012, Caltech-101, and Caltech-256. Our models are denoted as "VGG". Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes). Table 11：在VOC-2007，VOC-2012，Caltech-101,與Caltech-256上影像分類中的最新技術的比較。我們的模型標記為"VGG"。以\*標記的結果是使用擴展的ILSVRC資料集(2000個類別)上預訓練的卷積神經網路所達到的。 ![](https://i.imgur.com/HJ5dv22.png) ::: :::info **Image Classification on Caltech-101 and Caltech-256.** In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al., 2004) and Caltech-256 (Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard evaluation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall (which compensates for a different number of test images per class). Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20% of training images were used as a validation set for hyper-parameter selection. ::: :::success **Caltech-101與Caltech-256上的影像分類。** 在本節中，我們在Caltech-101 (Fei-Fei et al., 2004)與Caltech-256 (Griffin et al., 2007)影像分類基準上評估深度的功能。Caltech-101包含9K張已標記影像，102個類別(101個物件類別與1個背景類別)，而Caltech-256比較大，有31K張影像與257個類別。這些資料集上的標準評估協議是生成多個隨機分割的訓練與測試集，然後報告各分割的平均辨識效能，效能使用mean class recall來量測(補償每個類別的不同數量的測試影像)。依循著Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014)，在Caltech-101資料集上，我們產生3個隨機分割的訓練與測試集，因此每個分割的每個類別都包含30張訓練影像，每個類別最多50張測試影像。在Caltech-256資料集上，我們一樣產生3個分割，每一個分割的每個類別都包含60張訓練影像(其餘用於測試)。每一個分割中，20%的訓練影像拿來做為驗證集，以做為超參數的選擇。 ::: :::info We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales $Q \in {256, 384, 512}$. ::: :::success 我們發現到，不同於VOC，在Caltech資料集上，在多個尺度上計算的描述符的堆疊，其效能比averaging或max-pooling還要好。這可以以下列事實來解釋，Caltech的影像物件通常佔了整個影像版面，因此multi-scale的影像特徵在語義上是不同的(補捉整個物件 vs. 物件的一部份)，而且堆疊允許分類器利用這類的scale-specific(特定尺度?)的representations。我們使用三個尺度$Q \in {256, 384, 512}$。 ::: :::info Our models are compared to each other and the state of the art in Table 11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256, our features outperform the state of the art (Chatfield et al., 2014) by a large margin (8.6%). ::: :::success 我們的方法與其它最新技術的比較在Table 11中。可以看的出來，較深的19層Net-E效能比16層的Net-D還要好，它們的結合進一步提升效能。在Caltech-101上，我們的representations與He et al. (2014)的方法是具有競爭力的，然而，在VOC-2007上，他們的效能明顯不如我們的網路。在Caltech-256上，我們的功能大大的優於現有技術(Chatfield et al., 2014)8.6%。 ::: :::info **Action Classification on VOC-2012.** We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training images, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation. The results are compared to other approaches in Table 12. ::: :::success **VOC-2012上的動作分類。** 我們在PASCAL VOC-2012動作分類任務上(Everingham et al., 2015)評估我們表現最佳的影像的representation，這個任務包含從一張影像預測一個動作類別，給定執行動作的人的邊界框。資料集包含4.6K張訓練影像，標記為11個類別。類似於VOC-2012物件分類任務，以mAP測量效能。我們考慮兩種訓練設置：(i)在整個影像上計算卷積神經網路特徵，並忽略提供邊界框；(ii)計算整個影像與提供的邊界框上的特徵，並將其堆疊以獲得最終的representation。與其它方法的比較在Table 12中。 ::: :::info Table 12: Comparison with the state of the art in single-image action classification on VOC2012. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (1512 classes). Table 12：VOC2012上的單一影像動作分類中與最新技術的比較。我們的模型標記為"VGG"。標\*的是使用在擴展的ILSVRC資料集(1512個類別)上預訓練的卷積神經網路。 ![](https://i.imgur.com/VWrSdao.png) ::: :::info Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features. ::: :::success 即使不使用提供的邊界框，我們的representation在VOC動作分類任務上也可以來到最佳水準，在同時使用影像與邊界框的時候，結果還可以更進一步的提升。與其它方法不同，我們並沒有合併任何特定於任務的啟發法，而是依賴非常深的卷積特徵的representation能力。 ::: :::info Other Recognition Tasks. Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014). ::: :::success 其它的辨識任務。自從公開發佈我們的模型，它們(模型)已經被研究界廣泛的應用到各種影像辨識任務，而且始終優於更淺層的representations。舉例來說，Girshick et al. (2014)以我們16層的模型取代掉Krizhevsky et al. (2012)而達到最佳物件偵測的結果。在語義分割(Long et al., 2014)，影像字幕生成(Kiros et al., 2014; Karpathy & Fei-Fei, 2014)，紋理與材料辨識(Cimpoi et al., 2014; Bell et al., 2014)等方面，在Krizhevsky et al. (2012)的淺層架構也觀察到類似的效果。 :::