Deep Residual Learning for Image Recognition(ResNet)(翻譯)

# Deep Residual Learning for Image Recognition(ResNet)(翻譯) ###### tags: `CNN` `論文翻譯` `deeplearning` >[name=Shaoe.chen] [time=Thu, Feb 24, 2020] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1512.03385v1.pdf) * [Kaiming He介紹ResNet](http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf) ::: ## Abstract :::info Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets \[41\] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. ::: :::success 較深的神經網路是難以訓練的。我們提出一個[殘差](http://terms.naer.edu.tw/detail/550265/)學習框架，這可以簡化比之前所使用的網路更深的網路訓練。我們很明確的將layers重新以參照layer inputs的學習殘差函數來表示，而不是學習未引用函數。我們提供全面性的經驗上的證據來說明，這些殘差網路更容易最佳化，而且可以從大幅增加的深度中取得更高的準確度。在ImageNet資料集上，我們評估一個最多152層的殘差網路，那比VGG\[41\]的8倍還要多，但它仍然擁有較低的複雜度。這些殘差網路的ensemble在ImageNet的測試集上可以得到3.57%的誤差率。這個結果在ILSVRC 2015分類任務上贏得第一名。 ::: :::info The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions^1^ , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. ::: :::success representations的深度對於許多視覺辨識任務是十分重要的。光是因為我們這麼深的一個representations就可以讓我們在COCO目標檢測資料集上得到28%的相對改善。深度殘差網路是我們提交到ILSVRC & COCO 2015競賽的基礎，這個競賽我們還贏得ImageNet檢測任務、ImageNet定位、COCO檢測以及COCO分段等任務的第一名。 ::: :::info ^1^ http://image-net.org/challenges/LSVRC/2015/ and http://mscoco.org/dataset/#detections-challenge2015 ::: ## 1. Introduction :::info Deep convolutional neural networks \[22, 21\] have led to a series of breakthroughs for image classification \[21, 50, 40\]. Deep networks naturally integrate low/mid/highlevel features \[50\] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence \[41, 44\] reveals that network depth is of crucial importance, and the leading results \[41, 44, 13, 16\] on the challenging ImageNet dataset \[36\] all exploit “very deep” \[41\] models, with a depth of sixteen \[41\] to thirty \[16\]. Many other nontrivial visual recognition tasks \[8, 12, 7, 32, 27\] have also greatly benefited from very deep models. ::: :::success 深度卷積神經網路\[22, 21\]引發了影像分類\[21, 50, 40\]一系列的突破。深度網路以[端到端](http://terms.naer.edu.tw/detail/1277823/)的多層方式很自然地整合了低/中/高階特徵\[50\]與分類器，而特徵的"級別~(levels)~"可以透過堆疊的層(深度)的數量來豐富。最新的證據\[41, 44\]顯示，網路的深度是至關重要，在具挑戰性的ImageNet資料集上\[36\]所領先的結果都是利用"非常深"的模型，從16\[41\]到30層\[16\]。許多其它不平凡的視覺辨識任務\[8, 12, 7, 32, 27\]也都從非常深的深度模型中受益匪淺。 ::: :::info Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients \[1, 9\], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization \[23, 9, 37, 13\] and intermediate normalization layers \[16\], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation \[22\]. ::: :::success 在深度重要性的驅動之下，出現一個問題：學習更好的網路是否與堆疊更多層一樣的簡單?這問題的癥結點就在於眾所皆知的梯度消失/爆炸問題，從一開始就阻礙收斂。然而，這個問題已經透過normalized initialization \[23, 9, 37, 13\] and intermediate normalization layers \[16\]得到很大的解決，這種方式讓數十層深的網路能夠使用反向傳播做隨機梯度下降(SGD)\[22\]開始收斂。 ::: :::warning 個人觀點： * intermediate normalization layers: BN ::: :::info When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in \[11, 42\] and thoroughly verified by our experiments. Fig. 1 shows a typical example. ::: :::success 當更深層的網路可以開始收斂時，這就曝露出一個退化問題~(degradation problem)~：隨著網路深度的增加，準確度達到飽合(這不足為奇)，然後又快速退化。讓人意外的是，這樣的退化問題並不是由過擬所引起，而且在適當的模型上增加更多的層數會導致更高的訓練誤差(如\[11, 42\]所述，且經由我們的實驗驗證)。Fig. 1說明一個經典案例。 ::: :::info ![](https://i.imgur.com/xB38g0O.png) Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4. Figure 1. 20層與56層的"普通"網路，在CIFAR-10上的訓練誤差(左)與測試誤差(右)。較深的網路有較高的訓練誤差，而導致訓練誤差。ImageNet上也有類似的現象，如Fig. 4所示。 ::: :::info The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time). ::: :::success 訓練準確度退化的問題指出，並非所有的系統都一樣那麼容易可以最佳化。讓我們考慮一個比較淺的架構跟一個比較深的對應架構(加入更多層)。針對更深的模型存著一個透過結構的解決方案：增加的層為[恆等映射](http://terms.naer.edu.tw/detail/2117540/)，而且其它層是從學習自較淺模型中複製而來。這個結構解決方案的存在指出，較深的模型產生的訓練誤差比起它較淺層的對應模型應該是更低的。但是實驗顯示出，我們目前現有的解答器並不能找到比結構解決方案還要好或更好的解結方案(或者無法在可行時間內找到)。 ::: :::info In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping of $F(x) := H(x)−x$. The original mapping is recast into $F(x)+x$. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. ::: :::success 這篇論文中，我們透過引入深度殘差學習框架來解決退化問題。我們並不希望每個堆疊的層都直接擬合所需的底層映射，而是明確的讓這些層擬合殘差映射。形式上來說，我們將所需的底層映射表示為$H(x)$，並讓堆疊的非線性層擬合$F(x) := H(x)−x$的另一個映射。原始的映射改寫為$F(x)+x$。我們假設最佳化殘差映射會比最佳化原始未參照的映射還要來的容易。最極端的情況，如果[恆等映射](http://terms.naer.edu.tw/detail/2117540/)是最佳的，那將殘差推到零會比擬合一個利用非線性層堆疊而成的[恆等映射](http://terms.naer.edu.tw/detail/2117540/)還要容易。 ::: :::info The formulation of $F(x)$ +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections \[2, 34, 49\] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe \[19\]) without modifying the solvers. ::: :::success 公式$F(x) +x$可以利用帶有"shortcut connections"的前饋神經網路來實現(Fig. 2)。shortcut connections\[2, 34, 49\]就是跳過一或多層。在我們的案例中shortcut connections單純的執行[恆等映射](http://terms.naer.edu.tw/detail/2117540/)，並將它們的輸出加到堆疊層的輸出中(Fig. 2.)。Identity shortcut connections既不會增加額外的參數，也不會增加計算複雜度。整個網路依然可以用SGD利用反向傳播做[端到端](http://terms.naer.edu.tw/detail/1277823/)的訓練，而且只需要用通用套件(像是Caffe \[19\])就可以經鬆實現，不需要調整解答器。 ::: :::info ![](https://i.imgur.com/tXW2AJz.png) Figure 2. Residual learning: a building block. ::: :::info We present comprehensive experiments on ImageNet \[36\] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results ubstantially better than previous networks. ::: :::success 我們在ImageNet\[36\]上做了全面性的實驗，以此說明退化問題並評估我們的方法。我們提出:1)我們這麼深的一個殘差網路是非常容易最佳化的，但是對應的"普通"網路(簡單的堆疊層)顯示出有著非常高的訓練誤差；2)我們的深度殘差網路能夠從深度的增加中輕鬆的得到準確度的提高，產生的結果比起之前的網路都還要來的好。 ::: :::info Similar phenomena are also shown on the CIFAR-10 set \[20\], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers. ::: :::success 類似的現象也在CIFAR-10資料集\[20\]上發生，這意味著，最佳化的困難與我們所提出的方法的影響並不僅是針對特定的資料集。我們在這個資料集上成功地訓練一個100多層的模型，並探索超過1000層的模型。 ::: :::info On the ImageNet classification dataset \[36\], we obtain excellent results by extremely deep residual nets. Our 152- layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets \[41\]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems. ::: :::success 在ImageNet分類資料集上\[36\]，我們利用極深的殘差網路獲得非常亮眼的成績。我們152層的殘差網路是ImageNet上所提出的最深的網路(沒有之一)，同時比VGG網路\[41\]有著更低的複雜度。我們的ensemble在ImageNet測試集上有著3.57%的top-5誤差，並在ILSVRC 2015分類競賽中贏得第一名。這麼深的一個representations在其它辨識任務上也有著出色的泛化能力，也讓我們進一步的贏得ILSVRC & COCO 2015的ImageNet detection、ImageNet localization, COCO detection、與COCO segmentation第一名。這強力的證據證明了殘差學習原理是通用的，而且我們期待它能夠適用於其它視覺或非視覺的問題。 ::: ## 2. Related Work :::info **Residual Representations.** In image recognition, VLAD \[18\] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector \[30\] can be formulated as a probabilistic version \[18\] of VLAD. Both of them are powerful shallow representations for image retrieval and classification \[4, 48\]. For vector quantization, encoding residual vectors \[17\] is shown to be more effective than encoding original vectors. ::: :::success **Residual Representations.** 在影像辨識中，VLAD\[18\]是透過相對於字典的殘差向量做編碼所得一個representation，而Fisher Vector \[30\]可以以公式表示為VLAD的機率版本\[18\]。這兩者對於影像的檢索與分類都有非常強的shallow representations~(淺層表示)~\[4, 48\]。對於[向量量化](http://terms.naer.edu.tw/detail/1288493/)，對殘網向量\[17\]編碼顯得比對原始向量編碼還要來的有效。 ::: :::warning 個人見解： * VLAD: vector of locally aggregated descriptors * [VLAD paper](https://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf) * 可能要讀過才能瞭解encodes by the residual vectors with respect to a dictionary這句話的實際意義 ::: :::info In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method \[3\] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning \[45, 46]\, which relies on variables that represent residual vectors between two scales. It has been shown \[3, 45, 46\] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization. ::: :::success 在[低階](http://terms.naer.edu.tw/detail/6644324/)視覺與電腦[圖形學](http://terms.naer.edu.tw/detail/6621149/)中，為了解[偏微分方程式](http://terms.naer.edu.tw/detail/254317/)(PDEs)的問題，廣泛使用Multigrid method \[3\]將系統重新以[多重尺度](http://terms.naer.edu.tw/detail/2120113/)的[子問題](http://terms.naer.edu.tw/detail/2125646/)來表示，其中每一個子問題都負責介於較粗糙與較細微尺度之間的殘差解。一個Multigrid的替代方案，就是階層式基礎預處理\[45, 46]\，這取決於代表兩個尺度之間的殘差向量的變數。這在\[3, 45, 46\]已經說明，這些解答器的收斂速度比未注意到解的殘差性質的標準解答器還要快。這些方法說明了，好的重構或預處理可以簡化最佳化。 ::: :::success 個人見解： * [Multigrid method_多重網格法](http://terms.naer.edu.tw/detail/255510/) ::: :::info **Shortcut Connections.** Practices and theories that lead to shortcut connections \[2, 34, 49\] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output \[34, 49\]. In \[44, 24\], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of \[39, 38, 31, 47\] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In \[44\], an “inception” layer is composed of a shortcut branch and a few deeper branches. ::: :::success **Shortcut Connections.** 影響shortcut connections \[2, 34, 49\]的實踐與理論已經被研究很長一段時間。早期訓練多層感知器的實踐是加入一個線性層來連結網路的輸入到輸出\[34, 49\]。在\[44, 24\]中，一些中間層直接連接到輔助分類器，以解決消失/爆炸的梯度問題。論文 \[39, 38, 31, 47\]提出方法，透過shortcut connections來實作集中每一層的反應，梯度與傳播誤差。在\[44\]中，"inception" layer由shortcut branch與一些更深的branch所組成。 ::: :::info Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers). ::: :::success 在我們的作業的同時，"highway networks"\[42, 43\]提出具有gating functions的shortcut connections\[15\]。這些gates是data-dependent~(依靠資料)~而且具有參數，對比我們的identity shortcuts，我們的是沒有參數的。當一個gated shortcut"關閉"的時候(接近零)，highway networks中的layers就表示non-residual functions。相反的，我們的公式總是學習殘差函數；我們的identity shortcuts永遠不會關閉，所有的信息都是會傳遞的，並學習額外的殘差函數。除此之外，highway networks還沒有被證明可以從極深的深度中獲得準確度(像是超過100層)。 ::: ## 3. Deep Residual Learning ### 3.1. Residual Learning :::info Let us consider $H(x)$ as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with $x$ denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions^2^ , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., $H(x) − x$ (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate $H(x)$, we explicitly let these layers approximate a residual function $F(x) := H(x) − x$. The original function thus becomes $F(x)+x$. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different. ::: :::success 讓我們考慮將$H(x)$視為底層的映射，以擬合一些堆疊的層(並不需要整個網路)，以$x$表示這些層中第一層的輸入。如果假設多個非線性層可以逐漸地近似複雜的函數^2^，那麼它等同於假設它們可以逐漸地近似殘差函數，即$H(x) − x$(假設輸入與輸出的維度相同)。因此，我們並不是期望堆疊的層去近似$H(x)$，而是讓這些層近似殘差函數$F(x) := H(x) − x$。原始函數因而變成$F(x)+x$。儘管兩者形式都應該可以逐漸的近似所需的函數(視假設而定)，但學習的難易度可能不一樣。 ::: :::info This reformulation is motivated by the counter-intuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings. ::: :::success 這種重構是由關於退化問題不如預期的現象所引起。如我們在引言中所討論，如果可以將增加的層構造為[恆等映射](http://terms.naer.edu.tw/detail/2117540/)，那麼較深的模型的訓練誤差應該不會比較淺的對應模型來的大。退化問題意味著，解決器在利用多個非線性層來近似[恆等映射](http://terms.naer.edu.tw/detail/2117540/)可能是有困難的。利用殘差學習來重新表示，如果[恆等映射](http://terms.naer.edu.tw/detail/2117540/)是最佳的，那麼解決器只需要將多個非線性層的權重推向零就可以接近[恆等映射](http://terms.naer.edu.tw/detail/2117540/)。 ::: :::info In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. ::: :::success 在實際情況中，[恆等映射](http://terms.naer.edu.tw/detail/2117540/)不大可能是最佳的，但是我們的重新定義可能有助於解決這問題的預處理。如果這個最佳函數更接近[恆等映射](http://terms.naer.edu.tw/detail/2117540/)而不是零映射，那對解答器而言，應該是更容易找到與[恆等映射](http://terms.naer.edu.tw/detail/2117540/)相關的[擾動](http://terms.naer.edu.tw/detail/838760/)，而不是將函數視為一個新的函數來學習。我們透過實驗證明(Fig. 7)，學習到的殘差函數通常有較小的響應，這說明了[恆等映射](http://terms.naer.edu.tw/detail/2117540/)提供了合理的預處理。 ::: :::info ![](https://i.imgur.com/nw98y4v.png) Figure 7. Standard deviations (std) of layer responses on CIFAR10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. **Top:** the layers are shown in their original order. **Bottom:** the responses are ranked in descending order Figure 7. CIFAR10上層響應的[標準差](httpshttp://terms.naer.edu.tw/detail/1113798)(std)。響應的部份是BN之後，非線性之前的每一個3x3 layer的輸出。**Top:** 以原始的順序顯示。**Bottom:** 響應依降序排序。 ::: ### 3.2. Identity Mapping by Shortcuts :::info We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as: $$ y = F(x, \left\{ W_i\right\}) + x \qquad (1) $$ ::: :::success 每隔幾層我們就做殘差學習。建構的區塊如Fig. 2所示。確切來說，此論文中我們考慮將建構的區塊定義為： $$ y = F(x, \left\{ W_i\right\}) + x \qquad (1) $$ ::: :::info Here $x$ and $y$ are the input and output vectors of the layers considered. The function $F(x, {W_i})$ represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, $F = W_2\sigma(W_1x)$ in which $\sigma$ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation $F + x$ is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., $\sigma(y)$, see Fig. 2). ::: :::success 這邊的$x$與$y$分別是layers的輸入與輸出向量。函數$F(x, {W_i})$代表要學習的殘差映射。以Fig. 2為例，有兩層，$F = W_2\sigma(W_1x)$，其中$\sigma$表示ReLU\[29\]，並且為了簡化符號而忽略偏差。$F + x$是透過shortcut connection執行，而且是逐元素的相加。我們在相加之後做了第二次的非線性(即，$\sigma(y)$，見Fig. 2)。 ::: :::info The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition). ::: :::success 公式(1)的shortcut connections既沒有引入多餘的參數，也沒有增加計算的複雜度。這不僅僅在實作中具有吸引力，而且在我們比較普通與殘差網路的時候也非常重要。我們可以公平的比較同時具有相同數量的參數、深度、寬度與計算成本的普通/殘差網路(除了可以忽略的逐元素的相加)。 ::: :::info The dimensions of $x$ and $F$ must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection $W_s$ by the shortcut connections to match the dimensions: $$ y = F(x, \left\{ W_i\right\}) + W_sx \qquad (2) $$ We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus $W_s$ is only used when matching dimensions. ::: :::success $x$與$F$的維度必須與公式(1)相同。如果不是這樣，那我們可以透過shortcut connections執行線性投影$W_s$以匹配維度： $$ y = F(x, \left\{ W_i\right\}) + W_sx \qquad (2) $$ 我們也可以使用公式(1)的[方陣](http://terms.naer.edu.tw/detail/2125185/)$W_s$。但是，我們將透過實驗來說明，[恆等映射](http://terms.naer.edu.tw/detail/2117540/)足以解決退化問題，而且是非常經濟的作法，因此，只有在匹配維度的時候才使用$W_s$。 ::: :::info The form of the residual function $F$ is flexible. Experiments in this paper involve a function $F$ that has two or three layers (Fig. 5), while more layers are possible. But if $F$ has only a single layer, Eqn.(1) is similar to a linear layer: $y = W_1x + x$, for which we have not observed advantages. ::: :::success 殘差函數$F$的形式是非常靈活的。此論文中的實驗涉及一個具有兩層或三層的函數$F$(Fig. 5)，即使更多層也還是可以的。但如果$F$只有一層，那公式(1)就會變的類似線性層：$y = W_1x + x$，如果是這樣，目前為止我們還沒有觀察到有任何的優點。 ::: :::info ![](https://i.imgur.com/ZE12a5Z.png) Figure 5. A deeper residual function $F$ for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet34. Right: a “bottleneck” building block for ResNet-50/101/152. Figure 5. 用於ImageNet的深度殘差函數$F$。Left：ResNet34的建構區塊(於56x56的feature maps上)，如Fig. 3所示。Right：ResNet-50/101/152的"bottleneck"建構區塊。 ::: :::info We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function $F(x, {W_i})$ can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel. ::: :::success 我們還注意到，儘管上述的符號為了簡單起見主要是關於全連接層的，但它們仍然適用於卷積層。函數$F(x, {W_i})$就可以表示多個卷積層。在兩個feature maps上做逐元素的相加(channel by channel)。 ::: ### 3.3. Network Architectures :::info We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows. ::: :::success 我們測試了各種普通/殘差網路，而且觀察到了相同的現象。為了提供討論的實例，我們描述兩個ImageNet的模型，如下。 ::: :::info **Plain Network.** Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets \[41\] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle). ::: :::success **Plain Network.** 普通網路的基線(Fig. 3中間)，主要受到VGG網路\[41\]原理的啟發(Fig. 3左)。卷積層大多具有3x3的濾波器，並遵循兩個簡單的設計規則：(i)對於相同的output feature map大小，這些層有著相同的濾波器的數量；(ii)如果feature map的大小減半，那濾波器的數量就增加一倍，以保持每一層的時間複雜度。我們直接用stride=2的卷積層來做downsampling。網路以global average pooling layer與1000-way的softmax layer結束。權重層的總數為34(Fig. 3中間)。 ::: :::info It is worth noticing that our model has fewer filters and lower complexity than VGG nets \[41\] (Fig. 3, left). Our 34-layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs). ::: :::success 值得注意的是，我們的模型比較VGG網路\[41\]有更少的濾波器與更低的複雜度(Fig. 3, 左)。我們的34-layer基線網路有3.6億次FLOP(multiply-adds)，這只是VGG-19的18%(19.6億次FLOPs)。 ::: :::info **Residual Network.** Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2. ::: :::success **Residual Network.** 基於上述的普通網路，我們插入shortcut connections(Fig. 3, 右)，將網路轉為對照的殘差版本。當輸入、輸出的維度相同的時候(Fig. 3的實線)，可以直接使用identity shortcuts(公式(1))。當維度增加的時候(Fig. 3的虛線)，我們有兩個選項：(A)shortcut仍然執行[恆等映射](http://terms.naer.edu.tw/detail/2117540/)，不足維度以零填充。這個選項並不會增加額外的參數；(B)使用公式(2)的projection shortcut匹配維度(使用1x1卷積)。對於這兩個選項，當shortcuts穿過兩種大小的feature maps的時候，它們會以stride=2來執行。 ::: ### 3.4. Implementation :::info Our implementation for ImageNet follows the practice in \[21, 41\]. The image is resized with its shorter side randomly sampled in \[256, 480\] for scale augmentation \[41\]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted \[21\]. The standard color augmentation in \[21\] is used. We adopt batch normalization (BN) \[16\] right after each convolution and before activation, following \[16\]. We initialize the weights as in \[13\] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to $60 × 10^4$ iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout \[14\], following the practice in \[16\]. ::: :::success 我們ImageNet的實作依循\[21, 41\]的作法。影像以其短邊隨機選擇\[256, 480\]重新縮放，做為尺度的增強\[41\]。從影像中或其水平翻轉影像中隨機取224x224的剪裁，並減去像素均值\[21\]。使用\[21\]中的標準色彩增強。依循著\[16\]，我們會在每次的卷積之後，啟動(激活)之前使用batch normalization(BN)。以\[13\]的方式初始化權重，並從頭開始訓練所有普通/殘差網路。以SGD最佳化，其mini-batch size為256。learning rate從0.1開始，並在誤差穩定的時候除10，模型訓練高達$60x10^4$次的迭代。我們使用weight decay為0.0001，momentum為0.9。依循著\[16\]，因此我們並沒有使用dropout\[14\]。 ::: :::info In testing, for comparison studies we adopt the standard 10-crop testing \[21\]. For best results, we adopt the fullyconvolutional form as in \[41, 13\], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}). ::: :::success 測試期間，為了比較研究，我們採用標準的10-crop測試\[21\]。為了得到最佳結果，我們採用\[41, 13\]的全卷積形式，然後平均多個尺度的分數(影像重新縮放，使其短邊為{224, 256, 384, 480, 640})。 ::: ## 4. Experiments ### 4.1. ImageNet Classification :::info We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates. **Plain Networks.** We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures. ::: :::success 我們在ImageNet 2012分類資料集\[36\]上(包含1000個類別)評估我們的方法。模型以128萬張訓練影像訓練，以50,000張驗證影像評估。我們獲得測試伺服器報告的100,000張測試影像的最終結果。同時評估top-1與top-5的誤差率。 **Plain Networks.** 我們首先評估18-layer與34-layer的普通網路。34-layer的網路就是Fig. 3中間那個。18-layer的普通網路有類似的形式，架構上的細節請見Table 1。 ::: :::success ![](https://i.imgur.com/EUG5ivi.png) Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2. Table 1. ImageNet的架構。建構的區塊顯示在括號內(另參考Fig. 5)，堆疊了許多區塊(括號旁的乘數)。會在conv3 1，conv4 1與conv5 1中以stride=2來執行downsampling。 ::: :::info The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem - the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one. ::: :::success Table 2中的結果說明了，比較深的34-layer的普通網路比起較淺的18-layer的普通網路有較高的驗證誤差。為了能夠顯露出原因，在Fig. 4(左)，我們比較它們在訓練過程中的的訓練/驗證誤差。我們觀察到退化問題-即使18-layer普通網路的[解空間](http://terms.naer.edu.tw/detail/2124959/)是34-layer普通網路的子空間，但在訓練過程中，34-layer的普通網路卻有著較高的訓練誤差。 ::: :::info ![](https://i.imgur.com/50jNTJq.png) Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures. Table 2. ImageNet驗證集上的Top-1誤差(%, 10-crop testing)。 ::: :::info ![](https://i.imgur.com/YpYzv44.png) Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions. Table 3. ImageNet validation上的誤差率(%, 10-crop testing)。VGG-16基於我們的測試。ResNet-50/101/152採用選項B，單純使用投影來增加維度。 ::: :::info We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN \[16\], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error^3^ . The reason for such optimization difficulties will be studied in the future. ::: :::success 我們認為這種最佳化的問題點並不像是由消失的梯度所引起。這些普通網路的訓練是有加上BN\[16\]的，這確保正向傳播信號具有非零的變異數(方差)。我們還驗證BN的反向傳播梯度是正常的[範數](http://terms.naer.edu.tw/detail/2120585/)。因此，都不是正向或反向信號的消失。事實上，34-layer的普通網路仍然可以達到具有競爭力的準確度(Table 3)，這說明了，解答器在某種程度上是可以作業的。我們推測，深度的普通網路可能有指數級別的低收斂速度，這影響了訓練誤差的減少。將來要進一步研究這造成難以最佳化的原因。 ::: :::info ^3^We have experimented with more training iterations (3×) and still observed the degradation problem, suggesting that this problem cannot be feasibly addressed by simply using more iterations. ::: :::success ^3^我們已經嚐試更多的訓練迭代(3x)，而且仍然觀察到退化問題，這說明了，這個退化問題無法單純的透過更多的迭代來解決。 ::: :::info **Residual Networks.** Next we evaluate 18-layer and 34- layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts. ::: :::success **Residual Networks.** 接下來我們評估18-layer與34-layer的殘差網路(ResNets)。基線架構與上述的普通網路相同，除了每兩個3x3濾波器加入shortcut connection，如Fig. 3(右)所示。在第一個比較中(Table 2與Fig. 4右)，所有的shortcut都使用[恆等映射](http://terms.naer.edu.tw/detail/2117540/)，並以zero-padding來增加維度(選項A)。因此模型與普通網路相比並沒有額外的參數。 ::: :::info ![](https://i.imgur.com/PEJyaxN.png) Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more details and other variants. Figure 3. 用於ImageNet的範例網路架構。左：VGG-19模型\[41\](19.6 billion FLOPs)做為參照。中：34層參數參的普通網路(3.6 billion FLOPs)。右：34層參數層的殘差網路(3.6 billion FLOPs)。虛線的shortcuts會增加維度。Tabel 1說明更多詳細資訊與其它變體。 ::: :::info ![](https://i.imgur.com/P9jTCgd.png) Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts. Figure 4. 以ImageNet訓練。細的曲線表示訓練誤差，粗的曲線表示中心剪裁的驗證誤差。左：18、34層的普通網路。右：18、34層的殘差網路。在這張圖中，殘差網路與它們的對照普通網路相比並沒有多額的參數。 ::: :::info We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth. ::: :::success 從Table 2與Fig. 4中，我們得到三個主要的觀察結果。首先，以殘差網習是可以改變這種情況 - 34-layer的ResNet比18-layer ResNet還要好(低2.8%)。更重要的是，34-layer的ResNet表現出更低的訓練誤差，而且可以泛化到驗證資料。這指出，在這種情況之下，退化問題得到很好的解決，而且我們設法從增加的深度中得到準確度提升的好處。 ::: :::info Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems. ::: :::success 第二，與對照的普通網路相比，34-layer ResNet的top-1誤差降低3.5%(Table 2)，這是因為我們成功的降低訓練誤差(Fig. 4右 vs. 左)。這項比較驗證了在極深的系統上殘差學習的有效性。 ::: :::info Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage. ::: :::success 最後，我們還注意到，18-layer 普通/殘差網路有相同的準確度(Table 2)，但18-layer ResNet的收斂速度比較快(Fig. 4 右 vs. 左)。當網路"沒有很深"的時候(這邊指18-layer)，當前的SGD解決器仍然可以為普通網路找到很好的解。在這種情況下，ResNet通過在早期提供更快的收斂性來簡化最佳化。 ::: :::info **Identity vs. Projection Shortcuts.** We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections. ::: :::success **Identity vs. Projection Shortcuts.** 我們已經說明了parameter-free，也就是identity shortcuts有助於訓練。接下來我們要來探討Projection Shortcuts(公式(2))。Table 3中，我們比較三個項目：(A)使用zero-padding shortcuts增加維度，所有的shortcuts 都是parameter-free(與Table 2、Fig. 4 右相同)；(B)使用projection shortcuts增加維度，而且其它的shortcuts都是identity；(C)所有的shortcuts都是投影。 ::: :::info Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below. ::: :::success Table 3顯示，這三個選項都比對照用的普通網路還要好太多。B比A好一點。我們認為這是因為A裡面的zero-padded dimensions沒有起到殘差學習的作用。C比B些微的好，我們將它歸因於許多的projection shortcuts(13)帶來了額外的參數。但是A/B/C之間的些許差異也說明著，對於解決退化問題，projection shortcuts並不是必要的。因此，在這論文中的其它部份我們都沒有採用選項C，以減少記憶體/時間複雜度與模型大小。Identity shortcuts對於不增加下面介紹的瓶頸架構的複雜度特別的重要。 ::: :::info **Deeper Bottleneck Architectures.** Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design^4^. For each residual function $F$, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity. ::: :::success **Deeper Bottleneck Architectures.** 接下來，我們將介紹用於ImageNet的更深的網路。考量到我們所能負擔的訓練時間，我們將建構區塊修改為瓶頸設計。對於每一個殘差函數$F$，我們改使用3層而不是2層(Fig. 5)。這三層分別是1x1、3x3、1x1卷積，其中1x1負責降低，然後增加(還原)維度，讓3x3這一層變成一個有著較小的輸入/輸出維度的瓶頸層。Fig. 5說明這個範例，其兩種設計都有類似的時間複雜度。 ::: :::info The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs. ::: :::success parameter-free identity shortcuts對瓶頸架構特別重要。如果將Fig. 5(右)的identity shortcuts以投影來取代，可以看的出來，時間複雜度與模型大小都會倍數成長，因為shortcuts可以連接兩個高維度的端點。因此，identity shortcuts可以為瓶頸設計帶來更高效的模型。 ::: :::info **50-layer ResNet:** We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs. ::: :::success **50-layer ResNet:** 我們將34-layer的每一個2-layer block以3-layer bottleneck block取代掉，得到一個50-layer的Resnet(Table 1)。我們使用選項B來增加維度。這個模型有3.8 billion FLOPs。 ::: :::info **101-layer and 152-layer ResNets:** We construct 101-layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs). ::: :::success 我們用更多的3-layer blocks來建構101-layer與152-layer的ResNet(Table 1)。讓人感到不可思議的是，儘管深度大大的增加，但152-layer ResNet(11.3 billion FLOPs)仍然比VGG16/19(15.3/19.6 billion FLOPs)有更低的複雜度。 ::: :::info The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4). ::: :::success 50/101/152-layer的ResNets比起34-layer ResNet的準確度要高太多(Table 3、4)。我們並沒有觀察到退化問題，因此可以從深度的增加中得到明顯的準確度提高的好處。所有的評估指標都證明深度增加的好處(Table 3、4)。 ::: :::info ![](https://i.imgur.com/nuORRE5.png) Table 4. Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set). Table 4. ImageNet驗證集上單一模型得到的誤差率(%)(除了†是在測試集上)。 ::: :::info **Comparisons with State-of-the-art Methods.** In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015. ::: :::success **Comparisons with State-of-the-art Methods.** 在Table 4中我們與先前最佳的單一模型結果做了比較。我們的基線34-layer ResNets得到一個非常有競爭力的準確度。我們的152-layer ResNet的單一模型top-5驗證誤差為4.49%。這個單一模型的結果優於所有先前集成(ensemble)的結果(Table 5)。我們結合六個不同深度的模型來形成一個集成(ensemble)模型(提交的時候只有兩個152-layers的模型)。這在測試集上得到了3.57%的top-5誤差(Table 5)。這個作品贏得ILSVRC 2015第一名。 ::: :::info ![](https://i.imgur.com/C3MBaPx.png) Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server. ::: ### 4.2. CIFAR-10 and Analysis :::info We conducted more studies on the CIFAR-10 dataset \[20\\, which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows. ::: :::success 我們在CIFAR-10資料集上\[20\]做了更多的研究，這資料集包含50k的訓練影像與10k的測試影像，有10個類別。我們說明在訓練集上的訓練實驗以及在測試集上的評估。我們關注在非常深的網路的表現，而不是為了得到最佳結果，因此，我們特地只使用簡單的架構，如下。 ::: :::info The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture: |output map size| 32x32 | 16x16 | 8x8 | |---|---|---|---| |#layers| 1+2n | 2n | 2n| |# filters| 16 | 32 | 64 | When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally $3_n$ shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts. ::: :::success 普通/殘差架構依著Fig. 3的形式(中/右)。網路的輸入維度為32x32，然後減掉per-pixel的均值。第一層為3x3的卷積。接著我們在大小各別為\{32, 16, 8\}的feature maps上使用具有3x3卷積的6$n$ layers的堆疊，每個feature map大小為2$n$ layers。濾波器的數量分別為\{16, 32, 64\}。以stride=2的卷積來做subsampling。網路以global average pooling、10-way fully-connected layesr、softmax做為結束。總共有6$n$+2堆疊的權重層。下面表單總結了這個架構： |output map size| 32x32 | 16x16 | 8x8 | |---|---|---|---| |#layers| 1+2n | 2n | 2n| |# filters| 16 | 32 | 64 | 當使用shortcut connections的時候，它們會連接到一對的3x3 layers(總共$3_n$個shortcuts)。在這資料集上，所有的狀況我們都是使用identity shortcuts(也就是選項A)，因此我們的殘差網路與對照的普通網路有著完全相同的深度、寬度以及參數數量。 ::: :::info We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in \[13\] and BN \[16\] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image. ::: :::success 我們使用0.0001的權重衰減，以及0.9的momentum，並使用\[13\]的權重初始化方式，以及\[16\]的BN，但沒有使用dropout。這個模型以兩塊GPUs訓練，其batch_size為128。learning rate以0.1開始，並且在32k、48k次迭代的時候除以10，並在64k次迭代的時候終止訓練，這是決定於45k/5k的訓練/驗證分拆資料。我們依著\[24\]所用的簡單資料增強來訓練：每一邊都往外推出4個像素，再從外推出去的影像或把它垂直翻轉後的影像中隨機取得32x32的剪裁。測試的時候，我們單純的評估原始32x32的影像。 ::: :::info We compare $n = {3, 5, 7, 9}$, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see \[42\]), suggesting that such an optimization difficulty is a fundamental problem. ::: :::success 我們比較了$n = {3, 5, 7, 9}$所得出的20、32、44、與56層的網路。Fig. 6(左)說明了普通網路的效能。深度的普通網路因為深度的增加，很明顯的當網路愈深所顯示的訓練誤差就愈高。這種現象與ImageNet(Fig. 4, 左)、MNIST(見\[42\])所發生的狀況類似，這說明了，這種最佳化困難是一個基本問題。 ::: :::info ![](https://i.imgur.com/GQOkcqJ.png) Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers. Figure 6. 在CIFAR-10上訓練。虛線表示訓練誤差，而粗線表示訓練誤差。左：普通網路。110層的誤差高達60%，因此沒有顯示。中：ResNets。右：110層與1202層的ResNets。 ::: :::info Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demonstrate accuracy gains when the depth increases. ::: :::success Fig. 6(中)說明ResNets的狀況。與ImageNet的狀況類似(Fig. 4, 右)，我們的ResNets設法克服最佳化的困難，並證明當深度增加的時候所獲取的準確度的提升。 ::: :::info We further explore $n = 18$ that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging^5^ . So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet \[35\] and Highway \[42\] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6). ::: :::success 我們進一步的探討當$n=18$所得到的110-layer的ResNet。在這種情況下，我們發現初始的learning rate設置為0.1對110-layer的ResNet而言略微過大，因而導致收斂上的困難^5^。因此我們使用0.01來暖身，一直到訓練誤差低於80%(大約400次迭代)，然後再調回0.1，繼續訓練。其餘的learning rate的行程就與先前所說的一致。這個110-layer的網路收斂的很好(Fig 6, 中)。與其它又深又瘦的網路相比，它有更少的參數(如，FitNet\[35\]、Higwa\[42\])，但它仍然是最佳的結果(6.43%, Table 6)。 ::: :::info ![](https://i.imgur.com/GGg5N1P.png) Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in \[43\]. Table 6. CIFAR-10測試集上的分類誤差。所有方法都有使用資料增強。對於ResNet-110，我們執行五次，並顯示為"(mean±std)"最佳，如\[43\]所示。 ::: :::info **Analysis of Layer Responses.** Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less. ::: :::success **Analysis of Layer Responses.** Fig. 7說明layer responses的標準差(std)。這個響應是每一個3x3 layer的輸出(BN之後，非線性之前)。對於ResNets而言，這個分析揭示出殘差函數的響應強度。Fig. 7說明，ResNets通常有著比與之對照的普通網路還要小的響應。這些結果支持著我們的基礎動機(見Sec.3.1)，也就是殘差函數通常可能比非殘差函數更接近零。我們還注意到，較深的ResNets有著更小的響應幅度，證明來自於Fig. 7中與ResNet-20、56與110的比較。當擁有更多層的時候，ResNets的各層的信號改變會更少。 ::: :::info ![](https://i.imgur.com/rhsv9nh.png) Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also Table 10 and 11 for better results Table 7. PASCAL VOC 2007/2012測試集上的目標檢測mAP(%)，使用基線Faster R-CNN。同時參閱Table 10與11以得到更好的結果 ::: :::info ![](https://i.imgur.com/byG5srD.png) Table 9. Object detection improvements on MS COCO using Faster R-CNN and ResNet-101. Table 9. MS COCO上使用Faster R-CNN與ResNet-101，目標檢測的改善。 ::: :::info ![](https://i.imgur.com/qxnwt3O.png) Table 10. Detection results on the PASCAL VOC 2007 test set. The baseline is the Faster R-CNN system. The system “baseline+++” include box refinement, context, and multi-scale testing in Table 9. Table 10. PASCAL VOC 2007測試集的偵測結果。基線系統為Faster R-CNN。系統"baseline+++"包含Table 9中的框的優化，上下文與多尺度測試。 ::: :::info ![](https://i.imgur.com/fXENcN6.png) Table 11. Detection results on the PASCAL VOC 2012 test set (http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4). The baseline is the Faster R-CNN system. The system “baseline+++” include box refinement, context, and multi-scale testing in Table 9. Table 11. PASCAL VOC 2012測試集的偵測結果。基線系統為Faster R-CNN。系統"baseline+++"包含Table 9中的框的優化，上下文與多尺度測試。 ::: :::info **Exploring Over 1000 layers.** We explore an aggressively deep model of over 1000 layers. We set $n = 200$ that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this $10^3$ -layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6). ::: :::success **Exploring Over 1000 layers.** 我們探索一個超過1000層的超級深度模型。我們設置$n=200$，得到一個1202-layer的網路，訓練方式如上所述。我們的方法顯示出，並沒有最佳化的困難，而且這個$10^3$-layer的網路能夠達到訓練誤差<0.1%(Fig. 6, 右)。它的測試誤差仍然相當不錯(7.93%, Table 6)。 ::: :::info But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout \[10\] or dropout \[14\] is applied to obtain the best results (\[10, 25, 24, 35\]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future. ::: :::success 但是，這麼深的模型仍然有待解決的問題。即使1202-layer與110-layer有著相似的訓練誤差，但是1202-layer的測試結果仍然比110-layer的測試結果還要差。我們認為這是因為過擬合所造成。對於這個小資料集而言，可能是不必要用到1202-layer這麼大的網路。在這資料集上用一些很強的正規化，像是maxout\[10\]或dropout\[14\]都可以獲得最佳的結果\[10, 25, 24, 35\]。在這篇論文中，我們並沒有使用maxout/dropout，而是透過設計一個簡單的深而且精簡的架構來加上正規化，並沒有分散對最佳化困難的關注。但是結合更強的正規化也許可以改善結果，未來我們將更進一步的研究。 ::: ### 4.3. Object Detection on PASCAL and MS COCO :::info Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 \[5\] and COCO \[26\]. We adopt Faster R-CNN \[32\] as the detection method. Here we are interested in the improvements of replacing VGG-16 \[41\] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric (mAP@\[.5, .95\), which is a 28% relative improvement. This gain is solely due to the learned representations. ::: :::success 我們的方法在其它辨識任務上有很好的泛化效能。Table 7與Table 8說明著PASCAL VOC 2007、2012\[5\]以及COCO\[\26\]的目標檢測基線的結果。我們採用Faster R-CNN\[32\]做為檢測方法。這邊我們對於以ResNet-101來取代VGG-16\[41\]的改善很感興趣。兩個模型使用的檢測實作是相同的(見附錄說明)，因此檢測結果只能歸因於更好的網路。最引人注意的是，在極具挑戰性的COCO資料集上，我們在COCO的標準指標中得到6.0%的增加(mAP@\[.5, .95\)，相對改進了28%。這個改善完全歸因於所學到的representations。 ::: :::info ![](https://i.imgur.com/ZSff5D4.png) Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also Table 9 for better results. Table 8. COCO驗證集上的目標檢測mAP(%)，使用基線為Faster R-CNN。最好的結果可參考Table 9. ::: :::info Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix. ::: :::success 基於深度殘差網路，我們贏得ILSVRC & COCO 2015 競賽中的多個比賽的第一名。ImageNet detection、ImageNet localization、COCO detection與 COCO segmentation。細節請見附錄。 :::