Network In Network(翻譯)

# Network In Network(翻譯) ###### tags: `CNN` `論文翻譯` `deeplearning` >[name=Shaoe.chen] [time=2020 03 02] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1312.4400.pdf) * [吳恩達老師NIN課程筆記](https://hackmd.io/@shaoeChen/HJUZTKMZz/https%3A%2F%2Fhackmd.io%2Fs%2FSJx83co_f#2-5Network-in-Network-and-1x1-convloutions) * [hjimce_深度學習（二十六）Network In Network學習筆記](https://blog.csdn.net/hjimce/article/details/50458190) ::: ## Abstract :::info We propose a novel deep network structure called “Network In Network”(NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets. ::: :::success 我們提出一種新的深層網路結構，稱為"Network In Network"(NIN)，以提高模型在[接受域](http://terms.naer.edu.tw/detail/3278959/)內對local patches的[區辨性](http://terms.naer.edu.tw/detail/6845354/)。常見的卷積層使用線性[濾波器](http://terms.naer.edu.tw/detail/6620937/)，然後使用非線性啟動函數來掃描輸入。做為代替，我們建立擁有更複雜結構的微神經網路，來抽象[接受域](http://terms.naer.edu.tw/detail/3278959/)中的資料。我們用具體的範例來說明擁有多層感知器(強力的函數逼近器)的微神經網路。用類似於CNN的方法，在輸入上滑動微網路以取得feature maps；然後將它們送入下一層。Deep NIN可以利用堆疊上述結構來實現。通過微網路增強局部建模，我們能夠在分能類的feature map上利用global average pooling，比起傳統的全連接層，這更容易解釋，而且比較不會過擬合。我們證明了NIN在CIFAR-10與CIFAR-100的最新分類效能，以及在SVHN與MNIST資料集上的合理效能。 ::: :::warning 個人見解： * 一直在想，local patch是不是指filter的kernel sizes所蓋的那個區塊就叫做local patch? * micro network以圖1~(b)~來看，就是在卷積之後再放幾個mlpconv layer做計算再output，多了幾層非線性轉換的概念? ::: ## 1 Introduction :::info Convolutional neural networks (CNNs) \[1\] consist of alternating convolutional layers and pooling layers. Convolution layers take inner product of the linear filter and the underlying receptive field followed by a nonlinear activation function at every local portion of the input. The resulting outputs are called feature maps. ::: :::success 卷積神經網路(CNNs)\[1\]由輪流的卷積層與池化層所組成。卷積層在輸入的每個局部的部份計算線性濾波器與下層接受域的內積，隨後是非線性啟動函數。輸出的結果稱為feature maps。 ::: :::info The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept \[2\]. Replacing the GLM with a more potent nonlinear function approximator can enhance the abstraction ability of the local model. GLM can achieve a good extent of abstraction when the samples of the latent concepts are linearly separable, i.e. the variants of the concepts all live on one side of the separation plane defined by the GLM. Thus conventional CNN implicitly makes the assumption that the latent concepts are linearly separable. However, the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron [3] as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation. ::: :::success CNN中的卷積濾波器是底層data patch的[廣義線性模型](https://zh.wikipedia.org/wiki/%E5%BB%A3%E7%BE%A9%E7%B7%9A%E6%80%A7%E6%A8%A1%E5%9E%8B)(GLM)，而且我們認為GLM的抽象程度較是比較低的。透過抽象，我們說，特徵對同一概念的變體是不變的\[2\]。用一個更有效的非線性函數逼近器來取代GLM可以增加局部模型的抽象能力。當隱含概念的樣本是線性可分的時候，也就是說，概念的變體都存在GLM所定義的分離平面的一邊時，那麼，GLM就可以實現很好的抽象程度。因此，傳統的CNN隱性的假設隱含概念是線性可分的。然而，相同概念的資料通常存在非線性流形上，因此，捕獲這些概念的representations通常是輸入的高度非線性函數。在NIN中，GLM以"微網路~(micro network)~"結構取代，這是一個[一般的](http://terms.naer.edu.tw/detail/2116719/)非線性函數逼近器。在這項工作中，我們選擇多層感知器\[3\]做為微網路的實際範例，這是一個通用函數逼近器，而且可以用反向傳播訓練的神經網路。 ::: :::warning 個人見解： * 單純的conv->non-linear activation->pooling所提取到的特徵都是抽象化較低的，因此加入mlpconv layer來提高抽像化程度，有一種將抽象出來的特徵再做幾次的非線性轉換的概念。 ::: :::info The resulting structure which we call an mlpconv layer is compared with CNN in Figure 1. Both the linear convolutional layer and the mlpconv layer map the local receptive field to an output feature vector. The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron (MLP) consisting of multiple fully connected layers with nonlinear activation functions. The MLP is shared among all local receptive fields. The feature maps are obtained by sliding the MLP over the input in a similar manner as CNN and are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as we have micro networks (MLP), which are composing elements of the overall deep network, within mlpconv layers. ::: :::success 圖1中，我們將稱為mlpconv layer所得到的結構與CNN做了比較。線性卷積層與mlpconv layer都將局部的接受域映射到輸出的特徵向量。mlpconv用多層感知器(MLP)映射輸入的局部patch到輸出的特徵向量，這個多層感知器由具非線性啟動函數的多個全連接層所組成。MLP在歐有的局部接受域中共享。透過類似CNN的方式，在輸入上滑動MLP以獲取feature map，然後將它們送到下一層。NIN的整體結構是多個mlpconv layers的堆疊。它被稱為"Network In Network"(NIN)，因為我們擁有微網路(MLP)，它們在mlpconv layers內構成整個深度網路的要素。 ::: :::info ![](https://i.imgur.com/SZ7HAer.png) Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layer includes a linear filter while the mlpconv layer includes a micro network (we choose the multilayer perceptron in this paper). Both layers map the local receptive field to a confidence value of the latent concept. ::: :::info Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network. Furthermore, the fully connected layers are prone to overfitting and heavily depend on dropout regularization \[4\] \[5\], while global average pooling is itself a structural regularizer, which natively prevents overfitting for the overall structure. ::: :::success 我們不採用CNN中傳統全連接層來做分類，而是通過global average pooling layer直接將最後一個mlpconv layer輸出feature maps的空間平均值做為類別的置信度，然後再將得到的向量送入softmax layer。傳統CNN中，由於全連接層之間就好比是一個黑盒子，難以解釋來自objective cost layer的類別層級信息如何傳回先前的卷積層。相比之下，global average pooling更有意義也更具解釋性，因為它加強了feature maps與類別之間的相對應關係，這讓透過使用微網路做更強大的建模變的可能。此外，全連接層容易造成過擬合而且過度依賴dropout regularization\[4\] \[5\]，而global average pooling本身就是一個結構性的正規化器，很自然的就可以預防整體結構的過擬合。 ::: :::warning 個人見解： * mlpconv layer => global average pooling layer => softmax * Inception最後也是利用global average pooling layer來取代fully connected layer ::: ## 2 Convolutional Neural Networks :::info Classic convolutional neuron networks [1] consist of alternatively stacked convolutional layers and spatial pooling layers. The convolutional layers generate feature maps by linear convolutional filters followed by nonlinear activation functions (rectifier, sigmoid, tanh, etc.). Using the linear rectifier as an example, the feature map can be calculated as follows: $$ f_{i, j, k} = \max \left( w^T_kx_{i, j}, 0 \right) \qquad (1) $$ Here $(i, j)$ is the pixel index in the feature map, $x_{ij}$ stands for the input patch centered at location $(i, j)$, and $k$ is used to index the channels of the feature map. ::: :::success 經典的卷積神經網路\[1\]由交替堆疊的卷積層與空間池化層所組成。卷積層會透過線性卷積濾波器後面接非線性啟函數(rectifier, sigmoid, tanh, 等...)之後生成feature maps。以線性整流器為例，feature maps會如下以計算： $$ f_{i, j, k} = \max \left( w^T_kx_{i, j}, 0 \right) \qquad (1) $$ 這邊$(i, j)$是feature map中的像素索引，$x_{ij}$代表以$(i, j)$為中心的input patch，而$k$則是feature map的channel索引值。 ::: :::warning 個人見解： * linear rectifier即ReLU ::: :::info This linear convolution is sufficient for abstraction when the instances of the latent concepts are linearly separable. However, representations that achieve good abstraction are generally highly nonlinear functions of the input data. In conventional CNN, this might be compensated by utilizing an over-complete set of filters \[6\] to cover all variations of the latent concepts. Namely, individual linear filters can be learned to detect different variations of a same concept. However, having too many filters for a single concept imposes extra burden on the next layer, which needs to consider all combinations of variations from the previous layer \[7\]. As in CNN, filters from higher layers map to larger regions in the original input. It generates a higher level concept by combining the lower level concepts from the layer below. Therefore, we argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts. ::: :::success 當隱含概念的實例是線性可分的時候，那這個線性卷積就足以進行抽像。然而，實現良好抽像的表示~(representations)~通常是輸入資料的高度非線性函數。在傳統的CNN中，也許可以透過使用一組ver-complete的濾波器\[6\]來覆蓋隱含概念的所有變體以做補償。也就是，各別的線性濾波器可以學習來檢測相同概念的不同變化。但是，單個概念擁有過多的濾波器會給下一層帶來額外的負擔，這需要考慮上一層變體的所有組合\[7\]。與CNN一樣，較高層的濾波器會映射到原始輸入較大的區域。它通過結合來自下面層的較低層級概念來生成較高層級概念。因此，我們認為在將它們結合成更高層級的概念之前，在每個局部的patch上做更好的抽象是有好處的。 ::: :::warning 問題： * latent concepts? * concepts? ::: :::info In the recent maxout network \[8\], the number of feature maps is reduced by maximum pooling over affine feature maps (affine feature maps are the direct results from linear convolution without applying the activation function). Maximization over linear functions makes a piecewise linear approximator which is capable of approximating any convex functions. Compared to conventional convolutional layers which perform linear separation, the maxout network is more potent as it can separate concepts that lie within convex sets. This improvement endows the maxout network with the best performances on several benchmark datasets. ::: :::success 最近的maxout network\[8\]中，透過對affine feature map做maximum pooling來減少feature maps的數量(affine feature maps是線性卷積未應用啟動函數的直接結果)。線性函數的最大化讓[分段](http://terms.naer.edu.tw/detail/2121747/)線性逼近器能夠近似任意的[凸函數](http://terms.naer.edu.tw/detail/2113749/)。與執行線性分離的傳統卷積層相比，maxout network更強而有力，因為它可以分離凸集合內的概念。這個改善讓maxout network在多個基準資料集上擁有最佳效能。 ::: :::info However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold. It would be necessary to employ a more general function approximator when the distributions of the latent concepts are more complex. We seek to achieve this by introducing the novel “Network In Network” structure, in which a micro network is introduced within each convolutional layer to compute more abstract features for local patches. ::: :::success 然而，maxout network加強先驗，也就是將隱含概念的實例置於輸入空間中的凸集合內，但這一點並不一定成立。當隱含概念的分佈更為複雜的時候，就必需採用一個更為通用的函數逼近器。我們試著透過引入新的"Network In Network"結構來實現這一點，在每個卷積層中引入一個微網路來計算local patches更多的抽象特徵。 ::: :::info Sliding a micro network over the input has been proposed in several previous works. For example, the Structured Multilayer Perceptron (SMLP) \[9\] applies a shared multilayer perceptron on different patches of the input image; in another work, a neural network based filter is trained for face detection \[10\]. However, they are both designed for specific problems and both contain only one layer of the sliding network structure. NIN is proposed from a more general perspective, the micro network is integrated into CNN structure in persuit of better abstractions for all levels of features. ::: :::success 在輸入上滑動微網路已經在前面幾個工作中提出。舉例來說，Structured Multilayer\[9\]Perceptron (SMLP)在輸入影像的不同patches上應用共享的多層感知器；在另外的工作中，訓練了一個基於神經網路的濾波器用於臉部偵測\[10\]。然而，它們都是針對特別的問題所設置，而且都只包含一層的滑動網路結構。NIN從更一般的角度提出，微網路被整合到CNN結構，以從特徵的所有層級取得更好的抽象。 ::: ## 3 Network In Network :::info We first highlight the key components of our proposed “Network In Network” structure: the MLP convolutional layer and the global averaging pooling layer in Sec. 3.1 and Sec. 3.2 respectively. Then we detail the overall NIN in Sec. 3.3. ::: :::success 首先，我們強調由我們所提出的"Network In Network"結構的關鍵組成：分別於Sect3.1、Sect3.2中提出MLP卷積層與全域平均池化層。然後在Sect. 3.3中討論NIN的所有細節。 ::: ### 3.1 MLP Convolution Layers :::info Given no priors about the distributions of the latent concepts, it is desirable to use a universal function approximator for feature extraction of the local patches, as it is capable of approximating more abstract representations of the latent concepts. Radial basis network and multilayer perceptron are two well known universal function approximators. We choose multilayer perceptron in this work for two reasons. First, multilayer perceptron is compatible with the structure of convolutional neural networks, which is trained using back-propagation. Second, multilayer perceptron can be a deep model itself, which is consistent with the spirit of feature re-use \[2\]. This new type of layer is called mlpconv in this paper, in which MLP replaces the GLM to convolve over the input. Figure 1 illustrates the difference between linear convolutional layer and mlpconv layer. The calculation performed by mlpconv layer is shown as follows: $$\begin{aligned} & f^1_{i, j, k_1} = \max \left( {w^1_{k_1}}^T x_{i, j} + b_{k_1}, 0 \right) \\ & f^1_{i, j, k_n} = \max \left( {w^1_{k_n}}^T f_{i, j}^{n - 1} + b_{k_n}, 0 \right) \qquad (2) \end{aligned}$$ Here $n$ is the number of layers in the multilayer perceptron. Rectified linear unit is used as the activation function in the multilayer perceptron. ::: :::success 考量到在沒有關於隱含概念的分佈的先驗的情況，應該使用一個通用的函數逼近器來提取local patches的特徵，因為它能夠近似隱含概念的更為抽象的表示~(representations)~。[徑向基底網路](http://terms.naer.edu.tw/detail/6661780/)與多層感知器是兩個眾所皆知的通用函數逼近器。我們在這個工作中選擇多層感知器有兩個原因。首先，多層感知器兼容用反向傳播訓練的卷積神經網路結構。第二，多層感知器本身可以是一個深層模型，這與特徵重用的精神是一致的\[2\]。本篇論文中將這種新類型的層稱為mlpconv，其中以MLP取代GLM在輸入上進行卷積。圖1說明了線性卷積層與mlpconv layer的差異。mlpconv layer所執行的計算如下： $$\begin{aligned} & f^1_{i, j, k_1} = \max \left( {w^1_{k_1}}^T x_{i, j} + b_{k_1}, 0 \right) \\ & f^1_{i, j, k_n} = \max \left( {w^1_{k_n}}^T f_{i, j}^{n - 1} + b_{k_n}, 0 \right) \qquad (2) \end{aligned}$$ 這裡的$n$是多層感知器中層數的數量。使用整流線性單元(ReLU)做為多層感知器的啟動函數。 ::: :::info From cross channel (cross feature map) pooling point of view, Equation 2 is equivalent to cascaded cross channel parametric pooling on a normal convolution layer. Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit. The cross channel pooled feature maps are cross channel pooled again and again in the next layers. This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information. ::: :::success 從跨通道池化(cross channel pooling)~(cross feature map)~的觀點來看，公式2等價於在一個正常的卷積層上的[級聯](http://terms.naer.edu.tw/detail/3646774/)跨通道的參數池。每一個pooling layer都在input feature map上執行加權線性重組，然後通過整流線性單元。跨通道池化的feature maps在下一層中一次又一次的跨通道池。這種[級聯](http://terms.naer.edu.tw/detail/3646774/)跨通道參數池化的結構允許跨通道信息做複雜與可學習的[交互作用](http://terms.naer.edu.tw/detail/229511/)。 ::: :::info The cross channel parametric pooling layer is also equivalent to a convolution layer with 1x1 convolution kernel. This interpretation makes it straightforawrd to understand the structure of NIN. ::: :::success 這種跨通道參數池化層也等價於具有1x1卷積核的卷積層。這個解釋讓瞭解NIN的結構變得更為直接。 ::: :::warning 個人見解： * 這句話已經點出，我們可以在conv layer->non-linear activation->pooling之後直接1x1的filter，非常直觀。 ::: :::info Comparison to maxout layers: the maxout layers in the maxout network performs max pooling across multiple affine feature maps \[8\]. The feature maps of maxout layers are calculated as follows: $$ f_{i, j, k} = \max_m \left( w^T_{k_m} x_{i, j}\right) \qquad (3) $$ ::: :::success 與maxout layers相比：maxout network中的maxout layers跨多個affine feature maps做max pooling\[8\]。maxout layers的feature maps的計算如下： $$ f_{i, j, k} = \max_m \left( w^T_{k_m} x_{i, j}\right) \qquad (3) $$ ::: :::info Maxout over linear functions forms a piecewise linear function which is capable of modeling any convex function. For a convex function, samples with function values below a specific threshold form a convex set. Therefore, by approximating convex functions of the local patch, maxout has the capability of forming separation hyperplanes for concepts whose samples are within a convex set (i.e. l2 balls, convex cones). Mlpconv layer differs from maxout layer in that the convex function approximator is replaced by a universal function approximator, which has greater capability in modeling various distributions of latent concepts. ::: :::success 線性函數上的maxout形成[分段線性](http://terms.naer.edu.tw/detail/150894/)，能夠對任意的[凸函數](http://terms.naer.edu.tw/detail/2113749/)建模。對於[凸函數](http://terms.naer.edu.tw/detail/2113749/)而言，函數值低於指定閥值的樣本將會形成[凸集](http://terms.naer.edu.tw/detail/2113763/)。因此，透過近似local patch的凸函數，maxout能夠對樣本在[凸集](http://terms.naer.edu.tw/detail/2113763/)(如l2 balls，[凸錐](http://terms.naer.edu.tw/detail/2113744/))內的概念形成分離的超平面。mlpconv layer與maxout layer的差異在於，[凸函數](http://terms.naer.edu.tw/detail/2113749/)逼近器以通用函數逼近器替代(通用函數逼近器在隱含概念的不同分佈建模時擁有更強大的功能)。 ::: ### 3.2 Global Average Pooling :::info Conventional convolutional neural networks perform convolution in the lower layers of the network. For classification, the feature maps of the last convolutional layer are vectorized and fed into fully connected layers followed by a softmax logistic regression layer \[4\ \[8\] \[11\]. This structure bridges the convolutional structure with traditional neural network classifiers. It treats the convolutional layers as feature extractors, and the resulting feature is classified in a traditional way. ::: :::success 傳統的卷積神經網路在網路的較低層中執行卷積。對於分類，最後一個卷積層的feature map會向量化，然後送入全連接層，接著是softmax logistic regression layer\[4\ \[8\] \[11\]。這個結構將卷積結構與傳統神經網路分類器連接在一起。它將卷積層視為特徵提取器，然後以傳統的方法對得到的特徵做分類。 ::: :::info However, the fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network. Dropout is proposed by Hinton et al. [5] as a regularizer which randomly sets half of the activations to the fully connected layers to zero during training. It has improved the generalization ability and largely prevents overfitting [4]. ::: :::success 但是，全連接層太過容易過擬合了，因而阻礙整個網路的泛化能力。由Hinton et al. [5]提出的dropout做正規化器，在訓練過程中隨機的將全連接層一半的神經元設置為零。這提高了泛化能力，而且也很大程度的預防過擬合。 ::: :::info In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input. ::: :::success 在這篇論文中，我們提出另一種策略，稱為global average pooling~(全域平均池化)~來取代CNN中傳統的全連接層。想法上是對最後一層mlpconv layer中分類任務的每一個相對應類別生成一個feature map。我們並沒有在feature map的頂部加入全連接著，而是取得每一個feature map的平均值，然後再將得到的向量直接送入softmax layer。global average pooling在全連接層上的一個優點就是，透過增加feature maps與類別之間的關聯，這讓它更適合應用於卷積結構。因此，feature maps可以很輕鬆的被解釋為類別的置信度映射(confidence maps)。另一個優點是，在global average pooling中沒有參數需要被優化，因此在這層是可以避免過擬合的。此外，global average pooling匯總了空間信息，因此，這對輸入空間的轉換更具有魯棒性。 ::: :::info We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than GLMs. ::: :::success 我們可以將global average pooling視為結構的正規化器，可以明確的將feature maps強制為概念(類別)的置信度映射(confidence maps)。mlpconv layers是可以做的到的，因為它們比起GLMs有著更好的置信度映射的近似。 ::: :::warning 個人見解： * 將feature maps強制為概念(類別)的置信度映射(confidence maps)就是將feature maps映射到機率空間中，這樣子出來的資訊就直接是由feature maps所提供，而不是再經過全連接層的學習而得，也不再是黑盒子? ::: ### 3.3 Network In Network Structure :::info The overall structure of NIN is a stack of mlpconv layers, on top of which lie the global average pooling and the objective cost layer. Sub-sampling layers can be added in between the mlpconv layers as in CNN and maxout networks. Figure 2 shows an NIN with three mlpconv layers. Within each mlpconv layer, there is a three-layer perceptron. The number of layers in both NIN and the micro networks is flexible and can be tuned for specific tasks. ::: :::success NIN的整體結構是mlpconv layers的堆疊，最上面是global average pooling與objective cost layer。如同在CNN與maxout networks一般，[sub-sampling](http://terms.naer.edu.tw/detail/2125658/)layers可以被加入mlpconv layers之間。圖2說明擁有三個mlpconv layer的NIN。在每個mlpconv layer中都有一個三層感知器。NIN與微網路的層數可以針對特定任務做靈活的調整。 ::: :::info ![](https://i.imgur.com/w8hcyYH.png) Figure 2: The overall structure of Network In Network. In this paper the NINs include the stacking of three mlpconv layers and one global average pooling layer. 圖2：Network In Network的整體結構。論文中，NINs包含三個mlpconv layers的堆疊與一個global average pooling layer。 ::: ## 4 Experiments ### 4.1 Overview :::info We evaluate NIN on four benchmark datasets: CIFAR-10 \[12\], CIFAR-100 \[12\], SVHN \[13\] and MNIST \[1\]. The networks used for the datasets all consist of three stacked mlpconv layers, and the mlpconv layers in all the experiments are followed by a spatial max pooling layer which downsamples the input image by a factor of two. As a regularizer, dropout is applied on the outputs of all but the last mlpconv layers. Unless stated specifically, all the networks used in the experiment section use global average pooling instead of fully connected layers at the top of the network. Another regularizer applied is weight decay as used by Krizhevsky et al. \[4\]. Figure 2 illustrates the overall structure of NIN network used in this section. The detailed settings of the parameters are provided in the supplementary materials. We implement our network on the super fast cuda-convnet code developed by Alex Krizhevsky \[4\]. Preprocessing of the datasets, splitting of training and validation sets all follow Goodfellow et al. \[8\]. ::: :::success 我們在四個基準資料集上評估NIN：CIFAR-10 \[12\], CIFAR-100 \[12\], SVHN \[13\] and MNIST \[1\]。用於資料集的所有網路都包含三個堆疊的mlpconv layers，而且所有實驗中的mlpconv layers都接一個spatial max pooling layer，該layer將輸入的影像的取樣[降低](http://terms.naer.edu.tw/detail/6615261/)兩倍。做為正規化器，除了最後一個mlpconv layers之外，所有其它的輸出都使用dropout。除了特別說明，否則實驗中所使用的所有網路頂部都使用global average pooling，而不是全連接層。另一個正規化器則是 Krizhevsky et al. \[4\]所用的權重衰減。圖2說明這一節中NIN網路所使用的整體結構。參數的細部設置則是在補充教材中提供。我們在Alex Krizhevsky \[4\]所開發的超快速cuda-convnet上實作我們的網路。資料集的預處理，訓練與驗證集的分割都遵循著Goodfellow et al. \[8\]。 ::: :::info We adopt the training procedure used by Krizhevsky et al. \[4\]. Namely, we manually set proper initializations for the weights and the learning rates. The network is trained using mini-batches of size 128. The training process starts from the initial weights and learning rates, and it continues until the accuracy on the training set stops improving, and then the learning rate is lowered by a scale of 10. This procedure is repeated once such that the final learning rate is one percent of the initial value. ::: :::success 我們採用Krizhevsky et al. \[4\]使用的訓練程序。也就是，我們手動設置適合的權重初始值與learning rates。訓練網路所使用的batch-size為128。訓練過程由初始權重與learning rates開始，一直持續到訓練集上的準確度不再提升，然後將learning rate降低10。不斷重覆這個程，最終learning rate是初始值的百分之一。 ::: ### 4.2 CIFAR-10 :::info The CIFAR-10 dataset \[12\] is composed of 10 classes of natural images with 50,000 training images in total, and 10,000 testing images. Each image is an RGB image of size 32x32. For this dataset, we apply the same global contrast normalization and ZCA whitening as was used by Goodfellow et al. in the maxout network \[8\]. We use the last 10,000 images of the training set as validation data. ::: :::success CIFAR-10 dataset \[12\]這個資料集是由10個類別的自然影像組成，總共有50,000張訓練影像，與10,000張測試影像。每一張都是32x32的RGB影像。對於這個資料集，我們採用與Goodfellow et al.在maxout network\[8\]中所用的方式，global contrast normalization與ZCA whitening。我們用訓練集中最後10,000張影像做為驗證資料。 ::: :::info The number of feature maps for each mlpconv layer in this experiment is set to the same number as in the corresponding maxout network. Two hyper-parameters are tuned using the validation set, i.e. the local receptive field size and the weight decay. After that the hyper-parameters are fixed and we re-train the network from scratch with both the training set and the validation set. The resulting model is used for testing. We obtain a test error of 10.41% on this dataset, which improves more than one percent compared to the state-of-the-art. A comparison with previous methods is shown in Table 1. ::: :::success 這次實驗中的每一個mlpconv layer的feature maps數量都設置與相應的maxout network相同。使用驗證集調校兩個超參數，就是局部接受域大小與權重衰減。在固定超參數之後，我們使用訓練集與驗證集重頭開始重新訓練網路。將得到的模型拿來測試。我們在測試集上獲得10.41%的誤差率，與目前最新技術比較，我們的提高了1%以上。Table 1中說明與先前方法的比較。 ::: :::info Table 1: Test set error rates for CIFAR-10 of various methods. ![](https://i.imgur.com/WPENBCV.png) ::: :::info It turns out in our experiment that using dropout in between the mlpconv layers in NIN boosts the performance of the network by improving the generalization ability of the model. As is shown in Figure 3, introducing dropout layers in between the mlpconv layers reduced the test error by more than 20%. This observation is consistant with Goodfellow et al. \[8\]. Thus dropout is added in between the mlpconv layers to all the models used in this paper. The model without dropout regularizer achieves an error rate of 14.51% for the CIFAR-10 dataset, which already surpasses many previous state-of-the-arts with regularizer (except maxout). Since performance of maxout without dropout is not available, only dropout regularized version are compared in this paper. ::: :::success 我們實驗的結果，在NIN中的mlpconv layers之間使用dropout可以透過提高模型的泛化能力，進而提高了網路的效能。如圖3所示，在mlpconv layers之間引入dropout layer降低測試誤差高達20%以上。這個觀察與 Goodfellow et al. \[8\]不謀而合。因此，此篇論文中的所有模型都在mlpconv layers之間加入了dropout。在CIFAR-10資料集上，沒有加入dropout正規器的模型，其誤差率來到14.51%，這已經超過許多先前加入正規化的最佳技術(maxout除外)。因為maxout沒加入dropout的效能是不能用的，因此論文中僅比較加入dropout的版本。 ::: :::info ![](https://i.imgur.com/kYseu2q.png) Figure 3: The regularization effect of dropout in between mlpconv layers. Training and testing error of NIN with and without dropout in the first 200 epochs of training is shown. Figure 3：在mlpconv layers之間，dropout的正規化效果。說明在前200個epochs中，NIN加與不加dropout的訓練與測試誤差。 ::: :::info To be consistent with previous works, we also evaluate our method on the CIFAR-10 dataset with translation and horizontal flipping augmentation. We are able to achieve a test error of 8.81%, which sets the new state-of-the-art performance. ::: :::success 為了與之前的作業一致，我們還在CIFAR-10資料集上使用平移與水平翻轉的增強方式，以此評估我們的方法。我們能夠達到8.81%的測試誤差，這創造了新的最佳效能。 ::: ### 4.3 CIFAR-100 :::info The CIFAR-100 dataset \[12\] is the same in size and format as the CIFAR-10 dataset, but it contains 100 classes. Thus the number of images in each class is only one tenth of the CIFAR-10 dataset. For CIFAR-100 we do not tune the hyper-parameters, but use the same setting as the CIFAR-10 dataset. The only difference is that the last mlpconv layer outputs 100 feature maps. A test error of 35.68% is obtained for CIFAR-100 which surpasses the current best performance without data augmentation by more than one percent. Details of the performance comparison are shown in Table 2. ::: :::success CIFAR-100\[12\]資料集在大小與格式都與CIFAR-10一樣，但它包含100個類別。因此，每個類別的影像數量都只有CIFAR-10資料集的十分之一。針對CIFAR-100，我們並沒有調整超參數，但使用與CIFAR-10資料集相同的設置。唯一的不同就只有最後一層mlpconv layer輸出100個feature maps。CIFAR-100所獲得的測試差誤為35.68%，在沒有資料增強的情況下已經超過當下的最佳技術百分之一以上。Table 2說明了效能比較的細節。 ::: :::info Table 2: Test set error rates for CIFAR-100 of various methods. ![](https://i.imgur.com/LGjjFRu.png) ::: ### 4.4 Street View House Numbers :::info The SVHN dataset \[13\] is composed of 630,420 32x32 color images, divided into training set, testing set and an extra set. The task of this data set is to classify the digit located at the center of each image. The training and testing procedure follow Goodfellow et al. \[8\]. Namely 400 samples per class selected from the training set and 200 samples per class from the extra set are used for validation. The remainder of the training set and the extra set are used for training. The validation set is only used as a guidance for hyper-parameter selection, but never used for training the model. ::: :::success SVHN資料集\[13\]由630,420張32x32彩色影像所組成，分為訓練集、測試集與附加集。這資料集的任務是對每一張影像中心位置的數字進行分類。訓練與測試程序都遵循Goodfellow et al. \[8\]。也就是，從訓練集中的每個類別選出400張樣本，以及從附加集中的每個類別選出200張樣本，以作為驗證集。餘下的訓練集與附加集都拿來做為訓練使用。驗證集單純的做為超參數選擇的一個指南，絕不拿來做為模型訓練用。 ::: :::info Preprocessing of the dataset again follows Goodfellow et al. \[8\], which was a local contrast normalization. The structure and parameters used in SVHN are similar to those used for CIFAR-10, which consist of three mlpconv layers followed by global average pooling. For this dataset, we obtain a test error rate of 2.35%. We compare our result with methods that did not augment the data, and the comparison is shown in Table 3. ::: :::success 資料的預處理也是遵循Goodfellow et al.\[8\]的作法，也就是局部對比的正規化。SVHN中使用的結構與參數類似於CIFAR-10所使用，包含三個mlpconv layers，然後是global average pooling。對於這個資料集，我們獲得2.35%的誤差率。我們將結果與未做資料增強的方法做比較，如Table 3所示。 ::: :::info Table 3: Test set error rates for SVHN of various methods. ![](https://i.imgur.com/3jNoZHu.png) ::: ### 4.5 MNIST :::info The MNIST \[1\] dataset consists of hand written digits 0-9 which are 28x28 in size. There are 60,000 training images and 10,000 testing images in total. For this dataset, the same network structure as used for CIFAR-10 is adopted. But the numbers of feature maps generated from each mlpconv layer are reduced. Because MNIST is a simpler dataset compared with CIFAR-10; fewer parameters are needed. We test our method on this dataset without data augmentation. The result is compared with previous works that adopted convolutional structures, and are shown in Table 4. ::: :::success MNIST資料集由大小為28x28的手寫0-9數字影像所組成。總共有60,000張訓練影像與10,000張測試影像。針對這個資料集，採用與CIFAR-10相同的網路結構。但是每一層mlpconv所生成的feature maps 的數量減少。這是因為對比CIFAR-10，MNIST是一個更為簡單的資料集；需求參數更少。在這資料集上測試的時候並沒有使用資料增強。先前採用卷積結構的比較如Table 4所示。 ::: :::info Table 4: Test set error rates for MNIST of various methods. ![](https://i.imgur.com/0S1eKXW.png) ::: :::info We achieve comparable but not better performance (0.47%) than the current best (0.45%) since MNIST has been tuned to a very low error rate. ::: :::success 因為MNIST已經被調校到非常低的誤差率，因此我們獲得的效能(0.47%)雖然可以與目前最佳效能(0.45%)相比擬，但卻不是最好的。 ::: ### 4.6 Global Average Pooling as a Regularizer :::info Global average pooling layer is similar to the fully connected layer in that they both perform linear transformations of the vectorized feature maps. The difference lies in the transformation matrix. For global average pooling, the transformation matrix is prefixed and it is non-zero only on block diagonal elements which share the same value. Fully connected layers can have dense transformation matrices and the values are subject to back-propagation optimization. To study the regularization effect of global average pooling, we replace the global average pooling layer with a fully connected layer, while the other parts of the model remain the same. We evaluated this model with and without dropout before the fully connected linear layer. Both models are tested on the CIFAR-10 dataset, and a comparison of the performances is shown in Table 5. ::: :::success global average pooling layer與全連接層類似，因為它們都執行向量化feature maps的線性轉換。差異在於[轉換矩陣](http://terms.naer.edu.tw/detail/958673/)。對於global average pooling而言，[轉換矩陣](http://terms.naer.edu.tw/detail/958673/)是有前綴的，而且是只有在共享相同數值的對角元素上是非零的。全連接層可以擁有密集的[轉換矩陣](http://terms.naer.edu.tw/detail/958673/)，而且數值需要反向傳播來最佳化。為了研究global average pooling的正規化效果，我們用global average pooling取代全連接層，而模型的其餘部份都維持不變。我們評估這個模型在全連接層之前有無dropout的狀況。兩個模型都在CIFAR-10資料集上測試，效能比較如Table 5所示。 ::: :::info Table 5: Global average pooling compared to fully connected layer. ![](https://i.imgur.com/yQDMRJe.png) ::: :::info As is shown in Table 5, the fully connected layer without dropout regularization gave the worst performance (11.59%). This is expected as the fully connected layer overfits to the training data if no regularizer is applied. Adding dropout before the fully connected layer reduced the testing error (10.88%). Global average pooling has achieved the lowest testing error (10.41%) among the three. ::: :::success 如Table 5所示，沒有dropout正規化的全連接層得到最糟的效能(11.59%)。這是可以預期的，因為，如果沒有使用正規化器，全連接層很容易過擬合。在全連接層之前加入dropout可以減少測試誤差(10.88%)。Global average pooling在三者之間得到最低的測試誤差(10.41%)。 ::: :::info We then explore whether the global average pooling has the same regularization effect for conventional CNNs. We instantiate a conventional CNN as described by Hinton et al. \[5\], which consists of three convolutional layers and one local connection layer. The local connection layer generates 16 feature maps which are fed to a fully connected layer with dropout. To make the comparison fair, we reduce the number of feature map of the local connection layer from 16 to 10, since only one feature map is allowed for each category in the global average pooling scheme. An equivalent network with global average pooling is then created by replacing the dropout + fully connected layer with global average pooling. The performances were tested on the CIFAR-10 dataset. ::: :::success 接下來，我們探討了global average pooling對傳統的CNNs是否有相同的正規化效果。我們依著Hinton et al. \[5\]所述，實作一個傳統的CNN，包含三層卷積層以一個局部連接層。局部連接層生成16個feature maps，這些feature maps被送入帶有dropout的全連接層。為了合理比較，我們將局部連接層的feature map由16減少至10，因為在global average pooling方案中，每個類別只允許一個feature map。然後，建立一個以global average pooling取代dropout + 全連接層的相同的網路。在CIFAR-10資料集上測試效能。 ::: :::info This CNN model with fully connected layer can only achieve the error rate of 17.56%. When dropout is added we achieve a similar performance (15.99%) as reported by Hinton et al. \[5\]. By replacing the fully connected layer with global average pooling in this model, we obtain the error rate of 16.46%, which is one percent improvement compared with the CNN without dropout. It again verifies the effectiveness of the global average pooling layer as a regularizer. Although it is slightly worse than the dropout regularizer result, we argue that the global average pooling might be too demanding for linear convolution layers as it requires the linear filter with rectified activation to model the confidence maps of the categories. ::: :::success 使用全連接層的CNN模型，其誤差率只能來到17.56%。當加入dropout的時候，可以得到類似Hinton et al. \[5\]的效能(15.99%)。這模型在利用global average pooling取代全連接層之後，我們獲得16.46%的誤差率，這比沒有dropout的CNN還要提高百分之一的誤差率。這再次的驗證以global average pooling layer做為正規化器的效果。雖然比起以dropout為正規化器的CNN模型略差，但我們認為，global average pooling layerS對線性卷積層的要求過高，因為它需要具有整流啟動的線性濾器波來建構類別的置信度映射。 ::: ### 4.7 Visualization of NIN :::info We explicitly enforce feature maps in the last mlpconv layer of NIN to be confidence maps of the categories by means of global average pooling, which is possible only with stronger local receptive field modeling, e.g. mlpconv in NIN. To understand how much this purpose is accomplished, we extract and directly visualize the feature maps from the last mlpconv layer of the trained model for CIFAR-10. ::: :::success 我們明確的強制NIN的最後一層mlpconv layer的feature maps透過global average pooling的方法做為類別的置信度映射，這只有在更強力的層部接受域建模(如mlpconv in NIN)情況下才有可能。要瞭解這一個目的達成的程度，我們從CIFAR-10訓練出來的模型的最後一層mlpconv layer提取並直接可視化feature map。 ::: :::info Figure 4 shows some examplar images and their corresponding feature maps for each of the ten categories selected from CIFAR-10 test set. It is expected that the largest activations are observed in the feature map corresponding to the ground truth category of the input image, which is explicitly enforced by global average pooling. Within the feature map of the ground truth category, it can be observed that the strongest activations appear roughly at the same region of the object in the original image. It is especially true for structured objects, such as the car in the second row of Figure 4. Note that the feature maps for the categories are trained with only category information. Better results are expected if bounding boxes of the objects are used for fine grained labels. ::: :::success 圖4說明從CIFAR-10測試集選擇的十個類別中的每個類別的一些範例影像，以及它們所對應的feature maps。預期在輸入影像的實際類別所相對應的feature map上觀察到最大的啟動(激活)，這是利用global average pooling明確強制而得。在實際類別的feature map中，可以觀察到最大的啟動(激活)大致出現在原始影像中物件的相同區域。對結構化的物件尤其如此(如圖4的第二個row的car)。要注意的是，我們只有使用類別信息來訓練類別的feature maps。如果將物件的邊界框用於fine grained labels(?)，可以預期得到更好的效果。 ::: :::info ![](https://i.imgur.com/Kk1L2HO.png) Figure 4: Visualization of the feature maps from the last mlpconv layer. Only top 10% activations in the feature maps are shown. The categories corresponding to the feature maps are: 1. airplane, 2. automobile, 3. bird, 4. cat, 5. deer, 6. dog, 7. frog, 8. horse, 9. ship, 10. truck. Feature maps corresponding to the ground truth of the input images are highlighted. The left panel and right panel are just different examplars. Figure 4：從最後一層mlpconv layer可視化feature maps。僅顯示feature maps中啟動最高的10%。與feature map相對應的類別為：1. 飛機，2. 汽車，3. 鳥，4. 貓，5. 鹿，6. 狗，7. 青蛙，8. 馬，9.船，10. 卡車。突出顯示與輸入影像的真實類別相對應的feature map。左、右兩邊是不同的範例。 ::: :::info The visualization again demonstrates the effectiveness of NIN. It is achieved via a stronger local receptive field modeling using mlpconv layers. The global average pooling then enforces the learning of category level feature maps. Further exploration can be made towards general object detection. Detection results can be achieved based on the category level feature maps in the same flavor as in the scene labeling work of Farabet et al. \[20\] ::: :::success 可視化再次的證明NIN的效用。這是通過使用mlpconv layer更強大的接受域建模所達成。然後，global average pooling強制學習類別層級feature maps。可以往一般目標檢測做進一步的探索。檢測結果可以基於類別層級的feature maps來獲得結果，這與Farabet et al. \[20\]的場景標記工作相同。 ::: ## 5 Conclusions :::info We proposed a novel deep network called “Network In Network” (NIN) for classification tasks. This new structure consists of mlpconv layers which use multilayer perceptrons to convolve the input and a global average pooling layer as a replacement for the fully connected layers in conventional CNN. Mlpconv layers model the local patches better, and global average pooling acts as a structural regularizer that prevents overfitting globally. With these two components of NIN we demonstrated state-of-the-art performance on CIFAR-10, CIFAR-100 and SVHN datasets. Through visualization of the feature maps, we demonstrated that feature maps from the last mlpconv layer of NIN were confidence maps of the categories, and this motivates the possibility of performing object detection via NIN. ::: :::success 我們為分類任務提出一個新的深度網路，稱為"Network In Network"(NIN)。這個新的結構由mlpconv layers(使用多層感知器對輸入做卷積)與global average pooling layer(取代傳統CNN的全連接層)所組成。mlpconv layers能夠對local patches有更好的建模，而global average pooling則作為結構正規化器，以預防全域的過擬合。使用NIN的這兩個組件，我們在CIFAR-10、CIFAR-100與SVHN資料集上證明了得到的最佳效能。透過feature maps的可視化，我們證明NIN的最後一層mlpconv layer是類別的置信度映射，這激發了通過NIN執行目標檢測的可能性。 :::