Try   HackMD

Going deeper with convolutions(Inception)(翻譯)

tags: inception CNN 論文翻譯 deeplearning

Shaoe.chenThu, Feb 24, 2020

說明

區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院

原文

翻譯

個人註解,任何的翻譯不通暢部份都請留言指導

Abstract

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

我們提出一個深度卷積神經網路架構,代號為Inception,這個架構負責設置ImageNet 2014大型視覺辨識挑戰賽(ILSVRC14)中分類與定位的最新技術。這架構的主要特點在於提高網路內部計算資源的利用率。這是精心設計所實現的,該架構允許在增加網路的深度與寬度的同時保持計算的預算不變。為了最佳化品質,架構的決策基於Hebbian原則與multi-scale(多尺度)處理的直覺。在我們所提交的ILSVRC14中使用的一種特別的實體稱為GoogLeNet,是一個22層的深度網路,其質量在分類與偵測的情境中進行評估。

Hebbian理論是一種神經科學理論,解釋在學習過程中腦中神經元所發生的變化(參考維基百科),如果兩個神經元總是同時被激發,那它們就是一種組合。

1 Introduction

In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

過去三年中,主要由於深度學習的進步,更具體的說是卷積神經網路[10],影像辨識與物體偵測的質量以驚人的速度成長。一個令人鼓舞的消息是,多數的發展不僅是因為更強大的硬體,更大的資料集以更大的模型,最主要是因為創新思維,演算法與網路架構的改進。舉例來說,ILSVRC 2014挑戰賽中的前幾名作品,除了用於偵測目的同一比賽的分類資料之外,並沒有用新的資料集。事實上,我們的GoogLeNet提交到ILSVRC 2014的參數量,比起兩年前的Krizhevsky et al [9]的冠軍架構的參數量還要少12倍,但精確度卻大幅提升。在目標檢測中的最大收獲並不是來自單純的深度網路或更大的模型,而是來自深度架構與經典電腦視覺的協同作用,就像是Girshick et al [6]的R-CNN演算法。

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

另一個值得注意的因素是,隨著手機與崁入式計算的發展,我們演算法的效率-特別是功率與記憶體的使用-變的愈來愈重要。值得注意的是,此論文提出的深度架構的設計考慮包含了這個因素,而不單純的關注在準確數字。大多數的實驗中,模型的設計在推理時會維持在15億次的multiply-adds的計算預算,這樣它們就不會成為單純的學術好奇,而可以以合理的成本投入實際世界的應用,即使是在大型資料集上。

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, on which it significantly outperforms the current state of the art.

論文中,我們將關注於一個用於電腦視覺的高效深度神經網路架構,代號為Inception,其名稱源自Lin et al [12]的Network in network論文,以及著名的"we need to go deeper"的網路媒因[1]。在我們的案例中,"deep"乙詞有兩種不同的涵義:首先,某種意義上,我們以"Inception module"的形式引入一種新的組織層次,更直接的意義上來說是網路深度的增加。通常,人們可以將Inception model視為[12]的邏輯頂點,同時從Arora et al [2]的理論工作中得到靈感與指導。此架構的優勢已經在ILSVRC 2014分類與偵測挑戰賽中得到實驗驗證,在該挑戰賽上,它明顯優於當前最佳技術。

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.

從LeNet-5[10]開始,卷積神經網路(CNN)通常擁有一種標準結構-堆疊卷積層(可選-後面是接contrast normalization與max-pooling),然後是一個或多個全連接層。這種基本設置的變體在影像分類文獻中很普遍,而且已經在MNIST,CIFAR與最知名的ImageNet分類挑戰賽[9, 21]上取得目前為止的最佳結果。對於較大型的資料集(如Imangenet),近來的趨勢是增加layer的數量[12]與layer size[21, 14],同時使用dropout[7]來解決過擬合的問題。

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] use a series of fixed Gabor filters of different sizes in order to handle multiple scales, similarly to the Inception model. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception model are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.

儘管擔心max-pooling layers會有準確空間信息丟失的問題,但與[9]相同的卷積網路架構也已經成功的應用於定位[9, 14],目標檢測[6, 14, 18, 5]與人體姿態估計[19]。受到靈長類視覺皮質神經科學模型的啟發,Serre et al. [15]使用一系列不同大小的Gabor filters來處理多個尺度(類似於Inception model)。然而,與[15]固定2層的深度模型相反,Inception model中的所有的filters是學習來的。此外,Inception layers重覆多次,導致GoogLeNet Model在這種情況下變為擁有22層的深度模型。

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. When applied to convolutional layers, the method could be viewed as additional 1×1 convolutional layers followed typically by the rectified linear activation [9]. This enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty。

Network-in-Network是由Lin et al. [12]所提出的一種方法,為了增加神經網路的表示能力。當應用在卷積層的時候,此方法可以被視為是附加的1x1卷積層,然後通常是經過整流線性啟動(激活)[9]。這讓它(Network-in-Network)能夠很輕鬆的整合到當前的CNN pipelines。我們大量的在我們的架構中使用這個方法。但是,在我們的環境中,1x1卷積有兩個目的:最關鍵的是,它們主要用來做為維度降低的模組,以消除計算瓶頸,否則會限制我們網路的規模。這不僅可以增加網路的深度,還可以增加網路的寬度,而且不會有明顯的效能損失。

個人見解:

  • 這邊主要說明Network-in-Network的優點,可以做為降維的應用,而且利用這種作法可以提取更豐富的特徵
  • 可參考吳恩達老師課程

The current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: to first utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion, and to then use CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

當前目標檢測領先的方法是由Girshick et al. [6]所提出的Regions with Convolutional Neural Networks (R-CNN)。R-CNN將檢測問題分解為兩個子問題:首先利用與類別無關的方法將低階信息~(low-level cues)~(如顏色與超像素一致性)用於潛在目標建議,然後使用CNN分類器在那些位置辨識目標類別。這種兩階段的方法利用了擁有低階信息的邊界框分段的準確度以及最先進的CNNs強大的分類能力。我們在提交的檢測任務中採用了類似的pipeline,但在兩個階段都進行改進,像是針對更高的目標邊界框召回率的multi-box[5]的預測,以及對邊界框建議做更好的分類的組合方法。

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of levels – of the network and its width: the number of units at each level. This is as an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However this simple solution comes with two major drawbacks.

改善深度神經網路的效能最直接的方法就是增加它們的規模。這包含增加深度(層數)與寬度(每一層的單元數)。這是訓練高品質模型的一個簡單又安全的方法,特別是考慮到有大量的標記訓練資料的可用性。但是,這種簡單的解決方案有兩個主要缺點。

Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This can become a major bottleneck, since the creation of high quality training sets can be tricky and expensive, especially if expert human raters are necessary to distinguish between fine-grained visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated by Figure 1.

較大的尺寸通常意味著更多的參數量,這使得放大後的網路更容易過擬合,特別是,如果訓練集中有標記的樣本量是有限的情況。這可能會成為主要的瓶頸,因為建立高品質的訓練集會是非常棘手與昂貴的,特別是如果需要專業的評估人員來區分像是ImageNet中那樣細粒度的視覺類別((甚至是ILSVRC子集的1000-class)),如圖1所示。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classification challenge.

圖1:ILSVRC 2014分類挑選賽的1000個類別中不同的兩個類別。

Another drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then a lot of computation is wasted. Since in practice the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of results.

均勻增加網路大小的另一個缺點就是計算資源明顯的增加。舉例來說,在一個深度視覺網路中,如果兩個卷積層鏈接在一起,它們的filter數量任何的均勻增加都會導致計算量的平方增加。如果增加的容量其使用效率不佳(舉例來說,如果多數權重最終都接近零),那就會浪費大量的計算量。實際上因為計算的預算始終是有限的,即使主要的目標是增加結果的質量,也要有效分配計算資源,而不是隨意增加網路大小。

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known
– neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

解決這兩個問題的根本方法是最終從完全連接的架構轉到稀疏連接的架構,即使在卷積內部也是如此。除了模仿生物系統之外,由於Arora et al. [2]的開創性工作,這將擁有更堅固的理論基礎的優勢。他們主要的結果表明,如果資料集的機率分佈可以由一個大型且非常稀疏的深度神經網路來表示,那麼就可以透過分析最後一層啟動(激活)的相關統計信息以及對高度相關輸入的神經元做聚類,以此逐層建構最佳化網路拓撲。儘管嚴格的數學證明需要非常嚴格的條件,但是這個陳述與眾所皆知的Hebbian principle產生共鳴 - 一起激發的神經元連在一起 - 這說明著,即使在不嚴格的條件下,實際上也可以應用基本思想。

On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.

缺點是,今天的計算基礎結構在非均勻稀疏資料結構上的數值計算是非常沒有效率的。即使算術的運算量減少100倍,查找與快取misses的開銷仍然是主要的,以至於切換到稀疏矩陣都不見得會成功。透過使用穩定改善,高度優化的數值套件(可以快速計算密集的矩陣乘法),利用CPU或GPU底層的微小細節來進一步擴大差距[16, 9]。還有,非均勻稀疏模型需要更複雜的工程與計算基礎結構。現行多數面向視覺的機器學習系統僅透過卷積來利用空間域中的稀疏性。然而,卷積的實現是密集連接到前幾層中patches的集合。自[11]開始,傳統上,ConvNets在特徵維度中使用隨機與稀疏連結表,以打破對稱性並改善學習,為了有更好的最佳化平行計算,趨勢變回[9]的全連接。結構的一致性與大量的filters以及更大的batch size可以利用效率高的密集計算。

問題:sparse connection tables,是什麼?
可參考知乎

This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.

這提出一個問題,不論下一步(中間的步驟)是否有希望:如理論所建議,一個架構,使用額外的稀疏性,即使在filter level,但透過利用密集矩陣計算來利用我們當前的硬體。大量的稀疏矩陣計算的文獻說明著(如[3]),將稀疏矩陣聚類為相對密集的子矩陣往往會為稀疏矩陣乘法提供最先進的實用效能。在不久的將來,類似的方法將被用於自動構建非均勻的深度學習架構,這想法似乎也不牽強。

The Inception architecture started out as a case study of the first author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on [12]. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.

Inception架構起始於第一個作者的案例研究,目的是評估複雜的網路拓撲架構演算法的假設輸出,這演算法試著估計[2]所隱含的視覺網路的稀疏結構,並透過密集,現有的組件覆蓋假設的結果。儘管是一個高度推測的工作,但只有在拓撲確實選擇並執行兩次迭代後,我們還是可以看到基於[11]的參考架構的適度獲益。在進一步調整learning rate,超參數以及改善訓練方法,我們確定所得的Inception架構(做為[6]、[5]的基礎網路)在定位與目標檢測的上下文中特別有效。有趣的是,儘管多數的原始架構的選擇都受到徹底的質疑與測試,但結果證明它們至少是局部最優的。

One must be cautious though: although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction. Making sure would require much more thorough analysis and verification: for example, if automated tools based on the principles described below would find similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.

但必須謹慎的是:儘管所提出的架構在電腦視覺上已經成功,但它的品質是否可以被歸因於其建構其架構的指導原則,這點仍然是值得懷疑的。要確認這點需要更徹底的分析與驗證:舉例來說,如果基於下述原則的自動化工具可以為視覺網路找到類似但更好的拓撲。那最有說服力的證明就是,如果一個自動化系統建立的網路拓撲使用相同的演算法,但全域架構大不相同情況下,可以在其它領域上得到類似的獲益。至少,Inception架構最初的成功在這個方向上為未來的工作產生堅定的動力。

4 Architectural Details

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by-layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5, however this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).

Inception架構的主要想法是基於找出卷積視覺網路中的最佳局部稀疏結構如何被現在密集組件估計與覆蓋。注意到,假設平移不變性意味著我們的網路由卷積建構區塊所建構。我們所需要的就是去找出最佳的局部結構並在空間上重覆。Arora et al. [2]等人提出一種逐層結構,其中應該分析最後一層的相關統計信息,並將其聚類為高相關性單元群組。這些聚類形成下一層的單元,並連結到上一層的單元。我們假設來自前幾層的每一個單元都相對應於輸入影像的某些區域,而且這些單元被群組為filter banks。在較低層中(接近輸入層),相關單元將集中在局部區域。這意味著,最終我們將有大量的聚類集中在單一區域中,而且它們可以被下一層的1x1卷積層覆蓋,如[12]所說明。然而,可以預期的是,在空間上散開的聚類會變少(透過在較大的patches上卷積來覆蓋),而且在愈來愈大的區域上,其patches的數量會愈來愈少。為了避免patch-alignment的問題,Inception架構的當前實體被限制其filter size為1x1,3x3,5x5,但是,這個決定更多是基於便利性而不是必要性。這也意味著,這個建議的架構是所有這些層以及其輸出的filter banks的組合,這filter banks併列為一個單獨的輸出向量,形成下一階段的輸入。除此之外,由於pooling的操作對於當前最新技術的卷積網路的成功是必要的,因此建議在每個這樣的階段中加入一個可選的parallel pooling path,這應該也可以得到額外的效益才對(見圖2(a))。

個人理解:

  1. filter banks組合為一個單獨的輸出向量所指的就是Inception的block的output,這個output就是下一個block的input
  2. parallel pooling path所指就是每一個block都有一個pooling的處理,至於要不要去pooling就讓模型自己在學習過程中決定

Figure 2: Inception module

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.

因為這些"Inception modules"相互堆疊,因此它們的輸出相關性統計信息必然發生變化:隨著更高層所補獲到更高抽象的特徵,它們的空間集中度預期會降低,這說明了,當我們往愈高層去的時候,3x3與5x5的卷積比例應該增加。

One big problem with the above modules, at least in this na¨ıve(??) form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equals to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. Even while this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.

至少以這種形式,上面模組的一個大問題是,即使是數量有限有5x5卷積,在帶有大量filters的卷積層頂部,其成本依然人讓望之卻步。一旦pooling單元加入混合,這個問題將會變的更加明顯:它們的output filter的數量等於上一階段的filter數量。pooling layer的輸出與卷積層的輸出,這兩者的合併將導致階段到階段的輸出數量的必然增加。即使這個架構也許涵蓋最佳稀疏結構,但它的低率將導致在幾個階段內出現計算爆炸。

This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to model. We would like to keep our representation sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose. The final result is depicted in Figure 2(b).

這引出我們提出架構的第二個想法:在計算需求過多的地方明確的進行維度降低與投影。這基於embedding的成功:即使是低維度的embedding也可能包含大量有關相對較大的image patch的信息。而然,embedding的信息表示是密集,壓縮形式,而且用壓縮的信息建模是困難的。我們希望在多數地方中保持我們的representation是稀疏的(如[2]的條件所要求),而且只有在集體匯總的時候才壓縮信號。也這就是,在使用昂貴的3x3與5x5卷積之前,1x1卷積是由來計算縮減量。除了用來縮減量之外,它們還包含使用整流線性啟動(激活),因此它們~(1x1 conv.)~有雙重的功用。最終結果如圖2(b)所示。

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.

普遍而言,一個Inception網路是由上述類型的模組相互堆疊所組成,偶爾max-pooling layers設置stride為2來減半網格的解析度。因為技術原因(訓練期間的記憶體效率),在較低層以傳統卷積的方式,而僅在較高層使用Inception模組是比較效益的。這並非嚴格必要,只是單純的反應我們當前實現中的一些基礎結構的低效。

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

這個架構的主要優點之一就是,它可以很明顯的在每一階段增加單元的數量,而不會造成計算複雜性時失控爆炸。普遍使用的降維手法可以屏蔽上一階段大量的input filter到下一階段,首先降低它們的維度,再以較大的patch size對它們做卷積。這設計的另一個實際有用的方面是,直觀來看,視覺信息應該以不同尺度進行處理,然後匯總,如此,下一階段可以同時從不同尺度中抽取特徵。

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2 − 3× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

計算資源的改進使用可以在不引發計算困難的同時增加每個階段的寬與階段的數量。另一個利用Inception架構方法就是建立一個稍微弱一點,但計算成本上更便宜的版本。我們發現到,所有包含的knobs and levers都可以控制計算資源的平衡,這可以讓網路比起non-Inception架構的類似效能的網路還要快2-3倍,然而,這點需要仔細的手工設置。

5 GoogLeNet

We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage toYann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the influence of the exact architectural parameters is relatively minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.

在ILSVRC14競賽中,我們選擇GoogLeNet做為我們的團隊名稱。這個名稱是對Yann LeCuns開創LeNet-5網路的致敬[10]。我們還使用GoogLeNet提到在競賽中提交所使用的Inception架構的具體特殊表現。我們還使用更深,更寬的Inception network,其質量稍微弱一點,但是將它加入到集合(ensemble)中似乎可以些許改善結果。我們省略該網路的細節,因為我們的實驗已經說明,確切的架構參數所帶來的影響相對較小。這邊,為了說明目的,Table 1中說明了最成功的具體實例。在我們的集合(ensemble)模型中,7個模型就有6個使用完全相同的拓撲(使用不同採樣方法訓練)。

Table 1: GoogLeNet incarnation of the Inception architecture
Table 1:Incption架構的GoogLeNet實體

All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.

所有的卷積,包含Inception模型內部的那些卷積,都使用整流線性啟動(激活)。在我們的網路中,其接收域的大小為224x224,使用均值相減的RGB色通道。"#3×3 reduce"與"#5×5 reduce"表示在3x3與5x5卷積之前在縮減層中所使用的1x1 filter的數量。在pool proj column中built-in max-pooling之後,可以看的到投射層中1x1 filter的數量。所有的這些縮減/投射層都使用整流線性啟動(激活)。

個人見解:

  • 這邊的rectified linear activation所指的就是ReLU activation。

The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends on the machine learning infrastructure system used. The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer. This enables adapting and fine-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

這網路在設計的時候就考慮了計算效率與實用性,因此可以在個別設備上執行推理(inference),即使是計算資源有限的設備(特別是記憶體佔用較低的設備)。僅計算帶參數的層的話,網路的深度有22層(如果連同pooling layer一起計算,那就是27層)。用於網路構建的層的總數大約有100層。但是,這個數字取決於所用的機器學習基礎結構系統。分類別之前所用的average pooling是基於[12]所述,儘管我們的實現不同處在於我們使用額外的線性層。但這讓我們的網路可以輕鬆的適應和微調其它的標記集,但這主要是方便,我們並不希望它產生重大的影響。已經知道,將全連接層轉到average pooling提高了top-1準確度大約0.6%,然而,即使移除全連接層,還是必需使用dropout。

Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

有鑒於網路的深度相對較大,因此,以一個有效的方式將梯度傳播回所有層的能力是一件重要的事情。一個有趣的見解是,相對較淺的網路在這任務上的強大效能表示了,由網路中間各層所產生的特徵應該有所區別。透過增加輔助分類器連接到這些中間層,我們應該希望鼓勵在分類別的較低階段中有所區別,增加傳播回來的梯度的信號,並提供額外的正規化。這些分類器採用較小的卷積網路的形式,位於Inception (4a)與(4b)模組的輸出之上。訓練期間,它們的loss會以discount weight的方法加到網路的total loss中(輔助分類器的loss加權為0.3)。在推理(inference)時,這些輔助網路會被摒除。

個人見解:

  • 這邊說的應該就是在Inception架構中,不僅最後一層存在output,中間幾個block也會接softmax,以這種方式來增加額外的正規化,只用於訓練,inference的時候不會使用
  • 這種說法在v3的時候作者提出澄清?

The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:

  • An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
  • A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
  • A fully connected layer with 1024 units and rectified linear activation.
  • A dropout layer with 70% ratio of dropped outputs.
  • A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).

側邊額外網路的確切架構(包含額外的分類器)如下:

  • average pooling: 5x5 filter size, stride 3。(4a)的output為4x4x512,以及(4d)的output為4x4x528
  • 128個1x1卷積的filter,用於維度縮減以及整流線性啟動(ReLU)
  • fully connected: 1024 units與整流線性啟動(ReLU)
  • dropout: 70% ratio
  • 以softmax loss線性層做為分類器(預測與主要分類器相同的1000個類別,但是在推理(inference)時移除)。

A schematic view of the resulting network is depicted in Figure 3.

生成網路的示意圖如圖3所示。

Figure 3: GoogLeNet network with all the bells and whistles

圖3:GoogLeNet網路與所有產品特色

6 Training Methodology

Our networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.

我們的網路使用DistBelief[4]分佈式機器學習系統,使用少量的模型與資料數據平行性進行訓練。儘賽我們只有使用基於CPU的實現,但概略估計說明了,GoogLeNet網路可以使用少許的高階GPUs在一個禮拜內收斂,主要的限制還是在於記憶體用量。我們的訓練使用非同步隨機梯度下降,momentum為0.9[17],固定learning rate schedule(每8個epoch降低4%的learning rate)。使用Polyak averaging[13]建立推理(inference)時所用的最終模型。

問題:

  1. 什麼是Polyak averaging

Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use.

我們的影像採樣方法在進入競賽的幾個月中已經有了很大的改變,而且已經收斂的模型與其它選項一起做了訓練,有時候還有超參數的調整(像是dropout與learning rate),因此很難替訓練這些網路給出一個最有效的最終的指導。更複雜的是,受到[8]的啟發,一些模型主要是在相對較小的剪裁上訓練,另一些則在較大的剪裁上。儘管如此,在競賽之後被證明非常有效的處方,包括對影像的patch做各種不同尺寸的採樣,其尺寸平均分佈在影像區域的8%到100%之間,且其長寬比在3/4與4/3之間隨機選擇。同樣地,我們發現到Andrew Howard [8]的photometric distortions(亮度失真)在某種程度上有助於預防過擬合。此外,我們開始使用隨機插值方法(雙線性(bilinear),區域(area),最近鄰~(nearest neighbor)與[立方](http://terms.naer.edu.tw/detail/6606951/)(cubic)~,具相同機率)來調整相對late(深?)的大小,並結合其它超參數的調整,因此我們無法確認最終結果是否為正面的影響。

個人見解

  • photometric distortions:一種影像增強的手法,主要是亮度的調整,範圍內隨機取一個值加入影像

7 ILSVRC 2014 Classification Challenge Setup and Results

The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.

Table 2: Classification performance

Table 3: GoogLeNet classification performance break down

ILSVRC 2014分類挑戰賽包含將影像分類為Imagenet階層中1000個葉節點類別之一的任務。大約有120萬張訓練影像,50,000張驗證影像,以及100,000張測試影像。每一張影像都與一個真實類別相關聯,且效能的量測是基於最高得分的分類器預測而得。兩種常見報告數字:top-1準確率,與實際類別做比較,還有top-5誤差率,比較實際類別是否為前5個預測類別:如果實際類別位於前五名,則影像被視為正確分類(不考慮排序)。挑戰賽使用top-5誤差率做為排名依據。

We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we elaborate below.

  1. We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, mainly because of an oversight) and learning rate policies, and they only differ in sampling methodologies and the random order in which they see input images.
  2. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This results in 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).
  3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.

我們參加挑戰賽,而且沒有使用額外的資料訓練。除了論文前述的訓練技術,我們在測試過程中採用一系列的技術來獲得更高的效能,下面將說明。

  1. 我們各別訓練7個版本的GoogLeNet模型(包含一個更寬的版本),然後以這7個模型做ensemble prediction。這些模型以相同的初始化(相同的初始化權重,主要是因為疏忽)與learning rate policies訓練,主要的差別是採樣的方法以及輸入影像的隨機順序。
  2. 測試過程中,我們採用了比 Krizhevsky et al. [9]更積極的剪裁方法。具體來說,我們將影像重新縮放到4個尺度,其中短邊的部份各別是256,288,320與352,然後取這些重新調整尺度的左、中、與右方正方形(對於直擺的影像,我們取上、中與下方正方形)。然後,對於每一個正方形都取四個邊角與中心224x224的剪裁,並將正方形調整為224x224的大小,以及它們的鏡像版本。結果就是每張影像都有4x3x6x2=144個剪裁。Andrew Howard [8]在前一年的作品中使用了類似的方法,我們以經驗驗證其效能比我們所提出的方案略差。我們發現到,這種大量的剪裁在實際應用方也許不是必要的,因為在存在一定數量的剪裁之後,更多的剪裁其效能就變的微不足道(後續說明)。
  3. 平均所有多剪裁以及所有各別分類器的softmax機率,以獲得最終預測。在我們的實驗中,我們分析驗證資料的替代方案,像是對剪裁的max-pooling以及平均所有的分類器,但是與簡單的平均相比,它們的效能是較差的。

個人見解:

  1. 似乎是因為疏忽而造成所有版本初始權重相同
  2. 4個尺寸x3個正方形剪裁(上中下或左中右)x6個剪裁取片(四個邊角+中心+原正方形調整為224x224)x2(鏡像版本)

In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.

論文的其餘部份,我們分析影響最終提交的總體效能的多個因素。

Our final submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. The following table shows the statistics of some of the top-performing approaches.

挑戰賽的最終提交在驗證與測試集上獲得top-5誤差率6.67%,排名第一。與2012年的SuperVision方法比較,相對減少56.5%,與上一年的最佳方法(Clarifai)相比,減少大約40%,這兩個都使用額外的資料來訓練分類器。下面表格說明了一些效能最好的方法的統計信息。

We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.

我們還透過調整下面表格中預測影像時所用的模型數量與剪裁數量來分析、報告多個測試選擇的效能。當我們使用一個模型時,我們在驗證資料上選擇top-1誤差率最低的模型。所有數字都是在驗證資料上的報告,以避免過擬合測試集的統計信息。

8 ILSVRC 2014 Detection Challenge Setup and Results

The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary from large to tiny. Results are reported using the mean average precision (mAP).

ILSVRC檢測任務是在200種可能類別的影像中的目標週圍生成邊界框。如果檢測到的目標匹配實際類別,並且它們的邊界框重疊至少50%,則檢測目標視為正確(使用Jaccard index)。如果是無關的檢測則視為偽陽性,並且會得到懲罰。與分類任務相反,每一張影像都可能會包含多個目標或沒有,而且它們的比例也許會從大到小變化。使用mean average precision (mAP)報告結果。

Jaccard index,雅卡爾指數,也就是IoU。

The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.

GoogLeNet用於檢測的方法與R-CNN[6]類似,但是GoogLeNet以Inception模型做為區域分類器進行增強。此外,透過結合Selective Search [20]與multi-box [5] predictions,可以提高region proposal step,從而提高目標邊界框的召回率。為了降低偽陽性的數量,superpixel的大小增加2倍。這將來自selective search algorithm的建議減半。我們加回了來自multi-box [5]的200個候選區域,總共佔了[6]用的60%,同時將覆蓋率從92%提高到93%。減少proposals的數量並增加覆蓋率的整體效果是,單個模型狀況下的mAP提高1%。最終,在對每個region分類的時候,我們使用6個卷積神經網路的ensemble,這將結果的準確度從40%提高到43.9%。注意到,與R-CNN相反,由於時間不足,我們並沒有使用邊界框迴圈~(bounding box regression)~

We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pre-training.

首先,我們報告最高的檢測結果,並說明從第一版檢測任務以來的進度。與2013年的結果相比,準確度幾乎翻倍。表現最好的小組都是使用卷積網路。我們在Table 4中報告官方成績以及每一個小組的共同策略:使用額外的資料,ensemble models或contextual models。通常,外部資料是用於預訓練模型的ILSVRC12分類資料,這模型後續會在檢測資料上進行優化。部份小組還提到定位資料的使用。大部份的定位任務邊界框並不包含在檢測資料集中,因此可以用這資料預訓練一個通用的邊界框迴歸器,方法與預訓練分類相同。GoogLeNet的作品並沒有使用定位資料做預訓練。

Table 4: Detection performance

In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.

Table 5中,我們比較僅單一模型的結果。表現最好的模型是Deep Insight,令人驚訝的是,使用3個模型的ensemble竟然只提到0.3個百分點,而GoogLeNet的ensemble則是獲得明顯更強的結果。

Table 5: Single model performance for detection

9 Conclusion

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways on the basis of [2].

我們的結果似乎提出一個有力的證據,就是證明透過現成的密集建構區塊估計預期的最佳稀疏結構是改善用於電腦視覺神經網路的可行方法。與較淺的較不寬的網路相比,這方法的主要優點就是在計算需求適度增加的情況下,可以明顯的提高品質。也注意到,儘管我們沒有使用上下文,也沒有使用邊界框迴歸,我們的檢測工作依然是有競爭力的,這事實進一步證明Inception架構的實力。儘管是可預期的,透過類似的深度與寬度等較為昂貴的網路是可以實現類似品質的結果,但是我們的方法得到力的證據,轉向稀疏結構通常是可行而且有用的想法。這說明了,在[2]的基礎上,未來有希望以自動化的方式建立更稀疏且更精細的架構。