Fully Convolutional Networks for Semantic Segmentation

# Fully Convolutional Networks for Semantic Segmentation (翻譯) ###### tags: `CNN` `論文翻譯` `deeplearning` >[name=Shaoe.chen] [time=Thu, Feb 24, 2020] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 個人註解，任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1411.4038.pdf) * [Shift and stitch理解](https://zhuanlan.zhihu.com/p/56035377) * [FCN的學習及理解（Fully Convolutional Networks for Semantic Segmentation）](https://blog.csdn.net/qq_36269513/article/details/80420363) * [卷積神經網絡CNN（3）—— FCN(Fully Convolutional Networks)要點解釋](https://blog.csdn.net/Fate_fjh/article/details/53446630) * [全卷積網絡FCN 詳解](https://www.cnblogs.com/gujianhan/p/6030639.html) ::: ## Abstract :::info Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixelsto-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet \[22\], the VGG net \[34\], and GoogLeNet \[35\]) into fully convolutional networks and transfer their learned representations by fine-tuning \[5\] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves stateof-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image. ::: :::success 卷積網路是非常強大的視覺模型，它可以產生特徵的[階層](http://terms.naer.edu.tw/detail/6629463/)。我們證明了，經過end-to-end、pixels-to-pixels訓練之後的卷積網路，在語義分割的部份是可以超過當前最佳技術。我們的主要觀點就是建立"全卷積"~(fully convolutional)~的網路，這讓網路可以接受任意大小的輸入，並能夠有效的推理與學習，產生相對應大小的輸出。我們定義並詳細說明全卷積網路的空間，解釋它們應用於空間密集預測任務，並且說明與先前模型的關聯。我們採用當代的分類網路(AlexNet\[22\]、VGG\[34\]、GoogLeNet\[35\])做為全卷積網路，然後再微調\[5\]模型來轉換它們所學到的representations，做為分割任務使用。然後我們定義一個skip architecture，結合來自深層的語義信息(較粗糙)以及淺層的外觀信息(較精細)，以生成準確而且詳細的分割。我們的全卷積網路得到 PASCAL VOC(20% relative improvement to 62.2% mean IU on 2012)、NYUDv2與SIFT Flow的最佳分割，對於典型的影像，其推理所需的時間不到五分之一秒。 :::k :::warning 個人見解： * spatially dense prediction task：就是對於影像內的每一個pixel都需要進行分類，因為語義分割就是要針對影像內的每一個pixel做分類，像是二值化，歸於黑或白 ::: ## 1. Introduction :::info Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification \[19, 31, 32\],, but also making progress on local tasks with structured output. These include advances in bounding box object detection \[29, 12, 17\], part and keypoint prediction \[39, 24\],, and local correspondence \[24, 9\]. ::: :::success 卷積網路正推動辨識技術的進步。卷積並不只是改善整個影像的分類\[19, 31, 32\]，也在具結構輸出的定位任務上取得進展。這些進展包括邊界框目標檢測\[29, 12, 17\]，部份與關鍵點的預測\[39, 24\]，以及局部對應\[24, 9\]。 ::: :::info The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation \[27, 2, 8, 28, 16, 14, 11\], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses. ::: :::success 從粗略推理到精細推理的進展中，很自然的下一步就是在每個像素上做出預測。先前的方法已經使用卷積做語義分割\[27, 2, 8, 28, 16, 14, 11\]，其中每個像素都會標記為其包圍住的物件或區域的類別，但這個工作解決了這些缺點。 ::: :::info We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling. ::: :::success 我們說明了，在語義分割上以end-to-end、pixels-to-pixels所訓練的全卷積網路(FCN)，超過了最新技術，而且不需要加入其它的方法。據我們所知，這是第一個以end-to-end訓練FCN用於像素級別的預測而且來自監督式預訓練的工作。現已存在網路的全卷積版本以任意大小的輸入預測密集輸出。透過密集的前饋計算與反向傳播，一次可以整個影像做好學習與推理。網路中的upsampling layers可以以通過subsampled pooling在網路中做像素級別的預測與學習。 ::: :::info This method is efficient, both asymptotically and absolutely, and precludes the need for the complications in other works. Patchwise training is common \[27, 2, 8, 28, 11\], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post-processing complications, including superpixels \[8, 16\], proposals \[16, 14\], or post-hoc refinement by random fields or local classifiers \[8, 16\]. Our model transfers recent success in classification \[19, 31, 32\] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training \[8, 28, 27\]. ::: :::success 這個方法在[漸近性](http://terms.naer.edu.tw/detail/6688957/)與絕對性上都是有效的，而且不若其它工作的複雜。常見patchwise的訓練方式\[27, 2, 8, 28, 11\]，但沒有全卷積訓練的效率。我們的方法沒有使用前、後處理的複雜性，包含superpixels\[8, 16\]、proposals\[16, 14\]或是透過隨機字段或局部分類器的[事後](http://terms.naer.edu.tw/detail/3266334/)[細分](http://terms.naer.edu.tw/detail/2123388/)\[8, 16\]。我們的模型透過將分類網路重新解釋為全卷積網路，並以它們所學到的representations來微調，將最近在分類任務上成功\[19, 31, 32\]的模型轉為密集預測。相比之下，先前的工作在沒有監督式預訓練的情況下用小型的卷積網路\[8, 28, 27\]。 ::: :::info Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies encode location and semantics in a nonlinear local-to-global pyramid. We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3). ::: :::success 語義分割面臨著語義與位置之間的固有問題：global information resolves what while local information resolves where。深度特徵階層在一個非線性局部到全域的[角錐體](http://terms.naer.edu.tw/detail/2122702/)中對位置與語義做編碼。我們定義了一個skip architecture，利用這個結合4.2結中所述的深層、粗糙、語義信息與淺層、精細、外觀信息。(見Figure 3)的特徵[值譜](http://terms.naer.edu.tw/detail/2125070/)的優點。 ::: :::success 個人見解： * 這邊真的不甚理解這句的意義，求指導。 ::: :::info ![](https://i.imgur.com/gWXXPH3.png) Figure 3. Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Pooling and prediction layers are shown as grids that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines. First row (FCN-32s): Our singlestream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Second row (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic information. Third row (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision. Figure 3. 我們的DAG網路學習將粗糙、高層的信息與精細、低層的信息相結合。池化與預測層以網格來顯示，顯示為相對空間的[粗糙度](http://terms.naer.edu.tw/detail/928179/)，中間層則以垂直線來表示。First row (FCN-32s)：我們的單流網路~(single-stream)~(Section 4.1說明)，一個步驟中以stride=32將預測升採樣還原為像素。Second row (FCN-16s)：以stride=16結合最後一層與pool4的預測，讓我們的網路預測有更精細的細節，同時保有高階的語義信息。Third row (FCN-8s)：以stride=8，加入來自pool3的預測，提供了更高的精確度。 ::: :::warning 個人見解： * upsample：查詢維基翻譯為[升採樣](https://zh.wikipedia.org/wiki/%E5%8D%87%E6%8E%A1%E6%A8%A3)，大陸用語為上採樣，大致就是與pool相反，pool是將尺寸縮小，而upsample是放大 ::: :::info In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow. ::: :::success 在下一節中，我們將回顧關於深度分類網路、FCNs、以及近來使用卷積做語義分割的相關工作。下面的章節說明FCN的設計與密集預測的權衡，介紹我們的架構，具有in-network upsampling與多層的結合，並說明我們的實驗框架。最後我們在PASCAL VOC 2011-2、NYUDv2與SIFT Flow上說明最新的結果。 ::: ## 2. Related work :::info Our approach draws on recent successes of deep nets for image classification \[19, 31, 32\] and transfer learning \[4, 38\]. Transfer was first demonstrated on various visual recognition tasks \[4, 38\], then on detection, and on both instance and semantic segmentation in hybrid proposalclassifier models \[12, 16, 14\]. We now re-architect and fine-tune classification nets to direct, dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework. ::: :::success 我們的方法利用最近在影像分類\[19, 31, 32\]與遷移學習\[4, 38\]上成功的深度網路。遷移學習首先在各種視覺辨識任務上被證明\[4, 38\]，然後在檢測上，以及在hybrid proposal classifier models中的實例與語義分割上被證明\[12, 16, 14\]。現在，我們重新架構並微調分類網路，以針對語義分割的密集預測。我們繪製FCNs的空間，並且以前的模型置於此框架中(包含歷史與最近的)。 ::: :::warning 個人見解： * 有人說轉移學習，有人說遷移學習，直接一點就是transfer learning ::: :::info **Fully convolutional networks** To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. \[25\], which extended the classic LeNet \[21\] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt \[37\] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. \[27\] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference. ::: :::success **Fully convolutional networks** 據我們所知，將卷積網路擴展到任意大小的輸入的想法首先出現在Matan等人\[25\]，他們擴展經典的LeNet\[21\]來辨識數字字串。因為他們的網路受限於1維的輸入字串，因此，Matan等人使用[維特比](http://terms.naer.edu.tw/detail/6936407/)編碼來獲得其輸出。Wolf與Platt\[37\]將卷積網路擴展到郵政地址區塊的四個角的檢測分數的2維映射。這兩個歷史著作都以推理並學習全卷積來進行檢測。Ning等人\[27\]定義一個卷積網路，用於以全卷積推理來替[隱桿線蟲](http://terms.naer.edu.tw/detail/5455438/)組織的分段做概略的分類。 ::: :::info Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. \[29\], semantic segmentation by Pinheiro and Collobert \[28\], and image restoration by Eigen et al. \[5\] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. \[35\] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method. ::: :::success 全卷積網路計算在當前多層網路的時代也已經得到廣泛的應用。Sermanet等人\[29\]的滑動視窗，Pinheiro與Collobert的語義分割\[28\]與Eigen等人\[5\]的影像還原，這都是做了全卷積的推理。全卷積的訓練是很少見的，但是Tompson等人\[35\]有效的用於學習end-to-end的局部檢測器以及姿勢估測~(pose estimation)~的空間模型，儘管他們沒有說明或分析這種方法。 ::: :::info Alternatively, He et al. \[17\] discard the non-convolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end. ::: :::success 另外，He等人\[17\]拋棄分類網路的非卷積部份，來製做特徵提取器。他們結合proposals與spatial pyramid pooling來生成用於分類的局部，固定長度的特徵。雖然快又有笑，但這個混合模型無法以end-to-end的方式來學習。 ::: :::info **Dense prediction with convnets** Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. \[27\], Farabet et al. \[8\], and Pinheiro and Collobert \[28\]; boundary prediction for electron microscopy by Ciresan et al. \[2\] and for natural images by a hybrid convnet/nearest neighbor model by Ganin and Lempitsky \[11\]; and image restoration and depth estimation by Eigen et al. \[5, 6\]. Common elements of these approaches include * small models restricting capacity and receptive fields; * patchwise training \[27, 2, 8, 28, 11\]; * post-processing by superpixel projection, random field regularization, filtering, or local classification \[8, 2, 11\]; * input shifting and output interlacing for dense output \[28, 11\] as introduced by OverFeat \[29\]; 待調整翻譯 * multi-scale pyramid processing \[8, 28, 11\]; * saturating tanh nonlinearities \[8, 5, 28\]; and * ensembles \[2, 11\], whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. \[6\] is a special case. ::: :::success **Dense prediction with convnets** 最近的一些著作已經開始將卷積網路應用於密集預測的問題，包含Ning等人\[27\]，Farabet等人\[8\]與Collobert\[28\]的語義分割；Ciresan等人\[2\]的電子顯微鏡邊界預測，以及Ganin與Lempitsky\[11\]的混合卷積/最近鄰模型的自然影像；以及Eigen等人\[5, 6\]的影像還原與深度估計。這些方法的共同要素包含： * 限制容量與接收域的小型模型； * patchwise training\[27, 2, 8, 28, 11\]； * 透過超像素的投射，[隨機場](http://terms.naer.edu.tw/detail/2123027/)正規化，濾波，或局部分類做[後處理](http://terms.naer.edu.tw/detail/253446/)\[8, 2, 11\]； * 用於密集輸出的輸入[移位](http://terms.naer.edu.tw/detail/2124541/)與輸出[交錯](http://terms.naer.edu.tw/detail/6634453/)\[8, 28, 11\]； * multi-scale pyramid processing\[8, 5, 28\]； * 飽合的雙曲正切非線性；與 * ensembles \[2, 11\]，而我們的方法並沒有這些機制。然而，我們從FCNs的角度研究patchwise training 3.4與"shift-and-stitch"密集輸出3.2。我們還討論in-network upsampling 3.3，其中Eigen的全連接預測是一個特例。 ::: :::warning 個人見解： * image wise是影像級別，而影像是pixel所組成；pixel wise是像素級別；patch wise是一個區塊的級別，就好比是卷積的filter是一個nxm的區塊下去計算一般。 * patch wise training，理解上是對感興趣的pixel，以它為中心取patch，然後輸入網路，輸出則為該pixel的label * 參考：[Don’t Just Scan This: Deep Learning Techniques for MRI](https://medium.com/stanford-ai-for-healthcare/dont-just-scan-this-deep-learning-techniques-for-mri-52610e9b7a85) ::: :::info Unlike these existing methods, we adapt and extend deep classification architectures, using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and efficiently from whole image inputs and whole image ground thruths. ::: :::success 與目前現行方法不同，我們採用並擴展深度分類架構，使用影像分類做為監督式預訓練，並以全卷積微調模型，從整張影像輸入與實際類別中簡單又有效地學習。 ::: :::info Hariharan et al. \[16\] and Gupta et al. \[14\] likewise adapt deep classification nets to semantic segmentation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system \[12\] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end. ::: :::success Hariharan等人\[16\]與Gupta等人\[14\]同樣地採用深度分類網路來處理語義分割，但是在proposal-classifier模型中卻是如此。這些方法利用採樣邊界框與/或區域候選以執行檢測，語義分割，與實例分割來微調R-CNN系統\[12\]。這兩種方法都不是end-to-end的學習。 ::: :::info They achieve state-of-the-art segmentation results on PASCAL VOC and NYUDv2 respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5. ::: :::success 他們分別在PASCAL VOC與NYUDv2上得到最佳的分割結果，因此，我們在Section 5中直接把我們end-to-end的FCN與他們的語義分割結果做比較。 ::: :::info 這段取消? We fuse features across layers to define a nonlinear localto-global representation that we tune end-to-end. In contemporary work Hariharan et al. [18] also use multiple layers in their hybrid model for semantic segmentation. ::: :::success 這段取消? 我們以跨層融合特徵的方式定義非線性的局部到全域的~(local-to-global)~的representation~(表示)~，再以end-to-end的方式微調。在當代的作品當中，Hariharan等人\[18\]還在其混合模型中使用多層來做語義分割。 ::: ## 3. Fully convolutional networks :::info Each layer of data in a convnet is a three-dimensional array of size $h × w × d$, where $h$ and $w$ are spatial dimensions, and $d$ is the feature or channel dimension. The first layer is the image, with pixel size $h × w$, and $d$ color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields. ::: :::success 卷積網路中的每一層資料都是三維陣列，大小為$h x w x d$，其中$h$跟$w$是空間維度，而$d$是特徵或通道維度。第一層為影像，像素大小為$h x w$，以及$d$的顏色通道。較高層中的位置對應於[路徑連通](http://terms.naer.edu.tw/detail/2121517/)到影像中的位置，這稱之為它們的接收域。 ::: :::info Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. Writing $x_{ij}$ for the data vector at location $(i, j)$ in a particular layer, and $y_{ij}$ for the following layer, these functions compute outputs $y_{ij}$ by $$ y_{ij} = f_{ks}(\left\{X_{si + \delta i}, s_j + \delta j \right\} 0 \leq \delta i, \delta j \leq k) $$ where $k$ is called the kernel size, $s$ is the stride or subsampling factor, and $f_{ks}$ determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling, or an element-wise nonlinearity for an activation function, and so on for other types of layers. ::: :::success 卷積網路建立在[平移](http://terms.naer.edu.tw/detail/955614/)不變性上。它們的基本組件(卷積、池化與啟動函數)在局部輸入區域上操作，而且僅取決於相對的空間坐標。在特定層的位置$(i, j)$將資料向量寫入$x_{ij}$，在下一層則寫入$y_{ij}$，這些函數通過下面公式計算輸出$y_{ij}$ $$ y_{ij} = f_{ks}(\left\{X_{si + \delta i}, s_j + \delta j \right\} 0 \leq \delta i, \delta j \leq k) $$ 其中$k$稱為kernel size~(filter的大小)~, $s$為步幅~(每次平移幾步)~或[次取樣](http://terms.naer.edu.tw/detail/6570346/)因子，而$f_{ks}$則決定層的類型：卷積或average pooling的矩陣乘法，max pooling的空間最大值，或啟動函數的element-wise nonlinearity~(元素非線性)~，對其它類型的層則依此類推。 ::: :::info This functional form is maintained under composition, with kernel size and stride obeying the transformation rule $$ f_{ks} \space \text{o} \space g_{k's'} = (f \space \text{o} \space g)_{k' + (k-1)s', ss'} $$ While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. An FCN naturally operates on an input of any size, and produces an output of corresponding (possibly resampled) spatial dimensions. ::: :::success 這個函數的形式的組成是固定的，其kernel size與stride則依循著轉換規則 $$ f_{ks} \space \text{o} \space g_{k's'} = (f \space \text{o} \space g)_{k' + (k-1)s', ss'} $$ 一般的深度網路計算一般的非線性函數，只有這種形式的網路計算非線性的濾波器，我們稱為深度濾波或全卷積網路。FCN很自然的可以在任意大小的輸入上操作，並產生相對應的空間維度輸出(可能是重新採樣的)。 ::: :::info A real-valued loss function composed with an FCN defines a task. If the loss function is a sum over the spatial dimensions of the final layer, $\ell(x; \theta)=\sum_{ij} \ell{'}(x_{ij};\theta)$, its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on $\ell$ computed on whole images will be the same as stochastic gradient descent on $\ell{'}$, taking all of the final layer receptive fields as a minibatch. ::: :::success FCN組成的[實值損失函數](http://terms.naer.edu.tw/detail/2123206/)定義了任務。如果損失函數是最後一層的空間維度上的總和，$\ell(x; \theta)=\sum_{ij} \ell{'}(x_{ij};\theta)$，那它的梯度就會是它的每個空間[組件](http://terms.naer.edu.tw/detail/2113100/)的梯度的總和。因此，對整個影像計算的$\ell$上的隨機梯度下降將會與$\ell{'}$的隨機梯度下降相同，將所有最後一層接收域都做為小批量。 ::: :::info When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch. ::: :::success 當這些接收域明顯的重疊，當整個影像以layer-by-layer計算，而不是patch-by-patch的時候，那麼前饋計算與反向傳播都會更有效率， ::: :::info We next explain how to convert classification nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick, fast scanning \[29\], introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modification. As an efficient, effective alternative, we introduce deconvolution layers for upsampling in Section 3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section 4.3 that our whole image training is faster and equally effective. ::: :::success 接下來，我們會解釋如何將分類網路轉換為生成粗糙輸出映射的全卷積網路。對於像素級別的預測，我們需要將這些粗糙的輸出連接回像素。Section 3.2為此說明一個技巧，就是fast scanning\[29\]。我們透過將它重新解釋為等效的網路修改，以深入瞭解這個技巧。做為一個高效的，有效的替代方案，我們將在Section 3.3中說明用於[上取樣](http://terms.naer.edu.tw/detail/6570853/)的deconvolution~(反卷積)~ layer。Section 3.4中，我們考慮利用patchwise sampling訓練，並在Section 4.3中給出證明，證明我們以整張影像的訓練是快速而且同樣有效。 ::: ### 3.1. Adapting classifiers for dense prediction :::info Typical recognition nets, including LeNet \[21\], AlexNet \[19\], and its deeper successors \[31, 32\], ostensibly take fixed-sized inputs and produce non-spatial outputs. The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classification maps. This transformation is illustrated in Figure 2.(By contrast, nonconvolutional nets, such as the one by Le et al. [20], lack this capability.) 這邊補翻譯 ::: :::success 傳統的辨識網路，包含LeNet\[21\]，AlexNet\[19\]，以及更深的後繼作品\[31, 32\]，明顯的是以固定大小的輸入並生成非空間的輸出。這些網路的全連接層具有固定的維度，並拋棄了空間坐標。然而，這些全連接層也可以被視為具有其整個輸入區域的內核的卷積。這麼做會將它們轉換為全卷積的網路，這些網路可以是任意大小的輸入，並輸出類別的映射。Figure 2說明這種轉換。 ::: :::info ![](https://i.imgur.com/zNH6ADk.png) Figure 2. Transforming fully connected layers into convolution layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient machine for end-to-end dense learning. Figure 2. 將全連接層轉換為卷積層讓分類網路輸出熱力圖。增加層與空間損失(如Figure 1所示)可以生成一個有效的機器做為end-to-end的密集學習。 ::: :::info ![](https://i.imgur.com/XLYXMCn.png) Figure 1. Fully convolutional networks can efficiently learn to make dense predictions for per-pixel tasks like semantic segmentation. Figure 1. 全卷積網路可以有效學習為每一個像素做密集預測的任務(如語意分割)。 ::: :::warning 個人見解： * 最後的output是21個channel，因為有20個類別再加上1個背景，總計21 ::: :::info Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example, while AlexNet takes 1.2 ms (on a typical GPU) to infer the classification scores of a 227×227 image, the fully convolutional net takes 22 ms to produce a 10×10 grid of outputs from a 500×500 image, which is more than 5 times faster than the na¨ıve approach^1^ ::: :::success 另外，雖然生成的映射等價於特定輸入區塊在原始網路上的[計值](http://terms.naer.edu.tw/detail/2111307/)，但這些區塊的整個重疊區域的計算是被高度分攤的。舉例來說，雖然AlexNet需要1.2ms(在典型的GPU上)來推理227x227影像的類別分數，而全卷積則需要22ms從500x500的影像中生成一個10x10的輸出網格，比單純的方法還要快5倍多^1^。 ::: :::info ^1^ Assuming efficient batching of single image inputs. The classification scores for a single image by itself take 5.4 ms to produce, which is nearly 25 times slower than the fully convolutional version. ^1^ 假設單一影像輸入的有效批處理。單一影像的類別分數本身需要5.4ms來產生，這比全卷積版本還要慢幾乎25倍。 ::: :::info The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efficiency (and aggressive optimization) of convolution. ::: :::success 這些卷積模型的空間輸出映射始得它們成為語義分割等密集問題的自然選擇。由於每個輸出單元都有實際類別，因此正向與反向的過程都非常直接，而且都有卷積固有的高效計算(與積極的最佳化)。 ::: :::info The corresponding backward times for the AlexNet example are 2.4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a speedup similar to that of the forward pass. This dense backpropagation is illustrated in Figure 1. 這句補翻譯 ::: :::success 以AlexNet為例，單一影像相對於全卷積10x10的輸出映射的反向時間，為2.4ms、37ms，產生一個類似於正向傳遞的加速效果。 ::: :::info While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units. ::: :::success 儘管我們將分類網路重新解釋為全卷積，而且可以針對任意大小的輸入產生輸出映射，但通常利用[次取樣](http://terms.naer.edu.tw/detail/1060460/)來降低輸出維度。分類網路利用次取樣來維持濾波器的小以及合理的計算需求。這使得這些網路的全卷積版本的輸出變的粗糙，reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units. ::: :::warning 個人見解： * 這邊確實無法翻譯理解，求指導 ::: ### 3.2. Shift-and-stitch is filter rarefaction :::info Input shifting and output interlacing is a trick that yields dense predictions from coarse outputs without interpolation, introduced by OverFeat \[29\]. If the outputs are downsampled by a factor of f, the input is shifted (by left and top padding) $x$ pixels to the right and $y$ pixels down, once for every value of $(x, y) \in \left{\0, ..., f-1 \right\} x \left\{0,...,f-1 \right\}$. These $f^2$ inputs are each run through the convnet, and the outputs are interlaced so that the predictions correspond to the pixels at the centers of their receptive fields. ::: :::success 輸入的[移位](http://terms.naer.edu.tw/detail/2124541/)與輸出的[交錯](http://terms.naer.edu.tw/detail/6634453/)是一種技巧，從粗糙的輸出中產生密集的預測，而不需要[插值](http://terms.naer.edu.tw/detail/2118159/)，由OverFeat所引入\[29\]。如果輸出由一個因子$f$來決定downsampled，那麼輸入會向右[移位](http://terms.naer.edu.tw/detail/2124541/)$x$個像素(by left and top padding)，向下[移位](http://terms.naer.edu.tw/detail/2124541/)$y$個像素，每一個值為$(x, y) \left\{0,..., f-1 \right\} x \left\{0,...,f-1 \right\}$這些$f^2$的輸入都通過卷積網路執行，而且輸出是交錯的，以便預測相對於它們接收域中心的像素。 ::: :::info Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick. Consider a layer (convolution or pooling) with input stride $s$, and a following convolution layer with filter weights $f_{ij}$ (eliding the feature dimensions, irrelevant here). Setting the lower layer’s input stride to 1 upsamples its output by a factor of $s$, just like shift-and-stitch. However, convolving the original filter with the upsampled output does not produce the same result as the trick, because the original filter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the filter by enlarging it as $$ f'_{ij} = \begin{cases} f_{i/s, j/s} & \text{if s divides both i and j;} \\[2ex] 0 & \text{otherwise}, \end{cases} $$ (with $i$ and $j$ zero-based). Reproducing the full net output of the trick involves repeating this filter enlargement layer-by-layer until all subsampling is removed. ::: :::success 只需要改變卷積網路的filters與layer strides就可以產生如這個shift-and-switch技巧一般的輸出。考慮一個輸入步幅為$s$的層(卷積或池化)，以及下一個具有filter weights $f_{ij}$的卷積層(省略特徵維度，在此處不相關)。設置較低層的輸入步幅為1，就像是shift-and-switch一樣，利用$s$來對它的輸出做[上取樣](http://terms.naer.edu.tw/detail/6570853/)。然而，以原始的濾波器對[上取樣](http://terms.naer.edu.tw/detail/6570853/)的輸出計算卷積並不會產生與技巧相同的結果，因為原始的濾波器單純的看到其輸入減少的部份(現在是[上取樣](http://terms.naer.edu.tw/detail/6570853/))。為了要重現技巧的結果，要將濾波器放大為 $$ f'_{ij} = \begin{cases} f_{i/s, j/s} & \text{if s divides both i and j;} \\[2ex] 0 & \text{otherwise}, \end{cases} $$ (其中$i$與$j$為zero-based~(零基)~)。重現這個技巧的完整網路的輸出包含逐層重覆放大濾波器，一直到移除所有的[次取樣](http://terms.naer.edu.tw/detail/1060460/)。 ::: :::info Simply decreasing subsampling within a net is a tradeoff: the filters see finer information, but have smaller receptive fields and take longer to compute. We have seen that the shift-and-stitch trick is another kind of tradeoff: the output is made denser without decreasing the receptive field sizes of the filters, but the filters are prohibited from accessing information at a finer scale than their original design. ::: :::success 簡單的減少網路中的[次取樣](http://terms.naer.edu.tw/detail/1060460/)是一種折衷：濾波器可以看見更精細的信息，但會有較小的接收域以及比較長的計算時間。我們已經看到，shift-and-stitch技巧是另一種類型的折衷：輸出更為密集而不需要降低濾波器的接收域大小，但是濾波器被禁止比其原始設計更為精細的比例訪問信息。 ::: :::info Although we have done preliminary experiments with shift-and-stitch, we do not use it in our model. We find learning through upsampling, as described in the next section, to be more effective and efficient, especially when combined with the skip layer fusion described later on. ::: :::success 儘管我們已經完成shift-and-stitch的初步實驗，但我們並沒有在模型中使用它。我們發現，透過[上取樣](http://terms.naer.edu.tw/detail/6570853/)學習(如下一章節所述)，其效率與效果都更好，特別是結合後續說明的skip layer fusion時效果更好。 ::: ### 3.3. Upsampling is backwards strided convolution :::info Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output $y_{ij}$ from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells. ::: :::success 另一種將粗糙輸出連結到密集像素的方法就是[插值法](http://terms.naer.edu.tw/detail/2118159/)。舉例來說，簡單的[雙線性](http://terms.naer.edu.tw/detail/2111829/)[插值法](http://terms.naer.edu.tw/detail/2118159/)利用線性映射從最近的四個輸入中計算每一個輸出$y_{ij}$，而線性映射僅取決於輸入與輸出單元的相對位置。 ::: :::info In a sense, upsampling with factor $f$ is convolution with a fractional input stride of $1/f$. So long as $f$ is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of $f$. Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution. Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss. ::: :::success 就某種意義而言，以$f$做[上取樣](http://terms.naer.edu.tw/detail/6570853/)是一種卷積，其分數式輸入步幅為$1/4$。只要$f$為整數，很自然的就是以輸出步幅為$f$的反向卷積(有些時候稱為deconvolution)來做[上取樣](http://terms.naer.edu.tw/detail/6570853/)。這種操作很容易可以實現，因為它只是單純的[反轉](http://terms.naer.edu.tw/detail/6673691/)卷積的正向與反向傳送。因此，[上取樣](http://terms.naer.edu.tw/detail/6570853/)是透過從 pixelwise loss中做反向傳播，在網路中以end-to-end的方式學習。 ::: :::info Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling. ::: :::success 要注意到，這類層反卷積濾波器並不需要固定(即，固定為[雙線性](http://terms.naer.edu.tw/detail/2111829/)[上取樣](http://terms.naer.edu.tw/detail/6570853/))，它是透過學習而得的。反卷積層與啟動函數的堆疊甚至可以學到非線性的[上取樣](http://terms.naer.edu.tw/detail/6570853/)。 ::: :::info In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for refined prediction in Section 4.2. ::: :::success 在我們的實驗中，我們發現，in-network upsampling對於學習密集預測是快又有效。我們最佳的分割架構使用這些層來學習用於Section 4.2的精密預測做[上取樣](http://terms.naer.edu.tw/detail/6570853/)， ::: ### 3.4. Patchwise training is loss sampling :::info In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully-convolutional training can be made to produce any distribution, although their relative computational efficiency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive fields of the units below the loss for an image (or collection of images). While this is more efficient than uniform sampling of patches, it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask \[36\] between the output and the loss) excludes patches from the gradient computation. ::: :::success 在隨機最佳化中，梯度的計算是由訓練分佈來驅動。不論是patchwise training或是全卷積訓練都可以用來生成任意的分佈，儘管它們的相對計算效率是取決於[交疊](http://terms.naer.edu.tw/detail/2121289/)與minibatch size。整張影像的全卷積訓練與patchwise training是一樣的，每個batch都包含低於影像損失(或影像集合)的單元的所有接收域。儘管這比起區塊的均勻採樣還要來的有效，但它減少了可能的批處理數量。但是，可以簡單的恢復影像中隨機選擇的區塊。將損失限制為其空間項的隨機採樣子集(或等效於在輸出與損失之間執行DropConnect mask\[36\])，可以將區塊排除於梯度計算之外。 ::: :::info If the kept patches still have significant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images.^2^ ::: :::success 如果保留的區塊仍然存在著明顯的重疊，那麼全卷積計算仍將加快訓練速度。如果梯度是由多個反向傳遞所累積，那batches就可以包含來自多張影像的區塊。^2^ ::: :::info ^2^Note that not every possible patch is included this way, since the receptive fields of the final layer units lie on a fixed, strided grid. However, by shifting the image left and down by a random value up to the stride, random selection from all possible patches may be recovered. ::: :::success ^2^注意到，並非所有可能的區塊都以這種方式包含，因為最後一層神經元的接收域是位於一個固定而且跨步的網格上。但是，透過以隨機數值(直到步幅)將影像左、下移動，這可以恢復從所有可能區塊中所做的隨機選擇。 ::: :::info Sampling in patchwise training can correct class imbalance \[27, 8, 2\] and mitigate the spatial correlation of dense patches \[28, 16\]. In fully convolutional training, class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation. ::: :::success patchwise training中的採樣可以校準類別的不平衡\[27, 8, 2\]，並減輕dense patches的空間相關性\[28, 16\]。在全卷積訓練中，類別平衡也可以透過加權損失來實現，而且loss sampling也可以用來解決空間相關性。 ::: :::info We explore training with sampling in Section 4.3, and do not find that it yields faster or better convergence for dense prediction. Whole image training is effective and efficient. ::: :::success 我們在Section 4.3中探討採樣訓練~(training with sampling)~，並沒有發現對於密集預測有產生更快或更好的收斂性。整張影像訓練是快又有效的。 ::: ## 4. Segmentation Architecture :::info We cast ILSVRC classifiers into FCNs and augment them for dense prediction with in-network upsampling and a pixelwise loss. We train for segmentation by fine-tuning. Next, we build a novel skip architecture that combines coarse, semantic and local, appearance information to refine prediction. ::: :::info 我們將ILSVRC分類器轉換為FCNs，並且以in-network upsampling與pixelwise loss來增強它們以執行密集預測。我們利用微調來訓練分割。接下來，我們建立一個新穎的skip architecture，其結合粗糙、語義與局部，外觀信息來優化預測。 ::: :::info For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [7]. We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel intersection over union, with the mean taken over all classes, including background. The training ignores pixels that are masked out (as ambiguous or difficult) in the ground truth. ::: :::success 為了這個調查，我們在PASCAL VOC 2011分割挑戰賽上\[7\]訓練並驗證。我們以per-pixel[多項式](http://terms.naer.edu.tw/detail/3216877/)邏輯損失做訓練，並以mean pixel intersection over union(平均像素IoU)的標準度量做為驗證，並採用所有類別的均值(包含背景)。這訓練忽略實際類別中被隱藏的像素(模棱兩可或困難)。 ::: ### 4.1. From classifier to dense FCN :::info We begin by convolutionalizing proven classification architectures as in Section 3. We consider the AlexNet^3^ architecture \[19\] that won ILSVRC12, as well as the VGG nets \[31\] and the GoogLeNet^4^ \[32\] which did exceptionally well in ILSVRC14. We pick the VGG 16-layer net^5^ , which we found to be equivalent to the 19-layer net on this task. For GoogLeNet, we use only the final loss layer, and improve performance by discarding the final average pooling layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs as described in Section 3.3. Table 1 compares the preliminary validation results along with the basic characteristics of each net. We report the best results achieved after convergence at a fixed learning rate (at least 175 epochs). ::: :::success 我們首先將經驗證過的分類架構做卷積化，如Section 3所述。我們考慮了贏得ILSVRC12的AlexNet^3^架構\[19\]，還有ILSVRC14異常出色的VGG網路\[31\]，以及GoogLeNet^4^\[32\]。我們選擇VGG 16-layer^5^的網路，我們發現它相當於這任務上的19-layer。而GoogLeNet，我們單純使用最後的損失層，並放棄最後的平均池化層來提高效能。我們砍了每一個網路最終的分類層，然後將全連接層轉換為卷積。我們加入一個channel為21的1x1卷積來預測每一個粗略輸出位置上的每一個PASCAL類別(包含背景)的分數，然後是deconvolution layer來將概略的輸出做雙線性升採樣為像素密集~(pixel-dense)~輸出，如Section 3.3所述。Table 1比較了初步的驗證結果，以及每個網路的基本特性。我們報告了以固定的learning rate收斂之後所得到的最佳結果(最少175個epochs)。 ::: :::info ![](https://i.imgur.com/DpI7aV4.png) Table 1. We adapt and extend three classification convnets to segmentation. We compare performance by mean intersection over union on the validation set of PASCAL VOC 2011 and by inference time (averaged over 20 trials for a 500 × 500 input on an NVIDIA Tesla K40c). We detail the architecture of the adapted nets as regards dense prediction: number of parameter layers, receptive field size of output units, and the coarsest stride within the net. (These numbers give the best performance obtained at a fixed learning rate, not best performance possible.) Table 1. 我們調整並擴展三個分類卷積網路為分割使用。我們在PASCAL VOC2011驗證集上以mean intersection over union比較效能與推理時間(在NVIDIA Tesla K40c以500 x 500做為輸入的20個試驗的平均值)。我們詳細說明調整後的網路架構(如密集預測)：參數層的數量，輸出單元的接收域大小，以及網路內的最粗糙步幅。(這些數值在固定的learning rate下獲得最佳效能，而不是可能的最佳效能) ::: :::info Fine-tuning from classification to segmentation gave reasonable predictions for each net. Even the worst model achieved ∼ 75% of state-of-the-art performance. The segmentation-equipped VGG net (FCN-VGG16) already appears to be state-of-the-art at 56.0 mean IU on val, compared to 52.6 on test [16]. Training on extra data raises performance to 59.4 mean IU on a subset of val7 . Training details are given in Section 4.3. ::: :::success 從分類到分割的微調為每個網路提供了合理的預測。即使是最糟的模型，也可以達到75%的最佳效能。配有分段功能的VGG網路(FCN-VGG16)在測試上56.0的平均IU已經是最佳結果，在測試的時候也有52.6的平均IU。以額外的資料訓練，讓val7的一個子集的效能提高到59.7的平均IU。訓練細節於Section 4.3中說明。 ::: :::info Despite similar classification accuracy, our implementation of GoogLeNet did not match this segmentation result. ::: :::success 儘管在分類上的準確度是相似的，但是我們的GoogLeNet的實作與分割結果並不匹配。 ::: ### 4.2. Combining what and where :::info We define a new fully convolutional net (FCN) for segmentation that combines layers of the feature hierarchy and refines the spatial precision of the output. See Figure 3. ::: :::success 我們定義了一個用於分割的新的全卷積網路(FCN)，FCN結合了層之間的特徵階層，並改善輸出空間的精準度。見Figure 3。 ::: :::info While fully convolutionalized classifiers can be fine-tuned to segmentation as shown in 4.1, and even score highly on the standard metric, their output is dissatisfyingly coarse (see Figure 4). The 32 pixel stride at the final prediction layer limits the scale of detail in the upsampled output. ::: :::success 儘管全卷積分類別可以如4.1所說明一般的微調為分割，甚至在標準度量上的得分很高，但它們的輸出卻是讓人不滿意的粗糙(見Figure 4)。最後預測層的32像素步幅限制了升採樣輸出中的細節比例。 ::: :::info ![](https://i.imgur.com/mpPrubD.png) Figure 4. Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail. The first three images show the output from our 32, 16, and 8 pixel stride nets (see Figure 3). Figure 4. 利用以不同步幅從層中融合信息來優化全卷積網路，可以改善分割細節。前三張影像說明從我們的32、16、與8個像素步幅網路的輸出(見Figure 3)。 ::: :::info We address this by adding links that combine the final prediction layer with lower layers with finer strides. This turns a line topology into a DAG, with edges that skip ahead from lower layers to higher ones (Figure 3). As they see fewer pixels, the finer scale predictions should need fewer layers, so it makes sense to make them from shallower net outputs. Combining fine layers and coarse layers lets the model make local predictions that respect global structure. By analogy to the multiscale local jet of Florack et al. [10], we call our nonlinear local feature hierarchy the deep jet. ::: :::success 我們利用增加[連結](http://terms.naer.edu.tw/detail/2119058/)結合最後的預測層、較低的層以及更小的步幅來解決這個問題。這將線性拓撲轉換為DAG~(Directed Acyclic Graph，有向非循環圖)~，其邊緣會從較低的層跳到較高的層(Figure 3)。當它們看到較少的像素時，更精細的比例預測應該需要較少的層，因此，讓它們從較淺的網路輸出預測是有意義的。結合精細層與粗糙層可以讓模型根據全域結構來產出局部預測。這類似於Florack等人\[10\]的multiscale local jet，我們稱我們的非線性局部特徵階層稱為deep jet。 ::: :::info We first divide the output stride in half by predicting from a 16 pixel stride layer. We add a 1 × 1 convolution layer on top of pool4 to produce additional class predictions. We fuse this output with the predictions computed on top of conv7 (convolutionalized fc7) at stride 32 by adding a 2× upsampling layer and summing^6^ both predictions. (See Figure 3). We initialize the 2× upsampling to bilinear interpolation, but allow the parameters to be learned as described in Section 3.3. Finally, the stride 16 predictions are upsampled back to the image. We call this net FCN-16s. FCN-16s is learned end-to-end, initialized with the parameters of the last, coarser net, which we now call FCN-32s. The new parameters acting on pool4 are zero-initialized so that the net starts with unmodified predictions. The learning rate is decreased by a factor of 100. ::: :::success 我們首先依據16像素步幅層的預測將輸出步幅減半。我們在pool4的頂部加入一個1x1的卷積層，以此生成其它類別的預測。我們以步幅32將這個輸出與conv7所計算的預測(卷積fc7)融合，然後利用增加2x的upsampling layer將兩個預測相加(見Figure 3)。我們將2x upsampling初始化為雙線性插值，但是允許如Section 3.3所述學習參數。最後，將步幅16的預測升採樣回影像。我們將這個網路稱為FCN-16s。FCN-16s是透過end-to-end學習而來，並使用最後一個，較為粗糙的網路(FCN-32s)的參數來初始化。[作用](http://terms.naer.edu.tw/detail/2110835/)於pool4的新參數初始化為零，因此這網路是從未修改的預測開始。learning rate降低100倍。 ::: :::info Learning this skip net improves performance on the validation set by 3.0 mean IU to 62.4. Figure 4 shows improvement in the fine structure of the output. We compared this fusion with learning only from the pool4 layer (which resulted in poor performance), and simply decreasing the learning rate without adding the extra link (which results in an insignificant performance improvement, without improving the quality of the output). ::: :::success 學習這個skip net讓驗證集的效能提高3個平均IU，來到62.4。Figure 4說明了輸出精細結構的改善。我們比較了這種融合與單純的從pool4 layer學習(這導致了較差的效能)的結果，並且僅降低learning rate而沒有增加額外的連結(這只有些微改善效能，並沒有提高輸出的品質)。 ::: :::info We continue in this fashion by fusing predictions from pool3 with a 2× upsampling of predictions fused from pool4 and conv7, building the net FCN-8s. We obtaina minor additional improvement to 62.7 mean IU, and find a slight improvement in the smoothness and detail of our output. At this point our fusion improvements have met diminishing returns, both with respect to the IU metric which emphasizes large-scale correctness, and also in terms of the improvement visible e.g. in Figure 4, so we do not continue fusing even lower layers. ::: :::success 我們持續這種方式，以2x upsampling融合pool3與pool4、conv7的預測，建置FCN-8s網路。我們得到些微的額外提升到62.7的平均IU，並且在我們的輸出的平滑度與細節上發現些許的改進。就這一點，我們的融合改進方式遇到[報酬遞減](http://terms.naer.edu.tw/detail/448481/)的問題，無論是在IU指標強調大規模正確性，還是如Figure 4中可見的改進，因此我們不會繼續融合更低的層。 ::: :::info Refinement by other means Decreasing the stride of pooling layers is the most straightforward way to obtain finer predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 layer to have stride 1 requires our convolutionalized fc6 to have a kernel size of 14 × 14 in order to maintain its receptive field size. In addition to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not successful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important. ::: :::success 透過其它方式來減少池化層的步幅是獲得更精細預測的最直接的方法。然而，這麼做對我們基於VGG的網路是有問題的。要將pool5 layer的步幅設置為1，這需要我們的卷積fc6的kernel size為14x14才有辦法維持其接收域的大小。除了計算成本之外，我們難以學習這麼大的濾波器。我們嚐試著以較小的濾波器重新架構pool5以上的層，但是並沒有成功得到可以比較的效能；一種可能的解釋是，由ImageNet訓練而來的權重在較高層初始化是重要的。 ::: :::info Another way to obtain finer predictions is to use the shift-and-stitch trick described in Section 3.2. In limited experiments, we found the cost to improvement ratio from this method to be worse than layer fusion. ::: :::success 另一種得到較好的預測的方法是使用shift-and-stitch，如Section 3.2中所述。在有限的實驗中，我們發現這種方法的提升率比層的融合方法還要差。 ::: ### 4.3. Experimental framework :::info **Optimization** We train by SGD with momentum. We use a minibatch size of 20 images and fixed learning rates of 10^−3^, 10^−4^, and 5^-5^ for FCN-AlexNet, FCN-VGG16, and FCN-GoogLeNet, respectively, chosen by line search. We use momentum 0.9, weight decay of 5^−4^ or 2^−4^ , and doubled the learning rate for biases, although we found training to be insensitive to these parameters (but sensitive to the learning rate). We zero-initialize the class scoring convolution layer, finding random initialization to yield neither better performance nor faster convergence. Dropout was included where used in the original classifier nets. ::: :::success **Optimization** 我們以SGD + momentum訓練。我們使用minibatch size=20，並分別針對FCN-AlexNet、FCN-VGG16與FCN-GoogLeNet以固定learning rate，10^−3^、10^−4^、5^-5^，透過[直線搜尋](http://terms.naer.edu.tw/detail/67396/)選擇。我們使用mementum=0.9，weight decay為5^-4^或2^-4^，並讓biases的learning rate加倍，儘管我們發現訓練過程對這些參數不是那麼敏感。我們將類別分數的卷積層初始化為零，因為我們發現，隨機初始化既不會產生更好的效能，也不會收斂的比較快。原始分類器網路使用了包含dropout。 ::: :::info **Fine-tuning** We fine-tune all layers by backpropagation through the whole net. Fine-tuning the output classifier alone yields only 70% of the full fine-tuning performance as compared in Table 2. Training from scratch is not feasible considering the time required to learn the base classification nets. (Note that the VGG net is trained in stages, while we initialize from the full 16-layer version.) Fine-tuning takes three days on a single GPU for the coarse FCN-32s version, and about one day each to upgrade to the FCN-16s and FCN-8s versions. ::: :::success **Fine-tuning** 我們透過整個網路的反向傳播對所有層做微調。如Table 2比較，如果單獨的微調輸出分類器就只能產生70%的完整微調效能。考慮到學習基本的分類網路所需的時間，從頭開始訓練是行不通的。(注意到，VGG網路是分階段訓練的，而我們是從完整的16-layer版本初始化的)對於粗略的FCN-32s版本，單GPU的訓練時間是3天，而升級到FCN-16s與FCN-8s版本則大約需要1天。 ::: :::info ![](https://i.imgur.com/Xdolri2.png) Table 2. Comparison of skip FCNs on a subset of PASCAL VOC2011 validation . Learning is end-to-end, except for FCN32s-fixed, where only the last layer is fine-tuned. Note that FCN32s is FCN-VGG16, renamed to highlight stride. Table 2. 比較skip FCNs在PSACAL VOC2011驗證集子集上的結果。以end-to-end學習，除了FCN32s-fixed，僅最後一層微調。注意到，FCN32s是FCN-VGG16，重新命名以突顯步幅。 ::: :::info **Patch Sampling** As explained in Section 3.4, our full image training effectively batches each image into a regular grid of large, overlapping patches. By contrast, prior work randomly samples patches over a full dataset \[27, 2, 8, 28, 11\], potentially resulting in higher variance batches that may accelerate convergence \[22\]. We study this tradeoff by spatially sampling the loss in the manner described earlier, making an independent choice to ignore each final layer cell with some probability $1−p$. To avoid changing the effective batch size, we simultaneously increase the number of images per batch by a factor $1/p$. Note that due to the efficiency of convolution, this form of rejection sampling is still faster than patchwise training for large enough values of $p$ (e.g., at least for $p > 0.2$ according to the numbers in Section 3.1). Figure 5 shows the effect of this form of sampling on convergence. We find that sampling does not have a significant effect on convergence rate compared to whole image training, but takes significantly more time due to the larger number of images that need to be considered per batch. We therefore choose unsampled, whole image training in our other experiments. ::: :::success **Patch Sampling** 如Section 3.4所解釋，我們的完整影像訓練有效地將每個影像批處理為一個有規律的大型網格，重疊的區塊。相比之下，先前的作品在整個資料集上隨機採樣區塊\[27, 2, 8, 28, 11\]，可能導致更高變異批次，這可能加速收斂\[22\]。我們利用前面描述過的方式對損失做空間的採樣來研究這之間的一個折衷，以大約$1-p$的機率獨立選擇忽略每個最後一層的神經元。為了避免改更有效的批次大小，我們同時將每一批的影像數量增加$1/p$倍。注意而，由於卷積的效率，這種形式的[棄卻抽樣](http://terms.naer.edu.tw/detail/3645603/)在$p$足夠大的情況下仍然比patchwise training還來的快(即，根據Section 3.1，最少$p > 0.25$)。Figure 5說明，這種形式的採樣對收斂的影響。我們發現，對比整個影像的訓練，採樣對收斂速度並沒有明顯的影響，但是由於每一批需要考慮的數量變多，因此花費的時間明顯變多。因此，我們選擇不採樣，在其它的實驗中將以整個影像訓練來執行。 ::: :::info ![](https://i.imgur.com/6CKmPWP.png) Figure 5. Training on whole images is just as effective as sampling patches, but results in faster (wall time) convergence by making more efficient use of data. Left shows the effect of sampling on convergence rate for a fixed expected batch size, while right plots the same by relative wall time. Figure 5. 以整張影像訓練與sampling patches一樣的有效，但是透過更有效地利用資料，可以更快的收斂。左圖說明在固定的預期批量大小下，sampling對收斂速度的影響，而右圖則是以相對的[經過時間](http://terms.naer.edu.tw/detail/6690193/)繪出相同的東西。 ::: :::info **Class Balancing** Fully convolutional training can balance classes by weighting or sampling the loss. Although our labels are mildly unbalanced (about 3/4 are background), we find class balancing unnecessary. ::: :::success **Class Balancing** 全卷積訓練可以透過加權或採樣損失來平衡類別。儘管我們的標記是有些許的不平衡(大約3/4背景)，但我們發現類別的平徑並不是必需的。 ::: :::info **Dense Prediction** The scores are upsampled to the input dimensions by deconvolution layers within the net. Final layer deconvolutional filters are fixed to bilinear interpolation, while intermediate upsampling layers are initialized to bilinear upsampling, and then learned. Shift-and-stitch (Section 3.2), or the filter rarefaction equivalent, are not used. ::: :::success **Dense Prediction** 透過網路內的deconvolution layer，得分被upsampled為輸入維度。最後一層的deconvolutional filters固定為雙線性插值，而中間的upsampling layers被初始化為bilinear upsampling，然後學習。這並沒有使用shift-and-stitch(Section 3.2)或filter rarefaction equivalent。 ::: :::info **Augmentation** We tried augmenting the training data by randomly mirroring and “jittering” the images by translating them up to 32 pixels (the coarsest scale of prediction) in each direction. This yielded no noticeable improvement. ::: :::success **Augmentation** 我們試著利用隨機鏡像與"抖動(jittering)"(每個方向最多[位移](http://terms.naer.edu.tw/detail/2454761/)32個像素)來增強訓練資料(最粗糙的預測比例)。這並沒有產生明顯的改善。 ::: :::info **More Training Data** The PASCAL VOC 2011 segmentation challenge training set, which we used for Table 1, labels 1112 images. Hariharan et al. \[15\] have collected labels for a much larger set of 8498 PASCAL training images, which was used to train the previous state-of-the-art system, SDS \[16\]. This training data improves the FCN-VGG16 validation score7 by 3.4 points to 59.4 mean IU. ::: :::success PASCAL VOC 2011分割挑戰賽訓練集(用於Table 1)，標記了1112張影像。Hariharan\[15\]等人收集了更大的8498 PASCAL訓練影像標記，這些資料集用於先前最佳的系統SDS\[16\]。這些訓練資料提高FCN-VGG16的驗證得分^7^3.4個百分點，來到59.4平均IU。 ::: :::info ^7^There are training images from \[15\] included in the PASCAL VOC 2011 val set, so we validate on the non-intersecting set of 736 images. An earlier version of this paper mistakenly evaluated on the entire val set. ^7^PASCAL VOC 2011驗證集包含來自\[15\]的訓練影像，因此我們對736張影像的non-intersecting set上做驗證。此論文的早期版本錯誤的評估了整個驗證集。 ::: :::info **Implementation** All models are trained and tested with Caffe \[18\] on a single NVIDIA Tesla K40c. The models and code will be released open-source on publication. ::: :::success 所有模型的訓練與測試，都在單張的NVIDIA Tesla K40c上用Caffe\[18\]完成。模型與程式碼都會以開源的形式發佈。 ::: ## 5. Results :::info We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture8 on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow. ::: :::success 我們在語義分割與場景解析上測試FCN，探討PASCAL VOC、NYUDv2、與SIFT Flow。儘管這些任務歷年來在物件與區域之間是有所區別的，但是我們一律將之視為像素預測。我們在這些資料集上評估FCN skit architecture，然後，將其擴展為multi-model(用於NYUDv2)，以及multi-task predition(用於SIFT Flow的語義與幾何標記) ::: :::info **Metrics** We report four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). Let $n_{ij}$ be the number of pixels of class $i$ predicted to belong to class $j$, where there are $n_{cl}$ different classes, and let $t_i = \sum_j n_{ij}$ be the total number of pixels of class $i$. We compute: * pixel accuracy: $\sum_i n_{ii} / \sum_i t_i$ * mean accuraccy: $(1 / n_{cl}) \sum_i n_{ii} / t_i$ * mean IU: $(1 / n_{cl}) \sum_i n_{ii} / (t_i + \sum_j n_{ji} - n_{ii})$ * frequency weighted IU: $(\sum_k t_k)^{-1} \sum_i t_i n_{ii} / (t_i + \sum_j n_{ji} - n_{ii})$ ::: :::success 我們報告四種常見用於語義分割與場景解析的指標。這些指標是像素準確度與region intersection over union (IU)的變化。假設，$n_{ij}$為類別$i$被預測為所屬類別$j$的像素的數量，其中擁有$n_{cl}$個不同類別，然後假設$t_i = \sum_j n_{ij}$是類別$i$的總像素數量，我們計算： * pixel accuracy: $\sum_i n_{ii} / \sum_i t_i$ * mean accuraccy: $(1 / n_{cl}) \sum_i n_{ii} / t_i$ * mean IU: $(1 / n_{cl}) \sum_i n_{ii} / (t_i + \sum_j n_{ji} - n_{ii})$ * frequency weighted IU: $(\sum_k t_k)^{-1} \sum_i t_i n_{ii} / (t_i + \sum_j n_{ji} - n_{ii})$ ::: :::info **PASCAL VOC** Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS \[16\], and the well-known R-CNN \[12\]. We achieve the best results on mean IU^9^ by a relative margin of 20%. Inference time is reduced 114× (convnet only, ignoring proposals and refinement) or 286× (overall). ::: :::success **PASCAL VOC** Table 3給出FCN-8s在PASCAL VOC 2011與2012測試集上的效能，並且與先前最佳技術，SDS\[16\]、眾所皆知的R-CNN\[12\]比較。我們在平均IU上得到最佳結果^9^，相對誤差為20%。推理時間則是減少114x(convnet only, ignoring proposals and refinement)或286x(總體)。 ::: :::info ![](https://i.imgur.com/7Zig9Dw.png) Table 3. Our fully convolutional net gives a 20% relative improvement over the state-of-the-art on the PASCAL VOC 2011 and 2012 test sets, and reduces inference time. Table 3. 以PSACAL VOC2011、2012測試集與最佳技術相比，我們的全卷積網路得到20%的相對改進，而且減少推理時間。 ::: :::info **NYUDv2** \[30\] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. \[13\]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PASCAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit, perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the success of Gupta et al. \[14\], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predictions from both nets are summed at the final layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version. ::: :::success **NYUDv2**\[30\]是一個RGB-D的資料集，使用Microsoft Kinect收集而得。它擁有1449張帶有像素級別標記的RGB-D影像，並且被Gupta等人\[13\]合併為40個類別語義分割任務。我們報告795張訓練影像與654張測試影像的標準分割的結果(注意：所有的模型選擇都是建立在PASCAL 2011測試集上)。Table 4給出幾種模型的效能說明。首先，我們在RGB影像上訓練未修改的粗糙模型(FCN-32s)。為了增加深度的信息，我們訓練一個升級後的模型，以便使用四通道RGB-D的輸入(早期融合)。這幾乎沒有什麼好處，可能是因為難以在整個模型中傳播有意義的梯度。繼Gupta等人\[14\]成功之後，我們試了深度的三維HHA編碼，僅用這些信息訓練網路，以及RGB與HHA的"後期融合"，並在最後一層將兩個網路的預測相加，從而得到end-to-end學習的雙流網。最終，我們將這個後期融合的網路更新為16-stride版本。 ::: :::warning 個人見解： * early fusion所指的是特徵上的融合，而late fusion指的是在預測(分數)上的融合 * [參考來源](https://zhuanlan.zhihu.com/p/48351805) ::: :::info ![](https://i.imgur.com/Aep8AxY.png) Table 4. Results on NYUDv2. RGBD is early-fusion of the RGB and depth channels at the input. HHA is the depth embedding of \[14\] as horizontal disparity, height above ground, and the angle of the local surface normal with the inferred gravity direction. RGB-HHA is the jointly trained late fusion model that sums RGB and HHA predictions. Table 4. NYUDv2上的結果。RGBD是輸入的RGB與深度通道的早期融合。HHA is the depth embedding of \[14\] as horizontal disparity, height above ground, and the angle of the local surface normal with the inferred gravity direction。RGB-HHA是聯合訓練的後期融合模型，將RGB與HHA的預測相加。 ::: :::info **SIFT Flow** is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-16s with semantic and geometric prediction layers and losses. The learned model performs as well on both tasks as two independently trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training and 200 test images,^10^ show state-of-the-art performance on both tasks. ::: :::success **SIFT Flow** 帶有33個語義類別的資料集，有2,688張影像("橋"，"山"，"太陽")，還有三個幾何類別("水平"，"垂直"，與"天空")。FCN可以很自然的學到聯合表示，同時預測兩種標記類型。我們FCN-16s是雙頭版本，同時擁有語義與幾何預測層以及損失。學習好的模型其表現還不如兩個任務獨立訓練的模型，儘管學習與推理的速度基本上都與每個獨立模型一樣快。結果置於Table 5，按標準分割分為2488張訓練影像與200張測試影像，並說明兩個任務的目前最佳效能。 ::: :::info ![](https://i.imgur.com/CczXFJa.png) Table 5. Results on SIFT Flow^10^ with class segmentation (center) and geometric segmentation (right). Tighe [33] is a non-parametric transfer method. Tighe 1 is an exemplar SVM while 2 is SVM + MRF. Farabet is a multi-scale convnet trained on class-balanced samples (1) or natural frequency samples (2). Pinheiro is a multi-scale, recurrent convnet, denoted RCNN~3~ ($O^3$ ). The metric for geometry is pixel accuracy Table 5. SIFT Flow^10^上的結果(具分類分割(左)與幾何分割(右))。Tight\[33\]是一個非參數轉換的方法。Table 1是SVM的範例，而Table 2是SVM + MRF的範例。Farabet是訓練於類別平均樣本(1)或[固有頻率](http://terms.naer.edu.tw/detail/421182/)採樣(2)的多尺度卷積網路。Pinheiro為多尺度，遞迴卷積網路，表示為RCNN~3~($O^3$)。幾百的指標是像素準確度。 ::: :::info ![](https://i.imgur.com/WzwKBIN.png) Figure 6. Fully convolutional segmentation nets produce stateof-the-art performance on PASCAL. The left column shows the output of our highest performing net, FCN-8s. The second shows the segmentations produced by the previous state-of-the-art system by Hariharan et al. [16]. Notice the fine structures recovered (first row), ability to separate closely interacting objects (second row), and robustness to occluders (third row). The fourth row shows a failure case: the net sees lifejackets in a boat as people. Figure 6. 全卷積分割網路在PASCAL上產出最佳效能。左邊的column說明著我們最高效能網路的輸出(FCN-8s)。第二個column是由Hariharan等人\[16\]所做的先前最佳效能的系統。注意到，那還原的精細的結構(第一個row)，分能緊密互動目標的能力(第二個row)，以及針對遮蔽物的魯棒性。第四個row是一個失敗的案例：模型將船上的救生衣看成人了。 ::: ## 6. Conclusion :::info Fully convolutional networks are a rich class of models, of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference. ::: :::success 全卷積網路是一類豐富的模型，現代的分類卷積網路就是一個特別的案例。認識到這一點，將這些分類網路擴展到分割應用，並改善為具有多解析度組合的架構，極大的改善現有技術，同時簡化並加快訓練與推理速度。 :::