Improved Residual Networks forImage and Video Recognition(iResNet)

# Improved Residual Networks forImage and Video Recognition(iResNet) ###### tags: `paper` `已公開` [toc] ## Introduction and Related Work 針對ImageNet 實現 404-layer deep CNN , 針對CIFAR-10 CIFAR-100 實現3002-layer net ，而原始ResNet 在這種極端狀況下無法收斂。三個主要組成 - 網路層間的訊息流 - projection shortcut - residual bulid block 四點貢獻 1. 我們引入了一種基於階段的殘差學習網絡架構。所提出的方法通過為訊息通過網絡層傳播提供更好的路徑來促進學習過程 2. 提出改進的projection shortcut，減少了訊息損失並提供更好的結果。 3. 提出一個可顯著增加空間通道的building block，以學習更強大的空間模式 4. 我們提出的方法在基線上提供了一致的改進。重要的是要注意，這些改進是在不增加模型複雜度的情況下獲得的。六種資料集上做驗證(four for images and two for large-scale video classification) - ImageNet/CIFAR-10/CIFAR-100/COCO - Something-Something-v2/Kinetics-40 the work grouped convolution to split the computation of theconvolutions over two GPUs to overcome the limitations in computational re-sources. ## Improved residual networks ### Improved information flow through the network ![](https://i.imgur.com/VmJYYud.png) #### 先談談原先方法的缺點在一個極端，我們有原始的 ResNet（它用太多的門阻礙了訊息傳播（例如 ReLUs) 在主路徑上)，在另一個極端上，有 pre-activation ResNet（允許信號以不受控制的方式通過網絡) 1. 首先，在所有四個階段[3,4,6,3]中，完整信號都沒有歸一化，因此，隨著我們添加更多塊，完整信號變得更加“非歸一化”，這給學習帶來了困難。原始 ResNet 和 pre-act 都存在此問題。 2. 其次，注意有四個projection shortcut(four 1x1 conv ) 方式，理論上，大多block會學習成identity mapping((the branch in the building block with convolutions can output zeros)。在這種情況下，pre-act。導致四個主要階段的 ResNet 最終只有四個連續的 1x1 conv，但中間沒有任何非線性，限制了學習能力。 #### 基於以上，作出改善我們的方法(iResNet)也解決了這兩個問題，因為它在每個主要階段之前穩定了信號，並確保在每個階段結束時至少有一個非線性。 iResNet 將原始的ResNet 每個stage切分成，一個start ResBlock ,再依據stage[1,2,4,1]個 Middle ResBlock，最後接一個End ResBlock。所以四個stage，[1,1,1, 1,2,1, 1,4,1 , 1,1,1]。重點是，我們提出的解決方案不會增加模型的複雜性。與原始方法不同，我們提出的 ResStage 在主路徑上包含固定數量的 ReLU，用於信息向前和向後傳播。例如，主傳播路徑上的 ReLU 數量與網絡深度成正比。而在我們的 ResStage 中，對於主要階段，主要信息傳播路徑上只有四個 ReLU，不受深度變化的影響。這使網絡能夠避免在信息通過多個層時阻礙信號 :::success :bulb: 為甚麼幾乎不使用ReLU 这个位置的ReLU會讓負值清零，對訊息傳遞帶來負面影響，特別是網路剛開始訓練會存在很多負值。 :bulb:pre-act比原始ResNet更容易優化，為什麼還要對pre-act ResNet block 進行更動？ > 1. 在main path(主要路徑)上沒有BN，使得訊號沒有任何normalize，增加學習難度。 > 2. 四個residual block都是1x1 卷積結束，幾個block之間缺少非線性，限制學習能力 ::: we specifically split the networks into several stages, each of which(stages) contains three parts. 1. The End Resblock of each stage is completed with a BN and ReLU, which can be seen as preparation for the next stage, stabilizing and preparing the signal to enter into a new stage 2. In our Start ResBlock, there is a BN layer after the last conv, which normalizes the signal, preparing it for the element-wise addition with the projection shortcut (which also provides a normalized signal). 3. We eliminate the first BN in the first Middle ResBlock, as the signal is already normalized by our Start ResBlock.(PS: 應該是每個stage中的第一個Middle ResBlock) 我們提出的方法通過為訊息在網絡中傳播提供更好的路徑來促進學習。 ResStage易於優化，允許輕鬆訓練極深的網絡。網絡可以輕鬆地動態地選擇在學習過程中使用哪些 ResBlocks 和丟棄哪些 ResBlocks 我們分階段學習的建議旨在提高訊息流的效率，同時也為了保持信號處於可控狀態。 ### Improved projection shortcut #### 先談談原先方法的缺點原始的projection shortcut使用具有 1×1 kernel with stride 2的 conv 將 x 的通道數投影到 F 的輸出通道數。然後，在element-wise 之前做 BN，輸出為 F。因此，通道和空間的匹配都由一個 1×1 conv控制，這會導致訊息的顯著損失，因為 1×1 conv 在將空間上減少 75% 的feature maps activations(with stride 2)。此外，用沒有意義的標準去選擇 1×1 conv 考慮剩下 25%的feature maps activations。因此，projection shortcut 這個noisy的輸出，會貢獻相對一半的訊息量到下一個 ResBlock。 ![](https://i.imgur.com/AcqUkHl.png) #### 作出改善我們將通道投影與空間投影分開。對於空間投影，採用3x3 max pooling with stride 2 實現，然後再接通道投影採用1x1 conv with strdie 1，再一個BN。 With the **max pooling**, we introduce a criterion for **selecting which activations should be considered** for 1×1 conv. Furthermore, **the spatial projection considers all the information from the feature maps and picks the element with the highest activation to be considered in the next step**. Note that the kernel of the max pooling coincides with the kernel of the middle conv from the ResBlock, which ensures that the element-wise addition is performed between elements computed over the same spatial window. Our proposed projection shortcut reduces the information loss and, in the experimental section, we show the benefit in performance. projection shortcut三個動機 1. 減少訊息損失和信號的擾動。(上述所說的) 3. 在每個主要階段的 Start ResBlock 上進行最大池化，提高了網絡的平移不變性，進而提高了整體識別性能。 4. 使用我們的projection shortcut，採用down sampling 在每個階段的 Start ResBlock 可以看作是“soft sampling”（由 Start ResBlock 的 3x3 conv完成）和“hard sampling”之間的組合(由projection shortcut 的3x3 max pooling)。 :::success :mag:hard downsampling 硬選一個有利於分類 (picking the element with the highest activation) :mag: soft downsampling 每個都考慮(weighted downsampling ) 有助於不丟失所有空間上下文 (therefore, helping for better localization, as the transition between elements is smoother). ::: ResNet 通常只需要四個projection shortcuts，在每個階段的開始(on the Start ResBlock of a stage)。因此，對於我們提出的projection shortcuts，ResNet 計算成本的增加可以忽略不計，因為我們只需要額外包括三個max pooling，這通常計算成本低。對於第一階段，我們使用 ResNet 中現有的最大池化 :::success :bulb:為什麼要改變 downsampling方式? 因為使用1x1 conv with stride = 2，會丟失75%訊息，而剩下25%訊息，也沒有設計什麼有意義地標準，會對main path上的訊息流造成負面影響。 :bulb: 為什麼使用上下文的方式進行downsampling的改造？改造後考慮到來自特徵映射的所有訊息，並在下一步選擇激活度最高的元素，減少訊息的損失。同時conv 3x3 可以視為soft downsampling，3x3 maxpooling可以視為hard downsampling，兩種優勢互補。hard downsampling有助於選擇激活度最高的元素，soft downsampling有助於不丟失所有空間訊息，有利於定位。 ::: ### Grouped building block 出於實際考慮，引入了瓶頸構建塊，以在增加網絡深度時保持合理的計算成本。它首先包含一個 1×1 conv 用於減少通道數量，然後是一個 3×3 conv bottleneck，用於對最小數量進行操作輸入/輸出通道的數量，最後是一個 1×1 的捲積，將通道數量增加到原始狀態。這種設計的原因是在較少數量的通道上運行 3×3 conv，以保持計算成本和參數數量處於可控狀態。然而，3×3 conv 非常重要，因為它是唯一componentable去學習spatial patterns，但在bottleneck design 中，它接收的輸入/輸出通道數量較少。 ![](https://i.imgur.com/nhrQK0R.png) 我們提出了一個改進的構建塊，它包含 3×3 conv 上最大數量的輸入/輸出通道。在我們提出的構建塊的設計中，我們使用了 grouped convolution.，我們稱之為 ResGroup 塊。Grouped convolution 被用作將模型分佈在兩個 GPU 上的解決方案，以克服計算成本和記憶體所帶來的限制。最近有些研究顯示，利用分組卷積來提高準確性。對於標準卷積，每個輸出通道都連接到所有輸入通道。分組卷積的主要思想是將輸入通道分成若干組，並對每組進行獨立卷積操作。這樣，參數和浮點運算的數量可以reduced by a factor equal to the number of groups。(就是除以G) 參數和浮點數運算數量可以表示為 $$ params = \frac{ch_{in}}{G} \cdot ch_{out} \cdot k_{1} \cdot k_{2} FLOPs = \frac{ch_{in}}{G} \cdot ch_{out} \cdot k_{1} \cdot k_{2} \cdot w \cdot h $$ $ch_{in}, ch_{out}$ 分別表示輸入輸出的通道數; $k_{1}, k_{2}$ 代表conv的kernel size; $w, h$通道的大小; $G$代表通道被分成的組數. - If G=1, then we have a standard convolution. - If G is equal to the number of input channels, then we are at the other extreme, which is called depthwise convolution. ![](https://i.imgur.com/sweclL2.png) 我們使用 3×3 spatial kernel 的 grouped convolution． * ResGroupFix-50 表示每個階段的組數固定的情況 (64 in our case) * ResGroup-50 表示我們根據通道調整組數的情況，這樣所有階段每組具有相同的通道數。使用這種方式，3×3 具有最大數量的通道和更高的學習空間模式的能力。 ResNet 和 ResNeXt 都使用瓶頸形狀作為構建塊，其中在空間轉換使用最少數量的通道上運行(3x3 conv上)，而在我們提出的block 中, 3×3 conv 在最大數量的通道上運行。 :::success :bulb: 為什麼要 ResGroup? 原始bottleneck中的3x3 conv通道數太少，削落網路的學習能力，因此考慮先增加通道，再進行減少。為了減少參數和浮點數運算量，採用grouped conv ::: ## Experiment ![](https://i.imgur.com/W8c4xPx.png) This work proposed an improved version of residual networks. Our improve-ments address all three main components of a ResNet: information propagationthrough the network, the projection shortcut, and the building block. ![](https://i.imgur.com/oI7mmsX.png) ![](https://i.imgur.com/DaIVmTg.png) ![](https://i.imgur.com/JqkfXI1.png) ![](https://i.imgur.com/JDkvjeU.png) ![](https://i.imgur.com/aocY2zR.png) Our proposed approach facilitates learning of extremelydeep networks, showing no optimization issues when training networks with over400 layers (on ImageNet) and over 3000 layers (on CIFAR-10/100). ## 參考資料 - [Improved Residual Networks forImage and Video Recognition](https://arxiv.org/pdf/2004.04989.pdf)