Deep Residual Learning for Image Recognition ==== > Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun > 2015 Microsoft Research > [hightlight paper on kami](https://kami.app/wzDizwQFEnIB) [回目錄](https://hackmd.io/@marmot0814/ryzDejzGr) # Abstract - Big picture: - Deeper neural networks are more difficult to train - degradation - Residual learning framework - easier to optimize - gain accuracy from considerably increased depth - VGG net - a kind of convolution neural network - 3*3 convolution pattern - 2*2 pooling pattern - COCO - Microsoft dataset # Introduction - ImageNet Dataset - network depth is of crucial importance, and the leading results ``` - Different between Vanishing Gradient and Degradation - Vanishing Gradient - emerge eariler than degradation - with the increase of the neural network layers, the gradient become too small to allow network evolve - solved by... - renormalisation - Relu activation function - Degradation - With layers deeper, network gets higher error rate - solved by... - Residual Network ``` - Vanishing/Exploding gradients - has been largely addressed by... - normalized initialization - intermediate normalization layers - Degradation - With the network depth increasing, accuracy gets saturated and then degrades rapidly. - not caused by overfitting - because it got the low training accuracy - Indicate that not all systems are similarly easy to optimize - consider shallower architecture and its deeper counterpart that adds more layers onto it - The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart - the added layers are identity mapping - the rest layers are thet shallower architecture value - But experiments show that **current solvers are unable to find solutions** that are comparably good or better than the constructed solution or **unable to do so in feasible time**. - Deep residual learning framework - it can solve degradation problem - let these layers fit a residaul mapping instead of a desired underlying mapping - desired underlying mapping: $H(x)$ - stacked nonlinear layers fit another mapping of $F(x) := H(x) - x$ - recast original underlying mapping into $H(x) = F(x) + x$ - ![](https://i.imgur.com/cKINrkM.png) - Be considered as feedforward neural networks with shortcut connections - Add neither extra parameter nor computa-tional complexity. - hypothesize that... - it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. - experiments on ImageNet - Our extremely deep residual nets - easy to optimize - can easily enjoy accuracy gains from greatly increased depth - the counterpart plain nets - higher training error when the depth increases - get perfect result # Related Work - Residual Representations - powerful shallow representations (more effective) - VLAD - encodes by the residual vectors - Fisher Vector - probabilistic version of VLAD - Shortcut Connections # Deep Residual Learning ## Residual Learning - $H(x)$由來 - $H(x) = F(x) + x$ - deeper network must be better than shallow counterpart ## Identity Mapping by Shortcuts - $y = F(x, W_i) + x$ - problem: - the dimension of F may different from x - solved: - use a project function $W_s$ to change the dimesion of x - the equation will become - $y = F(x, W_i) + W_sx$ - If F has only a single layer, is similar to a linear layer: $y = W_x+x$, for which we have not observed advantages. ## Network Architectures - plain network - based on VGG-net - some basic idea - same output feature map size - same number of fil-ters - residual network - based on plain network - add some shortcuts and turn it into residual counterpart version - identity mapping doesn't increase the parameter in the network - use project matrix to make the dimension match between $F(x)$ and $x$ ## Implementation - batch normalization - SGD - learning rate: 0.1 - weight decay of 0.0001 and a momentum of 0.9 - **don't use dropout**, because the percentage of parameters in the fully connected layer is low. (about 0.01%) # Experiments ## ImageNet Classification - result - plain network - ![](https://i.imgur.com/VKIhDk2.png) - residual network - ![](https://i.imgur.com/haKQUSZ.png) - some conclusion - error rate goes down as the number of layers in residual network increase. By contract, error rate up as the number of layers in plain network increase. - the speed of converge in residual network is faster than plain network in the begining stage. ## CIFAR-10 and Analysis ## Object Detection on PASCAL and MS COCO # Reference - [VGG net](https://dgschwend.github.io/netscope/#/preset/vgg-16) - [Vanishing Gradient vs Degradation](https://medium.com/@shaoliang.jia/vanishing-gradient-vs-degradation-b719594b6877) - [[論文閱讀]Deep Residual Learning for Image Recognition](https://zhuanlan.zhihu.com/p/47199669)