The implementation uses Pytorch as framework. To see full implementation, please refer to this repository.
Also, if you want to read other "Summary and Implementation", feel free to check them at my blog.
I) Summary
It's important to understand that the main problem here is the difficulty to optimize a deep network rather than its lack of ability to learn features.
Feature learning (or representation learning) is the ability to find a transformation that maps raw data into a representation that is more suitable for a machine learning task (e.g classification).
1) Problem
Intuitively, the more layers we have, the better the accuracy will be.
So if we take a shallow network that performs well and copy its layers and stack them to make the model deeper, we can expect the deep network to perform comparably good or better than its counterpart.
Surprisingly, as we go deeper, accuracy increases up to a saturation point and then begins to degrade.
Unexpectedly, such degradation is not caused by overfitting and making the network even deeper leads to a high training error.
Thus, the deep network performs worse than the shallow network.
One possible explanation could be that the deep network suffered from the vanishing gradient problem.
However, it can mostly be fixed with batch normalization and normalized initializations.
A second explanation could be that the deep network wasn't able to learn the identity function.
Indeed, it could at least perform exactly like the shallow network by just "learning nothing" (remember the deep network was built by copying and stacking layers of the shallow network).
But the fact that he wasn't able to perform exactly like the shallow network means he has trouble to learn nothing! (learn the identity function).
This suggest a new problem: Is learning better networks as easy as stacking more layers ?
2) Solution
The solution to this problem is to use a residual module so that adding more layer will not cause any performance degradation.
A residual module is composed of:
a sequence of convolutions, batch normalization and ReLU activations.
We then combine through addition the residual connection with the sequence.
Suppose . If the deep network wants to learn the idendity function, it just has to use the residual connection and thus, set to 0 !
It is always easier for a sequence of layer to fit to a zero than an identity function, so the proposed structure is easier to train and ensure that a deeper network will be at least comparably good or better than its counterpart (neutral-or-better characteristic).
The residual connection is also called skip connection because they give a chance for the information to skip the function located within the residual module.
Skip connection provides a clear path for gradients to back propagate to early layers of the network. This makes the learning process faster by avoiding vanishing gradient problem.
However, the trade of is that residual networks are more prone to overfitting.
It seems that residual modules are more powerful for very deep networks and could even hurt the performance for very shallow networks if employed improperly.
When several residual modules are stacked, residual networks can be thought of as a complex combinations or ensemble of many shallower networks.