---
title: ConvNet Evolutions, Architectures, Implementation Details and Advantages.
authors: Chris Ick, Soham Tamba, Ziyu Lei, Hengyu Tang
date: 14 February 2020
---
## [Proto-CNNs and Evolution to Modern CNNs](00:56:40-1:11:20)
### [Proto-Convolutional Neural Nets on Small Dataset](00:56:40-00:58:15)
Inspired by Fukushima's work on visual cortex modeling, using the simple/complex cell hierarchy combined with supervised training and backpropogation lead to the development of the first ConvNets at University of Toronto in '88-'99 by Prof. Yann LeCunn. The experiments used a small dataset of 320 'mouser-written' digits. Performances of the following architectures were compared:
1. Single FC(fully connected) Layer
2. Two FC Layers
3. Locally Connected Layers w/o shared weights
4. Constrained network w/ shared weights and local connections
5. Constrained network w/ shared weights and local connections 2 (more feature maps)
The most successful networks (constrained network with shared weights) had the strongest generalizability, and form the basis for modern CNNs. Meanwhile, singler FC layer tends to overfit.
### [First "Real" ConvNets at Bell Labs](00:58:15-00:59:39)
After moving to Bell Labs, LeCunn's research shifted to using handwritten zipcodes from the US Postal service to train a larger CNN:
* 256 (16x16) input layer
* 12 5x5 kernels with stride 2 (stepped 2 pixels): next layer has lower resolution
* **NO** separate pooling
### [Convolutional Network Architecture w/ Pooling](00:59:39-1:05:52)
The next year, some changes were made: separate pooling was introduced. Separate pooling is done by averaging input values, adding a bias, and passing to a nonlinear function (hyperbolic tangent function). The 2x2 pooling was performed with a stride of 2, hence reducing resolutions by half.
<center>
<img src="detailed_convNet.png" width="600px" /><br>
<b>Fig. 1</b> ConvNet Architecture
</center>
An example of a single convolutional layer would be as follows:
1. Take an input with size *32x32*
2. The convolution layer passes a 5x5 kernel with stride 1 over the image, resulting feature map size *28x28*
3. Pass the feature map to a nonlinear function: size *28x28*
4. Pass to the pooling layer that averages over a 2x2 window with stride 2: size *14x14*
5. Repeat 1-4 for 4 kernels
<!--
1. Take an input of size 32x32
2. The convolutional layer passes a 5x5 kernel with stride 2 over the image
3. Pass the output of the convolution through a nonlinear function
4. Pass the resulting 28x28 feature map to the pooling layer
5. The pooling layer averages over a 2x2 window with stride 2
6. Pass the resulting 14x14 feature map to the next convolutional layer
7. Repeat 1-6 for each kernel
-->
The first-layer, simple convolution/pool combinations usually detect simple features, such as oriented edge detections. After the first convolution/pool layer, the objective is to detect combinations of features from previous layers. To do this, steps 2 to 4 are repeated with multiple kernels over previous-layer feature maps, and are summed in a new feature map:
8. A new 5x5 kernel is slided over all feature maps from pervious layers, with results summed up. (Note: In Prof. LeCun's experiment in 1989 the connection is not full for computation purpose. Modern settings usually enforce full connections): size *10x10*
9. Pass the output of the convolution to a nonlinear function: size *10x10*
10. Repeat 8/9 for 16 kernels.
11. Pass the result to the pooling layer that averages over 2x2 window with stride 2: size *5x5* each feature map
To generate an output, the last layer of convolution is conducted, which seems like a full connections but indeed is convolutional.
12. The final convolution layer slides a 5x5 kernel over all feature maps, with results summed up: size *1x1*
13. Pass through nonlinear function: size *1x1*
14. Generate the single output for one category.
15. Repeat steps 2-14 for each of the 10 categories(in parallel)
<!--
8. The convolutional layer passes a 5x5 kernel with stride 1 over previous feature map
9. Pass the output of the convolution to a nonlinear function
10. Hold the resulting 10x10 feature map
11. Repeat steps 8/9 for multiple kernels
12. Sum the resulting feature maps and pass to the next pooling layer
13. Repeat steps 8-12 for each set of kernels (10 sets, one for each output)
The resulting 10 sets of feature maps are only a few steps from becoming an output.
8. The pooling layer averages over a 2x2 window with stride 2
9. Pass the resulting 5x5 feature maps to the final convolutional layer
10. The final convolutional layer passes a 5x5 kernel over the image
11. Pass the output of the convolution to a nonlinear function
12. The result is an output
-->
See [this animation](http://cs231n.github.io/convolutional-networks/) on Andrej Karpathy's website on how convolutions change the shape of the next layer's feature maps. Full paper can be found [here](https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf).
### [Shift Equivariance](1:05:52-1:09:25)
<center>
<img src="shift_invariance.gif" width="300px" /><br>
<b>Fig. 2</b> Shift Equivariance
</center>
As demonstrated by the animation on the slides(here's another example), translating the input image results in same translation of the feature maps. However, the changes in feature maps are scaled by convolution/pooling operations. E.g. the 2x2 pooling with stride 2 will reduce the 1-pixel shift in input layer to 0.5-pixel shift in the following feature maps. Spatial resolution is then exchanged for incraesed number of feature types, i.e. making the representation more abstract and less sensitive to shifts and distortions.
### [Overall Architecture Breakdown](1:09:25-1:11:20)
Generic CNN architecture can broken down into several basic layer archetypes:
* Normalization layers - adjusting whitening (optional)
* Subtratrive methods e.g. average removal, high pass filtering
* Divisive: local contrast normalization, variance normalization
* Filter Banks: increasing dimensionality, finding edges, etc.
* Non-linearities: sparsification
* Typically Rectified Linear Unit (ReLU): $\text{ReLU}(x) = \text{max}(x, 0)$
* Pooling: aggregating over a feature map:
* $\text{MAX}= \text{Max}_i(X_i)$
* $L_p = \left(\sum_{i=1}^n |X_i|^p\right)^{1/p}$
* $\text{Prob}= \frac{1}{b} \left(\sum_{i=1}^n e^{b X_i} \right)$
## [LeNet5 and Digit Recognition](1:11:20-1:27:58)
### [Implementation of LeNet5 in PyTorch](1:11:20-1:16:50)
LeNet5 consists of the following layers (1 being the top-most layer):
1. Log-softmax
2. Fully connected layer of dimensions 500x10
3. ReLu
4. Fully connected layer of dimensions (4x4x50)x500
5. Max Pooling of dimensions 2x2, stride of 2.
6. ReLu
7. Convolution with 20 output channels, 5x5 kernel, stride of 1.
8. Max Pooling of dimensions 2x2, stride of 2.
9. ReLu
10. Convolution with 20 output channels, 5x5 kernel, stride of 1.
The input is a 32x32 gray scale image (1 input channel).
LeNet5 can be implemented in PyTorch with the following code:
```python
class LeNet5(nn.Module):
def __init__(self):
super(LeNet5, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 20, 5, 1)
self.fc1 = nn.Linear(4*4*50, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4*4*50)
x = F.relu(self.fc1)
x = self.fc2(x)
return F.logsoftmax(x, dim=1)
```
Although fc1 and fc2 are fully connected layers, they can be thought of as convolutional layers whose kernels cover the entire input. Fully connected layers are used for efficiency purposes.
The same code can be expressed using `nn.Sequential`, but it is outdated.
## [Advantages of CNN](1:16:50-1:20:21)
In a fully convolutional network, there is no need to specify the size of the input. However, changing the size of the input changes the size of the output.
Consider a cursive hand-writting recognition system. We do not have to break the input image into segments. We can apply the CNN over the entire image: the kernels will cover all locations in the entire image and record the same output regardless of where the pattern is located. Applying the CNN over an entire image is much cheaper than applying it at multiple locations separately. No prior segmentation is required, which is a relief because the task of segmenting an image is similar to recognizing an image.
### [Example: MNIST](1:20:21-1:27:58)
LeNet5 is trained on MNIST images of size 32x32 to classify individual digits in the center of the image. Data augmentation was applied by shifting the digit around, changing the size of the digit, inserting digits to the side. It was also trained with an 11<sup>th</sup> category which represented none of the above. Images labelled by this category were generated either by producing blank images, or placing digits at the side but not the center.
<center>
<img src="feature_binding.gif" width="300px" /><br>
<b>Fig. 3</b> Sliding Window ConvNet
</center>
The above image demonstrates that a LeNet5 network trained on 32x32 can be applied on a 32x64 input image to recognise the digit at multiple locations.
### [Feature Binding Problem](1:27:58-1:31:11)
#### What is the feature binding problem?
Visual neural scientists and computer vision people have the problem of defining the object as an object. An object is a collection of features, but how to bind all of the features to form this object?
#### How to solve it?
We can solve this feature binding problem by using a very simple ConvNet: only two layers of convolutions w/ poolings plus another two fully connected layers without any specific mechanism for it, given that we have enough non-linearities and data to train our ConvNet.
<center>
<img src="feature_binding.gif" width="300px" /><br>
<b>Fig. 4</b> ConvNet Addressing Feature Binding
</center>
The above animation showcases ability of ConvNet to recognize different digits by moving a single stroke around, demonstrating its ability to address feature binding problems, i.e. recognizing features in a hierarchical, compositional way.
### [Example: Dynamic Input Length](1:31:11-1:40:34)
We can build a ConvNet with 2 convolution layers with stride 1 and 2 pooling layers with stride 2 such that the overall stride is 4. Thus, if we want to get a new output, we need to shift our input window by 4. To be more explicit, we can see the figure below (green units). First, we have an input of size 10, and we perform convolution of size 3 to get 8 units. After that, we perform pooling of size 2 to get 4 units. Similarly, we repeat convolution and pooling again and eventually we get 1 output.
<center>
<img src="example.jpg" width="600px" /><br>
<b>Fig. 5</b> ConvNet Architecture On Variant Input Size Binding
</center>
Let’s assume we add 4 units at the input layer (pink units above), so that we can get 4 more units after the first convolution layer, 2 more units after the first pooling layer, 2 more units after the second convolution layer, and 1 more output. Therefore, window size to generate a new output is 4 (2 stride x2)<!--the overall subsampling we have shown from input to output is 4 (2x2)-->. Moreover, this is the demonstration of the fact that if we increase the size of the input, we will increase the size of every layer.
## [What are ConvNets Good For](1:40:34-1:45:40)
ConvNets are good for signals that come to you in the form of multidimensional arrays. Also, they have two major characters.
1. **Locality**: The first one is that there is a strong local correlation between values. If we take two nearby pixels of a natural image, those pixels are very likely to have the same color. As two pixels become further apart, the similarity between the them will decrease. The local correlations can help us to detect local features, which is what the ConvNets are doing. If we feed the ConvNets with permuted pixels, it will not perform well at recognizing the input images, while FC will not be affected. The local correlation justifies local connections.
2. **Stationality**: Second character is that the features are essential and can appear anywhere on the image, so we need to justify the share weights and pooling. Moreover, statistical signals are uniformly distributed, which means we need to repeat the feature detection for everywhere.
Furthermore, people make good use of ConvNets on videos, images, text, and speech recognition.