A brief introduction to CNN - convolutional neural network

# A brief introduction to CNN - convolutional neural network NOTE: this is for a non-profit knowledge-sharing purpose only. ![](https://hackmd.io/_uploads/rkO495nLh.png) (from https://developersbreach.com/convolution-neural-network-deep-learning/) Computer vision is the earliest and biggest success story of deep learning. Now it is common everywhere, such as - Google photos - Google image search - iPhone camera - YouTube - Video filters - OCR software - Autonomous driving - Robotics - AI-assisted medical diagnotics - Autonomous retail checkout system - Autonomous farming - ... The concept of using CNN for computer vision was there at least about 25 years ago (KSW: not sure if there were other pioneers even before then). But not until 2016, CNN for computer vision was not widerly accepted and facing intense skepticism. ![](https://hackmd.io/_uploads/S1biU4nUh.png) (from Yann LeCun et al 1998, https://ieeexplore.ieee.org/document/726791) ## KEY CONCEPT: the visual world is fundamentally spatially hierarchical ![](https://hackmd.io/_uploads/rknXRQn8n.png) (from Ref.1) A cat image can be broken down as fundamental visual components such as line segments and patches. These fundamental components can be used to assemble more complicated structures such as eye, nose, ear, etc. Then these complicated structures can be used to assemble even more complicated structures such as face, body, leg, etc. This process keeps going until a full cat image is contructed. A cat may be *considered* as a cat by our brain when we see only parts of it (eg eyes and/or nose and/or ears). ## Convnet Convolutional neural network, also known as *convnet*, is a type of deep learning model and is now used almost universally in computer vision applications. ### How does it work in keras? Instantiating a small convnet for the MNIST dataset: ```python from tensorflow import keras from tensorflow.keras import layers inputs = keras.Input(shape=(28, 28, 1)) x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs) x = layers.MaxPooling2D(pool_size=2)(x) x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x) x = layers.MaxPooling2D(pool_size=2)(x) x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x) x = layers.Flatten()(x) outputs = layers.Dense(10, activation="softmax")(x) model = keras.Model(inputs=inputs, outputs=outputs) ``` Visually, it looks *similar* to ![](https://hackmd.io/_uploads/H1eSY72U3.png) (from https://www.researchgate.net/figure/The-overall-architecture-of-the-Convolutional-Neural-Network-CNN-includes-an-input_fig4_331540139 with some editions) A table view (exactly) looks like ![](https://hackmd.io/_uploads/Bkw5tmhIh.png) (Counting numbers of free parameters: `320=1x32x9+32`, `18496=32(previous filters)x64(new filters)x9(3x3 kernel)+64(1x1 bias)`, `73856=64x128x9+128`, `1152=3x3x128`, `11530=1152x10+10`) (from Ref.1) A basic convnet is a stack of `Conv2D` and `MaxPooling2D` layers. We can see that the output of every `Conv2D` and `MaxPooling2D` layer is a rank-3 tensor of shape `(height, width, channels)`. The width and height dimensions tend to *shrink* as we go deeper in the model. The number of channels (aka filters) is controlled by the first argument passed to the `Conv2D` layers (ie 32, 64, or 128). After the last `Conv2D` layers, we end up with an output of shape `(3, 3, 128)` - a 3x3 feature map of 128 channels. The nest step is to feed this output into a densely connected classifier (a stack of `Dense` layers) like those we talked about last time. These classifiers process vectors, which are 1D, whereas the current output is a rank-3 tensor. The bridge the gap, we flatten the 3D outputs to 1D with a `Flatten` layer before adding the `Dense` layer. Then we do 10-way classification in the last layer. Using the MNIST training dataset, this CNN gives us an accuracy of 99.1%. Using a simple densely connected network, the accuracy is 97.8%. ### KEY STEP: the convolution operation The fundamental difference between a densely connected layer and a convolution layer is this: `Dense` layers learn *global* patterns in their input feature space (for example, for a MNIST digit, patterns involving *all pixels*), whereas convolution layers learn *local* patterns - in the case of images, patterns found in small 2D windows (ie convolution kernel) of the inputs. In the previous example, these windows were all 3x3. ![](https://hackmd.io/_uploads/SyXAc_3Ln.png) (from Ref.1) ![](https://hackmd.io/_uploads/r1f2amnLh.png) (from Ref.1) ![](https://hackmd.io/_uploads/rkqRqO3U3.png) (from Ref.1) This key characteristic gives convnets two interesting properties: * *The patterns they learn are translation-invariant.* After learning a certain pattern in the lower-right corner of a picture, a convnet can recognize it anywhere: for example, in the upper-left corner. A densely connected model would have to learn the pattern anew if it appeared at a new location. This makes convnets data-efficient when processing images (because the visual world is fundamentally translation-invariant): they need fewer training samples to learn representations that have generalization power. * *They can learn spatial hierarchies of patterns.* A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently learn increasingly complex and abstract visual concepts, because *the visual world is fundamentally spatially hierarchical*. ![](https://hackmd.io/_uploads/rknXRQn8n.png) (from Ref.1) Convolutions operate over rank-3 tensors called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue. For a black-and-white picture, like the MNIST digits, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. This output feature map is still a rank-3 tensor: it has a width and a height. Its depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors as in RGB input; rather, they stand for *filters*. *Filters encode specific aspects of the input data*: at a high level, a single filter could encode the concept “presence of a face in the input,” for instance. In the MNIST example, the first convolution layer takes a feature map of size `(28, 28, 1)` and outputs a feature map of size `(26, 26, 32)`: it computes 32 filters over its input. Each of these 32 output channels contains a 26x26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input. ![](https://hackmd.io/_uploads/rkw3JEnI2.png) (from Ref.1) That is what the term feature map means: every dimension in the depth axis is a feature (or filter), and the rank-2 tensor output `[:, :, n]` is the 2D spatial map of the response of this filter over the input. Kernel examples: ![](https://hackmd.io/_uploads/rJ1uyHn83.jpg) (from https://pylessons.com/CNN-tutorial-introduction#) A convolution works by sliding these windows of size 3x3 or 5x5 over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features (shape `(window_height, window_width, input_depth)`). Each such 3D patch is then transformed into a 1D vector of shape `(output_depth,)`, which is done via a tensor product with a learned weight matrix, called the convolution kernel — the same kernel is reused across every patch. All of these vectors (one per patch) are then spatially reassembled into a 3D output map of shape `(height, width, output_depth)`. ![](https://hackmd.io/_uploads/HkWcbV3Un.png) ### Padding and strides How padding works: ![](https://hackmd.io/_uploads/HJOlMV3Ln.png) ![](https://hackmd.io/_uploads/HkQzGE38n.png) (from Ref.1) How strides work? ![](https://hackmd.io/_uploads/HJ5zM4383.png) (from Ref.1) Strided convolutions are rarely used in classification models, but they come in handy for some types of models. In classification models, instead of strides, we tend to use the *max-pooling* operation to downsample feature maps. ### KEY STEP: the max-pooling operation In the convnet example, we may have noticed that the size of the feature maps is halved after every `MaxPooling2D` layer. For instance, before the first `MaxPooling2D` layers, the feature map is 26x26, but the max-pooling operation halves it to 13x13. That’s the role of max pooling: *to aggressively downsample feature maps, much like strided convolutions*. Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel. It’s conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded `max` tensor operation. A big difference from convolution is that max pooling is usually done with 2x2 windows and stride 2, in order to downsample the feature maps by a factor of 2. On the other hand, convolution is typically done with 3x3 windows and no stride (stride 1). ### Why downsample feature maps this way? Why not remove the max-pooling layers and keep fairly large feature maps all the way up? If we remove the max-pooling layers, our convnet would look like ![](https://hackmd.io/_uploads/HJjsX42Lh.png) (from Ref.1) What’s wrong with this setup? * It isn’t conducive to learning a spatial hierarchy of features. The 3x3 windows in the third layer will only contain information coming from 7x7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7x7 pixels!). We need the features from the last convolution layer to contain information about the totality of the input. * The final feature map has 22x22x128 = 61,952 total coefficients per sample. This is huge. When you flatten it to stick a Dense layer of size 10 on top, that layer would have over half a million parameters. This is far too large for such a small model and would result in intense overfitting. In short, the reason to use downsampling is to reduce the number of feature-map coefficients to process, as well as to *induce spatial-filter hierarchies* by making successive convolution layers look at increasingly large windows (in terms of the fraction of the original input they cover). Note that max pooling isn’t the only way we can achieve such downsampling. We can also use strides in the prior convolution layer. And we can use average pooling instead of max pooling, where each local input patch is transformed by taking the average value of each channel over the patch, rather than the max. But max pooling tends to work better than these alternative solutions. The reason is that features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map (hence the term feature map), and it’s more informative to look at the maximal presence of different features than at their average presence. The most reasonable subsampling strategy is to first produce dense maps of features (via unstrided convolutions) and then look at the maximal activation of the features over small patches, rather than looking at sparser windows of the inputs (via strided convolutions) or averaging input patches, which could cause you to miss or dilute feature-presence information. ## References 1. https://www.manning.com/books/deep-learning-with-python-second-edition (chapter 8) 2. https://poloclub.github.io/cnn-explainer/ 3. https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/ 4. https://pylessons.com/CNN-tutorial-introduction# 5. https://towardsdatascience.com/convolutional-neural-network-feature-map-and-filter-visualization-f75012a5a49c