Convolutional Neural Networks - HackMD

# Convolutional Neural Networks ![](https://i.imgur.com/UdHXtri.png) ## 1. Convolution Operation ### 1.1 Motivation Disadvantage of neural network on image classification : 1. Too many trainable variables. 2. Too much redundant computation. In disadvantage 1, we can consider a neural network as follows : ![](https://i.imgur.com/HySxYbI.png) Total number of trainable variables is $120000 \times 5000 + 5000 + 5000 \times 10 + 10 = 600,055,010$ In disadvantage 2, see the following figure : ![](https://i.imgur.com/R2GVQtM.png) Dot line means that it may not be an important information for the next neuron. And too much redundant information (noise) will influence neural network to extract feature. ### 1.2 Feature Extraction #### 1.2.1 Local Features Idea : how about extracting features from local region in a image ? ![](https://i.imgur.com/9itdbOf.png)  1. Maybe we can use some methods to find local features in a image. 2. And we can merge some local features into a high-level feature. 3. Finally, we can use these high-level features to do classification. ![](https://i.imgur.com/UIBp69w.png) #### 1.2.2 Filter But how do we extract local features ? We can use many **filters** to search local features in a image. Filter : a matrix (2D or 3D) which represents a specific pattern. ![](https://i.imgur.com/g9yV4Y9.png) We can performe dot product operation between a filter and a local region of a image. - If the result of dot product operation exceeds a threshold, we will expect that it is a valid local feature. (Remember **bias** ?) ![](https://i.imgur.com/QViMZtd.png) The filter will perfomre the above operations to every local region of the image. - We will get a output matrix after the operation. - The output matrix is called **feature map**. ![](https://i.imgur.com/pKEveTw.png) ![](https://i.imgur.com/t2ahvHN.png) And next, we can use the same filter methods to extract even more higher level features from those feature maps. ![](https://i.imgur.com/vzYgFd6.png) ### 1.3 Convolution ![](https://i.imgur.com/i1Z1GOp.png) Filter : a $H \times W \times C$ matrix ($H$ = filter height, $W$ = filter width, $C$ = number of input channel) ![](https://i.imgur.com/uB7Mvpj.png) - In practice, we will let $H = W$ in a 2D convolution layer. - $H$, $W$ are always odd numbers. - [Why convolutions always use odd-numbers as filter_size ](https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size) Convolution layer : a layer contains many filters. - We can use a $N \times H \times W \times C$ matrix to represent all filters in a convolution layer ($N$ = number of filters) Filters will be used on images (or input data) to generate feature maps. - Convolution layer will generate feature maps by all filters in this layer. We can use an equation to represent a convolution layer : $$ Output = Activation(Convolution(Input, Filters) + Bias) $$ - $Convolution(.)$ works as the following figure : ![](https://i.imgur.com/VaPC8z7.png) - A feature map will add a bias value. So, $Bias$ here is a vector with $N$ dimension. - Assume $Input$ is a $I \times J \times C$ matrix, where $I$ = height, $J$ = width. - We can expect $Output$ is a $(I - H + 1) \times (J - W + 1) \times N$ matrix. After a convolution layer, we can send the output data to the next convolution layer. How do we define filters in a convolution layer ? - We random initialize all filters and bias. And we treat them as trainable variables. - Optimize with backpropagation. Now, we can compute the number of trainable variables in a convolution layer : - Assume $H = W = 3$, $C = 3$, $N = 64$, we can get the number of trainable variables is $3 \times 3 \times 3 \times 64 + 64 = 1792$ - You can find that the number of parameters depends on your definition of a convolution layer. And it is independent of input data. :::info 1. What is ?? ![](https://i.imgur.com/PxjXY8n.png) ::: ### 1.4 Presenting CNN in the form of DNN ![](https://i.imgur.com/zDsLkwU.png) ![](https://i.imgur.com/2R6MyOX.png) ### 1.5 Stride Until now, we have at least 3 hyperparameters : $N$, $H$, $W$ - **Notice** : $C$ is not a hyperparameter. $C$ should be set according to the number of channels of input data. Now, we will focus on the sliding step of filters : If we slide a step once time : - we will get a $(I - H + 1) \times (J - W + 1) \times N$ output matrix. If we slide two steps once time : - we will get a $[(I - H + 1) / 2] \times [(J - W + 1)/2] \times N$ output matrix. If we slide $S$ steps once time : - we will get a $[(I - H + 1)/S] \times [(J - W + 1)/S] \times N$ output matrix. ![](https://i.imgur.com/fiRqZpT.png) We define **stride** as the step size of every slide. We can use stride to control the dimension of the output matrix. :::info ![](https://i.imgur.com/x5MMpPR.png) ::: ### 1.6 Zero-padding You will find it a little hard to compute the dimension of the output matrix of a convolution layer. So, we can add 0 to input data to avoid dimension reduction. ![](https://i.imgur.com/4PMrKqp.png) Let's consider the condition that stride is 1. Padding 0 can keep dimension (height, width) of the output matrix same as input matrix. And then, we consider the condition that stride > 1. We can pad 0 to let the dimension of output matrix is $ceil(I/S) \times ceil(J/S) \times N$ Advantage of zero-padding : 1. Easier to design networks. 2. Allows us to design deeper networks. 3. Padding actually improves performance by keeping borders information. :::info ![](https://i.imgur.com/fWURgKQ.png) ::: ## 2. Pooling Operation ### 2.1 Motivation : ![](https://i.imgur.com/Dh4NHzR.png) 圖片來源: https://www.loveupets.com/report_detail_pc.php?id=417 1. Human can recognize object in a low-resolution image. - If a 500 x 500 image is compressed to a 250 x 250 image, we can still recognize the objects in the image. 2. Some value in feature maps may be redundant or noisy. 3. Computing high-dimension feature maps needs larger software and hardware cost. ### 2.2 Pooling Idea : we can convert every $P \times P$ region on a feature map to a single value. The following figure show the condition if $P = 2$. ![](https://i.imgur.com/NUFx4tM.png) Two ways to generate the value from a $P \times P$ matrix : 1. Average pooling: the value is the average of matrix elements 2. Max pooling : the value is the maximum element in the matrix. - Max pooling is the most common pooling operation in convolutional neural networks. Pooling layer : a layer perform pooling operation - There is no trainable variable here. - There is no activation function, too. - Pooling operation will perform at each feature map. Feature maps are indenpendent of each other. ### 2.3 Stride In section 1, we know that stride will affect the dimension of output matrix. So, if we want to reduce dimension of the output of a pooling layer, we should let stride > 1. - In common, we will set stride to 2. We can expect the dimension of the output matrix to be $[H/S] \times [W/S] \times N$. - Dimension of output matrix is indenependent of $P$. ### 2.4 Zero-padding We can pad 0 before a pooling operation. - Dimension of the output matrix after the pooling layer will be $ceil(H/S) \times ceil(W/S) \times N$. :::info ![](https://i.imgur.com/hUwE2wm.png) ::: ## 3. Convolutional Neural Networks ### 3.1 Basic Architecture ![](https://i.imgur.com/rb7nSq1.png) Feel free to arrange your convolution layers and pooling layers. Flatten : reshape feature maps to a 1D vector. - For example, a 3D feature maps matrix shaped ($50 \times 50 \times 16$) will be reshaped to a vector with length $50 \times 50 \times 16 = 40000$. Visualize filters : ![](https://cdn-images-1.medium.com/max/1200/1*Ji5QhY9QXBlpNNLH4qAcNA.png) ### 3.2 1D Convolutional Neural Networks Convolutional neural networks is not limited to 2D image data. - It can be applied to any features with **local** patterns. For example : 1. Text 2. Voice 3. Protein sequence 4. DNA sequence Here, we will focus on 1D sequence features. 1D convolutional neural networks are similar to 2D convolutional neural networks. But convolution and pooling operation in 1D neural network has only one sliding direction. The following figure represent a 1D convolution operation : ![](https://i.imgur.com/tyvOiV6.png) The following figure represent a 1D pooling operation : ![](https://i.imgur.com/hGBcG4O.png) :::info Hint : The size of a filter depends on the size of the local pattern. ::: ## 4. Evolution of Convolutional Neural Networks In this section, we will discuss some famous convolutional neural networks architecture : - The first convolutional neural network - Champions of ILSVRC (ImageNet Large Scale Visual Recognition Competition) every years - Some improvement of convolutional neural networks We only introduce the basic conception of those neural networks. For more detail, please take a look for their papers. ### 4.1 LeNet Paper : [Gradient-Based Learning Applied to Document Recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) ![](https://i.imgur.com/YJL8Z2M.png) - It is **the first convolutional neural network**. - Total number of layers : 7 - Subsampling layer is neither max pooling nor average pooling. - $Sigmoid(Sum(x_{00}, x_{01}, x_{10}, x_{11}) \times w + b)$ - Gaussian connection is its output layer used to classify. - $y_i = \sum_{j} (x_j - w_{ij}) ^ 2$ - Loss function : MSE (mean sqare error) + MLE (maximum likelihood estimation) - $E(W) =\frac{1}{P} \sum_{p=1}^{P} (y_{Dp} (Z_p, W) + log(e^{-j} + \sum_i e^{-y_i(Z_p, W)}))$ ### 4.2 AlexNet Paper : [ImageNet Classification with Deep ConvolutionalNeural Networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) ![](https://i.imgur.com/DcrhnHl.png) - Champion of ILSVRC 2012 - Train on **two GPUs** - Use **ReLU** as its activation function - Use **Data augmentation** - Local response normalization : $$ b_{x,y}^i = a_{x,y}^i / (k + \alpha\sum_{j=max(0, i-n/2)}^{min(N-1, i+n/2)} (a_{x,y}^i)^2)^\beta $$ - Overlapping pooling : $P = 3, S = 2$ - **Dropout** in the first two fully connected layers. ### 4.3 VGG Paper : [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556.pdf) ![](https://i.imgur.com/kzHiggO.png) - Contribution : **deeper convolutional neural networks** have better performance. - Very small filters : $3 \times 3$ - VGG-11, VGG-13, VGG-16, VGG-19 : 11, 16, 13, 19 mean the number of layers. ### 4.4 Network in Network Paper : [Network In Network](https://arxiv.org/pdf/1312.4400.pdf) ![](https://i.imgur.com/cUJpBYr.png) #### 4.4.1 MLP convolution layer ![](https://i.imgur.com/wH1cnqW.png) Motivation : - Increase non-linear complexity between two convolution layers - Reduce trainable variables We can use convolution layer with $1 \times 1$ filters as MLP. Why it reduces trainable variables ? Consider the following two architectures : - conv 3x3(input=64 channels, output=128 channels) : $3 \times 3 \times 64 \times 128 = 73728$ - conv 3x3(input=64 channels, output=32 channels) -> conv 1x1 (input=32 channels, output=128 channels) : $3 \times 3 \times 64 \times 32 + 1 \times 1 \times 32 \times 128= 22528$ #### 4.4.2 Global average pooling Motivation : - Fully connected layers have too many parameters (trainable variables). - It causes overfitting easily. - For example, 85% parameters ($\cong 123,000,000$) come from fully connected layers in VGG-19. We can find that **flatten** keep too many values. And thus the coming fully-connected will need to use a lot of paramenters. Idea : replace flatten with global average pooling - Convert a feature map to a single value. - Max pooling will loss too much information. ![](https://i.imgur.com/e1bhxuD.png) Advantage : - Reduce parameters - Regularization ### 4.5 GoogLeNet Paper : [Going deeper with convolutions](https://arxiv.org/pdf/1409.4842.pdf) ![](https://i.imgur.com/jZtn0Fh.png) - Champion of ILSVRC 2014 - Reference **network in network** - Use convolution layer with $1 \times 1$ filters and global average pooling. - Total number of layers is 23 but 12 times smaller than AlexNet. - Multi-thread - Inception module : - How to concatenate feature maps after convolution layer and pooling layer ? - Zero-padding - Stride in pooling layers is **1**. ![](https://i.imgur.com/Rj1SQIz.png) - Inception v2, v3, v4 : - [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167) - [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf) - [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf) ### 4.6 ResNet Paper : [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf) ![](https://i.imgur.com/KtCM7pX.png) - Champion of ILSVRC 2015 - **Residual block** : skip connection can avoid gradient vanish. ![](https://i.imgur.com/OsTbelV.png) - It is possible to train a convolutional neural network with more than 1000 layers. - Variants : - DenseNet - Inception-ResNet - Analysis performance in different residual block architecture : [Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf) ### 4.7 More (Keyword only) Classification : - Xception - SENet - EfficientNet v1/v2 - ResNeSt Object Detection : - SPPNet - R-CNN - Fast R-CNN - Faster R-CNN - EfficientDet - [YOLO](https://pjreddie.com/darknet/yolo/) v1, v2, v3, ... v4, v5 Semantic segmentation : - Fully Convolutional Networks(FCN) - U-Net / U^2-Net - [DeepLab](https://colab.research.google.com/github/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb#scrollTo=edGukUHXyymr) v1, v2, v3 - Mask R-CNN Style transfer: - [Algorithmia](https://demos.algorithmia.com/deep-style) - AdaIN - CycleGAN - StyleGAN v1/v2/v3 GAN : - CGAN / Pix2Pix - SinGAN - [GAUGAN](http://nvidia-research-mingyuliu.com/gaugan) Mobile device : - ShuffleNet - SqueezeNet - MobileNet v1, v2 ## 5. Reference - [Why convolutions always use odd-numbers as filter_size ](https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size) - [Convolutional Layers: To pad or not to pad?](https://stats.stackexchange.com/questions/246512/convolutional-layers-to-pad-or-not-to-pad) - [ILSVRC 歷屆的深度學習模型](https://chtseng.wordpress.com/2017/11/20/ilsvrc-%E6%AD%B7%E5%B1%86%E7%9A%84%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E6%A8%A1%E5%9E%8B/) - [李宏毅 - Batch Normalization](https://www.youtube.com/watch?v=BZh1ltr5Rkg) - [Wikipideia - Feature Scaling](https://en.wikipedia.org/wiki/Feature_scaling) - [UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification ](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=32E7AA2ED75E8060BB8CE283166B218F?doi=10.1.1.703.6858&rep=rep1&type=pdf)