# Convolutional Neural Networks

## 1. Convolution Operation
### 1.1 Motivation
Disadvantage of neural network on image classification :
1. Too many trainable variables.
2. Too much redundant computation.
In disadvantage 1, we can consider a neural network as follows :

Total number of trainable variables is $120000 \times 5000 + 5000 + 5000 \times 10 + 10 = 600,055,010$
In disadvantage 2, see the following figure :

Dot line means that it may not be an important information for the next neuron.
And too much redundant information (noise) will influence neural network to extract feature.
### 1.2 Feature Extraction
#### 1.2.1 Local Features
Idea : how about extracting features from local region in a image ?

<!--
1. 只需要注意小區域
2. 共同特徵
-->
1. Maybe we can use some methods to find local features in a image.
2. And we can merge some local features into a high-level feature.
3. Finally, we can use these high-level features to do classification.

#### 1.2.2 Filter
But how do we extract local features ?
We can use many **filters** to search local features in a image.
Filter : a matrix (2D or 3D) which represents a specific pattern.

We can performe dot product operation between a filter and a local region of a image.
- If the result of dot product operation exceeds a threshold, we will expect that it is a valid local feature. (Remember **bias** ?)

The filter will perfomre the above operations to every local region of the image.
- We will get a output matrix after the operation.
- The output matrix is called **feature map**.


And next, we can use the same filter methods to extract even more higher level features from those feature maps.

### 1.3 Convolution

Filter : a $H \times W \times C$ matrix ($H$ = filter height, $W$ = filter width, $C$ = number of input channel)

- In practice, we will let $H = W$ in a 2D convolution layer.
- $H$, $W$ are always odd numbers.
- [Why convolutions always use odd-numbers as filter_size
](https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size)
Convolution layer : a layer contains many filters.
- We can use a $N \times H \times W \times C$ matrix to represent all filters in a convolution layer ($N$ = number of filters)
Filters will be used on images (or input data) to generate feature maps.
- Convolution layer will generate feature maps by all filters in this layer.
We can use an equation to represent a convolution layer :
$$
Output = Activation(Convolution(Input, Filters) + Bias)
$$
- $Convolution(.)$ works as the following figure :

- A feature map will add a bias value. So, $Bias$ here is a vector with $N$ dimension.
- Assume $Input$ is a $I \times J \times C$ matrix, where $I$ = height, $J$ = width.
- We can expect $Output$ is a $(I - H + 1) \times (J - W + 1) \times N$ matrix.
After a convolution layer, we can send the output data to the next convolution layer.
How do we define filters in a convolution layer ?
- We random initialize all filters and bias. And we treat them as trainable variables.
- Optimize with backpropagation.
Now, we can compute the number of trainable variables in a convolution layer :
- Assume $H = W = 3$, $C = 3$, $N = 64$, we can get the number of trainable variables is $3 \times 3 \times 3 \times 64 + 64 = 1792$
- You can find that the number of parameters depends on your definition of a convolution layer. And it is independent of input data.
:::info
1. What is ??

:::
### 1.4 Presenting CNN in the form of DNN


### 1.5 Stride
Until now, we have at least 3 hyperparameters : $N$, $H$, $W$
- **Notice** : $C$ is not a hyperparameter. $C$ should be set according to the number of channels of input data.
Now, we will focus on the sliding step of filters :
If we slide a step once time :
- we will get a $(I - H + 1) \times (J - W + 1) \times N$ output matrix.
If we slide two steps once time :
- we will get a $[(I - H + 1) / 2] \times [(J - W + 1)/2] \times N$ output matrix.
If we slide $S$ steps once time :
- we will get a $[(I - H + 1)/S] \times [(J - W + 1)/S] \times N$ output matrix.

We define **stride** as the step size of every slide.
We can use stride to control the dimension of the output matrix.
:::info

:::
### 1.6 Zero-padding
You will find it a little hard to compute the dimension of the output matrix of a convolution layer.
So, we can add 0 to input data to avoid dimension reduction.

Let's consider the condition that stride is 1. Padding 0 can keep dimension (height, width) of the output matrix same as input matrix.
And then, we consider the condition that stride > 1. We can pad 0 to let the dimension of output matrix is $ceil(I/S) \times ceil(J/S) \times N$
Advantage of zero-padding :
1. Easier to design networks.
2. Allows us to design deeper networks.
3. Padding actually improves performance by keeping borders information.
:::info

:::
## 2. Pooling Operation
### 2.1 Motivation :

圖片來源: https://www.loveupets.com/report_detail_pc.php?id=417
1. Human can recognize object in a low-resolution image.
- If a 500 x 500 image is compressed to a 250 x 250 image, we can still recognize the objects in the image.
2. Some value in feature maps may be redundant or noisy.
3. Computing high-dimension feature maps needs larger software and hardware cost.
### 2.2 Pooling
Idea : we can convert every $P \times P$ region on a feature map to a single value. The following figure show the condition if $P = 2$.

Two ways to generate the value from a $P \times P$ matrix :
1. Average pooling: the value is the average of matrix elements
2. Max pooling : the value is the maximum element in the matrix.
- Max pooling is the most common pooling operation in convolutional neural networks.
Pooling layer : a layer perform pooling operation
- There is no trainable variable here.
- There is no activation function, too.
- Pooling operation will perform at each feature map. Feature maps are indenpendent of each other.
### 2.3 Stride
In section 1, we know that stride will affect the dimension of output matrix. So, if we want to reduce dimension of the output of a pooling layer, we should let stride > 1.
- In common, we will set stride to 2.
We can expect the dimension of the output matrix to be $[H/S] \times [W/S] \times N$.
- Dimension of output matrix is indenependent of $P$.
### 2.4 Zero-padding
We can pad 0 before a pooling operation.
- Dimension of the output matrix after the pooling layer will be $ceil(H/S) \times ceil(W/S) \times N$.
:::info

:::
## 3. Convolutional Neural Networks
### 3.1 Basic Architecture

Feel free to arrange your convolution layers and pooling layers.
Flatten : reshape feature maps to a 1D vector.
- For example, a 3D feature maps matrix shaped ($50 \times 50 \times 16$) will be reshaped to a vector with length $50 \times 50 \times 16 = 40000$.
Visualize filters :

### 3.2 1D Convolutional Neural Networks
Convolutional neural networks is not limited to 2D image data.
- It can be applied to any features with **local** patterns.
For example :
1. Text
2. Voice
3. Protein sequence
4. DNA sequence
Here, we will focus on 1D sequence features.
1D convolutional neural networks are similar to 2D convolutional neural networks. But convolution and pooling operation in 1D neural network has only one sliding direction.
The following figure represent a 1D convolution operation :

The following figure represent a 1D pooling operation :

:::info
Hint :
The size of a filter depends on the size of the local pattern.
:::
## 4. Evolution of Convolutional Neural Networks
In this section, we will discuss some famous convolutional neural networks architecture :
- The first convolutional neural network
- Champions of ILSVRC (ImageNet Large Scale Visual Recognition Competition) every years
- Some improvement of convolutional neural networks
We only introduce the basic conception of those neural networks. For more detail, please take a look for their papers.
### 4.1 LeNet
Paper : [Gradient-Based Learning Applied to Document Recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)

- It is **the first convolutional neural network**.
- Total number of layers : 7
- Subsampling layer is neither max pooling nor average pooling.
- $Sigmoid(Sum(x_{00}, x_{01}, x_{10}, x_{11}) \times w + b)$
- Gaussian connection is its output layer used to classify.
- $y_i = \sum_{j} (x_j - w_{ij}) ^ 2$
- Loss function : MSE (mean sqare error) + MLE (maximum likelihood estimation)
- $E(W) =\frac{1}{P} \sum_{p=1}^{P} (y_{Dp} (Z_p, W) + log(e^{-j} + \sum_i e^{-y_i(Z_p, W)}))$
### 4.2 AlexNet
Paper : [ImageNet Classification with Deep ConvolutionalNeural Networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)

- Champion of ILSVRC 2012
- Train on **two GPUs**
- Use **ReLU** as its activation function<!-- 取代以前的tanh和sigmoid, 可解決vanishing gradient problem -->
- Use **Data augmentation**<!-- - 訓練時將原始影像256x256透過隨機擷取其中224x224的region以及水平翻轉,使得資料量增加了(256–224)² X 2 = 2048倍,大幅度增加資料量並可以降低overfitting -->
- Local response normalization :
$$
b_{x,y}^i = a_{x,y}^i / (k + \alpha\sum_{j=max(0, i-n/2)}^{min(N-1, i+n/2)} (a_{x,y}^i)^2)^\beta
$$
- Overlapping pooling : $P = 3, S = 2$<!-- 之前LeNet是用average pooling但會有特徵被模糊化的問題,AlexNet則是用Max Pooling, 並且step比pooling kernel來的小,這樣輸出之間會有重複的部分,避免遺失特徵 -->
- **Dropout** in the first two fully connected layers.
### 4.3 VGG
Paper : [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556.pdf)

- Contribution : **deeper convolutional neural networks** have better performance.<!-- 論文中解釋關於僅使用3x3 conv kernel,因為兩個3x3的conv kernel疊合的reception field等效於一個5x5 conv kernel(亦即每個pixel可以correlate到周圍的5x5個pixel), 而三個3x3則可以等效於一個7x7,但兩層3x3的參數量僅有一層5x5的(3x3x2)/(5x5) = 0.72倍,而三層3x3參數量是一層7x7的(3x3x3)/(7x7)=0.55倍,對應到的範圍等效並且可使得需參數量更少,並且疊越多層Conv+ReLU的特徵學習能力比單一層Conv+ReLU來的更好。 -->
- Very small filters : $3 \times 3$<!-- - 奠定了使用3x3 conv kernel的趨勢 -->
- VGG-11, VGG-13, VGG-16, VGG-19 : 11, 16, 13, 19 mean the number of layers.
### 4.4 Network in Network
Paper : [Network In Network](https://arxiv.org/pdf/1312.4400.pdf)

#### 4.4.1 MLP convolution layer

Motivation :
- Increase non-linear complexity between two convolution layers
- Reduce trainable variables
We can use convolution layer with $1 \times 1$ filters as MLP.
Why it reduces trainable variables ? Consider the following two architectures :
- conv 3x3(input=64 channels, output=128 channels) : $3 \times 3 \times 64 \times 128 = 73728$
- conv 3x3(input=64 channels, output=32 channels) -> conv 1x1 (input=32 channels, output=128 channels) : $3 \times 3 \times 64 \times 32 + 1 \times 1 \times 32 \times 128= 22528$
#### 4.4.2 Global average pooling
Motivation :
- Fully connected layers have too many parameters (trainable variables).
- It causes overfitting easily.
- For example, 85% parameters ($\cong 123,000,000$) come from fully connected layers in VGG-19.
We can find that **flatten** keep too many values. And thus the coming fully-connected will need to use a lot of paramenters.
Idea : replace flatten with global average pooling
- Convert a feature map to a single value.
- Max pooling will loss too much information.

Advantage :
- Reduce parameters
- Regularization
### 4.5 GoogLeNet
Paper : [Going deeper with convolutions](https://arxiv.org/pdf/1409.4842.pdf)

- Champion of ILSVRC 2014
- Reference **network in network**
- Use convolution layer with $1 \times 1$ filters and global average pooling.
- Total number of layers is 23 but 12 times smaller than AlexNet.
- Multi-thread
- Inception module :
- How to concatenate feature maps after convolution layer and pooling layer ?
- Zero-padding
- Stride in pooling layers is **1**.

- Inception v2, v3, v4 :
- [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
- [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf)
- [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf)
### 4.6 ResNet
Paper : [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf)

- Champion of ILSVRC 2015
- **Residual block** : skip connection can avoid gradient vanish.

- It is possible to train a convolutional neural network with more than 1000 layers.
- Variants :
- DenseNet
- Inception-ResNet
- Analysis performance in different residual block architecture : [Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf)
### 4.7 More (Keyword only)
Classification :
- Xception
- SENet
- EfficientNet v1/v2
- ResNeSt
Object Detection :
- SPPNet
- R-CNN
- Fast R-CNN
- Faster R-CNN
- EfficientDet
- [YOLO](https://pjreddie.com/darknet/yolo/) v1, v2, v3, ... v4, v5
Semantic segmentation :
- Fully Convolutional Networks(FCN)
- U-Net / U^2-Net
- [DeepLab](https://colab.research.google.com/github/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb#scrollTo=edGukUHXyymr) v1, v2, v3
- Mask R-CNN
Style transfer:
- [Algorithmia](https://demos.algorithmia.com/deep-style)
- AdaIN
- CycleGAN
- StyleGAN v1/v2/v3
GAN :
- CGAN / Pix2Pix
- SinGAN
- [GAUGAN](http://nvidia-research-mingyuliu.com/gaugan)
Mobile device :
- ShuffleNet
- SqueezeNet
- MobileNet v1, v2
## 5. Reference
- [Why convolutions always use odd-numbers as filter_size
](https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size)
- [Convolutional Layers: To pad or not to pad?](https://stats.stackexchange.com/questions/246512/convolutional-layers-to-pad-or-not-to-pad)
- [ILSVRC 歷屆的深度學習模型](https://chtseng.wordpress.com/2017/11/20/ilsvrc-%E6%AD%B7%E5%B1%86%E7%9A%84%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E6%A8%A1%E5%9E%8B/)
- [李宏毅 - Batch Normalization](https://www.youtube.com/watch?v=BZh1ltr5Rkg)
- [Wikipideia - Feature Scaling](https://en.wikipedia.org/wiki/Feature_scaling)
- [UNITN: Training Deep Convolutional Neural Network for Twitter
Sentiment Classification
](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=32E7AA2ED75E8060BB8CE283166B218F?doi=10.1.1.703.6858&rep=rep1&type=pdf)