---
title: VGG - Mobile Net
tags: khtn backbone
---
VGG - MobileNet
===
## Alexnet - 2012
- First use ReLU
- Norm
- Heavy data augmentation
- Minibatch 128
- Learning rate: 1e-2, reduced by 10 manually when val accuracy plateaus
- Dropout: 0.5
- SGD momentum 0.9
- L2 weight decay 5e-4
- 7 CNN ensemble: 18.2 -> 15.4% = 2.8%
- Network spread across 2 GPUs
## VGG - 2014
### Network design
- Only using 3x3 convolutionial layers
- Reducing volume size is handled by max pooling
- 2 Fully connected layers and followed by a softmax classifier at the end

**Figure:** VGG 16

<div><p stype="text-align: center;"><span stype="font-weight: strong"><b>Figure: </b></span>Architecture</p></div>

**Figure:** Number of parameters
### Receptive field
- The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by)

### Drawbacks
- Slow to train
- Number of weights are quite large: 528MB
### Summary
- Increase number of layer
- Using smaller filter 3x3 (3* 3x3 filter in stead of 1 7x7). Has the same effective receptive field but deeper, more non-linearities.
- Size derease, number of param increase
- fc7 features generalize well to other tasks
- Training
- Batch size: 256
- SGD + Momentum: 0.9
- Regular: L2
- Dropout 0.5 to first two FCN
- Learning rate: $10^{-2}$ devided by 10 when validation error plateaus
- weight decay of $5*10^{-4}$
## GoogLeNet - 2014
- Inception module
- Stack 1x1, 3x3, 5x5, maxpool
- 1x1 conv to reduce complexcity
- Stem network
- Reduce number of FC layer
- Auxiliary classification outputs
- Help gradient vanish
## Resnet - 2015
### The problem of very deep neural networks

- More layer more powerfull?
- Yes and No: Hard to train very deep network!
- Gradient vanishing/explode when we stack too much layer

### Residual Network
- Solution: a "shortcut" or a "skip connection"
- Allow the gradient to be directly backprobagated to earlier layers
- At least deep network works as well as shallow network
- 
- $z=Wx + b\\h=\theta(z)\\y=Wh + x$
- $y = F(x) + x$
- 
- Each layer adds something to the previous value, rather than producing an entirely new value.
#### The identity block

- Input and output are of the same dimensions
- $y=F(x, \{W_i\})+x$
- No extra parameters
#### The convolutional block

- Input and output have different dimensions
- $y=F(x, \{W_i\}+Wx$
- Need extra parameters
### Architect of Resnet
#### BottleNeck
- \# parameters
- 256 x 64 + 64 x 3 x 3 x 64 + 64 x 256 = ~70K
- \# parameters just using 3x3x256x256 conv layer = ~600K
- 
#### Diagram

### Summary
- Extreme depth 152 layers
- Idea: at least as good as shallow model
- $F(x)=H(x)-x$
- Training
- Batch Norm
- Xavier/2 initialization from He et al.
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
- SGD + Momentum (0.9)
- Learning rate: 0.1, devided by 10 when validation error plateaus
## Mobilenet V1

- Standard covolutions:
- $\text{computational cost}=D_K*D_K*M*N*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$
- Separable convolution:
- Depthwise separable convolution
- $\text{computational cost}=D_K*D_K*M*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\D_F:\text{Feature map size}$
- Pointwise (1x1) convolution
- $\text{computational cost}=M*N*D_F*D_F\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$
- $\text{computational cost}=D_K*D_K*M*D_F*D_F+M*N*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$
- Reduce in computation
- $\text{computational cost}=\frac{D_K*D_K*M*D_F*D_F+M*N*D_F*D_F}{D_K*D_K*M*N*D_F*D_F}=\frac{1}{N}+\frac{1}{D^2_K}\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$
- There are many ways to make deep learning model run faster but still archive pretty good accuracy. One of them is using Depthwise Separable Convolution.
- Depthwise layer to extract feature and point wise to mix channel to create new feature.
- First layer of model should be Traditional CNN because We don't want to lost information at the first step.
- Number of param is small so we don't have to apply much technique to prevent overfitting
- Architecture


- Training
- RMSprop with asynchronous gradient descent
- Less regularization and data augmentation
- Not use side heads or label smoothing and additionally reduce the amount image of distortions by limiting the size of small crops that are used in large Inception training.
- Little or no weight decay (l2 regularization)
## Group Conv
## Question
- Residual block???
- squeezed net ?? relate to mobilenet