--- title: VGG - Mobile Net tags: khtn backbone --- VGG - MobileNet === ## Alexnet - 2012 - First use ReLU - Norm - Heavy data augmentation - Minibatch 128 - Learning rate: 1e-2, reduced by 10 manually when val accuracy plateaus - Dropout: 0.5 - SGD momentum 0.9 - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2 -> 15.4% = 2.8% - Network spread across 2 GPUs ## VGG - 2014 ### Network design - Only using 3x3 convolutionial layers - Reducing volume size is handled by max pooling - 2 Fully connected layers and followed by a softmax classifier at the end ![](https://cv-tricks.com/wp-content/uploads/2017/05/vgg16.png) **Figure:** VGG 16 ![](https://i.imgur.com/He3jhvE.png) <div><p stype="text-align: center;"><span stype="font-weight: strong"><b>Figure: </b></span>Architecture</p></div> ![](https://i.imgur.com/kGJJl1j.png) **Figure:** Number of parameters ### Receptive field - The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by) ![](https://www.researchgate.net/publication/316950618/figure/fig4/AS:495826810007552@1495225731123/The-receptive-field-of-each-convolution-layer-with-a-3-3-kernel-The-green-area-marks.png) ### Drawbacks - Slow to train - Number of weights are quite large: 528MB ### Summary - Increase number of layer - Using smaller filter 3x3 (3* 3x3 filter in stead of 1 7x7). Has the same effective receptive field but deeper, more non-linearities. - Size derease, number of param increase - fc7 features generalize well to other tasks - Training - Batch size: 256 - SGD + Momentum: 0.9 - Regular: L2 - Dropout 0.5 to first two FCN - Learning rate: $10^{-2}$ devided by 10 when validation error plateaus - weight decay of $5*10^{-4}$ ## GoogLeNet - 2014 - Inception module - Stack 1x1, 3x3, 5x5, maxpool - 1x1 conv to reduce complexcity - Stem network - Reduce number of FC layer - Auxiliary classification outputs - Help gradient vanish ## Resnet - 2015 ### The problem of very deep neural networks ![](https://neurohive.io/wp-content/uploads/2019/01/plain-networks-training-results-770x272.png) - More layer more powerfull? - Yes and No: Hard to train very deep network! - Gradient vanishing/explode when we stack too much layer ![reduce gradient](https://cdn-images-1.medium.com/max/800/1*fzfbjRP-Ki2-rftR6O5HWA.png) ### Residual Network - Solution: a "shortcut" or a "skip connection" - Allow the gradient to be directly backprobagated to earlier layers - At least deep network works as well as shallow network - ![](https://cdn-images-1.medium.com/max/1600/1*1y9hueMSZAeo1Hbp9KYKiw.png) - $z=Wx + b\\h=\theta(z)\\y=Wh + x$ - $y = F(x) + x$ - ![](https://codesachin.files.wordpress.com/2017/02/screen-shot-2017-02-17-at-4-35-49-pm.png) - Each layer adds something to the previous value, rather than producing an entirely new value. #### The identity block ![](https://cdn-images-1.medium.com/max/1600/1*OIMU3ekaWGvEdZpQlTUSyg.png) - Input and output are of the same dimensions - $y=F(x, \{W_i\})+x$ - No extra parameters #### The convolutional block ![](https://cdn-images-1.medium.com/max/1600/1*U5wkA4O1IpY-ekXqFh0tUQ.png) - Input and output have different dimensions - $y=F(x, \{W_i\}+Wx$ - Need extra parameters ### Architect of Resnet #### BottleNeck - \# parameters - 256 x 64 + 64 x 3 x 3 x 64 + 64 x 256 = ~70K - \# parameters just using 3x3x256x256 conv layer = ~600K - ![](https://i.stack.imgur.com/1DTb8.png) #### Diagram ![](https://cdn-images-1.medium.com/max/1600/1*4tlPOipWjcwIoNUlQ6IWFQ.png) ### Summary - Extreme depth 152 layers - Idea: at least as good as shallow model - $F(x)=H(x)-x$ - Training - Batch Norm - Xavier/2 initialization from He et al. - Mini-batch size 256 - Weight decay of 1e-5 - No dropout used - SGD + Momentum (0.9) - Learning rate: 0.1, devided by 10 when validation error plateaus ## Mobilenet V1 ![](https://i.imgur.com/bSffObV.png) - Standard covolutions: - $\text{computational cost}=D_K*D_K*M*N*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$ - Separable convolution: - Depthwise separable convolution - $\text{computational cost}=D_K*D_K*M*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\D_F:\text{Feature map size}$ - Pointwise (1x1) convolution - $\text{computational cost}=M*N*D_F*D_F\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$ - $\text{computational cost}=D_K*D_K*M*D_F*D_F+M*N*D_F*D_F\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$ - Reduce in computation - $\text{computational cost}=\frac{D_K*D_K*M*D_F*D_F+M*N*D_F*D_F}{D_K*D_K*M*N*D_F*D_F}=\frac{1}{N}+\frac{1}{D^2_K}\\D_K:\text{Kernel size}\\M:\text{Number of input channel}\\N:\text{Number of output channel}\\D_F:\text{Feature map size}$ - There are many ways to make deep learning model run faster but still archive pretty good accuracy. One of them is using Depthwise Separable Convolution. - Depthwise layer to extract feature and point wise to mix channel to create new feature. - First layer of model should be Traditional CNN because We don't want to lost information at the first step. - Number of param is small so we don't have to apply much technique to prevent overfitting - Architecture ![](https://i.imgur.com/gVhWh42.png) ![](https://i.imgur.com/pTvcnm5.png) - Training - RMSprop with asynchronous gradient descent - Less regularization and data augmentation - Not use side heads or label smoothing and additionally reduce the amount image of distortions by limiting the size of small crops that are used in large Inception training. - Little or no weight decay (l2 regularization) ## Group Conv ## Question - Residual block??? - squeezed net ?? relate to mobilenet