MobileNet V1~V3

# MobileNet V1~V3 ###### tags: `paper notes` `deep learning` [V1 paper link](https://arxiv.org/pdf/1704.04861.pdf) [V1 Code](https://github.com/shanglianlm0525/PyTorch-Networks/blob/master/Lightweight/MobileNetV1.py) [V2 paper link](https://arxiv.org/pdf/1801.04381.pdf) [V2 Code](https://github.com/shanglianlm0525/PyTorch-Networks/blob/master/Lightweight/MobileNetV2.py) [V3 paper link](https://arxiv.org/pdf/1905.02244.pdf) [V3 Code](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mobilenetv3.py) ## What is MobileNet? - Google於2017年提出的輕量級 CNN 圖片分類模型，主要使用在手機和嵌入式裝置上 ![](https://i.imgur.com/FPIVZQb.jpg) - 並不是唯一一種輕量級模型，其他還有像是 Squeezenet 或是Shufflenet 等等模型 - 我手上這台 [Pixel 4](https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html) 裡面的 Pixel Neural Core 就是用了 MobileNetV3 加上利用 autoML 在 TPU 上優化而生的 MobileNetEdgeTPU - 順帶一提最新的 [Pixel 6](https://ai.googleblog.com/2021/11/improved-on-device-ml-on-pixel-6-with.html) 搭載的是 MobileNetEdgeTPUV2 (圖片分類)、SpaghettiNet-EdgeTPU (物件偵測)、FaceSSD (人臉辨識)、MobileBERT (NLP) ![](https://i.imgur.com/E4sMnjn.jpg) ## MobileNet V1: Depthwise Separable Convolution - Depthwise Separable Conv == Depthwise Conv + Pointwise Conv - Depthwise Conv 將每個 channel 分開做 Conv 來降低計算量，而 Pointwise Conv 則用來來學習同一張圖不同 channel 之間的關係 - Standard Conv 和 Depthwise Conv 差別在於 channel ![](https://i.imgur.com/ook6xdF.jpg) - Pointwise Conv 就是 1x1 Conv，針對一張圖片 1x1 區域的所有 channel 做 Conv ![](https://i.imgur.com/2RkXLbO.jpg) - 順帶一提 MobileNet 並不是第一個提出 Separable Convolution，這個作法早在 [Simplifying ConvNets for Fast Learning, 2012](https://liris.cnrs.fr/Documents/Liris-5659.pdf) 就已經出現，並且在 [Rigid-Motion Scattering For Image Classification, 2013](https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf)中被推廣到 Depthwise Separable ### 將 channel 分開做 Conv 為什麼可以降低參數量? 符號定義 $D_k:$ Kernel 的 width and height $D_F:$ feature map 的 width and height $M:$ Input Channel 數量 $N:$ Output Channel 數量 (Kernel 數量) - Standard Conv 的計算成本: ![](https://i.imgur.com/4yYEuSs.jpg) - Depthwise Conv 的計算成本: ![](https://i.imgur.com/6XrG95F.jpg) - Depthwise Separable Conv 的計算成本 ![](https://i.imgur.com/bXIOG7Y.jpg) - 拆分 ![](https://i.imgur.com/6rm1wdl.jpg) ### 少了多少計算量? ![](https://i.imgur.com/pBCBRsK.jpg) ### 架構對比 - Standard Conv vs Depthwise Separable Conv ![](https://i.imgur.com/S2iNLTK.jpg) ![](https://i.imgur.com/tZQp62o.jpg) - V1 完整架構 ![](https://i.imgur.com/qkx2Zoo.png) ### V1 result - 展示了 mobileNet 在使用較少計算複雜度和模型大小的情況下，表現能與其他較大的模型差不多 ![](https://i.imgur.com/N3IbMXh.png) - 作者也有統計每個 component 所佔的計算量比例，可見 1x1 Conv 所佔比例最高 ![](https://i.imgur.com/Ht6uEr0.png) ## MobileNet V2: Inverted Residuals and Linear Bottlenecks - 作者發現 V1 的 Depth-wise separable convolution 中有許多空的 Conv kernel，並發現原因是在低維度空間做 ReLU 會失去很多資訊，但在高維度空間裡面做卻不會 - 因為 ReLU 的 dead relu problem ![](https://i.imgur.com/HChl8FX.png) - 因此 V2 在 V1 的 Depth-wise separable convolution 的基礎上增加了 Linear Bottlenecks，就是在把做 ReLU 之前的輸入維度提高並換掉 ReLU ![](https://i.imgur.com/UeNtzOJ.jpg) ### Linear Bottlenacks - 把 Point-wise Conv 中的 ReLU 換成 Linear function ![](https://i.imgur.com/cYyPTMI.png) - V2 也在 Depth-wise Conv 之前先做 Point-wise Conv(1x1 conv) 來做升維度，好讓其提取到更多特徵，原文稱這個為 expansion layer ![](https://i.imgur.com/IFEcriu.jpg) ### Inverted Residuals - 最近很紅的 [ConvNeXt, 2020s](https://github.com/facebookresearch/ConvNeXt) 採用此設計來降低運算量 - 這裡加入了 residual connection 的概念，來達到更高的 memory efficient - 注意到 classicial residual block 在連接的時候 channel 很多，但 inverted residual 只連接了 bottlenneck - 有標註對角線的 layer 不使用非線性層 ![](https://i.imgur.com/7g0B0SS.png) ```python= class InvertedResidual(nn.Module): def __init__(self, in_channels, out_channels, stride, expansion_factor=6): super(InvertedResidual, self).__init__() self.stride = stride mid_channels = (in_channels * expansion_factor) self.bottleneck = nn.Sequential( Conv1x1BNReLU(in_channels, mid_channels), Conv3x3BNReLU(mid_channels, mid_channels, stride,groups=mid_channels), Conv1x1BN(mid_channels, out_channels) ) if self.stride == 1: self.shortcut = Conv1x1BN(in_channels, out_channels) def forward(self, x): out = self.bottleneck(x) out = (out+self.shortcut(x)) if self.stride==1 else out return out ``` - 運算量的對比 ![](https://i.imgur.com/NKYsy4x.png) - MobileNet V1 vs MobileNet V2 ![](https://i.imgur.com/fSv01JY.jpg) - ResNet vs MobileNet V2 - ResNet 先降維 (0.25倍)-> Conv -> 再升維 - MobileNetV2 先升維(6倍) -> Conv -> 再降維 - 這樣設計的原因就是他們希望讓特徵能夠在高維的空間作擷取 ![](https://i.imgur.com/FRmvt12.jpg) ### 整體架構 ![](https://i.imgur.com/xtiwAmV.png) - 這邊可以得知 expansion factor $t$ =6，也就是說每次的 Point-wise Conv 會輸出 6*k 個 channel - $c$ = channel, $n$ = 重複幾次, $s$ = stride 與其他類似網路的對比 - V2在遇到 stride=2 的 3x3 Conv 的時候會取消使用 residual connection，因為輸入和輸出的尺寸會不一樣 (尺寸會減半) ![](https://i.imgur.com/MbLoXWL.png) - V2 和其他網路對比其實在中間層的時候就參數量較少了 ![](https://i.imgur.com/HYaSaxj.png) ### V2 Result - On ImageNet (Google Pixel 1) ![](https://i.imgur.com/EhS20ep.png) - On COCO - 這邊所謂的 SSDLite 指的是將 SSD 裡面的 Conv 換成 separable convolutions (depthwise followed by 1x1 projection) 的輕量變形模型![](https://i.imgur.com/1e97pa9.png) ![](https://i.imgur.com/3ACcjjQ.png) - 可見其參數量遠遠小於 YoloV2 和 SSD ## MobileNet V3: Squeeze and Excitation with NAS - 在 V3 中，除了保留前代特性以外還加入了 NAS 以及 SENet 的 Squeeze and Excitation 架構，透過 Global Average Pooling (GAP)計算每個 feature map 的權重，用來強化重要的 feature map 的影響力，並減弱不重要的 feature map 的影響力 - NAS: Neural Architecture Search，一種 Auto-ML 的方法，G社一直都很愛 ~~(因為很少人玩得起)~~ - 除此之外也把原本使用的 activation function swish 改為 h-swish 以避免計算 sigmoid，並微調了一點 V2 架構來更進一步降低計算成本 ### SENet [paper link](https://arxiv.org/pdf/1709.01507.pdf) - 主要目標是學習 feature channel 間的關係，凸顯不同 feature channel 的重要度，進而提高模型表現 - 所謂的學習是透過 attention 或是 gating 方式進行，因此實作方法並不唯一 - 可用來強化重要的 feature map 的影響力，並減弱不重要的 feature map 的影響力 ![](https://i.imgur.com/faehOlm.png) - x 為輸入，x = w * h * c1 （width * height * channel） - 透過卷積變換 $F_{tr}$，輸出 w * h * c2 （width * height * channel)，c2 個大小為w*h的feature map $u_c$ ![](https://i.imgur.com/ayiponP.png =400x150) - $v_c$ 為 c-th filter 的參數 SENet 流程: 1. 透過 $F_{sq}$ 壓縮操作，輸出 1 * 1 * c2 (Squeeze部分) - 這邊作者用global average pooling 作為 Squeeze 操作（就是把w和h維度取平均變成一個scalar)，作為等等學習的準備 ![](https://i.imgur.com/JFul9qD.png) 2. 透過 $F_{ex}(W)$ 操作學到權重 (Excitation部分) - $F_{ex}(W)$操作包括兩個全連接層和兩個非線性激活函數(ReLu, Sigmoid)來製作出一個 gating 機制來學習 ![](https://i.imgur.com/vkZex16.png) 3. 最後透過 $F_{sacle}$ 輸出 re-wight 後的 w * h * c2 - $s_c$ 就是 feature map 的 weights，論文提到這樣的操作其實就等於在對每一個 feature map 學習其 self-attention weight，但沒有詳細說明怎麼替換成 SA 版本 ![](https://i.imgur.com/nYHYNtm.png) Implementation of SENet in [timm](https://github.com/rwightman/pytorch-image-models/blob/07379c6d5dbb809b3f255966295a4b03f23af843/timm/models/efficientnet_blocks.py#L17), using gating ```python= class SqueezeExcite(nn.Module): """ Squeeze-and-Excitation w/ specific features for EfficientNet/MobileNet family Args: in_chs (int): input channels to layer rd_ratio (float): ratio of squeeze reduction act_layer (nn.Module): activation layer of containing block gate_layer (Callable): attention gate function force_act_layer (nn.Module): override block's activation fn if this is set/bound rd_round_fn (Callable): specify a fn to calculate rounding of reduced chs """ def __init__( self, in_chs, rd_ratio=0.25, rd_channels=None, act_layer=nn.ReLU, gate_layer=nn.Sigmoid, force_act_layer=None, rd_round_fn=None): super(SqueezeExcite, self).__init__() if rd_channels is None: rd_round_fn = rd_round_fn or round rd_channels = rd_round_fn(in_chs * rd_ratio) act_layer = force_act_layer or act_layer self.conv_reduce = nn.Conv2d(in_chs, rd_channels, 1, bias=True) self.act1 = create_act_layer(act_layer, inplace=True) self.conv_expand = nn.Conv2d(rd_channels, in_chs, 1, bias=True) self.gate = create_act_layer(gate_layer) def forward(self, x): x_se = x.mean((2, 3), keepdim=True) x_se = self.conv_reduce(x_se) x_se = self.act1(x_se) x_se = self.conv_expand(x_se) return x * self.gate(x_se) ``` SENet 可替換掉 inception block 或是 residual block ![](https://i.imgur.com/0bwmZC7.png) MobileNet V2 ![](https://i.imgur.com/vEMBGON.png) MobileNetV2 + Squeeze-and-Excite = MobileNetV3 - 整個架構是將 SENet 放在 depthwise conv 之後，變成新的 bottleneck - 這樣放的原因是因為 SENet 的計算會花費一定時間，所以作者在含有 SENet 的架構中將 expansion layer 的 channel 變為原本的 1/4，這樣一來他們發現不僅可以提高準確度也沒有增加所需時間 ![](https://i.imgur.com/RCFdf3Y.png) ### NAS - 沒怎麼接觸所以簡單講 - 主要利用 platform-aware NAS + NetAdapt - 前者用於在計算量受限的前提下來搜尋網路的每一個 block，稱為 block-wise search ![](https://i.imgur.com/RaParbs.png) - 後者則用於針對每一個確定的 block 之中的網路層 kernel 數量做學習，稱為 layer-wise search ![](https://i.imgur.com/G2X4rxF.png) - 搜尋的目標主要是有兩個： 1) 減少任何一個 expansion layer 的 size, 2) 減少所有 block 中的 bottleneck - 在使用 NAS 的過程中他們也因此發現 V2 的某幾個層計算成本相對高，因此才會又對其架構做了進一步修改 ### 架構微調 - 他們實驗發現 V2 之中用來提高維度的 1x1 Conv (Expansion layer) 其實反而增加了模型的計算量，因此改為將其放在 avg pooling 之後 - 整個流程會先利用 avg pooling 將 feature map 從 7x7 降為 1x1，再利用 1x1 Conv 提高維度，減少了 7x7=49 倍的計算量 - 除此之外作者也去掉了前面的 3x3 Conv 和 1x1 Conv，因此降低 15ms 的速度但沒有喪失準確率 - V2 ![](https://i.imgur.com/tG8oUOc.png) - V3 - 眼睛比較利的話也可以發現 V3 還有調整一開始的 filter 數量，V2 是使用 32 個 3x3 conv kernel，而經過實驗後他們發現可以降為 16 個並且不影響準確率又可以降低 2ms ![](https://i.imgur.com/swNdrwP.jpg) - 微調後的整體架構 ![](https://i.imgur.com/qDIV2CU.png) ### Nonlinearities - 原本的 swish 中有使用到 sigmoid，但其計算放在行動裝置上面的計算非常貴 ![](https://i.imgur.com/ByUgXtT.png) - 所以他們對其做了修改，將一部分較深層的激活函數改用 h-swish，剩下的則使用 ReLU 來替代掉 swish ![](https://i.imgur.com/3413W1t.png) 使用 ReLU 的好處有兩個： 1. 可在任何平台上進行運算 2. 消除了潛在的由於浮點數運算缺陷而導致的準確度損失 ![](https://i.imgur.com/f98J0rb.png) - 比較使用 h-swish 之後所能降低的 latency (ms) - @n, n=number of channel ![](https://i.imgur.com/5CSmTUV.png) - 作者的實驗發現 h-swish 應該使用在 channel >= 80 的 layer 才能得到最好效果 ![](https://i.imgur.com/dl5uXPt.png) ### V3 Result - 整個 MobileNet V3 發展的流程與模型增進程度 ![](https://i.imgur.com/HlcyspF.png) - V1 vs V2 vs V3 in COCO ![](https://i.imgur.com/fuOQ2zJ.png) - V1 vs V2 on Pixel 1 ![](https://i.imgur.com/MuegIrN.png) - Experiments on Pixel, 2, 3 ![](https://i.imgur.com/GkkITtm.png) ![](https://i.imgur.com/tspno96.png) ## References - [轻量级神经网络“巡礼”（二）—— MobileNet，从V1到V3](https://zhuanlan.zhihu.com/p/70703846) - [卷積神經網路(Convolutional neural network, CNN): 1×1卷積計算在做什麼](https://chih-sheng-huang821.medium.com/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-convolutional-neural-network-cnn-1-1%E5%8D%B7%E7%A9%8D%E8%A8%88%E7%AE%97%E5%9C%A8%E5%81%9A%E4%BB%80%E9%BA%BC-7d7ebfe34b8) - [深度學習-MobileNet (Depthwise separable convolution)](https://chih-sheng-huang821.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-mobilenet-depthwise-separable-convolution-f1ed016b3467) - [[論文筆記] MobileNet演變史-從MobileNetV1到MobileNetV3](https://chihangchen.medium.com/%E8%AB%96%E6%96%87%E7%AD%86%E8%A8%98-mobilenetv3%E6%BC%94%E8%AE%8A%E5%8F%B2-f5de728725bc)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.