MobileNet V1, V2, V3 and EfficientNet V1, V2

> 1. https://medium.com/ai-blog-tw/efficientnet-v2%E7%9A%84%E8%83%8C%E5%BE%8C-%E9%87%8B%E6%94%BEmobilenet%E5%9C%A8gpu-tpu%E4%B8%8A%E7%9A%84%E6%95%88%E7%8E%87-f19abde55b05 > 2. https://www.zhihu.com/question/343343895 > 3. https://medium.com/ai-academy-taiwan/efficient-cnn-%E4%BB%8B%E7%B4%B9-%E4%B8%80-mobilenetv1-304c96f5eb7e > 4. https://medium.com/ai-academy-taiwan/efficient-cnn-%E4%BB%8B%E7%B4%B9-%E4%BA%8C-mobilenetv2-7809721f0bc8 > 5. https://www.youtube.com/watch?v=_OZsGQHB41s > 6. https://www.youtube.com/watch?v=fR_0o25kigM&list=PLhhyoLH6IjfxeoooqP9rhU3HJIAVAJ3Vz&index=20 ## MobileNet V1 ### Depth-wise Separable Convolution Traditional convolution operations consider the correlation of both space and channel in the kernel. ![](https://hackmd.io/_uploads/SkvIYjsc3.png) :::info :bulb: **The Amount of Computation** *Ksize * Ksize * out_featureH * out_featureW (One element in output feature need Ksize * Ksize computation) * in_channel * out_channel* ::: Separable convolutions use **depth-wise convolution** to fetch correlated **spatial information** and use pointwise convolution to fetch **channel-related information**. ![](https://hackmd.io/_uploads/BJc39siqn.png) :::info :bulb: **The Amount of Computation** *Ksize * Ksize * in_channel * out_featureH * out_featureW + in_channel * out_channel * out_featureH * out_featureW* ::: ## MobileNet V2 ### Linear Bottlenecks There is redundant or not important information in feature maps output by the model’s layers. To trim out unnecessary information and keep the model accuracy, we can reduce the channel size of each layer so that it can fit in all the important information. However, if ReLU is applied, some information (value less than 0) in the already reduced feature map will be loss. We can conclude that ReLU can be applied when the models have enough redundancy to keep important information. All in all, a linear bottleneck produces a simplified feature while keeping all the important information ![](https://hackmd.io/_uploads/rkJtjoo92.png) **(b):** Depth-wise -> Point-wise (Channel increased) **\(c\) and (d):** Depth-wise -> ReLU -> Point-wise (Channel reduced and Produces the linear bottleneck) -> Point-wise -> Relu -> Depth-wise (repeated) :::success ***TLDR: Reduce channel size, don’t use ReLU*** ::: ### Inverted Residual Block ![](https://hackmd.io/_uploads/BJJbnss53.png) The biggest difference between the residual block and the inverted residual block is that the connection happened in the linear bottleneck layer. Since the linear bottleneck should contend all the important information, the connection only connects important information and the expansion layer only for reducing the impact of ReLU. :::info :bulb: **Implementation Notes:** 1. Residual connection only happened when **input_channel == output_channel** and **stride == 1**. (*Intuition: Only do connections when the sizes of the feature maps are equal otherwise is not possible*) 3. Use ReLU6 to cap the maximum feature map value to 6.0. 4. **Depth-wise layers** will not change the channel size but it can change the feature map size. 5. **Point-wise layers** will not change the feature map size but it can change the channel size. ::: ### Code Example: (OpenCL) ```cpp // 2 model.add(new ocl::pwconv2d(model.back(), 96, false)); //9 model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::dwconv2d(model.back(), 3, 2, 1, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::pwconv2d(model.back(), 24, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); // 3 model.add(new ocl::pwconv2d(model.back(), 144, false)); //17 model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::dwconv2d(model.back(), 3, 1, 1, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::pwconv2d(model.back(), 24, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::add(model.back(), model[16], true)); //The third block has a connection because the input channel size is 24 and equal to the output channel size // 4 model.add(new ocl::pwconv2d(model.back(), 144, false)); //26 model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::dwconv2d(model.back(), 3, 2, 1, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); model.add(new ocl::relu6(model.back(), true)); model.add(new ocl::pwconv2d(model.back(), 32, false)); model.add(new ocl::batchNorm(model.back(), 1e-5, true)); ``` ### Possible Pitfall: MobileNet reduces the amount of computation needed compared to the normal convolutional layer. However, the following diagram shows that the inference time of the mobilenet is comparably more than other models. > ![](https://hackmd.io/_uploads/H1QEAio5h.png) > https://www.zhihu.com/question/343343895 We can assume that GPU is able to run models’ layers in parallel completely. The more layers in a model, the more memory transfer GPU is going to take. In mobilenet, we slice one single convolution layer into two depth-wise and point-wise convolution layers so the number of layers is increased. The layer computation is depended on its previous one so it’s required to finish computing the previous layer in order to do the next. Therefore, as mobilenet increase the number of layers in the model, memory transfer is also increased thus causing the entire computation to be memory bound. On the other hand, CPU is only bounded by the amount of computation needed and can benefit from the separable convolution. ## EfficientNet V1 Provide a constant ratio to scale up the convolutional neural network. The computational resources needed will be increased following this ratio. Therefore, the total computational cost can be estimated and modified according to the hardware we have or according to the performance that we want. :::info :information_source: **There are three independent ways to scale up a convolution model** 1. Increase the input image resolution. 2. Increase the model’s width (channels size) 3. Increase the model’s depth (number of layers) :bulb: **The compound scaling method** **depth:** 𝑑 = 𝛼𝜑 **width:** 𝑤 = 𝛽𝜑 **resolution:** 𝑟 = 𝛾𝜑 𝛼 = 1.2, 𝛽 = 1.1, 𝛾 = 1.15 ::: :::success ***TLDR: A systematic way to scale up the convolutional neural network.*** ::: ### Use of inverted residual block (mobile inverted bottleneck MBConv), Squeeze & excitation The grid search is applied to find the constant values and coefficients. ![](https://hackmd.io/_uploads/BktjJ3o9n.png) The following python list and dictionary shows the model architecture configuration. ```python # expand_ratio, out_channels, repeats, stride, kernel_size base_model = [ [1, 16, 1, 1, 3], [6, 24, 2, 2, 3], [6, 40, 2, 2, 5], [6, 80, 3, 2, 3], [6, 112, 3, 1, 5], [6, 192, 4, 2, 5], [6, 320, 1, 1, 3], ] # phi_value, resolution, drop_rate phi_values = { "b0": (0, 224, 0.2), "b1": (0.5, 240, 0.2), "b2": (1, 260, 0.3), "b3": (2, 300, 0.3), "b4": (3, 380, 0.4), "b5": (4, 456, 0.4), "b6": (5, 528, 0.5), "b7": (6, 600, 0.5), } ``` ### Squeeze & excitation (channel attention) > https://towardsdatascience.com/squeeze-and-excitation-networks-9ef5e71eacd7 ![](https://hackmd.io/_uploads/BJbQ-2j5h.png) Average pooling is applied to each channel and we have a vector with a size of C. This value will be passed through some 1 * 1 convolution and a sigmoid layer to a set of numbers with value 0 ~ 1. This value represents the percentage for a certain channel and can be considered as the amount of attention needed for a certain channel. ```python # Compute the attention score for each channels class SqueezeExcitation(nn.Module): def __init__(self, in_channels, reduced_dim): super(SqueezeExcitation, self).__init__() self.se = nn.Sequential( nn.AdaptiveAvgPool2d(1), # C x H x W -> C x 1 x 1 nn.Conv2d(in_channels, reduced_dim, 1), nn.SiLU(), nn.Conv2d(reduced_dim, in_channels, 1), nn.Sigmoid(), # output will be [0, 1] ) def forward(self, x): return x * self.se(x) ``` ### Stochastic Depth Similar to dropout but dropping layers. ![](https://hackmd.io/_uploads/ByzoZ3j53.png) This force the model to learn more generally. However, this action will only be applied when the model is under training. ### Inverted Residual Block Implementation notes ```cpp (features): Sequential( (0): CNNBlock( (cnn): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (silu): SiLU() ) (1): InvertedResidualBlock( (conv): Sequential( (0): CNNBlock( (cnn): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False) (bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (silu): SiLU() ) (1): SqueezeExcitation( (se): Sequential( (0): AdaptiveAvgPool2d(output_size=1) (1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1)) (2): SiLU() (3): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1)) (4): Sigmoid() ) ) (2): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False) (3): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ... ) ``` The above print output of the efficientnetv1 shows that the first inverted residual block will not expand and maintain a 32 in channel size to a 32 out channel size in the first depth-wise convolution. In the end of this inverted residual block, the channel size is reduced by a point-wise convolution to create a linear bottleneck. ## EfficientnetV2 EfficientnetV1 is not suitable to run on GPU and most of the performance gain can only be noticed when running on the sequential processor. The following bullet points shows the modification of EfficientnetV2 to the original EfficientnetV1. 1. Replace depth-wise separable convolution. 2. Neural Architecture Search is applied to find architecture suitable for GPU. 3. Use non-uniformed scaling which means each stage has different scaling factors. 4. Gradually increase image resolution during training and provider higher regularization. ### Replace depth-wise separable convolution with fused version The following MBConv is same as a bottleneck blocks in mobilenetv2 ![](https://hackmd.io/_uploads/Bkg5mhj9n.png =500x) ### Model Architecture ![](https://hackmd.io/_uploads/B1cqX2oc2.png) :::info :bulb: **Implementation Notes** 1. ***Fused-MBConv*** is just **replace the separable depth-wise convolution with traditional convolution layer**. 2. Squeeze and excitation layer is only applied on MBConv series of layers as the block used in MobilenetV3 3. The first Fused-MBConv does not expand on the first convolution layer since the expand factor is 1. 4. The batch normalization **use 1e-3 for avoid zero division instead of 1e-5** which is default for Pytorch. ::: ### Fused-MBConv building block ```cpp (0): FusedMBConv( (block): Sequential( (0): Conv2dNormActivation( (0): Conv2d(24, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm2d(96, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) (2): SiLU(inplace=True) ) (1): Conv2dNormActivation( (0): Conv2d(96, 48, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(48, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) ) ) (stochastic_depth): StochasticDepth(p=0.01, mode=row) ) ```