--- tags: machine-learning --- # MobileNet-V1: Summary and Implementation <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/0.png?token=AMAXSKKHKUZMTMMNM24KTUK6WMHWE"> </div> >This post is divided into 2 sections: Summary and Implementation. > >We are going to have an in-depth review of [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/pdf/1704.04861.pdf) paper which introduces the MobileNet architecture. > > The implementation uses Pytorch as framework. To see full implementation, please refer to this [repository](https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/mobilenet_v1/pytorch). > > Also, if you want to read other "Summary and Implementation", feel free to check them at my [blog](https://ferdinandmom.tech/deep-learning/). # I) Summary - This paper describes a ==very small, low latency and efficient network architecture called MobileNet for mobile and embedded vision applications.== - To do so, ==they use depthwise separable convolutions and a set of 2 hyperparameters (width and resolution multiplier)==. ## 1) Depthwise separable convolution - To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3. - A depthwise separable convolution is divided into 2 parts: - **Depthwise convolution**. - **Pointwise convolution**. **<ins>Depthwise convolution: </ins>** - In a normal convolution, **all channels** of a kernel are used to produce a feature map. - In a depthwise convolution, **each channel** of a kernel is used to produce a feature map. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/1.png?token=AMAXSKIH67DEERTBAXM2ZOK6WMIA6"> <figcaption > Figure: Normal convolution</figcaption> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/2.png?token=AMAXSKNEPT5RBYH6XDD3J5S6WMIBA"> <figcaption > Figure: Depthwise convolution</figcaption> </div> <br> **<ins>Pointwise convolution: </ins>** - To increase the number of channels in our output image to 256: - In a **normal convolution**, we just have to use **256 filters of size 5x5x3**. - In a **pointwise convolution**, we just have to use **256 filters of size 1x1x3**. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/3.png?token=AMAXSKPZV72JHBQEPJ45GPC6WMIBA"> <figcaption > Figure: Normal convolution</figcaption> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/4.png?token=AMAXSKI2R4EZK23RLWFBGOC6WMIBC"> <figcaption > Figure: Pointwise convolution</figcaption> </div> <br> - **What's the main difference between a depthwise separable convolution and normal convolution ?** - The **main difference** is the **number of computations**. In our example: - For a **normal convolution**, we have ((8x8x5x5)x3)x256 = **1,228,800** operations. - For a **depthwise separable convolution**, we have 4800 + 49,152 = **53,952** operations: - in a **depthwise convolution**, (8x8x5x5)x3 = 4800 operations. - in a **pointwise convolution**, ((8x8x1x1)x3)x256 = 49,152 operations. - ==We can clearly see that a depthwise separable convolution is **less expensive** than a normal convolution (~22.7% less computations)==. - The reason is, in a normal convolution, we are **transforming the image 256 times** whereas in a depthwise separable convolution, we transform the image **once** and then **expand it 256 times** along the channel axis. - The authors made a comparison between a MobileNet with depthwise seperable convolution and one with normal convolution on Imagenet. **Turns out, the accuracy only dropped by 1% but has less parameters and operations**. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/5.png?token=AMAXSKJZCA5FBIOK434LQZ26WMIBE"> </div> ## 2) Hyperparameters - ==They demonstrated how to build smaller and faster MobileNets using width multiplier ($\alpha$) and resolution multiplier ($\rho$) by trading off a reasonable amount of accuracy to reduce size and latency==. - The **width multiplier** ($\alpha$) (also known as "**depth multiplier**") with values $\{1, 0.75, 0.5, 0.25\}$, thins a network uniformly at each layer leading to a **reduction in computational cost and number of parameters**. - For example, if the width multiplier is 1, the network starts off with 32 channels and ends up with 1024. - Using a width multiplier of 0.5 will halve the number of channels used in each layer resulting in a reduction of number of computations by a factor of 4 and a number of learnable parameters by a factor 3 (see Table 6). Therefore, the new model is faster but less accurate than the full model. - The **resolution multiplier** ($\rho$) with values $\{224, 192, 160, 128\}$ reduces the input size leading to a **reduction in computational cost**. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/6.png?token=AMAXSKPKNQBNAGMOSJ3T2WC6WMIBG"> </div> <br> ## 3) Architecture Here is MobileNet-V1 architecture: <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/7.png?token=AMAXSKNCRMA65JQ4EVXJTUK6WMIBE"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/8.png?token=AMAXSKKZFQDTHOSDDEOFAOK6WMIBI"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/9.png?token=AMAXSKMNGERL23WED4IB3Y26WMIBI"> </div> <br> # II) Implementation I am not going to implement the resolution multiplier as I believe this could be handle during the preprocessing step. ## 1) Architecture build ```python class DSConv(nn.Module): def __init__(self, f_3x3, f_1x1, stride=1, padding=0): super(DSConv, self).__init__() self.feature = nn.Sequential(OrderedDict([ ('dconv', nn.Conv2d(f_3x3, f_3x3, kernel_size=3, groups=f_3x3, stride=stride, padding=padding, bias=False )), ('bn1', nn.BatchNorm2d(f_3x3)), ('act1', nn.ReLU()), ('pconv', nn.Conv2d(f_3x3, f_1x1, kernel_size=1, bias=False)), ('bn2', nn.BatchNorm2d(f_1x1)), ('act2', nn.ReLU()) ])) def forward(self, x): out = self.feature(x) return out class MobileNet(nn.Module): """ MobileNet-V1 architecture for CIFAR-10. """ def __init__(self, channels, width_multiplier=1.0, num_classes=1000): super(MobileNet, self).__init__() channels = [int(elt * width_multiplier) for elt in channels] self.conv = nn.Sequential(OrderedDict([ ('conv', nn.Conv2d(3, channels[0], kernel_size=3, stride=2, padding=1, bias=False)), ('bn', nn.BatchNorm2d(channels[0])), ('act', nn.ReLU()) ])) self.features = nn.Sequential(OrderedDict([ ('dsconv1', DSConv(channels[0], channels[1], 1, 1)), ('dsconv2', DSConv(channels[1], channels[2], 2, 1)), ('dsconv3', DSConv(channels[2], channels[2], 1, 1)), ('dsconv4', DSConv(channels[2], channels[3], 2, 1)), ('dsconv5', DSConv(channels[3], channels[3], 1, 1)), ('dsconv6', DSConv(channels[3], channels[4], 2, 1)), ('dsconv7_a', DSConv(channels[4], channels[4], 1, 1)), ('dsconv7_b', DSConv(channels[4], channels[4], 1, 1)), ('dsconv7_c', DSConv(channels[4], channels[4], 1, 1)), ('dsconv7_d', DSConv(channels[4], channels[4], 1, 1)), ('dsconv7_e', DSConv(channels[4], channels[4], 1, 1)), ('dsconv8', DSConv(channels[4], channels[5], 2, 1)), ('dsconv9', DSConv(channels[5], channels[5], 1, 1)) ])) self.avgpool = nn.AdaptiveAvgPool2d((1,1)) self.linear = nn.Linear(channels[5], num_classes) def forward(self, x): out = self.conv(x) out = self.features(out) out = self.avgpool(out) out = torch.flatten(out, 1) out = self.linear(out) return out ``` ```python= def MobileNetV1(): return MobileNet(channels=[32, 64, 128, 256, 512, 1024], width_multiplier=1) ``` ## 2) Training on CIFAR-10 ``` train_costs, val_costs = train_model() ``` ``` [Epoch 1/15]: train-loss = 1.559708 | train-acc = 0.433 | val-loss = 0.032007 | val-acc = 0.586 [Epoch 2/15]: train-loss = 1.004967 | train-acc = 0.645 | val-loss = 0.025247 | val-acc = 0.701 [Epoch 3/15]: train-loss = 0.747347 | train-acc = 0.742 | val-loss = 0.021409 | val-acc = 0.745 [Epoch 4/15]: train-loss = 0.596251 | train-acc = 0.793 | val-loss = 0.022274 | val-acc = 0.776 [Epoch 5/15]: train-loss = 0.493434 | train-acc = 0.832 | val-loss = 0.014880 | val-acc = 0.797 [Epoch 6/15]: train-loss = 0.415764 | train-acc = 0.856 | val-loss = 0.012545 | val-acc = 0.808 [Epoch 7/15]: train-loss = 0.356464 | train-acc = 0.877 | val-loss = 0.014014 | val-acc = 0.803 [Epoch 8/15]: train-loss = 0.318513 | train-acc = 0.891 | val-loss = 0.011231 | val-acc = 0.808 [Epoch 9/15]: train-loss = 0.265170 | train-acc = 0.909 | val-loss = 0.017637 | val-acc = 0.824 [Epoch 10/15]: train-loss = 0.234187 | train-acc = 0.920 | val-loss = 0.021207 | val-acc = 0.827 [Epoch 11/15]: train-loss = 0.203297 | train-acc = 0.930 | val-loss = 0.020546 | val-acc = 0.829 [Epoch 12/15]: train-loss = 0.184567 | train-acc = 0.937 | val-loss = 0.015497 | val-acc = 0.827 [Epoch 13/15]: train-loss = 0.167254 | train-acc = 0.942 | val-loss = 0.018095 | val-acc = 0.824 [Epoch 14/15]: train-loss = 0.142094 | train-acc = 0.951 | val-loss = 0.019001 | val-acc = 0.826 [Epoch 15/15]: train-loss = 0.128344 | train-acc = 0.955 | val-loss = 0.019820 | val-acc = 0.817 ``` ## 3) Evaluating model ```python nb_test_examples = 10000 correct = 0 model.eval().cuda() with torch.no_grad(): for inputs, labels in test_loader: inputs, labels = inputs.to(device), labels.to(device) # Make predictions. prediction = model(inputs) # Retrieve predictions indexes. _, predicted_class = torch.max(prediction.data, 1) # Compute number of correct predictions. correct += (predicted_class == labels).float().sum().item() test_accuracy = correct / nb_test_examples print('Test accuracy: {}'.format(test_accuracy)) ``` ``` Test accuracy: 0.8007 ```