tags: machine-learning
# MobileNet-V1: Summary and Implementation
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/0.png?token=AMAXSKKHKUZMTMMNM24KTUK6WMHWE">
>This post is divided into 2 sections: Summary and Implementation.
>We are going to have an in-depth review of [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/pdf/1704.04861.pdf) paper which introduces the MobileNet architecture.
> The implementation uses Pytorch as framework. To see full implementation, please refer to this [repository](https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/mobilenet_v1/pytorch).
> Also, if you want to read other "Summary and Implementation", feel free to check them at my [blog](https://ferdinandmom.tech/deep-learning/).
# I) Summary
- This paper describes a ==very small, low latency and efficient network architecture called MobileNet for mobile and embedded vision applications.==
- To do so, ==they use depthwise separable convolutions and a set of 2 hyperparameters (width and resolution multiplier)==.
## 1) Depthwise separable convolution
- To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3.
- A depthwise separable convolution is divided into 2 parts:
- **Depthwise convolution**.
- **Pointwise convolution**.
**<ins>Depthwise convolution: </ins>**
- In a normal convolution, **all channels** of a kernel are used to produce a feature map.
- In a depthwise convolution, **each channel** of a kernel is used to produce a feature map.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/1.png?token=AMAXSKIH67DEERTBAXM2ZOK6WMIA6">
<figcaption > Figure: Normal convolution</figcaption>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/2.png?token=AMAXSKNEPT5RBYH6XDD3J5S6WMIBA">
<figcaption > Figure: Depthwise convolution</figcaption>
**<ins>Pointwise convolution: </ins>**
- To increase the number of channels in our output image to 256:
- In a **normal convolution**, we just have to use **256 filters of size 5x5x3**.
- In a **pointwise convolution**, we just have to use **256 filters of size 1x1x3**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/3.png?token=AMAXSKPZV72JHBQEPJ45GPC6WMIBA">
<figcaption > Figure: Normal convolution</figcaption>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/4.png?token=AMAXSKI2R4EZK23RLWFBGOC6WMIBC">
<figcaption > Figure: Pointwise convolution</figcaption>
- **What's the main difference between a depthwise separable convolution and normal convolution ?**
- The **main difference** is the **number of computations**. In our example:
- For a **normal convolution**, we have ((8x8x5x5)x3)x256 = **1,228,800** operations.
- For a **depthwise separable convolution**, we have 4800 + 49,152 = **53,952** operations:
- in a **depthwise convolution**, (8x8x5x5)x3 = 4800 operations.
- in a **pointwise convolution**, ((8x8x1x1)x3)x256 = 49,152 operations.
- ==We can clearly see that a depthwise separable convolution is **less expensive** than a normal convolution (~22.7% less computations)==.
- The reason is, in a normal convolution, we are **transforming the image 256 times** whereas in a depthwise separable convolution, we transform the image **once** and then **expand it 256 times** along the channel axis.
- The authors made a comparison between a MobileNet with depthwise seperable convolution and one with normal convolution on Imagenet. **Turns out, the accuracy only dropped by 1% but has less parameters and operations**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/5.png?token=AMAXSKJZCA5FBIOK434LQZ26WMIBE">
## 2) Hyperparameters
- ==They demonstrated how to build smaller and faster MobileNets using width multiplier ($\alpha$) and resolution multiplier ($\rho$) by trading off a reasonable amount of accuracy to reduce size and latency==.
- The **width multiplier** ($\alpha$) (also known as "**depth multiplier**") with values $\{1, 0.75, 0.5, 0.25\}$, thins a network uniformly at each layer leading to a **reduction in computational cost and number of parameters**.
- For example, if the width multiplier is 1, the network starts off with 32 channels and ends up with 1024.
- Using a width multiplier of 0.5 will halve the number of channels used in each layer resulting in a reduction of number of computations by a factor of 4 and a number of learnable parameters by a factor 3 (see Table 6). Therefore, the new model is faster but less accurate than the full model.
- The **resolution multiplier** ($\rho$) with values $\{224, 192, 160, 128\}$ reduces the input size leading to a **reduction in computational cost**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/6.png?token=AMAXSKPKNQBNAGMOSJ3T2WC6WMIBG">
## 3) Architecture
Here is MobileNet-V1 architecture:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/7.png?token=AMAXSKNCRMA65JQ4EVXJTUK6WMIBE">
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/8.png?token=AMAXSKKZFQDTHOSDDEOFAOK6WMIBI">
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v1/9.png?token=AMAXSKMNGERL23WED4IB3Y26WMIBI">
# II) Implementation
I am not going to implement the resolution multiplier as I believe this could be handle during the preprocessing step.
## 1) Architecture build
class DSConv(nn.Module):
def __init__(self, f_3x3, f_1x1, stride=1, padding=0):
super(DSConv, self).__init__()
self.feature = nn.Sequential(OrderedDict([
('dconv', nn.Conv2d(f_3x3,
('bn1', nn.BatchNorm2d(f_3x3)),
('act1', nn.ReLU()),
('pconv', nn.Conv2d(f_3x3,
('bn2', nn.BatchNorm2d(f_1x1)),
('act2', nn.ReLU())
def forward(self, x):
out = self.feature(x)
return out
class MobileNet(nn.Module):
MobileNet-V1 architecture for CIFAR-10.
def __init__(self, channels, width_multiplier=1.0, num_classes=1000):
super(MobileNet, self).__init__()
channels = [int(elt * width_multiplier) for elt in channels]
self.conv = nn.Sequential(OrderedDict([
('conv', nn.Conv2d(3, channels[0], kernel_size=3,
stride=2, padding=1, bias=False)),
('bn', nn.BatchNorm2d(channels[0])),
('act', nn.ReLU())
self.features = nn.Sequential(OrderedDict([
('dsconv1', DSConv(channels[0], channels[1], 1, 1)),
('dsconv2', DSConv(channels[1], channels[2], 2, 1)),
('dsconv3', DSConv(channels[2], channels[2], 1, 1)),
('dsconv4', DSConv(channels[2], channels[3], 2, 1)),
('dsconv5', DSConv(channels[3], channels[3], 1, 1)),
('dsconv6', DSConv(channels[3], channels[4], 2, 1)),
('dsconv7_a', DSConv(channels[4], channels[4], 1, 1)),
('dsconv7_b', DSConv(channels[4], channels[4], 1, 1)),
('dsconv7_c', DSConv(channels[4], channels[4], 1, 1)),
('dsconv7_d', DSConv(channels[4], channels[4], 1, 1)),
('dsconv7_e', DSConv(channels[4], channels[4], 1, 1)),
('dsconv8', DSConv(channels[4], channels[5], 2, 1)),
('dsconv9', DSConv(channels[5], channels[5], 1, 1))
self.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.linear = nn.Linear(channels[5], num_classes)
def forward(self, x):
out = self.conv(x)
out = self.features(out)
out = self.avgpool(out)
out = torch.flatten(out, 1)
out = self.linear(out)
return out
def MobileNetV1():
return MobileNet(channels=[32, 64, 128, 256, 512, 1024], width_multiplier=1)
## 2) Training on CIFAR-10
train_costs, val_costs = train_model()
[Epoch 1/15]: train-loss = 1.559708 | train-acc = 0.433 | val-loss = 0.032007 | val-acc = 0.586
[Epoch 2/15]: train-loss = 1.004967 | train-acc = 0.645 | val-loss = 0.025247 | val-acc = 0.701
[Epoch 3/15]: train-loss = 0.747347 | train-acc = 0.742 | val-loss = 0.021409 | val-acc = 0.745
[Epoch 4/15]: train-loss = 0.596251 | train-acc = 0.793 | val-loss = 0.022274 | val-acc = 0.776
[Epoch 5/15]: train-loss = 0.493434 | train-acc = 0.832 | val-loss = 0.014880 | val-acc = 0.797
[Epoch 6/15]: train-loss = 0.415764 | train-acc = 0.856 | val-loss = 0.012545 | val-acc = 0.808
[Epoch 7/15]: train-loss = 0.356464 | train-acc = 0.877 | val-loss = 0.014014 | val-acc = 0.803
[Epoch 8/15]: train-loss = 0.318513 | train-acc = 0.891 | val-loss = 0.011231 | val-acc = 0.808
[Epoch 9/15]: train-loss = 0.265170 | train-acc = 0.909 | val-loss = 0.017637 | val-acc = 0.824
[Epoch 10/15]: train-loss = 0.234187 | train-acc = 0.920 | val-loss = 0.021207 | val-acc = 0.827
[Epoch 11/15]: train-loss = 0.203297 | train-acc = 0.930 | val-loss = 0.020546 | val-acc = 0.829
[Epoch 12/15]: train-loss = 0.184567 | train-acc = 0.937 | val-loss = 0.015497 | val-acc = 0.827
[Epoch 13/15]: train-loss = 0.167254 | train-acc = 0.942 | val-loss = 0.018095 | val-acc = 0.824
[Epoch 14/15]: train-loss = 0.142094 | train-acc = 0.951 | val-loss = 0.019001 | val-acc = 0.826
[Epoch 15/15]: train-loss = 0.128344 | train-acc = 0.955 | val-loss = 0.019820 | val-acc = 0.817
## 3) Evaluating model
nb_test_examples = 10000
correct = 0
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Make predictions.
prediction = model(inputs)
# Retrieve predictions indexes.
_, predicted_class = torch.max(prediction.data, 1)
# Compute number of correct predictions.
correct += (predicted_class == labels).float().sum().item()
test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))
Test accuracy: 0.8007