---
tags: machine-learning
---
# MobileNet-V2: Summary and Implementation
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/0.png">
</div>
>This post is divided into 2 sections: Summary and Implementation.
>
>We are going to have an in-depth review of [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/pdf/1801.04381.pdf) paper which introduces the MobileNet-V2 architecture.
>
> The implementation uses Pytorch as framework. To see full implementation, please refer to this [repository](https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/mobilenet_v2/pytorch).
>
> Also, if you want to read other "Summary and Implementation", feel free to check them at my [blog](https://ferdinandmom.engineer/deep-learning/).
# I) Summary
- This paper introduces the MobileNet-V2 architecture which is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers.
- Authors showed it is important to remove non-linearities in the narrow layers in order to maintain representational power and to improve performance.
- It results in a very memory-efficient inference model.
## 1) Depthwise separable convolution
- To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3.
- A depthwise separable convolution is divided into 2 parts:
- **Depthwise convolution**.
- **Pointwise convolution**.
**<ins>Depthwise convolution: </ins>**
- In a normal convolution, **all channels** of a kernel are used to produce a feature map.
- In a depthwise convolution, **each channel** of a kernel is used to produce a feature map.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/1.png">
<figcaption > Figure: Normal convolution</figcaption>
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/2.png">
<figcaption > Figure: Depthwise convolution</figcaption>
</div>
<br>
**<ins>Pointwise convolution: </ins>**
- To increase the number of channels in our output image to 256:
- In a **normal convolution**, we just have to use **256 filters of size 5x5x3**.
- In a **pointwise convolution**, we just have to use **256 filters of size 1x1x3**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/3.png">
<figcaption > Figure: Normal convolution</figcaption>
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/4.png">
<figcaption > Figure: Pointwise convolution</figcaption>
</div>
<br>
- **What's the main difference between a depthwise separable convolution and normal convolution ?**
- The **main difference** is the **number of computations**. In our example:
- For a **normal convolution**, we have ((8x8x5x5)x3)x256 = **1,228,800** operations.
- For a **depthwise separable convolution**, we have 4800 + 49,152 = **53,952** operations:
- in a **depthwise convolution**, (8x8x5x5)x3 = 4800 operations.
- in a **pointwise convolution**, ((8x8x1x1)x3)x256 = 49,152 operations.
- ==We can clearly see that a depthwise separable convolution is **less expensive** than a normal convolution (~22.7% less computations)==.
- The reason is, in a normal convolution, we are **transforming the image 256 times** whereas in a depthwise separable convolution, we transform the image **once** and then **expand it 256 times** along the channel axis.
## 2) Linear bottleneck
- ==Linear bottleneck layers are just bottleneck layers with a linear activation.==
- It was long assume that manifold of interest could be embedded in a low-dimensional subspace. ==In other words, the data representation that we are interested in could be embedded in a tiny subspace.==
- Thus, we can simply reduce the dimensionality of a layer until we get the manifold of interest.
- However, this intuition breaks because deep neural network uses ReLU which squashes aways too much information if the features are already in low dimension (Fig. 1 illustrated this perfectly)
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/5.png" width="60%">
</div>
<br>
- But there are 2 properties that are indicative of the requirement that the manifold of interest should lie in a low-dimensional subspace (of the higher-dimensional activation space):
<br>
1. If the manifold of interest remains non-zero volume after ReLU transformation, it corresponds to a linear transformation.
2. ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space.
<br>
- ==Thus, if we assume the manifold of interest is low-dimensional, **we can capture this by inserting linear bottleneck layers into the convolutional blocks**==.
- ==Experimental evidences show that using non-linear layers in bottleneck destroy too much information in low-dimensional space== even if in general, linear bottleneck models are strictly less powerful than models with non-linearities.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/6.png" width="50%">
</div>
<br>
## 3) Inverted Residuals
- Original residual block contains an input followed by several bottlenecks then followed by expansion and the shortcuts exist between thick layers (layers with many channels).
- However, inspired by the intuition that the bottlenecks actually contain all the necessary information and expansion layer acts merely as a non-linear transformation, MobileNetV2 uses shortcuts directly between the bottlenecks (thin layers). Hatched layers use linear activation.
- ReLU6 is used as the non-liner activation because of its robustnes when used with low-precision computation.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/7.png">
</div>
<br>
- The new result reported in this paper is that the shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/8.png" width="50%">
</div>
<br>
## 4) Information flow interpretation
- MobileNetV2 provides a natural separation between two things, which have been tangled together in previous architectures.
- Capacity: input/output domains of the building blocks (bottleneck layers).
- Expressiveness: layer transformation, a non-linear function that converts input to the output (expansion layers).
- Authors say that exploring these concepts separately is an important direction for future research.
## 5) Architecture
- This is a Fully Convolutional Network (no linear layers). Here is its architecture:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/9.png">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/10.png" width="60%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/11.png">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/12.png">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/mobilenet-v2/13.png">
</div>
<br>
# II) Implementation
- The paper doesn't explain how to deal with input/output channels mismatch during shortcut connections. Thus, shortcut connections will only be used when input/output channel match.
## 1) Architecture build
```python=
class LambdaLayer(nn.Module):
def __init__(self, lambd):
super(LambdaLayer, self).__init__()
self.lambd = lambd
def forward(self, x):
return self.lambd(x)
class Bottleneck(nn.Module):
def __init__(self, in_channels, out_channels, t, stride):
super(Bottleneck, self).__init__()
self.stride = stride
self.in_channels = in_channels
self.out_channels = out_channels
self.features = nn.Sequential(OrderedDict([
('pconv1', nn.Conv2d(in_channels,
in_channels*t,
kernel_size=1,
stride=1,
padding=0,
bias=False)),
('bn1', nn.BatchNorm2d(in_channels*t)),
('act1', nn.ReLU6()),
('dconv', nn.Conv2d(in_channels*t,
in_channels*t,
kernel_size=3,
groups=in_channels*t,
stride=stride,
padding=1,
bias=False)),
('bn2', nn.BatchNorm2d(in_channels*t)),
('act2', nn.ReLU6()),
('pconv3', nn.Conv2d(in_channels*t,
out_channels,
kernel_size=1,
stride=1,
padding=0,
bias=False)),
('bn3', nn.BatchNorm2d(out_channels))
]))
def forward(self, x):
out = self.features(x)
if self.stride == 1 and self.in_channels == self.out_channels:
out += x
return out
```
```python=
class MobileNet(nn.Module):
def __init__(self, block_type, bottleneck_settings, width_multiplier, num_classes):
super(MobileNet, self).__init__()
self.num_classes = num_classes
self.b_s = bottleneck_settings
self.b_s['c'] = [int(elt * width_multiplier) for elt in self.b_s['c']]
self.in_channels = int(32 * width_multiplier)
self.out_channels = int(1280 * width_multiplier)
# Feature
self.conv0 = nn.Sequential(OrderedDict([
('conv0', nn.Conv2d(3, self.in_channels, 1, stride=2, bias=False)),
('bn0', nn.BatchNorm2d(self.in_channels)),
('act0', nn.ReLU6())
]))
self.bottleneck1 = self.__build_layer(block_type,
self.in_channels,
self.b_s['c'][0],
self.b_s['t'][0],
self.b_s['s'][0],
self.b_s['n'][0])
self.bottleneck2 = self.__build_layer(block_type,
self.b_s['c'][0],
self.b_s['c'][1],
self.b_s['t'][1],
self.b_s['s'][1],
self.b_s['n'][1])
self.bottleneck3 = self.__build_layer(block_type,
self.b_s['c'][1],
self.b_s['c'][2],
self.b_s['t'][2],
self.b_s['s'][2],
self.b_s['n'][2])
self.bottleneck4 = self.__build_layer(block_type,
self.b_s['c'][2],
self.b_s['c'][3],
self.b_s['t'][3],
self.b_s['s'][3],
self.b_s['n'][3])
self.bottleneck5 = self.__build_layer(block_type,
self.b_s['c'][3],
self.b_s['c'][4],
self.b_s['t'][4],
self.b_s['s'][4],
self.b_s['n'][4])
self.bottleneck6 = self.__build_layer(block_type,
self.b_s['c'][4],
self.b_s['c'][5],
self.b_s['t'][5],
self.b_s['s'][5],
self.b_s['n'][5])
self.bottleneck7 = self.__build_layer(block_type,
self.b_s['c'][5],
self.b_s['c'][6],
self.b_s['t'][6],
self.b_s['s'][6],
self.b_s['n'][6])
# Classifier
self.conv8 = nn.Sequential(OrderedDict([
('conv8', nn.Conv2d(self.b_s['c'][6], self.out_channels, 1, bias=False)),
('bn8', nn.BatchNorm2d(self.out_channels)),
('act8', nn.ReLU6())
]))
self.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.conv9 = nn.Conv2d(self.out_channels, num_classes, 1)
def __build_layer(self, block_type, in_channels, out_channels, t, s, n):
layers = []
tmp_channels = in_channels
for i in range(n):
if i == 0:
layers.append(block_type(tmp_channels, out_channels, t, s))
else:
layers.append(block_type(tmp_channels, out_channels, t, 1))
tmp_channels = out_channels
return nn.Sequential(*layers)
def forward(self, x):
out = self.conv0(x)
out = self.bottleneck1(out)
out = self.bottleneck2(out)
out = self.bottleneck3(out)
out = self.bottleneck4(out)
out = self.bottleneck5(out)
out = self.bottleneck6(out)
out = self.bottleneck7(out)
out = self.conv8(out)
out = self.avgpool(out)
out = self.conv9(out)
out = out.view(-1, self.num_classes)
return out
```
```python=
def MobileNetV2():
bottleneck_settings = {
'c': [16, 24, 32, 64, 96, 160, 320],
't': [1, 6, 6, 6, 6, 6, 6],
's': [1, 2, 2, 2, 1, 2, 1],
'n': [1, 2, 3, 4, 3, 3, 1]
}
return MobileNet(block_type=Bottleneck,
bottleneck_settings=bottleneck_settings,
width_multiplier=.5,
num_classes=1000)
```
## 2) Training on CIFAR-10
```
train_costs, val_costs = train_model()
```
```
[Epoch 1/15]: train-loss = 1.641850 | train-acc = 0.397 | val-loss = 0.035919 | val-acc = 0.528
[Epoch 2/15]: train-loss = 1.192912 | train-acc = 0.569 | val-loss = 0.035297 | val-acc = 0.612
[Epoch 3/15]: train-loss = 0.997534 | train-acc = 0.642 | val-loss = 0.033239 | val-acc = 0.654
[Epoch 4/15]: train-loss = 0.868714 | train-acc = 0.692 | val-loss = 0.028015 | val-acc = 0.696
[Epoch 5/15]: train-loss = 0.760505 | train-acc = 0.733 | val-loss = 0.021540 | val-acc = 0.721
[Epoch 6/15]: train-loss = 0.681773 | train-acc = 0.763 | val-loss = 0.013692 | val-acc = 0.752
[Epoch 7/15]: train-loss = 0.617383 | train-acc = 0.786 | val-loss = 0.023159 | val-acc = 0.749
[Epoch 8/15]: train-loss = 0.568520 | train-acc = 0.802 | val-loss = 0.017240 | val-acc = 0.749
[Epoch 9/15]: train-loss = 0.527746 | train-acc = 0.816 | val-loss = 0.017224 | val-acc = 0.763
[Epoch 10/15]: train-loss = 0.491380 | train-acc = 0.828 | val-loss = 0.019151 | val-acc = 0.778
[Epoch 11/15]: train-loss = 0.469383 | train-acc = 0.837 | val-loss = 0.019102 | val-acc = 0.785
[Epoch 12/15]: train-loss = 0.432044 | train-acc = 0.849 | val-loss = 0.022228 | val-acc = 0.785
[Epoch 13/15]: train-loss = 0.404078 | train-acc = 0.860 | val-loss = 0.015088 | val-acc = 0.790
[Epoch 14/15]: train-loss = 0.385579 | train-acc = 0.865 | val-loss = 0.016302 | val-acc = 0.793
[Epoch 15/15]: train-loss = 0.363964 | train-acc = 0.872 | val-loss = 0.017327 | val-acc = 0.794
```
## 3) Evaluating model
```python
nb_test_examples = 10000
correct = 0
model.eval().cuda()
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Make predictions.
prediction = model(inputs)
# Retrieve predictions indexes.
_, predicted_class = torch.max(prediction.data, 1)
# Compute number of correct predictions.
correct += (predicted_class == labels).float().sum().item()
test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))
```
```
Test accuracy: 0.8007
```