MobileNet-V1: Summary and Implementation

This post is divided into 2 sections: Summary and Implementation.

We are going to have an in-depth review of MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications paper which introduces the MobileNet architecture.

The implementation uses Pytorch as framework. To see full implementation, please refer to this repository.

Also, if you want to read other "Summary and Implementation", feel free to check them at my blog.

I) Summary

This paper describes a very small, low latency and efficient network architecture called MobileNet for mobile and embedded vision applications.
To do so, they use depthwise separable convolutions and a set of 2 hyperparameters (width and resolution multiplier).

1) Depthwise separable convolution

To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3.
A depthwise separable convolution is divided into 2 parts:
- Depthwise convolution.
- Pointwise convolution.

Depthwise convolution:

In a normal convolution, all channels of a kernel are used to produce a feature map.
In a depthwise convolution, each channel of a kernel is used to produce a feature map.

Figure: Normal convolution

Figure: Depthwise convolution

Pointwise convolution:

To increase the number of channels in our output image to 256:
- In a normal convolution, we just have to use 256 filters of size 5x5x3.
- In a pointwise convolution, we just have to use 256 filters of size 1x1x3.

Figure: Normal convolution

Figure: Pointwise convolution

What's the main difference between a depthwise separable convolution and normal convolution ?
The main difference is the number of computations. In our example:
- For a normal convolution, we have ((8x8x5x5)x3)x256 = 1,228,800 operations.
- For a depthwise separable convolution, we have 4800 + 49,152 = 53,952 operations:
  - in a depthwise convolution, (8x8x5x5)x3 = 4800 operations.
  - in a pointwise convolution, ((8x8x1x1)x3)x256 = 49,152 operations.
We can clearly see that a depthwise separable convolution is less expensive than a normal convolution (~22.7% less computations).
The reason is, in a normal convolution, we are transforming the image 256 times whereas in a depthwise separable convolution, we transform the image once and then expand it 256 times along the channel axis.
The authors made a comparison between a MobileNet with depthwise seperable convolution and one with normal convolution on Imagenet. Turns out, the accuracy only dropped by 1% but has less parameters and operations.

2) Hyperparameters

They demonstrated how to build smaller and faster MobileNets using width multiplier (
$α$ ) and resolution multiplier (
$ρ$ ) by trading off a reasonable amount of accuracy to reduce size and latency.
The width multiplier (
$α$ ) (also known as "depth multiplier") with values
${1, 0.75, 0.5, 0.25}$ , thins a network uniformly at each layer leading to a reduction in computational cost and number of parameters.
- For example, if the width multiplier is 1, the network starts off with 32 channels and ends up with 1024.
- Using a width multiplier of 0.5 will halve the number of channels used in each layer resulting in a reduction of number of computations by a factor of 4 and a number of learnable parameters by a factor 3 (see Table 6). Therefore, the new model is faster but less accurate than the full model.
The resolution multiplier (
$ρ$ ) with values
${224, 192, 160, 128}$ reduces the input size leading to a reduction in computational cost.

3) Architecture

Here is MobileNet-V1 architecture:

II) Implementation

I am not going to implement the resolution multiplier as I believe this could be handle during the preprocessing step.

1) Architecture build

class DSConv(nn.Module):
    
    def __init__(self, f_3x3, f_1x1, stride=1, padding=0):
        super(DSConv, self).__init__()
        
        self.feature = nn.Sequential(OrderedDict([
            ('dconv', nn.Conv2d(f_3x3,
                                f_3x3,
                                kernel_size=3,
                                groups=f_3x3,
                                stride=stride,
                                padding=padding,
                                bias=False
                                )),
            ('bn1', nn.BatchNorm2d(f_3x3)),
            ('act1', nn.ReLU()),
            ('pconv', nn.Conv2d(f_3x3,
                                f_1x1,
                                kernel_size=1,
                                bias=False)),
            ('bn2', nn.BatchNorm2d(f_1x1)),
            ('act2', nn.ReLU())
        ]))
    
    def forward(self, x):
        out = self.feature(x)
        return out
        
class MobileNet(nn.Module):
    """
        MobileNet-V1 architecture for CIFAR-10.
    """
    def __init__(self, channels, width_multiplier=1.0, num_classes=1000):
        super(MobileNet, self).__init__()
        
        channels = [int(elt * width_multiplier) for elt in channels]
        
        self.conv = nn.Sequential(OrderedDict([
            ('conv', nn.Conv2d(3, channels[0], kernel_size=3,
                               stride=2, padding=1, bias=False)),
            ('bn', nn.BatchNorm2d(channels[0])),
            ('act', nn.ReLU()) 
        ]))
        
        self.features = nn.Sequential(OrderedDict([
            ('dsconv1', DSConv(channels[0], channels[1], 1, 1)),
            ('dsconv2', DSConv(channels[1], channels[2], 2, 1)),
            ('dsconv3', DSConv(channels[2], channels[2], 1, 1)),
            ('dsconv4', DSConv(channels[2], channels[3], 2, 1)),
            ('dsconv5', DSConv(channels[3], channels[3], 1, 1)),
            ('dsconv6', DSConv(channels[3], channels[4], 2, 1)),
            ('dsconv7_a', DSConv(channels[4], channels[4], 1, 1)),
            ('dsconv7_b', DSConv(channels[4], channels[4], 1, 1)),
            ('dsconv7_c', DSConv(channels[4], channels[4], 1, 1)),
            ('dsconv7_d', DSConv(channels[4], channels[4], 1, 1)),
            ('dsconv7_e', DSConv(channels[4], channels[4], 1, 1)),
            ('dsconv8', DSConv(channels[4], channels[5], 2, 1)),
            ('dsconv9', DSConv(channels[5], channels[5], 1, 1))
        ]))
        
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.linear = nn.Linear(channels[5], num_classes)
       
        
    def forward(self, x):
        out = self.conv(x)
        out = self.features(out)
        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.linear(out)
        return out


def MobileNetV1():
    return MobileNet(channels=[32, 64, 128, 256, 512, 1024], width_multiplier=1)

2) Training on CIFAR-10

train_costs, val_costs = train_model()

[Epoch 1/15]: train-loss = 1.559708 | train-acc = 0.433 | val-loss = 0.032007 | val-acc = 0.586
[Epoch 2/15]: train-loss = 1.004967 | train-acc = 0.645 | val-loss = 0.025247 | val-acc = 0.701
[Epoch 3/15]: train-loss = 0.747347 | train-acc = 0.742 | val-loss = 0.021409 | val-acc = 0.745
[Epoch 4/15]: train-loss = 0.596251 | train-acc = 0.793 | val-loss = 0.022274 | val-acc = 0.776
[Epoch 5/15]: train-loss = 0.493434 | train-acc = 0.832 | val-loss = 0.014880 | val-acc = 0.797
[Epoch 6/15]: train-loss = 0.415764 | train-acc = 0.856 | val-loss = 0.012545 | val-acc = 0.808
[Epoch 7/15]: train-loss = 0.356464 | train-acc = 0.877 | val-loss = 0.014014 | val-acc = 0.803
[Epoch 8/15]: train-loss = 0.318513 | train-acc = 0.891 | val-loss = 0.011231 | val-acc = 0.808
[Epoch 9/15]: train-loss = 0.265170 | train-acc = 0.909 | val-loss = 0.017637 | val-acc = 0.824
[Epoch 10/15]: train-loss = 0.234187 | train-acc = 0.920 | val-loss = 0.021207 | val-acc = 0.827
[Epoch 11/15]: train-loss = 0.203297 | train-acc = 0.930 | val-loss = 0.020546 | val-acc = 0.829
[Epoch 12/15]: train-loss = 0.184567 | train-acc = 0.937 | val-loss = 0.015497 | val-acc = 0.827
[Epoch 13/15]: train-loss = 0.167254 | train-acc = 0.942 | val-loss = 0.018095 | val-acc = 0.824
[Epoch 14/15]: train-loss = 0.142094 | train-acc = 0.951 | val-loss = 0.019001 | val-acc = 0.826
[Epoch 15/15]: train-loss = 0.128344 | train-acc = 0.955 | val-loss = 0.019820 | val-acc = 0.817

3) Evaluating model

nb_test_examples = 10000
correct = 0 

model.eval().cuda()

with  torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        # Make predictions.
        prediction = model(inputs)

        # Retrieve predictions indexes.
        _, predicted_class = torch.max(prediction.data, 1)

        # Compute number of correct predictions.
        correct += (predicted_class == labels).float().sum().item()

test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))

Test accuracy: 0.8007

MobileNet-V1: Summary and Implementation

I) Summary

1) Depthwise separable convolution

2) Hyperparameters

3) Architecture

II) Implementation

1) Architecture build

2) Training on CIFAR-10

3) Evaluating model

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation