MobileNet-V2: Summary and Implementation

This post is divided into 2 sections: Summary and Implementation.

We are going to have an in-depth review of MobileNetV2: Inverted Residuals and Linear Bottlenecks paper which introduces the MobileNet-V2 architecture.

The implementation uses Pytorch as framework. To see full implementation, please refer to this repository.

Also, if you want to read other "Summary and Implementation", feel free to check them at my blog.

I) Summary

This paper introduces the MobileNet-V2 architecture which is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers.
Authors showed it is important to remove non-linearities in the narrow layers in order to maintain representational power and to improve performance.
It results in a very memory-efficient inference model.

1) Depthwise separable convolution

To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3.
A depthwise separable convolution is divided into 2 parts:
- Depthwise convolution.
- Pointwise convolution.

Depthwise convolution:

In a normal convolution, all channels of a kernel are used to produce a feature map.
In a depthwise convolution, each channel of a kernel is used to produce a feature map.

Figure: Normal convolution

Figure: Depthwise convolution

Pointwise convolution:

To increase the number of channels in our output image to 256:
- In a normal convolution, we just have to use 256 filters of size 5x5x3.
- In a pointwise convolution, we just have to use 256 filters of size 1x1x3.

Figure: Normal convolution

Figure: Pointwise convolution

The main difference is the number of computations. In our example:
- For a normal convolution, we have ((8x8x5x5)x3)x256 = 1,228,800 operations.
- For a depthwise separable convolution, we have 4800 + 49,152 = 53,952 operations:
  - in a depthwise convolution, (8x8x5x5)x3 = 4800 operations.
  - in a pointwise convolution, ((8x8x1x1)x3)x256 = 49,152 operations.
We can clearly see that a depthwise separable convolution is less expensive than a normal convolution (~22.7% less computations).
The reason is, in a normal convolution, we are transforming the image 256 times whereas in a depthwise separable convolution, we transform the image once and then expand it 256 times along the channel axis.

2) Linear bottleneck

Linear bottleneck layers are just bottleneck layers with a linear activation.
It was long assume that manifold of interest could be embedded in a low-dimensional subspace. In other words, the data representation that we are interested in could be embedded in a tiny subspace.
Thus, we can simply reduce the dimensionality of a layer until we get the manifold of interest.
However, this intuition breaks because deep neural network uses ReLU which squashes aways too much information if the features are already in low dimension (Fig. 1 illustrated this perfectly)

But there are 2 properties that are indicative of the requirement that the manifold of interest should lie in a low-dimensional subspace (of the higher-dimensional activation space):
1. If the manifold of interest remains non-zero volume after ReLU transformation, it corresponds to a linear transformation.
2. ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space.
Thus, if we assume the manifold of interest is low-dimensional, we can capture this by inserting linear bottleneck layers into the convolutional blocks.
Experimental evidences show that using non-linear layers in bottleneck destroy too much information in low-dimensional space even if in general, linear bottleneck models are strictly less powerful than models with non-linearities.

3) Inverted Residuals

Original residual block contains an input followed by several bottlenecks then followed by expansion and the shortcuts exist between thick layers (layers with many channels).
However, inspired by the intuition that the bottlenecks actually contain all the necessary information and expansion layer acts merely as a non-linear transformation, MobileNetV2 uses shortcuts directly between the bottlenecks (thin layers). Hatched layers use linear activation.
ReLU6 is used as the non-liner activation because of its robustnes when used with low-precision computation.

The new result reported in this paper is that the shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers.

4) Information flow interpretation

MobileNetV2 provides a natural separation between two things, which have been tangled together in previous architectures.
- Capacity: input/output domains of the building blocks (bottleneck layers).
- Expressiveness: layer transformation, a non-linear function that converts input to the output (expansion layers).
Authors say that exploring these concepts separately is an important direction for future research.

5) Architecture

This is a Fully Convolutional Network (no linear layers). Here is its architecture:

II) Implementation

The paper doesn't explain how to deal with input/output channels mismatch during shortcut connections. Thus, shortcut connections will only be used when input/output channel match.

1) Architecture build


















































class LambdaLayer(nn.Module):
    
    def __init__(self, lambd):
        super(LambdaLayer, self).__init__()
        self.lambd = lambd
    
    def forward(self, x):
        return self.lambd(x)
    
class Bottleneck(nn.Module):
    
    def __init__(self, in_channels, out_channels, t, stride):
        super(Bottleneck, self).__init__()
        
        self.stride = stride
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        self.features = nn.Sequential(OrderedDict([
            ('pconv1', nn.Conv2d(in_channels,
                                in_channels*t,
                                kernel_size=1,
                                stride=1,
                                padding=0,
                                bias=False)),
            ('bn1', nn.BatchNorm2d(in_channels*t)),
            ('act1', nn.ReLU6()),
            ('dconv', nn.Conv2d(in_channels*t,
                                in_channels*t,
                                kernel_size=3,
                                groups=in_channels*t,
                                stride=stride,
                                padding=1,
                                bias=False)),
            ('bn2', nn.BatchNorm2d(in_channels*t)),
            ('act2', nn.ReLU6()),
            ('pconv3', nn.Conv2d(in_channels*t,
                                out_channels,
                                kernel_size=1,
                                stride=1,
                                padding=0,
                                bias=False)),
            ('bn3', nn.BatchNorm2d(out_channels))
        ]))
            
    def forward(self, x):
        out = self.features(x)
        if self.stride == 1 and self.in_channels == self.out_channels:
            out += x 
        return out





























































































class MobileNet(nn.Module):
    
    def __init__(self, block_type, bottleneck_settings, width_multiplier, num_classes):
        super(MobileNet, self).__init__()
        
        self.num_classes = num_classes
        self.b_s = bottleneck_settings
        self.b_s['c'] = [int(elt * width_multiplier) for elt in self.b_s['c']]
        self.in_channels = int(32 * width_multiplier)
        self.out_channels = int(1280 * width_multiplier)
        
        # Feature
        self.conv0 = nn.Sequential(OrderedDict([
            ('conv0', nn.Conv2d(3, self.in_channels, 1, stride=2, bias=False)),
            ('bn0', nn.BatchNorm2d(self.in_channels)),
            ('act0', nn.ReLU6()) 
        ]))
        self.bottleneck1 = self.__build_layer(block_type,
                                              self.in_channels, 
                                              self.b_s['c'][0], 
                                              self.b_s['t'][0],
                                              self.b_s['s'][0],
                                              self.b_s['n'][0])
        self.bottleneck2 = self.__build_layer(block_type, 
                                              self.b_s['c'][0],
                                              self.b_s['c'][1], 
                                              self.b_s['t'][1],
                                              self.b_s['s'][1],
                                              self.b_s['n'][1])
        self.bottleneck3 = self.__build_layer(block_type,
                                              self.b_s['c'][1],
                                              self.b_s['c'][2],
                                              self.b_s['t'][2],
                                              self.b_s['s'][2],
                                              self.b_s['n'][2])
        self.bottleneck4 = self.__build_layer(block_type,
                                              self.b_s['c'][2],
                                              self.b_s['c'][3],
                                              self.b_s['t'][3],
                                              self.b_s['s'][3],
                                              self.b_s['n'][3])
        self.bottleneck5 = self.__build_layer(block_type,
                                              self.b_s['c'][3],
                                              self.b_s['c'][4],
                                              self.b_s['t'][4],
                                              self.b_s['s'][4],
                                              self.b_s['n'][4])
        self.bottleneck6 = self.__build_layer(block_type,
                                              self.b_s['c'][4],
                                              self.b_s['c'][5],
                                              self.b_s['t'][5],
                                              self.b_s['s'][5],
                                              self.b_s['n'][5])
        self.bottleneck7 = self.__build_layer(block_type,
                                              self.b_s['c'][5],
                                              self.b_s['c'][6],
                                              self.b_s['t'][6],
                                              self.b_s['s'][6],
                                              self.b_s['n'][6])
        # Classifier
        self.conv8 = nn.Sequential(OrderedDict([
            ('conv8', nn.Conv2d(self.b_s['c'][6], self.out_channels, 1, bias=False)),
            ('bn8', nn.BatchNorm2d(self.out_channels)),
            ('act8', nn.ReLU6()) 
        ]))
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.conv9 = nn.Conv2d(self.out_channels, num_classes, 1)
    
    def __build_layer(self, block_type, in_channels, out_channels, t, s, n):
        layers = []
        tmp_channels = in_channels
        for i in range(n):
            if i == 0:
                layers.append(block_type(tmp_channels, out_channels, t, s))
            else:
                layers.append(block_type(tmp_channels, out_channels, t, 1))
            tmp_channels = out_channels
        return nn.Sequential(*layers)
        
    def forward(self, x):
        out = self.conv0(x)
        out = self.bottleneck1(out)
        out = self.bottleneck2(out)
        out = self.bottleneck3(out)
        out = self.bottleneck4(out)
        out = self.bottleneck5(out)
        out = self.bottleneck6(out)
        out = self.bottleneck7(out)
        out = self.conv8(out)
        out = self.avgpool(out)
        out = self.conv9(out)
        out = out.view(-1, self.num_classes)
        return out












def MobileNetV2():
    bottleneck_settings = {
                'c': [16, 24, 32, 64, 96, 160, 320],
                't': [1, 6, 6, 6, 6, 6, 6],
                's': [1, 2, 2, 2, 1, 2, 1],
                'n': [1, 2, 3, 4, 3, 3, 1]
            }
    
    return MobileNet(block_type=Bottleneck,
                     bottleneck_settings=bottleneck_settings,
                     width_multiplier=.5,
                     num_classes=1000)

2) Training on CIFAR-10

train_costs, val_costs = train_model()

[Epoch 1/15]: train-loss = 1.641850 | train-acc = 0.397 | val-loss = 0.035919 | val-acc = 0.528
[Epoch 2/15]: train-loss = 1.192912 | train-acc = 0.569 | val-loss = 0.035297 | val-acc = 0.612
[Epoch 3/15]: train-loss = 0.997534 | train-acc = 0.642 | val-loss = 0.033239 | val-acc = 0.654
[Epoch 4/15]: train-loss = 0.868714 | train-acc = 0.692 | val-loss = 0.028015 | val-acc = 0.696
[Epoch 5/15]: train-loss = 0.760505 | train-acc = 0.733 | val-loss = 0.021540 | val-acc = 0.721
[Epoch 6/15]: train-loss = 0.681773 | train-acc = 0.763 | val-loss = 0.013692 | val-acc = 0.752
[Epoch 7/15]: train-loss = 0.617383 | train-acc = 0.786 | val-loss = 0.023159 | val-acc = 0.749
[Epoch 8/15]: train-loss = 0.568520 | train-acc = 0.802 | val-loss = 0.017240 | val-acc = 0.749
[Epoch 9/15]: train-loss = 0.527746 | train-acc = 0.816 | val-loss = 0.017224 | val-acc = 0.763
[Epoch 10/15]: train-loss = 0.491380 | train-acc = 0.828 | val-loss = 0.019151 | val-acc = 0.778
[Epoch 11/15]: train-loss = 0.469383 | train-acc = 0.837 | val-loss = 0.019102 | val-acc = 0.785
[Epoch 12/15]: train-loss = 0.432044 | train-acc = 0.849 | val-loss = 0.022228 | val-acc = 0.785
[Epoch 13/15]: train-loss = 0.404078 | train-acc = 0.860 | val-loss = 0.015088 | val-acc = 0.790
[Epoch 14/15]: train-loss = 0.385579 | train-acc = 0.865 | val-loss = 0.016302 | val-acc = 0.793
[Epoch 15/15]: train-loss = 0.363964 | train-acc = 0.872 | val-loss = 0.017327 | val-acc = 0.794

3) Evaluating model

nb_test_examples = 10000
correct = 0 

model.eval().cuda()

with  torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        # Make predictions.
        prediction = model(inputs)

        # Retrieve predictions indexes.
        _, predicted_class = torch.max(prediction.data, 1)

        # Compute number of correct predictions.
        correct += (predicted_class == labels).float().sum().item()

test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))

Test accuracy: 0.8007

MobileNet-V2: Summary and Implementation

I) Summary

1) Depthwise separable convolution

2) Linear bottleneck

3) Inverted Residuals

4) Information flow interpretation

5) Architecture

II) Implementation

1) Architecture build

2) Training on CIFAR-10

3) Evaluating model

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation