MobileNet-V2: Summary and Implementation
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
This post is divided into 2 sections: Summary and Implementation.
We are going to have an in-depth review of MobileNetV2: Inverted Residuals and Linear Bottlenecks paper which introduces the MobileNet-V2 architecture.
The implementation uses Pytorch as framework. To see full implementation, please refer to this repository.
Also, if you want to read other "Summary and Implementation", feel free to check them at my blog.
I) Summary
- This paper introduces the MobileNet-V2 architecture which is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers.
- Authors showed it is important to remove non-linearities in the narrow layers in order to maintain representational power and to improve performance.
- It results in a very memory-efficient inference model.
1) Depthwise separable convolution
- To understand what a depthwise separable convolution really is, let's compare it to a normal convolution between a 12x12x3 input and 256 kernels of size 5x5x3.
- A depthwise separable convolution is divided into 2 parts:
- Depthwise convolution.
- Pointwise convolution.
Depthwise convolution:
- In a normal convolution, all channels of a kernel are used to produce a feature map.
- In a depthwise convolution, each channel of a kernel is used to produce a feature map.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure: Normal convolution
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure: Depthwise convolution
Pointwise convolution:
- To increase the number of channels in our output image to 256:
- In a normal convolution, we just have to use 256 filters of size 5x5x3.
- In a pointwise convolution, we just have to use 256 filters of size 1x1x3.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure: Normal convolution
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure: Pointwise convolution
-
The main difference is the number of computations. In our example:
- For a normal convolution, we have ((8x8x5x5)x3)x256 = 1,228,800 operations.
- For a depthwise separable convolution, we have 4800 + 49,152 = 53,952 operations:
- in a depthwise convolution, (8x8x5x5)x3 = 4800 operations.
- in a pointwise convolution, ((8x8x1x1)x3)x256 = 49,152 operations.
-
We can clearly see that a depthwise separable convolution is less expensive than a normal convolution (~22.7% less computations).
-
The reason is, in a normal convolution, we are transforming the image 256 times whereas in a depthwise separable convolution, we transform the image once and then expand it 256 times along the channel axis.
2) Linear bottleneck
- Linear bottleneck layers are just bottleneck layers with a linear activation.
- It was long assume that manifold of interest could be embedded in a low-dimensional subspace. In other words, the data representation that we are interested in could be embedded in a tiny subspace.
- Thus, we can simply reduce the dimensionality of a layer until we get the manifold of interest.
- However, this intuition breaks because deep neural network uses ReLU which squashes aways too much information if the features are already in low dimension (Fig. 1 illustrated this perfectly)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- But there are 2 properties that are indicative of the requirement that the manifold of interest should lie in a low-dimensional subspace (of the higher-dimensional activation space):
- If the manifold of interest remains non-zero volume after ReLU transformation, it corresponds to a linear transformation.
- ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space.
- Thus, if we assume the manifold of interest is low-dimensional, we can capture this by inserting linear bottleneck layers into the convolutional blocks.
- Experimental evidences show that using non-linear layers in bottleneck destroy too much information in low-dimensional space even if in general, linear bottleneck models are strictly less powerful than models with non-linearities.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
3) Inverted Residuals
- Original residual block contains an input followed by several bottlenecks then followed by expansion and the shortcuts exist between thick layers (layers with many channels).
- However, inspired by the intuition that the bottlenecks actually contain all the necessary information and expansion layer acts merely as a non-linear transformation, MobileNetV2 uses shortcuts directly between the bottlenecks (thin layers). Hatched layers use linear activation.
- ReLU6 is used as the non-liner activation because of its robustnes when used with low-precision computation.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The new result reported in this paper is that the shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
MobileNetV2 provides a natural separation between two things, which have been tangled together in previous architectures.
- Capacity: input/output domains of the building blocks (bottleneck layers).
- Expressiveness: layer transformation, a non-linear function that converts input to the output (expansion layers).
-
Authors say that exploring these concepts separately is an important direction for future research.
5) Architecture
- This is a Fully Convolutional Network (no linear layers). Here is its architecture:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
II) Implementation
- The paper doesn't explain how to deal with input/output channels mismatch during shortcut connections. Thus, shortcut connections will only be used when input/output channel match.
1) Architecture build
class LambdaLayer(nn.Module):
def __init__(self, lambd):
super(LambdaLayer, self).__init__()
self.lambd = lambd
def forward(self, x):
return self.lambd(x)
class Bottleneck(nn.Module):
def __init__(self, in_channels, out_channels, t, stride):
super(Bottleneck, self).__init__()
self.stride = stride
self.in_channels = in_channels
self.out_channels = out_channels
self.features = nn.Sequential(OrderedDict([
('pconv1', nn.Conv2d(in_channels,
in_channels*t,
kernel_size=1,
stride=1,
padding=0,
bias=False)),
('bn1', nn.BatchNorm2d(in_channels*t)),
('act1', nn.ReLU6()),
('dconv', nn.Conv2d(in_channels*t,
in_channels*t,
kernel_size=3,
groups=in_channels*t,
stride=stride,
padding=1,
bias=False)),
('bn2', nn.BatchNorm2d(in_channels*t)),
('act2', nn.ReLU6()),
('pconv3', nn.Conv2d(in_channels*t,
out_channels,
kernel_size=1,
stride=1,
padding=0,
bias=False)),
('bn3', nn.BatchNorm2d(out_channels))
]))
def forward(self, x):
out = self.features(x)
if self.stride == 1 and self.in_channels == self.out_channels:
out += x
return out
class MobileNet(nn.Module):
def __init__(self, block_type, bottleneck_settings, width_multiplier, num_classes):
super(MobileNet, self).__init__()
self.num_classes = num_classes
self.b_s = bottleneck_settings
self.b_s['c'] = [int(elt * width_multiplier) for elt in self.b_s['c']]
self.in_channels = int(32 * width_multiplier)
self.out_channels = int(1280 * width_multiplier)
self.conv0 = nn.Sequential(OrderedDict([
('conv0', nn.Conv2d(3, self.in_channels, 1, stride=2, bias=False)),
('bn0', nn.BatchNorm2d(self.in_channels)),
('act0', nn.ReLU6())
]))
self.bottleneck1 = self.__build_layer(block_type,
self.in_channels,
self.b_s['c'][0],
self.b_s['t'][0],
self.b_s['s'][0],
self.b_s['n'][0])
self.bottleneck2 = self.__build_layer(block_type,
self.b_s['c'][0],
self.b_s['c'][1],
self.b_s['t'][1],
self.b_s['s'][1],
self.b_s['n'][1])
self.bottleneck3 = self.__build_layer(block_type,
self.b_s['c'][1],
self.b_s['c'][2],
self.b_s['t'][2],
self.b_s['s'][2],
self.b_s['n'][2])
self.bottleneck4 = self.__build_layer(block_type,
self.b_s['c'][2],
self.b_s['c'][3],
self.b_s['t'][3],
self.b_s['s'][3],
self.b_s['n'][3])
self.bottleneck5 = self.__build_layer(block_type,
self.b_s['c'][3],
self.b_s['c'][4],
self.b_s['t'][4],
self.b_s['s'][4],
self.b_s['n'][4])
self.bottleneck6 = self.__build_layer(block_type,
self.b_s['c'][4],
self.b_s['c'][5],
self.b_s['t'][5],
self.b_s['s'][5],
self.b_s['n'][5])
self.bottleneck7 = self.__build_layer(block_type,
self.b_s['c'][5],
self.b_s['c'][6],
self.b_s['t'][6],
self.b_s['s'][6],
self.b_s['n'][6])
self.conv8 = nn.Sequential(OrderedDict([
('conv8', nn.Conv2d(self.b_s['c'][6], self.out_channels, 1, bias=False)),
('bn8', nn.BatchNorm2d(self.out_channels)),
('act8', nn.ReLU6())
]))
self.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.conv9 = nn.Conv2d(self.out_channels, num_classes, 1)
def __build_layer(self, block_type, in_channels, out_channels, t, s, n):
layers = []
tmp_channels = in_channels
for i in range(n):
if i == 0:
layers.append(block_type(tmp_channels, out_channels, t, s))
else:
layers.append(block_type(tmp_channels, out_channels, t, 1))
tmp_channels = out_channels
return nn.Sequential(*layers)
def forward(self, x):
out = self.conv0(x)
out = self.bottleneck1(out)
out = self.bottleneck2(out)
out = self.bottleneck3(out)
out = self.bottleneck4(out)
out = self.bottleneck5(out)
out = self.bottleneck6(out)
out = self.bottleneck7(out)
out = self.conv8(out)
out = self.avgpool(out)
out = self.conv9(out)
out = out.view(-1, self.num_classes)
return out
2) Training on CIFAR-10
3) Evaluating model