This post is divided into 2 sections: Summary and Implementation.
We are going to have an in-depth review of Rethinking the Inception Architecture for Computer Vision paper which introduces the Inception-V2/Inception-V3 architecture.
The implementation uses Pytorch as framework. To see full implementation, please refer to this repository.
Also, if you want to read other "Summary and Implementation", feel free to check them at my blog.
They have studied how factorizing convolutions and aggressive dimension reductions inside neural network can result in networks with relatively low computational cost while maintaining high quality.
They factorize 5x5 convolution into two stacked 3x3 convolution which result in a
To get these numbers, let's say we have a 5x5 input image and after a convolution, we want to produce a 5x5 output image (use of padding). The goal is to compare the number of operations between a 5x5 convolution and 2 stacked 3x3 convolutions.
Moreover, it is worth noticing that 2 stacked 3x3 filters give the same receptive field as as a 5x5 filter.
They factorize n x n convolution into a combination of 1 x n convolution and n x 1 convolution. They call it "asymmetric convolution". For example, a 3x3 convolution is equivalent to first performing a 1x3 convolution, and then performing a 3x1 convolution on its output. They found it to be
Traditionally, convolutional networks use some pooling before convolution operations to reduce the gride size of the feature maps. Problem is, it can introduce a representational bottleneck.
The authors think that increasing the number of filters (expand the filter bank) remove the representational bottleneck. This is achieved by the inception module.
In the left picture, we are introducing a representational bottleneck by first reducing the grid size and then expanding the filter bank which is the other way around in the right picture.
To get the intuition behind it, let's follow this simple example. Suppose you are a primary student split between choosing either general study or technical study.
However, the right side is more expensive so they proposed another solution that reduces the computational cost while eliminating the bottleneck (by using 2 parallel stride 2 pooling/convolution blocks).
Whole network is 42 layers deep, computational cost is 2.5 times higher than GoogLeNet.
Remark: Authors say they use variations of reduction technique (picture below) to reduce the grid sizes between the Inception blocks whenever applicable. They also add an auxilary classifier on top of the last 17×17 layer.
In CNN, the label is a vector. If you have 3 class, the one-hot labels are [0, 0, 1] or [0, 1, 0] or [1, 0, 0]. Each of the vector stands for a class at the output layer. Label smoothing, in my understanding, is to use a relatvely smooth vector to represent a ground truth label. Say [0, 0, 1] can be represented as [0.1, 0.1, 0.8]. It is used when the loss function is the cross entropy function.
According to the author:
They claim that by using label smoothing, the top-1 and top-5 error rate are reduced by 0.2%.
Inception Net v3 incorporated all of the above upgrades stated for Inception v2, and in addition used the following:
We will implement Inception-V2. According to the paper,
The detailed structure of the network, including the sizes of filter
banks inside the Inception modules, is given in the supplementary
material, given in the model.txt
However, no model.txt
was found so we have to refer to the tensorflow implementation.
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
super(ConvBlock, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
self.bn = nn.BatchNorm2d(out_channels)
self.act = nn.ReLU()
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.act(x)
return x
class InceptionF5(nn.Module):
"""
From the paper, figure 5 inception module.
"""
def __init__(self, in_channels):
super(InceptionF5, self).__init__()
self.branch1 = nn.Sequential(
ConvBlock(in_channels, 64, kernel_size=1, stride=1, padding=0),
ConvBlock(64, 96, kernel_size=3, stride=1, padding=1),
ConvBlock(96, 96, kernel_size=3, stride=1, padding=1)
)
self.branch2 = nn.Sequential(
ConvBlock(in_channels, 48, kernel_size=1, stride=1, padding=0),
ConvBlock(48, 64, kernel_size=3, stride=1, padding=1)
)
self.branch3 = nn.Sequential(
nn.MaxPool2d(3, stride=1, padding=1),
ConvBlock(in_channels, 64, kernel_size=1, stride=1, padding=0)
)
self.branch4 = nn.Sequential(
ConvBlock(in_channels, 64, kernel_size=1, stride=1, padding=0)
)
def forward(self, x):
branch1 = self.branch1(x)
branch2 = self.branch2(x)
branch3 = self.branch3(x)
branch4 = self.branch4(x)
return torch.cat([branch1, branch2, branch3, branch4], 1)
class InceptionF6(nn.Module):
"""
From the paper, figure 6 inception module.
"""
def __init__(self, in_channels, f_7x7):
super(InceptionF6, self).__init__()
self.branch1 = nn.Sequential(
ConvBlock(in_channels, f_7x7, kernel_size=1, stride=1, padding=0),
ConvBlock(f_7x7, f_7x7, kernel_size=(1,7), stride=1, padding=(0,3)),
ConvBlock(f_7x7, f_7x7, kernel_size=(7,1), stride=1, padding=(3,0)),
ConvBlock(f_7x7, f_7x7, kernel_size=(1,7), stride=1, padding=(0,3)),
ConvBlock(f_7x7, 192, kernel_size=(7,1), stride=1, padding=(3,0))
)
self.branch2 = nn.Sequential(
ConvBlock(in_channels, f_7x7, kernel_size=1, stride=1, padding=0),
ConvBlock(f_7x7, f_7x7, kernel_size=(1,7), stride=1, padding=(0,3)),
ConvBlock(f_7x7, 192, kernel_size=(7,1), stride=1, padding=(3,0))
)
self.branch3 = nn.Sequential(
nn.MaxPool2d(3, stride=1, padding=1),
ConvBlock(in_channels, 192, kernel_size=1, stride=1, padding=0)
)
self.branch4 = nn.Sequential(
ConvBlock(in_channels, 192, kernel_size=1, stride=1, padding=0)
)
def forward(self, x):
branch1 = self.branch1(x)
branch2 = self.branch2(x)
branch3 = self.branch3(x)
branch4 = self.branch4(x)
return torch.cat([branch1, branch2, branch3, branch4], 1)
class InceptionF7(nn.Module):
"""
From the paper, figure 7 inception module.
"""
def __init__(self, in_channels):
super(InceptionF7, self).__init__()
self.branch1 = nn.Sequential(
ConvBlock(in_channels, 448, kernel_size=1, stride=1, padding=0),
ConvBlock(448, 384, kernel_size=(3,3), stride=1, padding=1)
)
self.branch1_top = ConvBlock(384, 384, kernel_size=(1,3), stride=1, padding=(0,1))
self.branch1_bot = ConvBlock(384, 384, kernel_size=(3,1), stride=1, padding=(1,0))
self.branch2 = ConvBlock(in_channels, 384, kernel_size=1, stride=1, padding=0)
self.branch2_top = ConvBlock(384, 384, kernel_size=(1,3), stride=1, padding=(0,1))
self.branch2_bot = ConvBlock(384, 384, kernel_size=(3,1), stride=1, padding=(1,0))
self.branch3 = nn.Sequential(
nn.MaxPool2d(3, stride=1, padding=1),
ConvBlock(in_channels, 192, kernel_size=1, stride=1, padding=0)
)
self.branch4 = nn.Sequential(
ConvBlock(in_channels, 320, kernel_size=1, stride=1, padding=0)
)
def forward(self, x):
branch1 = self.branch1(x)
branch1 = torch.cat([self.branch1_top(branch1), self.branch1_bot(branch1)], 1)
branch2 = self.branch2(x)
branch2 = torch.cat([self.branch2_top(branch2), self.branch2_bot(branch2)], 1)
branch3 = self.branch3(x)
branch4 = self.branch4(x)
return torch.cat([branch1, branch2, branch3, branch4], 1)
class InceptionRed(nn.Module):
"""
From the paper, figure 10 improved pooling operation.
"""
def __init__(self, in_channels, f_3x3_r, add_ch=0):
super(InceptionRed, self).__init__()
self.branch1 = nn.Sequential(
ConvBlock(in_channels, f_3x3_r, kernel_size=1, stride=1, padding=0),
ConvBlock(f_3x3_r, 178 + add_ch, kernel_size=3, stride=1, padding=1),
ConvBlock(178 + add_ch, 178 + add_ch, kernel_size=3, stride=2, padding=0)
)
self.branch2 = nn.Sequential(
ConvBlock(in_channels, f_3x3_r, kernel_size=1, stride=1, padding=0),
ConvBlock(f_3x3_r, 302 + add_ch, kernel_size=3, stride=2, padding=0)
)
self.branch3 = nn.Sequential(
nn.MaxPool2d(3, stride=2, padding=0)
)
def forward(self, x):
branch1 = self.branch1(x)
branch2 = self.branch2(x)
branch3 = self.branch3(x)
return torch.cat([branch1, branch2, branch3], 1)
class InceptionAux(nn.Module):
"""
From the paper, auxilary classifier
"""
def __init__(self, in_channels, num_classes):
super(InceptionAux, self).__init__()
self.pool = nn.AdaptiveAvgPool2d((4,4))
self.conv = nn.Conv2d(in_channels, 128, kernel_size=1, stride=1, padding=0)
self.act = nn.ReLU()
self.fc1 = nn.Linear(2048, 1024)
self.dropout = nn.Dropout(0.7)
self.fc2 = nn.Linear(1024, num_classes)
def forward(self, x):
x = self.pool(x)
x = self.conv(x)
x = self.act(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = self.act(x)
x = self.dropout(x)
x = self.fc2(x)
return x
class InceptionV2(nn.Module):
def __init__(self, num_classes = 10):
super(InceptionV2, self).__init__()
self.conv1 = ConvBlock(3, 32, kernel_size=3, stride=2, padding=0)
self.conv2 = ConvBlock(32, 32, kernel_size=3, stride=1, padding=0)
self.conv3 = ConvBlock(32, 64, kernel_size=3, stride=1, padding=1)
self.pool1 = nn.MaxPool2d(3, stride=2, padding=0)
self.conv4 = ConvBlock(64, 80, kernel_size=3, stride=1, padding=0)
self.conv5 = ConvBlock(80, 192, kernel_size=3, stride=2, padding=0)
self.conv6 = ConvBlock(192, 288, kernel_size=3, stride=1, padding=1)
self.inception3a = InceptionF5(288)
self.inception3b = InceptionF5(288)
self.inception3c = InceptionF5(288)
self.inceptionRed1 = InceptionRed(288,f_3x3_r=64, add_ch=0)
self.inception4a = InceptionF6(768, f_7x7=128)
self.inception4b = InceptionF6(768, f_7x7=160)
self.inception4c = InceptionF6(768, f_7x7=160)
self.inception4d = InceptionF6(768, f_7x7=160)
self.inception4e = InceptionF6(768, f_7x7=192)
self.inceptionRed2 = InceptionRed(768,f_3x3_r=192, add_ch=16)
self.aux = InceptionAux(768, num_classes)
self.inception5a = InceptionF7(1280)
self.inception5b = InceptionF7(2048)
self.pool6 = nn.AdaptiveAvgPool2d((1,1))
self.dropout = nn.Dropout(0.4)
self.fc = nn.Linear(2048, num_classes)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.pool1(x)
x = self.conv4(x)
x = self.conv5(x)
x = self.conv6(x)
x = self.inception3a(x)
x = self.inception3b(x)
x = self.inception3c(x)
x = self.inceptionRed1(x)
x = self.inception4a(x)
x = self.inception4b(x)
x = self.inception4c(x)
x = self.inception4d(x)
x = self.inception4e(x)
aux = self.aux(x)
x = self.inceptionRed2(x)
x = self.inception5a(x)
x = self.inception5b(x)
x = self.pool6(x)
x = self.dropout(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x, aux
train_costs, val_costs = train_model()
[Epoch 1/15]: train-loss = 2.559885 | train-acc = 0.259 | val-loss = 2.215568 | val-acc = 0.359
[Epoch 2/15]: train-loss = 1.992639 | train-acc = 0.424 | val-loss = 1.846719 | val-acc = 0.472
[Epoch 3/15]: train-loss = 1.640984 | train-acc = 0.537 | val-loss = 1.591195 | val-acc = 0.558
[Epoch 4/15]: train-loss = 1.385598 | train-acc = 0.619 | val-loss = 1.439533 | val-acc = 0.614
[Epoch 5/15]: train-loss = 1.168221 | train-acc = 0.684 | val-loss = 1.305238 | val-acc = 0.655
[Epoch 6/15]: train-loss = 1.002453 | train-acc = 0.731 | val-loss = 1.214881 | val-acc = 0.681
[Epoch 7/15]: train-loss = 0.851118 | train-acc = 0.771 | val-loss = 1.217857 | val-acc = 0.687
[Epoch 8/15]: train-loss = 0.711145 | train-acc = 0.811 | val-loss = 1.147356 | val-acc = 0.711
[Epoch 9/15]: train-loss = 0.608349 | train-acc = 0.838 | val-loss = 1.140132 | val-acc = 0.724
[Epoch 10/15]: train-loss = 0.515004 | train-acc = 0.863 | val-loss = 1.154486 | val-acc = 0.735
[Epoch 11/15]: train-loss = 0.437409 | train-acc = 0.885 | val-loss = 1.180348 | val-acc = 0.741
[Epoch 12/15]: train-loss = 0.386996 | train-acc = 0.899 | val-loss = 1.190746 | val-acc = 0.742
[Epoch 13/15]: train-loss = 0.339299 | train-acc = 0.911 | val-loss = 1.171836 | val-acc = 0.749
[Epoch 14/15]: train-loss = 0.293521 | train-acc = 0.923 | val-loss = 1.245975 | val-acc = 0.751
[Epoch 15/15]: train-loss = 0.235749 | train-acc = 0.938 | val-loss = 1.239976 | val-acc = 0.761
nb_test_examples = 10000
correct = 0
model.eval().cuda()
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Make predictions.
prediction, _, _ = model(inputs)
# Retrieve predictions indexes.
_, predicted_class = torch.max(prediction.data, 1)
# Compute number of correct predictions.
correct += (predicted_class == labels).float().sum().item()
test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))
Test accuracy: 0.7418