---
tags: machine-learning
---
# Inception-V1 (GoogLeNet): Summary and Implementation
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/0.png?token=AMAXSKI25HM5VANVHC4XGHS6WMGZY">
</div>
>This post is divided into 2 sections: Summary and Implementation.
>
>We are going to have an in-depth review of [Going Deeper with Convolutions](https://arxiv.org/pdf/1409.4842.pdf) paper which introduces the Inception-V1/GoogLeNet architecture.
>
> The implementation uses Pytorch as framework. To see full implementation, please refer to this [repository](https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/inception_v1/pytorch).
>
> Also, if you want to read other "Summary and Implementation", feel free to check them at my [blog](https://ferdinandmom.engineer/deep-learning/).
# I) Summary
- The paper [Going Deeper with Convolutions](https://arxiv.org/pdf/1409.4842.pdf) introduces the first version of Inception model called GoogLeNet.
- During ILSVLC-2014, they achieved 1st place at the classification task (top-5 test error = 6.67%)
- It has around 6.7977 million parameters (without auxilaries layers) which is 9x fewer than AlexNet (ILSVRC-2012 winner) and 20x fewer than its competitor VGG-16.
- In most of the standard network architectures, the intuition is not clear why and when to perform the max-pooling operation, when to use the convolutional operation. For example, in AlextNet we have the convolutional operation and max-pooling operation following each other whereas in VGGNet, we have 3 convolutional operations in a row and then 1 max-pooling layer.
- ==Thus, **the idea behind GoogLeNet is to use all the operations at the same time**. It computes multiple kernels of different size over the same input map in parallel, concatenating their results into a single output. This is called an **Inception module**.==
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/1.png?token=AMAXSKKTCOTQ7ZEOA3YO2DK6WMGYI"
height="100%" width="100%">
</div>
- Consider the following:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/2_a.png?token=AMAXSKOX3N3ZM5CZF4RAEA26WMGKW"
height="50%" width="70%">
</div>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/2_b.png?token=AMAXSKJWSXQWWKA2ZN44TYS6WMGKY"
height="50%" width="90%">
</div>
- The Naive approach is computationally expensive:
- Computation cost = ((28 x 28 x 5 x 5) x 192) x 32 $\simeq$ **120 Mil**
- We perform (28 x 28 x 5 x 5) operations along 192 channels for each of the 32 filters.
- The dimension reduction approach is **less** computationally expensive:
- 1st layer computation cost = ((28 x 28 x 1 x 1) x 192) x 16 $\simeq$ 2.4 Mil
- 2nd layer computation cost = ((28 x 28 x 5 x 5) x 16) x 32 $\simeq$ 10 Mil
- Total computation cost $\simeq$ **12.4 Mil**
---
Here its architecture:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/3.png?token=AMAXSKLECW7TXM42VJERNRS6WMGK2"
height="100%" width="100%">
</div>
- There are:
- 9 Inception modules (red box)
- Global Average pooling were used instead of a Fully-connected layer.
- It enables adapting and fine-tuning on the network easily.
- 2 auxilaries softmax layer (green box)
- Their role is to push the network toward its goal and helps to ensure that the intermediate features are good enough for the network to learn.
- It turns out that softmax0 and sofmax1 gives regularization effect.
- During training, their loss gets added to the total loss with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3).
- During inference, they are discarded.
- Structure:
- Average pooling layer with 5×5 filter size and stride 3 resulting in an output size:
- For 1st green box: 4x4x512.
- For 2nd green box: 4x4x528.
- 128 1x1 convolutions + ReLU.
- Fully-connected layer with 1024 units + ReLU.
- Dropout = 70%.
- Linear layer (1000 classes) + Softmax.
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/4.png?token=AMAXSKOVSG2CINH3G6SS2N26WMGK4"
height="100%" width="80%">
</div>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/5.png?token=AMAXSKM7XU4DLG6LR5HGYZC6WMGK4"
height="100%" width="100%">
</div>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/6.png?token=AMAXSKOEL3QIPOMOEPVLVEK6WMGYQ"
height="100%" width="100%">
</div>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/7.png?token=AMAXSKP2IXA5AYCOJT664L26WMGYQ"
height="100%" width="100%">
</div>
<br>
# II) Implementation
### 1) Architecture build
```python
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
super(ConvBlock, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
self.bn = nn.BatchNorm2d(out_channels)
self.act = nn.ReLU()
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.act(x)
return x
```
```python
class InceptionModule(nn.Module):
def __init__(self, in_channels, f_1x1, f_3x3_r, f_3x3, f_5x5_r, f_5x5, f_pp):
super(InceptionModule, self).__init__()
self.branch1 = nn.Sequential(
ConvBlock(in_channels, f_1x1, kernel_size=1, stride=1, padding=0)
)
self.branch2 = nn.Sequential(
ConvBlock(in_channels, f_3x3_r, kernel_size=1, stride=1, padding=0),
ConvBlock(f_3x3_r, f_3x3, kernel_size=3, stride=1, padding=1)
)
self.branch3 = nn.Sequential(
ConvBlock(in_channels, f_5x5_r, kernel_size=1, stride=1, padding=0),
ConvBlock(f_5x5_r, f_5x5, kernel_size=5, stride=1, padding=2)
)
self.branch4 = nn.Sequential(
nn.MaxPool2d(3, stride=1, padding=1, ceil_mode=True),
ConvBlock(in_channels, f_pp, kernel_size=1, stride=1, padding=0)
)
def forward(self, x):
branch1 = self.branch1(x)
branch2 = self.branch2(x)
branch3 = self.branch3(x)
branch4 = self.branch4(x)
return torch.cat([branch1, branch2, branch3, branch4], 1)
```
```python
class InceptionAux(nn.Module):
def __init__(self, in_channels, num_classes):
super(InceptionAux, self).__init__()
self.pool = nn.AdaptiveAvgPool2d((4,4))
self.conv = nn.Conv2d(in_channels, 128, kernel_size=1, stride=1, padding=0)
self.act = nn.ReLU()
self.fc1 = nn.Linear(2048, 1024)
self.dropout = nn.Dropout(0.7)
self.fc2 = nn.Linear(1024, num_classes)
def forward(self, x):
x = self.pool(x)
x = self.conv(x)
x = self.act(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = self.act(x)
x = self.dropout(x)
x = self.fc2(x)
return x
```
```python
class GoogLeNet(nn.Module):
def __init__(self, num_classes = 10):
super(GoogLeNet, self).__init__()
self.conv1 = ConvBlock(3, 64, kernel_size=7, stride=2, padding=3)
self.pool1 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
self.conv2 = ConvBlock(64, 64, kernel_size=1, stride=1, padding=0)
self.conv3 = ConvBlock(64, 192, kernel_size=3, stride=1, padding=1)
self.pool3 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
self.inception3A = InceptionModule(in_channels=192,
f_1x1=64,
f_3x3_r=96,
f_3x3=128,
f_5x5_r=16,
f_5x5=32,
f_pp=32)
self.inception3B = InceptionModule(in_channels=256,
f_1x1=128,
f_3x3_r=128,
f_3x3=192,
f_5x5_r=32,
f_5x5=96,
f_pp=64)
self.pool4 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
self.inception4A = InceptionModule(in_channels=480,
f_1x1=192,
f_3x3_r=96,
f_3x3=208,
f_5x5_r=16,
f_5x5=48,
f_pp=64)
self.inception4B = InceptionModule(in_channels=512,
f_1x1=160,
f_3x3_r=112,
f_3x3=224,
f_5x5_r=24,
f_5x5=64,
f_pp=64)
self.inception4C = InceptionModule(in_channels=512,
f_1x1=128,
f_3x3_r=128,
f_3x3=256,
f_5x5_r=24,
f_5x5=64,
f_pp=64)
self.inception4D = InceptionModule(in_channels=512,
f_1x1=112,
f_3x3_r=144,
f_3x3=288,
f_5x5_r=32,
f_5x5=64,
f_pp=64)
self.inception4E = InceptionModule(in_channels=528,
f_1x1=256,
f_3x3_r=160,
f_3x3=320,
f_5x5_r=32,
f_5x5=128,
f_pp=128)
self.pool5 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
self.inception5A = InceptionModule(in_channels=832,
f_1x1=256,
f_3x3_r=160,
f_3x3=320,
f_5x5_r=32,
f_5x5=128,
f_pp=128)
self.inception5B = InceptionModule(in_channels=832,
f_1x1=384,
f_3x3_r=192,
f_3x3=384,
f_5x5_r=48,
f_5x5=128,
f_pp=128)
self.pool6 = nn.AdaptiveAvgPool2d((1,1))
self.dropout = nn.Dropout(0.4)
self.fc = nn.Linear(1024, num_classes)
self.aux4A = InceptionAux(512, num_classes)
self.aux4D = InceptionAux(528, num_classes)
def forward(self, x):
x = self.conv1(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.pool3(x)
x = self.inception3A(x)
x = self.inception3B(x)
x = self.pool4(x)
x = self.inception4A(x)
aux1 = self.aux4A(x)
x = self.inception4B(x)
x = self.inception4C(x)
x = self.inception4D(x)
aux2 = self.aux4D(x)
x = self.inception4E(x)
x = self.pool5(x)
x = self.inception5A(x)
x = self.inception5B(x)
x = self.pool6(x)
x = torch.flatten(x,1)
x = self.dropout(x)
x = self.fc(x)
return x, aux1, aux2
```
### 2) Training on CIFAR-10
```python
train_costs, val_costs = train_model()
```
```
[Epoch 1/15]: train-loss = 2.376666 | train-acc = 0.462 | val-loss = 1.732332 | val-acc = 0.617
[Epoch 2/15]: train-loss = 1.534975 | train-acc = 0.665 | val-loss = 1.419659 | val-acc = 0.691
[Epoch 3/15]: train-loss = 1.155955 | train-acc = 0.756 | val-loss = 1.148954 | val-acc = 0.758
[Epoch 4/15]: train-loss = 0.888322 | train-acc = 0.817 | val-loss = 1.016156 | val-acc = 0.790
[Epoch 5/15]: train-loss = 0.727873 | train-acc = 0.852 | val-loss = 1.007011 | val-acc = 0.796
[Epoch 6/15]: train-loss = 0.566593 | train-acc = 0.887 | val-loss = 0.955890 | val-acc = 0.812
[Epoch 7/15]: train-loss = 0.449871 | train-acc = 0.912 | val-loss = 0.940923 | val-acc = 0.820
[Epoch 8/15]: train-loss = 0.358857 | train-acc = 0.931 | val-loss = 0.970173 | val-acc = 0.827
[Epoch 9/15]: train-loss = 0.282615 | train-acc = 0.947 | val-loss = 0.998978 | val-acc = 0.826
[Epoch 10/15]: train-loss = 0.214605 | train-acc = 0.960 | val-loss = 1.033067 | val-acc = 0.836
[Epoch 11/15]: train-loss = 0.188588 | train-acc = 0.964 | val-loss = 1.048824 | val-acc = 0.838
[Epoch 12/15]: train-loss = 0.156200 | train-acc = 0.972 | val-loss = 1.125927 | val-acc = 0.832
[Epoch 13/15]: train-loss = 0.142177 | train-acc = 0.974 | val-loss = 1.076445 | val-acc = 0.838
[Epoch 14/15]: train-loss = 0.109996 | train-acc = 0.980 | val-loss = 1.123746 | val-acc = 0.838
[Epoch 15/15]: train-loss = 0.110901 | train-acc = 0.980 | val-loss = 1.147809 | val-acc = 0.839
```
### 3) Evaluating model
```python
nb_test_examples = 10000
correct = 0
model.eval().cuda()
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Make predictions.
prediction, _, _ = model(inputs)
# Retrieve predictions indexes.
_, predicted_class = torch.max(prediction.data, 1)
# Compute number of correct predictions.
correct += (predicted_class == labels).float().sum().item()
test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))
```
```
Test accuracy: 0.8099
```