Inception-V1 (GoogLeNet): Summary and Implementation

--- tags: machine-learning --- # Inception-V1 (GoogLeNet): Summary and Implementation <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/0.png?token=AMAXSKI25HM5VANVHC4XGHS6WMGZY"> </div> >This post is divided into 2 sections: Summary and Implementation. > >We are going to have an in-depth review of [Going Deeper with Convolutions](https://arxiv.org/pdf/1409.4842.pdf) paper which introduces the Inception-V1/GoogLeNet architecture. > > The implementation uses Pytorch as framework. To see full implementation, please refer to this [repository](https://github.com/3outeille/Research-Paper-Summary/tree/master/src/architecture/inception_v1/pytorch). > > Also, if you want to read other "Summary and Implementation", feel free to check them at my [blog](https://ferdinandmom.engineer/deep-learning/). # I) Summary - The paper [Going Deeper with Convolutions](https://arxiv.org/pdf/1409.4842.pdf) introduces the first version of Inception model called GoogLeNet. - During ILSVLC-2014, they achieved 1st place at the classification task (top-5 test error = 6.67%) - It has around 6.7977 million parameters (without auxilaries layers) which is 9x fewer than AlexNet (ILSVRC-2012 winner) and 20x fewer than its competitor VGG-16. - In most of the standard network architectures, the intuition is not clear why and when to perform the max-pooling operation, when to use the convolutional operation. For example, in AlextNet we have the convolutional operation and max-pooling operation following each other whereas in VGGNet, we have 3 convolutional operations in a row and then 1 max-pooling layer. - ==Thus, **the idea behind GoogLeNet is to use all the operations at the same time**. It computes multiple kernels of different size over the same input map in parallel, concatenating their results into a single output. This is called an **Inception module**.== <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/1.png?token=AMAXSKKTCOTQ7ZEOA3YO2DK6WMGYI" height="100%" width="100%"> </div> - Consider the following: <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/2_a.png?token=AMAXSKOX3N3ZM5CZF4RAEA26WMGKW" height="50%" width="70%"> </div> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/2_b.png?token=AMAXSKJWSXQWWKA2ZN44TYS6WMGKY" height="50%" width="90%"> </div> - The Naive approach is computationally expensive: - Computation cost = ((28 x 28 x 5 x 5) x 192) x 32 $\simeq$ **120 Mil** - We perform (28 x 28 x 5 x 5) operations along 192 channels for each of the 32 filters. - The dimension reduction approach is **less** computationally expensive: - 1st layer computation cost = ((28 x 28 x 1 x 1) x 192) x 16 $\simeq$ 2.4 Mil - 2nd layer computation cost = ((28 x 28 x 5 x 5) x 16) x 32 $\simeq$ 10 Mil - Total computation cost $\simeq$ **12.4 Mil** --- Here its architecture: <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/3.png?token=AMAXSKLECW7TXM42VJERNRS6WMGK2" height="100%" width="100%"> </div> - There are: - 9 Inception modules (red box) - Global Average pooling were used instead of a Fully-connected layer. - It enables adapting and fine-tuning on the network easily. - 2 auxilaries softmax layer (green box) - Their role is to push the network toward its goal and helps to ensure that the intermediate features are good enough for the network to learn. - It turns out that softmax0 and sofmax1 gives regularization effect. - During training, their loss gets added to the total loss with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). - During inference, they are discarded. - Structure: - Average pooling layer with 5×5 filter size and stride 3 resulting in an output size: - For 1st green box: 4x4x512. - For 2nd green box: 4x4x528. - 128 1x1 convolutions + ReLU. - Fully-connected layer with 1024 units + ReLU. - Dropout = 70%. - Linear layer (1000 classes) + Softmax. <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/4.png?token=AMAXSKOVSG2CINH3G6SS2N26WMGK4" height="100%" width="80%"> </div> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/5.png?token=AMAXSKM7XU4DLG6LR5HGYZC6WMGK4" height="100%" width="100%"> </div> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/6.png?token=AMAXSKOEL3QIPOMOEPVLVEK6WMGYQ" height="100%" width="100%"> </div> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/research-paper-summary/inception-v1/7.png?token=AMAXSKP2IXA5AYCOJT664L26WMGYQ" height="100%" width="100%"> </div> <br> # II) Implementation ### 1) Architecture build ```python class ConvBlock(nn.Module): def __init__(self, in_channels, out_channels, kernel_size, stride, padding): super(ConvBlock, self).__init__() self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding) self.bn = nn.BatchNorm2d(out_channels) self.act = nn.ReLU() def forward(self, x): x = self.conv(x) x = self.bn(x) x = self.act(x) return x ``` ```python class InceptionModule(nn.Module): def __init__(self, in_channels, f_1x1, f_3x3_r, f_3x3, f_5x5_r, f_5x5, f_pp): super(InceptionModule, self).__init__() self.branch1 = nn.Sequential( ConvBlock(in_channels, f_1x1, kernel_size=1, stride=1, padding=0) ) self.branch2 = nn.Sequential( ConvBlock(in_channels, f_3x3_r, kernel_size=1, stride=1, padding=0), ConvBlock(f_3x3_r, f_3x3, kernel_size=3, stride=1, padding=1) ) self.branch3 = nn.Sequential( ConvBlock(in_channels, f_5x5_r, kernel_size=1, stride=1, padding=0), ConvBlock(f_5x5_r, f_5x5, kernel_size=5, stride=1, padding=2) ) self.branch4 = nn.Sequential( nn.MaxPool2d(3, stride=1, padding=1, ceil_mode=True), ConvBlock(in_channels, f_pp, kernel_size=1, stride=1, padding=0) ) def forward(self, x): branch1 = self.branch1(x) branch2 = self.branch2(x) branch3 = self.branch3(x) branch4 = self.branch4(x) return torch.cat([branch1, branch2, branch3, branch4], 1) ``` ```python class InceptionAux(nn.Module): def __init__(self, in_channels, num_classes): super(InceptionAux, self).__init__() self.pool = nn.AdaptiveAvgPool2d((4,4)) self.conv = nn.Conv2d(in_channels, 128, kernel_size=1, stride=1, padding=0) self.act = nn.ReLU() self.fc1 = nn.Linear(2048, 1024) self.dropout = nn.Dropout(0.7) self.fc2 = nn.Linear(1024, num_classes) def forward(self, x): x = self.pool(x) x = self.conv(x) x = self.act(x) x = torch.flatten(x, 1) x = self.fc1(x) x = self.act(x) x = self.dropout(x) x = self.fc2(x) return x ``` ```python class GoogLeNet(nn.Module): def __init__(self, num_classes = 10): super(GoogLeNet, self).__init__() self.conv1 = ConvBlock(3, 64, kernel_size=7, stride=2, padding=3) self.pool1 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True) self.conv2 = ConvBlock(64, 64, kernel_size=1, stride=1, padding=0) self.conv3 = ConvBlock(64, 192, kernel_size=3, stride=1, padding=1) self.pool3 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True) self.inception3A = InceptionModule(in_channels=192, f_1x1=64, f_3x3_r=96, f_3x3=128, f_5x5_r=16, f_5x5=32, f_pp=32) self.inception3B = InceptionModule(in_channels=256, f_1x1=128, f_3x3_r=128, f_3x3=192, f_5x5_r=32, f_5x5=96, f_pp=64) self.pool4 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True) self.inception4A = InceptionModule(in_channels=480, f_1x1=192, f_3x3_r=96, f_3x3=208, f_5x5_r=16, f_5x5=48, f_pp=64) self.inception4B = InceptionModule(in_channels=512, f_1x1=160, f_3x3_r=112, f_3x3=224, f_5x5_r=24, f_5x5=64, f_pp=64) self.inception4C = InceptionModule(in_channels=512, f_1x1=128, f_3x3_r=128, f_3x3=256, f_5x5_r=24, f_5x5=64, f_pp=64) self.inception4D = InceptionModule(in_channels=512, f_1x1=112, f_3x3_r=144, f_3x3=288, f_5x5_r=32, f_5x5=64, f_pp=64) self.inception4E = InceptionModule(in_channels=528, f_1x1=256, f_3x3_r=160, f_3x3=320, f_5x5_r=32, f_5x5=128, f_pp=128) self.pool5 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True) self.inception5A = InceptionModule(in_channels=832, f_1x1=256, f_3x3_r=160, f_3x3=320, f_5x5_r=32, f_5x5=128, f_pp=128) self.inception5B = InceptionModule(in_channels=832, f_1x1=384, f_3x3_r=192, f_3x3=384, f_5x5_r=48, f_5x5=128, f_pp=128) self.pool6 = nn.AdaptiveAvgPool2d((1,1)) self.dropout = nn.Dropout(0.4) self.fc = nn.Linear(1024, num_classes) self.aux4A = InceptionAux(512, num_classes) self.aux4D = InceptionAux(528, num_classes) def forward(self, x): x = self.conv1(x) x = self.pool1(x) x = self.conv2(x) x = self.conv3(x) x = self.pool3(x) x = self.inception3A(x) x = self.inception3B(x) x = self.pool4(x) x = self.inception4A(x) aux1 = self.aux4A(x) x = self.inception4B(x) x = self.inception4C(x) x = self.inception4D(x) aux2 = self.aux4D(x) x = self.inception4E(x) x = self.pool5(x) x = self.inception5A(x) x = self.inception5B(x) x = self.pool6(x) x = torch.flatten(x,1) x = self.dropout(x) x = self.fc(x) return x, aux1, aux2 ``` ### 2) Training on CIFAR-10 ```python train_costs, val_costs = train_model() ``` ``` [Epoch 1/15]: train-loss = 2.376666 | train-acc = 0.462 | val-loss = 1.732332 | val-acc = 0.617 [Epoch 2/15]: train-loss = 1.534975 | train-acc = 0.665 | val-loss = 1.419659 | val-acc = 0.691 [Epoch 3/15]: train-loss = 1.155955 | train-acc = 0.756 | val-loss = 1.148954 | val-acc = 0.758 [Epoch 4/15]: train-loss = 0.888322 | train-acc = 0.817 | val-loss = 1.016156 | val-acc = 0.790 [Epoch 5/15]: train-loss = 0.727873 | train-acc = 0.852 | val-loss = 1.007011 | val-acc = 0.796 [Epoch 6/15]: train-loss = 0.566593 | train-acc = 0.887 | val-loss = 0.955890 | val-acc = 0.812 [Epoch 7/15]: train-loss = 0.449871 | train-acc = 0.912 | val-loss = 0.940923 | val-acc = 0.820 [Epoch 8/15]: train-loss = 0.358857 | train-acc = 0.931 | val-loss = 0.970173 | val-acc = 0.827 [Epoch 9/15]: train-loss = 0.282615 | train-acc = 0.947 | val-loss = 0.998978 | val-acc = 0.826 [Epoch 10/15]: train-loss = 0.214605 | train-acc = 0.960 | val-loss = 1.033067 | val-acc = 0.836 [Epoch 11/15]: train-loss = 0.188588 | train-acc = 0.964 | val-loss = 1.048824 | val-acc = 0.838 [Epoch 12/15]: train-loss = 0.156200 | train-acc = 0.972 | val-loss = 1.125927 | val-acc = 0.832 [Epoch 13/15]: train-loss = 0.142177 | train-acc = 0.974 | val-loss = 1.076445 | val-acc = 0.838 [Epoch 14/15]: train-loss = 0.109996 | train-acc = 0.980 | val-loss = 1.123746 | val-acc = 0.838 [Epoch 15/15]: train-loss = 0.110901 | train-acc = 0.980 | val-loss = 1.147809 | val-acc = 0.839 ``` ### 3) Evaluating model ```python nb_test_examples = 10000 correct = 0 model.eval().cuda() with torch.no_grad(): for inputs, labels in test_loader: inputs, labels = inputs.to(device), labels.to(device) # Make predictions. prediction, _, _ = model(inputs) # Retrieve predictions indexes. _, predicted_class = torch.max(prediction.data, 1) # Compute number of correct predictions. correct += (predicted_class == labels).float().sum().item() test_accuracy = correct / nb_test_examples print('Test accuracy: {}'.format(test_accuracy)) ``` ``` Test accuracy: 0.8099 ```