Also known as GoogLeNet , it is a 22-layer network that won the 2014 ILSVRC Championship.
The original intention of the design is to expand the width and depth on its basis .
which is designed motives derived from improving the performance of the depth of the network generally can increase the size of the network and increase the size of the data set to increase, but at the same time cause the network parameters and easily fit through excessive , computing resources inefficient and The production of high-quality data sets is an expensive issue.
Its design philosophy is to change the full connection to a sparse architecture and try to change it to a sparse architecture inside the convolution.
The main idea is to design an inception module and increase the depth and width of the network by continuously copying these inception modules , but GooLeNet mainly extends these inception modules in depth.
There are four parallel channels in each inception module , and concat is performed at the end of the channel .
1x1 conv is mainly used to reduce the dimensions in the article to avoid calculation bottlenecks. It also adds additional softmax loss to some branches of the previous network layer to avoid the problem of gradient disappearance.
The most direct way to improve the performance of deep neural networks is to increase their size . This includes depth, the number of levels, and their width, the size of each level unit .
Another easy and safe way is to increase the size of the training data.
Larger models mean more parameters, which makes it easier for the network to overfit , especially when the number of label samples in the training data set is limited.
At the same time, because the production of high-quality training sets is tricky and expensive ,especially when some human experts do it , there is a large error rate.
Another shortcoming is that uniformly increasing the size of the network will increase the use of computing resources . For example, in a deep network, if two convolutions are chained, any unified improvement of their convolution kernels will cause demand for resources.
Power increase: If the increased capacity is inefficient, for example, if most of the weights end with 0 , then a lot of computing resources are wasted. But because the computing resources are always limited, an effective computational distribution always tends to increase the size of the model indiscriminately, and even the main objective goal is to improve the performance of the results.
The basic method to solve these two problems is to finally change the fully connected network to a sparse architecture, even inside the convolution.
The details of the GooLeNet network layer are shown in the following table:
The momentum is set to 0.9 and the learning rate is set to decrease by 4% every 8 epochs.
Seven models were trained . To make the problem more detailed, some models were trained on small crops, and some were trained on large crops .
The factors that make the model train well include : the sampling of patches of various sizes in the image , the size of which is evenly distributed between 8% and 100%, and the aspect ratio between 3/4 and 4/3.
Illumination changes have an effect on avoiding overfitting.
Later, random interpolation is used to resize the image.
This architecture is a landmark in the development of deep network models. The most prominent contribution is to propose a normalized Batch Normalization layer to unify the output range of the network. It is fixed in a relatively uniform range. If the BN layer is not added, the value range of the network input and output of each layer is greatly different, so the size of the learning rate will be different. The BN layer avoids this situation This accelerates the training of the network and gives the network regular terms to a certain extent , reducing the degree of overfitting of the network. In the subsequent development of network models, most models have more or less added BN layers to the model.
In this paper, the BN layer is standardized before being input to the activation function. At the same time, VGG uses 2 3x3 convs instead of 5x5 convs in the inception module to reduce the amount of parameters and speed up the calculation.
This architecture focuses, how to use the convolution kernel two or more smaller size of the convolution kernel to replace, but also the introduction of asymmetrical layers i.e. a convolution dimensional convolution has also been proposed for pooling layer Some remedies that can cause loss of spatial information; there are ideas such as label-smoothing , BN-ahxiliary.
Experiments were performed on inputs with different resolutions . The results show that although low-resolution inputs require more time to train, the accuracy and high-resolution achieved are not much different.
The computational cost is reduced while improving the accuracy of the network.
We will describe some design principles that have been proposed through extensive experiments with different architectural designs for convolutional networks. At this point, full use of the following principles can be guessed, and some additional experiments in the future will be necessary to estimate their accuracy and effectiveness.
Prevent bottlenecks in characterization. The so-called bottleneck of feature description is that a large proportion of features are compressed in the middle layer (such as using a pooling operation). This operation will cause the loss of feature space information and the loss of features. Although the operation of pooling in CNN is important, there are some methods that can be used to avoid this loss as much as possible (I note: later hole convolution operations ).
The higher the dimensionality of the feature, the faster the training converges. That is, the independence of features has a great relationship with the speed of model convergence. The more independent features, the more thoroughly the input feature information is decomposed. It is easier to converge if the correlation is strong. Hebbin principle : fire together, wire together.
Reduce the amount of calculation through dimensionality reduction. In v1, the feature is first reduced by 1x1 convolutional dimensionality reduction. There is a certain correlation between different dimensions. Dimension reduction can be understood as a lossless or low-loss compression. Even if the dimensions are reduced, the correlation can still be used to restore its original information.
Balance the depth and width of the network. Only by increasing the depth and width of the network in the same proportion can the performance of the model be maximized.
With the same number of convolution kernels, larger convolution kernels (such as 5x5 or 7x7) are more expensive to calculate than 3x3 convolution kernels , which is about a multiple of 25/9 = 2.78. Of course, the 5x5 convolution kernel can obtain more correlations between the information and activation units in the previous network, but under the premise of huge consumption of computing resources, a physical reduction in the size of the convolution kernel still appears.
However, we still want to know whether a 5x5 convolutional layer can be replaced by a multi-layer convolutional layer with fewer parameters when the input and output sizes are consistent . If we scale the calculation map of 5x5 convolution, we can see that each output is like a small fully connected network sliding on the input window with a size of 5x5. Refer to Figure 1.
Therefore, we have developed a network that explores translation invariance and replaces one layer of convolution with two layers of convolution: the first layer is a 3x3 convolution layer and the second layer is a fully connected layer . Refer to Figure 1. We ended up replacing two 5x5 convolutional layers with two 3x3 convolutional layers. Refer to Figure 4 Figure 5. This operation can realize the weight sharing of neighboring layers. It is about (9 + 9) / 25 times reduction in computational consumption.
We are wondering if the convolution kernel can be made smaller, such as 2x2, but there is an asymmetric method that can be better than this method. That is to use nx1 size convolution. For example, using the [3x1 + 1x3] convolution layer. In this case, a single 3x3 convolution has the same receptive field. Refer to Figure 3. This asymmetric method can save [((3x3)-(3 + 3)) / (3x3) = 33%] computing resources, and replacing two 2x2 only saves [11%] Computing resources.
In theory, we can have a deeper discussion and use the convolution of [1xn + nx1] instead of the convolutional layer of nxn. Refer to Figure 6. But this situation is not very good in the previous layer, but it can perform better on a medium-sized feature map [mxm, m is between 12 and 20]. In this case, use [1x7 + 7x1] convolutional layer can get a very good result.
Inception-v1 introduced some auxiliary classifiers (referring to some branches of the previous layer adding the softmax layer to calculate the loss back propagation) to improve the aggregation problem in deep networks. The original motive is to pass the gradient back to the previous convolutional layer , so that they can effectively and improve the aggregation of features and avoid the problem of vanishing gradients.
Traditionally, pooling layers are used in convolutional networks to reduce the size of feature maps . In order to avoid bottlenecks in the expression of spatial information, the number of convolution kernels in the network can be expanded before using max pooling or average pooling.
For example, for a dxd network layer with K feature maps, to generate a network layer with 2K [d / 2 xd / 2] feature maps, we can use 2K convolution kernels with a step size of 1. Convolution and then add a pooling layer to get it, then this operation requires [2d 2 K 2 ]. But using pooling instead of convolution, the approximate operation is [2 * (d / 2) 2 xK 2 ], which reduces the operation by four times. However, this will cause a description bottleneck, because the feature map is reduced to [(d / 2) 2 xK], which will definitely cause the loss of spatial information on the network. Refer to Figure 9. However, we have adopted a different method to avoid this bottleneck, refer to Figure 10. That is, two parallel channels are used , one is a pooling layer (max or average), the step size is 2, and the other is a convolution layer , and then it is concatenated during output.
After ResNet appeared, ResNet residual structure was added.
It is based on Inception-v3 and added the skip connection structure in ResNet. Finally, under the structure of 3 residual and 1 inception-v4 , it reached the top-5 error 3.08% in CLS (ImageNet calssification) .
1-Introduction Residual conn works well when training very deep networks. Because the Inception network architecture can be very deep, it is reasonable to use residual conn instead of concat.
Compared with v3, Inception-v4 has more unified simplified structure and more inception modules.
Fig9 is an overall picture, and Fig3,4,5,6,7,8 are all local structures. For the specific structure of each module, see the end of the article.
For the residual version in the Inception network, we use an Inception module that consumes less than the original Inception. The convolution kernel (followed by 1x1) of each Inception module is used to modify the dimension, which can compensate the reduction of the Inception dimension to some extent.
One is named Inception-ResNet-v1, which is consistent with the calculation cost of Inception-v3. One is named Inception-ResNet-v2, which is consistent with the calculation cost of Inception-v4.
Figure 15 shows the structure of both. However, Inception-v4 is actually slower in practice, probably because it has more layers.
Another small technique is that we use the BN layer in the header of the traditional layer in the Inception-ResNet module, but not in the header of the summations. ** There is reason to believe that the BN layer is effective. But in order to add more Inception modules, we made a compromise between the two.
This paper finds that when the number of convolution kernels exceeds 1,000 , the residual variants will start to show instability , and the network will die in the early stages of training, which means that the last layer before the average pooling layer is in the Very few iterations start with just a zero value . This situation cannot be prevented by reducing the learning rate or by adding a BN layer . Hekaiming's ResNet article also mentions this phenomenon.
This article finds that scale can stabilize the training process before adding the residual module to the activation layer . This article sets the scale coefficient between 0.1 and 0.3.
In order to prevent the occurrence of unstable training of deep residual networks, He suggested in the article that it is divided into two stages of training. The first stage is called warm-up (preheating) , that is, training the model with a very low learning first. In the second stage, a higher learning rate is used. And this article finds that if the convolution sum is very high, even a learning rate of 0.00001 cannot solve this training instability problem, and the high learning rate will also destroy the effect. But this article considers scale residuals to be more reliable than warm-up.
Even if scal is not strictly necessary, it has no effect on the final accuracy, but it can stabilize the training process.
Inception-ResNet-v1 : a network architecture combining inception module and resnet module with similar calculation cost to Inception-v3;
Inception-ResNet-v2 : A more expensive but better performing network architecture.
Inception-v4 : A pure inception module, without residual connections, but with performance similar to Inception-ResNet-v2.
A big picture of the various module structures of Inception-v4 / Inception-ResNet-v1 / v2:
Fig3-Stem: (Inception-v4 & Inception-ResNet-v2)
Fig4-Inception-A: (Inception-v4)
Fig7-Reduction-A: (Inception-v4 & Inception-ResNet-v1 & Inception-ResNet-v2)
Fig8-Reduction-B: (Inception-v4)
Fig10-Inception-ResNet-A: (Inception-ResNet-v1)
Fig11-Inception-ResNet-B: (Inception-ResNet-v1)
Fig12-Reduction-B: (Inception-ResNet-v1)
Fig13-Inception-ResNet-C: (Inception-ResNet-v1)
Fig14-Stem: (Inception-ResNet-v1)
Fig16-Inception-ResNet-A: (Inception-ResNet-v2)
Fig17-Inception-ResNet-B: (Inception-ResNet-v2)
Fig18-Reduction-B: (Inception-ResNet-v2)
Fig19-Inception-ResNet-C: (Inception-ResNet-v2)