Notes on "[ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation](https://arxiv.org/abs/1606.02147)"

# Notes on "[ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation](https://arxiv.org/abs/1606.02147)" ###### tags: `notes` `segmentation` `supervised` #### Author [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline This paper presents a network architecture which is faster and more compact, for low real-time inference times. ## Introduction * This paper introduces a novel neural network architecture for fast inference and high accuracy. * Don't use any post-processing steps. Thus, completely end-to-end CNN architecture. ## Network Architecture ![Initial Block](https://i.imgur.com/qymfQnS.png) * The above module is called `InitialBlock`. * The $3\times3$ convolutional block has 13 convolutional filters. The MaxPooling has non-overlapping $2\times2$ filters. So, total 16 filters in the output for input RGB images. ![Bottleneck Block](https://i.imgur.com/Z4UrxMI.png) * The above module is called Bottleneck. It is of 3 types: * RegularBottleneck does not downsample, the input and output channels stay the same, only the intermediate feature map channels reduce according to an internal ratio. The main branch is an identity connection. * DownsampleBottleneck downsamples the input, while the intermediate feature map channels reduce according to an internal ratio. The main branch is a MaxPooling layer followed by padding. The feature maps are zero-padded to match the number of channels while adding. * UpsampleBottleneck upsamples the input, while the intermediate feature map channels reduce according to an internal ratio. The main branch is a convolutional layer followed by batch normalization and PReLU activation. * The `conv` layer maybe either a regular, dilated, or full convolution (deconvolution) with $3\times3$ filters, or a $5\times5$ convolution decomposed into two asymmetric ones. * For the regularizer, Spatial Dropout is used with $p = 0.01$ before `bottleneck2.0` and $p = 0.1$ afterwards. * Check implementation of both [here](https://github.com/akshaykvnit/pl-sem-seg/blob/master/models/enet/parts.py). ![ENet Architecture](https://i.imgur.com/CHyAGXU.png) * Check implementation [here](https://github.com/akshaykvnit/pl-sem-seg/blob/master/models/enet/model.py). * No bias terms are used in any operations. This reduces computation and memory usage. ### Design Choices #### Feature Map Resolution * Downsampling (reducing feature map size) has 2 main drawbacks: * It causes loss of spatial information like exact edge shape. * Generally, same output shape as input is expected. So, strong downsampling will require strong upsampling. This increases model size and computational cost. * The first issue is addressed in FCN ([Long et. al. 2015](https://arxiv.org/abs/1411.4038)) by utilizing feature maps produced at different stages of the encoder, and in SegNet ([Badrinarayanan et. al. 2015](https://arxiv.org/abs/1505.07293)) by saving indices from max pooling layers and using them in the decoder for unpooling. * The SegNet approach is used in this paper because of lesser memory requirements. * Despite this, strong downsampling hurts the accuracy. However, downsampling has one advantage: * Filters operating on downsampled images have a bigger receptive field, so they gather more context. This is especially important when differentiating between classes like rider and pedestrian in a road scene (for example). It is not enough to learn how a person looks, the context in which they appear is equally important. * However, the use of dilated convolutions ([Yu and Koltun, 2015](https://arxiv.org/abs/1511.07122)) is better for this purpose. #### Early Downsampling * The early layers in most architectures process very large input frames, which is very expensive computationally. * The first two ENet blocks heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. * Also, the initial layers should not directly contribute to classification. Instead, they should be good feature extractors (i.e. preprocess inputs for further layers). #### Decoder Size * Generally, encoder-decoder architectures have decoder symmetric to encoder. Instead, ENet has a large encoder and a small decoder. * The encoder should work like classical classification networks i.e. operate on lower resolution data and provide for information processing and filtering. * The decoder should upsample the encoder output, only finetuning the details. #### Nonlinear Operations * PReLUs ([He et. al. 2015](https://arxiv.org/abs/1502.01852)) have an additional parameter $a$ which determines the slope when input is negative. $$ \text{PReLU}(x) = \begin{cases} x; & x > 0 \\ ax; & x < 0 \end{cases} $$ * All the ReLUs when replaced by PReLUs would give * $a$ = 0, if ReLU is useful at that layer * $a$ = 1, if identity (i.e. no non-linearity) is useful at that layer * They found that early layers had $a$ near to zero i.e. ReLU like. Thus, early layers filter out information. The identity concept didn't work because this is a very shallow network compared to ResNets, so the network has to filter out information quickly. * Further, the upsample layers at the end had positive $a$ implying that the decoder only finetunes the encoder output (doesn't filter). #### Information Preserving Dimensionality Changes * Aggressive dimensionality reduction can hinder information flow. * According to [Szegedy et. al. 2015](https://arxiv.org/abs/1512.00567) (very interesting paper btw), convolution and pooling should be performed in parallel, and feature maps can be concatenated. This is better than * having a convolutional layer followed by a pooling layer because it is computationally expensive. * whereas having a pooling layer followed by a convolution layer because, while relatively cheap, it introduces a representational bottleneck (or forces use of more number of filters, which is again increases computation). * A problem in the original ResNet is that the first $1 \times 1$ projection of the convolutional branch has stride $2$, which means most of the input is not considered by the filter. * Increasing the filter size to $2 \times 2$ takes the full input into consideration, improving the information flow. However, it increases the computation (very few are used in ENet, so not a problem). #### Factorizing Filters * Convolutional weights have some redundancy, and each $n \times n$ convolution can be decomposed into sequential $n \times 1$ and $1 \times n$ filters (proposed in [Szegedy et. al. 2015](https://arxiv.org/abs/1512.00567)). * This allows to increase the variety of functions learned by the network and increase the receptive field (since n = 5 will be equivalent to a single $3 \times 3$ convolution). * It allows to make the functions richer, since non-linear operations can be inserted between the $n \times 1$ and $1 \times n$ layers. #### Dilated Convolutions * As mentioned, it is important to have a large receptive field, so classification is performed taking a wider context into account. * Best accuracy was obtained by interleaving dilated convolution bottleneck blocks with other bottleneck blocks (both regular and asymmetric) instead of in a sequence ([Yu and Koltun, 2015](https://arxiv.org/abs/1511.07122)). #### Regularization * L2 weight decay didn't work well. * According to [Huang et. al. 2016](https://arxiv.org/abs/1603.09382), stochastic depth increased accuracy. However, dropping entire branches (setting o/p to zero) is a special case of Spatial Dropout ([Tompson et. al. 2014](https://arxiv.org/abs/1411.4280)). * Instead, using Spatial Dropout at the end of convolutional branches (before addition) worked better. ### Training Procedure * Adam optimization is used in 2 stages: * First train only the encoder to categorize downsampled regions of the input image. * Then, append the decoder and train the network to perform upsampling and pixel-wise classification. ## Conclusion * This paper presents an efficient architecture and training procedure for real-time inference. * It combines several methods to reduce computation and memory requirements.