# Notes on "[Recent progress in semantic image segmentation](https://arxiv.org/ftp/arxiv/papers/1809/1809.10198.pdf)" ###### tags: `review` `segmentation` `supervised` #### Author [Akshay Kulkarni](https://akshayk07.weebly.com/) This paper presents a review on semantic segmentation approaches - traditional as well as DL-based. ## Brief Outline The paper reviews some traditional approaches (used before DL), but focuses more on DL-based approaches (more recent and gave lot of improvement in the SOTA). Also gives a review of available datasets and evaluation metrics for semantic segmenation. ## Datasets and Evaluation Metrics ### Datasets General datasets: * PASCAL Visual Object Classes (VOC) ([Everingham et. al. 2010](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/PascalVOC_IJCV2009.pdf)) - 20 classes * Microsoft Common Objects in Context (COCO) ([Lin et. al. 2014](http://cocodataset.org/#home)) - 91 classes, 2.5 million labelled instances in 328k images. * ADE20K ([Zhou et al. 2017](https://groups.csail.mit.edu/vision/datasets/ADE20K/)) - 150 classes, also has masks for eyes, nose, mouth of humans. Related to autonomous driving: * Cityscapes ([Cordts et al. 2016](https://www.cityscapes-dataset.com/)) - focuses on urban street scenes. 5k finely annotated images, 20k coarsely annotated. Covers spring, fall, summer seasons for over 50 cities. * KITTI ([Fritsch et. al. 2013, Menze and Geiger 2015](http://www.cvlibs.net/datasets/kitti/)) - Images captured from driving around Karlsruhe city, on highways and in rural areas. Another dataset for autonomous driving applications. Other datasets: * SUN ([Xiao et. al. 2010](https://groups.csail.mit.edu/vision/SUN/)) * [Shadow detection/Texture segmentation vision dataset](https://zenodo.org/record/59019#.WWHm3oSGNeM) * Berkeley segmentation dataset ([Martin and Fowlkes 2017](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/)) * LabelMe images database ([Russell et al. 2008](https://http://labelme.csail.mit.edu/Release3.0/)) ### Metrics Notations: * $n_{ij}$ = number of pixels of class $i$ predicted to belong in class $j$ * $n_{cl}$ = number of different classes * $t_i = \sum_jn_{ij}$ = total number of pixels of class $i$ Metrics: * Pixel accuracy $P_{acc} = \frac{\sum_in_{ii}}{\sum_it_i}$ * Mean accuracy $M_{acc} = \frac{1}{n_{cl}}\sum_i\frac{n_{ii}}{t_i}$ * Mean IoU (Intersection over Union) $M_{IU} = \frac{1}{n_{cl}}\sum_i\frac{n_{ii}}{t_i + \sum_jn_{ij} - n_{ii}}$ * Frequency Weighted IoU $FW_{IU} = \frac{1}{\sum_kt_k}\sum_i\frac{t_in_{ii}}{t_i + \sum_jn_{ij} - n_{ii}}$ ## Traditional Methods Go through the paper. Authors have mentioned several features and algorithms useful for image segmentation. ## DL-based Methods ### [FCN](https://www.ncbi.nlm.nih.gov/pubmed/27244717) ![Fully Convolutional Network](https://i.imgur.com/7jUCMm4.png) * First paper that introduced Fully Convolutional Networks (no fully connected layers). * Using **interpolation layer**, size of output is made same as input, which is essential in segmentation. Skip connections are also used. * Network can be trained end-to-end, can take arbitrary sized input and produces correspondingly sized output. * Uses modified VGG-Net as the network. * Achieved 20% improvement in IoU over the then SOTA. ### [Upsample method: Deconvolution](https://ieeexplore.ieee.org/document/7410535) ![Deconvolution Network](https://i.imgur.com/yLpQ7lk.png) * Uses deconvolution and unpooling layers to recover the size after convolution and pooling. * Deconvolution is the reverse operation of convolution, whereas interpolation uses bilinear interpolation. Note that bilinear interpolation is computationally efficient and has good image recovery and is thus used. * Unlike FCN, this network is applied to individual object proposals to obtain instance-wise segmentations, which are combined to get the final semantic segmentation. This is done in the original paper linked above, but there are other implementations also. ### [FCN + CRF and other traditional methods](https://arxiv.org/abs/1511.03328) * The responses at the final layer of deep CNNs are not sufficiently localized to give accurate object segmentation. * This is overcome by using a Conditional Random Field in combination with the final layer of the deep CNN. * Another work after this (by same authors) used Domain Transform (DT) instead of CRF because CRF inference is computationally expensive. * Some other approaches involved super-pixels and Markov Random Fields (MRF). Check paper for reference to those. ### [Dilated Convolutions](https://ieeexplore.ieee.org/document/7913730) ![Dilated Convolution](https://i.imgur.com/cT6Tkqp.gif) * [Source of Animation](https://github.com/vdumoulin/conv_arithmetic) * Most of the work prior to this was based on the conventional CNN. However, the conventional CNN is geared towards dense classification task which is structurally different from semantic segmentation. * Dilated convolutions support exponential receptive field expansion without loss of coverage or resolution. * [Dilated Residual Networks](https://arxiv.org/abs/1705.09914) are also developed which overcome the griding effect of dilated convolutions. ### Progress in Backbone Network * Backbone network refers to the main structure used. * Some notable architectures (lower is newer/improved) * VGG * ResNet * ResNeXt * Inception-v2 * Inception-v3 * Inception-v4 * Inception-ResNet ### Pyramid Methods #### Image Pyramid * An image pyramid ([Adelson et al.1984](http://persci.mit.edu/pub_pdfs/RCA84.pdf)) is a collection of images which are successively downsampled until a certain criteria is reached. * 2 types of pyramids: * Gaussian pyramid: downsamples images * Laplacian pyramid: reconstruct image from image lower in the pyramid * [Lin et. al. 2016](https://arxiv.org/abs/1504.01013) present a network with sliding pyramid pooling which improves image segmentation by using patch-background contextual information. * Similarly,[Chen et. al. 2016](https://arxiv.org/abs/1511.03339) implements an image pyramid structure which extracts multi-scale features by feeding multiple resized input images to a shared deep network. At the end of the network, the features are merged for pixel-wise classification. * Laplacian pyramid is also used in some papers. #### Atrous Spatial Pyramid Pooling (ASPP) ![ASPP](https://i.imgur.com/7AqIaT9.png) * [Chen et. al. 2017](https://ieeexplore.ieee.org/document/7913730) proposes a Atrous Spatial Pyramid Pooling (ASPP) to segment objects robustly at multiple scales. * ASPP probes effective fields-of-views (FOV) and convolutional feature layer with filters at multiple sampling rates, and then captures objects image context at multiple scales. * Architecture shown in figure. #### Pooling Pyramid ![Pooling Pyramid](https://i.imgur.com/ItHv9vw.png) * [Zhao et al. 2016](https://arxiv.org/abs/1612.01105) exploits the capability of global context information by different-region based context aggregationand names their pyramid scene parsing network (PSPNet). * The pyramid pooling adopts different scales of pooling size, then does up-sample processon the outputs to the original size, and finally concatenates the results to form a mixed featurerepresentation. * In figure, different scales of pooling sizes are marked with different colors. #### Feature Pyramid * Recent deep learning object detectors have avoided pyramid representation because it is compute and memory intensive. * [Lin et. al. 2016](https://arxiv.org/abs/1612.03144) exploit the multi-scale, pyramidal hierarchy of CNN to construct feature pyramids with marginal extra cost. * Also, Feature Pyramid Network (FPN) is developed for building high-level semantic feature maps at all scales. ### Multi-level and Multi-stage method ![Hariharan et. al. 2015](https://i.imgur.com/TprITii.png) * Recognition algorithms based on CNNs use the output of the last layer as a feature representation. However, the information in this layer is too coarse for dense prediction. On the contrary, earlier layers may be precise in localization, but they will not capture semantics. * To get the best of both advantages, [Hariharan et. al. 2015](https://arxiv.org/abs/1411.5752) define the hypercolumns as the vector of activations of all CNN units above that pixel. **Shown in figure**. * Multi-model is an ensemble way to deal with image tasks. Also approached in some papers. * [Li et. al. 2017](https://arxiv.org/abs/1704.01344) propose deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. * Unlike the conventional model cascade (MC) that consists of multiple independent models, LC treats a single deep model as a cascade of several sub-models and classifies most of the easy regions in the shallow stage and makes deeper stage focus on a few hard regions. * It not only improves the segmentation performance but also accelerates both training and testing of deep network. ## Conclusion * This review mentions some pure DL based approaches to semantic segmentation alongwith lot of traditional inspired DL based approaches. * Some other review paper also needs to be read which mentions more recent papers (this is only upto 2017). * This paper also misses other segmentation architectures like U-Net, E-Net, etc.