# Notes on "[Recent progress in semantic image segmentation](https://arxiv.org/ftp/arxiv/papers/1809/1809.10198.pdf)"
###### tags: `review` `segmentation` `supervised`
#### Author
[Akshay Kulkarni](https://akshayk07.weebly.com/)
This paper presents a review on semantic segmentation approaches - traditional as well as DL-based.
## Brief Outline
The paper reviews some traditional approaches (used before DL), but focuses more on DL-based approaches (more recent and gave lot of improvement in the SOTA). Also gives a review of available datasets and evaluation metrics for semantic segmenation.
## Datasets and Evaluation Metrics
### Datasets
General datasets:
* PASCAL Visual Object Classes (VOC) ([Everingham et. al. 2010](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/PascalVOC_IJCV2009.pdf)) - 20 classes
* Microsoft Common Objects in Context (COCO) ([Lin et. al. 2014](http://cocodataset.org/#home)) - 91 classes, 2.5 million labelled instances in 328k images.
* ADE20K ([Zhou et al. 2017](https://groups.csail.mit.edu/vision/datasets/ADE20K/)) - 150 classes, also has masks for eyes, nose, mouth of humans.
Related to autonomous driving:
* Cityscapes ([Cordts et al. 2016](https://www.cityscapes-dataset.com/)) - focuses on urban street scenes. 5k finely annotated images, 20k coarsely annotated. Covers spring, fall, summer seasons for over 50 cities.
* KITTI ([Fritsch et. al. 2013, Menze and Geiger 2015](http://www.cvlibs.net/datasets/kitti/)) - Images captured from driving around Karlsruhe city, on highways and in rural areas. Another dataset for autonomous driving applications.
Other datasets:
* SUN ([Xiao et. al. 2010](https://groups.csail.mit.edu/vision/SUN/))
* [Shadow detection/Texture segmentation vision dataset](https://zenodo.org/record/59019#.WWHm3oSGNeM)
* Berkeley segmentation dataset ([Martin and Fowlkes 2017](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/))
* LabelMe images database ([Russell et al. 2008](https://http://labelme.csail.mit.edu/Release3.0/))
### Metrics
Notations:
* $n_{ij}$ = number of pixels of class $i$ predicted to belong in class $j$
* $n_{cl}$ = number of different classes
* $t_i = \sum_jn_{ij}$ = total number of pixels of class $i$
Metrics:
* Pixel accuracy $P_{acc} = \frac{\sum_in_{ii}}{\sum_it_i}$
* Mean accuracy $M_{acc} = \frac{1}{n_{cl}}\sum_i\frac{n_{ii}}{t_i}$
* Mean IoU (Intersection over Union) $M_{IU} = \frac{1}{n_{cl}}\sum_i\frac{n_{ii}}{t_i + \sum_jn_{ij} - n_{ii}}$
* Frequency Weighted IoU $FW_{IU} = \frac{1}{\sum_kt_k}\sum_i\frac{t_in_{ii}}{t_i + \sum_jn_{ij} - n_{ii}}$
## Traditional Methods
Go through the paper. Authors have mentioned several features and algorithms useful for image segmentation.
## DL-based Methods
### [FCN](https://www.ncbi.nlm.nih.gov/pubmed/27244717)
![Fully Convolutional Network](https://i.imgur.com/7jUCMm4.png)
* First paper that introduced Fully Convolutional Networks (no fully connected layers).
* Using **interpolation layer**, size of output is made same as input, which is essential in segmentation. Skip connections are also used.
* Network can be trained end-to-end, can take arbitrary sized input and produces correspondingly sized output.
* Uses modified VGG-Net as the network.
* Achieved 20% improvement in IoU over the then SOTA.
### [Upsample method: Deconvolution](https://ieeexplore.ieee.org/document/7410535)
![Deconvolution Network](https://i.imgur.com/yLpQ7lk.png)
* Uses deconvolution and unpooling layers to recover the size after convolution and pooling.
* Deconvolution is the reverse operation of convolution, whereas interpolation uses bilinear interpolation. Note that bilinear interpolation is computationally efficient and has good image recovery and is thus used.
* Unlike FCN, this network is applied to individual object proposals to obtain instance-wise segmentations, which are combined to get the final semantic segmentation. This is done in the original paper linked above, but there are other implementations also.
### [FCN + CRF and other traditional methods](https://arxiv.org/abs/1511.03328)
* The responses at the final layer of deep CNNs are not sufficiently localized to give accurate object segmentation.
* This is overcome by using a Conditional Random Field in combination with the final layer of the deep CNN.
* Another work after this (by same authors) used Domain Transform (DT) instead of CRF because CRF inference is computationally expensive.
* Some other approaches involved super-pixels and Markov Random Fields (MRF). Check paper for reference to those.
### [Dilated Convolutions](https://ieeexplore.ieee.org/document/7913730)
![Dilated Convolution](https://i.imgur.com/cT6Tkqp.gif)
* [Source of Animation](https://github.com/vdumoulin/conv_arithmetic)
* Most of the work prior to this was based on the conventional CNN. However, the conventional CNN is geared towards dense classification task which is structurally different from semantic segmentation.
* Dilated convolutions support exponential receptive field expansion without loss of coverage or resolution.
* [Dilated Residual Networks](https://arxiv.org/abs/1705.09914) are also developed which overcome the griding effect of dilated convolutions.
### Progress in Backbone Network
* Backbone network refers to the main structure used.
* Some notable architectures (lower is newer/improved)
* VGG
* ResNet
* ResNeXt
* Inception-v2
* Inception-v3
* Inception-v4
* Inception-ResNet
### Pyramid Methods
#### Image Pyramid
* An image pyramid ([Adelson et al.1984](http://persci.mit.edu/pub_pdfs/RCA84.pdf)) is a collection of images which are successively downsampled until a certain criteria is reached.
* 2 types of pyramids:
* Gaussian pyramid: downsamples images
* Laplacian pyramid: reconstruct image from image lower in the pyramid
* [Lin et. al. 2016](https://arxiv.org/abs/1504.01013) present a network with sliding pyramid pooling which improves image segmentation by using patch-background contextual information.
* Similarly,[Chen et. al. 2016](https://arxiv.org/abs/1511.03339) implements an image pyramid structure which extracts multi-scale features by feeding multiple resized input images to a shared deep network. At the end of the network, the features are merged for pixel-wise classification.
* Laplacian pyramid is also used in some papers.
#### Atrous Spatial Pyramid Pooling (ASPP)
![ASPP](https://i.imgur.com/7AqIaT9.png)
* [Chen et. al. 2017](https://ieeexplore.ieee.org/document/7913730) proposes a Atrous Spatial Pyramid Pooling (ASPP) to segment objects robustly at multiple scales.
* ASPP probes effective fields-of-views (FOV) and convolutional feature layer with filters at multiple sampling rates, and then captures objects image context at multiple scales.
* Architecture shown in figure.
#### Pooling Pyramid
![Pooling Pyramid](https://i.imgur.com/ItHv9vw.png)
* [Zhao et al. 2016](https://arxiv.org/abs/1612.01105) exploits the capability of global context information by different-region based context aggregationand names their pyramid scene parsing network (PSPNet).
* The pyramid pooling adopts different scales of pooling size, then does up-sample processon the outputs to the original size, and finally concatenates the results to form a mixed featurerepresentation.
* In figure, different scales of pooling sizes are marked with different colors.
#### Feature Pyramid
* Recent deep learning object detectors have avoided pyramid representation because it is compute and memory intensive.
* [Lin et. al. 2016](https://arxiv.org/abs/1612.03144) exploit the multi-scale, pyramidal hierarchy of CNN to construct feature pyramids with marginal extra cost.
* Also, Feature Pyramid Network (FPN) is developed for building high-level semantic feature maps at all scales.
### Multi-level and Multi-stage method
![Hariharan et. al. 2015](https://i.imgur.com/TprITii.png)
* Recognition algorithms based on CNNs use the output of the last layer as a feature representation. However, the information in this layer is too coarse for dense prediction. On the contrary, earlier layers may be precise in localization, but they will not capture semantics.
* To get the best of both advantages, [Hariharan et. al. 2015](https://arxiv.org/abs/1411.5752) define the hypercolumns as the vector of activations of all CNN units above that pixel. **Shown in figure**.
* Multi-model is an ensemble way to deal with image tasks. Also approached in some papers.
* [Li et. al. 2017](https://arxiv.org/abs/1704.01344) propose deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation.
* Unlike the conventional model cascade (MC) that consists of multiple independent models, LC treats a single deep model as a cascade of several sub-models and classifies most of the easy regions in the shallow stage and makes deeper stage focus on a few hard regions.
* It not only improves the segmentation performance but also accelerates both training and testing of deep network.
## Conclusion
* This review mentions some pure DL based approaches to semantic segmentation alongwith lot of traditional inspired DL based approaches.
* Some other review paper also needs to be read which mentions more recent papers (this is only upto 2017).
* This paper also misses other segmentation architectures like U-Net, E-Net, etc.