Notes on "Recent progress in semantic image segmentation"

tags: `review` `segmentation` `supervised`

Author

This paper presents a review on semantic segmentation approaches - traditional as well as DL-based.

Brief Outline

The paper reviews some traditional approaches (used before DL), but focuses more on DL-based approaches (more recent and gave lot of improvement in the SOTA). Also gives a review of available datasets and evaluation metrics for semantic segmenation.

Datasets and Evaluation Metrics

Datasets

General datasets:

PASCAL Visual Object Classes (VOC) (Everingham et. al. 2010) - 20 classes
Microsoft Common Objects in Context (COCO) (Lin et. al. 2014) - 91 classes, 2.5 million labelled instances in 328k images.
ADE20K (Zhou et al. 2017) - 150 classes, also has masks for eyes, nose, mouth of humans.

Related to autonomous driving:

Cityscapes (Cordts et al. 2016) - focuses on urban street scenes. 5k finely annotated images, 20k coarsely annotated. Covers spring, fall, summer seasons for over 50 cities.
KITTI (Fritsch et. al. 2013, Menze and Geiger 2015) - Images captured from driving around Karlsruhe city, on highways and in rural areas. Another dataset for autonomous driving applications.

Other datasets:

SUN (Xiao et. al. 2010)
Shadow detection/Texture segmentation vision dataset
Berkeley segmentation dataset (Martin and Fowlkes 2017)
LabelMe images database (Russell et al. 2008)

Metrics

Notations:

$n_{i j}$ = number of pixels of class
$i$ predicted to belong in class
$j$
$n_{c l}$ = number of different classes
$t_{i} = \sum_{j} n_{i j}$ = total number of pixels of class
$i$

Metrics:

Pixel accuracy
$P_{a c c} = \frac{\sum_{i} n_{i i}}{\sum_{i} t_{i}}$
Mean accuracy
$M_{a c c} = \frac{1}{n_{c l}} \sum_{i} \frac{n_{i i}}{t_{i}}$
Mean IoU (Intersection over Union)
$M_{I U} = \frac{1}{n_{c l}} \sum_{i} \frac{n_{i i}}{t_{i} + \sum_{j} n_{i j} - n_{i i}}$
Frequency Weighted IoU
$F W_{I U} = \frac{1}{\sum_{k} t_{k}} \sum_{i} \frac{t_{i} n_{i i}}{t_{i} + \sum_{j} n_{i j} - n_{i i}}$

Traditional Methods

Go through the paper. Authors have mentioned several features and algorithms useful for image segmentation.

DL-based Methods

FCN

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

First paper that introduced Fully Convolutional Networks (no fully connected layers).
Using interpolation layer, size of output is made same as input, which is essential in segmentation. Skip connections are also used.
Network can be trained end-to-end, can take arbitrary sized input and produces correspondingly sized output.
Uses modified VGG-Net as the network.
Achieved 20% improvement in IoU over the then SOTA.

Upsample method: Deconvolution

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Uses deconvolution and unpooling layers to recover the size after convolution and pooling.
Deconvolution is the reverse operation of convolution, whereas interpolation uses bilinear interpolation. Note that bilinear interpolation is computationally efficient and has good image recovery and is thus used.
Unlike FCN, this network is applied to individual object proposals to obtain instance-wise segmentations, which are combined to get the final semantic segmentation. This is done in the original paper linked above, but there are other implementations also.

FCN + CRF and other traditional methods

The responses at the final layer of deep CNNs are not sufficiently localized to give accurate object segmentation.
This is overcome by using a Conditional Random Field in combination with the final layer of the deep CNN.
Another work after this (by same authors) used Domain Transform (DT) instead of CRF because CRF inference is computationally expensive.
Some other approaches involved super-pixels and Markov Random Fields (MRF). Check paper for reference to those.

Dilated Convolutions

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Source of Animation
Most of the work prior to this was based on the conventional CNN. However, the conventional CNN is geared towards dense classification task which is structurally different from semantic segmentation.
Dilated convolutions support exponential receptive field expansion without loss of coverage or resolution.
Dilated Residual Networks are also developed which overcome the griding effect of dilated convolutions.

Progress in Backbone Network

Backbone network refers to the main structure used.
Some notable architectures (lower is newer/improved)
- VGG
- ResNet
- ResNeXt
- Inception-v2
- Inception-v3
- Inception-v4
- Inception-ResNet

Pyramid Methods

Image Pyramid

An image pyramid (Adelson et al.1984) is a collection of images which are successively downsampled until a certain criteria is reached.
2 types of pyramids:
- Gaussian pyramid: downsamples images
- Laplacian pyramid: reconstruct image from image lower in the pyramid
Lin et. al. 2016 present a network with sliding pyramid pooling which improves image segmentation by using patch-background contextual information.
Similarly,Chen et. al. 2016 implements an image pyramid structure which extracts multi-scale features by feeding multiple resized input images to a shared deep network. At the end of the network, the features are merged for pixel-wise classification.
Laplacian pyramid is also used in some papers.

Atrous Spatial Pyramid Pooling (ASPP)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Chen et. al. 2017 proposes a Atrous Spatial Pyramid Pooling (ASPP) to segment objects robustly at multiple scales.
ASPP probes effective fields-of-views (FOV) and convolutional feature layer with filters at multiple sampling rates, and then captures objects image context at multiple scales.
Architecture shown in figure.

Pooling Pyramid

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Zhao et al. 2016 exploits the capability of global context information by different-region based context aggregationand names their pyramid scene parsing network (PSPNet).
The pyramid pooling adopts different scales of pooling size, then does up-sample processon the outputs to the original size, and finally concatenates the results to form a mixed featurerepresentation.
In figure, different scales of pooling sizes are marked with different colors.

Feature Pyramid

Recent deep learning object detectors have avoided pyramid representation because it is compute and memory intensive.
Lin et. al. 2016 exploit the multi-scale, pyramidal hierarchy of CNN to construct feature pyramids with marginal extra cost.
Also, Feature Pyramid Network (FPN) is developed for building high-level semantic feature maps at all scales.

Multi-level and Multi-stage method

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Recognition algorithms based on CNNs use the output of the last layer as a feature representation. However, the information in this layer is too coarse for dense prediction. On the contrary, earlier layers may be precise in localization, but they will not capture semantics.
To get the best of both advantages, Hariharan et. al. 2015 define the hypercolumns as the vector of activations of all CNN units above that pixel. Shown in figure.
Multi-model is an ensemble way to deal with image tasks. Also approached in some papers.
Li et. al. 2017 propose deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation.
Unlike the conventional model cascade (MC) that consists of multiple independent models, LC treats a single deep model as a cascade of several sub-models and classifies most of the easy regions in the shallow stage and makes deeper stage focus on a few hard regions.
It not only improves the segmentation performance but also accelerates both training and testing of deep network.

Conclusion

This review mentions some pure DL based approaches to semantic segmentation alongwith lot of traditional inspired DL based approaches.
Some other review paper also needs to be read which mentions more recent papers (this is only upto 2017).
This paper also misses other segmentation architectures like U-Net, E-Net, etc.

Notes on "Recent progress in semantic image segmentation"

tags: review segmentation supervised

Author

Brief Outline

Datasets and Evaluation Metrics

Datasets

Metrics

Traditional Methods

DL-based Methods

FCN

Upsample method: Deconvolution

FCN + CRF and other traditional methods

Dilated Convolutions

Progress in Backbone Network

Pyramid Methods

Image Pyramid

Atrous Spatial Pyramid Pooling (ASPP)

Pooling Pyramid

Feature Pyramid

Multi-level and Multi-stage method

Conclusion

Read more

Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

tags: `review` `segmentation` `supervised`