Try   HackMD

Notes on "Recent progress in semantic image segmentation"

tags: review segmentation supervised

Author

Akshay Kulkarni

This paper presents a review on semantic segmentation approaches - traditional as well as DL-based.

Brief Outline

The paper reviews some traditional approaches (used before DL), but focuses more on DL-based approaches (more recent and gave lot of improvement in the SOTA). Also gives a review of available datasets and evaluation metrics for semantic segmenation.

Datasets and Evaluation Metrics

Datasets

General datasets:

  • PASCAL Visual Object Classes (VOC) (Everingham et. al. 2010) - 20 classes
  • Microsoft Common Objects in Context (COCO) (Lin et. al. 2014) - 91 classes, 2.5 million labelled instances in 328k images.
  • ADE20K (Zhou et al. 2017) - 150 classes, also has masks for eyes, nose, mouth of humans.

Related to autonomous driving:

  • Cityscapes (Cordts et al. 2016) - focuses on urban street scenes. 5k finely annotated images, 20k coarsely annotated. Covers spring, fall, summer seasons for over 50 cities.
  • KITTI (Fritsch et. al. 2013, Menze and Geiger 2015) - Images captured from driving around Karlsruhe city, on highways and in rural areas. Another dataset for autonomous driving applications.

Other datasets:

Metrics

Notations:

  • nij
    = number of pixels of class
    i
    predicted to belong in class
    j
  • ncl
    = number of different classes
  • ti=jnij
    = total number of pixels of class
    i

Metrics:

  • Pixel accuracy
    Pacc=iniiiti
  • Mean accuracy
    Macc=1ncliniiti
  • Mean IoU (Intersection over Union)
    MIU=1ncliniiti+jnijnii
  • Frequency Weighted IoU
    FWIU=1ktkitiniiti+jnijnii

Traditional Methods

Go through the paper. Authors have mentioned several features and algorithms useful for image segmentation.

DL-based Methods

FCN

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • First paper that introduced Fully Convolutional Networks (no fully connected layers).
  • Using interpolation layer, size of output is made same as input, which is essential in segmentation. Skip connections are also used.
  • Network can be trained end-to-end, can take arbitrary sized input and produces correspondingly sized output.
  • Uses modified VGG-Net as the network.
  • Achieved 20% improvement in IoU over the then SOTA.

Upsample method: Deconvolution

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Uses deconvolution and unpooling layers to recover the size after convolution and pooling.
  • Deconvolution is the reverse operation of convolution, whereas interpolation uses bilinear interpolation. Note that bilinear interpolation is computationally efficient and has good image recovery and is thus used.
  • Unlike FCN, this network is applied to individual object proposals to obtain instance-wise segmentations, which are combined to get the final semantic segmentation. This is done in the original paper linked above, but there are other implementations also.

FCN + CRF and other traditional methods

  • The responses at the final layer of deep CNNs are not sufficiently localized to give accurate object segmentation.
  • This is overcome by using a Conditional Random Field in combination with the final layer of the deep CNN.
  • Another work after this (by same authors) used Domain Transform (DT) instead of CRF because CRF inference is computationally expensive.
  • Some other approaches involved super-pixels and Markov Random Fields (MRF). Check paper for reference to those.

Dilated Convolutions

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Source of Animation
  • Most of the work prior to this was based on the conventional CNN. However, the conventional CNN is geared towards dense classification task which is structurally different from semantic segmentation.
  • Dilated convolutions support exponential receptive field expansion without loss of coverage or resolution.
  • Dilated Residual Networks are also developed which overcome the griding effect of dilated convolutions.

Progress in Backbone Network

  • Backbone network refers to the main structure used.
  • Some notable architectures (lower is newer/improved)
    • VGG
    • ResNet
    • ResNeXt
    • Inception-v2
    • Inception-v3
    • Inception-v4
    • Inception-ResNet

Pyramid Methods

Image Pyramid

  • An image pyramid (Adelson et al.1984) is a collection of images which are successively downsampled until a certain criteria is reached.
  • 2 types of pyramids:
    • Gaussian pyramid: downsamples images
    • Laplacian pyramid: reconstruct image from image lower in the pyramid
  • Lin et. al. 2016 present a network with sliding pyramid pooling which improves image segmentation by using patch-background contextual information.
  • Similarly,Chen et. al. 2016 implements an image pyramid structure which extracts multi-scale features by feeding multiple resized input images to a shared deep network. At the end of the network, the features are merged for pixel-wise classification.
  • Laplacian pyramid is also used in some papers.

Atrous Spatial Pyramid Pooling (ASPP)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Chen et. al. 2017 proposes a Atrous Spatial Pyramid Pooling (ASPP) to segment objects robustly at multiple scales.
  • ASPP probes effective fields-of-views (FOV) and convolutional feature layer with filters at multiple sampling rates, and then captures objects image context at multiple scales.
  • Architecture shown in figure.

Pooling Pyramid

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Zhao et al. 2016 exploits the capability of global context information by different-region based context aggregationand names their pyramid scene parsing network (PSPNet).
  • The pyramid pooling adopts different scales of pooling size, then does up-sample processon the outputs to the original size, and finally concatenates the results to form a mixed featurerepresentation.
  • In figure, different scales of pooling sizes are marked with different colors.

Feature Pyramid

  • Recent deep learning object detectors have avoided pyramid representation because it is compute and memory intensive.
  • Lin et. al. 2016 exploit the multi-scale, pyramidal hierarchy of CNN to construct feature pyramids with marginal extra cost.
  • Also, Feature Pyramid Network (FPN) is developed for building high-level semantic feature maps at all scales.

Multi-level and Multi-stage method

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Recognition algorithms based on CNNs use the output of the last layer as a feature representation. However, the information in this layer is too coarse for dense prediction. On the contrary, earlier layers may be precise in localization, but they will not capture semantics.
  • To get the best of both advantages, Hariharan et. al. 2015 define the hypercolumns as the vector of activations of all CNN units above that pixel. Shown in figure.
  • Multi-model is an ensemble way to deal with image tasks. Also approached in some papers.
  • Li et. al. 2017 propose deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation.
  • Unlike the conventional model cascade (MC) that consists of multiple independent models, LC treats a single deep model as a cascade of several sub-models and classifies most of the easy regions in the shallow stage and makes deeper stage focus on a few hard regions.
  • It not only improves the segmentation performance but also accelerates both training and testing of deep network.

Conclusion

  • This review mentions some pure DL based approaches to semantic segmentation alongwith lot of traditional inspired DL based approaches.
  • Some other review paper also needs to be read which mentions more recent papers (this is only upto 2017).
  • This paper also misses other segmentation architectures like U-Net, E-Net, etc.