This paper presents a review on semantic segmentation approaches - traditional as well as DL-based.
Brief Outline
The paper reviews some traditional approaches (used before DL), but focuses more on DL-based approaches (more recent and gave lot of improvement in the SOTA). Also gives a review of available datasets and evaluation metrics for semantic segmenation.
Microsoft Common Objects in Context (COCO) (Lin et. al. 2014) - 91 classes, 2.5 million labelled instances in 328k images.
ADE20K (Zhou et al. 2017) - 150 classes, also has masks for eyes, nose, mouth of humans.
Related to autonomous driving:
Cityscapes (Cordts et al. 2016) - focuses on urban street scenes. 5k finely annotated images, 20k coarsely annotated. Covers spring, fall, summer seasons for over 50 cities.
KITTI (Fritsch et. al. 2013, Menze and Geiger 2015) - Images captured from driving around Karlsruhe city, on highways and in rural areas. Another dataset for autonomous driving applications.
Uses deconvolution and unpooling layers to recover the size after convolution and pooling.
Deconvolution is the reverse operation of convolution, whereas interpolation uses bilinear interpolation. Note that bilinear interpolation is computationally efficient and has good image recovery and is thus used.
Unlike FCN, this network is applied to individual object proposals to obtain instance-wise segmentations, which are combined to get the final semantic segmentation. This is done in the original paper linked above, but there are other implementations also.
Most of the work prior to this was based on the conventional CNN. However, the conventional CNN is geared towards dense classification task which is structurally different from semantic segmentation.
Dilated convolutions support exponential receptive field expansion without loss of coverage or resolution.
Dilated Residual Networks are also developed which overcome the griding effect of dilated convolutions.
Progress in Backbone Network
Backbone network refers to the main structure used.
Some notable architectures (lower is newer/improved)
VGG
ResNet
ResNeXt
Inception-v2
Inception-v3
Inception-v4
Inception-ResNet
Pyramid Methods
Image Pyramid
An image pyramid (Adelson et al.1984) is a collection of images which are successively downsampled until a certain criteria is reached.
2 types of pyramids:
Gaussian pyramid: downsamples images
Laplacian pyramid: reconstruct image from image lower in the pyramid
Lin et. al. 2016 present a network with sliding pyramid pooling which improves image segmentation by using patch-background contextual information.
Similarly,Chen et. al. 2016 implements an image pyramid structure which extracts multi-scale features by feeding multiple resized input images to a shared deep network. At the end of the network, the features are merged for pixel-wise classification.
Chen et. al. 2017 proposes a Atrous Spatial Pyramid Pooling (ASPP) to segment objects robustly at multiple scales.
ASPP probes effective fields-of-views (FOV) and convolutional feature layer with filters at multiple sampling rates, and then captures objects image context at multiple scales.
Zhao et al. 2016 exploits the capability of global context information by different-region based context aggregationand names their pyramid scene parsing network (PSPNet).
The pyramid pooling adopts different scales of pooling size, then does up-sample processon the outputs to the original size, and finally concatenates the results to form a mixed featurerepresentation.
In figure, different scales of pooling sizes are marked with different colors.
Feature Pyramid
Recent deep learning object detectors have avoided pyramid representation because it is compute and memory intensive.
Lin et. al. 2016 exploit the multi-scale, pyramidal hierarchy of CNN to construct feature pyramids with marginal extra cost.
Also, Feature Pyramid Network (FPN) is developed for building high-level semantic feature maps at all scales.
Recognition algorithms based on CNNs use the output of the last layer as a feature representation. However, the information in this layer is too coarse for dense prediction. On the contrary, earlier layers may be precise in localization, but they will not capture semantics.
To get the best of both advantages, Hariharan et. al. 2015 define the hypercolumns as the vector of activations of all CNN units above that pixel. Shown in figure.
Multi-model is an ensemble way to deal with image tasks. Also approached in some papers.
Li et. al. 2017 propose deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation.
Unlike the conventional model cascade (MC) that consists of multiple independent models, LC treats a single deep model as a cascade of several sub-models and classifies most of the easy regions in the shallow stage and makes deeper stage focus on a few hard regions.
It not only improves the segmentation performance but also accelerates both training and testing of deep network.
Conclusion
This review mentions some pure DL based approaches to semantic segmentation alongwith lot of traditional inspired DL based approaches.
Some other review paper also needs to be read which mentions more recent papers (this is only upto 2017).
This paper also misses other segmentation architectures like U-Net, E-Net, etc.