# Notes on "[Understanding Deep Learning Techniques for Image Segmentation](https://arxiv.org/abs/1907.06119)" ###### tags: `review` `segmentation` `unsupervised` `supervised` `adversarial` `weakly-supervised` #### Author [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline The paper aims to provide an intuitive understanding of significant DL-based approaches to segmentation. Subtasks of Image Segmentation: * Semantic Segmentation: Each pixel is classified into one of the predefined set of classes such that pixels belonging to the same class belongs to a unique semantic entity in the image. Note that the semantics (logic) in question depends not only on the data but also the problem being addressed. * Saliency Detection: Focus on the most important object in a scene. * Instance Segmentation: Segments multiple instances of the same object in a scene. * Segmentation in the temporal space: Object tracking requires segmentation in the spatial domain as well as over time (temporal domain). * Oversegmentation: Images are divided into extremely small regions to ensure boundary adherence, at the cost of creating a lot of spurious edges. Region merging techniques are used to perform image segmentation. * Color or texture segmentation: Also found to be useful for certain applications. **This image shows the legend for further images.** ![Legend](https://i.imgur.com/zpxfhDt.png) ## Effectiveness of convolutions for segmentation * It is observed that the convolutional kernels of a CNN tend to generate activation maps w.r.t. certain features of objects. * These activation maps can be seen as segmentation masks for objects having specific features. * Thus, we can effectively generate specific segmentation maps within these output feature maps. * Most image segmentation algorithms somehow utilize this property of CNNs to achieve segmentation. * It is important to note that earlier layers give sharper activations than later ones ## Image Segmentation using Deep Learning The following table shows a brief summary of DL-based approaches to segmentation: ![Summary of DL-based approaches](https://i.imgur.com/eIucEeT.png) ### Fully Convolutional Layers * Classification tasks require output which is a probability distribution over the number of classes. * Flattening of 2D feature maps allowed fully connected networks to perform classification. However, flattening results in loss of the spatial relation between pixels in the feature map. * To overcome this, FCN were introduced, where the output of the last convolutional block is used directly for pixel-level classification. ![FCN for Segmentation](https://i.imgur.com/EtlSiL0.png) * Another approach to avoiding fully connected layers is using Average Pooling to convert a set of 2D feature maps to a set of scalar values. This process is called Global Average Pooling. * Issues with this approach are loss of sharpness due to intermediate subsampling. This issue has been approached in several ways. Skip connections are one way. ### Region Proposal Networks * While RCNNs were primarily aimed at object detection (bounding box generation), the approach was relevant to segmentation as well. * Mask-RCNN ([He et. al. 2017](https://arxiv.org/abs/1703.06870)) version had a pixel-level classification part alongwith the bounding box regression part as earlier versions had. * Thus, object detection capabilities of RCNN like approaches have often been coupled with segmentation models for instance segmentation. ![RCNN family](https://i.imgur.com/UQTqv0Y.png) ### DeepLab * Pixel-level segmentation had some issues: * Smaller kernel sizes could not capture contextual information. For classification problems, this is solved using pooling layers. However, pooling causes loss of sharpness in segmented output. * Alternatively, use of larger kernels became slower due to larger number of trainable parameters. * To handle these issues, DeepLab family of algos used * Atrous/Dilated Convolutions - increase field of view without increasing number of parameters. * Spatial Pooling Pyramids * Fully connected Conditional Random Fields (CRF) * These papers and techniques are mentioned in the [notes](https://hackmd.io/@akshayk07/B1lv_WN9B) of the other review paper and are thus skipped in these notes. ![DeepLab Family](https://i.imgur.com/hjwcZEY.png) ## Multi-Scale Networks * One of the major problems in segmentation is that size of the object is unpredictable as objects may look smaller or bigger depending on the position of the camera and the object. * In CNNs, small scale features are captured in the early layers while later layers are more feature specific for larger objects. For example, smaller objects, like a tiny car, has lower chance of being detected in the higher layers due to downsampling and pooling operations. * Thus, it is useful to extract information from feature maps of various scales so that output considers smaller sized objects from the image as well. ### PSPNet * Pyramid Scene Parsing (PSP) Net is explained in the other [notes](https://hackmd.io/@akshayk07/B1lv_WN9B). ![PSPNet](https://i.imgur.com/vnmP4VV.png) ### RefineNet * Using features from the last layer of a CNN gives soft boundaries in segmentation. DeepLab avoided this issue using atrous/dilated convolutions. * RefineNet uses an alternative approach. It refines intermediate feature maps and hierarchically concatenates to combine multi-scale activations and prevent loss of sharpness simultaneously. ![RefineNet](https://i.imgur.com/K4Nqkpf.png) ## Convolutional Autoencoders * Autoencoders are traditionally used for representation learning. It has 2 parts, encoder and decoder. * Encoder encodes the raw input to a lower dimensional representation, while the decoder attempts to reconstruct the input from the encoded representation. * The decoder's generative nature can be modified to achieve segmentation tasks. * The major issue with such approaches is to prevent over-abstraction of images during the encoding process. * The major benefit of these approaches is generation of sharper boundaries without much complication. Unlike classification approaches, the decoder's generative nature can learn to generate delicate boundaries using the extracted features. * Another benefit is that autoencoder approach allows any input size. * The commonly used technique for decoding is transposed convolutions or unpooling layers. ### Skip Connections * Linear skip connections are often used to improve gradient flow for large number of layers. * Skip connections are also useful to combine different levels of abstraction from different layers to produce sharp segmentation output. #### U-Net ([Ronneberger et. al. 2015](https://arxiv.org/abs/1505.04597)) * The network has an encoder with a sequence of convolution and max pooling layers. The decoding layer has a mirrored sequence of transposed convolutions and upsampling layers. Upto this, it behaves like a traditional autoencoder. * To incorporate various levels of abstraction, skip connections are implemented to copy uncompressed feature maps from the encoder to it's counterpart in the decoder. ![U-Net](https://i.imgur.com/mBhN0Kx.png) * The feature extractor (encoder and decoder architecture) can be modified. [Jejou et. al. 2016](https://arxiv.org/abs/1611.09326) used DenseNets alongwith the U-Net approach. [Sabour et. al. 2017](https://arxiv.org/abs/1710.09829) used Capsule networks along with locally constrained routing ([LaLonde et. al. 2018](https://arxiv.org/abs/1804.04241)). ### Forward Pooling Indices * During maxpooling, the maximum response of a region of pixels is the output. * The forward pooling indices are saved i.e. the location of the maximum value among the region of pixels is saved. This location is used by the decoder for max-unpooling. * If this is not done, and the maximum value is copied to any random location while unpooling, it would cause inconsistencies in the segmentation map (specially in the boundary regions). ![maxpool and unpool](https://i.imgur.com/PKjEt0J.png) #### SegNet ([Badrinarayanan et. al. 2015](https://arxiv.org/abs/1511.00561)) * This uses the concept of saving forward pooling indices and passing them to the corresponding unpooling layer. ![SegNet](https://i.imgur.com/JNrreNq.png) ## Adversarial Models ([Luc et. al. 2016](https://arxiv.org/abs/1611.08408)) * Previous approaches like FCN, DeepLab, etc. were purely discriminative which generated a probability distribution for each pixel. * Autoencoders used a generative approach, but the last layer used a pixel-wise softmax classifier. * Using the adversarial learning perspective, the segmentation network is treated as a generator which produces segmentation masks for each class, while a discriminator network attempts to predict whether the set of masks is from the ground truth or from the generator output. * This framework is generally used where the semantic boundaries of image and required output do not necessarily coincide. ![SegGAN](https://i.imgur.com/yfRrV4n.png) ## Sequential Models * Previous approaches deal with semantic segmentation. * Instance level segmentation is generally handled as learning to give a sequence of object segments as outputs. * Sequential models are used for this purpose. ### Recurrent Models ([Romera-Paredes et. al. 2016](https://arxiv.org/abs/1511.08250)) * Convolutional LSTM is used to perform instance level segmentation. * Convolutional part captures spatial information of images, while LSTM models long and short term memories across sequential inputs. * Generally, convolutional LSTMs are used as a suffix to object segmentation networks. * The purpose of the recurrent model is to select each instance of the object in different timestamps of the sequential output. ### Attention Models * Attention models are designed to have more control over the process of localizing individual instances. * Spatial inibition ([Romera-Paredes et. al. 2016](https://arxiv.org/abs/1511.08250)) is one simple method to do this. It learns a bias parameter that cuts off previously detected segments from future activations. * Attention models have been further developed using dedicated attention module and an external module to keep track of segments ([Ren et. al. 2017](https://arxiv.org/abs/1605.09410)). ## Weakly Supervised or Unsupervised Models * One simple way is to use networks pretrained on larger datasets with similar kinds of samples and ground truths and use clustering algorithms like K-means on the feature maps. * However, this is inefficient for data samples that have a unique distribution of sample space. Another issue is that the training is done on data which is different from test data. * **The key problem in fully unsupervised segmentation is the development of a loss function capable of measuring the quality of segments or cluster of pixels.** ### Weakly Supervised Algorithms Even without pixel-level annotations, segmentation algos can exploit coarser annotations like bounding boxes or even image-level labels. #### Exploiting bounding boxes * Availability of datasets with bounding boxes is much more than those with pixel-level segmentations. * Bounding box can be used as a weak supervision to generate pixel level segmentation maps. * In [Dai et. al. 2015](https://arxiv.org/abs/1503.01640), segmentation proposals were generated using region proposal methods like selective search. After that, multi-scale combinatorial grouping is used to combine candidate masks and the object is to select the optimal combination that has the highest IoU with the box. * This segmentation is used to tune a traditional segmentation network like FCN. ### Unsupervised Segmentation * The success of these depends mostly on the learning mechanism. #### Learning multiple objectives * A common variant called JULE (joint unsupervised learning of deep representation) have been used where there is lack of samples with ground truth. * JULE basically uses a sequential model alongwith a deep feature extraction model. The learning methodology is primarily an image clustering algorithm. * In JULE, the objective function considers the affinity between samples in the clusters and also the negative affinity between the clusters and its neighbours. Agglomerative clustering is performed across timestamps of a recurrent network. * Note: Agglomerative means a "bottom-up" approach, each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. * Another approach ([Ranjan et. al. 2019](https://arxiv.org/abs/1805.09806)) uses the learning multiple objectives idea through adversarial collaboration among networks with independent jobs like monocular depth estimation, estimating camera motion, detecting optical flow and segmentation of a video into static and moving regions. * Through competitive collaboration, each network competes to explain the same pixel that either belongs to static or moving class which in turn shares their learned concepts with a moderator to perform the motion segmentation. #### Using refinement modules for self supervision * In [Kanekazi 2018](https://ieeexplore.ieee.org/document/8462533), multiple constraints like similarity between features, spatial continuity, intra-axis normalization are enforced and optimized using back propagation. * The spatial continuity is achieved by extracting superpixel from the image using standard algorithm like the SLIC and all pixels within a superpixel are forced to have the same label. * The difference between the 2 segmentation map is used as a supervisory signal to update the weights. #### Other relevant unsupervised techniques * Using CNNs to solve jigsaw puzzles ([Noroozi et. al. 2017](https://arxiv.org/abs/1603.09246), [Mundhenk et. al. 2018](https://arxiv.org/abs/1711.06379)) derived from images can be used to learn semantic connections between various parts of the objects. The proposed context free network takes a set of image tiles as input and tries to establish the correct spatial relations between them. During the process it simultaneously learns features specific to parts of an object as well as their semantic relationships. * Using context-encoders ([Pathak et. al. 2016](https://arxiv.org/abs/1604.07379)) can also derive spatial and semantic relationship among various parts of an image. Context encoders are CNNs trained to generate arbitrary regions of an image which is conditioned by its surrounding information. * Another demonstration of extracting semantic information can be found in the process of image colorization ([Zhang et. al. 2016](https://arxiv.org/abs/1603.08511)). The process of colorization requires pixel level understanding of the semantic boundaries corresponding to objects. * Other self supervision techniques ([Zhan et. al. 2019](https://arxiv.org/abs/1903.11412)) can leverage motion cues in videos to segment various parts of anobject for better semantic understanding. This method can learn several structural and coherent features for tasks like semantic segmentation, human parsing, instance segmentation and so on. ### W-Net ([Xia et. al. 2017](https://arxiv.org/abs/1711.08506)) * Inspired from U-Net. It consists of 2 cascaded U-Nets. * First U-Net acts as an encoder which segments the image and the second U-Net attempts to reconstruct the original image from the segmented image. * 2 loss functions are minimized simultaneously. First is MSE between input and reconstruction from 2nd U-Net. Second loss function is a soft version of Normalized-Cut ([Shi et. al. 2000](https://ieeexplore.ieee.org/document/868688)). * Output segmentation maps were further refined using fully connected conditional random fields. Remaining insignificant segments are further merged using hierarchical clustering. ## Interactive Segmentation * A little interaction and guidance from users can significantly improve the performance of segmentation algorithms. * With powerful feature extraction of CNNs, the amount of interaction can be reduced. ### Two stream fusion * [Hu et. al. 2019](https://arxiv.org/abs/1807.02480) use a straight forward approach having 2 parallel branches, one from the image and another from an image representing the interactive stream, and fusing them to perform segmentation. * Need to read paper for details. ### Deep Extreme Cut * Contrary to 2 stream fusion, this ([Mannis et. al. 2018](http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr/)) takes a single pipeline to produce segmentation maps from RGB images. * This method expects 4 points from the user denoting the four extreme regions in the boundary of the object (leftmost, rightmost, topmost, bottommost). * By creating heatmaps from the points, a 4 channel input is fed into a DenseNet101 network. * The final feature map of the network is passed into a pyramid scene parsing module for analyzing global contexts to perform the final segmentation. ### Polygon-RNN ([Castrejon et. al. 2017](https://arxiv.org/abs/1704.05548)) * Multi-scale features are extracted from different layers of a typical VGG Network and concatenated to create a feature block for a recurrent network. * The RNN in turn is supposed to provide a sequence of points as an output that represents the contour of the object. * The system is primarily designed as an interactive image annotation tool. The users can interact in two different ways. Firstly, the users must provide a tight bounding box for the object of interest. Secondly, after the polygon is built, the users were allowed to edit any point in the polygon. * However, this editing is not used for any further training of the system and hence presents a small avenue for improvement of the system. ## More efficient networks ### ENet * ENet ([Paszke et. al. 2016](https://arxiv.org/abs/1606.02147)) has a deeper encoder and shallower decoder unlike symmetric encoder decoder architectures of SegNet or U-Net. * Instead of increasing channel sizes after pooling, parallel pooling operations are used with convolutions of stride 2 to reduce overall features. * PReLU is used instead of ReLU to increase the learning capability. The transfer functions remain dynamic so that it can simulate the job of a ReLU as well as an identity function when required. * Using factorized filters also allowed for a smaller number of parameters. ### Deep Layer Cascade * This ([Li et. al. 2017](https://arxiv.org/abs/1704.01344)) tackled several challenges and made 2 significant contributions. * Firstly, it analyzed the difficulty of pixel-level segmentation for various classes. With a cascaded network, easier segments are discovered in the earlier stage while the latter layers focus on regions that need more delicate segments. * Secondly, the proposed layer cascading can be used with common networks to improve the speed and also the performance to some extent. * The basic principle is to create a multi-stage pipeline where in each stage a certain amount of pixels would be classified into one of the segments. In the earlier stages, the easiest pixels will be classified and the harder pixels with more uncertainty will move forward to latter stages. * In the consequent stages, the convolutions will only take place on those pixels which could not be classified in the previous stage while forwarding yet harder pixels to the next stage. ### SegFast * **Very fast forward pass (0.38 seconds without GPU)** * The approach ([Pal et. al. 2018](https://www.researchgate.net/publication/329521736_SegFast_A_Faster_SqueezeNet_based_Semantic_Image_Segmentation_Technique_using_Depth-wise_Separable_Convolutions)) combined the concept of depth-wise separable convolutions with the fire modules of SqueezeNet. * SqueezeNet's fire modules reduced the number of convolutional weights. With depth-wise separable convolutions, the number of parameters went further down. * Depth-wise separable transposed convolutions are used for decoding. ### Segmentation using superpixels * Refer to paper for details. Couldn't understand much as of now. ## Applications ### Content-based Image Retrieval (CBIR) * Since structured and unstructured data is abundant and ever increasing on the Internet, development of efficient information retrieval systems is important. * Other related problems are visual question answering, interactive query based image processing, description generation. * Unsupervised approaches are useful for handling bulk amount of non-annotated data. ### Medical Imaging * Many kinds of diagnostic procedures involve working with images corresponding to different types of imaging source and various parts of the body. * Most common tasks are segmentation of organic elements like vessels, tissues, nerves, and so on. * Other tasks are localization of abnormalities like tumors, aneurysms, and so on. * Microscopic images also need various kinds of segmentations like cell or nuclei detection, counting number of cells, cell structure analysis for cancer detection and so on. * The **primary challenge** in this domain is the **lack of bulk amount of data** for challenging diseases and **variety in the quality of images** due to different types of imaging devices involved. ### Object Detection * Many applications such as robotic maneuverability, autonomous driving, intelligent motion detection, tracking systems, deep sea or space exploration using intelligent robots. * UAVs can detect anomalies or threats in remote area. Geostatistical analysis can be done from satellite images. * For image or video post-production, segmentation is required for tasks like image matting, compositing, rotoscoping, etc. ### Forensics * Biometric verification systems like iris, fingerprint, finger vein, dental records, etc. involve segmentation of various informative regions for efficient analysis. ## Discussion ### Datasets and Annotations * One of the most important aspects for DL-based approaches is the availability of datasets and annotations. * In case of working with a small scale dataset, pretraining on a larger dataset of a similar domain is a common practice. * Sometimes, ample dataset is available but pixel-level labels may not be available as creating them is a taxing issue. Even in such cases, pretraining parts of networks on other related problems like classification or localization can help. ### Approach * A related decision to datasets/annotations is the choice of supervised, unsupervised or weakly supervised approaches. * The low number of unsupervised and weakly supervised approaches is a legitimate concern because data collection can be automated, but perfect annotation requires manual labour. * Contributions in terms of building end-to-end scalable systems that can model data distribution, decide on optimal number of classes and create accurate pixel-level segmentation maps in a completely unsupervised domain are required. * Weakly supervised algorithms are also promising since it is much easier to collect annotations corresponding to problems like classification or localization. ### Selection of Network * Pretrained classifiers can be used for FCN approaches. Such classifiers are often used for the encoder part. * Information passed from various layers of the encoder to correspondingly sized layers of the decoder can give multi-scale information. * Another benefit of encoder-decoder architecture is that careful choice of downsampling and upsampling operations can allow for same input and output size. Major benefit over simple convolutional approaches like FCN. * If finer level of instance specific segments are required, object detection techniques are coupled with segmentation ones. This is one way, while other approaches use attention based models or recurrent models for instance segmentation. ### Accuracy * CRFs are most commonly used post-processing module for refining outputs. CRFs can be simulated as an RNN to create end-to-end trainable modules which give very precise segmentation maps. * Other refinement strategies include use of oversegmentation algos like superpixel or using human interactions to guide the algos. ### Speed * Networks can be highly compressed using strategies like depth-wise separable convolutions, kernel factorizations, reducing number of spatial convolutions, etc. * This reduces the computation while preserving accuracy to some extent. ## Conclusion and Future Scope * Depends on quality and quantity of available data. There is an abundance of unstructured data on the Internet, but lack of accurate annotations is a legitimate concern. * Particularly, pixel-level annotations are difficult to obtain without manual intervention. * The most ideal scenario would be to exploit the data distribution itself to analyze and extract meaningful segments that represent concepts rather than content. This is an incredibly challenging task especially if we are working with a huge amount of unstructured data. * The key is to map a representation of the data distribution to the intent of the problem statement such that the derived segments are meaningful in some way and contributes to the overall purpose of the system. Note: Some traditional approaches for segmentation (not DL-based) are given as supplementary information in the paper. Go through if required.