This paper presents a 2-stream CNN i.e. one stream is normal CNN (classical stream) while the other is a shape stream, which explicitly processes shape information in a separate stream.
Introduction
Classical CNNs used for image segmentation process is inefficient by design because color, shape and texture information are processed together inside a single deep CNN.
Using residual skip or dense connections lead to performance gains because they allow information flow across different scales of network depth.
However, disentangling these representations by design will lead to a more natural and effective recognition pipeline.
Thus, a 2-stream CNN architecture is proposed that explicitly wires shape information as a separate processing branch. Particularly, classical CNN is used in one stream while the other stream is the shape stream (processes shape information in parallel).
The first stream is a standard segmentation CNN ("Regular Stream"). The second stream ("Shape Stream") processes shape information in the form of semantic boundaries.
Regular Stream
The Regular Stream produces dense pixel features. It has parameters , takes image and output feature representation where is the stride of the regular stream.
This maybe any feedforward fully convolutional network like ResNet or VGG based segmentation network (they use ResNet101 and WideResNet).
Shape Stream
The Shape Stream takes image gradients as well as output of the first convolutional layer of the Regular Stream as input and produces semantic boundaries as output.
The network architecture is composed of a few residual blocks interleaved with gated convolutional layers (GCL) (explained later).
This GCL ensures that the shape stream only processes boundary-relevant information. Ground-truth (GT) boundary edges (from GT segmentation masks) are used to supervise the shape stream using binary cross entropy loss on output boundaries.
Output boundary map of shape stream is denoted as .
Shape stream is shown in detail in the detailed architecture image inserted above.
Fusion Module
This module with parameters takes as input the dense feature representation (from Regular Stream) and fuses it with boundary map (from Shape Stream) in a way that multi-scale contextual information is preserved.
While the fusion module structure is not explained in detail in the paper, it can be understood using the code. Thus, the fusion module architecture is explained in the following two images (in 2 parts):
Conv2d represents 2D convolution operator (particularly the PyTorch implementation as far as parameters are concerned). The unnamed parameters in the brackets are (in order): number of input channels, number of output channels, 2D kernel/filter size (i.e. 3 implies 3 3 filter).
Any other parameters used are named like pad (padding on input before convolution operation) or dil (dilation factor of convolution).
BN represents the use of Normalization layers, while ReLU implies use of non-linearity ReLU. concat block implies concatenation of feature maps along the channel axis.
adap avg pool means Adaptive Average Pooling layer.
Note that ASPP refers to Atrous Spatial Pooling Pyramid, which refers to the 4 parallel operations (in Part 1 image) and subsequent concatentation. So, the output out is used in the Part 2 image.
Gated Convolutional Layer (GCL)
GCL helps the shape stream to only process relevant information by filtering out the rest. GCL is used to deactivate its own activations which are deemed irrelevant by the higher-level information in the regular stream.
GCL is used in a number of locations. Let denote the number of locations, and let be a running index where and denote intermediate representations of the corresponding regular and shape streams.
First, an attention map is obtained by concatenating and , followed by a convolutional layer followed by a sigmoid function: where denotes concatentation of feature maps.
Given , GCL is applied on as an element-wise product with attention map followed by a residual connection and channel-wise weighting with kernel . At each pixel, GCL is:
is then passed onto the next layer in the shape stream for further processing. Note that these computations are differentiable and thus, backprop can be done end-to-end.
Intuitively, can be seen as an attention map that weighs areas with important boundary information more heavily.
Bilinear interpolation is used to upsample feature maps from regular stream wherever required.
Joint Multi-Task Learning
Jointly supervise segmentation and boundary map prediction during training. Boundary map is a binary representation of all the outlines of object classes in the scene.
Standard Binary Cross Entropy (BCE) loss is used for predicted boundary maps and standard Cross Entropy (CE) loss on predicted semantic segmentation : where denotes GT boundaries and denotes GT semantic labels, and and are 2 hyperparameters that control the weight between the 2 losses.
Since number of edge pixels will be very less compared to non-edge pixels, there will be an imbalance which introduces a bias. Thus, a weighted BCE loss with coefficient is used (as in Xie and Tu 2015).
Dual Task Regularizer
denotes a categorical distribution output of the fusion module. Let be a potential that represents whether a pixel belongs to a semantic boundary in the input image .
It is computed by taking a spatial derivative on the segmentation output: where denotes Gaussian filter. If we assume is a GT binary mask computed like GT semantic labels , then: where contains the set of all non-zero pixels coordinates in both and .
Intuitively, it ensures that boundary pixels are penalized when there is a mismatch with GT boundaries, and to avoid non-boundary pixels to dominate the loss function.
Similarly, the boundary prediction from the shape stream can be used to ensure consistency between binary boundary predition and the predicted semantics : 𝟙 where and run over all image pixels and semantic classes respectively. 𝟙𝕤 corresponds to the indicator function and is a confidence threshold (0.8 in paper).
The total dual task regularizer loss function is:
Note that and are hyperparameters that control the weighting of the regularizer.
Gradient Propagation during Training
The partial derivative of equation w.r.t. a given parameter is
Since it contains argmax, it is non-differentiable. So, the Gumbel softmax trick is used (as in Jang et. al. 2016).
During the backward pass, we approximate the argmax operator with a softmax with temperature : where and is a hyperparameter. The operator can be computed by filtering with Sobel kernel.
Experiments
Experiments are done on Cityscapes fine dataset in PyTorch. They use , , and . They use for the Gumbel softmax.
They claim SOTA performance (at least at the time of writing).
Conclusions
This paper proposes a 2-stream CNN architecture i.e. one stream is a regular CNN and other processes shape information.
They propose Gated Convolutional Layers (GCL) to connect intermediate layers.
They use a new loss function that exploits the duality between semantic segmentation task and semantic boundary prediction task.