# Notes on "[Gated-SCNN: Gated Shape CNNs for Semantic Segmentation](http://openaccess.thecvf.com/content_ICCV_2019/html/Takikawa_Gated-SCNN_Gated_Shape_CNNs_for_Semantic_Segmentation_ICCV_2019_paper.html)" ###### tags: `notes` `segmentation` `supervised` #### Author [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline This paper presents a 2-stream CNN i.e. one stream is normal CNN (classical stream) while the other is a shape stream, which explicitly processes shape information in a separate stream. ## Introduction * Classical CNNs used for image segmentation process is inefficient by design because color, shape and texture information are processed together inside a single deep CNN. * Using residual skip or dense connections lead to performance gains because they allow information flow across different scales of network depth. * However, disentangling these representations by design will lead to a more natural and effective recognition pipeline. * Thus, a 2-stream CNN architecture is proposed that explicitly wires shape information as a separate processing branch. Particularly, classical CNN is used in one stream while the other stream is the shape stream (processes shape information in parallel). ![Detailed Network Architecture](https://i.imgur.com/cNcBsoH.png) ## Network Architecture * The first stream is a standard segmentation CNN ("Regular Stream"). The second stream ("Shape Stream") processes shape information in the form of semantic boundaries. ### Regular Stream * The Regular Stream $R_\theta(I)$ produces dense pixel features. It has parameters $\theta$, takes image $I \in \mathbb{R}^{3 \times H \times W}$ and output feature representation $r \in \mathbb{R}^{C \times \frac{H}{m} \times \frac{W}{m}}$ where $m$ is the stride of the regular stream. * This maybe any feedforward fully convolutional network like ResNet or VGG based segmentation network (they use ResNet101 and WideResNet). ### Shape Stream * The Shape Stream $S_\phi$ takes image gradients $\nabla I$ as well as output of the first convolutional layer of the Regular Stream as input and produces semantic boundaries as output. * The network architecture is composed of a few residual blocks interleaved with gated convolutional layers (GCL) (explained later). * This GCL ensures that the shape stream only processes boundary-relevant information. Ground-truth (GT) boundary edges (from GT segmentation masks) are used to supervise the shape stream using binary cross entropy loss on output boundaries. * Output boundary map of shape stream is denoted as $s \in \mathbb{R}^{H \times W}$. * Shape stream is shown in detail in the detailed architecture image inserted above. ### Fusion Module * This module $\mathcal{F}_\gamma$ with parameters $\gamma$ takes as input the dense feature representation $r$ (from Regular Stream) and fuses it with boundary map $s$ (from Shape Stream) in a way that multi-scale contextual information is preserved. * While the fusion module structure is not explained in detail in the paper, it can be understood using [the code](https://github.com/nv-tlabs/GSCNN). Thus, the fusion module architecture is explained in the following two images (in 2 parts): ![Fusion Module Part 1](https://i.imgur.com/Gk8AIPX.jpg) ![Fusion Module Part 2](https://i.imgur.com/pgofjxo.jpg) * The notations used in the photos are as follows: * `Conv2d` represents 2D convolution operator (particularly the PyTorch implementation as far as parameters are concerned). The unnamed parameters in the brackets are (in order): number of input channels, number of output channels, 2D kernel/filter size (i.e. 3 implies 3 $\times$ 3 filter). * Any other parameters used are named like `pad` (padding on input before convolution operation) or `dil` (dilation factor of convolution). * `BN` represents the use of Normalization layers, while ReLU implies use of non-linearity ReLU. `concat` block implies concatenation of feature maps along the channel axis. * `adap avg pool` means Adaptive Average Pooling layer. * Note that ASPP refers to Atrous Spatial Pooling Pyramid, which refers to the 4 parallel operations (in Part 1 image) and subsequent concatentation. So, the output `out` is used in the Part 2 image. ### Gated Convolutional Layer (GCL) * GCL helps the shape stream to only process relevant information by filtering out the rest. GCL is used to deactivate its own activations which are deemed irrelevant by the higher-level information in the regular stream. * GCL is used in a number of locations. Let $m$ denote the number of locations, and let $t \in 0, 1, ..., m$ be a running index where $r_t$ and $s_t$ denote intermediate representations of the corresponding regular and shape streams. * First, an attention map $\alpha_t \in \mathbb{R}^{H \times W}$ is obtained by concatenating $r_t$ and $s_t$, followed by a $1 \times 1$ convolutional layer $C_{1 \times 1}$ followed by a sigmoid function: $$ \alpha_t = \sigma(C_{1 \times 1}(s_t || r_t)) $$ where $||$ denotes concatentation of feature maps. * Given $\alpha_t$, GCL is applied on $s_t$ as an element-wise product $\odot$ with attention map $\alpha_t$ followed by a residual connection and channel-wise weighting with kernel $w_t$. At each pixel, GCL $\otimes$ is: $$ \hat{s}_t^{(i, j)} = (s_t \otimes w_t)_{(i, j)} \\ = ((s_{t_{(i, j)}} \odot \alpha_{t_{(i, j)}})+s_{t_{(i, j)}})^Tw_t $$ * $\hat{s}_t$ is then passed onto the next layer in the shape stream for further processing. Note that these computations are differentiable and thus, backprop can be done end-to-end. * Intuitively, $\alpha$ can be seen as an attention map that weighs areas with important boundary information more heavily. * Bilinear interpolation is used to upsample feature maps from regular stream wherever required. ### Joint Multi-Task Learning * Jointly supervise segmentation and boundary map prediction during training. Boundary map is a binary representation of all the outlines of object classes in the scene. * Standard Binary Cross Entropy (BCE) loss is used for predicted boundary maps $s$ and standard Cross Entropy (CE) loss on predicted semantic segmentation $f$: $$ \mathcal{L}^{\theta, \phi, \gamma} = \lambda_1\mathcal{L}_{BCE}^{\theta, \phi}(s, \hat{s}) + \lambda_2\mathcal{L}_{CE}^{\theta, \phi, \gamma}(\hat{y}, f) $$ where $\hat{s} \in \mathbb{R}^{H \times W}$ denotes GT boundaries and $\hat{y} \in \mathbb{R}^{H \times W}$ denotes GT semantic labels, and $\lambda_1$ and $\lambda_2$ are 2 hyperparameters that control the weight between the 2 losses. * Since number of edge pixels will be very less compared to non-edge pixels, there will be an imbalance which introduces a bias. Thus, a weighted BCE loss with coefficient $\beta$ is used (as in [Xie and Tu 2015](https://arxiv.org/abs/1504.06375)). ### Dual Task Regularizer * $p(y|r, s) \in \mathbb{R}^{K \times H \times W}$ denotes a categorical distribution output of the fusion module. Let $\zeta \in \mathbb{R}^{H \times W}$ be a potential that represents whether a pixel belongs to a semantic boundary in the input image $I$. * It is computed by taking a spatial derivative on the segmentation output: $$ \zeta = \frac{1}{\sqrt{2}}||\nabla(G * \text{argmax}_k(p(y|r, s)))|| $$ where $G$ denotes Gaussian filter. If we assume $\hat{\zeta}$ is a GT binary mask computed like GT semantic labels $\hat{f}$, then: $$ \mathcal{L}_{reg\rightarrow}^{\theta, \phi, \gamma} = \lambda_3\sum_{p^+}|\zeta(p^+) - \hat{\zeta}(p^+)| $$ where $p^+$ contains the set of all non-zero pixels coordinates in both $\zeta$ and $\hat{\zeta}$. * Intuitively, it ensures that boundary pixels are penalized when there is a mismatch with GT boundaries, and to avoid non-boundary pixels to dominate the loss function. * Similarly, the boundary prediction from the shape stream $s \in \mathbb{R}^{H \times W}$ can be used to ensure consistency between binary boundary predition $s$ and the predicted semantics $p(y|r, s)$: $$ \mathcal{L}_{reg\leftarrow}^{\theta, \phi, \gamma} = \lambda_4\sum_{k, p}\mathbb{1}_{s_{p}}[-\hat{y}_p^k logp(y_p^k|r, s)] $$ where $p$ and $k$ run over all image pixels and semantic classes respectively. $\mathbb{1_s} = \{1:s > thrs \}$ corresponds to the indicator function and $thrs$ is a confidence threshold (0.8 in paper). * The total dual task regularizer loss function is: $$ \mathcal{L}^{\theta, \phi, \gamma} = \mathcal{L}_{reg\rightarrow}^{\theta, \phi, \gamma} + \mathcal{L}_{reg\leftarrow}^{\theta, \phi, \gamma} $$ * Note that $\lambda_3$ and $\lambda_4$ are hyperparameters that control the weighting of the regularizer. #### Gradient Propagation during Training * The partial derivative of $\zeta$ equation w.r.t. a given parameter $\eta$ is $$ \frac{\partial L}{\partial \eta_i} = \sum_{j, l} \nabla G * \frac{\partial L}{\partial \zeta_j}\frac{\partial \zeta_j}{\partial g_l} \frac{\partial \text{argmax}_kp(y^k)_l}{\partial \eta_i} $$ * Since it contains argmax, it is non-differentiable. So, the Gumbel softmax trick is used (as in [Jang et. al. 2016](https://arxiv.org/abs/1611.01144)). * During the backward pass, we approximate the argmax operator with a softmax with temperature $\tau$: $$ \frac{\partial argmax_kp(y^k)_l}{\partial \eta_i} = \nabla \eta_i \frac{exp((logp(y_k) + g_k) / \tau)}{\sum_j exp((logp(y_j) + g_j) / \tau)} $$ where $g_j \sim \text{Gumbel}(0, I)$ and $\tau$ is a hyperparameter. The operator $\nabla G *$ can be computed by filtering with Sobel kernel. ## Experiments * Experiments are done on Cityscapes fine dataset in PyTorch. They use $\lambda_1 = 20$, $\lambda_2 = 1$, $\lambda_3 = 1$ and $\lambda_4 = 1$. They use $\tau = 1$ for the Gumbel softmax. * They claim SOTA performance (at least at the time of writing). ## Conclusions * This paper proposes a 2-stream CNN architecture i.e. one stream is a regular CNN and other processes shape information. * They propose Gated Convolutional Layers (GCL) to connect intermediate layers. * They use a new loss function that exploits the duality between semantic segmentation task and semantic boundary prediction task.