# Notes on "[Gated-SCNN: Gated Shape CNNs for Semantic Segmentation](http://openaccess.thecvf.com/content_ICCV_2019/html/Takikawa_Gated-SCNN_Gated_Shape_CNNs_for_Semantic_Segmentation_ICCV_2019_paper.html)"
###### tags: `notes` `segmentation` `supervised`
#### Author
[Akshay Kulkarni](https://akshayk07.weebly.com/)
## Brief Outline
This paper presents a 2-stream CNN i.e. one stream is normal CNN (classical stream) while the other is a shape stream, which explicitly processes shape information in a separate stream.
## Introduction
* Classical CNNs used for image segmentation process is inefficient by design because color, shape and texture information are processed together inside a single deep CNN.
* Using residual skip or dense connections lead to performance gains because they allow information flow across different scales of network depth.
* However, disentangling these representations by design will lead to a more natural and effective recognition pipeline.
* Thus, a 2-stream CNN architecture is proposed that explicitly wires shape information as a separate processing branch. Particularly, classical CNN is used in one stream while the other stream is the shape stream (processes shape information in parallel).
![Detailed Network Architecture](https://i.imgur.com/cNcBsoH.png)
## Network Architecture
* The first stream is a standard segmentation CNN ("Regular Stream"). The second stream ("Shape Stream") processes shape information in the form of semantic boundaries.
### Regular Stream
* The Regular Stream $R_\theta(I)$ produces dense pixel features. It has parameters $\theta$, takes image $I \in \mathbb{R}^{3 \times H \times W}$ and output feature representation $r \in \mathbb{R}^{C \times \frac{H}{m} \times \frac{W}{m}}$ where $m$ is the stride of the regular stream.
* This maybe any feedforward fully convolutional network like ResNet or VGG based segmentation network (they use ResNet101 and WideResNet).
### Shape Stream
* The Shape Stream $S_\phi$ takes image gradients $\nabla I$ as well as output of the first convolutional layer of the Regular Stream as input and produces semantic boundaries as output.
* The network architecture is composed of a few residual blocks interleaved with gated convolutional layers (GCL) (explained later).
* This GCL ensures that the shape stream only processes boundary-relevant information. Ground-truth (GT) boundary edges (from GT segmentation masks) are used to supervise the shape stream using binary cross entropy loss on output boundaries.
* Output boundary map of shape stream is denoted as $s \in \mathbb{R}^{H \times W}$.
* Shape stream is shown in detail in the detailed architecture image inserted above.
### Fusion Module
* This module $\mathcal{F}_\gamma$ with parameters $\gamma$ takes as input the dense feature representation $r$ (from Regular Stream) and fuses it with boundary map $s$ (from Shape Stream) in a way that multi-scale contextual information is preserved.
* While the fusion module structure is not explained in detail in the paper, it can be understood using [the code](https://github.com/nv-tlabs/GSCNN). Thus, the fusion module architecture is explained in the following two images (in 2 parts):
![Fusion Module Part 1](https://i.imgur.com/Gk8AIPX.jpg)
![Fusion Module Part 2](https://i.imgur.com/pgofjxo.jpg)
* The notations used in the photos are as follows:
* `Conv2d` represents 2D convolution operator (particularly the PyTorch implementation as far as parameters are concerned). The unnamed parameters in the brackets are (in order): number of input channels, number of output channels, 2D kernel/filter size (i.e. 3 implies 3 $\times$ 3 filter).
* Any other parameters used are named like `pad` (padding on input before convolution operation) or `dil` (dilation factor of convolution).
* `BN` represents the use of Normalization layers, while ReLU implies use of non-linearity ReLU. `concat` block implies concatenation of feature maps along the channel axis.
* `adap avg pool` means Adaptive Average Pooling layer.
* Note that ASPP refers to Atrous Spatial Pooling Pyramid, which refers to the 4 parallel operations (in Part 1 image) and subsequent concatentation. So, the output `out` is used in the Part 2 image.
### Gated Convolutional Layer (GCL)
* GCL helps the shape stream to only process relevant information by filtering out the rest. GCL is used to deactivate its own activations which are deemed irrelevant by the higher-level information in the regular stream.
* GCL is used in a number of locations. Let $m$ denote the number of locations, and let $t \in 0, 1, ..., m$ be a running index where $r_t$ and $s_t$ denote intermediate representations of the corresponding regular and shape streams.
* First, an attention map $\alpha_t \in \mathbb{R}^{H \times W}$ is obtained by concatenating $r_t$ and $s_t$, followed by a $1 \times 1$ convolutional layer $C_{1 \times 1}$ followed by a sigmoid function:
$$
\alpha_t = \sigma(C_{1 \times 1}(s_t || r_t))
$$
where $||$ denotes concatentation of feature maps.
* Given $\alpha_t$, GCL is applied on $s_t$ as an element-wise product $\odot$ with attention map $\alpha_t$ followed by a residual connection and channel-wise weighting with kernel $w_t$. At each pixel, GCL $\otimes$ is:
$$
\hat{s}_t^{(i, j)} = (s_t \otimes w_t)_{(i, j)} \\
= ((s_{t_{(i, j)}} \odot \alpha_{t_{(i, j)}})+s_{t_{(i, j)}})^Tw_t
$$
* $\hat{s}_t$ is then passed onto the next layer in the shape stream for further processing. Note that these computations are differentiable and thus, backprop can be done end-to-end.
* Intuitively, $\alpha$ can be seen as an attention map that weighs areas with important boundary information more heavily.
* Bilinear interpolation is used to upsample feature maps from regular stream wherever required.
### Joint Multi-Task Learning
* Jointly supervise segmentation and boundary map prediction during training. Boundary map is a binary representation of all the outlines of object classes in the scene.
* Standard Binary Cross Entropy (BCE) loss is used for predicted boundary maps $s$ and standard Cross Entropy (CE) loss on predicted semantic segmentation $f$:
$$
\mathcal{L}^{\theta, \phi, \gamma} = \lambda_1\mathcal{L}_{BCE}^{\theta, \phi}(s, \hat{s}) + \lambda_2\mathcal{L}_{CE}^{\theta, \phi, \gamma}(\hat{y}, f)
$$
where $\hat{s} \in \mathbb{R}^{H \times W}$ denotes GT boundaries and $\hat{y} \in \mathbb{R}^{H \times W}$ denotes GT semantic labels, and $\lambda_1$ and $\lambda_2$ are 2 hyperparameters that control the weight between the 2 losses.
* Since number of edge pixels will be very less compared to non-edge pixels, there will be an imbalance which introduces a bias. Thus, a weighted BCE loss with coefficient $\beta$ is used (as in [Xie and Tu 2015](https://arxiv.org/abs/1504.06375)).
### Dual Task Regularizer
* $p(y|r, s) \in \mathbb{R}^{K \times H \times W}$ denotes a categorical distribution output of the fusion module. Let $\zeta \in \mathbb{R}^{H \times W}$ be a potential that represents whether a pixel belongs to a semantic boundary in the input image $I$.
* It is computed by taking a spatial derivative on the segmentation output:
$$
\zeta = \frac{1}{\sqrt{2}}||\nabla(G * \text{argmax}_k(p(y|r, s)))||
$$
where $G$ denotes Gaussian filter. If we assume $\hat{\zeta}$ is a GT binary mask computed like GT semantic labels $\hat{f}$, then:
$$
\mathcal{L}_{reg\rightarrow}^{\theta, \phi, \gamma} = \lambda_3\sum_{p^+}|\zeta(p^+) - \hat{\zeta}(p^+)|
$$
where $p^+$ contains the set of all non-zero pixels coordinates in both $\zeta$ and $\hat{\zeta}$.
* Intuitively, it ensures that boundary pixels are penalized when there is a mismatch with GT boundaries, and to avoid non-boundary pixels to dominate the loss function.
* Similarly, the boundary prediction from the shape stream $s \in \mathbb{R}^{H \times W}$ can be used to ensure consistency between binary boundary predition $s$ and the predicted semantics $p(y|r, s)$:
$$
\mathcal{L}_{reg\leftarrow}^{\theta, \phi, \gamma} = \lambda_4\sum_{k, p}\mathbb{1}_{s_{p}}[-\hat{y}_p^k logp(y_p^k|r, s)]
$$
where $p$ and $k$ run over all image pixels and semantic classes respectively. $\mathbb{1_s} = \{1:s > thrs \}$ corresponds to the indicator function and $thrs$ is a confidence threshold (0.8 in paper).
* The total dual task regularizer loss function is:
$$
\mathcal{L}^{\theta, \phi, \gamma} = \mathcal{L}_{reg\rightarrow}^{\theta, \phi, \gamma} + \mathcal{L}_{reg\leftarrow}^{\theta, \phi, \gamma}
$$
* Note that $\lambda_3$ and $\lambda_4$ are hyperparameters that control the weighting of the regularizer.
#### Gradient Propagation during Training
* The partial derivative of $\zeta$ equation w.r.t. a given parameter $\eta$ is
$$
\frac{\partial L}{\partial \eta_i} = \sum_{j, l} \nabla G * \frac{\partial L}{\partial \zeta_j}\frac{\partial \zeta_j}{\partial g_l} \frac{\partial \text{argmax}_kp(y^k)_l}{\partial \eta_i}
$$
* Since it contains argmax, it is non-differentiable. So, the Gumbel softmax trick is used (as in [Jang et. al. 2016](https://arxiv.org/abs/1611.01144)).
* During the backward pass, we approximate the argmax operator with a softmax with temperature $\tau$:
$$
\frac{\partial argmax_kp(y^k)_l}{\partial \eta_i} = \nabla \eta_i \frac{exp((logp(y_k) + g_k) / \tau)}{\sum_j exp((logp(y_j) + g_j) / \tau)}
$$
where $g_j \sim \text{Gumbel}(0, I)$ and $\tau$ is a hyperparameter. The operator $\nabla G *$ can be computed by filtering with Sobel kernel.
## Experiments
* Experiments are done on Cityscapes fine dataset in PyTorch. They use $\lambda_1 = 20$, $\lambda_2 = 1$, $\lambda_3 = 1$ and $\lambda_4 = 1$. They use $\tau = 1$ for the Gumbel softmax.
* They claim SOTA performance (at least at the time of writing).
## Conclusions
* This paper proposes a 2-stream CNN architecture i.e. one stream is a regular CNN and other processes shape information.
* They propose Gated Convolutional Layers (GCL) to connect intermediate layers.
* They use a new loss function that exploits the duality between semantic segmentation task and semantic boundary prediction task.