Try   HackMD

Notes on "Gated-SCNN: Gated Shape CNNs for Semantic Segmentation"

tags: notes segmentation supervised

Author

Akshay Kulkarni

Brief Outline

This paper presents a 2-stream CNN i.e. one stream is normal CNN (classical stream) while the other is a shape stream, which explicitly processes shape information in a separate stream.

Introduction

  • Classical CNNs used for image segmentation process is inefficient by design because color, shape and texture information are processed together inside a single deep CNN.
  • Using residual skip or dense connections lead to performance gains because they allow information flow across different scales of network depth.
  • However, disentangling these representations by design will lead to a more natural and effective recognition pipeline.
  • Thus, a 2-stream CNN architecture is proposed that explicitly wires shape information as a separate processing branch. Particularly, classical CNN is used in one stream while the other stream is the shape stream (processes shape information in parallel).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Network Architecture

  • The first stream is a standard segmentation CNN ("Regular Stream"). The second stream ("Shape Stream") processes shape information in the form of semantic boundaries.

Regular Stream

  • The Regular Stream
    Rθ(I)
    produces dense pixel features. It has parameters
    θ
    , takes image
    IR3×H×W
    and output feature representation
    rRC×Hm×Wm
    where
    m
    is the stride of the regular stream.
  • This maybe any feedforward fully convolutional network like ResNet or VGG based segmentation network (they use ResNet101 and WideResNet).

Shape Stream

  • The Shape Stream
    Sϕ
    takes image gradients
    I
    as well as output of the first convolutional layer of the Regular Stream as input and produces semantic boundaries as output.
  • The network architecture is composed of a few residual blocks interleaved with gated convolutional layers (GCL) (explained later).
  • This GCL ensures that the shape stream only processes boundary-relevant information. Ground-truth (GT) boundary edges (from GT segmentation masks) are used to supervise the shape stream using binary cross entropy loss on output boundaries.
  • Output boundary map of shape stream is denoted as
    sRH×W
    .
  • Shape stream is shown in detail in the detailed architecture image inserted above.

Fusion Module

  • This module
    Fγ
    with parameters
    γ
    takes as input the dense feature representation
    r
    (from Regular Stream) and fuses it with boundary map
    s
    (from Shape Stream) in a way that multi-scale contextual information is preserved.
  • While the fusion module structure is not explained in detail in the paper, it can be understood using the code. Thus, the fusion module architecture is explained in the following two images (in 2 parts):
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • The notations used in the photos are as follows:
    • Conv2d represents 2D convolution operator (particularly the PyTorch implementation as far as parameters are concerned). The unnamed parameters in the brackets are (in order): number of input channels, number of output channels, 2D kernel/filter size (i.e. 3 implies 3
      ×
      3 filter).
    • Any other parameters used are named like pad (padding on input before convolution operation) or dil (dilation factor of convolution).
    • BN represents the use of Normalization layers, while ReLU implies use of non-linearity ReLU. concat block implies concatenation of feature maps along the channel axis.
    • adap avg pool means Adaptive Average Pooling layer.
    • Note that ASPP refers to Atrous Spatial Pooling Pyramid, which refers to the 4 parallel operations (in Part 1 image) and subsequent concatentation. So, the output out is used in the Part 2 image.

Gated Convolutional Layer (GCL)

  • GCL helps the shape stream to only process relevant information by filtering out the rest. GCL is used to deactivate its own activations which are deemed irrelevant by the higher-level information in the regular stream.
  • GCL is used in a number of locations. Let
    m
    denote the number of locations, and let
    t0,1,...,m
    be a running index where
    rt
    and
    st
    denote intermediate representations of the corresponding regular and shape streams.
  • First, an attention map
    αtRH×W
    is obtained by concatenating
    rt
    and
    st
    , followed by a
    1×1
    convolutional layer
    C1×1
    followed by a sigmoid function:
    αt=σ(C1×1(st||rt))

    where
    ||
    denotes concatentation of feature maps.
  • Given
    αt
    , GCL is applied on
    st
    as an element-wise product
    with attention map
    αt
    followed by a residual connection and channel-wise weighting with kernel
    wt
    . At each pixel, GCL
    is:
    s^t(i,j)=(stwt)(i,j)=((st(i,j)αt(i,j))+st(i,j))Twt
  • s^t
    is then passed onto the next layer in the shape stream for further processing. Note that these computations are differentiable and thus, backprop can be done end-to-end.
  • Intuitively,
    α
    can be seen as an attention map that weighs areas with important boundary information more heavily.
  • Bilinear interpolation is used to upsample feature maps from regular stream wherever required.

Joint Multi-Task Learning

  • Jointly supervise segmentation and boundary map prediction during training. Boundary map is a binary representation of all the outlines of object classes in the scene.
  • Standard Binary Cross Entropy (BCE) loss is used for predicted boundary maps
    s
    and standard Cross Entropy (CE) loss on predicted semantic segmentation
    f
    :
    Lθ,ϕ,γ=λ1LBCEθ,ϕ(s,s^)+λ2LCEθ,ϕ,γ(y^,f)

    where
    s^RH×W
    denotes GT boundaries and
    y^RH×W
    denotes GT semantic labels, and
    λ1
    and
    λ2
    are 2 hyperparameters that control the weight between the 2 losses.
  • Since number of edge pixels will be very less compared to non-edge pixels, there will be an imbalance which introduces a bias. Thus, a weighted BCE loss with coefficient
    β
    is used (as in Xie and Tu 2015).

Dual Task Regularizer

  • p(y|r,s)RK×H×W
    denotes a categorical distribution output of the fusion module. Let
    ζRH×W
    be a potential that represents whether a pixel belongs to a semantic boundary in the input image
    I
    .
  • It is computed by taking a spatial derivative on the segmentation output:
    ζ=12||(Gargmaxk(p(y|r,s)))||

    where
    G
    denotes Gaussian filter. If we assume
    ζ^
    is a GT binary mask computed like GT semantic labels
    f^
    , then:
    Lregθ,ϕ,γ=λ3p+|ζ(p+)ζ^(p+)|

    where
    p+
    contains the set of all non-zero pixels coordinates in both
    ζ
    and
    ζ^
    .
  • Intuitively, it ensures that boundary pixels are penalized when there is a mismatch with GT boundaries, and to avoid non-boundary pixels to dominate the loss function.
  • Similarly, the boundary prediction from the shape stream
    sRH×W
    can be used to ensure consistency between binary boundary predition
    s
    and the predicted semantics
    p(y|r,s)
    :
    Lregθ,ϕ,γ=λ4k,p1sp[y^pklogp(ypk|r,s)]

    where
    p
    and
    k
    run over all image pixels and semantic classes respectively.
    1s={1:s>thrs}
    corresponds to the indicator function and
    thrs
    is a confidence threshold (0.8 in paper).
  • The total dual task regularizer loss function is:
    Lθ,ϕ,γ=Lregθ,ϕ,γ+Lregθ,ϕ,γ
  • Note that
    λ3
    and
    λ4
    are hyperparameters that control the weighting of the regularizer.

Gradient Propagation during Training

  • The partial derivative of
    ζ
    equation w.r.t. a given parameter
    η
    is
    Lηi=j,lGLζjζjglargmaxkp(yk)lηi
  • Since it contains argmax, it is non-differentiable. So, the Gumbel softmax trick is used (as in Jang et. al. 2016).
  • During the backward pass, we approximate the argmax operator with a softmax with temperature
    τ
    :
    argmaxkp(yk)lηi=ηiexp((logp(yk)+gk)/τ)jexp((logp(yj)+gj)/τ)

    where
    gjGumbel(0,I)
    and
    τ
    is a hyperparameter. The operator
    G
    can be computed by filtering with Sobel kernel.

Experiments

  • Experiments are done on Cityscapes fine dataset in PyTorch. They use
    λ1=20
    ,
    λ2=1
    ,
    λ3=1
    and
    λ4=1
    . They use
    τ=1
    for the Gumbel softmax.
  • They claim SOTA performance (at least at the time of writing).

Conclusions

  • This paper proposes a 2-stream CNN architecture i.e. one stream is a regular CNN and other processes shape information.
  • They propose Gated Convolutional Layers (GCL) to connect intermediate layers.
  • They use a new loss function that exploits the duality between semantic segmentation task and semantic boundary prediction task.