Try   HackMD

Notes on "Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation"

tags: notes domain-adaptation unsupervised segmentation aaai21 adain augmentation

Notes author: Akshay Kulkarni

AAAI '21 paper; Code Release

Brief Outline

They propose a bidirectional style-induced DA method (BiSIDA) that employs consistency regularization to efficiently exploit information from the unlabeled target dataset using a simple neural style transfer model.

Introduction

  • To perform domain alignment on a pixel-level or feature-level, existing works (Tsai et al. CVPR '18, Hoffman et al. ICML '18) typically use adversarial training and training with the aligned data is supervised using the labeled source data. However, this introduces extra complexity and instability in training.
  • Alternative approaches (Zou et al. ECCV '18, Vu et al. CVPR '19) seek to exploit information about the unlabeled target data by performing semi-supervised learning including entropy minimization, pseudo-labeling and consistency regularization. However, these play an auxiliary role beside supervised learning or fail to take full advantage of the target data.
  • This work proposes a 2 stage pipeline:
    • In the supervised learning phase, a style-induced image generator translates images with different styles to align the source domain to the direction of the source domain.
    • In the unsupervised phase, they perform high-dimensional perturbations on target domain images with consistency regularization.

Methodology

Background

Adaptive Instance Normalization (AdaIN)

  • Given a content image
    c
    and a style image
    s
    from another domain, an image that mimics the style of
    s
    while pertaining the content of
    c
    is synthesized.
  • Formally, the feature map of content image
    c
    and style image
    s
    through an encoder
    f
    can be represented as
    tc=f(c)
    and
    ts=f(s)
    . Normalizing the mean and standard deviation for each channel of
    tc
    and
    ts
    , the target feature maps
    t^
    are produced as follows:

(1)t^=AdaIN(tc,ts)=σ(ts)tcμ(tc)σ(tc)+μ(ts)

  • Here,
    μ(t)
    and
    σ(t)
    are the mean and variance of the feature map
    t
    .

Self-ensembling

  • To stabilize the generation of pseudo-labels, they employ self-ensembling (Tarvainen and Valpola, NeurIPS '17) which consists of a segmentation network as student network
    Fs
    and a teacher network
    Ft
    with the same architecture.
  • The teacher is essentially the temporal ensemble of the student network so that a radical change in the weight of the teacher network can be alleviated and more informed prediction can be made.
  • The weight of teacher
    Ft
    at the
    ith
    iteration
    θit
    is updated as the exponential moving average of the weight
    θis
    of the student
    Fs
    i.e.
    θit=ηθi1t+(1η)θis
    where
    η
    is the decay factor.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Approach

Continuous style-induced image generator

  • They synthesize an image with the combined style of a source and a target image controlled by a content-style trade-off parameter
    α[0,1]
    through the generator
    G
    :

(2)G(c,s,α)=g(αt^+(1α)tc)

  • Here,
    α=0
    indicates content image will be reconstructed while
    α=1
    will reconstruct with a combination of styles of content image and style image.

Target-guided supervised learning

  • Given a source dataset
    {(xiS,yiS)}(i=1,,NS)
    and a target dataset
    {xiT}
    , they first perform a random color space perturbation
    A
    on a source image
    xs
    to get
    A(xs)
    to enhance the randomness.
  • This augmented image is passed through the style-induced generator
    G
    to perform style transfer as a stronger augmentation using a target image
    xt
    with the trade-off parameter
    α
    randomly sampled from a uniform distribution
    U(0,1)
    to get
    x^s=G(A(xs),xt,α)
    .
  • The translation process is enabled with probability of
    pst
    so that model can be trained on details since a loss of resolution occurs in the translation. For the rest of the probability,
    x^s=A(xs)
    is used.
  • Finally, the supervised loss
    Ls
    (CE loss) between probability map
    ps=Fs(x^s)
    and the label
    ys
    is given by:

(3)Ls=1HWm=1H×Wc=1Cysmclog(psmc)

  • Using a strong and directed augmentation method facilitates generalization towards different styles and further enables adaptation towards the direction of target.

Source-guided unsupervised learning

  • Since their model is more adapted to the source domain where supervised learning is performed, the quality of produced pseudo-labels is generally higher if input is closer to source domain. Consequently, pseudo-labels are computed from target images transferred to the direction of appearance of source domain.
  • First, they perform random color space perturbation
    A
    on target
    xt
    to get
    A(xt)
    . Then, each
    A(xt)
    is further augmented using
    k
    randomly sampled source images
    {xsi}i=1k
    as style images through
    G
    with probability
    pts
    considering the loss of resolution.
  • After the augmentation process, transformed images
    {x^ti}i=1k
    will be passed through the teacher model
    Ft
    individually to acquire more stable predictions
    y^i=Ft(x^ti)
    . These are averaged to get the probability map
    pl
    for the pseudo-label
    pl=1ki=1ky^i
    .
  • They sharpen (Berthelot et al. NeurIPS '19) the predictions before aggregation (make them more peaky). Finally, the pseudo-label
    q=argmax(pl)
    is obtained which will be used to compute the unsupervised loss
    Lu
    in a supervised manner.
  • The class imbalance in the dataset causes model to be biased towards popular or easy classes especially when relying on semi-supervised methods like pseudo-labels. To address this, they add simple weightage to the loss based on the proportion of pixels in each class in the source data. Note: over-complicating a simple training trick by providing explicit equations!

Optimization

  • The final loss
    L
    , given the unsupervised loss weight
    λu
    , in a multi-task learning manner is given by

(4)L=Ls+λuLu

  • During the training process, the student network
    Fs
    is updated toward the direction of the gradient computed via back-propagation of the loss
    L
    , while the weight of the teacher network
    Ft
    is updated as the exponential moving average of the student network.

Conclusion

  • They propose a Bidirectional Style-induced DA (BiSIDA) framework that optimizes a segmentation model via target-guided supervised learning and source-guided unsupervised learning.
  • Using a continuous style-induced generator, they show effectiveness of learning from unlabeled target by providing high-dimensional perturbations for consistency regularization.
  • They also show that alignment between source and target from both directions is achievable without adversarial training.