Try   HackMD

Notes on "Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation"

tags: notes domain-adaptation segmentation unsupervised open-compound

NeurIPS '20 paper; Code not released as of 30/11/20.

Author: Akshay Kulkarni

Brief Outline

This work investigates open compound domain adaptation (OCDA) for semantic segmentation which deals with mixed and novel situations at the same time. They first cluster the compound target data based on style (discover), then hallucinate multiple latent target domains in source using image translation, and perform target-to-source alignment separately between domains (adapt).

Introduction

  • Most existing UDA techniques focus on a single-source single-target setting instead of a more practical scenario where target consists of multiple data distributions without clear distinctions.
  • Towards this, they study open compound domain adaptation (OCDA) (CVPR '20) where target data is a union of multiple homogeneous domains without domain labels. Unseen target data is also considered at test-time (open domains).
  • Naive use of UDA techniques for OCDA have a fundamental limitation of inducing a biased alignment where only the target data that are close to source aligns well.
  • They propose a framework that incorporates three key functionalities: discover, hallucinate, and adapt. The key idea is to decompose a hard OCDA problem into multiple easy UDA problems.
    • First, the scheme discovers
      K
      latent domains in the compound target data (discover). They use style information as domain-specific representation and cluster the compound target using latent target styles.
    • Second, the scheme generates
      K
      target-like source domains using an exemplar-guided image translation network (CVPR '19), hallucinating multiple latent target domains in source (hallucinate).
    • Third, the scheme matches the latent domains of source and target. Using
      K
      different discriminators, the domain-invariance is captured separately between domains (adapt).

Methodology

  • Source data and corresponding labels are denoted by
    XS={xSi}i=1NS
    and
    YS={ySi}i=1NS
    respectively. Compound target data is denoted by
    XT={xTi}i=1NT
    which are a mixture of multiple homogeneous data distributions. All domains share same space of classes (closed label set).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Discover (Multiple Latent Target Domains Discovery)

  • The key motivation is to make implicit multiple target domains explicit. For this, they assume that latent domain of images is reflected in their style, and use style information to cluster the compound target domain.
  • They introduce a hyperparameter
    K
    and divide the compound target domain
    T
    into
    K
    latent domains by style,
    {Tj}j=1K
    . Here, style information is convolutional feature statistics (mean and standard deviation).
  • After this
    k
    -means clustering based discovery step, the target data in the
    jth
    latent domain (
    j1,,K
    ) can be expressed as
    XT,j={xT,ji}i=1NT,j
    .

Hallucinate (Latent Target Domains Hallucination in Source)

  • They hallucinate
    K
    latent target domains in the source domain formulating it as image translation. For example,
    G(xSi,xT,jz)xS,ji
    is the hallucination of the
    jth
    latent target domain, where
    G
    is an exemplar-guided image translation network (and
    z
    is a random index).
  • How to design an effective image translation network to satisfy following:
    1. high-resolution image translation
    2. source-content preservation
    3. target-style reflection
  • They use TGCF-DA (ICCV '19) as a baseline. The framework is cycle-free (no cyclic consistency loss) and uses a strong semantic constraint loss. It involves generator trained with 2 losses:
    LGAN
    and
    Lsem
    (described in the ICCV paper).
  • However, the limitation is that it fails to reflect diverse target-styles (from multiple latent domains) to the output image. Rather, it falls into mode collapse. This is attributed to lack of style consistency constraints in the framework.
  • To address this issue, they introduce a style consistency loss using a discriminator
    Dsty
    associated with a pair of target images - either both from same latent domain or not:

(1)LStylej(G,DSty)=ExT,jXT,j,xT,jXT,j[logDSty(xT,j,xT,j)]+ljExT,jXT,j,xT,lXT,l[log(1DSty(xT,j,xT,l))]+ExSXS,xT,jXT,j[log(1DSty(xT,j,G(xS,xT,j)))]

  • Here,
    xT,j
    and
    xT,j
    are a pair of sampled target images from the same latent domain
    j
    (i.e. same style), while
    xT,j
    and
    xT,l
    are a pair of sampled images from different latent domains (i.e. different styles).
  • DSty
    learns awareness of style consistency between a pair of images. Simultaneously,
    G
    learns to fool
    DSty
    by synthesizing images with the same style to exemplar,
    xT,j
    .
  • Using image translation, the hallucination step reduces the domain gap between the source and the target at a pixel-level.

Adapt (Domain-wise Adversaries)

  • Given
    K
    latent target domains
    {Tj}j=1K
    and
    K
    translated source domains
    {Sj}j=1K
    , the model attempts to learn domain-invariant features. Assuming the translated source and latent targets are both uni-modal, one could apply existing SOTA UDA techniques directly.
  • However, as the latent multi-mode structure is not fully exploited, this may be sub-optimal and gives inferior performance (experimentally observed). Thus, they use
    K
    different discriminators
    {DO,j}j=1K
    to achieve latent domain-wise adversaries.
  • The
    jth
    discriminator
    DO,j
    only focuses on discriminating the output probability of segmentation model from
    jth
    latent domain (i.e. either from
    Tj
    or
    Sj
    ). The loss for
    jth
    target domain is defined as

(2)LOutj(F,DO,j)=ExS,jXS,j[logDO,j(F(xS,j))]+ExT,jXT,j[log(1DO,j(F(xT,j)))]

  • Here,
    F
    is the segmentation network. The segmentation task loss is defined as standard CE loss. The source data translated to
    jth
    latent domain can be trained with the original annotation as:

(3)Ltaskj(F)=E(xS,j,yS)(XS,j,YS)h,wcys(h,w,c)log(F(xS,j)(h,w,c))

Overall Objective

  • The proposed framework utilizes adaptation techniques, including pixel-level alignment, semantic consistency, style consistency and output-level alignment. The overall objective function is:

(4)Ltotal=j[λGANLGANj+λsemLsemj+λStyleLStylej+λOutLOutj+λtaskLtaskj]

  • Finally, the training process corresponds to solving the optimization problem
    F=argminFminDmaxGLtotal
    , where
    G
    and
    D
    represent a generator (in
    Lsem,LGAN,LStyle
    ) and all the discriminators (in
    LGAN,LStyle,LOut
    ) respectively.

Conclusion

  • This work presented a novel OCDA framework for semantic segmentation using 3 core design principles: Discover, Hallucinate, and Adapt.
    • Based on the latent target styles, the compound data is clustered and each group is considered as one specific latent target domain.
    • These target domains are hallucinated in the source domain via image translation. This reduces the domain gap and changes the classifier boundary to cover the latent domains.
    • Finally, domain-wise target-to-source alignment is performed using multiple discriminators with each disc. focusing on one latent domain.
  • Key idea presented was to decompose OCDA into multiple UDA problems.