This work investigates open compound domain adaptation (OCDA) for semantic segmentation which deals with mixed and novel situations at the same time. They first cluster the compound target data based on style (discover), then hallucinate multiple latent target domains in source using image translation, and perform target-to-source alignment separately between domains (adapt).
Introduction
Most existing UDA techniques focus on a single-source single-target setting instead of a more practical scenario where target consists of multiple data distributions without clear distinctions.
Towards this, they study open compound domain adaptation (OCDA) (CVPR '20) where target data is a union of multiple homogeneous domains without domain labels. Unseen target data is also considered at test-time (open domains).
Naive use of UDA techniques for OCDA have a fundamental limitation of inducing a biased alignment where only the target data that are close to source aligns well.
They propose a framework that incorporates three key functionalities: discover, hallucinate, and adapt. The key idea is to decompose a hard OCDA problem into multiple easy UDA problems.
First, the scheme discovers latent domains in the compound target data (discover). They use style information as domain-specific representation and cluster the compound target using latent target styles.
Second, the scheme generates target-like source domains using an exemplar-guided image translation network (CVPR '19), hallucinating multiple latent target domains in source (hallucinate).
Third, the scheme matches the latent domains of source and target. Using different discriminators, the domain-invariance is captured separately between domains (adapt).
Methodology
Source data and corresponding labels are denoted by and respectively. Compound target data is denoted by which are a mixture of multiple homogeneous data distributions. All domains share same space of classes (closed label set).
The key motivation is to make implicit multiple target domains explicit. For this, they assume that latent domain of images is reflected in their style, and use style information to cluster the compound target domain.
They introduce a hyperparameter and divide the compound target domain into latent domains by style, . Here, style information is convolutional feature statistics (mean and standard deviation).
After this -means clustering based discovery step, the target data in the latent domain () can be expressed as .
Hallucinate (Latent Target Domains Hallucination in Source)
They hallucinate latent target domains in the source domain formulating it as image translation. For example, is the hallucination of the latent target domain, where is an exemplar-guided image translation network (and is a random index).
How to design an effective image translation network to satisfy following:
high-resolution image translation
source-content preservation
target-style reflection
They use TGCF-DA (ICCV '19) as a baseline. The framework is cycle-free (no cyclic consistency loss) and uses a strong semantic constraint loss. It involves generator trained with 2 losses: and (described in the ICCV paper).
However, the limitation is that it fails to reflect diverse target-styles (from multiple latent domains) to the output image. Rather, it falls into mode collapse. This is attributed to lack of style consistency constraints in the framework.
To address this issue, they introduce a style consistency loss using a discriminator associated with a pair of target images - either both from same latent domain or not:
Here, and are a pair of sampled target images from the same latent domain (i.e. same style), while and are a pair of sampled images from different latent domains (i.e. different styles).
learns awareness of style consistency between a pair of images. Simultaneously, learns to fool by synthesizing images with the same style to exemplar, .
Using image translation, the hallucination step reduces the domain gap between the source and the target at a pixel-level.
Adapt (Domain-wise Adversaries)
Given latent target domains and translated source domains , the model attempts to learn domain-invariant features. Assuming the translated source and latent targets are both uni-modal, one could apply existing SOTA UDA techniques directly.
However, as the latent multi-mode structure is not fully exploited, this may be sub-optimal and gives inferior performance (experimentally observed). Thus, they use different discriminators to achieve latent domain-wise adversaries.
The discriminator only focuses on discriminating the output probability of segmentation model from latent domain (i.e. either from or ). The loss for target domain is defined as
Here, is the segmentation network. The segmentation task loss is defined as standard CE loss. The source data translated to latent domain can be trained with the original annotation as:
Overall Objective
The proposed framework utilizes adaptation techniques, including pixel-level alignment, semantic consistency, style consistency and output-level alignment. The overall objective function is:
Finally, the training process corresponds to solving the optimization problem , where and represent a generator (in ) and all the discriminators (in ) respectively.
Conclusion
This work presented a novel OCDA framework for semantic segmentation using 3 core design principles: Discover, Hallucinate, and Adapt.
Based on the latent target styles, the compound data is clustered and each group is considered as one specific latent target domain.
These target domains are hallucinated in the source domain via image translation. This reduces the domain gap and changes the classifier boundary to cover the latent domains.
Finally, domain-wise target-to-source alignment is performed using multiple discriminators with each disc. focusing on one latent domain.
Key idea presented was to decompose OCDA into multiple UDA problems.