Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

# Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)" ###### tags: `notes` `domain-adaptation` `segmentation` `unsupervised` `open-compound` NeurIPS '20 paper; Code not released as of 30/11/20. Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline This work investigates open compound domain adaptation (OCDA) for semantic segmentation which deals with mixed and novel situations at the same time. They first cluster the compound target data based on style *(discover)*, then *hallucinate* multiple latent target domains in source using image translation, and perform target-to-source alignment separately between domains *(adapt)*. ## Introduction * Most existing UDA techniques focus on a single-source single-target setting instead of a more practical scenario where target consists of multiple data distributions without clear distinctions. * Towards this, they study open compound domain adaptation (OCDA) ([CVPR '20](https://arxiv.org/pdf/1909.03403.pdf)) where target data is a union of multiple homogeneous domains without domain labels. Unseen target data is also considered at test-time (open domains). * Naive use of UDA techniques for OCDA have a fundamental limitation of inducing a biased alignment where only the target data that are close to source aligns well. * They propose a framework that incorporates three key functionalities: discover, hallucinate, and adapt. The key idea is to decompose a hard OCDA problem into multiple easy UDA problems. * First, the scheme discovers $K$ latent domains in the compound target data (**discover**). They use style information as domain-specific representation and cluster the compound target using *latent target styles*. * Second, the scheme generates $K$ target-like source domains using an exemplar-guided image translation network ([CVPR '19](https://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Example-Guided_Style-Consistent_Image_Synthesis_From_Semantic_Labeling_CVPR_2019_paper.html)), hallucinating multiple latent target domains in source (**hallucinate**). * Third, the scheme matches the latent domains of source and target. Using $K$ different discriminators, the domain-invariance is captured separately between domains (**adapt**). ## Methodology * Source data and corresponding labels are denoted by $X_S=\{x_S^i\}_{i=1}^{N_S}$ and $Y_S=\{ y_S^i \}_{i=1}^{N_S}$ respectively. Compound target data is denoted by $X_T=\{x_T^i\}_{i=1}^{N_T}$ which are a mixture of multiple homogeneous data distributions. All domains share same space of classes (closed label set). ![Overview](https://i.imgur.com/tqbHrRH.png) ### Discover (Multiple Latent Target Domains Discovery) * The key motivation is to make *implicit* multiple target domains *explicit*. For this, they assume that latent domain of images is reflected in their *style*, and use style information to cluster the compound target domain. * They introduce a hyperparameter $K$ and divide the compound target domain $T$ into $K$ latent domains by style, $\{T_j\}_{j=1}^K$. Here, style information is convolutional feature statistics (mean and standard deviation). * After this $k$-means clustering based discovery step, the target data in the $j^{\text{th}}$ latent domain ($j \in 1,\dots, K$) can be expressed as $X_{T, j}=\{ x_{T, j}^i \}_{i=1}^{N_{T, j}}$. ### Hallucinate (Latent Target Domains Hallucination in Source) * They hallucinate $K$ latent target domains in the source domain formulating it as image translation. For example, $G(x_S^i, x_{T, j}^z)\to x_{S, j}^i$ is the hallucination of the $j^{\text{th}}$ latent target domain, where $G$ is an exemplar-guided image translation network (and $z$ is a random index). * How to design an effective image translation network to satisfy following: 1. high-resolution image translation 2. source-content preservation 3. target-style reflection * They use TGCF-DA ([ICCV '19](https://openaccess.thecvf.com/content_ICCV_2019/papers/Choi_Self-Ensembling_With_GAN-Based_Data_Augmentation_for_Domain_Adaptation_in_Semantic_ICCV_2019_paper.pdf)) as a baseline. The framework is cycle-free (no cyclic consistency loss) and uses a strong semantic constraint loss. It involves generator trained with 2 losses: $L_{GAN}$ and $L_{sem}$ (described in the ICCV paper). * However, the limitation is that it fails to reflect diverse target-styles (from multiple latent domains) to the output image. Rather, it falls into mode collapse. This is attributed to lack of style consistency constraints in the framework. * To address this issue, they introduce a *style consistency loss* using a discriminator $D_{sty}$ associated with a pair of target images - either both from same latent domain or not: $$ L_{Style}^j (G, D_{Sty}) = \mathbb{E}_{x_{T, j}^\prime \sim X_{T, j}, x_{T, j}^{\prime\prime} \sim X_{T, j}} [\log D_{Sty}(x_{T, j}^{\prime}, x_{T, j}^{\prime\prime})] \\ +\sum_{l \neq j} \mathbb{E}_{x_{T, j}\sim X_{T, j}, x_{T, l}\sim X_{T, l}} [\log(1-D_{Sty}(x_{T, j}, x_{T, l}))] \\ +\mathbb{E}_{x_S \sim X_S, x_{T, j} \sim X_{T, j}}[\log(1-D_{Sty}(x_{T, j}, G(x_S, x_{T, j})))] \tag{1} $$ * Here, $x_{T, j}^{\prime}$ and $x_{T, j}^{\prime\prime}$ are a pair of sampled target images from the same latent domain $j$ (i.e. same style), while $x_{T, j}$ and $x_{T, l}$ are a pair of sampled images from different latent domains (i.e. different styles). * $D_{Sty}$ learns awareness of style consistency between a pair of images. Simultaneously, $G$ learns to fool $D_{Sty}$ by synthesizing images with the same style to exemplar, $x_{T, j}$. * Using image translation, the hallucination step reduces the domain gap between the source and the target at a pixel-level. ### Adapt (Domain-wise Adversaries) * Given $K$ latent target domains $\{ T_j\}_{j=1}^K$ and $K$ translated source domains $\{ S_j\}_{j=1}^K$, the model attempts to learn domain-invariant features. Assuming the translated source and latent targets are both uni-modal, one could apply existing SOTA UDA techniques directly. * However, as the latent multi-mode structure is not fully exploited, this may be sub-optimal and gives inferior performance (experimentally observed). Thus, they use $K$ different discriminators $\{D_{O, j}\}_{j=1}^K$ to achieve latent domain-wise adversaries. * The $j^{\text{th}}$ discriminator $D_{O, j}$ only focuses on discriminating the output probability of segmentation model from $j^{\text{th}}$ latent domain (i.e. either from $T_j$ or $S_j$). The loss for $j^{\text{th}}$ target domain is defined as $$ L_{Out}^j(F, D_{O, j}) = \mathbb{E}_{x_{S, j} \sim X_{S, j}}[\log D_{O, j}(F(x_{S, j}))] + \mathbb{E}_{x_{T, j} \sim X_{T, j}}[\log(1-D_{O, j}(F(x_{T, j})))] \tag{2} $$ * Here, $F$ is the segmentation network. The segmentation task loss is defined as standard CE loss. The source data translated to $j^{\text{th}}$ latent domain can be trained with the original annotation as: $$ L_{task}^j (F)=-\mathbb{E}_{(x_{S, j}, y_S) \sim (X_{S, j}, Y_S)} \sum_{h, w} \sum_{c} y_s^{(h, w, c)} \log (F(x_{S, j})^{(h, w, c)}) \tag{3} $$ ### Overall Objective * The proposed framework utilizes adaptation techniques, including pixel-level alignment, semantic consistency, style consistency and output-level alignment. The overall objective function is: $$ L_{total} = \sum_j[\lambda_{GAN} L_{GAN}^j + \lambda_{sem} L_{sem}^j + \lambda_{Style}L_{Style}^j + \lambda_{Out}L_{Out}^j + \lambda_{task}L_{task}^j] \tag{4} $$ * Finally, the training process corresponds to solving the optimization problem $F^* = \arg \min_F \min_D \max_G L_{total}$, where $G$ and $D$ represent a generator (in $L_{sem}, L_{GAN}, L_{Style}$) and all the discriminators (in $L_{GAN}, L_{Style}, L_{Out}$) respectively. ## Conclusion * This work presented a novel OCDA framework for semantic segmentation using 3 core design principles: Discover, Hallucinate, and Adapt. * Based on the latent target styles, the compound data is clustered and each group is considered as one specific latent target domain. * These target domains are hallucinated in the source domain via image translation. This reduces the domain gap and changes the classifier boundary to cover the latent domains. * Finally, domain-wise target-to-source alignment is performed using multiple discriminators with each disc. focusing on one latent domain. * Key idea presented was to decompose OCDA into multiple UDA problems.

Read more

Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Cost-Effective REgion-based Active Learning for Semantic Segmentation](http://bmvc2018.org/contents/papers/0437.pdf)"