# Notes on "[Domain Adaptation for Structured Output via Discriminative Patch Representations](https://arxiv.org/abs/1901.05427)" ###### tags: `notes` `domain-adaptation` `adversarial` `unsupervised` `segmentation` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) **Note**: [Code](https://github.com/wasidennis/AdaptSegNet) for [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349) mentions some implementation details for this method. There is no open-source implementation for this method available (as of 20/04/2020). ## Brief Outline They propose an unsupervised domain adaptation that explicitly discovers many modes in the structured output space of semantic segmentation to learn a better discriminator between the 2 domains, ultimately leading to a better domain alignment. ## Introduction - They leverage pixel-level semantic annotations available in the source domain, but instead of working on the output space (like [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)), the adaptation occurs in 2 stages. - They extract patches from the source domain (represented using annotations) and discover major modes by grouping the patches using $K$-means clustering. They use a $K$-way classifier to predict the cluster/mode index of each patch using the image as input. - Their method, patch-level alignment, operates on the $K$-dimensional probability vector space. The learned discriminator on this space can backpropagate the gradient through the cluster/mode index classifier to the semantic segmentation network. - They propose an adversarial adaptation framework for structured prediction that explicitly tries to discover and predict modes of the output patches. ## Related Work ### Unsupervised Domain Adaptation (UDA) - One common practice is to adopt adversarial learning ([Ganin et. al. 2015](https://arxiv.org/abs/1505.07818)) or to minimize the Maximum Mean Discrepancy ([Long et. al. 2015](http://proceedings.mlr.press/v37/long15.html)). - Recent approaches for UDA in semantic segmentation can be categorized into: 1. output space ([Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)) and spatial-aware([Chen et. al. 2017](https://arxiv.org/abs/1711.11556)) adaptations aim to align the global structure (like scene layout) across domains. 2. pixel-level adaptation synthesizes target samples to reduce the domain gap during training (not relevant to this paper). 3. pseudo label re-training generates pseudo ground truth of target images to finetune the model trained on source domain (not relevant to this paper). - The first category are most relevant to this approach. However, they don't handle the intrinsic domain gap (like camera poses). This method can handle this type of domain gap. - They also mention that the other 2 categories or other techniques (for DA) like robust loss function design are orthogonal to their work (i.e. this method can work with other methods to further improve performance). ### Learning Disentangled Representations - These approaches use predefined factors to learn interpretable representations of the image (see paper for approaches). - Although these present promising results, they focus on handling the data in a single domain. - Motivated by this line of research, they propose to learn discriminative representations for patches to help the DA task. - Thus, they use the available labels and use them as a disentangled factor (so the method does not require any predefined factors). ## Methodology ### Overview - Given the source and target images $I_s, I_t \in \mathbb{R}^{H \times W \times 3}$, where only source data is annotated with per-pixel semantic categories $Y_s$, they seek to learn a sem-seg model $G$ that works on both domains. - Since the target domain is unlabeled, the goal is to align the predicted output distribution $O_t$ of the target data with the source distribution $O_s$. - They project the category distribution of patches to the clustered space that already discovers various patch modes (i.e. $K$ clusters) based on the annotations in the source domain. - For the target domain, they then employ adversarial learning to align the patch level distributions across domains in the $K$-dimensional space. ![Overview](https://i.imgur.com/9YnAL9u.png) ### Patch-level Alignment - In this, it is described how they construct the clustered space and learn discriminative patch representations. - Then, they describe adversarial alignment using the learned patch representation. #### Patch Mode Discovery - To discover modes and learn a discriminative feature space, class labels or predefined factors are usually provided as supervisory signals. - In this, per-pixel annotations from the source domain are used to construct a space of semantic patch representation. - They use label histograms for patches. - First, randomly sample patches from source images, use a $2\times2$ grid on patches to extract spatial label histograms, and concatenate them to obtain a $2\times2\times C$ dimensional vector. - Second, apply $K$-means clustering on these histograms, thereby assigning each ground truth label patch a unique cluster index. - Note: This process of finding the cluster membership for each patch in a ground truth label map $Y_s$ is defined as $\Gamma(Y_s)$. - To implement this clustered space for training $G$ on source data, they use a classification module $H$ on top of the predicted output $O_s$, which tries to predict the cluster membership $\Gamma(Y_s)$ for all locations. - They denote the learned representation as $F_s=H(G(I_s)) \in (0, 1)^{U \times V \times K}$ through the softmax function (where $K$ is the number of clusters). - Each datapoint on the spatial map $F_s$ corresponds to a patch of the input image, and the group label for each patch can be obtained using $\Gamma(Y_s)$. Then, the learning process to construct the clustered space can be formulated using a cross-entropy loss: $$ \mathcal{L}(F_s, \Gamma(Y_s); G, H) = -\sum_{u, v}\sum_{k \in K}CE^{(u, v, k)} \tag{1} $$ - Here, $CE^{u, v, k}=\Gamma(Y_s)^{(u, v, k)}\log(F_s^{(u, v, k)})$ #### Adversarial Alignment - The ensuing task is to align the representations of target patches to the clustered space constructed in the source domain, ideally aligned to one of the $K$ modes. - For this, they use an adversarial loss between $F_s$ and $F_t$, where $F_t$ is generated similarly using $H$. - They formulate the patch distribution alignment in an adversarial objective: $$ \mathcal{L}_{adv}(F_s, F_t; G, H, D) = \sum_{u, v}\{ \mathbb{E}[\log D(F_s)^{(u, v, 1)}] + \mathbb{E}[\log(1 - D(F_t)^{(u, v, 1)})] \} \tag{2} $$ - Here, $D$ is the discriminator to classify whether the feature representation $F$ is from the source domain or the target domain. #### Learning Objective - Writing the equations 1 and 2 together in a min-max objective (with only optimization variables): $$ \min_{G, H} \max_D \{\mathcal{L}_s(G) + \lambda_d \mathcal{L}_d(G, H) + \lambda_{adv}\mathcal{L}_{adv}(G, H, D) \} \tag{3} $$ - Here, $\mathcal{L}_s$ is the cross-entropy for learning the structured prediction (here, sem-seg) on source data and $\lambda$'s are the weights for different losses. ### Network Optimization - To solve the optimization problem of Equation 3, they follow the training procedure of GANs, and alternate 2 steps: 1. update $D$ 2. update $G$ and $H$ while fixing $D$ #### Update $D$ - The discriminator $D$ is trained to classify whether the feature representation $F$ is from the source (labeled as 1) or the target domain (labeled as 0). - The maximization problem w.r.t. $D$ in Equation 3 is equivalent to minimizing the binary cross-entropy loss: $$ \mathcal{L}_D(F_s, F_t; D) = -\sum_{u, v}\{ \mathbb{E}[\log D(F_s)^{(u, v, 1)}] + \mathbb{E}[\log(1 - D(F_t)^{(u, v, 1)})] \} \tag{4} $$ #### Update $G$ and $H$ - The goal of this step is to push the target distribution closer to the source distribution using the optimized $D$, while maintaining good performance on main tasks using $G$ and $H$. - So, the minimization problem in Equation 3 is the combination of 2 supervised loss functions and an adversarial loss (which is a binary cross-entropy loss that assigns the source label to the target distribution): $$ \mathcal{L}_{G, H} = \mathcal{L}_s + \lambda_d \mathcal{L}_d - \lambda_{adv} \sum_{u, v} \log (D(F_t)^{(u, v, 1)}) \tag{5} $$ - Note that updating $H$ also influences $G$, thus enhancing the feature representations in $G$. Also, $H$ and $D$ are required only for training, so inference time is unaffected. ## Ablation Study ### Loss Functions - Adding discriminative patch representations (just using $\mathcal{L}_s + \mathcal{L}_d$) already improves the performance. - This demonstrates that the learned feature representation enhances the discrimination and generalization ability. - The proposed patch-level alignment (adding $\mathcal{L}_{adv}$) further improves the performance. ### Learning Clustered Space - $K$-means provides additional supervised signal to separate different patch patterns, while performing alignment in this clustered space. - Without the clustered loss $\mathcal{L}_d$, it would be difficult to align patch modes across two domains. Removing this term from training reduces the performance. ### Number of Clusters - The performance is robust to the choice of $K$. - However, when $K$ is too large (>300), it would cause confusion between patch modes and increases the training difficulty. - To keep efficiency and accuracy, they use $K=50$ in all experiments. ### Combination with other DA methods - Fusing the patch-level alignment method with other methods like output space adaptation, pixel-level adaptation and pseudo label re-training separately, improves the performance. - Fusing all of these at the same time gives the best performance. - Combining different DA approaches does not incrementally improve performance. However, adding patch-alignment consistently improves performance in all settings. ## Conclusion - They present a DA method for structured output via patch-level alignment. - They learn discriminative representations of patches by constructing a clustered space of the source patches and adopt an adversarial learning scheme to push the target patch distribution closer to the source ones. - Patch-level alignment is complementary to other DA approaches and provides additional improvements. # Supplementary Notes - The following will be useful for implementation (of experiments) but may or may not be explicitly in the paper. ## Datasets ### GTA5 - Introduced by [Richter et. al. 2016](https://arxiv.org/abs/1608.02192), it has generated around 25k images from a photo-realistic computer game called Grand Theft Auto 5 (GTA5). - Using their labeling pipeline, they labeled the entire dataset in 49 hours (7 sec/image) compared to 12 years for similar sized datasets (without their pipeline). ### Cityscapes - Introduced by [Cordts et. al. 2016](https://arxiv.org/abs/1604.01685), it has 5k high quality pixel-level annotations and 20k coarse annotations for road scenes (mostly in sunny conditions). ### Synthia - Introduced by [Ros et. al. 2016](https://ieeexplore.ieee.org/document/7780721), it generates images from a photo-realistic virtual city. ### Oxford Robot Car - Introduced by [Maddern et. al. 2016](https://journals.sagepub.com/doi/abs/10.1177/0278364916679498), contains around 20 million images collected over the same route in Central Oxford (but over the period of 1 year). - This doesn't have ground truth labels by default. But, this paper's authors selected a sequence from this dataset tagged as 'rainy' and used 895 of those for training. - 297 separate images from the same sequence were manually annotated for testing (and are available on the [project page](https://www.nec-labs.com/~mas/adapt-seg/adapt-seg.html)). ![Image and Patch Sizes](https://i.imgur.com/TZ0VDlp.png) ## Evaluation Metrics - They use Intersection over Union (IoU) as a metric for all experiments.