Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

# Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)" ###### tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `image-mixup` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) Note: WACV '21 paper; [Official Code Release](https://github.com/vikolss/DACS) ## Brief Outline They propose Domain Adaptation via Cross-domain Mixed Sampling which mixes images from two domains along with their corresponding labels. These mixed samples are trained on, along with the labelled data itself. ## Introduction * Unsupervised Domain Adaptation (UDA) deals with the problem where labelled data is available for a source domain, but only unlabelled data is available for the target domain. * Pseudo-labelling or self-training ([Lee, 2013](http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf)) was originally proposed for Semi-Supervised Learning (SSL), and involves training on artificial targets based on the class predictions of the network. It was later adapted to UDA with certain modifications to compensate for the domain shift. * [Zou et. al. 2018](https://openaccess.thecvf.com/content_ECCV_2018/papers/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.pdf) pointed out that naively applying pseudo-labelling for UDA tends to bias the predictions of the network to easy-to-predict classes, causing difficult classes to stop being predicted during training. That work proposed class balanced sampling of pseudo-labels to combat the issue. * Additional issue with pseudo-labels of incorrectness due to the domain shift, have inspired works to add uncertainty estimation modules ([Zou et. al. 2019](https://arxiv.org/abs/1908.09822) and [Zheng and Yang, 2020](https://arxiv.org/abs/2003.03773)). * In existing methods for correcting incorrect pseudo-labels, certain images in the target domain are oversampled and low confidence pixels are filtered out. However, many low confidence pixels are aligned with predictions at semantic boundaries, which leads to a diminished training signal at those boundaries. * To circumvent these issues, they propose DA via cross-domain mixed sampling (DACS) which adapts the augmentation technique ClassMix ([Olsson et. al. 2020](https://arxiv.org/abs/2007.07936)), originally proposed for SSL. * Naively using ClassMix (mixing within target domain) with pseudo-labelling causes classes to be conflated (confused with other classes), similar to the findings of [Zou et. al. 2018](https://openaccess.thecvf.com/content_ECCV_2018/papers/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.pdf). They propose to solve this problem by pasting classes from labelled source domain onto unlabelled target domain images. * Their main contributions are * They apply an SSL method based on ClassMix to UDA, illustrate its flaws and provide an analysis of possible causes. * They introduce DACS, a simple framework that adapts ClassMix based on cross-domain mixing of samples. * They present improvements over SOTA in UDA for GTA5 to Cityscapes and Synthia to Cityscapes tasks. ## Methodology ### ClassMix * ClassMix ([Olsson et. al. 2020](https://arxiv.org/abs/2007.07936)) is a recently proposed data augmentation technique for challenging semi-supervised semantic segmentation benchmarks. * It *mixes* 2 images $A$ and $B$ from the unlabelled dataset into an augmented image while also generating a pseudo-label for it. First, a segmentation network makes predictions for $A$ and $B$, resulting in semantic maps (pseudo-labels) $Y_A$ and $Y_B$ respectively. * Half of the classes present in $Y_A$ are selected and a binary mask $M$ is generated (the mask contains 1 for pixels of selected classes of $A$ and 0 elsewhere). The augmented image $X_M$ and its corresponding label $Y_M$ are generated using the mask $$ X_M = M \odot A + (1 - M) \odot B \\ Y_M = M \odot Y_A + (1 - M) \odot Y_B \tag{1} $$ * Here, $\odot$ denotes element-wise multiplication. An illustration of ClassMix is shown below. ![ClassMix example](https://i.imgur.com/OnEqtqD.png) ### Naive ClassMix for UDA * ClassMix uses unlabelled samples to generate augmented images. In UDA, unlabelled samples are from target dataset, so naive approach would be to mix target domain images. * Similar to the original ClassMix work ([Olsson et. al. 2020](https://arxiv.org/abs/2007.07936)), they use the Mean-Teacher framework ([Tarvainen and Valpola, 2017](https://arxiv.org/abs/1703.01780)) where instead of using the current parameters of the network, they use an exponential moving average of the previous weights during the optimization, resulting in more stable predictions. * In practice, this performs poorly, and confuses some of the classes when predicting the semantics of target domain images. This impacts performance considerably, and occurs exclusively for target domain images, not for the source domain images. * There is a bias towards easy-to-transfer classes when applying pseudo-labels naively (as per [Zou et. al. 2018](https://openaccess.thecvf.com/content_ECCV_2018/papers/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.pdf)). Since the ClassMix approach relies on generation of pseudo-labels, it can be expected to inherit the same underlying issues. * While other works proposed improvements to correct the erroneous pseudo-label generation, they propose a change in the augmentation procedure. ### DACS * Instead of mixing images only from target domain, they mix images across domains. The mix is performed as in Equation 1, but the source domain image $X_S$ is used to compute the mask $M$. Also, ground truth label $Y_S$ can be used instead of predictions from the segmentation network. ![DACS example](https://i.imgur.com/dcp2cMp.png) * In Naive ClassMix, due to the potentially large domain gap, the network may implicitly learn to discern between the domains in order to perform better in the task, and (incorrectly) learn that the class distributions are very different in the two domains. * With cross-domain mixing, they introduce new data. Considering that the labels for these new images partly come from the source domain, they won't be conflated for entire images. * Further, the pseudo-labelled pixels (target domain) and the ground truth labelled pixels (source domain) may now be neighbors in an image, making the implicit discerning between domains unlikely, since it would have to be done at a pixel-level. * Both of these aspects help the network to better deal with the domain gap, and effectively solve the class conflation problem, resulting in considerably better performance. * The overall algorithm of DACS is as follows: ![DACS algorithm](https://i.imgur.com/uffV261.png) ### Loss Function * In DACS, the network parameters $\theta$ are trained by minimizing the following loss: $$ \mathcal{L}(\theta) = \mathbb{E}[H(f_\theta(x_S), y_S) + \lambda H(f_\theta(x_M), y_M)] \tag{2} $$ * Here, the expectation is over random variables $x_S, y_S, x_M, y_M$, where $x_S$ is an image sampled uniformly at random from the source distribution, and $y_S$ is its corresponding label. * The random variables $x_M$ and $y_M$ are the mixed image and its pseudo label, created by performing cross-domain mixed sampling from an image sampled uniformly at random from the source domain and one from target domain, as explained previously. * $H$ is the cross-entropy between predicted semantic map and the corresponding label (GT or pseudo) averaged over all pixels, and $\lambda$ is a hyperparameter which decides how much the unsupervised part of the loss affects the overall training. * Following ClassMix and [French et. al. 2020](https://arxiv.org/abs/1906.01916), they use an adaptive schedule for $\lambda$, where it, for each image, is the proportion of pixels where the predictions of $f_\theta$ on that image have a confidence above a certain threshold. ## Experiments * They evaluate on standard GTA5 to Cityscapes and Synthia to Cityscapes tasks and surpass the SOTA for both the tasks. ### Early Stopping * They express concern over the use of early stopping by other works on the Cityscapes validation set since the same set of images is also used for the final evaluation. * This becomes an unfair evaluation as high performance on the validation set will not imply high performance on all data in that case. * They also claim that the use of early stopping in their method gives an increase of around 3-4% on all the results (including the baselines). * The reason for the considerable performance increase from early stopping is that the network's performance on the validation set fluctuates a lot over the course of training, rather than that the model is overfitting the training data. ### Ablation Study #### No Mixing * Since Naive ClassMix mixes images based on predictions, they investigate if the problem of class conflation could be related to the mixing component. * They use only pseudo-labelling (i.e. remove the mixing component) and even more classes stop being predicted by the network, with overall performance reducing below the source baseline. * Thus, it is reasonable to assume that the pseudo-labelling component causes the class conflation and not the mixing. #### Distribution Alignment * Another way to solve class conflation would be to impose a prior class distribution on the pseudo-labels as in [Vu et. al. 2019](https://arxiv.org/abs/1811.12833) and [Berthelot et. al. 2019](https://arxiv.org/abs/1811.12833). * For this, they follow [Berthelot et. al. 2019](https://arxiv.org/abs/1811.12833) and use Distribution Alignment, meaning that a distribution of classes is forced upon the predictions. This is a different way of injecting entropy into the pseudo-labels, meaning it makes it more likely that the network learns to correctly segment and avoid class conflation. * They perform experiments where GT distribution $p$ of the target dataset is used to guide the training. This is done by transforming each output prediction $q$ for each pixel into $\tilde{q}=\text{Normalize}(q \times p / \tilde{p})$, where $\tilde{p}$ is a running average of all predictions made by the network on the target data. * While this is not a legitimate way in a UDA setting, it is interesting to see that this approach also solves the class conflation problem. This further strengthens the hypothesis that artificial injection of entropy in training can help avoid class conflation. * An interesting direction for future research would be to use an estimated class distribution in a similar way. ## Conclusion * They proposed Domain Adaptation via Cross-domain mixed Sampling (DACS) for UDA in semantic segmentation based on the ClassMix data augmentation technique. * They study the systematic problems caused by naive application of ClassMix to UDA and detail the changes to correct the issues.

Read more

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

Notes on "[Cost-Effective REgion-based Active Learning for Semantic Segmentation](http://bmvc2018.org/contents/papers/0437.pdf)"