# Notes on "[Domain Adaptive Semantic Segmentation Using Weak Labels](https://arxiv.org/abs/2007.15176)" ###### tags: `notes` `domain-adaptation` `segmentation` `weakly-supervised` `unsupervised` ECCV '20 paper; [Project Page](http://www.nec-labs.com/~mas/WeakSegDA/); Code not released as of 20/09/20. Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline This paper proposes a framework for domain adaptation (DA) in semantic segmentation with image-level weak labels in the target domain. They use weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in DA. ## Introduction * Existing UDA methods for semantic segmentation are developed mainly using 2 mechanisms * Psuedo label self-training * In this, pixel-wise pseudo labels are generated via strategies such as confidence scores ([BMVC '18](http://www.bmva.org/bmvc/2018/contents/papers/0200.pdf), [CVPR '19](https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Bidirectional_Learning_for_Domain_Adaptation_of_Semantic_Segmentation_CVPR_2019_paper.pdf)) or self-paced learning ([ECCV '18](https://openaccess.thecvf.com/content_ECCV_2018/html/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.html)). * But, such pseudo labels are specific to the target domain and do not consider alignment between domains. * Distribution alignment between source and target domains * Numerous spaces could be considered for the alignment procedure, such as pixel ([ICML '18](http://proceedings.mlr.press/v80/hoffman18a.html), [CVPR '18](https://openaccess.thecvf.com/content_cvpr_2018/papers/Murez_Image_to_Image_CVPR_2018_paper.pdf)), feature ([Hoffman et. al. 2016](https://arxiv.org/abs/1612.02649), [ICCV '17](https://openaccess.thecvf.com/content_ICCV_2017/papers/Zhang_Curriculum_Domain_Adaptation_ICCV_2017_paper.pdf)), output ([CVPR '18](https://arxiv.org/abs/1802.10349), [CVPR '18](https://arxiv.org/abs/1711.11556)) and patch ([ICCV '19 Oral](https://arxiv.org/abs/1901.05427)) spaces. * However, alignment by these methods is category-agnostic, which may be problematic as the domain gap may vary across categories. * To alleviate the issue of lacking annotations in the target domain, they propose utilizing weak labels in the form of image- or point-level annotations in the target domain. * The weak labels could be estimated from the model prediction in the UDA setting or provided by the human oracle in the weakly-supervised DA (WDA) paradigm. Note that this is the first paper to introduce a WDA setting for semantic segmentation. * Specifically, they use weak labels to perform * image-level classification to identify the presence/absence of categories in an image as a regularization. * category-wise domain alignment using such categorical labels. * For the image-level classification task, weak labels help obtain a better pixel-wise attention map per category. These category-wise attention maps act as guidance to further pool category-wise features for proposed domain alignment procedure. * The main contributions of this work are * They propose the concept of using weak labels to help DA for semantic segmentation. * They utilize weak labels to improve category-wise alignment for better feature space adaptation. * They demonstrate the applicability of their method to both UDA and WDA settings. ## Methodology ### Problem Definition * In the source domain, they have images and pixel-wise labels denoted as $\mathcal{I}_s=\{ X_s^i, Y_s^i \}_{i=1}^{N_s}$. Whereas, the target dataset contains images and only image-level labels as $\mathcal{I}_t=\{ X_t^i, y_t^i \}_{i=1}^{N_t}$. * Here, $X_s, X_t \in \mathbb{R}^{H \times W \times 3}$, $Y_s \in \mathbb{B}^{H \times W \times C}$ with pixel-wise one-hot vectors, $y_t \in \mathbb{B}^{C}$ is a multi-hot vector representing the categories present in the image and $C$ is the number of categories (same for both source and target datasets). * The image-level labels $y_t$, termed weak labels, can be estimated (in which case, they are called pseudo-weak labels i.e. UDA) or acquired from a human oracle (in which case, they are called oracle-weak labels i.e. WDA). * Given such data, the problem is to adapt a segmentation model $\mathbf{G}$ learned on the source dataset $\mathcal{I}_s$ to the target dataset $\mathcal{I}_t$. ### Algorithm Overview ![Overall Procedure](https://i.imgur.com/AApo7Wf.png) * They first pass both the source and target images through the segmentation network $\mathbf{G}$ and obtain their features $F_s, F_t \in \mathbb{R}^{H' \times W' \times 2048}$, segmentation predictions $A_s, A_t \in \mathbb{R}^{H' \times W' \times C}$ and the upsampled pixel-wise predictions $O_s, O_t \in \mathbb{R}^{H \times W \times C}$. * As a baseline, they use source pixel-wise annotations to learn $\mathbf{G}$, while aligning the output space distribution $O_s$ and $O_t$, following this [CVPR '18](https://arxiv.org/abs/1802.10349) paper. * First, they introduce a module which learns to predict the categories that are present in a target image. Second, they formulate a mechanism to align the features of each individual category between source and target domains. * To this end, they use category-specific domain discriminators $D^c$ guided by the weak labels to determine which categories should be aligned. ### Weak Labels for Category Classification * To predict whether a category is absent/present in a particular image, they define an image classification task using the weak labels, such that $\mathbf{G}$ can discover those categories. * They feed the target images $X_t$ through $\mathbf{G}$ to obtain the predictions $A_t$ and then apply a global pooling layer to obtain a single vector of predictions for each category: $$ p_t^c = \sigma_s[\frac{1}{k} \log \frac{1}{H' W'} \sum_{h', w'} \exp kA_t^{(h', w', c)}] \tag{1} $$ * Here, $\sigma_s$ is the sigmoid function such that $p_t$ represents the probability that a particular category appears in an image. Note that (1) is a smooth approximation of the $\max$ function, and higher the value of $k$, better it approximates to $\max$. They use $k=1$. * Using $p_t$ and the weak labels $y_t$, they compute the category-wise binary CE loss: $$ \mathcal{L}_c(X_t; \mathbf{G}) = \sum_{c=1}^C -y_t^c \log(p_t^c) - (1-y_t^c)\log(1 - p_t^c) \tag{2} $$ * This loss $\mathcal{L}_c$ helps to identify the categories which are absent/present in a particular image and enforces $\mathbf{G}$ to pay attention to those objects/stuff that are partially identified when the source model is used directly on the target images. ### Weak Labels for Feature Alignment * Methods in literature either align feature space or output space across domains. However, these are agnostic to category, so it may align features of categories not present in certain images. * Also, features belonging to different categories may have different domain gaps. Thus, category-wise alignment could be beneficial but has not been widely studied in UDA for semantic segmentation. #### Category-wise Feature Pooling * Given the last layer features $F$ and the segmentation prediction $A$, they obtain the category-wise features by using the prediction as an attention over the features. Specifically, they obtain the category-wise feature $\mathcal{F}^c$ as a 2048-dimensional vector for the $c^{th}$ category as follows: $$ \mathcal{F}^c = \sum_{h', w'} \sigma(A)^{(h', w', c)} F^{(h', w')} \tag{3} $$ * Here, $\sigma(A)$ is a tensor of dimension $H' \times W' \times C$ with each channel along the category dimension representing the category-wise attention obtained by the softmax operation $\sigma$ over the spatial dimensions. * $\sigma(A)^{(h', w', c)}$ is a scalar and $F^{(h', w')}$ is a 2048-dimensional vector. So, $\mathcal{F}^c$ is the summed feature of $F^{(h', w')}$ weighted by $\sigma(A)^{(h', w', c)}$ over the spatial map $H' \times W'$. Note that the subscripts $s$ and $t$ are dropped as they employ the same operation to obtain the category-wise features for both domains. * Note that $\mathcal{F}^c$ denotes the pooled features for the $c^th$ category and $\mathcal{F}^C$ denotes the set of pooled features for all categories. #### Category-wise Feature Alignment * To learn $\mathbf{G}$ such that source and target category-wise features are aligned, they use an adversarial loss while using category-specific discriminators $D^C=\{ D^c \}_{c=1}^C$. * The reason for using category-specific discriminators is to ensure that the feature distribution for each category could be aligned independently, which avoids the noisy distribution modeling from a mixture of categories. * They train $C$ distinct category-specific discriminators $D^C$ as follows: $$ \mathcal{L}_d^C(\mathcal{F}_s^C, \mathcal{F}_t^C; D^C) = \sum_{c=1}^C -y_s^c \log D^c(\mathcal{F}_s^c) - y_t^c \log (1 - D^c(\mathcal{F}_t^c)) \tag{4} $$ * While training the discriminators, they only compute the loss for those categories which are present in the particular images via the weak labels $y_s, y_t \in \mathbb{B}^C$ that indicate whether a category occurs in an image or not. * The adversarial loss for the target images to train $\mathbf{G}$ is: $$ \mathcal{L}_{adv}^C(\mathcal{F}_t^C; \mathbf{G}, D^C)=\sum_{c=1}^C -y_t^c \log D^c(\mathcal{F}_t^c) \tag{5} $$ * Similarly, they use the target weak labels $y_t$ to align only those categories present in the target image. * Note: These 2 loss functions are effectively those used in the original GAN paper and also in the output space adaptation paper ([CVPR '18](https://arxiv.org/abs/1802.10349)). ### Network Optimization #### Discriminator Training * Both source and target images are used to train a set of $C$ distinct discriminators for each category $c$, which learn to distinguish between the category-wise features drawn from either the source or the target domain. * The optimization problem to train the discriminator can be expressed as $\min_{D^C}\mathcal{L}_d^C(\mathcal{F}_s^C, \mathcal{F}_t^C)$. #### Segmentation Network Training * They train $\mathbf{G}$ with the pixel-wise CE loss $\mathcal{L}_s$ on the source images, image classification loss $\mathcal{L}_c$ and adversarial loss $\mathcal{L}_{adv}^C$ on the target images. * The combined loss function to train $\mathbf{G}$ is: $$ \min_{\mathbf{G}} \mathcal{L}_s(X_s) + \lambda_c \mathcal{L}_c(X_t) + \lambda_d \mathcal{L}_{adv}^C(\mathcal{F}_t^C) \tag{6} $$ * They follow standard GAN training procedure ([NeurIPS '14](https://papers.nips.cc/paper/5423-generative-adversarial-nets)) to alternatively update $\mathbf{G}$ and $D^C$. ### Acquiring Weak Labels #### Pseudo-Weak Labels (UDA) * One way is to directly estimate the weak labels using the data available i.e. source images/labels and target images, which is the UDA setting. In this work, they utilize the baseline model ([CVPR '18](https://arxiv.org/abs/1802.10349)) to adapt a model learned from the source to the target domain, and obtain the weak labels of target images as follows: $$ y_t^c = \begin{cases} 1, & p_t^c > T,\\ 0, & \text{otherwise}. \end{cases} \tag{7} $$ * Here, $p_t^c$ is the probability for category $c$ as computed in (1) and $T$ is a threshold, which they set to 0.2 in the experiments. * They forward a target image through the model, obtain the weak labels using (7) in an online manner. Since these do not require human supervision, this is in a UDA setting. #### Oracle-Weak Labels (WDA) * In this, they obtain weak labels by querying a human oracle to provide a list of categories that occur in the target image. * They further show that their method can use different forms of oracle-weak labels by using point supervision ([ECCV '16](https://arxiv.org/abs/1506.02106)) (which is only slightly more effort compared to image-level supervision). * In point supervision, they randomly obtain one pixel coordinate of each category that belongs in the image, i.e. the set of tuples $\{ (h^c, w^c, c) | \forall y_t^c = 1 \}$. For an image, they compute the loss as follows: $\mathcal{L}_{point}=-\sum_{\forall y_t^c=1}y_t^c \log (O_t^{(h^c, w^c, c)})$, where $O_t \in \mathbb{R}^{H \times W \times C}$ is the output prediction of target after pixel-wise softmax. ## Conclusion * In this work, they use weak labels to improve domain adaptation for semantic segmentation in both UDA and WDA settings, with the latter being a novel setting. * They design an image-level classification module using weak labels, enforcing the network to pay attention to categories present in the image. With this guidance from weak labels, they further utilize a category-wise alignment method to improve adversarial alignment in the feature space. * Their formulation generalizes to both pseudo-weak and oracle-weak labels.