# Notes on "[Domain Adaptive Semantic Segmentation Using Weak Labels](https://arxiv.org/abs/2007.15176)"
###### tags: `notes` `domain-adaptation` `segmentation` `weakly-supervised` `unsupervised`
ECCV '20 paper; [Project Page](http://www.nec-labs.com/~mas/WeakSegDA/); Code not released as of 20/09/20.
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
## Brief Outline
This paper proposes a framework for domain adaptation (DA) in semantic segmentation with image-level weak labels in the target domain. They use weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in DA.
## Introduction
* Existing UDA methods for semantic segmentation are developed mainly using 2 mechanisms
* Psuedo label self-training
* In this, pixel-wise pseudo labels are generated via strategies such as confidence scores ([BMVC '18](http://www.bmva.org/bmvc/2018/contents/papers/0200.pdf), [CVPR '19](https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Bidirectional_Learning_for_Domain_Adaptation_of_Semantic_Segmentation_CVPR_2019_paper.pdf)) or self-paced learning ([ECCV '18](https://openaccess.thecvf.com/content_ECCV_2018/html/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.html)).
* But, such pseudo labels are specific to the target domain and do not consider alignment between domains.
* Distribution alignment between source and target domains
* Numerous spaces could be considered for the alignment procedure, such as pixel ([ICML '18](http://proceedings.mlr.press/v80/hoffman18a.html), [CVPR '18](https://openaccess.thecvf.com/content_cvpr_2018/papers/Murez_Image_to_Image_CVPR_2018_paper.pdf)), feature ([Hoffman et. al. 2016](https://arxiv.org/abs/1612.02649), [ICCV '17](https://openaccess.thecvf.com/content_ICCV_2017/papers/Zhang_Curriculum_Domain_Adaptation_ICCV_2017_paper.pdf)), output ([CVPR '18](https://arxiv.org/abs/1802.10349), [CVPR '18](https://arxiv.org/abs/1711.11556)) and patch ([ICCV '19 Oral](https://arxiv.org/abs/1901.05427)) spaces.
* However, alignment by these methods is category-agnostic, which may be problematic as the domain gap may vary across categories.
* To alleviate the issue of lacking annotations in the target domain, they propose utilizing weak labels in the form of image- or point-level annotations in the target domain.
* The weak labels could be estimated from the model prediction in the UDA setting or provided by the human oracle in the weakly-supervised DA (WDA) paradigm. Note that this is the first paper to introduce a WDA setting for semantic segmentation.
* Specifically, they use weak labels to perform
* image-level classification to identify the presence/absence of categories in an image as a regularization.
* category-wise domain alignment using such categorical labels.
* For the image-level classification task, weak labels help obtain a better pixel-wise attention map per category. These category-wise attention maps act as guidance to further pool category-wise features for proposed domain alignment procedure.
* The main contributions of this work are
* They propose the concept of using weak labels to help DA for semantic segmentation.
* They utilize weak labels to improve category-wise alignment for better feature space adaptation.
* They demonstrate the applicability of their method to both UDA and WDA settings.
## Methodology
### Problem Definition
* In the source domain, they have images and pixel-wise labels denoted as $\mathcal{I}_s=\{ X_s^i, Y_s^i \}_{i=1}^{N_s}$. Whereas, the target dataset contains images and only image-level labels as $\mathcal{I}_t=\{ X_t^i, y_t^i \}_{i=1}^{N_t}$.
* Here, $X_s, X_t \in \mathbb{R}^{H \times W \times 3}$, $Y_s \in \mathbb{B}^{H \times W \times C}$ with pixel-wise one-hot vectors, $y_t \in \mathbb{B}^{C}$ is a multi-hot vector representing the categories present in the image and $C$ is the number of categories (same for both source and target datasets).
* The image-level labels $y_t$, termed weak labels, can be estimated (in which case, they are called pseudo-weak labels i.e. UDA) or acquired from a human oracle (in which case, they are called oracle-weak labels i.e. WDA).
* Given such data, the problem is to adapt a segmentation model $\mathbf{G}$ learned on the source dataset $\mathcal{I}_s$ to the target dataset $\mathcal{I}_t$.
### Algorithm Overview
![Overall Procedure](https://i.imgur.com/AApo7Wf.png)
* They first pass both the source and target images through the segmentation network $\mathbf{G}$ and obtain their features $F_s, F_t \in \mathbb{R}^{H' \times W' \times 2048}$, segmentation predictions $A_s, A_t \in \mathbb{R}^{H' \times W' \times C}$ and the upsampled pixel-wise predictions $O_s, O_t \in \mathbb{R}^{H \times W \times C}$.
* As a baseline, they use source pixel-wise annotations to learn $\mathbf{G}$, while aligning the output space distribution $O_s$ and $O_t$, following this [CVPR '18](https://arxiv.org/abs/1802.10349) paper.
* First, they introduce a module which learns to predict the categories that are present in a target image. Second, they formulate a mechanism to align the features of each individual category between source and target domains.
* To this end, they use category-specific domain discriminators $D^c$ guided by the weak labels to determine which categories should be aligned.
### Weak Labels for Category Classification
* To predict whether a category is absent/present in a particular image, they define an image classification task using the weak labels, such that $\mathbf{G}$ can discover those categories.
* They feed the target images $X_t$ through $\mathbf{G}$ to obtain the predictions $A_t$ and then apply a global pooling layer to obtain a single vector of predictions for each category:
$$
p_t^c = \sigma_s[\frac{1}{k} \log \frac{1}{H' W'} \sum_{h', w'} \exp kA_t^{(h', w', c)}]
\tag{1}
$$
* Here, $\sigma_s$ is the sigmoid function such that $p_t$ represents the probability that a particular category appears in an image. Note that (1) is a smooth approximation of the $\max$ function, and higher the value of $k$, better it approximates to $\max$. They use $k=1$.
* Using $p_t$ and the weak labels $y_t$, they compute the category-wise binary CE loss:
$$
\mathcal{L}_c(X_t; \mathbf{G}) = \sum_{c=1}^C -y_t^c \log(p_t^c) - (1-y_t^c)\log(1 - p_t^c)
\tag{2}
$$
* This loss $\mathcal{L}_c$ helps to identify the categories which are absent/present in a particular image and enforces $\mathbf{G}$ to pay attention to those objects/stuff that are partially identified when the source model is used directly on the target images.
### Weak Labels for Feature Alignment
* Methods in literature either align feature space or output space across domains. However, these are agnostic to category, so it may align features of categories not present in certain images.
* Also, features belonging to different categories may have different domain gaps. Thus, category-wise alignment could be beneficial but has not been widely studied in UDA for semantic segmentation.
#### Category-wise Feature Pooling
* Given the last layer features $F$ and the segmentation prediction $A$, they obtain the category-wise features by using the prediction as an attention over the features. Specifically, they obtain the category-wise feature $\mathcal{F}^c$ as a 2048-dimensional vector for the $c^{th}$ category as follows:
$$
\mathcal{F}^c = \sum_{h', w'} \sigma(A)^{(h', w', c)} F^{(h', w')}
\tag{3}
$$
* Here, $\sigma(A)$ is a tensor of dimension $H' \times W' \times C$ with each channel along the category dimension representing the category-wise attention obtained by the softmax operation $\sigma$ over the spatial dimensions.
* $\sigma(A)^{(h', w', c)}$ is a scalar and $F^{(h', w')}$ is a 2048-dimensional vector. So, $\mathcal{F}^c$ is the summed feature of $F^{(h', w')}$ weighted by $\sigma(A)^{(h', w', c)}$ over the spatial map $H' \times W'$. Note that the subscripts $s$ and $t$ are dropped as they employ the same operation to obtain the category-wise features for both domains.
* Note that $\mathcal{F}^c$ denotes the pooled features for the $c^th$ category and $\mathcal{F}^C$ denotes the set of pooled features for all categories.
#### Category-wise Feature Alignment
* To learn $\mathbf{G}$ such that source and target category-wise features are aligned, they use an adversarial loss while using category-specific discriminators $D^C=\{ D^c \}_{c=1}^C$.
* The reason for using category-specific discriminators is to ensure that the feature distribution for each category could be aligned independently, which avoids the noisy distribution modeling from a mixture of categories.
* They train $C$ distinct category-specific discriminators $D^C$ as follows:
$$
\mathcal{L}_d^C(\mathcal{F}_s^C, \mathcal{F}_t^C; D^C) = \sum_{c=1}^C -y_s^c \log D^c(\mathcal{F}_s^c) - y_t^c \log (1 - D^c(\mathcal{F}_t^c))
\tag{4}
$$
* While training the discriminators, they only compute the loss for those categories which are present in the particular images via the weak labels $y_s, y_t \in \mathbb{B}^C$ that indicate whether a category occurs in an image or not.
* The adversarial loss for the target images to train $\mathbf{G}$ is:
$$
\mathcal{L}_{adv}^C(\mathcal{F}_t^C; \mathbf{G}, D^C)=\sum_{c=1}^C -y_t^c \log D^c(\mathcal{F}_t^c)
\tag{5}
$$
* Similarly, they use the target weak labels $y_t$ to align only those categories present in the target image.
* Note: These 2 loss functions are effectively those used in the original GAN paper and also in the output space adaptation paper ([CVPR '18](https://arxiv.org/abs/1802.10349)).
### Network Optimization
#### Discriminator Training
* Both source and target images are used to train a set of $C$ distinct discriminators for each category $c$, which learn to distinguish between the category-wise features drawn from either the source or the target domain.
* The optimization problem to train the discriminator can be expressed as $\min_{D^C}\mathcal{L}_d^C(\mathcal{F}_s^C, \mathcal{F}_t^C)$.
#### Segmentation Network Training
* They train $\mathbf{G}$ with the pixel-wise CE loss $\mathcal{L}_s$ on the source images, image classification loss $\mathcal{L}_c$ and adversarial loss $\mathcal{L}_{adv}^C$ on the target images.
* The combined loss function to train $\mathbf{G}$ is:
$$
\min_{\mathbf{G}} \mathcal{L}_s(X_s) + \lambda_c \mathcal{L}_c(X_t) + \lambda_d \mathcal{L}_{adv}^C(\mathcal{F}_t^C)
\tag{6}
$$
* They follow standard GAN training procedure ([NeurIPS '14](https://papers.nips.cc/paper/5423-generative-adversarial-nets)) to alternatively update $\mathbf{G}$ and $D^C$.
### Acquiring Weak Labels
#### Pseudo-Weak Labels (UDA)
* One way is to directly estimate the weak labels using the data available i.e. source images/labels and target images, which is the UDA setting. In this work, they utilize the baseline model ([CVPR '18](https://arxiv.org/abs/1802.10349)) to adapt a model learned from the source to the target domain, and obtain the weak labels of target images as follows:
$$
y_t^c = \begin{cases}
1, & p_t^c > T,\\
0, & \text{otherwise}.
\end{cases}
\tag{7}
$$
* Here, $p_t^c$ is the probability for category $c$ as computed in (1) and $T$ is a threshold, which they set to 0.2 in the experiments.
* They forward a target image through the model, obtain the weak labels using (7) in an online manner. Since these do not require human supervision, this is in a UDA setting.
#### Oracle-Weak Labels (WDA)
* In this, they obtain weak labels by querying a human oracle to provide a list of categories that occur in the target image.
* They further show that their method can use different forms of oracle-weak labels by using point supervision ([ECCV '16](https://arxiv.org/abs/1506.02106)) (which is only slightly more effort compared to image-level supervision).
* In point supervision, they randomly obtain one pixel coordinate of each category that belongs in the image, i.e. the set of tuples $\{ (h^c, w^c, c) | \forall y_t^c = 1 \}$. For an image, they compute the loss as follows: $\mathcal{L}_{point}=-\sum_{\forall y_t^c=1}y_t^c \log (O_t^{(h^c, w^c, c)})$, where $O_t \in \mathbb{R}^{H \times W \times C}$ is the output prediction of target after pixel-wise softmax.
## Conclusion
* In this work, they use weak labels to improve domain adaptation for semantic segmentation in both UDA and WDA settings, with the latter being a novel setting.
* They design an image-level classification module using weak labels, enforcing the network to pay attention to categories present in the image. With this guidance from weak labels, they further utilize a category-wise alignment method to improve adversarial alignment in the feature space.
* Their formulation generalizes to both pseudo-weak and oracle-weak labels.