# Notes on "[Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision](https://arxiv.org/abs/2004.07703)"
###### tags: `notes` `unsupervised` `domain-adaptation` `segmentation` `self-supervised` `cvpr20`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
Note: CVPR '20 Oral, [Code](https://github.com/feipan664/IntraDA)
## Brief Outline
They propose a two-step self-supervised DA approach to minimize the inter-domain and intra-domain gap together.
1. They conduct inter-domain adaptation and from this, they separate the target domain into an easy and hard split using an entropy-based ranking function.
2. For the intra-domain adaptation, they propose a self-supervised adaptation technique from the easy to the hard split.
## Introduction
* Target data collected from real world have diverse scene distributions, caused by various factors such as moving objects, weather conditions, which leads to a large gap within the target (intra-domain gap).
* Previous DA works focus more on inter-domain gap, so this paper presents a 2-step DA approach to minimize the inter-domain and the intra-domain gaps.
* Their model consists of 3 parts
* An inter-domain adaptation module to close the inter-domain gap between labeled source data and unlabeled target data.
* An entropy-based ranking system to separate target data into an easy and hard split.
* An intra-domain adaptation module to close intra-domain gap between the easy and hard split (using pseudo labels from the easy domain).
## Methodology
![Proposed Method](https://i.imgur.com/usLsH29.jpg)
* Let $\mathcal{S}$ denote a source domain consisting of a set of images $\in \mathbb{R}^{H \times W \times 3}$ with their associated ground-truth $C$-class segmentation maps $\in (1, C)^{H \times W}$. Similarly, let $\mathcal{T}$ denote a target domain containing a set of unlabeled images $\in \mathbb{R}^{H \times W \times 3}$.
* The first step is inter-domain adaptation, based on common UDA approaches ([Tsai et. al. 2018](https://arxiv.org/abs/1802.10349) and [Vu et. al. 2019](https://arxiv.org/abs/1811.12833)). Then, the pseudo labels and predicted entropy maps are used by an entropy-based ranking system to cluster the target data into the easy and hard split.
* The second step is intra-domain adaptation, which consists of aligning the pseudo-labeled easy split with the hard split. The full procedure is illustrated in the figure above.
* The proposed network consists of the inter-domain generator and discriminator $\{G_{inter}, D_{inter}\}$, and the intra-domain generator and discriminator $\{G_{intra}, D_{intra}\}$.
### Inter-Domain Adaptation
* A sample $X_s \in \mathbb{R}^{H \times W \times 3}$ is from the source domain with it's associated map $Y_s$. Each entry $Y_s^{(h, w)}=[Y_s^{(h, w, c)}]_c$ of $Y_s$ provides a label of a pixel $(h, w)$ as a one-hot vector.
* The network $G_{inter}$ takes $X_s$ as input and generates a "soft segmentation map" $P_s=G_{inter}(X_s)$. $G_{inter}$ is optimized in a supervised way by minimizing the CE loss
$$
\mathcal{L}_{inter}^{seg}(X_s, Y_s)=-\sum_{h, w}\sum_c Y_s^{(h, w, c)} \log(P_s^{(h, w, c)})
\tag{1}
$$
* ADVENT ([Vu et. al. 2019](https://arxiv.org/abs/1811.12833)) assumes that trained models tend to produce over-confident (low entropy) predictions for source-like images, and under-confident (high entropy) predictions for target-like images. Based on this, they propose to utilize entropy maps to align the distribution shift of the features.
* This paper adopts ADVENT for inter-domain adaptation due to it's simplicity and effectiveness. The generator $G_{inter}$ takes a target image $X_t$ as input and produces the segmentation map $P_t=G_{inter}(X_t)$, and the entropy map $I_t$ is formulated as
$$
I_t^{(h, w)}=\sum_c -P_t^{(h, w, c)} \log(P_t^{(h, w, c)})
\tag{2}
$$
* To reduce the inter-domain gap, $D_{inter}$ is trained to predict the domain labels for the entropy maps while $G_{inter}$ is trained to fool $D_{inter}$, and the optimization is achieved via the loss function
$$
\mathcal{L}_{inter}^{adv}(X_s;X_t)=\sum_{h, w} \log(1 - D_{inter}(I_t^{(h, w)}))+\log(D_{inter}(I_s^{(h, w)}))
\tag{3}
$$
* Here, $I_s$ is the entropy map of $X_s$. The loss functions 2 and 3 are optimized to align the distribution shift between source and target domains.
### Entropy-based Ranking
* Some target prediction maps are clean (confident and smooth) while others are noisy, despite being generated from the same model. Since this intra-domain gap exists among target images, a straightforward solution is to decompose the target domain into small subdomains.
* To build these splits, they use entropy maps to determine the confidence levels of target predictions. They rank the predictions using the mean value of the entropy map $I_t$ given by
$$
R(X_t)=\frac{1}{HW}\sum_{h, w}I_t^{(h, w)}
\tag{4}
$$
* Let $X_{te}$ and $X_{th}$ denote a target image assigned to the easy and hard splits respectively. For domain separation, they define $\lambda = \frac{|X_{te}|}{|X_t|}$ where $|X_{te}|$ is the cardinality (number of elements) of easy split, and $|X_{t}|$ is the cardinality of the whole target set.
* Note that a threshold value for separation is not used since it would be specific to a dataset. They choose the ratio as a hyperparameter, which shows strong generalization to other datasets.
#### Entropy Normalization
* Complex scenes (containing many objects) might be categorized as *hard*.
* For a more *representative ranking*, they adopt a new normalization by dividing the mean entropy with the number of predicted rare classes in the target image.
* Note that rare classes are pre-defined from the set of all classes (see results section in the paper for definition).
* The entropy normalization helps to move images with many objects to the easy split.
### Intra-domain Adaptation
* They propose to use the predictions from $G_{inter}$ as pseudo labels for the easy split. Given image from easy split $X_{te}$, the prediction map $P_{te}=G_{inter}(X_{te})$ is a *soft-segmentation map*, which is converted to $\mathcal{P}_{te}$ where each entry is a one-hot vector.
* Using these pseudo-labels, $G_{intra}$ is optimized by minimizing the CE loss
$$
\mathcal{L}_{intra}^{seg}(X_{te})=-\sum_{h, w}\sum_c \mathcal{P}_{te}^{(h, w, c)} \log(G_{intra}(X_{te})^{(h, w, c)})
\tag{5}
$$
* An image $X_{th}$ from the hard split gives the segmentation map $P_{th}=G(X_{th})$ (note that this is the $G$ being trained and not the fixed $G_{inter}$) and the entropy map $I_{th}$.
* To close the intra-domain gap, the intra-domain discriminator $D_{intra}$ is trained to predict the split labels of $I_{te}$ (easy split) and $I_{th}$ (hard split), and $G$ is trained to fool $D_{intra}$. The adversarial loss can be formulated as
$$
\mathcal{L}_{intra}^{adv}(X_{te},X_{th})=\sum_{h, w} \log(1 - D_{intra}(I_{th}^{(h, w)}))+\log(D_{intra}(I_{te}^{(h, w)}))
\tag{6}
$$
* The complete loss function $\mathcal{L}$ is the sum of all 4 equations (2, 3, 5, 6) and the objective is to learn a target model $G$ according to
$$
G^*=\text{argmin}_{G_{intra}} \min_{\substack{G_{inter} \\ G_{intra}}} \max_{\substack{D_{inter} \\ D_{intra}}} \mathcal{L}
\tag{7}
$$
* Since the proposed method is a 2-step self-supervised approach, it is difficult to train in a single stage. They choose to minimize it in 3 stages as follows
1. Train the inter-domain adaptation model to optimize $G_{inter}$ and $D_{inter}$.
2. Generate target pseudo labels using $G_{inter}$ and rank all target images based on $R(X_t)$.
3. Train the intra-domain adaptation model to optimize $G_{intra}$ and $D_{intra}$.
**Note**: See Section 4.3 of the paper for theoretical analysis (not able to understand yet).
## Conclusion
* A self-supervised DA approach is proposed to minimize the inter-domain and intra-domain gaps simultaneously.