# Notes on "[Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision](https://arxiv.org/abs/2004.07703)" ###### tags: `notes` `unsupervised` `domain-adaptation` `segmentation` `self-supervised` `cvpr20` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) Note: CVPR '20 Oral, [Code](https://github.com/feipan664/IntraDA) ## Brief Outline They propose a two-step self-supervised DA approach to minimize the inter-domain and intra-domain gap together. 1. They conduct inter-domain adaptation and from this, they separate the target domain into an easy and hard split using an entropy-based ranking function. 2. For the intra-domain adaptation, they propose a self-supervised adaptation technique from the easy to the hard split. ## Introduction * Target data collected from real world have diverse scene distributions, caused by various factors such as moving objects, weather conditions, which leads to a large gap within the target (intra-domain gap). * Previous DA works focus more on inter-domain gap, so this paper presents a 2-step DA approach to minimize the inter-domain and the intra-domain gaps. * Their model consists of 3 parts * An inter-domain adaptation module to close the inter-domain gap between labeled source data and unlabeled target data. * An entropy-based ranking system to separate target data into an easy and hard split. * An intra-domain adaptation module to close intra-domain gap between the easy and hard split (using pseudo labels from the easy domain). ## Methodology ![Proposed Method](https://i.imgur.com/usLsH29.jpg) * Let $\mathcal{S}$ denote a source domain consisting of a set of images $\in \mathbb{R}^{H \times W \times 3}$ with their associated ground-truth $C$-class segmentation maps $\in (1, C)^{H \times W}$. Similarly, let $\mathcal{T}$ denote a target domain containing a set of unlabeled images $\in \mathbb{R}^{H \times W \times 3}$. * The first step is inter-domain adaptation, based on common UDA approaches ([Tsai et. al. 2018](https://arxiv.org/abs/1802.10349) and [Vu et. al. 2019](https://arxiv.org/abs/1811.12833)). Then, the pseudo labels and predicted entropy maps are used by an entropy-based ranking system to cluster the target data into the easy and hard split. * The second step is intra-domain adaptation, which consists of aligning the pseudo-labeled easy split with the hard split. The full procedure is illustrated in the figure above. * The proposed network consists of the inter-domain generator and discriminator $\{G_{inter}, D_{inter}\}$, and the intra-domain generator and discriminator $\{G_{intra}, D_{intra}\}$. ### Inter-Domain Adaptation * A sample $X_s \in \mathbb{R}^{H \times W \times 3}$ is from the source domain with it's associated map $Y_s$. Each entry $Y_s^{(h, w)}=[Y_s^{(h, w, c)}]_c$ of $Y_s$ provides a label of a pixel $(h, w)$ as a one-hot vector. * The network $G_{inter}$ takes $X_s$ as input and generates a "soft segmentation map" $P_s=G_{inter}(X_s)$. $G_{inter}$ is optimized in a supervised way by minimizing the CE loss $$ \mathcal{L}_{inter}^{seg}(X_s, Y_s)=-\sum_{h, w}\sum_c Y_s^{(h, w, c)} \log(P_s^{(h, w, c)}) \tag{1} $$ * ADVENT ([Vu et. al. 2019](https://arxiv.org/abs/1811.12833)) assumes that trained models tend to produce over-confident (low entropy) predictions for source-like images, and under-confident (high entropy) predictions for target-like images. Based on this, they propose to utilize entropy maps to align the distribution shift of the features. * This paper adopts ADVENT for inter-domain adaptation due to it's simplicity and effectiveness. The generator $G_{inter}$ takes a target image $X_t$ as input and produces the segmentation map $P_t=G_{inter}(X_t)$, and the entropy map $I_t$ is formulated as $$ I_t^{(h, w)}=\sum_c -P_t^{(h, w, c)} \log(P_t^{(h, w, c)}) \tag{2} $$ * To reduce the inter-domain gap, $D_{inter}$ is trained to predict the domain labels for the entropy maps while $G_{inter}$ is trained to fool $D_{inter}$, and the optimization is achieved via the loss function $$ \mathcal{L}_{inter}^{adv}(X_s;X_t)=\sum_{h, w} \log(1 - D_{inter}(I_t^{(h, w)}))+\log(D_{inter}(I_s^{(h, w)})) \tag{3} $$ * Here, $I_s$ is the entropy map of $X_s$. The loss functions 2 and 3 are optimized to align the distribution shift between source and target domains. ### Entropy-based Ranking * Some target prediction maps are clean (confident and smooth) while others are noisy, despite being generated from the same model. Since this intra-domain gap exists among target images, a straightforward solution is to decompose the target domain into small subdomains. * To build these splits, they use entropy maps to determine the confidence levels of target predictions. They rank the predictions using the mean value of the entropy map $I_t$ given by $$ R(X_t)=\frac{1}{HW}\sum_{h, w}I_t^{(h, w)} \tag{4} $$ * Let $X_{te}$ and $X_{th}$ denote a target image assigned to the easy and hard splits respectively. For domain separation, they define $\lambda = \frac{|X_{te}|}{|X_t|}$ where $|X_{te}|$ is the cardinality (number of elements) of easy split, and $|X_{t}|$ is the cardinality of the whole target set. * Note that a threshold value for separation is not used since it would be specific to a dataset. They choose the ratio as a hyperparameter, which shows strong generalization to other datasets. #### Entropy Normalization * Complex scenes (containing many objects) might be categorized as *hard*. * For a more *representative ranking*, they adopt a new normalization by dividing the mean entropy with the number of predicted rare classes in the target image. * Note that rare classes are pre-defined from the set of all classes (see results section in the paper for definition). * The entropy normalization helps to move images with many objects to the easy split. ### Intra-domain Adaptation * They propose to use the predictions from $G_{inter}$ as pseudo labels for the easy split. Given image from easy split $X_{te}$, the prediction map $P_{te}=G_{inter}(X_{te})$ is a *soft-segmentation map*, which is converted to $\mathcal{P}_{te}$ where each entry is a one-hot vector. * Using these pseudo-labels, $G_{intra}$ is optimized by minimizing the CE loss $$ \mathcal{L}_{intra}^{seg}(X_{te})=-\sum_{h, w}\sum_c \mathcal{P}_{te}^{(h, w, c)} \log(G_{intra}(X_{te})^{(h, w, c)}) \tag{5} $$ * An image $X_{th}$ from the hard split gives the segmentation map $P_{th}=G(X_{th})$ (note that this is the $G$ being trained and not the fixed $G_{inter}$) and the entropy map $I_{th}$. * To close the intra-domain gap, the intra-domain discriminator $D_{intra}$ is trained to predict the split labels of $I_{te}$ (easy split) and $I_{th}$ (hard split), and $G$ is trained to fool $D_{intra}$. The adversarial loss can be formulated as $$ \mathcal{L}_{intra}^{adv}(X_{te},X_{th})=\sum_{h, w} \log(1 - D_{intra}(I_{th}^{(h, w)}))+\log(D_{intra}(I_{te}^{(h, w)})) \tag{6} $$ * The complete loss function $\mathcal{L}$ is the sum of all 4 equations (2, 3, 5, 6) and the objective is to learn a target model $G$ according to $$ G^*=\text{argmin}_{G_{intra}} \min_{\substack{G_{inter} \\ G_{intra}}} \max_{\substack{D_{inter} \\ D_{intra}}} \mathcal{L} \tag{7} $$ * Since the proposed method is a 2-step self-supervised approach, it is difficult to train in a single stage. They choose to minimize it in 3 stages as follows 1. Train the inter-domain adaptation model to optimize $G_{inter}$ and $D_{inter}$. 2. Generate target pseudo labels using $G_{inter}$ and rank all target images based on $R(X_t)$. 3. Train the intra-domain adaptation model to optimize $G_{intra}$ and $D_{intra}$. **Note**: See Section 4.3 of the paper for theoretical analysis (not able to understand yet). ## Conclusion * A self-supervised DA approach is proposed to minimize the inter-domain and intra-domain gaps simultaneously.