# Notes on "[Fourier Domain Adaptation for Semantic Segmentation](https://openaccess.thecvf.com/content_CVPR_2020/papers/Yang_FDA_Fourier_Domain_Adaptation_for_Semantic_Segmentation_CVPR_2020_paper.pdf)" ###### tags: `notes` `unsupervised` `domain-adaptation` `segmentation` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline A simple method for UDA where the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. The method is illustrated for semantic segmentation. ## Introduction - Unsupervised domain adaptation (UDA) refers to adapting a model trained with annotated samples from one distribution (source), to operate on a different (target) distribution for which no annotations are given. - Simply training the model on the source data does not yield satisfactory performance on the target data, due to the covariate shift. - In some cases, perceptually insignificant changes in the low-level statistics can cause significant deterioration of the performance of the trained model, unless UDA is performed. - They explore the hypothesis that simple alignment of the low-level statistics between the source and target distributions can improve UDA, without needing any training beyond the primary task. - They compute the FFT of each input image, replace the low-level frequencies of the source images with those of the target images before reconstituting the image using inverse FFT (iFFT) and using the original annotations of the source domain. ![Spectral Transfer](https://i.imgur.com/zY9QS3p.jpg) - Since their method surpasses the adversarially trained SOTA UDA methods, they claim that such a simple method is more effective for managing low-level nuisance variability compared to sophisticated adversarial training. - The motivation is that the low-level amplitude spectrum can vary significantly without affecting the perception of high-level semantics. - Using such a method, known nuisance variability (like rescaling of color map or non-linear contrast changes) can be dealt with at the outset, without the need to learn it through complex adversarial training. - This is important since networks don't transfer well across different low-level statistics according to [Achille et. al. 2019](https://arxiv.org/abs/1711.08856). - DA and self-supervised learning (SSL) are closely related. When the domains are aligned, UDA becomes SSL. CBST ([Zou et. al. 2018](https://arxiv.org/abs/1810.07911)) and BDL ([Li et.al. 2019](https://arxiv.org/abs/1904.10620)) used *self-training* as regularization, exploiting target images by treating pseudo-labels as ground truth. This work also uses this method. - ADVENT ([Vu et. al. 2019](https://arxiv.org/abs/1811.12833)) minimizes both the entropy of the pixel-wise predictions and the adversarial loss of the entropy maps. This work also uses *entropy minimization* to regularize the segmentation training. - Inspired by [Tarvainen and Valpola, 2017](https://arxiv.org/abs/1703.01780), [Laine and Aila, 2017](https://arxiv.org/abs/1610.02242) and [French et. al. 2018](https://arxiv.org/abs/1706.05208), they average the output of different trained with different spectral domain size, which fosters multi-band transfer (detailed later). ## Methodology - In UDA, given is a source dataset $D^s= \{ (x_i^s, y_i^s) \sim P(x^s, y^s)\}_{i=1}^{N_s}$, where $x^s \in \mathbb{R}^{H \times W \times 3}$ is a color image and $y_s \in \mathbb{R}^{H \times W}$ is the semantic map associated with $x^s$. - Similarly, $D^t=\{ x_i^t \}_{i=1}^{N_t}$ is the target dataset, where ground truth semantic labels are absent. ### Fourier Domain Adaptation - Let $\mathcal{F}^A, \mathcal{F}^P: \mathbb{R}^{H \times W \times 3} \rightarrow\mathbb{R}^{H \times W \times 3}$ be the amplitude and phase components of the Fourier transform $\mathcal{F}$ of an RGB image i.e. for a single channel image $x$, we have $$ \mathcal{F}(x)(m, n)=\sum_{h,w}x(h,w)e^{-j2\pi(\frac{h}{H}m + \frac{w}{W}n)} \tag{1} $$ - where $j^2=-1$. This can be efficiently implemented using FFT ([Frigo and Johnson, 1998](http://www.fftw.org/fftw-paper-icassp.pdf)). Accordingly, $\mathcal{F}^{-1}$ is the inverse Fourier transform which maps the spectral signals (phase and amplitude) back to image space. - Further, a mask $M_\beta, \beta \in(0, 1)$ is defined whose value is zero except for the centre region, as $$ M_\beta= \unicode{x1D7D9}_{(h, w)\in[-\beta H:\beta H, -\beta W: \beta W]} \tag{2} $$ - where centre of image is assumed to be $(0, 0)$. Note that $\beta$ is not measured in pixels, so choice of $\beta$ does not depend on image size/resolution. - Given 2 randomly sampled images $x^s \sim D^s, x^t \sim D^t$, FDA can be formalized as $$ x^{s \rightarrow t} = \mathcal{F}^{-1}([M_\beta \circ \mathcal{F}^A(x^t) + (1 - M_\beta)\circ \mathcal{F}^A(x^s), \mathcal{F}^P(x^s)]) \tag{3} $$ - where the low frequency part of the amplitude of the source image $\mathcal{F}^A(x^s)$ is replaced by that of the target image $x^t$. - Then, the modified spectral representation of $x^s$ (with it's phase component unaltered) is mapped back to the image $x^{s \rightarrow t}$, whose content is the same as $x^s$ but will resemble the appearance of a sample from $D^t$. Shown in the first image. #### Choice of $\beta$ ![effect of beta](https://i.imgur.com/0bQzSVU.jpg) - As $\beta$ increases to 1, the image $x^{s \rightarrow t}$ approaches the target image $x^t$, but also exhibits visible artifacts (as seen in above image). So, they use $\beta \leq 0.15$. ### FDA for Semantic Segmentation - Given the adapted source dataset $D^{s \rightarrow t}$, a semantic segmentation network $\phi^w$ (with parameters $w$) can be trained by minimizing the CE loss $$ \mathcal{L}_{ce}(\phi^w;D^{s \rightarrow t})=-\sum_i\langle y_i^s, \log(\phi^w(x^{s \rightarrow t}))\rangle \tag{4} $$ - Note: Since $D^{s \rightarrow t}$ can theoretically contain $N_s \times N_t$ examples (which can become very large), they do online random generation of $D^{s \rightarrow t}$ and get some smaller number of samples. - Since FDA aligns the 2 domains, it is now a SSL problem. The key to SSL is the regularization model. They use as a criterion, a penalty for the decision boundary to cross clusters in the unlabeled space. - This can be achieved, assuming class separation, by penalizing the decision boundary traversing regions densely populated by data points, which can be done by minimizing the prediction entropy on the target images. - However, ADVENT ([Vu et. al. 2019](https://arxiv.org/abs/1811.12833)) states that this is ineffective in regions with low entropy. So, they use the Charbonnier penalty function ([Bruhn and Weickert, 2005](http://www.wisdom.weizmann.ac.il/~vision/courses/2006_2/papers/optic_flow_multigrid/bruhn_iccv05.pdf)) defined as $\rho(x) = (x^2 + 0.001^2)^\eta$ as a weighting function for entropy minimization $$ \mathcal{L}_{ent}(\phi^w;D^t)=\sum_i \rho(-\langle\phi^w(x_i^t), \log(\phi^w(x_i^t)) \rangle) \tag{5} $$ - The penalty function penalizes high entropy predictions more than the low entropy ones for $\eta > 0.5$ as shown below. They use this instead of penalizing only high entropy predictions using a threshold. ![Penalty function](https://i.imgur.com/Eh32yq2.png) - Combining this with the segmentation loss on the adapted source images, we can use the following overall loss to train the semantic segmentation network $\phi^w$ from scratch $$ \mathcal{L}(\phi^w; D^{s\rightarrow t}, D^t)=\mathcal{L}_{ce}(\phi^w;D^{s\rightarrow t}) + \lambda_{ent}\mathcal{L}_{ent}(\phi^w; D^t) \tag{6} $$ ### Self-Supervised Training - Self-supervised training (or self-training) is a common way of attempting to boost the performance of SSL by using highly confident pseudo-labels predicted by $\phi^w$ as if they were ground-truth. - Inspired by [Tarvainen and Valpola, 2017](https://arxiv.org/abs/1703.01780), they use the mean of predictions of multiple models to regularize the self-learning. - They train $M=3$ segmentation networks $\phi_{\beta_m}^w; m=1,2,3$ from scratch using Eq. 6, and the mean prediction for a certain target image $x_i^t$ is $$ \hat{y}_i^t = \arg \max_k \frac{1}{M} \sum_m \phi_{\beta_m}^w(x_i^t) \tag{7} $$ - Using these pseudo-labels generated by $M$ models, $\phi_\beta^w$ can be trained to get further improvement using the following self-supervised loss $$ \mathcal{L}_{sst}(\phi^w; D^{s\rightarrow t}, D^t, \hat{D}^t) = \mathcal{L}_{ce}(\phi^w;D^{s\rightarrow t}) + \lambda_{ent}\mathcal{L}_{ent}(\phi^w; D^t) + \mathcal{L}_{ce}(\phi^w;\hat{D}^t) \tag{8} $$ - where $\hat{D}^t$ is $D^t$ augmented with pseudo-labels $\hat{y}_i^t$. Since this involves different $\beta$'s in the FDA operation, they call this self-supervised training using the mean prediction of different segmentation networks as Multi-band Transfer (MBT). - The full training procedure of the FDA semantic segmentation network consists of one round of initial training of M models from scratch using Eq. 6, and two more rounds of self-supervised training using Eq. 8. ## Conclusion - A simple method for domain alignment is proposed, which does not require any learning and can be easily integrated in learning systems that transform UDA into SSL. - They use regularizers in both entropy minimization and self-training. Their MBT technique does not require joint training of networks or any complicated model selection. - Results are able to outperform SOTA, despite the simplicity, which suggests that some distributional misalignment (due to low-level statistics and which wreaks havoc in generalization across domains) is simple to capture using FFT.