Notes on "Fourier Domain Adaptation for Semantic Segmentation"

tags: `notes` `unsupervised` `domain-adaptation` `segmentation`

Brief Outline

A simple method for UDA where the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. The method is illustrated for semantic segmentation.

Introduction

Unsupervised domain adaptation (UDA) refers to adapting a model trained with annotated samples from one distribution (source), to operate on a different (target) distribution for which no annotations are given.
Simply training the model on the source data does not yield satisfactory performance on the target data, due to the covariate shift.
In some cases, perceptually insignificant changes in the low-level statistics can cause significant deterioration of the performance of the trained model, unless UDA is performed.
They explore the hypothesis that simple alignment of the low-level statistics between the source and target distributions can improve UDA, without needing any training beyond the primary task.
They compute the FFT of each input image, replace the low-level frequencies of the source images with those of the target images before reconstituting the image using inverse FFT (iFFT) and using the original annotations of the source domain.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Since their method surpasses the adversarially trained SOTA UDA methods, they claim that such a simple method is more effective for managing low-level nuisance variability compared to sophisticated adversarial training.
The motivation is that the low-level amplitude spectrum can vary significantly without affecting the perception of high-level semantics.
Using such a method, known nuisance variability (like rescaling of color map or non-linear contrast changes) can be dealt with at the outset, without the need to learn it through complex adversarial training.
This is important since networks don't transfer well across different low-level statistics according to Achille et. al. 2019.
DA and self-supervised learning (SSL) are closely related. When the domains are aligned, UDA becomes SSL. CBST (Zou et. al. 2018) and BDL (Li et.al. 2019) used self-training as regularization, exploiting target images by treating pseudo-labels as ground truth. This work also uses this method.
ADVENT (Vu et. al. 2019) minimizes both the entropy of the pixel-wise predictions and the adversarial loss of the entropy maps. This work also uses entropy minimization to regularize the segmentation training.
Inspired by Tarvainen and Valpola, 2017, Laine and Aila, 2017 and French et. al. 2018, they average the output of different trained with different spectral domain size, which fosters multi-band transfer (detailed later).

Methodology

In UDA, given is a source dataset
$D^{s} = {(x_{i}^{s}, y_{i}^{s}) \sim P (x^{s}, y^{s})}_{i = 1}^{N_{s}}$ , where
$x^{s} \in R^{H \times W \times 3}$ is a color image and
$y_{s} \in R^{H \times W}$ is the semantic map associated with
$x^{s}$ .
Similarly,
$D^{t} = {x_{i}^{t}}_{i = 1}^{N_{t}}$ is the target dataset, where ground truth semantic labels are absent.

Fourier Domain Adaptation

Let
$F^{A}, F^{P} : R^{H \times W \times 3} \to R^{H \times W \times 3}$ be the amplitude and phase components of the Fourier transform
$F$ of an RGB image i.e. for a single channel image
$x$ , we have

\begin{matrix} (1) & F (x) (m, n) = \sum_{h, w} x (h, w) e^{- j 2 π (\frac{h}{H} m + \frac{w}{W} n)} \end{matrix}

where
$j^{2} = - 1$ . This can be efficiently implemented using FFT (Frigo and Johnson, 1998). Accordingly,
$F^{- 1}$ is the inverse Fourier transform which maps the spectral signals (phase and amplitude) back to image space.
Further, a mask
$M_{β}, β \in (0, 1)$ is defined whose value is zero except for the centre region, as

\begin{matrix} (2) & M_{β} = \unicode {x 1 D 7 D 9}_{(h, w) \in [- β H : β H, - β W : β W]} \end{matrix}

where centre of image is assumed to be
$(0, 0)$ . Note that
$β$ is not measured in pixels, so choice of
$β$ does not depend on image size/resolution.
Given 2 randomly sampled images
$x^{s} \sim D^{s}, x^{t} \sim D^{t}$ , FDA can be formalized as

\begin{matrix} (3) & x^{s \to t} = F^{- 1} ([M_{β} \circ F^{A} (x^{t}) + (1 - M_{β}) \circ F^{A} (x^{s}), F^{P} (x^{s})]) \end{matrix}

where the low frequency part of the amplitude of the source image
$F^{A} (x^{s})$ is replaced by that of the target image
$x^{t}$ .
Then, the modified spectral representation of
$x^{s}$ (with it's phase component unaltered) is mapped back to the image
$x^{s \to t}$ , whose content is the same as
$x^{s}$ but will resemble the appearance of a sample from
$D^{t}$ . Shown in the first image.

Choice of
$β$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

As
$β$ increases to 1, the image
$x^{s \to t}$ approaches the target image
$x^{t}$ , but also exhibits visible artifacts (as seen in above image). So, they use
$β \leq 0.15$ .

FDA for Semantic Segmentation

Given the adapted source dataset
$D^{s \to t}$ , a semantic segmentation network
$ϕ^{w}$ (with parameters
$w$ ) can be trained by minimizing the CE loss

\begin{matrix} (4) & L_{c e} (ϕ^{w}; D^{s \to t}) = - \sum_{i} ⟨ y_{i}^{s}, \log (ϕ^{w} (x^{s \to t})) ⟩ \end{matrix}

Note: Since
$D^{s \to t}$ can theoretically contain
$N_{s} \times N_{t}$ examples (which can become very large), they do online random generation of
$D^{s \to t}$ and get some smaller number of samples.
Since FDA aligns the 2 domains, it is now a SSL problem. The key to SSL is the regularization model. They use as a criterion, a penalty for the decision boundary to cross clusters in the unlabeled space.
This can be achieved, assuming class separation, by penalizing the decision boundary traversing regions densely populated by data points, which can be done by minimizing the prediction entropy on the target images.
However, ADVENT (Vu et. al. 2019) states that this is ineffective in regions with low entropy. So, they use the Charbonnier penalty function (Bruhn and Weickert, 2005) defined as
$ρ (x) = (x^{2} + {0.001}^{2})^{η}$ as a weighting function for entropy minimization

\begin{matrix} (5) & L_{e n t} (ϕ^{w}; D^{t}) = \sum_{i} ρ (- ⟨ ϕ^{w} (x_{i}^{t}), \log (ϕ^{w} (x_{i}^{t})) ⟩) \end{matrix}

The penalty function penalizes high entropy predictions more than the low entropy ones for
$η > 0.5$ as shown below. They use this instead of penalizing only high entropy predictions using a threshold.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Combining this with the segmentation loss on the adapted source images, we can use the following overall loss to train the semantic segmentation network
$ϕ^{w}$ from scratch

\begin{matrix} (6) & L (ϕ^{w}; D^{s \to t}, D^{t}) = L_{c e} (ϕ^{w}; D^{s \to t}) + λ_{e n t} L_{e n t} (ϕ^{w}; D^{t}) \end{matrix}

Self-Supervised Training

Self-supervised training (or self-training) is a common way of attempting to boost the performance of SSL by using highly confident pseudo-labels predicted by
$ϕ^{w}$ as if they were ground-truth.
Inspired by Tarvainen and Valpola, 2017, they use the mean of predictions of multiple models to regularize the self-learning.
They train
$M = 3$ segmentation networks
$ϕ_{β_{m}}^{w}; m = 1, 2, 3$ from scratch using Eq. 6, and the mean prediction for a certain target image
$x_{i}^{t}$ is

\begin{matrix} (7) & {\hat{y}}_{i}^{t} = \arg max_{k} \frac{1}{M} \sum_{m} ϕ_{β_{m}}^{w} (x_{i}^{t}) \end{matrix}

Using these pseudo-labels generated by
$M$ models,
$ϕ_{β}^{w}$ can be trained to get further improvement using the following self-supervised loss

\begin{matrix} (8) & L_{s s t} (ϕ^{w}; D^{s \to t}, D^{t}, {\hat{D}}^{t}) = L_{c e} (ϕ^{w}; D^{s \to t}) + λ_{e n t} L_{e n t} (ϕ^{w}; D^{t}) + L_{c e} (ϕ^{w}; {\hat{D}}^{t}) \end{matrix}

where
${\hat{D}}^{t}$ is
$D^{t}$ augmented with pseudo-labels
${\hat{y}}_{i}^{t}$ . Since this involves different
$β$ 's in the FDA operation, they call this self-supervised training using the mean prediction of different segmentation networks as Multi-band Transfer (MBT).
The full training procedure of the FDA semantic segmentation network consists of one round of initial training of M models from scratch using Eq. 6, and two more rounds of self-supervised training using Eq. 8.

Conclusion

A simple method for domain alignment is proposed, which does not require any learning and can be easily integrated in learning systems that transform UDA into SSL.
They use regularizers in both entropy minimization and self-training. Their MBT technique does not require joint training of networks or any complicated model selection.
Results are able to outperform SOTA, despite the simplicity, which suggests that some distributional misalignment (due to low-level statistics and which wreaks havoc in generalization across domains) is simple to capture using FFT.

Notes on "Fourier Domain Adaptation for Semantic Segmentation"

tags: notes unsupervised domain-adaptation segmentation

Brief Outline

Introduction

Methodology

Fourier Domain Adaptation

Choice of β

FDA for Semantic Segmentation

Self-Supervised Training

Conclusion

Read more

Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

tags: `notes` `unsupervised` `domain-adaptation` `segmentation`

Choice of
$β$