Try   HackMD

Notes on "Fourier Domain Adaptation for Semantic Segmentation"

tags: notes unsupervised domain-adaptation segmentation

Author: Akshay Kulkarni

Brief Outline

A simple method for UDA where the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. The method is illustrated for semantic segmentation.

Introduction

  • Unsupervised domain adaptation (UDA) refers to adapting a model trained with annotated samples from one distribution (source), to operate on a different (target) distribution for which no annotations are given.
  • Simply training the model on the source data does not yield satisfactory performance on the target data, due to the covariate shift.
  • In some cases, perceptually insignificant changes in the low-level statistics can cause significant deterioration of the performance of the trained model, unless UDA is performed.
  • They explore the hypothesis that simple alignment of the low-level statistics between the source and target distributions can improve UDA, without needing any training beyond the primary task.
  • They compute the FFT of each input image, replace the low-level frequencies of the source images with those of the target images before reconstituting the image using inverse FFT (iFFT) and using the original annotations of the source domain.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Since their method surpasses the adversarially trained SOTA UDA methods, they claim that such a simple method is more effective for managing low-level nuisance variability compared to sophisticated adversarial training.
  • The motivation is that the low-level amplitude spectrum can vary significantly without affecting the perception of high-level semantics.
  • Using such a method, known nuisance variability (like rescaling of color map or non-linear contrast changes) can be dealt with at the outset, without the need to learn it through complex adversarial training.
  • This is important since networks don't transfer well across different low-level statistics according to Achille et. al. 2019.
  • DA and self-supervised learning (SSL) are closely related. When the domains are aligned, UDA becomes SSL. CBST (Zou et. al. 2018) and BDL (Li et.al. 2019) used self-training as regularization, exploiting target images by treating pseudo-labels as ground truth. This work also uses this method.
  • ADVENT (Vu et. al. 2019) minimizes both the entropy of the pixel-wise predictions and the adversarial loss of the entropy maps. This work also uses entropy minimization to regularize the segmentation training.
  • Inspired by Tarvainen and Valpola, 2017, Laine and Aila, 2017 and French et. al. 2018, they average the output of different trained with different spectral domain size, which fosters multi-band transfer (detailed later).

Methodology

  • In UDA, given is a source dataset
    Ds={(xis,yis)P(xs,ys)}i=1Ns
    , where
    xsRH×W×3
    is a color image and
    ysRH×W
    is the semantic map associated with
    xs
    .
  • Similarly,
    Dt={xit}i=1Nt
    is the target dataset, where ground truth semantic labels are absent.

Fourier Domain Adaptation

  • Let
    FA,FP:RH×W×3RH×W×3
    be the amplitude and phase components of the Fourier transform
    F
    of an RGB image i.e. for a single channel image
    x
    , we have

(1)F(x)(m,n)=h,wx(h,w)ej2π(hHm+wWn)

  • where
    j2=1
    . This can be efficiently implemented using FFT (Frigo and Johnson, 1998). Accordingly,
    F1
    is the inverse Fourier transform which maps the spectral signals (phase and amplitude) back to image space.
  • Further, a mask
    Mβ,β(0,1)
    is defined whose value is zero except for the centre region, as

(2)Mβ=\unicodex1D7D9(h,w)[βH:βH,βW:βW]

  • where centre of image is assumed to be
    (0,0)
    . Note that
    β
    is not measured in pixels, so choice of
    β
    does not depend on image size/resolution.
  • Given 2 randomly sampled images
    xsDs,xtDt
    , FDA can be formalized as

(3)xst=F1([MβFA(xt)+(1Mβ)FA(xs),FP(xs)])

  • where the low frequency part of the amplitude of the source image
    FA(xs)
    is replaced by that of the target image
    xt
    .
  • Then, the modified spectral representation of
    xs
    (with it's phase component unaltered) is mapped back to the image
    xst
    , whose content is the same as
    xs
    but will resemble the appearance of a sample from
    Dt
    . Shown in the first image.

Choice of
β

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • As
    β
    increases to 1, the image
    xst
    approaches the target image
    xt
    , but also exhibits visible artifacts (as seen in above image). So, they use
    β0.15
    .

FDA for Semantic Segmentation

  • Given the adapted source dataset
    Dst
    , a semantic segmentation network
    ϕw
    (with parameters
    w
    ) can be trained by minimizing the CE loss

(4)Lce(ϕw;Dst)=iyis,log(ϕw(xst))

  • Note: Since
    Dst
    can theoretically contain
    Ns×Nt
    examples (which can become very large), they do online random generation of
    Dst
    and get some smaller number of samples.
  • Since FDA aligns the 2 domains, it is now a SSL problem. The key to SSL is the regularization model. They use as a criterion, a penalty for the decision boundary to cross clusters in the unlabeled space.
  • This can be achieved, assuming class separation, by penalizing the decision boundary traversing regions densely populated by data points, which can be done by minimizing the prediction entropy on the target images.
  • However, ADVENT (Vu et. al. 2019) states that this is ineffective in regions with low entropy. So, they use the Charbonnier penalty function (Bruhn and Weickert, 2005) defined as
    ρ(x)=(x2+0.0012)η
    as a weighting function for entropy minimization

(5)Lent(ϕw;Dt)=iρ(ϕw(xit),log(ϕw(xit)))

  • The penalty function penalizes high entropy predictions more than the low entropy ones for
    η>0.5
    as shown below. They use this instead of penalizing only high entropy predictions using a threshold.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Combining this with the segmentation loss on the adapted source images, we can use the following overall loss to train the semantic segmentation network
    ϕw
    from scratch

(6)L(ϕw;Dst,Dt)=Lce(ϕw;Dst)+λentLent(ϕw;Dt)

Self-Supervised Training

  • Self-supervised training (or self-training) is a common way of attempting to boost the performance of SSL by using highly confident pseudo-labels predicted by
    ϕw
    as if they were ground-truth.
  • Inspired by Tarvainen and Valpola, 2017, they use the mean of predictions of multiple models to regularize the self-learning.
  • They train
    M=3
    segmentation networks
    ϕβmw;m=1,2,3
    from scratch using Eq. 6, and the mean prediction for a certain target image
    xit
    is

(7)y^it=argmaxk1Mmϕβmw(xit)

  • Using these pseudo-labels generated by
    M
    models,
    ϕβw
    can be trained to get further improvement using the following self-supervised loss

(8)Lsst(ϕw;Dst,Dt,D^t)=Lce(ϕw;Dst)+λentLent(ϕw;Dt)+Lce(ϕw;D^t)

  • where
    D^t
    is
    Dt
    augmented with pseudo-labels
    y^it
    . Since this involves different
    β
    's in the FDA operation, they call this self-supervised training using the mean prediction of different segmentation networks as Multi-band Transfer (MBT).
  • The full training procedure of the FDA semantic segmentation network consists of one round of initial training of M models from scratch using Eq. 6, and two more rounds of self-supervised training using Eq. 8.

Conclusion

  • A simple method for domain alignment is proposed, which does not require any learning and can be easily integrated in learning systems that transform UDA into SSL.
  • They use regularizers in both entropy minimization and self-training. Their MBT technique does not require joint training of networks or any complicated model selection.
  • Results are able to outperform SOTA, despite the simplicity, which suggests that some distributional misalignment (due to low-level statistics and which wreaks havoc in generalization across domains) is simple to capture using FFT.