Notes on "Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation"

tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `aaai21` `adain` `augmentation`

Brief Outline

They propose a bidirectional style-induced DA method (BiSIDA) that employs consistency regularization to efficiently exploit information from the unlabeled target dataset using a simple neural style transfer model.

Introduction

To perform domain alignment on a pixel-level or feature-level, existing works (Tsai et al. CVPR '18, Hoffman et al. ICML '18) typically use adversarial training and training with the aligned data is supervised using the labeled source data. However, this introduces extra complexity and instability in training.
Alternative approaches (Zou et al. ECCV '18, Vu et al. CVPR '19) seek to exploit information about the unlabeled target data by performing semi-supervised learning including entropy minimization, pseudo-labeling and consistency regularization. However, these play an auxiliary role beside supervised learning or fail to take full advantage of the target data.
This work proposes a 2 stage pipeline:
- In the supervised learning phase, a style-induced image generator translates images with different styles to align the source domain to the direction of the source domain.
- In the unsupervised phase, they perform high-dimensional perturbations on target domain images with consistency regularization.

Methodology

Background

Adaptive Instance Normalization (AdaIN)

Given a content image
$c$ and a style image
$s$ from another domain, an image that mimics the style of
$s$ while pertaining the content of
$c$ is synthesized.
Formally, the feature map of content image
$c$ and style image
$s$ through an encoder
$f$ can be represented as
$t^{c} = f (c)$ and
$t^{s} = f (s)$ . Normalizing the mean and standard deviation for each channel of
$t^{c}$ and
$t^{s}$ , the target feature maps
$\hat{t}$ are produced as follows:

\begin{matrix} (1) & \hat{t} = AdaIN (t^{c}, t^{s}) = σ (t^{s}) \frac{t^{c} - μ (t^{c})}{σ (t^{c})} + μ (t^{s}) \end{matrix}

Here,
$μ (t^{*})$ and
$σ (t^{*})$ are the mean and variance of the feature map
$t^{*}$ .

Self-ensembling

To stabilize the generation of pseudo-labels, they employ self-ensembling (Tarvainen and Valpola, NeurIPS '17) which consists of a segmentation network as student network
$F^{s}$ and a teacher network
$F^{t}$ with the same architecture.
The teacher is essentially the temporal ensemble of the student network so that a radical change in the weight of the teacher network can be alleviated and more informed prediction can be made.
The weight of teacher
$F^{t}$ at the
$i^{th}$ iteration
$θ_{i}^{t}$ is updated as the exponential moving average of the weight
$θ_{i}^{s}$ of the student
$F^{s}$ i.e.
$θ_{i}^{t} = η θ_{i - 1}^{t} + (1 - η) θ_{i}^{s}$ where
$η$ is the decay factor.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Approach

Continuous style-induced image generator

They synthesize an image with the combined style of a source and a target image controlled by a content-style trade-off parameter
$α \in [0, 1]$ through the generator
$G$ :

\begin{matrix} (2) & G (c, s, α) = g (α \hat{t} + (1 - α) t^{c}) \end{matrix}

Here,
$α = 0$ indicates content image will be reconstructed while
$α = 1$ will reconstruct with a combination of styles of content image and style image.

Target-guided supervised learning

Given a source dataset
${(x_{i}^{S}, y_{i}^{S})}_{(i = 1, \dots, N^{S})}$ and a target dataset
${x_{i}^{T}}$ , they first perform a random color space perturbation
$A$ on a source image
$x_{s}$ to get
$A (x_{s})$ to enhance the randomness.
This augmented image is passed through the style-induced generator
$G$ to perform style transfer as a stronger augmentation using a target image
$x_{t}$ with the trade-off parameter
$α$ randomly sampled from a uniform distribution
$U (0, 1)$ to get
${\hat{x}}_{s} = G (A (x_{s}), x_{t}, α)$ .
The translation process is enabled with probability of
$p_{s \to t}$ so that model can be trained on details since a loss of resolution occurs in the translation. For the rest of the probability,
${\hat{x}}_{s} = A (x_{s})$ is used.
Finally, the supervised loss
$L_{s}$ (CE loss) between probability map
$p_{s} = F^{s} ({\hat{x}}_{s})$ and the label
$y_{s}$ is given by:

\begin{matrix} (3) & L_{s} = - \frac{1}{H W} \sum_{m = 1}^{H \times W} \sum_{c = 1}^{C} y_{s}^{m c} \log (p_{s}^{m c}) \end{matrix}

Using a strong and directed augmentation method facilitates generalization towards different styles and further enables adaptation towards the direction of target.

Source-guided unsupervised learning

Since their model is more adapted to the source domain where supervised learning is performed, the quality of produced pseudo-labels is generally higher if input is closer to source domain. Consequently, pseudo-labels are computed from target images transferred to the direction of appearance of source domain.
First, they perform random color space perturbation
$A$ on target
$x_{t}$ to get
$A (x_{t})$ . Then, each
$A (x_{t})$ is further augmented using
$k$ randomly sampled source images
${x_{s}^{i}}_{i = 1}^{k}$ as style images through
$G$ with probability
$p_{t \to s}$ considering the loss of resolution.
After the augmentation process, transformed images
${{\hat{x}}_{t}^{i}}_{i = 1}^{k}$ will be passed through the teacher model
$F^{t}$ individually to acquire more stable predictions
${\hat{y}}^{i} = F^{t} ({\hat{x}}_{t}^{i})$ . These are averaged to get the probability map
$p_{l}$ for the pseudo-label
$p_{l} = \frac{1}{k} \sum_{i = 1}^{k} {\hat{y}}^{i}$ .
They sharpen (Berthelot et al. NeurIPS '19) the predictions before aggregation (make them more peaky). Finally, the pseudo-label
$q = \arg max (p_{l})$ is obtained which will be used to compute the unsupervised loss
$L_{u}$ in a supervised manner.
The class imbalance in the dataset causes model to be biased towards popular or easy classes especially when relying on semi-supervised methods like pseudo-labels. To address this, they add simple weightage to the loss based on the proportion of pixels in each class in the source data. Note: over-complicating a simple training trick by providing explicit equations!

Optimization

The final loss
$L$ , given the unsupervised loss weight
$λ_{u}$ , in a multi-task learning manner is given by

\begin{matrix} (4) & L = L_{s} + λ_{u} L_{u} \end{matrix}

During the training process, the student network
$F^{s}$ is updated toward the direction of the gradient computed via back-propagation of the loss
$L$ , while the weight of the teacher network
$F^{t}$ is updated as the exponential moving average of the student network.

Conclusion

They propose a Bidirectional Style-induced DA (BiSIDA) framework that optimizes a segmentation model via target-guided supervised learning and source-guided unsupervised learning.
Using a continuous style-induced generator, they show effectiveness of learning from unlabeled target by providing high-dimensional perturbations for consistency regularization.
They also show that alignment between source and target from both directions is achievable without adversarial training.