They propose a bidirectional style-induced DA method (BiSIDA) that employs consistency regularization to efficiently exploit information from the unlabeled target dataset using a simple neural style transfer model.
Introduction
To perform domain alignment on a pixel-level or feature-level, existing works (Tsai et al. CVPR '18, Hoffman et al. ICML '18) typically use adversarial training and training with the aligned data is supervised using the labeled source data. However, this introduces extra complexity and instability in training.
Alternative approaches (Zou et al. ECCV '18, Vu et al. CVPR '19) seek to exploit information about the unlabeled target data by performing semi-supervised learning including entropy minimization, pseudo-labeling and consistency regularization. However, these play an auxiliary role beside supervised learning or fail to take full advantage of the target data.
This work proposes a 2 stage pipeline:
In the supervised learning phase, a style-induced image generator translates images with different styles to align the source domain to the direction of the source domain.
In the unsupervised phase, they perform high-dimensional perturbations on target domain images with consistency regularization.
Methodology
Background
Adaptive Instance Normalization (AdaIN)
Given a content image and a style image from another domain, an image that mimics the style of while pertaining the content of is synthesized.
Formally, the feature map of content image and style image through an encoder can be represented as and . Normalizing the mean and standard deviation for each channel of and , the target feature maps are produced as follows:
Here, and are the mean and variance of the feature map .
Self-ensembling
To stabilize the generation of pseudo-labels, they employ self-ensembling (Tarvainen and Valpola, NeurIPS '17) which consists of a segmentation network as student network and a teacher network with the same architecture.
The teacher is essentially the temporal ensemble of the student network so that a radical change in the weight of the teacher network can be alleviated and more informed prediction can be made.
The weight of teacher at the iteration is updated as the exponential moving average of the weight of the student i.e. where is the decay factor.
They synthesize an image with the combined style of a source and a target image controlled by a content-style trade-off parameter through the generator :
Here, indicates content image will be reconstructed while will reconstruct with a combination of styles of content image and style image.
Target-guided supervised learning
Given a source dataset and a target dataset , they first perform a random color space perturbation on a source image to get to enhance the randomness.
This augmented image is passed through the style-induced generator to perform style transfer as a stronger augmentation using a target image with the trade-off parameter randomly sampled from a uniform distribution to get .
The translation process is enabled with probability of so that model can be trained on details since a loss of resolution occurs in the translation. For the rest of the probability, is used.
Finally, the supervised loss (CE loss) between probability map and the label is given by:
Using a strong and directed augmentation method facilitates generalization towards different styles and further enables adaptation towards the direction of target.
Source-guided unsupervised learning
Since their model is more adapted to the source domain where supervised learning is performed, the quality of produced pseudo-labels is generally higher if input is closer to source domain. Consequently, pseudo-labels are computed from target images transferred to the direction of appearance of source domain.
First, they perform random color space perturbation on target to get . Then, each is further augmented using randomly sampled source images as style images through with probability considering the loss of resolution.
After the augmentation process, transformed images will be passed through the teacher model individually to acquire more stable predictions . These are averaged to get the probability map for the pseudo-label .
They sharpen (Berthelot et al. NeurIPS '19) the predictions before aggregation (make them more peaky). Finally, the pseudo-label is obtained which will be used to compute the unsupervised loss in a supervised manner.
The class imbalance in the dataset causes model to be biased towards popular or easy classes especially when relying on semi-supervised methods like pseudo-labels. To address this, they add simple weightage to the loss based on the proportion of pixels in each class in the source data. Note: over-complicating a simple training trick by providing explicit equations!
Optimization
The final loss , given the unsupervised loss weight , in a multi-task learning manner is given by
During the training process, the student network is updated toward the direction of the gradient computed via back-propagation of the loss , while the weight of the teacher network is updated as the exponential moving average of the student network.
Conclusion
They propose a Bidirectional Style-induced DA (BiSIDA) framework that optimizes a segmentation model via target-guided supervised learning and source-guided unsupervised learning.
Using a continuous style-induced generator, they show effectiveness of learning from unlabeled target by providing high-dimensional perturbations for consistency regularization.
They also show that alignment between source and target from both directions is achievable without adversarial training.