However, there is a domain gap between this synthetic domain and the real domain, so a model trained on synthetic data does not generalize well to real data. Domain Adaptation addresses this issue by reducing the domain gap.
One of the approaches, pixel-level adaptation, uses image translation algorithms like CycleGAN (Zhu et. al. 2017) to reduce the gap in visual appearances between the domains.
Although CycleGAN works to some extent, it is challenging to overcome the fundamental difference, texture. Translated images (by CycleGAN) get the Cityscapes' gray color tone, but not the real texture.
To overcome this limitation, they propose a method to adapt to the target domain's texture
They generate a texture-diversified source dataset using a style transfer algorithm. A model trained on this is guided to learn texture-invariant representations.
Then, they finetune the model using self-training to get direct supervision of the target texture.
Methodology
Stylized GTA5/Synthia
Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017) is an efficient style transfer algorithm. But, the authors say that it distorts the content image considerably with wave patterns. So, they don't use AdaIN.
However, I have personally used AdaIN and it works fine when degree of stylization is reduced (they probably used it as 1 while it can be set between 0 and 1).
The photo-realistic style transfer algorithm (Li et. al. 2018) preserves the precise structure of the original image using a smoothing step after stylization. However, that preserves the original texture as well. Thus, they don't use this algorithm.
Their requirements for stylization are
Enough stylization to remove the synthetic texture while not distorting the structure of the original image too much.
The stylization process should be time-efficient since the synthetic dataset has a large volume and high image resolution.
To generate diverse stylized results, it should be able to transfer various styles.
Considering these conditions, they choose Style-swap (Chen and Schmidt, 2016). Some stylized images are shown below.
Goal of this stage is to learn texture-invariant representations using the texture-diversified dataset. They train the segmentation model with both the images stylized by Style-swap and the translated images by CycleGAN.
They alternate the stylized and translated images at each iteration.
The stylized images make the model learn texture-invariance while the translated images guide the model towards the target style.
Additionally, they use output-level adversarial training (like Tsai et. al. 2018) to further align the feature space between the two domains.
Goal of this stage is to finetune the segmentation network to the target domain's texture, based on the learned texture-invariant representation. For this, they use a self-training approach.
Following Li et. al. 2019, they generate pseudo-labels with the model trained in Stage 1.
They set a threshold of 0.9 (higher confidence) on the predictions of the target training images for being considered pseudo-labels.
Then, they finetune the model with the generated pseudo-labels and translated source images. They apply this process iteratively.
Training Objectives
Segmentation Model Training
Since the GT labels are available only for source domain, the segmentation loss is given by
And when target image is given, they calculate the adversarial loss using the discriminator :
Here, and are the input images from the source and target domains, and are the final feature maps of the source and target images, is the source domain GT label, is the number of classes, and is a fully convolutional discriminator.
Thus, total loss function for the segmentation network is
Discriminator Training
The discriminator takes the features and classifies it into source or target domain.
Here, if the feature is from source domain and if the feature is from target domain.
Self-Training
In Stage 2, they use the segmentation loss for the generated pseudo-labels in target images to get direct supervision of the target domain textures.
Here, indicates whether each pixel of the target prediction is a pseudo-label or not.
Discussion
Ablation Study
In below table, Original source only means training segmentation network only with original GTA5 images. Stylized source only and Translated source only use the generated datasets by Style-swap and CycleGAN respectively.
Alternating between Stylized source and Translated source performs better because while the stylized images enable texture-invariance, the translated images guide the model towards the target style.
They present a style transfer based algorithm to adapt to the target texture. The various textures of the stylized datasets work as a regularizer to make the model learn texture-invariant representations.
Based on the texture-invariant representation, they use self-training to get direct supervision of the target texture, and finetune the model.