Notes on "Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation"

tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `adversarial` `cvpr20`

Author: Akshay Kulkarni

Note: CVPR '20 paper, Official Code Release

Brief Outline

This paper proposes a method to adapt to the target domain's texture.

They diversify the texture of source images using a style transfer algorithm, which prevents the model from overfitting to a specific texture.
Then, they finetune the model with self-training to get direct supervision of the target texture.

Introduction

To reduce the cost of annotation, synthetic datasets such as GTA5 (Richter et. al. 2016) and Synthia (Ros et. al. 2016) are proposed.
However, there is a domain gap between this synthetic domain and the real domain, so a model trained on synthetic data does not generalize well to real data. Domain Adaptation addresses this issue by reducing the domain gap.
One of the approaches, pixel-level adaptation, uses image translation algorithms like CycleGAN (Zhu et. al. 2017) to reduce the gap in visual appearances between the domains.
Although CycleGAN works to some extent, it is challenging to overcome the fundamental difference, texture. Translated images (by CycleGAN) get the Cityscapes' gray color tone, but not the real texture.
To overcome this limitation, they propose a method to adapt to the target domain's texture
- They generate a texture-diversified source dataset using a style transfer algorithm. A model trained on this is guided to learn texture-invariant representations.
- Then, they finetune the model using self-training to get direct supervision of the target texture.

Methodology

Stylized GTA5/Synthia

Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017) is an efficient style transfer algorithm. But, the authors say that it distorts the content image considerably with wave patterns. So, they don't use AdaIN.
However, I have personally used AdaIN and it works fine when degree of stylization is reduced (they probably used it as 1 while it can be set between 0 and 1).
The photo-realistic style transfer algorithm (Li et. al. 2018) preserves the precise structure of the original image using a smoothing step after stylization. However, that preserves the original texture as well. Thus, they don't use this algorithm.
Their requirements for stylization are
- Enough stylization to remove the synthetic texture while not distorting the structure of the original image too much.
- The stylization process should be time-efficient since the synthetic dataset has a large volume and high image resolution.
- To generate diverse stylized results, it should be able to transfer various styles.
Considering these conditions, they choose Style-swap (Chen and Schmidt, 2016). Some stylized images are shown below.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Stage 1 Training

Goal of this stage is to learn texture-invariant representations using the texture-diversified dataset. They train the segmentation model with both the images stylized by Style-swap and the translated images by CycleGAN.
They alternate the stylized and translated images at each iteration.
The stylized images make the model learn texture-invariance while the translated images guide the model towards the target style.
Additionally, they use output-level adversarial training (like Tsai et. al. 2018) to further align the feature space between the two domains.
Following diagram shows the process of Stage 1 :

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Stage 2 Training

Goal of this stage is to finetune the segmentation network to the target domain's texture, based on the learned texture-invariant representation. For this, they use a self-training approach.
Following Li et. al. 2019, they generate pseudo-labels with the model trained in Stage 1.
They set a threshold of 0.9 (higher confidence) on the predictions of the target training images for being considered pseudo-labels.
Then, they finetune the model with the generated pseudo-labels and translated source images. They apply this process iteratively.

Training Objectives

Segmentation Model Training

Since the GT labels are available only for source domain, the segmentation loss is given by

\begin{matrix} (1) & L_{s e g} (I_{s}) = - \sum_{h, w} \sum_{c = 1}^{C} y_{s}^{h, w, c} \log P_{s}^{(h, w, c)} \end{matrix}

And when target image is given, they calculate the adversarial loss using the discriminator :

\begin{matrix} (2) & L_{a d v} (I_{t}) = - \sum_{h, w} \log D (P_{t}^{(h, w, c)}) \end{matrix}

Here,
$I_{s}$ and
$I_{t}$ are the input images from the source and target domains,
$P_{s}^{(h, w, c)}$ and
$P_{t}^{(h, w, c)}$ are the final feature maps of the source and target images,
$y_{s}^{h, w, c}$ is the source domain GT label,
$C$ is the number of classes, and
$D$ is a fully convolutional discriminator.
Thus, total loss function for the segmentation network is

\begin{matrix} (3) & L (I_{s}, I_{t}) = L_{s e g} (I_{s}) + λ_{a d v} L_{a d v} (I_{t}) \end{matrix}

Discriminator Training

The discriminator takes the features and classifies it into source or target domain.

\begin{matrix} (4) & L_{D} (P) = - \sum_{h, w} ((1 - z) \log D (P_{s}^{(h, w, c)}) + z \log D (P_{t}^{(h, w, c)})) \end{matrix}

Here,
$z = 0$ if the feature is from source domain and
$z = 1$ if the feature is from target domain.

Self-Training

In Stage 2, they use the segmentation loss for the generated pseudo-labels in target images to get direct supervision of the target domain textures.

\begin{matrix} (5) & L_{S T} (I_{t}) = - \sum_{h, w} \unicode {x 1 D 7 D 9}_{p s e u d o} \sum_{c = 1}^{C} {\hat{y}}_{t}^{h, w, c} \log P_{t}^{(h, w, c)} \end{matrix}

Here,
$\unicode {x 1 D 7 D 9}_{p s e u d o}$ indicates whether each pixel of the target prediction is a pseudo-label or not.

Discussion

Ablation Study

In below table, Original source only means training segmentation network only with original GTA5 images. Stylized source only and Translated source only use the generated datasets by Style-swap and CycleGAN respectively.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Alternating between Stylized source and Translated source performs better because while the stylized images enable texture-invariance, the translated images guide the model towards the target style.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Robustness Test

To verify the texture-invariance of the model, they test the model on perturbed validation sets distorted by various noises.
They generate noisy Cityscapes validation sets with noises that do not distort the shapes of objects.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Conclusion

They present a style transfer based algorithm to adapt to the target texture. The various textures of the stylized datasets work as a regularizer to make the model learn texture-invariant representations.
Based on the texture-invariant representation, they use self-training to get direct supervision of the target texture, and finetune the model.

Notes on "Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation"

tags: notes domain-adaptation unsupervised segmentation adversarial cvpr20

Brief Outline

Introduction

Methodology

Stylized GTA5/Synthia

Stage 1 Training

Stage 2 Training

Training Objectives

Segmentation Model Training

Discriminator Training

Self-Training

Discussion

Ablation Study

Robustness Test

Conclusion

Read more

Notes on "[DACS: Domain Adaptation via Cross-domain Mixed Sampling](https://arxiv.org/abs/2007.08702)"

Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)"

Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)"

Notes on "[Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation](https://papers.nips.cc/paper/2020/hash/7a9a322cbe0d06a98667fdc5160dc6f8-Abstract.html)"

tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `adversarial` `cvpr20`