Try   HackMD

Notes on "Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation"

tags: notes domain-adaptation unsupervised segmentation adversarial cvpr20

Author: Akshay Kulkarni

Note: CVPR '20 paper, Official Code Release

Brief Outline

This paper proposes a method to adapt to the target domain's texture.

  1. They diversify the texture of source images using a style transfer algorithm, which prevents the model from overfitting to a specific texture.
  2. Then, they finetune the model with self-training to get direct supervision of the target texture.

Introduction

  • To reduce the cost of annotation, synthetic datasets such as GTA5 (Richter et. al. 2016) and Synthia (Ros et. al. 2016) are proposed.
  • However, there is a domain gap between this synthetic domain and the real domain, so a model trained on synthetic data does not generalize well to real data. Domain Adaptation addresses this issue by reducing the domain gap.
  • One of the approaches, pixel-level adaptation, uses image translation algorithms like CycleGAN (Zhu et. al. 2017) to reduce the gap in visual appearances between the domains.
  • Although CycleGAN works to some extent, it is challenging to overcome the fundamental difference, texture. Translated images (by CycleGAN) get the Cityscapes' gray color tone, but not the real texture.
  • To overcome this limitation, they propose a method to adapt to the target domain's texture
    • They generate a texture-diversified source dataset using a style transfer algorithm. A model trained on this is guided to learn texture-invariant representations.
    • Then, they finetune the model using self-training to get direct supervision of the target texture.

Methodology

Stylized GTA5/Synthia

  • Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017) is an efficient style transfer algorithm. But, the authors say that it distorts the content image considerably with wave patterns. So, they don't use AdaIN.
  • However, I have personally used AdaIN and it works fine when degree of stylization is reduced (they probably used it as 1 while it can be set between 0 and 1).
  • The photo-realistic style transfer algorithm (Li et. al. 2018) preserves the precise structure of the original image using a smoothing step after stylization. However, that preserves the original texture as well. Thus, they don't use this algorithm.
  • Their requirements for stylization are
    • Enough stylization to remove the synthetic texture while not distorting the structure of the original image too much.
    • The stylization process should be time-efficient since the synthetic dataset has a large volume and high image resolution.
    • To generate diverse stylized results, it should be able to transfer various styles.
  • Considering these conditions, they choose Style-swap (Chen and Schmidt, 2016). Some stylized images are shown below.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Stage 1 Training

  • Goal of this stage is to learn texture-invariant representations using the texture-diversified dataset. They train the segmentation model with both the images stylized by Style-swap and the translated images by CycleGAN.
  • They alternate the stylized and translated images at each iteration.
  • The stylized images make the model learn texture-invariance while the translated images guide the model towards the target style.
  • Additionally, they use output-level adversarial training (like Tsai et. al. 2018) to further align the feature space between the two domains.
  • Following diagram shows the process of Stage 1 :

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Stage 2 Training

  • Goal of this stage is to finetune the segmentation network to the target domain's texture, based on the learned texture-invariant representation. For this, they use a self-training approach.
  • Following Li et. al. 2019, they generate pseudo-labels with the model trained in Stage 1.
  • They set a threshold of 0.9 (higher confidence) on the predictions of the target training images for being considered pseudo-labels.
  • Then, they finetune the model with the generated pseudo-labels and translated source images. They apply this process iteratively.

Training Objectives

Segmentation Model Training

  • Since the GT labels are available only for source domain, the segmentation loss is given by

(1)Lseg(Is)=h,wc=1Cysh,w,clogPs(h,w,c)

  • And when target image is given, they calculate the adversarial loss using the discriminator :

(2)Ladv(It)=h,wlogD(Pt(h,w,c))

  • Here,
    Is
    and
    It
    are the input images from the source and target domains,
    Ps(h,w,c)
    and
    Pt(h,w,c)
    are the final feature maps of the source and target images,
    ysh,w,c
    is the source domain GT label,
    C
    is the number of classes, and
    D
    is a fully convolutional discriminator.
  • Thus, total loss function for the segmentation network is

(3)L(Is,It)=Lseg(Is)+λadvLadv(It)

Discriminator Training

  • The discriminator takes the features and classifies it into source or target domain.

(4)LD(P)=h,w((1z)logD(Ps(h,w,c))+zlogD(Pt(h,w,c)))

  • Here,
    z=0
    if the feature is from source domain and
    z=1
    if the feature is from target domain.

Self-Training

  • In Stage 2, they use the segmentation loss for the generated pseudo-labels in target images to get direct supervision of the target domain textures.

(5)LST(It)=h,w\unicodex1D7D9pseudoc=1Cy^th,w,clogPt(h,w,c)

  • Here,
    \unicodex1D7D9pseudo
    indicates whether each pixel of the target prediction is a pseudo-label or not.

Discussion

Ablation Study

  • In below table, Original source only means training segmentation network only with original GTA5 images. Stylized source only and Translated source only use the generated datasets by Style-swap and CycleGAN respectively.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Alternating between Stylized source and Translated source performs better because while the stylized images enable texture-invariance, the translated images guide the model towards the target style.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Robustness Test

  • To verify the texture-invariance of the model, they test the model on perturbed validation sets distorted by various noises.
  • They generate noisy Cityscapes validation sets with noises that do not distort the shapes of objects.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Conclusion

  • They present a style transfer based algorithm to adapt to the target texture. The various textures of the stylized datasets work as a regularizer to make the model learn texture-invariant representations.
  • Based on the texture-invariant representation, they use self-training to get direct supervision of the target texture, and finetune the model.