# Notes on "[Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation](https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Learning_Texture_Invariant_Representation_for_Domain_Adaptation_of_Semantic_Segmentation_CVPR_2020_paper.html)"
###### tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `adversarial` `cvpr20`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
Note: CVPR '20 paper, [Official Code Release](https://github.com/MyeongJin-Kim/Learning-Texture-Invariant-Representation)
## Brief Outline
This paper proposes a method to adapt to the target domain's texture.
1. They diversify the texture of source images using a style transfer algorithm, which prevents the model from overfitting to a specific texture.
2. Then, they finetune the model with self-training to get direct supervision of the target texture.
## Introduction
* To reduce the cost of annotation, synthetic datasets such as GTA5 ([Richter et. al. 2016](https://download.visinf.tu-darmstadt.de/data/from_games/)) and Synthia ([Ros et. al. 2016](https://synthia-dataset.net/)) are proposed.
* However, there is a domain gap between this synthetic domain and the real domain, so a model trained on synthetic data does not generalize well to real data. Domain Adaptation addresses this issue by reducing the domain gap.
* One of the approaches, pixel-level adaptation, uses image translation algorithms like CycleGAN ([Zhu et. al. 2017](https://arxiv.org/abs/1703.10593)) to reduce the gap in visual appearances between the domains.
* Although CycleGAN works to some extent, it is challenging to overcome the fundamental difference, texture. Translated images (by CycleGAN) get the Cityscapes' gray color tone, but not the real texture.
* To overcome this limitation, they propose a method to adapt to the target domain's texture
* They generate a texture-diversified source dataset using a style transfer algorithm. A model trained on this is guided to learn texture-invariant representations.
* Then, they finetune the model using self-training to get direct supervision of the target texture.
## Methodology
### Stylized GTA5/Synthia
* Adaptive Instance Normalization (AdaIN) ([Huang and Belongie, 2017](https://arxiv.org/abs/1703.06868)) is an efficient style transfer algorithm. But, the authors say that it distorts the content image considerably with wave patterns. So, they don't use AdaIN.
* However, I have personally used AdaIN and it works fine when degree of stylization is reduced (they probably used it as 1 while it can be set between 0 and 1).
* The photo-realistic style transfer algorithm ([Li et. al. 2018](https://arxiv.org/abs/1802.06474)) preserves the precise structure of the original image using a smoothing step after stylization. However, that preserves the original texture as well. Thus, they don't use this algorithm.
* Their requirements for stylization are
* Enough stylization to remove the synthetic texture while not distorting the structure of the original image too much.
* The stylization process should be time-efficient since the synthetic dataset has a large volume and high image resolution.
* To generate diverse stylized results, it should be able to transfer various styles.
* Considering these conditions, they choose Style-swap ([Chen and Schmidt, 2016](https://arxiv.org/abs/1612.04337)). Some stylized images are shown below.
![Stylization Results](https://i.imgur.com/aj3HASd.jpg)
### Stage 1 Training
* Goal of this stage is to learn texture-invariant representations using the texture-diversified dataset. They train the segmentation model with both the images stylized by Style-swap and the translated images by CycleGAN.
* They alternate the stylized and translated images at each iteration.
* The stylized images make the model learn texture-invariance while the translated images guide the model towards the target style.
* Additionally, they use output-level adversarial training (like [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)) to further align the feature space between the two domains.
* Following diagram shows the process of Stage 1 :
![Stage 1 Diagram](https://i.imgur.com/UL53Onv.png)
### Stage 2 Training
* Goal of this stage is to finetune the segmentation network to the target domain's texture, based on the learned texture-invariant representation. For this, they use a self-training approach.
* Following [Li et. al. 2019](https://arxiv.org/abs/1904.10620), they generate pseudo-labels with the model trained in Stage 1.
* They set a threshold of 0.9 (higher confidence) on the predictions of the target training images for being considered pseudo-labels.
* Then, they finetune the model with the generated pseudo-labels and translated source images. They apply this process iteratively.
### Training Objectives
#### Segmentation Model Training
* Since the GT labels are available only for source domain, the segmentation loss is given by
$$
L_{seg}(I_s) = -\sum_{h, w}\sum_{c=1}^C y_s^{h, w, c} \log P_s^{(h, w, c)}
\tag{1}
$$
* And when target image is given, they calculate the adversarial loss using the discriminator :
$$
L_{adv}(I_t) = -\sum_{h, w} \log D(P_t^{(h, w, c)})
\tag{2}
$$
* Here, $I_s$ and $I_t$ are the input images from the source and target domains, $P_s^{(h, w, c)}$ and $P_t^{(h, w, c)}$ are the final feature maps of the source and target images, $y_s^{h, w, c}$ is the source domain GT label, $C$ is the number of classes, and $D$ is a fully convolutional discriminator.
* Thus, total loss function for the segmentation network is
$$
L(I_s, I_t) = L_{seg}(I_s) + \lambda_{adv}L_{adv}(I_t)
\tag{3}
$$
#### Discriminator Training
* The discriminator takes the features and classifies it into source or target domain.
$$
L_D(P) = -\sum_{h, w}((1 - z)\log D(P_s^{(h, w, c)}) + z \log D(P_t^{(h, w, c)}))
\tag{4}
$$
* Here, $z=0$ if the feature is from source domain and $z=1$ if the feature is from target domain.
#### Self-Training
* In Stage 2, they use the segmentation loss for the generated pseudo-labels in target images to get direct supervision of the target domain textures.
$$
L_{ST}(I_t) = -\sum_{h, w} \unicode{x1D7D9}_{pseudo}\sum_{c=1}^C \hat{y}_t^{h, w, c} \log P_t^{(h, w, c)}
\tag{5}
$$
* Here, $\unicode{x1D7D9}_{pseudo}$ indicates whether each pixel of the target prediction is a pseudo-label or not.
## Discussion
### Ablation Study
* In below table, *Original source only* means training segmentation network only with original GTA5 images. *Stylized source only* and *Translated source only* use the generated datasets by Style-swap and CycleGAN respectively.
![Stage 1 Ablation Study](https://i.imgur.com/yF67SBn.png)
* Alternating between *Stylized source* and *Translated source* performs better because while the stylized images enable texture-invariance, the translated images guide the model towards the target style.
![Stage 2 Ablation Study](https://i.imgur.com/d7HXMWU.png)
### Robustness Test
* To verify the texture-invariance of the model, they test the model on perturbed validation sets distorted by various noises.
* They generate noisy Cityscapes validation sets with noises that do not distort the shapes of objects.
![Robustness Ablation](https://i.imgur.com/BgOWCQF.png)
## Conclusion
* They present a style transfer based algorithm to adapt to the target texture. The various textures of the stylized datasets work as a regularizer to make the model learn texture-invariant representations.
* Based on the texture-invariant representation, they use self-training to get direct supervision of the target texture, and finetune the model.