# Notes on "[Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation](https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Learning_Texture_Invariant_Representation_for_Domain_Adaptation_of_Semantic_Segmentation_CVPR_2020_paper.html)" ###### tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `adversarial` `cvpr20` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) Note: CVPR '20 paper, [Official Code Release](https://github.com/MyeongJin-Kim/Learning-Texture-Invariant-Representation) ## Brief Outline This paper proposes a method to adapt to the target domain's texture. 1. They diversify the texture of source images using a style transfer algorithm, which prevents the model from overfitting to a specific texture. 2. Then, they finetune the model with self-training to get direct supervision of the target texture. ## Introduction * To reduce the cost of annotation, synthetic datasets such as GTA5 ([Richter et. al. 2016](https://download.visinf.tu-darmstadt.de/data/from_games/)) and Synthia ([Ros et. al. 2016](https://synthia-dataset.net/)) are proposed. * However, there is a domain gap between this synthetic domain and the real domain, so a model trained on synthetic data does not generalize well to real data. Domain Adaptation addresses this issue by reducing the domain gap. * One of the approaches, pixel-level adaptation, uses image translation algorithms like CycleGAN ([Zhu et. al. 2017](https://arxiv.org/abs/1703.10593)) to reduce the gap in visual appearances between the domains. * Although CycleGAN works to some extent, it is challenging to overcome the fundamental difference, texture. Translated images (by CycleGAN) get the Cityscapes' gray color tone, but not the real texture. * To overcome this limitation, they propose a method to adapt to the target domain's texture * They generate a texture-diversified source dataset using a style transfer algorithm. A model trained on this is guided to learn texture-invariant representations. * Then, they finetune the model using self-training to get direct supervision of the target texture. ## Methodology ### Stylized GTA5/Synthia * Adaptive Instance Normalization (AdaIN) ([Huang and Belongie, 2017](https://arxiv.org/abs/1703.06868)) is an efficient style transfer algorithm. But, the authors say that it distorts the content image considerably with wave patterns. So, they don't use AdaIN. * However, I have personally used AdaIN and it works fine when degree of stylization is reduced (they probably used it as 1 while it can be set between 0 and 1). * The photo-realistic style transfer algorithm ([Li et. al. 2018](https://arxiv.org/abs/1802.06474)) preserves the precise structure of the original image using a smoothing step after stylization. However, that preserves the original texture as well. Thus, they don't use this algorithm. * Their requirements for stylization are * Enough stylization to remove the synthetic texture while not distorting the structure of the original image too much. * The stylization process should be time-efficient since the synthetic dataset has a large volume and high image resolution. * To generate diverse stylized results, it should be able to transfer various styles. * Considering these conditions, they choose Style-swap ([Chen and Schmidt, 2016](https://arxiv.org/abs/1612.04337)). Some stylized images are shown below. ![Stylization Results](https://i.imgur.com/aj3HASd.jpg) ### Stage 1 Training * Goal of this stage is to learn texture-invariant representations using the texture-diversified dataset. They train the segmentation model with both the images stylized by Style-swap and the translated images by CycleGAN. * They alternate the stylized and translated images at each iteration. * The stylized images make the model learn texture-invariance while the translated images guide the model towards the target style. * Additionally, they use output-level adversarial training (like [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)) to further align the feature space between the two domains. * Following diagram shows the process of Stage 1 : ![Stage 1 Diagram](https://i.imgur.com/UL53Onv.png) ### Stage 2 Training * Goal of this stage is to finetune the segmentation network to the target domain's texture, based on the learned texture-invariant representation. For this, they use a self-training approach. * Following [Li et. al. 2019](https://arxiv.org/abs/1904.10620), they generate pseudo-labels with the model trained in Stage 1. * They set a threshold of 0.9 (higher confidence) on the predictions of the target training images for being considered pseudo-labels. * Then, they finetune the model with the generated pseudo-labels and translated source images. They apply this process iteratively. ### Training Objectives #### Segmentation Model Training * Since the GT labels are available only for source domain, the segmentation loss is given by $$ L_{seg}(I_s) = -\sum_{h, w}\sum_{c=1}^C y_s^{h, w, c} \log P_s^{(h, w, c)} \tag{1} $$ * And when target image is given, they calculate the adversarial loss using the discriminator : $$ L_{adv}(I_t) = -\sum_{h, w} \log D(P_t^{(h, w, c)}) \tag{2} $$ * Here, $I_s$ and $I_t$ are the input images from the source and target domains, $P_s^{(h, w, c)}$ and $P_t^{(h, w, c)}$ are the final feature maps of the source and target images, $y_s^{h, w, c}$ is the source domain GT label, $C$ is the number of classes, and $D$ is a fully convolutional discriminator. * Thus, total loss function for the segmentation network is $$ L(I_s, I_t) = L_{seg}(I_s) + \lambda_{adv}L_{adv}(I_t) \tag{3} $$ #### Discriminator Training * The discriminator takes the features and classifies it into source or target domain. $$ L_D(P) = -\sum_{h, w}((1 - z)\log D(P_s^{(h, w, c)}) + z \log D(P_t^{(h, w, c)})) \tag{4} $$ * Here, $z=0$ if the feature is from source domain and $z=1$ if the feature is from target domain. #### Self-Training * In Stage 2, they use the segmentation loss for the generated pseudo-labels in target images to get direct supervision of the target domain textures. $$ L_{ST}(I_t) = -\sum_{h, w} \unicode{x1D7D9}_{pseudo}\sum_{c=1}^C \hat{y}_t^{h, w, c} \log P_t^{(h, w, c)} \tag{5} $$ * Here, $\unicode{x1D7D9}_{pseudo}$ indicates whether each pixel of the target prediction is a pseudo-label or not. ## Discussion ### Ablation Study * In below table, *Original source only* means training segmentation network only with original GTA5 images. *Stylized source only* and *Translated source only* use the generated datasets by Style-swap and CycleGAN respectively. ![Stage 1 Ablation Study](https://i.imgur.com/yF67SBn.png) * Alternating between *Stylized source* and *Translated source* performs better because while the stylized images enable texture-invariance, the translated images guide the model towards the target style. ![Stage 2 Ablation Study](https://i.imgur.com/d7HXMWU.png) ### Robustness Test * To verify the texture-invariance of the model, they test the model on perturbed validation sets distorted by various noises. * They generate noisy Cityscapes validation sets with noises that do not distort the shapes of objects. ![Robustness Ablation](https://i.imgur.com/BgOWCQF.png) ## Conclusion * They present a style transfer based algorithm to adapt to the target texture. The various textures of the stylized datasets work as a regularizer to make the model learn texture-invariant representations. * Based on the texture-invariant representation, they use self-training to get direct supervision of the target texture, and finetune the model.