# Notes on "[Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for UDA in Segmentation](https://arxiv.org/abs/2009.08610)" ###### tags: `notes` `domain-adaptation` `unsupervised` `segmentation` `aaai21` `adain` `augmentation` Notes author: [Akshay Kulkarni](https://akshayk07.weebly.com/) AAAI '21 paper; [Code Release](https://github.com/wangkaihong/BiSIDA) ## Brief Outline They propose a bidirectional style-induced DA method (BiSIDA) that employs consistency regularization to efficiently exploit information from the unlabeled target dataset using a simple neural style transfer model. ## Introduction * To perform domain alignment on a pixel-level or feature-level, existing works ([Tsai *et al.* CVPR '18](https://openaccess.thecvf.com/content_cvpr_2018/html/Tsai_Learning_to_Adapt_CVPR_2018_paper.html), [Hoffman *et al.* ICML '18](http://proceedings.mlr.press/v80/hoffman18a.html)) typically use adversarial training and training with the aligned data is supervised using the labeled source data. However, this introduces extra complexity and instability in training. * Alternative approaches ([Zou *et al.* ECCV '18](https://arxiv.org/abs/1810.07911), [Vu *et al.* CVPR '19](https://arxiv.org/abs/1811.12833)) seek to exploit information about the unlabeled target data by performing semi-supervised learning including entropy minimization, pseudo-labeling and consistency regularization. However, these play an auxiliary role beside supervised learning or fail to take full advantage of the target data. * This work proposes a 2 stage pipeline: * In the supervised learning phase, a style-induced image generator translates images with different styles to align the source domain to the direction of the source domain. * In the unsupervised phase, they perform high-dimensional perturbations on target domain images with consistency regularization. ## Methodology ### Background #### Adaptive Instance Normalization (AdaIN) * Given a content image $c$ and a style image $s$ from another domain, an image that mimics the style of $s$ while pertaining the content of $c$ is synthesized. * Formally, the feature map of content image $c$ and style image $s$ through an encoder $f$ can be represented as $t^c = f(c)$ and $t^s = f(s)$. Normalizing the mean and standard deviation for each channel of $t^c$ and $t^s$, the target feature maps $\hat{t}$ are produced as follows: $$ \hat{t} = \text{AdaIN}(t^c, t^s) = \sigma(t^s) \frac{t^c - \mu(t^c)}{\sigma(t^c)} + \mu(t^s) \tag{1} $$ * Here, $\mu(t^*)$ and $\sigma(t^*)$ are the mean and variance of the feature map $t^*$. #### Self-ensembling * To stabilize the generation of pseudo-labels, they employ self-ensembling ([Tarvainen and Valpola, NeurIPS '17](https://arxiv.org/abs/1703.01780)) which consists of a segmentation network as student network $F^s$ and a teacher network $F^t$ with the same architecture. * The teacher is essentially the temporal ensemble of the student network so that a radical change in the weight of the teacher network can be alleviated and more informed prediction can be made. * The weight of teacher $F^t$ at the $i^\text{th}$ iteration $\theta_i^t$ is updated as the exponential moving average of the weight $\theta_i^s$ of the student $F^s$ *i.e.* $\theta_i^t = \eta \theta_{i-1}^t + (1-\eta)\theta_i^s$ where $\eta$ is the decay factor. ![Approach](https://i.imgur.com/pqcrQAb.png) ### Approach #### Continuous style-induced image generator * They synthesize an image with the combined style of a source and a target image controlled by a content-style trade-off parameter $\alpha \in [0, 1]$ through the generator $G$: $$ G(c, s, \alpha) = g(\alpha \hat{t} + (1-\alpha)t^c) \tag{2} $$ * Here, $\alpha=0$ indicates content image will be reconstructed while $\alpha=1$ will reconstruct with a combination of styles of content image and style image. #### Target-guided supervised learning * Given a source dataset $\{(x_i^\mathcal{S}, y_i^\mathcal{S})\}_{(i=1, \dots, N^\mathcal{S})}$ and a target dataset $\{x_i^\mathcal{T}\}$, they first perform a random color space perturbation $\mathcal{A}$ on a source image $x_s$ to get $\mathcal{A}(x_s)$ to enhance the randomness. * This augmented image is passed through the style-induced generator $G$ to perform style transfer as a stronger augmentation using a target image $x_t$ with the trade-off parameter $\alpha$ randomly sampled from a uniform distribution $\mathcal{U}(0, 1)$ to get $\hat{x}_s = G(\mathcal{A}(x_s), x_t, \alpha)$. * The translation process is enabled with probability of $p_{s\to t}$ so that model can be trained on details since a loss of resolution occurs in the translation. For the rest of the probability, $\hat{x}_s=\mathcal{A}(x_s)$ is used. * Finally, the supervised loss $L_s$ (CE loss) between probability map $p_s=F^s(\hat{x}_s)$ and the label $y_s$ is given by: $$ L_s = -\frac{1}{HW} \sum_{m=1}^{H \times W} \sum_{c=1}^C y_s^{mc}\log(p_s^{mc}) \tag{3} $$ * Using a strong and directed augmentation method facilitates generalization towards different styles and further enables adaptation towards the direction of target. #### Source-guided unsupervised learning * Since their model is more adapted to the source domain where supervised learning is performed, the quality of produced pseudo-labels is generally higher if input is closer to source domain. Consequently, pseudo-labels are computed from target images transferred to the direction of appearance of source domain. * First, they perform random color space perturbation $\mathcal{A}$ on target $x_t$ to get $\mathcal{A}(x_t)$. Then, each $\mathcal{A}(x_t)$ is further augmented using $k$ randomly sampled source images $\{x_s^i\}_{i=1}^k$ as style images through $G$ with probability $p_{t \to s}$ considering the loss of resolution. * After the augmentation process, transformed images $\{\hat{x}_t^i\}_{i=1}^k$ will be passed through the teacher model $F^t$ individually to acquire more stable predictions $\hat{y}^i=F^t(\hat{x}_t^i)$. These are averaged to get the probability map $p_l$ for the pseudo-label $p_l=\frac{1}{k}\sum_{i=1}^k\hat{y}^i$. * They sharpen ([Berthelot *et al.* NeurIPS '19](https://proceedings.neurips.cc/paper/2019/hash/1cd138d0499a68f4bb72bee04bbec2d7-Abstract.html)) the predictions before aggregation (make them more peaky). Finally, the pseudo-label $q = \arg\max (p_l)$ is obtained which will be used to compute the unsupervised loss $L_u$ in a supervised manner. * The class imbalance in the dataset causes model to be biased towards popular or easy classes especially when relying on semi-supervised methods like pseudo-labels. To address this, they add simple weightage to the loss based on the proportion of pixels in each class in the source data. Note: over-complicating a simple training trick by providing explicit equations! #### Optimization * The final loss $L$, given the unsupervised loss weight $\lambda_u$, in a multi-task learning manner is given by $$ L = L_s + \lambda_u L_u \tag{4} $$ * During the training process, the student network $F^s$ is updated toward the direction of the gradient computed via back-propagation of the loss $L$, while the weight of the teacher network $F^t$ is updated as the exponential moving average of the student network. ## Conclusion * They propose a Bidirectional Style-induced DA (BiSIDA) framework that optimizes a segmentation model via target-guided supervised learning and source-guided unsupervised learning. * Using a continuous style-induced generator, they show effectiveness of learning from unlabeled target by providing high-dimensional perturbations for consistency regularization. * They also show that alignment between source and target from both directions is achievable without adversarial training.