# Notes on "[Prototypical Pseudo Label Denoising and Target Structure Learning for DA sem. seg.](https://arxiv.org/abs/2101.10979)" ###### tags: `notes` Notes Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) CVPR '21 paper; [Code Release](https://github.com/microsoft/ProDA) SOTA on sem. seg. UDA benchmarks (as of 10/04/21). However, their code shows use of DeepLabv3+ type of decoder instead of DeepLabv2 (as claimed in the paper). Thus, it is unclear whether performance improvements are architecture-dependent or not. ## Brief Outline They use representative prototypes (class feature centroids) to address 2 issues in self-training for UDA in semantic segmentation. * Exploit the feature distances from prototypes to estimate likelihood of pseudo-labels to facilitate online correction during training. * Align prototypical assignments based on relative feature distances for 2 different views of the same target, producing a more compact target feature space. Further, distilling the already learned knowledge to a self-supervised pretrained model further boosts performance. ## Introduction * Self-training has recently emerged as a simple yet competitive approach for UDA rather than explicitly aligning the distributions of source and targets (adversarial alignment). * Two key ingredients are lacking in self-training: * Typical practices select pseudo-labels according to a strict confidence threshold. Since high scores are not necessarily correct, network fails to learn reliable knowledge in the target domain. * Due to the domain gap, network is prone to produce dispersed features in the target domain. It is likely that for target data, the closer to the source distribution, the higher the confidence score. So, data far from source distribution (*i.e.* low scores) will never be considered during training. ![Issues in self-training](https://i.imgur.com/cFITgdB.png) * This work proposes to online denoise pseudo-labels and learn a compact target structure to address the above 2 issues respectively. They use **prototypes** *i.e.* class-wise feature centroids to accomplish the 2 tasks: * Rectify pseudo-labels by estimating the class-wise likelihoods according to its relative feature distances to all class prototypes. Prototypes are computed on-the-fly and thus, pseudo-labels are progressively corrected throughout training. * Inspired by Deepcluster ([Caron *et al.* ECCV '18](https://openaccess.thecvf.com/content_ECCV_2018/papers/Mathilde_Caron_Deep_Clustering_for_ECCV_2018_paper.pdf)), they learn the intrinsic structure of the target domain. They propose to align soft prototypical assignments for different views of the same target, which produces a more compact target feature space. * They call their method **ProDA** as they rely heavily on prototypes for DA. * Further, they find that DA can benefit from task-agnostic pretraining. Distilling the knowledge to a self-supervised model ([SimCLRv2, NeurIPS '20](https://arxiv.org/abs/2006.10029)) further boosts the performance. ## Methodology ### Preliminaries * Given source dataset $\mathcal{X}_s = \{ x_s \}_{j=1}^{n_s}$ with labels $\mathcal{Y}_s = \{ y_s \}_{j=1}^{n_s}$, aim is to train a segmentation network to achieve low risk on the unlabeled target dataset $\mathcal{X}_t = \{ x_t \}_{j=1}^{n_t}$ where classes are same across domains. * Typically, source-trained models cannot generalize well to target data. To transfer the knowledge, traditional self-training techniques optimize the categorical cross-entropy (CE) with pseudo-labels $\hat{y}_t$: $$ l_{ce}^t = -\sum_{i=1}^{H \times W} \hat{y}_t^{(i, k)} \log(p_t^{(i, k)}) \tag{1} $$ * Typically, the most probable class predicted by the source n/w is used as pseudo-labels: $$ \hat{y}_t^{(i, k)} = \begin{cases} 1, & \text{if } k=\arg\max_{k'}p_t^{(i, k')} \\ 0, & \text{otherwise} \\ \end{cases} \tag{2} $$ * This conversion from soft predictions to hard labels is denoted by $\hat{y}_t = \xi(p_t)$. Further, in practice, only pixels whose prediction confidence exceeds a given threshold are used as pseudo-labels (due to noise in predictions). ### Prototypical pseudo-label denoising * Updating the pseudo-label after one training stage is too late as n/w may have already overfitted the noisy labels. On the other hand, simultaneously updating pseudo-labels and n/w weights is prone to give trivial solutions. * The key is to fix the soft pseudo-labels and progressively weight them by class-wise probabilities, with the update in accordance with freshly learned knowledge. Formally, they propose to use the weighted pseudo-labels for self-training: $$ \hat{y}_t^{(i, k)} = \xi(w_t^{(i, k)}p_{t, 0}^{(i, k)}) \tag{3} $$ * Here, $w_t^{(i, k)}$ is the proposed weight for modulating the probability (and it changes as training proceeds), whereas $p_{t, 0}^{(i, k)}$ is initialized by the source model and remains fixed throughout training (boiler-plate for subsequent refinement). * They use distances from the prototypes to gradually rectify the pseudo-labels. Let $f(x_t)^{(i)}$ represent the feature of $x_t$ at index $i$. If it is far from the prototype $\eta^{(k)}$ (feature centroid of class $k$), it is more probable that the learned feature is an outlier, hence downweight its probability of being classified into $k^\text{th}$ category. * Concretely, the modulation weight is defined as the softmax over feature distances to prototypes: $$ w_t^{(i, k)} = \frac{\exp(-||\tilde{f}(x_t)^{(i)}-\eta^{(k)}|| / \tau)}{\sum_{k'}\exp(-||\tilde{f}(x_t)^{(i)}-\eta^{(k')}|| / \tau)} \tag{4} $$ * Here, $\tilde{f}$ denotes momentum encoder ([He *et al.* CVPR '20](https://openaccess.thecvf.com/content_CVPR_2020/html/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.html)) of feature extractor $f$ as a reliable feature estimation of $x_t$ is desired, and $\tau$ is softmax temperature (empirically, $\tau=1$). In other words, $w_t^{(i, k)}$ approximates the trust confidence of pixel $x_t^{(i)}$ belonging to the $k^\text{th}$ class. * Note about momentum encoder: $\tilde{f}$ is architecturally same as $f$. However, $f$ is updated at every iteration using the gradients from backprop. On the other hand, $\tilde{f}$ is updated as an exponential moving average of the weights of $f$ *i.e.* $\tilde{f}\leftarrow \lambda \tilde{f}+ (1-\lambda) f$. Thus, $\tilde{f}$ cannot be used to backpropagate the loss, while $f$ can. #### Prototype computation * The proposed method requires computation of prototypes on-the-fly. At the beginning, prototypes are initialized according to predicted pseudo-labels $\hat{y}_t$ for target domain images as: $$ \eta^{(k)} = \frac{\sum_{x_t \in \mathcal{X}_t} \sum_i f(x_t)^{(i)} * \unicode{x1D7D9}(\hat{y}_t^{(i, k)} == 1)}{\sum_{x_t\in\mathcal{X}_t}\sum_i\unicode{x1D7D9}(\hat{y}_t^{(i, k)} == 1)} \tag{5} $$ * Here, $\unicode{x1D7D9}$ is the indicator function. However, such computation on entire dataset is expensive during training. Thus, they estimate the prototypes as moving average of the cluster centroids in mini-batches to track the prototypes that slowly move. Formally, $$ \eta^{(k)} \leftarrow \lambda\eta^{(k)} + (1 - \lambda)\eta'^{(k)} \tag{6} $$ * Here, $\eta'^{(k)}$ is the mean feature of class $k$ calculated within the current training batch from the momentum encoder and $\lambda$ is the momentum coefficient set to $0.9999$. #### Pseudo-label training loss * Instead of using standard CE loss, they use a more robust, symmetric cross-entropy (SCE) loss ([Wang *et al.* ICCV '19](https://arxiv.org/abs/1908.06112)) to further enhance the noise tolerance to stabilize the early training. Specifically, they enforce $$ l_{sce}^t = \alpha l_{ce}(p_t, \hat{y}_t) + \beta l_{ce}(\hat{y}_t, p_t) \tag{7} $$ $$ i $$ * Here, $\alpha=0.1$ and $\beta=1.0$ are balancing coefficients. #### Why are prototypes useful for pseudo-label denoising? 1. The prototypes are less sensitive to the outliers (wrong pseudo-labels) which are assumed to be the minority. 2. The prototypes treat different classes equally regardless of occurrence frequency, which is useful due to class-imbalance in semantic segmentation. * Also, see Fig. 1a for a visual explanation. ### Structure learning by enforcing consistency * Pseudo-labels can be denoised when feature extractor $f$ generates compact target features. However, due to the domain gap, the generated target features are likely to be dispersed (Fig. 1b). * To achieve compact target features, they aim to learn the underlying structure of the target domain. They use the prototypical assignment under weak augmentation to guide the learning for the strong augmented view. * Let $\mathcal{T}(x_t)$ and $\mathcal{T}'(x_t)$ respectively denote the weak and strong augmented views for $x_t$. They use the momentum encoder $\tilde{f}$ to generate a reliable prototypical assignment for $\mathcal{T}(x_t)$ which is: $$ z_\mathcal{T}^{(i, k)} = \frac{\exp(-|| \tilde{f}(\mathcal{T}(x_t))^{(i)}-\eta^{(k)} || / \tau)}{\sum_{k'}\exp(-|| \tilde{f}(\mathcal{T}(x_t))^{(i)}-\eta^{(k')} || / \tau)} \tag{8} $$ $$ z_{\mathcal{T}'}^{(i, k)} = \frac{\exp(-|| f(\mathcal{T}'(x_t))^{(i)}-\eta^{(k)} || / \tau)}{\sum_{k'}\exp(-|| f(\mathcal{T}'(x_t))^{(i)}-\eta^{(k')} || / \tau)} \tag{8.5} $$ * Similarly, the soft assignment $z_{\mathcal{T}'}$ for $\mathcal{T}'(x_t)$ is obtained except that current trainable feature extractor $f$ is used. Since $z_{\mathcal{T}}$ is more reliable as feature is from momentum encoder and input $x_t$ suffers less distortion, they use it to teach $f$ to produce consistent assignments for $\mathcal{T}(x_t)$. * Hence, they minimize the KL divergence between the 2 prototypical assignments under 2 views: $$ l_{kl}^t = \text{KL}(z_\mathcal{T} || z_{\mathcal{T}'}) \tag{9} $$ * Intuitively, this enforces the n/w to give consistent prototypical labeling for adjacent feature points, resulting in a more compact target feature space. * The proposed method may suffer from degeneration issue *i.e.* one cluster becomes empty. To amend this, they use a regularization term ([Zou *et al.* ICCV '19](https://arxiv.org/abs/1908.09822)) which encourages the output to be evenly distributed to different classes: $$ l_{reg}^t = -\sum_{i=1}^{H \times W}\sum_{k=1}^K \log p_t^{(i, k)} \tag{10} $$ * They train the DA n/w with the following total loss: $$ l_{total} = l_{ce}^s + l_{sce}^t + \gamma_1 l_{kl}^t + \gamma_2 l_{reg}^t \tag{11} $$ ### Distillation to self-supervised model * After training with Eq. 11 converges, they further transfer knowledge from the learned target model to a student model with the same architecture but pretrained in a self-supervised manner. * They initialize the student feature extractor $h^\dagger$ with SimCLRv2 pretrained weights and apply a knowledge distillation (KD) loss (KL divergence loss). * Besides, following the self-training paradigm, the teacher model $h$ generates one-hot pseudo-labels $\xi(p_t)$ to teach the student model. * To prevent the model forgetting the source domain, source images are also utilized. Altogether, the student model is trained with: $$ l_{\text{KD}} = l_{ce}^s(p_s, y_s) + l_{ce}^t(p_t^\dagger, \xi(p_t)) + \beta \text{KL}(p_t || p_t^\dagger) \tag{12} $$ * Here, $p_t^\dagger = h^\dagger(x_t)$ is the output of the student model and $\beta=1$. In practice, such self-distillation can be applied multiple times after model convergence to boost the DA performance further. ## Ablation Study ![Ablations Table](https://i.imgur.com/Lh7Zmik.png) ## Conclusion * They propose *ProDA* which resorts to prototypes to online denoise the pseudo-labels and learn a compact target feature space. * Knowledge distillation to a self-supervised pretrained model further boosts the performance.