# Notes on "[Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation](https://arxiv.org/pdf/2007.07077.pdf)" ###### tags: `notes` `unsupervised` `domain-adaptation` Notes Author: [Rohit Lal](https://rohitlal.net/) --- ## Brief Outline - Paper propose a novel unsupervised MTDA approach to train a CNN that can generalize well across multiple target domains. - Their Multi-Teacher MTDA (MTMTDA) method relies on multi-teacher knowledge distillation (KD) to iteratively distill target domain knowledge from multiple teachers to a common student. ## Introduction - Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance - The MTDA problem can be addressed by adapting one specialized model per target domain, although this solution is too costly in many real-world applications. - MTDA problems can be solved by producing one model per target domain, this approach becomes costly and impractical in applications with a growing number of target domains. ### Application - For instance, in video-surveillance applications, each camera of a distributed network corresponds to a different non-overlapping viewpoint (target domain). - A DL model for person re-identification should normally be adapted to multiple different camera viewpoints. ## Methodology ![](https://i.imgur.com/rAbQPQW.png) Authors argue that having better preservation of target specificity leads to higher accuracy ### Domain Adaptation of Teachers: - Let us define the source domain as $S=\{x_s, y_s\}$. The set of target domains is defined as $T = \{T_1, T_2, ... , T_n$\}, each one defined as $T_i = \{x_i\}$ - For each target domain $T_i$, we define a teacher model $\phi_i$, and each of these teachers will be adapted to a corresponding target domain using the UDA technique proposed in "Unsupervised domain adaptation by backpropagation" - The domain adaptation of the teacher relies on a domain classifier, a gradient reversal layer (GRL), and the domain confusion loss ![](https://i.imgur.com/Z5wImtK.png) - The final domain adaptation loss is then defined as: ![](https://i.imgur.com/Z2wUESH.png) ### Teacher to Student Knowledge Distillation: ![](https://i.imgur.com/vXqlgtO.png) - employ knowledge distillation based on logits as in "Distilling the knowledge in a neural network." - Logits from a teacher/student model are fed to a temperature-based softmax function, in combination with a KL divergence loss on both the teacher and student outputs: ![](https://i.imgur.com/n0nxTf2.png) ### Multi-Teacher Multi-Target DA: ![](https://i.imgur.com/JQHbhDr.png) - For progressive UDA of teacher models and transfer of knowledge from teacher to the student model, they adapt an exponential growing rate to gradually transfer the importance of UDA to KD - Growth rate ![](https://i.imgur.com/t4FOozC.png). where s is the starting value, f the final value, and $N_e$ the number of total epochs. This growth rate is used to calculate $\beta = s * exp(g* e)$ - overall loss function for optimization of one teacher: ![](https://i.imgur.com/u2cUo7F.png) - instead of using deterministic fusion functions, such as average fusion, employs an alternative learning scheme for knowledge distillation from multiple teachers. ![](https://i.imgur.com/kUr9tAL.png) ## Implementation Details - Datasets used: Digits, Office31, OfficeHome, PACS - MT-MTDA is compared to 1. a lower bound, which is only trained on source and tested on target 2. current state-of-the-art in MTDA with domain labels such as MTDA-ITA (from "Unsupervised multi-target domain adaptation: An information theoretic approach.") 3. MTDA without domain labels such as AMEANS 4. compare to baseline methods such as RevGrad which is the basis of our MTDA method. We also use other baselines like DANor ADDA - Digits-five dataset, we employ a LeNet backbone with ResNet50 as teacher. - Office31 and OfficeHome, we use AlexNet backbone with ResNet50 as teacher models, and as for the comparison on the ResNet50 backbone, we use a ResNext101 as teachers ## Conclusion - an avenue is unexplored for MTDA, relying on multiple teachers in order to distill knowledge from multiple target domains into a single student. - Results from their experiment show that their method outperforms the current state-of-the-art, especially when using compact models, which can facilitate the use in numerous real-time applications. ## Limitations - the STDA algorithm determines the accuracy of teacher model - the transfer of target domain knowledge which needs to be improved when the student model is compact.