# Notes on "[Unsupervised Domain Adaptation with Residual Transfer Networks](https://papers.nips.cc/paper/6110-unsupervised-domain-adaptation-with-residual-transfer-networks.pdf)"
###### tags: `notes` `unsupervised` `domain-adaptation`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
Note: The paper was published at NIPS'16 (NeurIPS since 2018).
## Brief Outline
A domain adaptation approach that can jointly learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain.
## Introduction
* Domain Adaptation (DA) is machine learning under the shift between training and test distributions.
* Several DA approaches aim to bridge the source and target domains by learning domain-invariant feature representations without using target labels, so that the classifier learnt on the source domain can be used on the target domain.
* [Donahue et. al. 2014](https://arxiv.org/abs/1310.1531) and [Yosinski et. al. 2014](https://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks) show that deep networks can learn more transferable features for DA by disentangling explanatory factors of variations behind domains.
* [Tzeng et. al. 2014](https://arxiv.org/abs/1412.3474), [Long et. al. 2015](http://proceedings.mlr.press/v37/long15.html), [Ganin and Lempitsky, 2015](http://proceedings.mlr.press/v37/ganin15.html), and [Tzeng et. al. 2015](https://arxiv.org/abs/1510.02192) embed DA in the pipeline of deep feature learning to extract domain invariant features.
* The previous DA approaches assume that source classifier can be directly transferred to the target domain upon the learned domain-invariant feature representations. This assumption is strong in practical cases, as it is not feasible to check whether they can be shared or not.
* Hence, this paper focuses on a more general DA scenario where source and target classifier differ by a small perturbation function.
* They enable classifier adaptation by plugging several layers into deep networks to explicitly learn the residual function with reference to the target classifier. Through this, the source and target classifiers can be bridged tightly in backprop.
* They fuse features of multiple layers with tensor product and embed them into reproducing kernel Hilbert spaces (RKHS). See [these notes](https://hackmd.io/@FtbpSED3RQWclbmbmkChEA/rkTjKdRMS) by my colleague for an introduction to RKHS.
## Methodology
* In a UDA problem, a source domain $\mathcal{D}_s=\{(\mathbf{x}_i^s, y_i^s)\}_{i=1}^{n_s}$ of $n_s$ labeled examples and a target domain $\mathcal{D}_t=\{\mathbf{x}_j^t\}_{j=1}^{n_t}$ of $n_t$ unlabeled examples are given. Source and target domains are sampled from different distributions $p$ and $q$ respectively.
* This paper aims to design a DNN that enables learning of transfer classifiers $y=f_s(\mathbf{x})$ and $y=f_t(\mathbf{x})$ to close the source-target discrepancy, such that the expected target risk $R_t(f_t)=\mathbb{E}_{(\mathbf{x}, y)\sim q}[f_t(\mathbf{x})\neq y]$ can be bounded using the source domain labeled data.
* The distribution discrepancy may give rise to mismatches in both features and classifiers, i.e. $p(\mathbf{x}) \neq q(\mathbf{x})$ and $f_s(\mathbf{x}) \neq f_t(\mathbf{x})$. Both mismatches should be fixed by joint adaptation of features and classifiers to enable effective DA.
* Classifier adaptation is more difficult than feature adaptation since it is directly related to labels but the target domain is unlabeled.
### Feature Adaptation
* Deep features in standard CNNs must eventually transition from general to specific along the network, and the transferability of features and classifiers will decrease when the cross-domain discrepancy increases ([Yosinski et. al. 2014](https://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks)).
* Here, they perform feature adaptation by matching the feature distributions of multiple layers $\ell \in \mathcal{L}$.
* They reduce feature dimensions by adding a bottleneck layer $fcb$ on top of the last feature layer of CNNs, and then finetune CNNs on source labeled examples such that feature distributions of the source and target are made similar under new feature representations in multiple layers $\mathcal{L}=\{ fcb, fcc \}$.
* They propose the tensor product between features of multiple layers to perform lossless multi-layer feature fusion i.e. $\mathbf{z}_i^s \triangleq \otimes_{\ell \in \mathcal{L}}\mathbf{x}_i^{s\ell}$ and $\mathbf{z}_j^t \triangleq \otimes_{\ell \in \mathcal{L}}\mathbf{x}_j^{t\ell}$. Then, they perform feature adaptation by minimizing the Maximum Mean Discrepancy (MMD) ([Gretton et. al. 2012](http://jmlr.csail.mit.edu/papers/v13/gretton12a.html)) between source and target domains using the fusion features (and called tensor MMD) as
$$
\min_{f_s, f_t}\mathcal{D}_\mathcal{L}(\mathcal{D}_s, \mathcal{D}_t)=\sum_{i=1}^{n_s}\sum_{j=1}^{n_s} \frac{k(\mathbf{z}_i^s, \mathbf{z}_j^s)}{n_s^2} + \sum_{i=1}^{n_t}\sum_{j=1}^{n_t} \frac{k(\mathbf{z}_i^t, \mathbf{z}_j^t)}{n_t^2} - 2 \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(\mathbf{z}_i^s, \mathbf{z}_j^t)}{n_sn_t}
\tag{1}
$$
* Here, the characteristic kernel $k(\mathbf{z}, \mathbf{z}')=e^{-||\text{vec}(\mathbf{z}) - \text{vec}(\mathbf{z}')||^2 / b}$ is the Gaussian kernel function defined on the vectorizations of tensors $\mathbf{z}$ and $\mathbf{z}'$ with bandwidth parameter $b$.
![Overview](https://i.imgur.com/YoiRbQN.png)
### Classifier Adaptation
* Although source and target classifiers are different, $f_s(\mathbf{x}) \neq f_t(\mathbf{x})$, they should be related to ensure the feasibility of DA. It is reasonable to assume that they differ only by a small perturbation function $\Delta f(\mathbf{x})$.
* Other methods used labeled data from the target domain to learn the perturbation function $\Delta f(\mathbf{x})$ (which is a function of the input $\mathbf{x}$). However, this is not possible in UDA.
* If multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, $F(\mathbf{x}) - \mathbf{x}$.
* Rather than expecting stacked layers to approximate $F(\mathbf{x})$, they explicitly let these layers approximate a residual function $\Delta F(\mathbf{x}) \triangleq F(\mathbf{x}) - \mathbf{x}$.
* While it is unlikely that identity mappings are optimal, it should be easier to find the perturbations with reference to an identity mapping, than to learn the function anew.
* This is inspired from [He et. al. 2016](https://arxiv.org/abs/1512.03385) which bridges the input and output of residual layers by the shortcut connection (identity mapping) such that $F(\mathbf{x}) = \Delta F(\mathbf{x}) + \mathbf{x}$, which eases the learning of the residual function $\Delta F(\mathbf{x})$ (similar to the perturbation function).
* They reformulate the residual block to bridge the source and target classifiers, $f_S(\mathbf{x})$ and $f_T(\mathbf{x})$, by letting $\mathbf{x}\triangleq f_T(\mathbf{x})$, $F(\mathbf{x})\triangleq f_S(\mathbf{x})$ and $\Delta F(\mathbf{x})\triangleq \Delta f(\mathbf{x})$.
* Note that $f_S$ is the output of elementwise addition operation while $f_T$ is the output of $fcc$ layer, and both before softmax activation $\sigma(.)$, $f_s(\mathbf{x}) \triangleq \sigma(f_S(\mathbf{x}))$, $f_t(\mathbf{x})\triangleq \sigma(f_T(\mathbf{x}))$. See the diagram above.
* They connect the source and target classifiers using the residual block as
$$
f_S(\mathbf{x})=f_T(\mathbf{x})+\Delta f(\mathbf{x})
\tag{2}
$$
* Note that they use the outputs before softmax to ensure that final classifiers $f_s$ and $f_t$ will output probabilities.
* They set the source classifier $f_S$ as the outputs of the residual block to make it better trainable using the source labeled data. Setting $f_T$ is not possible as labeled data is not available, and so it can't be learned using standard backprop.
* It ensures to give valid classifiers $|\Delta f(\mathbf{x})| \ll |f_T(\mathbf{x})| \approx |f_S(\mathbf{x})|$ and more importantly, makes the perturbation function $\Delta f(\mathbf{x})$ dependent on both source classifier $f_S(\mathbf{x})$ (due to backprop pipeline) and target classifier $f_T(\mathbf{x})$ (due to the functional dependency).
* Although classifier adaptation is cast into the residual learning framework, it tends to make the target classifier $f_t(\mathbf{x})$ not deviate much from the source classifier $f_s(\mathbf{x})$, but cannot guarantee that $f_t(\mathbf{x})$ will fit target specific structures well.
* So, they use entropy minimization ([Grandvalet and Hinton, 2004](https://papers.nips.cc/paper/2740-semi-supervised-learning-by-entropy-minimization.pdf)), which encourages low density separation between classes (i.e. reduces overlap between classes) by minimizing the entropy of class-conditional distribution $f_j^t(\mathbf{x}_i^t)=p(y_i^t=j|\mathbf{x}_i^t;f_t)$ on target domain data $\mathcal{D}_t$ as
$$
\min_{f_t} \frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(\mathbf{x}_i^t))
\tag{3}
$$
* Here, $H$ is the entropy function of class-conditional distribution $f_t(\mathbf{x}_i^t)$ defined as $H(f_t(\mathbf{x}_i^t))=-\sum_{j=1}^{c} f_j^t(\mathbf{x}_i^t) \log(f_j^t(\mathbf{x}_i^t))$, $c$ is the number of classes and $f_j^t(\mathbf{x}_i^t)$ is the probability of predicting $\mathbf{x}_i^t$ to class $j$.
### Overall Training Procedure
* Combining all the parts,
$$
\min_{f_S=f_T+\Delta f}\frac{1}{n_s}\sum_{i=1}^{n_s}L(f_s(\mathbf{x}_i^s), y_i^s) + \frac{\gamma}{n_t}\sum_{i=1}^{n_t}H(f_t(\mathbf{x}_i^t)) + \lambda \mathcal{D}_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)
\tag{4}
$$
* Here, $\lambda$ and $\gamma$ are the tradeoff parameters for the tensor MMD penalty and the entropy penalty respectively.
* Note that tensor MMD penalty optimization requires carefully-designed algorithm to establish linear-time training (detailed in [Long et. al. 2015](http://proceedings.mlr.press/v37/long15.html)).
* They also use bilinear pooling ([Lin et. al. 2015](https://arxiv.org/abs/1504.07889)) to reduce the dimensions of fusion features in tensor MMD.
* [Caffe code](https://github.com/thuml/DAN) is available, while [PyTorch code](https://github.com/thuml/Xlearn) has not been uploaded since 2018.
## Conclusion
* This paper presented a UDA approach which enables end-to-end learning of adaptive classifiers and transferable features.
* Similar to prior DA techniques, feature adaptation is achieved by matching the distribution of features across domains.
* Unlike prior work, this also supports classifier adaptation, implemented through a residual transfer module that bridges the two classifiers.