Notes on "[Phase Consistent Ecological Domain Adaptation](https://openaccess.thecvf.com/content_CVPR_2020/html/Yang_Phase_Consistent_Ecological_Domain_Adaptation_CVPR_2020_paper.html)"

# Notes on "[Phase Consistent Ecological Domain Adaptation](https://openaccess.thecvf.com/content_CVPR_2020/html/Yang_Phase_Consistent_Ecological_Domain_Adaptation_CVPR_2020_paper.html)" ###### tags: `notes` `unsupervised` `domain-adaptation` `segmentation` Author: [Akshay Kulkarni](https://akshayk07.weebly.com/) ## Brief Outline They introduce 2 criteria to regularize the optimization involved in Unsupervised Domain Adaptation (UDA). 1. The first criterion, inspired by [visual psychophysics](http://www.bionicsinstitute.org/visual-psychophysics/), is that the map between 2 image domains be phase-preserving. 2. The second criterion aims to leverage ecological statistics (or regularities in the scene), regardless of the illuminant or imaging sensor. ## Introduction - Unsupervised domain adaptation (UDA) aims to leverage an annotated *source* dataset in designing learning schemes for a *target* dataset for which no ground-truth is available. - If the two datasets are sampled from the same distribution, this is a standard semi-supervised learning problem. - The twist in UDA is that the distributions from which source and target data are drawn differ enough that a model trained on the former performs poorly, out-of-the-box, on the latter. - Typical DA works using deep NNs proceed by - learning a map that aligns the source and target (marginal) distributions - or training a backbone to be insensitive to the domain change through an auxiliary discrimination loss for the domain variable. - Either way, these approaches operate on the marginal distributions, since the labels are not available in the target domain. - However, the marginals could be perfectly aligned, yet the labels could be scrambled: Trees in one domain could map to houses in another, and vice-versa. - Since class information has to be transferred, ideally the class conditional distributions should be aligned, which are not available. As the problem is ill-posed, constraints or priors have to be enforced in UDA. - They introduce two priors or constraints, one on the map between the domains, the other on the classifier in the target domain, both unknown at the outset. - From visual psychophysics, it is known that semantic information in images is associated with the phase of its Fourier Transform (FT). ![semantic content of phase component of FT](https://i.imgur.com/5FiTole.png) - Changes in the amplitude of the FT can significantly alter the appearance but not the interpretation. This suggests placing an incentive for the transformation between domains to be phase-preserving. - For the classifier in the target domain, even in the absence of annotations, a target image informs the set of possible hypotheses (segmentations), due to the statistical regularities of natural scenes (ecological statistics, [Brunswik and Kamiya, 1953](https://pubmed.ncbi.nlm.nih.gov/13030843/) and [Elder and Goldberg, 2002](https://jov.arvojournals.org/article.aspx?articleid=2192496)). - Semantic segments are unlikely to be highly irregular due to the regularity of the shape of objects in the scene. - Such generic priors, informed by each single unlabeled image, could be learned from other labeled images and transfer across image domains, since they arise from properties of the scene they portray. - They use a Conditional Prior Network ([Yang and Soatto, 2018](https://openaccess.thecvf.com/content_ECCV_2018/html/Yanchao_Yang_Conditional_Prior_Networks_ECCV_2018_paper.html)) to learn a data-dependent prior on segmentations that can be imposed in an end-to-end framework when learning a classifier in the target domain in UDA. ## Methodology ### Image Translation for UDA - Consider 2 probabilities, a source $P^s$ and a target $P^\tau$ which are generally different (covariate shift), as measured by the Kullbach-Liebler divergence $KL(P^s||P^\tau)$. - $x \in \mathbb{R}^{H \times W \times 3}$ are color images and $y\in[1, \dots, K]^{H \times W}$ are segmentation maps where each pixel has an associated label. There are images and labels in the source domain, $D^s=\{ (x_i^s, y_i^s)\sim P^s(x, y)\}_{i=1}^{N_s}$ but only images in the target domain, $D^\tau=\{x_i^\tau \sim P^\tau(x) \}_{i=1}^{N_\tau}$. - The goal of UDA for semantic segmentation is to train a model $\phi^\tau$ that maps target images to estimated segmentations, $x^\tau \rightarrow \hat{y}^\tau=\arg\max_y\phi^\tau(x^\tau)_y$, leveraging source domain annotations. - Any invertible map $T$ between samples in the source and target domains, $x^s \rightarrow T(x^s)$ induces a pushforward map between their distributions $P^s \rightarrow T_*P^s$ where $T_*P^s=P^s(T^{-1}(x^\tau))$. - The map can be implemented by a *transformer* network, and the target domain risk is minimized by a CE loss $$ \mathcal{L}_{ce}(\phi^\tau, T; D^s)=\sum_{(x_i^s, y_i^s) \in D^s} -\log[\phi^\tau(T(x_i))]_{y_i} \tag{1} $$ - The domain gap is measured by $KL(P^\tau||T_*P^s)$ and can be minimized by adversarially maximizing the domain confusion, as measured by a domain discriminator $\theta$ that maps each image into the probability of it coming from the source or target domains. $$ \mathcal{L}_D(\theta, T; x_i^s) = -\log[\theta(T(x_i^s))] \tag{2} $$ - where $\theta$ ideally returns 1 for images drawn from $P^\tau$ and 0 otherwise. #### Limitations and Challenges - Ideally, jointly minimizing the two previous equations would yield a segmentation model that operates in the targetdomain, producing estimated segmentations $y^\tau=\phi(x^\tau)$. - Unfortunately, the transformation network $T$ trained using Eq. 2 does not yield a good target domain classifier, since $T$ can match the image statistics, but there is nothing that encourages it to match semantics. - Cycle-consistency ([Zhu et. al. 2017](https://arxiv.org/abs/1703.10593) and [Hoffman et. al. 2018](https://arxiv.org/abs/1711.03213)) does not address this issue, since it only enforces the invertibility of $T$ $$ \mathcal{L}_{cyc}(T, T^{-1};x_i^s)=||x_i^s-T^{-1}T(x_i^s)||_1 \tag{3} $$ - Even after imposing this constraint, buildings in the source domain may get mapped to trees and vice versa (as seen below). ![cycle consistency vs phase consistency](https://i.imgur.com/m87VHWE.jpg) ### Phase Consistency - Changes in the amplitude of the Fourier transform alters the image but does not affect its interpretation, whereas altering the phase produces uninterpretable images ([Kermisch, 1970](https://www.osapublishing.org/josa/abstract.cfm?uri=josa-60-1-15), [Piotrowski and Campbell, 1982](https://pubmed.ncbi.nlm.nih.gov/7167342/), [Oppenheim and Lim, 1981](https://www.rle.mit.edu/dspg/documents/ImportancePhaseSignals_1981.pdf) and [Hansen and Hess, 2007](https://www.osapublishing.org/josaa/abstract.cfm?URI=josaa-24-7-1873)). - Semantic information is included in the phase, not the amplitude, of the spectrum. This motivated them to hypothesize that the transformation $T$ should be phase-preserving. - Let $\mathcal{F}:\mathbb{R}^{H \times W} \rightarrow \mathbb{R}^{H \times W \times 2}$ be the FT. Phase consistency for a transformation $T$, for a single channel image $x$, is obtained by minimizing $$ \mathcal{L}_{ph}(T;x)=-\sum_j \frac{\langle \mathcal{F}(x)_j, \mathcal{F}(T(x))_j \rangle}{||\mathcal{F}(x)_j||_2 . ||\mathcal{F}(T(x))_j||_2} \tag{4} $$ - where $\langle , \rangle$ is the dot product and $||.||_2$ is the $l_2$ norm. It is the negative cosine of the difference between original and transformed phases. ### Prior on Scene Compatibility - Given an unlabeled image, we may not know what classes are present, but we know that objects have certain regularities, - so it is unlikely that photometrically homogenous regions are segmented into many pieces, or segments span many image boundaries. - it is also unlikely that the segmented map is highly irregular. - These characteristics inform the probability of a segmentation given the image in the target domain, $Q(\phi(x)|x)$. So, $Q$ is a function that scores each hypothesis $\phi(x)$ based on the plausibility of the resulting segmentation given the input image $x$. - Such a function can be learnt from a set of labeled images $D^s$ and they use a Conditional Prior Network ([Yang and Soatto, 2018](https://openaccess.thecvf.com/content_ECCV_2018/html/Yanchao_Yang_Conditional_Prior_Networks_ECCV_2018_paper.html)). However, training on $D^s$, which is sampled from $P^s(x, y)$, will make $Q(y|x) \sim P^s(y|x)$ i.e. overfitting the source dataset (which is useless). ![CPN Architecture](https://i.imgur.com/GEHMFwO.png) - Simply using $D^s$ would make the CPN capture both the domain related unary prediction term and the domain irrelevant pairwise term that depends on the image structure. To make this point explicit, they decompose $P^s(y|x)$ as follows $$ P^s(y|x) \approx \prod_j P^s(y_j|x) \prod_{m < n}P(y_m=y_n|x) \tag{5} $$ - Here, they omit the higher order terms for simplicity (I think those must represent correlation between more than 2 classes at a time). - The unary terms $P^s(y_j|x)$ measure the likelihood of the semantic label of a single pixel given the image. The pairwise terms $P(y_m=y_n|x)$ measure the labelling compatibility between pixels, which would depend much less on the domain. - Some of my understanding and doubts on the decomposition (not in paper of course): - Pairwise terms are sort of correlations between pixels (RGB values, position) and labels. Thus, they are considered domain independent (at least for segmentation). - The decomposition is not properly explained in the paper. If we consider each component as independent of each other, then such a decomposition seems correct (but I'm not completely sure). - They don't use the $s$ superscript with the 2nd term since it's domain independent, but not sure if that will check out mathematically. - To prevent overfitting the source domain (through the unary terms), they randomly permute the labels in $y^s$ according to a uniform distribution $$ y^s|_{y^s=i} = \text{PM}^K(i) \tag{6} $$ - Here, $\text{PM}^K$ is a random permutation of the class IDs for $K$ classes, and they denote the permuted segmentation masks as $\hat{y}^s$ (which scales up the dataset size by a factor of $K!$). - This new dataset is denoted $\hat{D}^s$ which will render the conditional distribution invariant to the domain-dependent unaries $$ \hat{P}^s(y|x) \approx \prod_{m < n}P(y_m=y_n|x) \tag{7} $$ - Note that $\hat{P}^s(y|x)$ only evaluates the compatibility based on semantic layout but not the semantic meanings. - They claim to train a CPN $Q$ using KL divergence and an information capacity constraint $$ \min_Q \mathbb{E}_x KL(\hat{P}^s(y|x), Q(y|x)) + \beta \mathbb{I}(y, Q^e(y)) \tag{8} $$ - Here, $\mathbb{I}$ denotes the mutual information between $y$ and its CPN encoding $Q^e(y)$. However, they later mention that they use the NLL (negative log likelihood) loss, derived from the KLD term, to reconstruct the randomly permuted GT segmentation masks $\hat{y}^s$. $$ \mathcal{L}_{cpn}(Q;\hat{y}^s, x^s)=\text{NLL}(Q(\hat{y}^s|x^s), \hat{y}^s) \tag{9} $$ - They also mention that the information capacity constraint is implemented as a structural bottleneck. Through this training, they obtain a compatibility function (in the form of trained CPN) i.e. $$ Q(y|x)\approx\prod_{m < n}P(y_m=y_n|x) \tag{10} $$ ### Overall Training Loss - Combining all the parts (and pretraining $Q$ using Equation 9), the overall training loss for training the image transformation networks $T, T^{-1}$ and the target domain segmentation network $\phi^\tau$ is given as follows $$ \mathcal{L}(\phi^\tau, T, T^{-1};\theta^s, \theta^\tau, x_i^s, y_i^s, x_i^\tau) = \\ \lambda_D(\mathcal{L}_D(\theta^\tau, T, x_i^s)+\mathcal{L}_D(\theta^s, T^{-1}, x_i^\tau)) + \\ \lambda_{cyc} (\mathcal{L}_{cyc}(T, T^{-1}, x_i^s) + \mathcal{L}_{cyc}(T^{-1}, T, x_i^\tau)) + \\ \lambda_{ph}(\mathcal{L}_{ph}(T;x_i^s)+\mathcal{L}_{ph}(T^{-1};x_i^\tau)) + \\ \mathcal{L}_{ce}(\phi^\tau, T; x_i^s, y_i^s) - \lambda_{cpn} \log[Q(\phi^\tau(x_i^\tau)|x_i^\tau)] \tag{11} $$ - Here, $\lambda$'s are hyperparameters to control each constraint/loss. Note that output of $\phi^\tau$ are not permuted for evaluation of the scene compatibility term. - Discriminator training is done according to [Li et. al. 2019](https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Bidirectional_Learning_for_Domain_Adaptation_of_Semantic_Segmentation_CVPR_2019_paper.pdf) and [Hoffman et. al. 2018](http://proceedings.mlr.press/v80/hoffman18a.html). ## Conclusion * They propose certain priors to improve UDA for semantic segmentation. However, imposing semantic consistency and ecological statistic priors to general UDA tasks other than semantic segmentation remains an open problem. * The capacity of the CPN is chosen empirically, and it's analysis is an unsolved task. Finding the optimal bottleneck capacity for specific tasks will require quantitative measurement of the information that CPN leverages from images. * They introduce 2 assumptions, and corresponding priors and variational renditions that are used in end-to-end differential learning * The transformations mapping one domain to another only affect the magnitude, but not the phase, of the spectrum. * A prior meant to capture ecological statistics, that are characteristics of the images induced by regularities in the scene, and thus shared across domains.