They introduce 2 criteria to regularize the optimization involved in Unsupervised Domain Adaptation (UDA).
The first criterion, inspired by visual psychophysics, is that the map between 2 image domains be phase-preserving.
The second criterion aims to leverage ecological statistics (or regularities in the scene), regardless of the illuminant or imaging sensor.
Introduction
Unsupervised domain adaptation (UDA) aims to leverage an annotated source dataset in designing learning schemes for a target dataset for which no ground-truth is available.
If the two datasets are sampled from the same distribution, this is a standard semi-supervised learning problem.
The twist in UDA is that the distributions from which source and target data are drawn differ enough that a model trained on the former performs poorly, out-of-the-box, on the latter.
Typical DA works using deep NNs proceed by
learning a map that aligns the source and target (marginal) distributions
or training a backbone to be insensitive to the domain change through an auxiliary discrimination loss for the domain variable.
Either way, these approaches operate on the marginal distributions, since the labels are not available in the target domain.
However, the marginals could be perfectly aligned, yet the labels could be scrambled: Trees in one domain could map to houses in another, and vice-versa.
Since class information has to be transferred, ideally the class conditional distributions should be aligned, which are not available. As the problem is ill-posed, constraints or priors have to be enforced in UDA.
They introduce two priors or constraints, one on the map between the domains, the other on the classifier in the target domain, both unknown at the outset.
From visual psychophysics, it is known that semantic information in images is associated with the phase of its Fourier Transform (FT).
Changes in the amplitude of the FT can significantly alter the appearance but not the interpretation. This suggests placing an incentive for the transformation between domains to be phase-preserving.
For the classifier in the target domain, even in the absence of annotations, a target image informs the set of possible hypotheses (segmentations), due to the statistical regularities of natural scenes (ecological statistics, Brunswik and Kamiya, 1953 and Elder and Goldberg, 2002).
Semantic segments are unlikely to be highly irregular due to the regularity of the shape of objects in the scene.
Such generic priors, informed by each single unlabeled image, could be learned from other labeled images and transfer across image domains, since they arise from properties of the scene they portray.
They use a Conditional Prior Network (Yang and Soatto, 2018) to learn a data-dependent prior on segmentations that can be imposed in an end-to-end framework when learning a classifier in the target domain in UDA.
Methodology
Image Translation for UDA
Consider 2 probabilities, a source and a target which are generally different (covariate shift), as measured by the Kullbach-Liebler divergence .
are color images and are segmentation maps where each pixel has an associated label. There are images and labels in the source domain, but only images in the target domain, .
The goal of UDA for semantic segmentation is to train a model that maps target images to estimated segmentations, , leveraging source domain annotations.
Any invertible map between samples in the source and target domains, induces a pushforward map between their distributions where .
The map can be implemented by a transformer network, and the target domain risk is minimized by a CE loss
The domain gap is measured by and can be minimized by adversarially maximizing the domain confusion, as measured by a domain discriminator that maps each image into the probability of it coming from the source or target domains.
where ideally returns 1 for images drawn from and 0 otherwise.
Limitations and Challenges
Ideally, jointly minimizing the two previous equations would yield a segmentation model that operates in the targetdomain, producing estimated segmentations .
Unfortunately, the transformation network trained using Eq. 2 does not yield a good target domain classifier, since can match the image statistics, but there is nothing that encourages it to match semantics.
Semantic information is included in the phase, not the amplitude, of the spectrum. This motivated them to hypothesize that the transformation should be phase-preserving.
Let be the FT. Phase consistency for a transformation , for a single channel image , is obtained by minimizing
where is the dot product and is the norm. It is the negative cosine of the difference between original and transformed phases.
Prior on Scene Compatibility
Given an unlabeled image, we may not know what classes are present, but we know that objects have certain regularities,
so it is unlikely that photometrically homogenous regions are segmented into many pieces, or segments span many image boundaries.
it is also unlikely that the segmented map is highly irregular.
These characteristics inform the probability of a segmentation given the image in the target domain, . So, is a function that scores each hypothesis based on the plausibility of the resulting segmentation given the input image .
Such a function can be learnt from a set of labeled images and they use a Conditional Prior Network (Yang and Soatto, 2018). However, training on , which is sampled from , will make i.e. overfitting the source dataset (which is useless).
Simply using would make the CPN capture both the domain related unary prediction term and the domain irrelevant pairwise term that depends on the image structure. To make this point explicit, they decompose as follows
Here, they omit the higher order terms for simplicity (I think those must represent correlation between more than 2 classes at a time).
The unary terms measure the likelihood of the semantic label of a single pixel given the image. The pairwise terms measure the labelling compatibility between pixels, which would depend much less on the domain.
Some of my understanding and doubts on the decomposition (not in paper of course):
Pairwise terms are sort of correlations between pixels (RGB values, position) and labels. Thus, they are considered domain independent (at least for segmentation).
The decomposition is not properly explained in the paper. If we consider each component as independent of each other, then such a decomposition seems correct (but I'm not completely sure).
They don't use the superscript with the 2nd term since it's domain independent, but not sure if that will check out mathematically.
To prevent overfitting the source domain (through the unary terms), they randomly permute the labels in according to a uniform distribution
Here, is a random permutation of the class IDs for classes, and they denote the permuted segmentation masks as (which scales up the dataset size by a factor of ).
This new dataset is denoted which will render the conditional distribution invariant to the domain-dependent unaries
Note that only evaluates the compatibility based on semantic layout but not the semantic meanings.
They claim to train a CPN using KL divergence and an information capacity constraint
Here, denotes the mutual information between and its CPN encoding . However, they later mention that they use the NLL (negative log likelihood) loss, derived from the KLD term, to reconstruct the randomly permuted GT segmentation masks .
They also mention that the information capacity constraint is implemented as a structural bottleneck. Through this training, they obtain a compatibility function (in the form of trained CPN) i.e.
Overall Training Loss
Combining all the parts (and pretraining using Equation 9), the overall training loss for training the image transformation networks and the target domain segmentation network is given as follows
Here, 's are hyperparameters to control each constraint/loss. Note that output of are not permuted for evaluation of the scene compatibility term.
They propose certain priors to improve UDA for semantic segmentation. However, imposing semantic consistency and ecological statistic priors to general UDA tasks other than semantic segmentation remains an open problem.
The capacity of the CPN is chosen empirically, and it's analysis is an unsolved task. Finding the optimal bottleneck capacity for specific tasks will require quantitative measurement of the information that CPN leverages from images.
They introduce 2 assumptions, and corresponding priors and variational renditions that are used in end-to-end differential learning
The transformations mapping one domain to another only affect the magnitude, but not the phase, of the spectrum.
A prior meant to capture ecological statistics, that are characteristics of the images induced by regularities in the scene, and thus shared across domains.