owned this note
owned this note
Published
Linked with GitHub
# Notes on "[Domain Adaptation for Structured Output via Discriminative Patch Representations](https://arxiv.org/abs/1901.05427)"
###### tags: `notes` `domain-adaptation` `adversarial` `unsupervised` `segmentation`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
**Note**: [Code](https://github.com/wasidennis/AdaptSegNet) for [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349) mentions some implementation details for this method. There is no open-source implementation for this method available (as of 20/04/2020).
## Brief Outline
They propose an unsupervised domain adaptation that explicitly discovers many modes in the structured output space of semantic segmentation to learn a better discriminator between the 2 domains, ultimately leading to a better domain alignment.
## Introduction
- They leverage pixel-level semantic annotations available in the source domain, but instead of working on the output space (like [Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)), the adaptation occurs in 2 stages.
- They extract patches from the source domain (represented using annotations) and discover major modes by grouping the patches using $K$-means clustering. They use a $K$-way classifier to predict the cluster/mode index of each patch using the image as input.
- Their method, patch-level alignment, operates on the $K$-dimensional probability vector space. The learned discriminator on this space can backpropagate the gradient through the cluster/mode index classifier to the semantic segmentation network.
- They propose an adversarial adaptation framework for structured prediction that explicitly tries to discover and predict modes of the output patches.
## Related Work
### Unsupervised Domain Adaptation (UDA)
- One common practice is to adopt adversarial learning ([Ganin et. al. 2015](https://arxiv.org/abs/1505.07818)) or to minimize the Maximum Mean Discrepancy ([Long et. al. 2015](http://proceedings.mlr.press/v37/long15.html)).
- Recent approaches for UDA in semantic segmentation can be categorized into:
1. output space ([Tsai et. al. 2018](https://arxiv.org/abs/1802.10349)) and spatial-aware([Chen et. al. 2017](https://arxiv.org/abs/1711.11556)) adaptations aim to align the global structure (like scene layout) across domains.
2. pixel-level adaptation synthesizes target samples to reduce the domain gap during training (not relevant to this paper).
3. pseudo label re-training generates pseudo ground truth of target images to finetune the model trained on source domain (not relevant to this paper).
- The first category are most relevant to this approach. However, they don't handle the intrinsic domain gap (like camera poses). This method can handle this type of domain gap.
- They also mention that the other 2 categories or other techniques (for DA) like robust loss function design are orthogonal to their work (i.e. this method can work with other methods to further improve performance).
### Learning Disentangled Representations
- These approaches use predefined factors to learn interpretable representations of the image (see paper for approaches).
- Although these present promising results, they focus on handling the data in a single domain.
- Motivated by this line of research, they propose to learn discriminative representations for patches to help the DA task.
- Thus, they use the available labels and use them as a disentangled factor (so the method does not require any predefined factors).
## Methodology
### Overview
- Given the source and target images $I_s, I_t \in \mathbb{R}^{H \times W \times 3}$, where only source data is annotated with per-pixel semantic categories $Y_s$, they seek to learn a sem-seg model $G$ that works on both domains.
- Since the target domain is unlabeled, the goal is to align the predicted output distribution $O_t$ of the target data with the source distribution $O_s$.
- They project the category distribution of patches to the clustered space that already discovers various patch modes (i.e. $K$ clusters) based on the annotations in the source domain.
- For the target domain, they then employ adversarial learning to align the patch level distributions across domains in the $K$-dimensional space.
![Overview](https://i.imgur.com/9YnAL9u.png)
### Patch-level Alignment
- In this, it is described how they construct the clustered space and learn discriminative patch representations.
- Then, they describe adversarial alignment using the learned patch representation.
#### Patch Mode Discovery
- To discover modes and learn a discriminative feature space, class labels or predefined factors are usually provided as supervisory signals.
- In this, per-pixel annotations from the source domain are used to construct a space of semantic patch representation.
- They use label histograms for patches.
- First, randomly sample patches from source images, use a $2\times2$ grid on patches to extract spatial label histograms, and concatenate them to obtain a $2\times2\times C$ dimensional vector.
- Second, apply $K$-means clustering on these histograms, thereby assigning each ground truth label patch a unique cluster index.
- Note: This process of finding the cluster membership for each patch in a ground truth label map $Y_s$ is defined as $\Gamma(Y_s)$.
- To implement this clustered space for training $G$ on source data, they use a classification module $H$ on top of the predicted output $O_s$, which tries to predict the cluster membership $\Gamma(Y_s)$ for all locations.
- They denote the learned representation as $F_s=H(G(I_s)) \in (0, 1)^{U \times V \times K}$ through the softmax function (where $K$ is the number of clusters).
- Each datapoint on the spatial map $F_s$ corresponds to a patch of the input image, and the group label for each patch can be obtained using $\Gamma(Y_s)$. Then, the learning process to construct the clustered space can be formulated using a cross-entropy loss:
$$
\mathcal{L}(F_s, \Gamma(Y_s); G, H) = -\sum_{u, v}\sum_{k \in K}CE^{(u, v, k)}
\tag{1}
$$
- Here, $CE^{u, v, k}=\Gamma(Y_s)^{(u, v, k)}\log(F_s^{(u, v, k)})$
#### Adversarial Alignment
- The ensuing task is to align the representations of target patches to the clustered space constructed in the source domain, ideally aligned to one of the $K$ modes.
- For this, they use an adversarial loss between $F_s$ and $F_t$, where $F_t$ is generated similarly using $H$.
- They formulate the patch distribution alignment in an adversarial objective:
$$
\mathcal{L}_{adv}(F_s, F_t; G, H, D) = \sum_{u, v}\{ \mathbb{E}[\log D(F_s)^{(u, v, 1)}] + \mathbb{E}[\log(1 - D(F_t)^{(u, v, 1)})] \}
\tag{2}
$$
- Here, $D$ is the discriminator to classify whether the feature representation $F$ is from the source domain or the target domain.
#### Learning Objective
- Writing the equations 1 and 2 together in a min-max objective (with only optimization variables):
$$
\min_{G, H} \max_D \{\mathcal{L}_s(G) + \lambda_d \mathcal{L}_d(G, H) + \lambda_{adv}\mathcal{L}_{adv}(G, H, D) \}
\tag{3}
$$
- Here, $\mathcal{L}_s$ is the cross-entropy for learning the structured prediction (here, sem-seg) on source data and $\lambda$'s are the weights for different losses.
### Network Optimization
- To solve the optimization problem of Equation 3, they follow the training procedure of GANs, and alternate 2 steps:
1. update $D$
2. update $G$ and $H$ while fixing $D$
#### Update $D$
- The discriminator $D$ is trained to classify whether the feature representation $F$ is from the source (labeled as 1) or the target domain (labeled as 0).
- The maximization problem w.r.t. $D$ in Equation 3 is equivalent to minimizing the binary cross-entropy loss:
$$
\mathcal{L}_D(F_s, F_t; D) = -\sum_{u, v}\{ \mathbb{E}[\log D(F_s)^{(u, v, 1)}] + \mathbb{E}[\log(1 - D(F_t)^{(u, v, 1)})] \}
\tag{4}
$$
#### Update $G$ and $H$
- The goal of this step is to push the target distribution closer to the source distribution using the optimized $D$, while maintaining good performance on main tasks using $G$ and $H$.
- So, the minimization problem in Equation 3 is the combination of 2 supervised loss functions and an adversarial loss (which is a binary cross-entropy loss that assigns the source label to the target distribution):
$$
\mathcal{L}_{G, H} = \mathcal{L}_s + \lambda_d \mathcal{L}_d - \lambda_{adv} \sum_{u, v} \log (D(F_t)^{(u, v, 1)})
\tag{5}
$$
- Note that updating $H$ also influences $G$, thus enhancing the feature representations in $G$. Also, $H$ and $D$ are required only for training, so inference time is unaffected.
## Ablation Study
### Loss Functions
- Adding discriminative patch representations (just using $\mathcal{L}_s + \mathcal{L}_d$) already improves the performance.
- This demonstrates that the learned feature representation enhances the discrimination and generalization ability.
- The proposed patch-level alignment (adding $\mathcal{L}_{adv}$) further improves the performance.
### Learning Clustered Space
- $K$-means provides additional supervised signal to separate different patch patterns, while performing alignment in this clustered space.
- Without the clustered loss $\mathcal{L}_d$, it would be difficult to align patch modes across two domains. Removing this term from training reduces the performance.
### Number of Clusters
- The performance is robust to the choice of $K$.
- However, when $K$ is too large (>300), it would cause confusion between patch modes and increases the training difficulty.
- To keep efficiency and accuracy, they use $K=50$ in all experiments.
### Combination with other DA methods
- Fusing the patch-level alignment method with other methods like output space adaptation, pixel-level adaptation and pseudo label re-training separately, improves the performance.
- Fusing all of these at the same time gives the best performance.
- Combining different DA approaches does not incrementally improve performance. However, adding patch-alignment consistently improves performance in all settings.
## Conclusion
- They present a DA method for structured output via patch-level alignment.
- They learn discriminative representations of patches by constructing a clustered space of the source patches and adopt an adversarial learning scheme to push the target patch distribution closer to the source ones.
- Patch-level alignment is complementary to other DA approaches and provides additional improvements.
# Supplementary Notes
- The following will be useful for implementation (of experiments) but may or may not be explicitly in the paper.
## Datasets
### GTA5
- Introduced by [Richter et. al. 2016](https://arxiv.org/abs/1608.02192), it has generated around 25k images from a photo-realistic computer game called Grand Theft Auto 5 (GTA5).
- Using their labeling pipeline, they labeled the entire dataset in 49 hours (7 sec/image) compared to 12 years for similar sized datasets (without their pipeline).
### Cityscapes
- Introduced by [Cordts et. al. 2016](https://arxiv.org/abs/1604.01685), it has 5k high quality pixel-level annotations and 20k coarse annotations for road scenes (mostly in sunny conditions).
### Synthia
- Introduced by [Ros et. al. 2016](https://ieeexplore.ieee.org/document/7780721), it generates images from a photo-realistic virtual city.
### Oxford Robot Car
- Introduced by [Maddern et. al. 2016](https://journals.sagepub.com/doi/abs/10.1177/0278364916679498), contains around 20 million images collected over the same route in Central Oxford (but over the period of 1 year).
- This doesn't have ground truth labels by default. But, this paper's authors selected a sequence from this dataset tagged as 'rainy' and used 895 of those for training.
- 297 separate images from the same sequence were manually annotated for testing (and are available on the [project page](https://www.nec-labs.com/~mas/adapt-seg/adapt-seg.html)).
![Image and Patch Sizes](https://i.imgur.com/TZ0VDlp.png)
## Evaluation Metrics
- They use Intersection over Union (IoU) as a metric for all experiments.