owned this note
owned this note
Published
Linked with GitHub
# Notes on "[Variational Adversarial Active Learning](https://arxiv.org/abs/1904.00370)"
###### tags: `notes` `adversarial` `variational`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
Note: For proper understanding, the knowledge of Variational AutoEncoders (VAE) is highly recommended. [This post](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html) provides a very good and in-depth explanation.
## Brief Outline
- Active learning algorithms attempt to incrementally select samples for annotation that result in high classification performance with low labelling cost.
- This paper introduces a pool-based active learning strategy which learns a low dimensional latent space from labeled and unlabeled data using a VAE.
## Introduction
- This method (VAAL), selects instances for labeling from the unlabeled pool that are sufficiently different in the latent space learned by the VAE, to maximize the performance of the representation learned on the newly labeled data.
- Sample selection in VAAL is performed by an adversarial network which classifies which pool the instances belong to (labeled or unlabeled) and does not depend on the task or tasks for which are trying to collect labels.
- The VAE and the discriminator are framed as a two-player mini-max game, similar to GANs, such that the VAE learns a feature space to trick the adversarial network into predicting that all datapoints, from both the labeled and unlabeled sets, are from the labeled pool while the discriminator network learns how to discriminate between them.
- The intuition is that once the active learner is trained, the probability associated with discriminator’s prediction effectively estimates how representative that sample is from the pool that it has been deemed to be from.
## Related Work
### Active Learning
- Current active learning techniques are broadly of 2 types:
- *Query-acquiring (Pool-based)*: Use different sampling strategies to determine how to select the most informative samples. References to read - [Mahapatra et. al. 2018](https://arxiv.org/abs/1806.05473), [Mayer and Timofte, 2018](https://arxiv.org/abs/1808.06671), and [Zhu and Bento, 2018](https://arxiv.org/abs/1702.07956v5).
- *Query-synthesizing*: Use generative models to generate informative samples.
- Pool-based learning techniques are of 3 types:
- *Uncertainty-based methods*
- *Representation-based methods*
- Combination of the two
- A review on these is given by [Settle, 2012](https://www.morganclaypool.com/doi/abs/10.2200/S00429ED1V01Y201207AIM018).
### Variational AutoEncoder (VAE) and Adversarial Learning
- A VAE ([Kingma and Welling, 2013](https://arxiv.org/abs/1312.6114)) is a latent variable model that follows an encoder decoder architecture which places a prior distribution on the feature space distribution and uses an Expected Lower Bound to optimize the learnt posterior.
- Adversarial Autoencoders ([Makhzani et. al. 2015](https://arxiv.org/abs/1511.05644)) minimize the adversarial loss in the latent space between a sample from the prior and the posterior distribution.
- The use of an adversarial network ([Goodfellow et. al. 2014](https://arxiv.org/abs/1406.2661)) enables training the model by solving a mini-max optimization problem.
## Methodology
![VAAL](https://i.imgur.com/lidR4TN.png)
- $(x_L, y_L)$ = a sample pair from a pool of labeled data $(X_L, Y_L)$.
- $x_U$ = a sample from a much larger pool of unlabeled data $X_U$.
- The active learner iteratively queries a fixed sampling *budget*, $b$ number of the most informative samples from the unlabeled pool ($X_U$), using an aquisition function to be annotated by the oracle such that the expected loss is minimized.
### Transductive Representation Learning ($\beta$-VAE)
- They use a $\beta$-VAE for representation learning. The encoder learns a low dimensional space for the underlying distribution using a Gaussian prior, while the decoder reconstructs the input.
- To capture the features missing in the representation learned on the labeled pool, we can use the unlabeled data and perform *transductive learning*.
- Note: Transductive learning is just another name for semi-supervised learning (learning some property from unlabeled data which will help you learn for your main task using labeled data). Similarly, supervised learning is called *inductive learning*.
- The objective function of the $\beta$-VAE is minimizing the variational lower bound on the marginal likelihood of a given sample formulated as
$$
\mathcal{L}_{VAE}^{trd} = \mathbb{E}[\log p_\theta(x_L|z_L)] - \beta \mathrm{D}_{KL}(q_\phi(z_L|x_L)||p(z))
$$
$$
+\mathbb{E}[\log p_\theta(x_U|z_U)] - \beta \mathrm{D}_{KL}(q_\phi(z_U|x_U)||p(z))
\tag{1}
$$
- Here, $q_\phi$ and $p_\theta$ are the encoder and decoder parametrized by $\phi$ and $\theta$ respectively. $p(z)$ is the prior chosen as a unit Gaussian and $\beta$ is the Lagrangian parameter chosen for the optimization problem.
- For a better understanding of this loss function (and why it works), refer to [the VAE post mentioned at the beginning](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html). It is sufficient, but if you want, you can look at the original paper ([Kingma and Welling, 2013](https://arxiv.org/abs/1312.6114)).
- The reparametrization trick ([Kingma and Welling, 2013](https://arxiv.org/abs/1312.6114) but also explained in [the blog post](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html)) is used to make the VAE trainable.
### Adversarial Representation Learning
- An ideal active learning agent is assumed to have a perfect sampling strategy that is capable of sending the most *informative* unlabeled data to the oracle.
- Most of the sampling strategies rely on the model's uncertainty i.e. the more uncertain the prediction, the more informative that specific unlabeled sample must be. However, this introduces vulnerability to outliers.
- In contrast, VAAL uses a discriminator (like in GANs) to map the latent representation of $z_L \cup z_U$ to a binary label (which is 1 if sample belongs to $X_L$ and is 0 otherwise).
- The VAE and discriminator are learned together in an adversarial fashion.
- While the VAE maps the labeled and unlabeled data into the same latent space with similar probability distribution $q_\phi(z_L|x_L)$ and $q_\phi(z_U|x_U)$, it fools the discriminator to classify all inputs as labeled.
- On the other hand, the discriminator attempts to effectively estimate the probability that the data comes from the unlabeled set.
- The objective function for the adversarial role of the VAE can be formulated as a Binary Cross-Entropy Loss as follows
$$
\mathcal{L}_{VAE}^{adv} = -\mathbb{E}[\log (D(q_\phi(z_L|x_L)))] - \mathbb{E}[\log (D(q_\phi(z_U|x_U)))]
\tag{2}
$$
- The objective function to train the discriminator is given as
$$
\mathcal{L}_D = -\mathbb{E}[\log (D(q_\phi(z_L|x_L)))] - \mathbb{E}[\log (1-D(q_\phi(z_U|x_U)))]
\tag{3}
$$
- By combining Eq. 1 and Eq. 2, we get the full objective function for the VAE for VAAL as
$$
\mathcal{L}_{VAE} = \lambda_1 \mathcal{L}_{VAE}^{trd} + \lambda_2 \mathcal{L}_{VAE}^{adv}
\tag{4}
$$
- Here, $\lambda_1$ and $\lambda_2$ are hyperparameters that determine the effect of each component to learn an effective variational representation.
![VAAL Algorithm](https://i.imgur.com/mkF12wk.png)
- The task module (T in the first figure) learns the task for which the active learner is being trained. T is trained separately from the active learner as they don't depend on each other.
### Noisy Oracles
- Note: Oracles are just sources of labels (they maybe humans or already available information online, etc.).
- The labels provided by the oracles might vary in how accurate they are, depending on the quality of available human resources.
- They consider 2 types of oracles:
- ideal oracle, which always provides correct labels for the active learner.
- noisy oracle, which non-adversarially provides erroneous labels for some specific classes.
- This noise might occur in practical cases due to similarities across some classes causing ambiguity for the labeler. So, for a realistic oracle, they apply a targeted noise on visually similar classes.
- Note: The implementation of the noisy oracle is detailed in Section 5.2 of the paper.
### Sampling Strategy
- Sampling strategy is shown below
![Sampling Strategy VAAL](https://i.imgur.com/ae8flh7.png)
- The probability associated with the discriminator's predictions is used as a score to collect $b$ number of samples in every batch predicted as 'unlabeled' with the lowest confidence to be sent to the oracle.
- Note that the closer the $D$ output probability is to zero, the more likely it is that it comes from the unlabeled pool.
- The key idea in this approach is that instead of relying on the performance of the training algorithm on the main task (which may be unreliable at the beginning), samples are selected based on their representativeness w.r.t. other samples which the $D$ thinks belong to the unlabeled pool.
## Analysis of VAAL
### Ablation Study
Note: Ablation study is basically taking your system apart, and analyzing what each part does. You'll get it more clearly when you see what these people have done.
The variants of ablation considered are:
- Eliminating VAE
- This explores the role of the VAE as the representation learner by having only a discriminator trained (to discriminate between labeled and unlabeled pool).
- This results in $D$ only memorizing the data and yields the lowest performance.
- It reveals the key role of the VAE in not only learning a rich latent space but also playing an effective mini-max game with $D$ to avoid overfitting.
- Frozen VAE with $D$
- In this, they add a frozen VAE (not trainable) to the previous setting. Thus, they explore the VAE's role as an autoencoder.
- This performs better than having only the $D$ trained, but performs similar or worse than random sampling suggesting that the $D$ failed to learn the representativeness of the samples in the unlabeled pool.
- Eliminating $D$
- This explores the role of $D$ by training only a VAE that uses a 2-Wasserstein distance from the cluster centroid of the labeled dataset as a heuristic to explicitly measure uncertainty.
- For a multivariate isotropic Gaussian distribution, the closed form solution ([Givens and Shortt, 1984](https://scinapse.io/papers/2040104067) - worth going through?) of the 2-Wasserstein distance between 2 probability distributions can be written as:
$$
W_{ij} = [||\mu_i - \mu_j||_2^2 + ||\Sigma_i^{\frac{1}{2}} - \Sigma_j^{\frac{1}{2}}||_\mathcal{F}^2]
\tag{5}
$$
- Here, $||.||_\mathcal{F}$ represents the Frobenius norm, $\mu_i$, $\Sigma_i$ denote the mean and variance predicted by the encoder, and $\mu_j$, $\Sigma_j$ are the mean and variance for the normal distribution over the labeled data from which the latent variable $z$ is generated.
- In this, there is an improvement over random sampling which shows the effect of explicitly measuring the uncertainty in the learned latent space.
- However, VAAL outperforms all these scenarios by implicitly learning the uncertainty over the adversarial game between the VAE and the $D$.
### Robustness of VAAL
#### Effect of biased initial labels
- Intuitively, bias can affect the training such that it causes the initially labeled samples to not be representative of the underlying data distribution by being inadequate to cover most of the regions in the latent space.
- They model bias in the labeled pool by not providing labels for $m$ chosen classes at random, and compare it to the case where samples are randomly selected from all classes.
- They report their method to better or similar to other methods in these experiments.
#### Effect of budget size
- They repeat their experiments for 2 budget sizes.
- Experiments with lower budget size perform better because a larger budget size results in adding redundant samples instead of more informative ones.
#### Noisy vs Ideal oracle
- CIFAR100 has 100 classes which are grouped into 20 super-classes. So, each image has a fine label (of the 100 classes) and a coarse label (of the 20 super-classes).
- For the noise, they randomly change the ground truth labels of a subset of the dataset, but within the same super-class (which is meaningful, as such a mistake maybe incurred by human labelers due to ambiguity).
- Since this method (VAAL) does not depend on the main task, the relative performance is comparable to the ideal oracle.
- Also, as percentage of noisy labels increases, all the active learning strategies converge to random sampling (which is intuitive because incorrect labels bring some randomness).
#### Choice of architecture
- They repeat experiments with ResNet18 (earlier done with VGG16) and report that performance gaps between VAAL and other methods remains similar.
## Conclusion
- This paper gives a new task agnostic active learning algorithm, VAAL, that learns a latent representation on both labeled and unlabeled data in an adversarial game between a VAE and a discriminator.
- It implicitly learns the uncertainty for the samples deemed to be from the unlabeled pool.
- They claim SOTA results in terms of accuracy and sampling time for image classification and semantic segmentation tasks.
- They also show that VAAL is robust to noisy oracles and biased initial data. It also performs consistently well across different budget sizes.