owned this note
owned this note
Published
Linked with GitHub
# Notes on "[Active Mixup for Data-Efficient Knowledge Distillation](https://arxiv.org/abs/2003.13960)"
###### tags: `notes` `knowledge-distillation` `active-learning` `image-mixup`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
## Brief Outline
They propose to blend active learning ([Gissin and Shalev-Shwartz, 2019](https://arxiv.org/abs/1907.06347)) and image mixup ([Zhang et. al. 2017](https://arxiv.org/abs/1710.09412)) to tackle data-efficient knowledge distillation from a blackbox teacher model.
## Introduction
- Synthesize a big pool of images from few training examples by *mixup* and then use *active learning* to select the most helpful subset from the pool to query the teacher model.
- Treat the teacher model's outputs as ground truth labels and train the student network using them.
- Image mixup ([Zhang et. al. 2017](https://arxiv.org/abs/1710.09412), [Guo et. al. 2019](https://arxiv.org/abs/1809.02499), [Berthelot et. al. 2019](https://arxiv.org/abs/1905.02249)) synthesizes a virtual image by a convex combination of two training images. While the resultant image maybe cluttered or semantically meaningless, it resides near the *manifold* of the natural images.
- A large mixup pool will provide a good coverage of the manifold of natural images. Thus, it is expected that a student network imitating the teacher on these mixup images can give similar predictions to test set images as the teacher model does.
- Instead of querying the blackbox teacher model by all the mixup images, they use active learning to improve the querying efficiency. Check out [these paper notes](https://hackmd.io/@akshayk07/Sk0KdyMOL) for some information on active learning.
## Methodology
Given a blackbox teacher model and a small number of unlabeled images, the approach iterates over the following 3 steps:
1. constructing a big candidate pool of synthesized images from the small number of unlabeled images
2. actively choosing a subset from the pool for which the current student network is the most uncertain
3. querying the blackbox teacher model to acquire labels for this subset and to retrain the student network
### Constructing a candidate pool
- Given two natural images $x_i$ and $x_j$, mixup generates multiple synthetic images by a convex combination of the two with different coefficients,
$$
\hat{x}_{ij}(\lambda) = \lambda x_i + (1 - \lambda)x_j
\tag{1}
$$
- Here, the coefficient $\lambda \in [0, 1]$. This technique is handy and effective as it exponentially expands the size of the initial image pool.
- The pool of synthetic images can be viewed as a dense sampling of the convex hull of the natural images. The test images likely fall into (or close to) this convex hull assuming the initial images are diverse and representative.
### Actively choosing a subset
- Let $\{\hat{x}_{ij}(\lambda), \lambda \in [0, 1], i \neq j\}$ denote the augmented pool of virtual images. It is straightforward to query the teacher to obtain (soft) labels for all these, but this may incur high computational and financial costs.
- Instead, use active learning to reduce the cost. Define the student network's confidence over an input $x$ as $C_1(x):=\max_yP_s(y|x)$ where $P_s(y|x)$ is the probability of input $x$ belonging to the class $y$ predicted by the student network.
- Less confidence implies that student network can gain more from the teacher's label for the input. Ranking all the sythethic images in the pool and choosing the top ones may lead to near-duplicated images (same image with slightly different $\lambda$ gets selected because of very low confidence).
- Thus, they rank image pairs in the candidate pool. They define the confidence of the student network over a pair of images $x_i$ and $x_j$ as
$$
C_2(x_i, x_j) := \min_\lambda C_1(\hat{x}_{ij}(\lambda)), \lambda \in [0, 1]
\tag{2}
$$
- So, synthetic image is selected if the confidence score $C_2$ is among the lowest $k$ ones. Note that size of query set $k$ is a hyperparameter.
### Training the student
- Query the blackbox teacher model with the actively selected set of images, and use the soft predictions as labels for the images.
- Merge them with the previous dataset (if it's not the first step) to train the student network using a cross entropy loss.
![Overall Algorithm](https://i.imgur.com/KniS8zn.png)
## Ablation Study
### Data Efficiency and Query Efficiency
- They investigate how the results change as they vary the total number of unlabeled real images (data efficiency) and the number of synthetic images selected by the active learning scheme (query efficiency).
- The more synthetic images selected by their confidence scores (Eq. 2), the higher quality the distilled student network is. Similarly, higher the number of unlabeled real images, higher the distillation success rate.
- If number of synthetic images is high, then gain is diminishing as number of real images is increased.
- Another trend that can be seen is that if number of real images are halved, it can be complemented by doubling the number of synthetic images to maintain a similar distillation success rate.
### Active Mixup vs Random Search
- For a fair comparison, they give the same set of natural images to both random search and active mixup.
- Since active mixup avoids redundancy by using the improved confidence score (Eq. 2), they equip random search with similar capability by using a single coefficient $\lambda=0.5$ to construct synthetic images.
- Results show that active mixup outperforms random search, which verifies the query efficiency of the proposed method.
### Active Mixup vs Vanilla Active Learning
- Active mixup improves on the vanilla scheme by selecting only one synthetic image at most from a pair of real images. To quantize the improvement, they replace the active mixup with vanilla active learning scheme.
- The vanilla approach performs worse than both active mixup and random search. This is because it selects nearly duplicated images to query the teacher model.
- The prominent improvement suggests that the constraint imposed by $C_2$ in Eq. 2 is crucial to the approach.
## Extra Experiments
### Active Mixup with Out-of-Domain Data for KD
- They attempt to use their technique in case actual training data is unavailable but some other data for similar task is available (i.e. there is a domain gap between the actual and available datasets).
- They claim similar performance with out-of-domain data as using in-domain data with their approach.
- Higher distillation accuracy is obtained for more number of real images (out-of-domain) used.
- Without the original training data, mixup augmentation is critical to boost the performance.
## Conclusion
- Available examples for KD are not sufficient to represent the variation in the original training data. So, the authors propose a mixup augmentation.
- Blackbox teacher models are often expensive to query, so the authors propose to use an active learning scheme.
- Thus, they apply a combination of mixup and active learning for data efficient KD.