###### tags: `PaperReview`
# Variational Information Bottleneck for Effective Low-Resource Fine-Tuning
> ICLR 2021
## Introduction
- **Transfer learning** has emerged as the standard technique in natural language processing (NLP), where large-scale language models are **pretrained on an immense amount of text** to learn a **general-purpose** representation, which is then **transferred to the target domain** with fine-tuning on target task data
- However, such **pretrained models have a huge number of parameters**, potentially making **fine-tuning susceptible to overfitting**.
- In particular, if the amount of **target task data is small**, it can be hard for fine-tuning to **distinguish relevant from irrelevant information**, leading to overfitting between the irrelevant information and target labels
- This paper propose a **fine-tuning method that uses Variational Information Bottleneck(VIB)** to improve transfer learning in low-resource scenarios.
## Fine-Tuning in Low-Resource Dataset

### Information Bottleneck
- The objective of IB is to **find a maximally compressed representation** $Z$ **of the input representation** $X$ (compression loss) that **maximally preserves information about the output** $Y$ (prediction loss), by minimizing:
$$
\mathcal{L}_{IB} = \beta I(X, Z) - I(Z, Y)
$$
in which $\beta \geq 0$ is a hyper-parameter that **controls the balance between compresion and prediciton**,$I(X, Z)$ and $I(Z, Y)$ are **Compression Loss** and **Prediction Loss** respectively, and $I(\cdot, \cdot)$ is **mutual information**.
### Variational Information Bottleneck
- The loss for VIB can be derived as:
$$
\mathcal{L}_{\mathrm{VIB}}=\beta \underset{x}{\mathbb{E}}\left[\mathrm{KL}\left[p_\theta(z \mid x), r(z)\right]\right]+\underset{z \sim p_\theta(z \mid x)}{\mathbb{E}}\left[-\log q_\phi(y \mid z)\right]
$$
- $\operatorname{KL}\left(\mathcal{N}\left(\mu_0, \Sigma_0\right) \| \mathcal{N}\left(\mu_1, \Sigma_1\right)\right)=\frac{1}{2}\left(\operatorname{tr}\left(\Sigma_1^{-1} \Sigma_0\right)+\left(\mu_1-\mu_0\right)^T \Sigma_1^{-1}\left(\mu_1-\mu_0\right)-K+\log \left(\frac{\operatorname{det}\left(\Sigma_1\right)}{\operatorname{det}\left(\Sigma_0\right)}\right)\right)$
- During training, the compressed sentence representation $z$ is **sampled from the distribution** $p_\theta(z|x)$, meaning that a **specific pattern of noise is added** to the input of the output classifier $q_\phi(y|z)$.
- During testing, the **expected value of** $z$ is used for predicting labels with $q_\phi(y|z)$.
- Considering **parametric Gaussian Distribution** for prior $r(z)$ and $p_\theta(z|x)$ to allow an **analytic computation for their KL divergence**.
## Experiments
- Evaluated on **text classifications, natural language inference, similarity, and paraphrase detection**.
- **BERT~Base~ and BERT~Large~** is utilized as the base model
### Varying Resource Result

### OOD generalization

### Impact of VIB on Overfitting
- Analyze the effect of $\beta$ parameter on training and validation error

- For **small values of** $\beta$, where VIB has little effect, the **validation loss is substantially higher than the training loss**, indicating **overfitting**.
- As $\beta$ becomes **too large**, both the **training and validation losses shoot up** because the amount of **preserved information is insufficient** to differentiate between the classes
# Task-specific Fine-tuning via Variational Information Bottleneck for Weakly-supervised Pathology Whole Slide Image Classification
> CVPR 2023
## Introduction
- The reading of Whole Slide Images (WSIs) with **gigapixel resolution** is **time-consuming** which poses an urgent need for **automatic computer-assisted diagnosis**.
- Though computers can boost the speed of the diagnosis process, the **enormous size of resolution, over 100M, makes it infeasible to acquire precise and exhaustive annotations for model training**, let alone the current hardware can hardly support the parallel training on all patches of a WSI.
- Propose a solution based on **Variational IB** to tackle the dilemma of fine-tuning and computational limitation

## Method
- Given a WSI $X$, the goal is to make slide-level prediction $Y$ by learning a classifier $f(X ; \theta)$.
- Due to its extremely high resolution, $\mathrm{X}$ is patched into a huge bag of small instances $X=\left\{x_1, \ldots, x_N\right\}$, where $\mathrm{N}$ is the number of instance.
- The slide-level supervision $\hat{Y}$ is given by a Max-pooling operation of the latent label $\hat{y}_i$ for each instance $x_i$, which can be defined as:
$$
\hat{Y}=\max \left\{\hat{y}_1, \ldots, \hat{y}_N\right\}
$$

- Define the IB module as:
$$
z = m \odot x
$$
where $m$ is a Bernoulli($\pi$) distributed binary mask.
- The KL can be decomposed as:
$$
K L\left[p_\theta\left(m_i \mid x\right), r\left(m_i\right)\right]+\pi H(X)
$$

Where $H(X)$ is the entropy of X, which can be **omitted during the minimization** due to its constant value
## Result
