Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

###### tags: `PaperReview` # Variational Information Bottleneck for Effective Low-Resource Fine-Tuning > ICLR 2021 ## Introduction - **Transfer learning** has emerged as the standard technique in natural language processing (NLP), where large-scale language models are **pretrained on an immense amount of text** to learn a **general-purpose** representation, which is then **transferred to the target domain** with fine-tuning on target task data - However, such **pretrained models have a huge number of parameters**, potentially making **fine-tuning susceptible to overfitting**. - In particular, if the amount of **target task data is small**, it can be hard for fine-tuning to **distinguish relevant from irrelevant information**, leading to overfitting between the irrelevant information and target labels - This paper propose a **fine-tuning method that uses Variational Information Bottleneck(VIB)** to improve transfer learning in low-resource scenarios. ## Fine-Tuning in Low-Resource Dataset ![image](https://hackmd.io/_uploads/HyF178Zy0.png) ### Information Bottleneck - The objective of IB is to **find a maximally compressed representation** $Z$ **of the input representation** $X$ (compression loss) that **maximally preserves information about the output** $Y$ (prediction loss), by minimizing: $$ \mathcal{L}_{IB} = \beta I(X, Z) - I(Z, Y) $$ in which $\beta \geq 0$ is a hyper-parameter that **controls the balance between compresion and prediciton**,$I(X, Z)$ and $I(Z, Y)$ are **Compression Loss** and **Prediction Loss** respectively, and $I(\cdot, \cdot)$ is **mutual information**. ### Variational Information Bottleneck - The loss for VIB can be derived as: $$ \mathcal{L}_{\mathrm{VIB}}=\beta \underset{x}{\mathbb{E}}\left[\mathrm{KL}\left[p_\theta(z \mid x), r(z)\right]\right]+\underset{z \sim p_\theta(z \mid x)}{\mathbb{E}}\left[-\log q_\phi(y \mid z)\right] $$ - $\operatorname{KL}\left(\mathcal{N}\left(\mu_0, \Sigma_0\right) \| \mathcal{N}\left(\mu_1, \Sigma_1\right)\right)=\frac{1}{2}\left(\operatorname{tr}\left(\Sigma_1^{-1} \Sigma_0\right)+\left(\mu_1-\mu_0\right)^T \Sigma_1^{-1}\left(\mu_1-\mu_0\right)-K+\log \left(\frac{\operatorname{det}\left(\Sigma_1\right)}{\operatorname{det}\left(\Sigma_0\right)}\right)\right)$ - During training, the compressed sentence representation $z$ is **sampled from the distribution** $p_\theta(z|x)$, meaning that a **specific pattern of noise is added** to the input of the output classifier $q_\phi(y|z)$. - During testing, the **expected value of** $z$ is used for predicting labels with $q_\phi(y|z)$. - Considering **parametric Gaussian Distribution** for prior $r(z)$ and $p_\theta(z|x)$ to allow an **analytic computation for their KL divergence**. ## Experiments - Evaluated on **text classifications, natural language inference, similarity, and paraphrase detection**. - **BERT~Base~ and BERT~Large~** is utilized as the base model ### Varying Resource Result ![image](https://hackmd.io/_uploads/HyoXL-beR.png) ### OOD generalization ![image](https://hackmd.io/_uploads/rkyarZ-e0.png) ### Impact of VIB on Overfitting - Analyze the effect of $\beta$ parameter on training and validation error ![image](https://hackmd.io/_uploads/HyznIZWxA.png) - For **small values of** $\beta$, where VIB has little effect, the **validation loss is substantially higher than the training loss**, indicating **overfitting**. - As $\beta$ becomes **too large**, both the **training and validation losses shoot up** because the amount of **preserved information is insufficient** to differentiate between the classes # Task-specific Fine-tuning via Variational Information Bottleneck for Weakly-supervised Pathology Whole Slide Image Classification > CVPR 2023 ## Introduction - The reading of Whole Slide Images (WSIs) with **gigapixel resolution** is **time-consuming** which poses an urgent need for **automatic computer-assisted diagnosis**. - Though computers can boost the speed of the diagnosis process, the **enormous size of resolution, over 100M, makes it infeasible to acquire precise and exhaustive annotations for model training**, let alone the current hardware can hardly support the parallel training on all patches of a WSI. - Propose a solution based on **Variational IB** to tackle the dilemma of fine-tuning and computational limitation ![image](https://hackmd.io/_uploads/rkgyObbeA.png) ## Method - Given a WSI $X$, the goal is to make slide-level prediction $Y$ by learning a classifier $f(X ; \theta)$. - Due to its extremely high resolution, $\mathrm{X}$ is patched into a huge bag of small instances $X=\left\{x_1, \ldots, x_N\right\}$, where $\mathrm{N}$ is the number of instance. - The slide-level supervision $\hat{Y}$ is given by a Max-pooling operation of the latent label $\hat{y}_i$ for each instance $x_i$, which can be defined as: $$ \hat{Y}=\max \left\{\hat{y}_1, \ldots, \hat{y}_N\right\} $$ ![image](https://hackmd.io/_uploads/BJaCFb-gA.png) - Define the IB module as: $$ z = m \odot x $$ where $m$ is a Bernoulli($\pi$) distributed binary mask. - The KL can be decomposed as: $$ K L\left[p_\theta\left(m_i \mid x\right), r\left(m_i\right)\right]+\pi H(X) $$ ![image](https://hackmd.io/_uploads/HJs7AW-eA.png) Where $H(X)$ is the entropy of X, which can be **omitted during the minimization** due to its constant value ## Result ![image](https://hackmd.io/_uploads/HkPGJGZxR.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.