A Stability Analysis of Fine-Tuning a Pre-Trained Model

# A Stability Analysis of Fine-Tuning a Pre-Trained Model ###### tags: `Meeting` ## Introduction - Fine-tuning a pre-trained model has proven to be one of the most promising paradigms in recent NLP research. - Numerous recent works indicate that fine-tuning suffers from the instability problem - Tuning the same model under the same setting results in significantly different performance - Instability problem substantially impairs the model performance and makes different fine-tuned models incomparable with each other > **ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES** - ICLR 2021 > > *Why is fine-tuning prone to failures and how can we improve its stability?* > - Fine-tuning a model multiple times on the same dataset, varying only the random seed, leads to a large standard deviation of the fine-tuning accuracy > - Observed instability is caused by optimization difficulties that lead to vanishing gradients and differences in generalization late in training > - They propose to use a **smaller learning rate** and **more iteration steps** > - Use small learning rates with bias correction to avoid vanishing gradients early in training > - Increase the number of iterations considerably and train to (almost) zero training loss > ![](https://hackmd.io/_uploads/BJ40zu7Eh.png) > **Better fine-tuning by reducing representational collapse** - ICLR 2021 > - Fine-tuning for each task has been shown to be a highly unstable process > - Many hyperparmeter settings producing failed fine-tuning runs, unstable results (large variation between random seeds), over-fitting and other unwanted consequences > - **Catastrophic forgetting**, originally proposed as **catastrophic interference**: a phenomena that occurs during sequential training where new updates interfere catastrophically with previous updates manifesting in **forgetting of certain examples with respect to a fixed task** > - **Representational collapse** is the degradation of generalizable representations of pre-trained models during the fine-tuning stage > - Fine-tuning collapses the wide range of information available in the representations into a smaller set needed only for the immediate task and particular training set > ![](https://hackmd.io/_uploads/BkMdEY7N2.png =50%x) > - They propose to control the **Lipschitz constant** with different noise regularizations - In this paper, they give a unified theoretical stability analysis of two most widely used fine-tuning paradigms 1. Full fine-tuning 2. Head tuning (linear probing) - They propose three novel strategies to stabilize the fine-tuning procedure 1. Maximal Margin Regularizer (MMR) - Maximizes the margin between the encoded features from different classes by adding a **new regularization term** to the original loss - Minimizing this term increases the distance between the features 3. Multi-Head Loss (MHLoss) - They train several linear heads simultaneously and combine them together to predict the label - Accelerate the convergence rate of the training procedure 4. Self Unsupervised Re-Training (SURT) - Fine-tuning on a model with weights closer to the final weights helps stabilize the fine-tuning procedure ## Theoretical Analysis ### Stability Analysis for Full Fine-Tuning - Some works propose to use the pointwise hypothesis stability to measure the output variation after removing one of the training samples. - Some works propose to directly use the distance between model parameters as a measure of stability. **Definition 1** (Leave-one-out Model Stability). They say that a learning algorithm $A$ has leave-one-out model stability $\epsilon$ if $\forall i \in \{ 1, ...,n \}$, $$ \mathbb{E}_s[\|A(S^i)-A(S)\|] \leq \epsilon $$ Denote $\epsilon$ as the infimum over all feasible $\epsilon$'s for which Definition 1 holds. They assume the overall loss function $f$ is $L$-Lipschitz and $\beta$-smooth. Moreover, as empirically indicated by previous work, the **pre-trained parameter** $w_0$ **and the fine-tuned parameter** $w_*$ **are very close to each other**. Therefore, around the optimal solution $w_*$, $f(w,x)$ can be approximated by its **second-order Taylor expansion** as $$ f(w,x)=f(w_*,x)+(w-w_*)^T\nabla f(w_*,x)+\frac{1}{2}(w-w_*)^T\nabla^2f(w_*,x)(w-w_*) $$ where $x$ stands for given fixed input data while $w$ is the function parameter. Since $w_*$ is the optimal solution, the Hessian matrix $\nabla^2f(w_*,x)$ is positive semidefinite. Moreover, big pre-trained models almost achieve zero training loss. Hence, they assume that $f(w_*,x)=0$. **Theorem 2** (Stability Bound for Full Fine-Tuning). Suppose that the loss function $(w,x)\mapsto f(w,x)$ is non-negative, $L$-Lipschitz, and $\beta$-smooth with respect to $w$, $\mu I \leq \nabla^2 f(w_*,x)$ with $\mu > 0$, and $f(w_*,x)=0$. If $A$ is the gradient descent method with learning rate $\eta=\frac{1}{\beta}$, then the leave-one-out-model stability satisfies $$ \mathbb{E}_s[\|A(S^i)-A(S)\|] \leq \frac{\sqrt{2L\|w_0-w_*\|/\beta}}{n(1/\sqrt[4]{1-\frac{\mu}{\beta}}-1)} $$ - Increasing the training sample size $n$ can reduce the term - It brings down the stability upper bound and potentially stabilizes the fine-tuning procedure - Reducing the Lipschitz constant $L$ for the function $f$ can similarly diminish the leave-one-out model stability, stablizing the training procedure - They note that reducing the distance $\|w_0-w_*\|$ between the initial parameter $w_0$ and the optimal parameter $w_*$ can also improve stabilit - The start point and the endpoint are close to each other, the optimization procedure is less likely to jump to some other local minima, thus making the training procedure more stable :::spoiler Proof of Theorem 2 - They first recall the gradient descent convergence bound for the $\mu$-strongly convex function given by Lemma A.1.1. ![](https://hackmd.io/_uploads/SkXcW-I42.png) ![](https://hackmd.io/_uploads/HyM-MWLEh.png) - Giving a standard lemma (Lemma A.1.2) widely used in smooth convex optimization ![](https://hackmd.io/_uploads/BJ-2W-8Vn.png) - A lemma for a self-bounded function in Lemma A.1.3. ![](https://hackmd.io/_uploads/rJbEfWLV3.png =80%x) - Finally, they give the proof of Theorem 2 which focuses on the Taylor expansion of the full fine-tuning ![](https://hackmd.io/_uploads/S10dfWU43.png) ![](https://hackmd.io/_uploads/ryzcMbUE3.png) ![](https://hackmd.io/_uploads/H1oizWINn.png) ![](https://hackmd.io/_uploads/r1xaf-84n.png) ::: ### Stability Analysis for Head Tuning - If we train a linear model on a linearly separable dataset with the loss $\ell$, the norm of $w$ must diverge toward infinity as $\lim_{t \rightarrow \infty}\|w_t\|=\infty$ **Definition 3** (Normalized Leave-one-out Model Stability). $$ \mathbb{E}_s \left[ \left\|\frac{A(S^i)}{\|A(S^i)\|}-\frac{A(S)}{\|A(S)\|} \right\| \right] \leq \epsilon $$ Definition 3 normalizes the parameters trained on the corresponding training data and focuses on the direction gap between them. The head tuning approach aims to find a separation plan $w^T \tilde{x}=0$ to classify the encoded features $\tilde{x}_i$ into two categories. **Theorem 4** (Stability Bound for Head Tuning). Given a linearly separable dataset $S$, suppose that the encoded features $E(x_i)$ are bounded as $\|E(x_i)\|\leq B$, $\forall i\in\{1,...,n\}$. Let $\gamma_S$ be the maximal margin between separation plan $\hat{w}^T_S\tilde{x}=0$ and encoded features $E(x_i)$. Suppose further that the model parameter $w$ is optimized by gradient descent with $t$ iterations and learning rate $\eta < 2\beta^{-1}\sigma^{-1}_{max}(\tilde{X})$. Then, for some constants $C$, $\lambda$, $\nu$, the normalized leave-one-out model stability is upper bounded as $$ \mathbb{E}_S\left[ \left\| \frac{A(S^i)}{\|A(S^i)\|}-\frac{A(S)}{\|A(S)\|} \right\| \right] \leq \frac{C\log\log t}{\log t} + \nu \max \{ \sqrt{\frac{2}{\lambda n}\left( 1+\frac{B}{\gamma_S} \right)}, \frac{B+\sqrt{B^2+8n\lambda(1+B/\gamma_s)}}{2n\lambda} \} $$ - This theorem is based on [Soudry et al. (2018)](https://arxiv.org/abs/1710.10345)’s theory that training a linear model on linearly separable data will converge to the direction of a **max-margin classifier (SVM)** if it is trained with the gradient descent method - Increasing the number of iterations $t$ can help stabilize the training procedure - Increasing the sample size $n$ can help stabilize the model :::spoiler Proof of Theorem 4 - To analyze the SVM classifier, they use $w=[v^T,b]^T$ to denote all the parameters of the linear classifier - They denote $\hat{w}_S=[\hat{v}_S^T,\hat{b}_S]^T$ as the SVM solution trained on dataset $S$ and denote $\hat{w}_{S^i}=[\hat{v}_{S^i}^T,\hat{b}_{S^i}]^T$ as the SVM solution trained on dataset $S^i$ - First, in Lemma A.2.1, they study the Lipschitz constant for function, $f(x)=x/\|x\|$ ![](https://hackmd.io/_uploads/BJezLW843.png) ![](https://hackmd.io/_uploads/Hk1mIZLE3.png) - In Lemma A.2.2, they study some properties of the function $f(v,b)=\max(0,1-y(v^Tx-b))$. They found it is a **convex** function and calculate the bound for $|f(v_1,b_1)-f(v_2,b_2)|$ ![](https://hackmd.io/_uploads/rkQE8ZIEh.png) ![](https://hackmd.io/_uploads/Bk-SL-8Nn.png) - In Lemma A.2.3, they study the bound for the intercept $\hat{b}$ ![](https://hackmd.io/_uploads/ry1ILZUV3.png) - In Lemma A.2.4, they compare the optimal solutions $\hat{v}_S$ and $\hat{v}_{S^i}$ which is the maximal margin classifier weight for dataset $S$ and $S^i$ respectively. They have a tight relationship with the margins ![](https://hackmd.io/_uploads/By8c8bUE3.png) - Finally, equipped with these lemmas, they give the proof of Theorem 4 ![](https://hackmd.io/_uploads/rJlP_W8N3.png) ![](https://hackmd.io/_uploads/H1ZOdZ8En.png) ![](https://hackmd.io/_uploads/BkGYu-8E2.png) ![](https://hackmd.io/_uploads/B1Y9d-U4n.png) ::: They further derive a corollary of Theorem 4 to show how the Lipschitz constant controls the stability in the head tuning setting. They simplify the distance between parameters trained on two datasets to incorporate the Lipschitz constant $L$. **Corollary 5**. Given a linearly separable dataset $S$, suppose that the encoded features $E(x_i)$ are bounded as $\|E(x_i)\|\leq B$, $\forall i\in\{1,...,n\}$. Suppose further that the model parameter $w$ is optimized by gradient descent with $t$ iterations and learning rate $\eta < 2\beta^{-1}\sigma^{-1}_{max}(\tilde{X})$. For some constants $C$, $\lambda$, $\nu$, the normalized leave-one-out model stability is upper bounded as $$ \mathbb{E}_S\left[ \left\| \frac{A(S^i)}{\|A(S^i)\|}-\frac{A(S)}{\|A(S)\|} \right\| \right] \leq \frac{C\log\log t}{\log t} + \nu \frac{L}{\lambda n} \} $$ - If the function $f$ has smaller Lipschitz consistent $L$, training a model can be more stable :::spoiler Proof of Corollary 5 ![](https://hackmd.io/_uploads/SkAEFbI42.png) ::: ## Methods ### Max Margin Regularizer - If the margin between the encoded representation and the separation plane is large, the model will have better stability - Calculating the margin is quite computationally costly and the margin itself is not differentiable - They propose to maximize the distance between the center points of the two classes Each category should contain at least one sample and MMR can then be represented as $$ R(S)=\frac{1}{1+\left\|\sum_{i=1}^n E(x_i)y_i \left(\frac{1+y_i}{\sum_{j=1}^n(1+y_j)}+\frac{1-y_i}{\sum_{j=1}^n(1-y_j)}\right)\right\|} $$ - Calculating the center points for different classes and then gets the distance between them - They add a constant 1 on the denominator to increase numerical stability $$ \mathcal{L}_{\text{MMR}}(E,w)=\frac{1}{n}\sum_{i=1}^n\ell(w^T\tilde{x}_iy_i)+\alpha R(S) $$ where $\alpha$ is the weight parameter for $R(S)$. ### Multi-Head Loss - The convergence rate $O( \frac{\log\log t}{\log t} )$ for the first term in Theorem 4 is quite slow as $t$ grows - The effect of raising $t$ to lower the bound gradually loses its apparent effect especially when $t$ is already very large - Instead of using one linear head to calculate the loss, they propose to use $H$ linear headers with the same shape simultaneously to calculate the loss and take the average of the outputs In the training stage, the $h$th head $(h\in\{ 1,...,H \})$ with parameter $w_h$ is trained separately by minimizing the loss $\ell(w^T\tilde{x}_iy_i)$. The overall loss can be calculated as $$ \mathcal{L}_{\text{MH}}(E,w_1,...,w_H)=\frac{1}{nH}\sum_{h=1}^{H}\sum_{i=1}^n \ell(w^T_h \tilde{x}_iy_i) $$ To theoretically prove this claim, they establish Corollary 6, which is based on Theorem 4. **Corollary 6** (Stability Bound for Multi-Head Loss). Consider a mulit-head loss with $H$ heads, where $H>2+8\ln\frac{1}{\delta}$, $\delta\in(0,1)$, and $\overline{w}^T \tilde{w}_S \not=0$. With the same assumption as in Theorem 4, for some constants $C$, $xi$, $nu$, with probability $1-\delta$, we have $$ \mathbb{E}_S\left[ \left\| \frac{A(S^i)}{\|A(S^i)\|}-\frac{A(S)}{\|A(S)\|} \right\| \right] \leq \sqrt{\frac{2+8\log\frac{1}{\delta}}{H}}\frac{C\xi\log\log t}{\log t} + \nu \max \{ \sqrt{\frac{2}{\lambda n}\left( 1+\frac{B}{\gamma_S} \right)}, \frac{B+\sqrt{B^2+8n\lambda(1+B/\gamma_S)}}{2n\lambda} \} \} $$ - As H increases, the first term decreases at the rate of $\mathcal{O}(\frac{1}{\sqrt{H}})$, which is better than simply using one head. :::spoiler Proof of Corollary 6 ![](https://hackmd.io/_uploads/Bym9YWLV2.png) ![](https://hackmd.io/_uploads/ByGotb84n.png) ![](https://hackmd.io/_uploads/H1khK-UE3.png) ![](https://hackmd.io/_uploads/HytaFZUV3.png) ![](https://hackmd.io/_uploads/Bk1ZqW843.png) ![](https://hackmd.io/_uploads/BJyG5bU42.png) ![](https://hackmd.io/_uploads/SyKG5-L4h.png) ::: ### Self Unsupervised Re-Training - Reducing the distance $\|w_0-w_*\|$ between the initialization weight $w_0$ and the solution weight $w_*$ reduces the stability upper bound. - This observation inspires them to fine-tune a pre-trained model that is very close to the final model. - They propose a novel Self Unsupervised Re-Training (SURT) method to first re-train the given pre-trained model with the same training corpus as the one used in the fine-tuning task - It is re-trained with the unsupervised mask language model task, which only needs the given training corpus without any annotated label - Fine-tune the model based on the re-trained model with the given labeled data ## Experiments - In order to check the stability of the experiments, they run each experiment for 10 runs with different random seeds and report the mean scores and the standard deviation - They compare their model with several widely used baseline models focusing on fine-tuning stability - **FineTune model** is the standard full fine-tuning model that directly fine-tunes all the parameters of the backbone model together with the task-specific head on each task - **MixOut** model is another widely used fine-tuning method that mixes the trained model parameters and the original model parameters according to a specific ratio - **LNSR** model uses a regularizer to diminish the Lipschitz constant of the prediction function to improve the stability - RoBERTa-base model as the backbone model ### Experiment Results for GLUE/SuperGLUE #### Main Experiment ![](https://hackmd.io/_uploads/HJxb1FHE2.png) - MHLoss helps stabilize the fine-tuning procedure since the output is more stable than that of the vanilla FineTune model - MMR strengthens the stability in many tasks, which verifies the prediction of increasing the margin of the encoded features can help improve stability - SURT model also outperforms the FineTune model by a notable margin in the standard deviation - LNSR model’s results show that reducing the Lipschitz constant also helps improve stability #### Impact of Head Number, Training Epoch, Learning Rate - As the head number $H$ increases, the standard deviation decreases - as the training epochs increase, the stability of nearly all tasks correspondingly improves - As the learning rate $\eta$ decreases, the model achieves better stability ![](https://hackmd.io/_uploads/rk2MgFrN3.png) #### Impact of Sample Count - As getting more and more training samples, the models become more and more stable ![](https://hackmd.io/_uploads/HJ9zbYSNh.png =50%x) #### Head Tuning Stability Analysis - In this setting, the pre-trained encoders are fixed and only the head layer’s parameters are fine-tuned ![](https://hackmd.io/_uploads/HJ1RZKHN3.png) #### Data Perturbation Stability - Training the model on several datasets with 10% of their training samples randomly removed - MHLoss, MMR, and SURT can help stabilize the training procedure with perturbation on the training data ![](https://hackmd.io/_uploads/B17YfFrV2.png) #### Large Pre-Trained Model Stability - Run several tasks with the RoBERTa-large model - The scores for most of the experiments improve as using a much larger backbone model - All stabilizing methods reduce the variance of the FineTune model ![](https://hackmd.io/_uploads/S160QtrV2.png) ## Conclusion - In this paper, they propose a novel theoretical analysis of the stability of fine-tuning a pre-trained model - Defining theoretical stability bounds in two commonly used settings - Giving a theoretical analysis that provides the basis for four existing and widely used methods proposed by previous works - Based on their theory, they propose Max Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT) methods to stabilize fine-tuning - The experiment results show the effectiveness of their proposed methods and hence validate their theory as well