Outline
- Introduction
- Preliminaries
- Bridging the Gap – a Unified View
- Experiment
- Conclusion
Introduction
- Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of extra parameters to attain strong performance.
- While effective, the critical ingredients for success and the connections among the various methods are poorly understood.
- How are these methods connected?
- Do these methods share design elements that are essential for their effectiveness?
- Can the effective ingredients of each method be transferred to others to yield more effective variants?
- We first derive an alternative form of prefix tuning that reveals prefix tuning’s close connections with adapters.
- We then devise a unified framework that frames the aforementioned methods as different ways to modify the hidden representations of frozen PLMs.
- Our unified framework decomposes previous methods along a shared set of design dimensions :
- The function used to perform the modification.
- The position in which to impose this modification.
- How to integrate the modification.
- Our NLP tasks cover XSum, WMT, MNLI(NLU), SST2(classification).
Preliminaries
- Transformer models are composed of \(L\) stacked blocks, where each block contains multi-head self-attention and a fully connected feed-forward network (FFN).
\[
\operatorname{Attn}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\operatorname{softmax}\left(\frac{\boldsymbol{Q} \boldsymbol{K}^T}{\sqrt{d_k}}\right) \boldsymbol{V}
\]
- Given a sequence of \(m\) vectors \(\boldsymbol{C} \in \mathbb{R}^{m \times d}\) and a query vector \(\boldsymbol{x} \in \mathbb{R}^d\)
\[
\operatorname{MHA}(\boldsymbol{C}, \boldsymbol{x})=\operatorname{Concat}\left(\text { head }_1, \cdots, \text { head }_{\mathrm{h}}\right) \boldsymbol{W}_o, \text { head }_{\mathrm{i}}=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right)
\]
\[
\operatorname{FFN}(\boldsymbol{x})=\operatorname{ReLU}\left(\boldsymbol{x} \boldsymbol{W}_1+\boldsymbol{b}_1\right) \boldsymbol{W}_2+\boldsymbol{b}_2
\]
Overview of previous parameter-efficient tuning methods
- Adapters:
\[
\boldsymbol{h} \leftarrow \boldsymbol{h}+f\left(\boldsymbol{h} \boldsymbol{W}_{\text {down }}\right) \boldsymbol{W}_{\text {up }}
\]
- Prefix Tuning:
\[
\operatorname{head}_i=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \operatorname{concat}\left(\boldsymbol{P}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}\right), \operatorname{concat}\left(\boldsymbol{P}_v^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right)\right)
\]
- LoRA:
\[
\boldsymbol{h} \leftarrow \boldsymbol{h}+s \cdot \boldsymbol{x} \boldsymbol{W}_{\text {down }} \boldsymbol{W}_{\text {up }}
\]
- Others:
- BitFit only fine-tunes bias vectors in the pre-trained model.
- Diff-pruning learns a sparse parameter update vector.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Bridging the Gap – a Unified View
A closer look at prefix tuning
\[
\begin{aligned}
& \text { head }=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right), \operatorname{concat}\left(\boldsymbol{P}_v, \boldsymbol{C} \boldsymbol{W}_v\right)\right) \\
& =\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right)^{\top}\right)\left[\begin{array}{c}
\boldsymbol{P}_v \\
\boldsymbol{C} \boldsymbol{W}_v
\end{array}\right] \\
& =(1-\lambda(\boldsymbol{x})) \operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right) \boldsymbol{C} \boldsymbol{W}_v+\lambda(\boldsymbol{x}) \operatorname{softmax}\left(x \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v \\
& =(1-\lambda(\boldsymbol{x})) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{C} \boldsymbol{W}_k, \boldsymbol{C} \boldsymbol{W}_v\right)}_{\text {standard attention }}+\lambda(\boldsymbol{x}) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{P}_k, \boldsymbol{P}_v\right)}_{\text {independent of } \boldsymbol{C}},
\end{aligned}
\]
-
\(λ(x)\) is a scalar that represents the sum of normalized attention weights on the prefixes:
\[
\lambda(\boldsymbol{x})=\frac{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i}{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i+\sum_j \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right)_j}
\]
-
The equation gives an alternative view of prefix tuning that essentially applies a position-wise modification to the original head attention output \(h\) through linear interpolation:
\[
\boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) \Delta \boldsymbol{h}, \quad \Delta \boldsymbol{h}:=\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v
\]
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
The Connection with Adapter:
\[
\boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) f\left(\boldsymbol{x} \boldsymbol{W}_1\right) \boldsymbol{W}_2
\]
-
The Difference from Adapters:
- Prefix tuning uses \(x\), the input of the PLM layer, to compute \(∆h\), while adapters use \(h\), the output of the PLM layer.
- Adapters are more flexible.
- Adapters typically modify attention or FFN outputs.
- Prefix tuning only modifies the attention output of each head.
The unified framework
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Transferring design elements
- Parallel Adapter is the variant by transferring the parallel insertion of prefix tuning into adapters.
- Multi-head Parallel Adapter apply parallel adapters to modify head attention outputs as prefix tuning.
- Scaled Parallel Adapter is the variant by transferring the composition and insertion form of LoRA into adapters.
Experiment
-
We use \(\mathrm{BART_{LARGE}}\) and a multilingual version of it, \(\mathrm{mBART_{LARGE}}\) as the underlying pretrained models for XSum and en-ro translation respectively.
-
We use \(\mathrm{RoBERTa_{BASE}}\) for MNLI and SST2.
-
The results of exiting method:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
Which Insertion Form – Sequential or Parallel?
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
Which Modified Representation – Attention or FFN?
- In this case \(W_{up}\) in LoRA for \(W_1\) (similar for \(W_{down}\) of \(W_2\)) would have dimensions of \(r × d_m\), where \(d_m = 4d\) .
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
The scaling composition function is better than the vanilla additive one.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
An effective integration by transferring favorable design elements.
- Scaled parallel adapter is the best variant to modify FFN.
- FFN can better utilize modification at larger capacities.
- Modifying head attentions like prefix tuning can achieve strong performance with only 0.1% parameters.
-
We use prefix tuning with a small bottleneck dimension (l = 30) at the attention sub-layers and allocate more parameter budgets to modify FFN representation using the scaled parallel adapter (r = 512).

Conclusion
- We provide a unified framework for several performant parameter-tuning methods which enables us to instantiate a more effective model that matches the performance of full fine-tuning method through transferring techniques across approaches.
Reference
Towards a Unified View of Parameter-Efficient Transfer Learning