Towards a Unified View of Parameter-Efficient Transfer Learning

Outline

Introduction
Preliminaries
Bridging the Gap – a Unified View
Experiment
Conclusion

Introduction

Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of extra parameters to attain strong performance.
While effective, the critical ingredients for success and the connections among the various methods are poorly understood.
- How are these methods connected?
- Do these methods share design elements that are essential for their effectiveness?
- Can the effective ingredients of each method be transferred to others to yield more effective variants?
We first derive an alternative form of prefix tuning that reveals prefix tuning’s close connections with adapters.
We then devise a unified framework that frames the aforementioned methods as different ways to modify the hidden representations of frozen PLMs.
Our unified framework decomposes previous methods along a shared set of design dimensions :
- The function used to perform the modification.
- The position in which to impose this modification.
- How to integrate the modification.
Our NLP tasks cover XSum, WMT, MNLI(NLU), SST2(classification).

Preliminaries

Recap of the transformer architecture

Transformer models are composed of
\(L\) stacked blocks, where each block contains multi-head self-attention and a fully connected feed-forward network (FFN).

\[ \operatorname{Attn}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\operatorname{softmax}\left(\frac{\boldsymbol{Q} \boldsymbol{K}^T}{\sqrt{d_k}}\right) \boldsymbol{V} \]
Given a sequence of
\(m\) vectors
\(\boldsymbol{C} \in \mathbb{R}^{m \times d}\) and a query vector
\(\boldsymbol{x} \in \mathbb{R}^d\)

\[ \operatorname{MHA}(\boldsymbol{C}, \boldsymbol{x})=\operatorname{Concat}\left(\text { head }_1, \cdots, \text { head }_{\mathrm{h}}\right) \boldsymbol{W}_o, \text { head }_{\mathrm{i}}=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right) \]

\[ \operatorname{FFN}(\boldsymbol{x})=\operatorname{ReLU}\left(\boldsymbol{x} \boldsymbol{W}_1+\boldsymbol{b}_1\right) \boldsymbol{W}_2+\boldsymbol{b}_2 \]

Overview of previous parameter-efficient tuning methods

Adapters:

\[ \boldsymbol{h} \leftarrow \boldsymbol{h}+f\left(\boldsymbol{h} \boldsymbol{W}_{\text {down }}\right) \boldsymbol{W}_{\text {up }} \]
Prefix Tuning:

\[ \operatorname{head}_i=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \operatorname{concat}\left(\boldsymbol{P}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}\right), \operatorname{concat}\left(\boldsymbol{P}_v^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right)\right) \]
LoRA:

\[ \boldsymbol{h} \leftarrow \boldsymbol{h}+s \cdot \boldsymbol{x} \boldsymbol{W}_{\text {down }} \boldsymbol{W}_{\text {up }} \]
Others:
- BitFit only fine-tunes bias vectors in the pre-trained model.
- Diff-pruning learns a sparse parameter update vector.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Bridging the Gap – a Unified View

A closer look at prefix tuning

\[ \begin{aligned} & \text { head }=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right), \operatorname{concat}\left(\boldsymbol{P}_v, \boldsymbol{C} \boldsymbol{W}_v\right)\right) \\ & =\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right)^{\top}\right)\left[\begin{array}{c} \boldsymbol{P}_v \\ \boldsymbol{C} \boldsymbol{W}_v \end{array}\right] \\ & =(1-\lambda(\boldsymbol{x})) \operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right) \boldsymbol{C} \boldsymbol{W}_v+\lambda(\boldsymbol{x}) \operatorname{softmax}\left(x \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v \\ & =(1-\lambda(\boldsymbol{x})) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{C} \boldsymbol{W}_k, \boldsymbol{C} \boldsymbol{W}_v\right)}_{\text {standard attention }}+\lambda(\boldsymbol{x}) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{P}_k, \boldsymbol{P}_v\right)}_{\text {independent of } \boldsymbol{C}}, \end{aligned} \]

\(λ(x)\) is a scalar that represents the sum of normalized attention weights on the prefixes:

\[ \lambda(\boldsymbol{x})=\frac{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i}{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i+\sum_j \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right)_j} \]
The equation gives an alternative view of prefix tuning that essentially applies a position-wise modification to the original head attention output
\(h\) through linear interpolation:

\[ \boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) \Delta \boldsymbol{h}, \quad \Delta \boldsymbol{h}:=\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v \]

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The Connection with Adapter:

\[ \boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) f\left(\boldsymbol{x} \boldsymbol{W}_1\right) \boldsymbol{W}_2 \]
The Difference from Adapters:
- Prefix tuning uses
  \(x\), the input of the PLM layer, to compute
  \(∆h\), while adapters use
  \(h\), the output of the PLM layer.
- Adapters are more flexible.
  - Adapters typically modify attention or FFN outputs.
  - Prefix tuning only modifies the attention output of each head.

The unified framework

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Transferring design elements

Parallel Adapter is the variant by transferring the parallel insertion of prefix tuning into adapters.
Multi-head Parallel Adapter apply parallel adapters to modify head attention outputs as prefix tuning.
Scaled Parallel Adapter is the variant by transferring the composition and insertion form of LoRA into adapters.

Experiment

We use
\(\mathrm{BART_{LARGE}}\) and a multilingual version of it,
\(\mathrm{mBART_{LARGE}}\) as the underlying pretrained models for XSum and en-ro translation respectively.
We use
\(\mathrm{RoBERTa_{BASE}}\) for MNLI and SST2.
The results of exiting method:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Which Insertion Form – Sequential or Parallel?
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Which Modified Representation – Attention or FFN?
- In this case
  \(W_{up}\) in LoRA for
  \(W_1\) (similar for
  \(W_{down}\) of
  \(W_2\)) would have dimensions of
  \(r × d_m\), where
  \(d_m = 4d\) .
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
The scaling composition function is better than the vanilla additive one.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
An effective integration by transferring favorable design elements.
1. Scaled parallel adapter is the best variant to modify FFN.
2. FFN can better utilize modification at larger capacities.
3. Modifying head attentions like prefix tuning can achieve strong performance with only 0.1% parameters.
We use prefix tuning with a small bottleneck dimension (l = 30) at the attention sub-layers and allocate more parameter budgets to modify FFN representation using the scaled parallel adapter (r = 512).

Conclusion

We provide a unified framework for several performant parameter-tuning methods which enables us to instantiate a more effective model that matches the performance of full fine-tuning method through transferring techniques across approaches.

Reference

Towards a Unified View of Parameter-Efficient Transfer Learning

Outline

Introduction

Preliminaries

Recap of the transformer architecture

Overview of previous parameter-efficient tuning methods

Bridging the Gap – a Unified View

A closer look at prefix tuning

The unified framework

Transferring design elements

Experiment

Conclusion

Reference

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation