## Outline - Introduction - Preliminaries - Bridging the Gap – a Unified View - Experiment - Conclusion ## Introduction - Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of extra parameters to attain strong performance. - While effective, the critical ingredients for success and the connections among the various methods are poorly understood. - How are these methods connected? - Do these methods share design elements that are essential for their effectiveness? - Can the effective ingredients of each method be transferred to others to yield more effective variants? - We first derive an alternative form of **prefix tuning** that reveals prefix tuning’s close connections with adapters. - We then devise a unified framework that frames the aforementioned methods as different ways to **modify the hidden representations** of frozen PLMs. - Our unified framework decomposes previous methods along a *shared* set of design **dimensions** : - The function used to perform the modification. - The position in which to impose this modification. - How to integrate the modification. - Our NLP tasks cover XSum, WMT, MNLI(NLU), SST2(classification). ## Preliminaries ### Recap of the transformer architecture - Transformer models are composed of $L$ stacked blocks, where each block contains **multi-head self-attention** and **a fully connected feed-forward network (FFN)**. $$ \operatorname{Attn}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\operatorname{softmax}\left(\frac{\boldsymbol{Q} \boldsymbol{K}^T}{\sqrt{d_k}}\right) \boldsymbol{V} $$ - Given a sequence of $m$ vectors $\boldsymbol{C} \in \mathbb{R}^{m \times d}$ and a query vector $\boldsymbol{x} \in \mathbb{R}^d$ $$ \operatorname{MHA}(\boldsymbol{C}, \boldsymbol{x})=\operatorname{Concat}\left(\text { head }_1, \cdots, \text { head }_{\mathrm{h}}\right) \boldsymbol{W}_o, \text { head }_{\mathrm{i}}=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right) $$ $$ \operatorname{FFN}(\boldsymbol{x})=\operatorname{ReLU}\left(\boldsymbol{x} \boldsymbol{W}_1+\boldsymbol{b}_1\right) \boldsymbol{W}_2+\boldsymbol{b}_2 $$ ### Overview of previous parameter-efficient tuning methods - Adapters: $$ \boldsymbol{h} \leftarrow \boldsymbol{h}+f\left(\boldsymbol{h} \boldsymbol{W}_{\text {down }}\right) \boldsymbol{W}_{\text {up }} $$ - Prefix Tuning: $$ \operatorname{head}_i=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q^{(i)}, \operatorname{concat}\left(\boldsymbol{P}_k^{(i)}, \boldsymbol{C} \boldsymbol{W}_k^{(i)}\right), \operatorname{concat}\left(\boldsymbol{P}_v^{(i)}, \boldsymbol{C} \boldsymbol{W}_v^{(i)}\right)\right) $$ - LoRA: $$ \boldsymbol{h} \leftarrow \boldsymbol{h}+s \cdot \boldsymbol{x} \boldsymbol{W}_{\text {down }} \boldsymbol{W}_{\text {up }} $$ - Others: - BitFit only fine-tunes bias vectors in the pre-trained model. - Diff-pruning learns a sparse parameter update vector. ![](https://hackmd.io/_uploads/HkdOhYcK3.png) ## Bridging the Gap – a Unified View ### A closer look at prefix tuning $$ \begin{aligned} & \text { head }=\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right), \operatorname{concat}\left(\boldsymbol{P}_v, \boldsymbol{C} \boldsymbol{W}_v\right)\right) \\ & =\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \operatorname{concat}\left(\boldsymbol{P}_k, \boldsymbol{C} \boldsymbol{W}_k\right)^{\top}\right)\left[\begin{array}{c} \boldsymbol{P}_v \\ \boldsymbol{C} \boldsymbol{W}_v \end{array}\right] \\ & =(1-\lambda(\boldsymbol{x})) \operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right) \boldsymbol{C} \boldsymbol{W}_v+\lambda(\boldsymbol{x}) \operatorname{softmax}\left(x \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v \\ & =(1-\lambda(\boldsymbol{x})) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{C} \boldsymbol{W}_k, \boldsymbol{C} \boldsymbol{W}_v\right)}_{\text {standard attention }}+\lambda(\boldsymbol{x}) \underbrace{\operatorname{Attn}\left(\boldsymbol{x} \boldsymbol{W}_q, \boldsymbol{P}_k, \boldsymbol{P}_v\right)}_{\text {independent of } \boldsymbol{C}}, \end{aligned} $$ - $λ(x)$ is a scalar that represents the sum of normalized attention weights on the prefixes: $$ \lambda(\boldsymbol{x})=\frac{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i}{\sum_i \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right)_i+\sum_j \exp \left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{W}_k^{\top} \boldsymbol{C}^{\top}\right)_j} $$ - The equation gives an alternative view of prefix tuning that essentially applies a position-wise modification to the original head attention output $h$ through linear interpolation: $$ \boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) \Delta \boldsymbol{h}, \quad \Delta \boldsymbol{h}:=\operatorname{softmax}\left(\boldsymbol{x} \boldsymbol{W}_q \boldsymbol{P}_k^{\top}\right) \boldsymbol{P}_v $$ ![](https://hackmd.io/_uploads/H1OZm59F2.png) - The Connection with Adapter: $$ \boldsymbol{h} \leftarrow(1-\lambda(\boldsymbol{x})) \boldsymbol{h}+\lambda(\boldsymbol{x}) f\left(\boldsymbol{x} \boldsymbol{W}_1\right) \boldsymbol{W}_2 $$ - The Difference from Adapters: - Prefix tuning uses $x$, the input of the PLM layer, to compute $∆h$, while adapters use $h$, the output of the PLM layer. - Adapters are more flexible. - Adapters typically modify attention or FFN outputs. - Prefix tuning only modifies the attention output of each head. ### The unified framework ![](https://hackmd.io/_uploads/H1jpMq9Kn.png) ### Transferring design elements - **Parallel Adapter** is the variant by transferring the parallel insertion of prefix tuning into adapters. - **Multi-head Parallel Adapter** apply parallel adapters to modify head attention outputs as prefix tuning. - **Scaled Parallel Adapter** is the variant by transferring the composition and insertion form of LoRA into adapters. ## Experiment - We use $\mathrm{BART_{LARGE}}$ and a multilingual version of it, $\mathrm{mBART_{LARGE}}$ as the underlying pretrained models for XSum and en-ro translation respectively. - We use $\mathrm{RoBERTa_{BASE}}$ for MNLI and SST2. - The results of exiting method: ![](https://hackmd.io/_uploads/Bya-WHGch.png) - Which Insertion Form – Sequential or Parallel? ![](https://hackmd.io/_uploads/H1HZoyhYh.png) - Which Modified Representation – Attention or FFN? - In this case $W_{up}$ in LoRA for $W_1$ (similar for $W_{down}$ of $W_2$) would have dimensions of $r × d_m$, where $d_m = 4d$ . ![](https://hackmd.io/_uploads/HJAnPrM5n.png) - The scaling composition function is better than the vanilla additive one. ![](https://hackmd.io/_uploads/rJ8_urfc3.png) - An effective integration by transferring favorable design elements. 1. Scaled parallel adapter is the best variant to modify FFN. 2. FFN can better utilize modification at larger capacities. 3. Modifying head attentions like prefix tuning can achieve strong performance with only 0.1% parameters. - We use prefix tuning with a small bottleneck dimension (l = 30) at the attention sub-layers and allocate more parameter budgets to modify FFN representation using the scaled parallel adapter (r = 512). ![](https://hackmd.io/_uploads/r17aHYMcn.png) ## Conclusion - We provide a unified framework for several performant parameter-tuning methods which enables us to instantiate a more effective model that matches the performance of full fine-tuning method through **transferring techniques** across approaches. ## Reference [Towards a Unified View of Parameter-Efficient Transfer Learning](https://arxiv.org/pdf/2110.04366.pdf)