Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

# Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning ###### tags: `Research`, `prompt`, `multitask`, `transfer learning` > [Paper link](https://openreview.net/pdf?id=Nk2pDtuhTq) | ICLR 2023 | NLP Group 2023/02/20 ## Abstract Existing methods typically learn soft prompt vectors from scratch, and it has not been clear how to exploit the rich cross-task knowledge in task-specific prompt vectors to improve performance on target downstream tasks. This paper proposes multitask prompt tuning (MPT), which first learns a **single transferable prompt by decomposing and distilling knowledge from multiple task-specific source prompts.** And then learn multiplicative low rank updates to this shared prompt to efficiently adapt it to each downstream target task. ## Introduction The conventional paradigm of full task-specific finetuning is difficult to scale to multiple tasks given that contemporary PLMs can have hundreds of millions (or even billions) of parameters. There has thus been a growing interest in developing *parameter-efficient* methods for model tuning. **Prompt tuning (PT)**, which prepends continuous prompt vectors to the input, has emerged as a promising approach for parameter-efficient transfer learning with PLMs. **PT freezes the PLM parameters and only learns a small set of task-specific prompt vectors.** It using task-specific training data only are more sensitive to initialization and require significantly more training time than fine-tuning ***Transferring* prompt vectors from different tasks** 1. Train soft prompts on multiple source tasks 2. Use these pre-trained prompts to initialize the prompt for further finetuning on a target task based on a (potentially learned) similarity measure --- This paper introduces multitask prompt tuning (MPT), which **uses multitask data to learn a single prompt** that can be efficiently transferred to target tasks. :::info Practically challenging as it requires **learning commonalities across source tasks while minimizing interference.** They **decompose the soft prompt of each source task** (which can be represented as a prompt matrix) **as a multiplication of a shared matrix and a low-rank task-specific matrix**, and find that this decomposition is more effective than simply sharing the prompt matrix across all tasks ::: To transfer to new tasks, they perform low-rank multiplicative updates to the shared prompt matrix <center> <img src = "https://i.imgur.com/dJqXzbV.png"> </center><br> Extensive experiments on 21 NLP datasets across diverse tasks demonstrate the effectiveness of our proposed approach over state-of-the-art prompt transfer methods. ![](https://i.imgur.com/0peY3OZ.png) ## Related work * Parameter-Efficient Transfer Learning * Adapters * Variants * Prompt tuning: only updates soft prompts prepended to the input * Learning sparse updates to the original PLM This paper **learns a single shared prompt by decomposing and distilling knowledge from source prompts in a structured way** for efficient adaptation to a diverse set of target tasks * Multitask Learning * Transferring a model fine-tuned on multiple source tasks to another target task * Zero-shot and few-shot transfer capabilities of language models through massive multitask learning * Designing specific parameter-sharing strategies This paper focuses on multi-task prompt transfer for parameter-efficient adaptation of language models * Knowledge Distillation * Model compression * Transfer learning * PANDA: uses knowledge distillation with a new metric to better predict the prompt transferability across different combinations of source-target tasks. This paper's MPT approach leverages multitask learning to better exploit the rich cross-task knowledge in prompt transfer. ## Methodology Given a set of source tasks $\boldsymbol{S}=\left\{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_\kappa\right\}$ and target tasks $\boldsymbol{T}=\left\{\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_\tau\right\}$, our goal is to learn a single soft prompt over $S$ that can be efficiently updated to enable better performance on $\boldsymbol{T}$. ### Multitask prompt tuning This paper proposed framework mainly consists of two stages, *source training* and *target adaptation*, which MPT framework first **focuses on the source training to generate a single soft prompt to be reused in the second stage for target task adaptation. (prompt distillation)** :::info Task prompts for source tasks are decomposed into: 1. task-shared component: shared across all tasks 2. low-rank task-specific component: task-specific ::: **Prompt Decomposition** It enables efficient knowledge sharing across $\mathcal{S}$ while still allowing each task to maintain its own parameters to encode task-specific knowledge. The method decomposes the soft prompt $\boldsymbol{P}_k$ for $k$-th task into two parts <center> <img src = "https://i.imgur.com/JjDQ46r.png"> </center><br> Let $\boldsymbol{P}^* \in \mathbb{R}^{l \times d}$ denote the shared prompt across all tasks, and further let $\boldsymbol{u}_k \in \mathbb{R}^l, \boldsymbol{v} _k\in \mathbb{R}^d$ be the task-specific vectors for each task $k$. The task-specific vectors form a rank-one matrix $\boldsymbol{W}_k = \boldsymbol{u}_k \cdot \boldsymbol{v}_k^T$. The final task prompt $\hat{\boldsymbol{P}}$ for $k$-th source task is then parameterized as: $$ \begin{equation} \tag{1}\widehat{\boldsymbol{P}}_k=\boldsymbol{P}^* \circ \boldsymbol{W}_k=\boldsymbol{P}^* \circ\left(\boldsymbol{u}_k \cdot \boldsymbol{v}_k^T\right) \end{equation} $$ where general information of $\mathcal{S}$ can be captured by “slow” weights $\boldsymbol{P}^*$ shared across tasks and “fast” weights $\boldsymbol{W}_k$ could encode task-specific knowledge in a low-rank subspace. Knowledge distillation from separately-trained source prompts to be an effective strategy for learning good decomposable prompts. The first loss is to match the output probability distributions of students and teachers through minimizing their KL-Divergence $$ \begin{equation} \tag{2}\mathcal{L}_{\text {Logits }}=\sum_k \sum_{i \in \mathcal{S}_k} \operatorname{KL}\left(P\left(\boldsymbol{y}_i \mid\left[\boldsymbol{P}_k^t ; \boldsymbol{x}_i\right]\right), P\left(\boldsymbol{y}_i \mid\left[\widehat{\boldsymbol{P}}_{\boldsymbol{k}}^{\boldsymbol{s}} ; \boldsymbol{x}_i\right]\right)\right) \end{equation} $$ And they use temperature $T$ to control the smoothness of the output distribution for both teacher and student models as $p_j = \frac{1}{Z} \exp (z_j // T)$, where $z_j$ is the logit score for class $j$ and $Z$ is the normalization factor. $$ \begin{equation} \tag{3}\mathcal{L}_{\text {Hidden }}=\sum_k \sum_{i \in \mathcal{S}_k}\left(\boldsymbol{H}_{k i}^s-\boldsymbol{H}_{k i}^t\right)^2 \end{equation} $$ where $\boldsymbol{H}_{k i}^t$, $\boldsymbol{H}_{k i}^s$ denotes the hidden states of teacher and student networks, respectively, consisting of a sequence of hidden vectors for $i$-th input. $$ \begin{equation} \tag{4}\mathcal{L}_{\text {Total }}=\mathcal{L}_{\text {PLM }}+\lambda\left(\mathcal{L}_{\text {Logits }}+\mathcal{L}_{\text {Hidden }}\right) \end{equation} $$ where $\mathcal{L}_{\text {PLM}} = \sum_k \mathcal{L}_{\text {PLM}}^k$ represents the aggregated task losses for all source tasks, and $\lambda$ is a weight to balance the impact of distillation loss terms. ### Source training and target adaptation Training the single source prompt to be transferred to target tasks requires two steps 1. Teacher prompts for all source tasks are pretrained individually through vanilla prompt tuning 2. Conduct multitask training on $\mathcal{S}$ to jointly learn the single shared prompt via the knowledge distillation loss function in $(4)$ For target adaptation, this paper proposed method can also be used for multitask learning on target tasks to enable more parameter-efficient adaption of pretrained language models. **Parameter-Efficiency** Each task contains the shared prompt $l \times d$ that has the same dimensions as a vanilla soft prompt and a smaller number of task-specific vectors $(l + d)$. The total number of tunable parameters is $(l \times d) + (l + d) \tau$, where $\tau$ is the number of target tasks. ## Experiments ### Experimental setup * Datasets and Tasks * Models: T5-Base, T5-Small, and T5-Large * Baselines * Full finetuning (FT) * vanilla prompt tuning * popular parameter-efficient methods: Adapter, Bit-Fit * existing prompt transfer methods: SPoT, ATTEMPT <center> <img src = "https://i.imgur.com/Tm4XiRq.png"> </center> ### Result and analysis ![](https://i.imgur.com/8dELoUw.png) <center> <img src = "https://i.imgur.com/Fqe0dfo.png"> </center><br> <center> <img src = "https://i.imgur.com/bPNsPRg.png"> </center> ### Ablation studies * Effectiveness of Prompt Decomposition * Effectiveness of Prompt Distillation * Effectiveness of Prompt Distillation * Effectiveness of Stochastic Task Sampling ## Conclusion This paper introduced and studied multitask prompt tuning (MPT), which learns a single transferable prompt by decomposing and distilling knowledge from multiple source tasks as well as their task-specific source prompts. **MPT decomposes the task prompt by the Hadamard product of a shared prompt matrix and a rank-one task-specific matrix.** The shared component is then transferred and adapted to target tasks to be further tuned.