<style> img { display: block; margin-left: auto; margin-right: auto; } .red{ color: red; } </style> :::info **Parameter-Efficient method** 1. Prefix tuning: (1), (5) 2. Adapter tuning: (2), (4), (5) 3. Sparse training: (3) 4. ... ::: ## Control Prefixes for Parameter-Efficient Text Generation > [Paper link](https://arxiv.org/abs/2110.08329) | [Code link](https://github.com/jordiclive/ControlPrefixes) | GEM 2022 | Prefix-tuning is a powerful lightweight technique for adapting a large pre-trained language model to a downstream application. However, it uses the same dataset-level tuned prompt for all examples in the dataset. This paper extend this idea and propose a dynamic method, **Control Prefixes, which allows for the inclusion of conditional input-dependent information, combining the benefits of prompt tuning and controlled generation**. The method incorporates attribute-level learnable representations into different layers of a pre-trained transformer, allowing for the generated text to be guided in a particular direction. ![](https://hackmd.io/_uploads/SkBAGwluh.png) This work considers sequence-to-sequence tasks where the objective is to model the conditional probability $P(Y \mid X)$ with $X$ and $Y$ representing the tokenized input and output sequences respectively. In this work they experiment with ==$\text{T5-large}$ and $\text{BART}_{\text{LARGE}}$== as the underlying pre-trained LMs with parameters $\phi$, and $\phi$ remains frozen. The model uses **a general task prefix $P_θ$** ("task-specific parameters") and also trains **a set of control prefixes $C_θ$** that change depending on the input ("attribute-level parameters"). This requires **attribute-level information or guidance $G$**, to indicate which control prefixes to be used while processing a given input $X$. The parallel corpus $\mathcal{Z}=\left\{\left\langle X^j, Y^j, G^j\right\rangle\right\}_{j=1, . ., N}$, where $G^j$ indicates all the conditional attribute-level information for the sample $j$. **<p class = "red">The goal is to optimize through gradient descent the final inference parameters, $\theta$, whilst the underlying $\phi$ parameters of the pre-trained LM remain frozen:</p>** $$ \tag{1} \theta^*=\arg \max _\theta \sum_{j=1}^N \log p\left(Y^j \mid X^j, G^j ; P_\theta, C_\theta, \phi\right) $$ **General Prefix** For each attention class $(E, D c, D m)$, a distinct prefix of key-value pairs is learnt, $P=\left\{P_1, \ldots, P_L\right\}$, where $P_l \in$ $\mathbb{R}^{\rho \times 2 d} \forall l \in\{1, \ldots . L\}.$ $P \in \mathbb{R}^{\rho \times 2 d L}$ and $\rho$ is the prompt length, i.e. the number of additional key-value pairs in each attention computation. In prefix-tuning, for an attention computation in the $l$-th layer, $K_l$ and $V_l$ are augmented to become $$ \tag{2} K_l^{\prime}=\left[P_{l, K} ; K_l\right], V_l^{\prime}=\left[P_{l, V} ; V_l\right] $$ where $K_l^{\prime}, V_l^{\prime} \in \mathbb{R}^{(\rho+M) \times d}$. The overall general prefix, parameterized by $\theta$, is $P_\theta=$ $\left\{P^E, P^{D c}, P^{D m}\right\}$, where $P_\theta \in \mathbb{R}^{\rho \times 6 d L}$. **Control Prefixes** Here they consider one attribute with $R$ possible labels, $C_\theta=\left\{C_{\theta, 1}, \ldots, C_{\theta, R}\right\}$, where $C_{\theta, r} \in \mathbb{R}^{\rho_c \times 6 d L}$, $\forall r \in\{1 \ldots R\}$. $C_{\theta, r}$ represents the control prefix learnt for the $r$-th attribute label and the parameter $\rho_c$ denotes the control prompt length for this particular attribute. Let $\mathcal{A}$ be a function which returns the corresponding control prefix for the attribute label indicated by $G$. In CONTROL PREFIXES the $K_l$ and $V_l$ are augmented to become $$ \tag{3} K_l^{\prime \prime}=\left[\mathcal{A}(G)_{l, K} ; P_{l, K} ; K_l\right], V_l^{\prime \prime}=\left[\mathcal{A}(G)_{l, V} ; P_{l, V} ; V_l\right] $$ where $K_l^{\prime \prime}, V_l^{\prime \prime} \in \mathbb{R}^{\left(\rho_c+\rho+M\right) \times d}$. **Shared Re-parameterization** They introducing a feed-forward network to re-parameterize the prefix. Rather than one network, we use three distinct two-layered large feed-forward neural networks for each attention class, applied row-wise. For each attention class $(E, D_c, D_m), P = \text{MLP}(\tilde{P})$ where $\tilde{P} \in \mathbb{R}^{\rho_c \times d}$ where $\tilde{P} \in \mathbb{R}^{\rho_c \times d}$ is smaller than the matrix $P \in \mathbb{R}^{\rho_c \times 2dL}$. Such that the problem about prefix optimization is stabilized by increasing the number of trainable parameters can be reduced. ### Data-to-Text ![](https://hackmd.io/_uploads/S1MQCaWd3.png) ### Simplification: make article easiler to understand ![](https://hackmd.io/_uploads/H1_DC6bd2.png) ### Summarization: summarize while preserving its most important information ![](https://hackmd.io/_uploads/SkUOCpZdh.png) ## AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning > [Paper link](https://arxiv.org/abs/2205.12410) | [Note link](https://zhuanlan.zhihu.com/p/522404268) | [Code link](https://github.com/microsoft/AdaMix) | EMNLP 2022 | This paper proposes $\verb|AdaMix|$ as a general PEFT method that **tunes a mixture of adaptation modules** – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. Overall, their work makes the following contributions: 1. Given any PEFT method of choice like adapters and low-rank decompositions, $\verb|AdaMix|$ improves downstream task performance over the underlying PEFT method. 2. $\verb|AdaMix|$ is trained with stochastic routing and adaptation module merging to retain the same computational cost (e.g., FLOPs, #tunable adaptation parameters) and benefits of the underlying PEFT method. 3. By tuning only $0.1 − 0.2\%$ of a pre-trained language model’s parameters, $\verb|AdaMix|$ is the first PEFT method to outperform full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks. <!--- #### Mixture-of-Experts Mixture-of-Experts (MoE) is a model design approach that aims to increase the parameter count and support conditional computation in neural models like Transformers. In MoE, multiple "experts" are used, each with its own set of learnable weights, to compute different representations of an input token based on context. These experts are typically feed-forward networks (FFN). #### Adapters The adapter design involves introducing new parameters into the original PLM while keeping the majority of the pre-trained weights frozen. Adapters typically consist of two fully connected layers. The first layer, known as the adapter layer, projects the input representation to a lower-dimensional space using a down projection matrix. This lower-dimensional space is referred to as the bottleneck dimension. The projected features then pass through a non-linear activation function. Finally, an up-projection matrix is used to project the low-dimensional features back to the original dimension. ---> ![](https://hackmd.io/_uploads/rkxb7Dgu2.png) <!-- ![](https://hackmd.io/_uploads/Sk19mwl_3.png) --> ### Routing Policy They **use stochastic routing policy** for $\verb|AdaMix|$ with adapters, at any training step, they randomly select a pair of feedforward up and feedforward down projection matrices in the $i^{th}$ Transformer layer as $A_i = \{ W_{ij}^{up}, W_{ik}^{down} \}$ and $B_i = \{ W_{ij^\prime}^{up}, W_{ik^\prime}^{down} \}$ respectively. $$ \tag{1} x \leftarrow x + f(x \cdot W^{down}) \cdot W^{up} $$ Such stochastic routing enables adaptation modules to learn different transformations during training and obtain multiple views of the task. However, this also creates a challenge on which modules to use during inference due to random routing protocol during training. ### Consistency regularization The objective of consistency regularization is to enable the adaptation modules to **share information and prevent divergence**. They add the following consistency loss as a regularizer to the task-specific optimization loss: $$ \tag{2} \mathcal{L} = - \left(\sum_{c = 1}^{C} \mathcal{I}(x, c) \log \text{ softmax}(z_c^\mathcal{A}(x)) + \frac{1}{2}(\mathcal{KL}(z^\mathcal{A}_{(\cdot)}(x) \| z^\mathcal{B}_{(\cdot)}(x)) + \mathcal{KL}(z^\mathcal{B}_{(\cdot)}(x) \| z^\mathcal{A}_{(\cdot)}(x))) \right) $$ where $\mathcal{I}(x, c)$ is a binary indicator (0 or 1) if class label $c$ is the correct classification for $x$ and $z^\mathcal{A}_{(\cdot)}(x)$ and $z^\mathcal{B}_{(\cdot)}(x)$ are the predicted logits. ### Adaptation module mergin While the above regularization mitigates inconsistency in random module selection during inference, it still results in increased serving cost to host several adaptation modules. They employ adaptation merging **only during inference**. Given a set of adaptation modules $W_{ij}^{up}$ and $W_{ik}^{down}$ for $i \in \{ 1 \dots L \}$ and $\{ j,k \} \in \{ 1 \dots M \}$, they simply average the weights of all the corresponding modules in every Transformer layer to collapse to a single module $\{ W_{i}^{\prime \ up}, W_{i}^{\prime \ down} \}$, where: $$ \tag{3} W_{i}^{\prime up} \leftarrow \frac{1}{M} \sum_{j=1}^M W_{ij}^{up} \ \ \ \ \ \ \ \ W_{i}^{ \prime down} \leftarrow \frac{1}{M} \sum_{j=1}^M W_{ij}^{down} $$ ![](https://hackmd.io/_uploads/rkZISweuh.png) They use BERT-base and RoBERTa-large as encoders for NLU tasks, and ==GPT-2 for NLG tasks.== ### NLU ![](https://hackmd.io/_uploads/HyrdIkfd3.png) ![](https://hackmd.io/_uploads/r1ycIyf_n.png) ### NLG ![](https://hackmd.io/_uploads/H1sjIJMOn.png) ![](https://hackmd.io/_uploads/BJRLvJz_n.png) ## Parameter-Efficient Sparsity for Large Language Models Fine-Tuning > [Paper link](https://arxiv.org/abs/2205.11005) | [Note link](https://zhuanlan.zhihu.com/p/545718980) | [Code link](https://github.com/yuchaoli/PST) | IJCAI 2022 | This paper proposes a Parameter-efficient Sparse Training (PST) method to reduce the number of trainable parameters during **sparse-aware training in downstream tasks**. First, combining the data-free and data-driven criteria to efficiently and accurately measure the importance of weights. Then it investigates the intrinsic redundancy of data-driven weight importance and derive two obvious characteristics i.e. low-rankness and structuredness. Based on that, two groups of small matrices are introduced to compute the data-driven importance of weights, instead of using the original large importance scorematrix, which therefore makes the sparse training resource-efficient and parameter-efficient. Their main contribution: - It exploit both the **low-rankness** and **structuredness** in the data-driven importance score and thus replace it with several small matrices. This leads to a novel research area, how to compress the redundancy of the importance score to efficiently obtain the importance of weights. - Extensive experiments demonstrate the effectiveness of ==their method across various typical pre-trained large language models (e.g., BERT, RoBERTa, and GPT-2) upon diverse datasets.== In particular, compared with previous works, PST obtains $98.5\%$ trainable parameter saving with a $0.12$ average score improvement in GLUE. ![](https://hackmd.io/_uploads/HktRQPxO3.png) ### Preliminaries There has a weight matrix $W \in \mathbb{R}^{n \times k}$, a network sparse strategy introduces an importance score $S \in \mathbb{R}^{n \times k}$ to determine which weights should be removed. Based on $S$, a binary msak $M \in \{ 0, 1\}^{n \times k}$ can be generated for computation $Y = (W \odot M)X$, where $Y \in \mathbb{R}^{n \times k}$ and $X \in \mathbb{R}^{n \times k}$ are the output and input of the layer. A common strategy is to keep the top-$v$ of the weight $W$ based on the importance score $S$. Here are the function to use $M$ to select top-$v$: $$ \tag{1} M_{i,j} = f(S, v)_{i,j} = \begin{cases} 1, & S_{i,j} \text{ in the top-}v \text{ values}, \\ 0, & \text{otherwise.} \\ \end{cases} $$ Thus, the optimized process of the language model fine-tuning is: $$ \tag{2} \min_{W,S} \mathcal{L}(W \odot f(S,v) ; \mathcal{D}), \ \ s.t. \frac{v}{n * k} \le 1 - p $$ where $\mathcal{D}$ is the observed dataset, $\mathcal{L}$ represents the loss function, and $p$ denotes the target compression ratio. ### Low-Rankness ($A$ and $B$ in Figure 1) The rank of weights and gradients can be represented by a set of rank-decomposition matrices. ### Structuredness ($R$ and $C$ in Figure 1) The distribution of sparse weights and observe the phenomenon that there are some rows/columns less important than the others in general, which inspires us to introduce a set of small matrices to measure the importance of each row/column in weight. ![](https://hackmd.io/_uploads/rk0KriM_n.png) ## Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning > [Paper link](https://aclanthology.org/2020.findings-emnlp.41/) | [Code link](https://github.com/zlinao/VGLM) | EMNLP 2020 | This paper proposes an effective way to **fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model**. The Versatile Language Model (VLM) is composed of three components: 1. ==A pre-trained language model back-bone (e.g., GPT-2)== 2. Two kinds of specialized parameters for each generation tasks such as - Low-rank residual adapters - Task embedding ![](https://hackmd.io/_uploads/rJPmHvldh.png) ### Residual Adapters These are trainable modules which steer the pre-trained model to different downstream tasks. Here're the residual adapter computes: $$ \tag{1} \textbf{Adapter}(H_i) = (\textbf{ReLU}(\textbf{LN}(H_i)W_i^E))W_i^D + H_i $$ , $H_i \in \mathbb{R}^{t \times d}$ is hidden representation from the language model layer $i$, where $d$ is the hidden dimension and $t$ is the current generation step. $W^E_i$ and $W_i^D$ are parameters of dimension $d \times m$ and $m \times d$ respectively, and $\textbf{LN}(\cdot)$ denotes layer normalization. The bottleneck dimension $m$ is tunable and it allows to adjust the capacity of the adapter according to complexity of the target task. ### Task Embedding To adapt unconditional generative language models to different conditional language generation tasks (e.g., CoQA, Summarization), they construct a set of task-specific segment embeddings. ### Knowledge Distillation Since the problem with a large distributional shift from the original pre-trained language model, they propose to use sentence-level knowledge distillation. First, fully fine-tune a GPT-2 model on the training set of a task (e.g., Machine Translation). Then they replace the gold target (e.g., gold translation) in the training set with the greedy decoded output from the full fine-tuned model. Finally, the new constructed training set is used to fine-tune the student VLM. ![](https://hackmd.io/_uploads/ByeqIJ7d2.png) ## HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks > [Paper link](https://arxiv.org/abs/2203.03878) | arXiv 2022 | They design a novel unified **parameter-efficient transfer learning framework** that works effectively on **both pure language and V&L tasks**. It uses **a shared hypernetwork that takes trainable hyper-embeddings as input, and outputs weights for fine-tuning different small modules in a pretrained language model**, such as tuning the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feedforward blocks (i.e., adapter-tuning). They define a set of embeddings (e.g., layer, block, task and visual embeddings) as the key components to calculate hyper-embeddings, which thus can support both pure language and V&L tasks. ### Preliminaries ==They use $\text{T5}_{\text{BASE}}$ as their backbone model.== The standard multi-task with a set of tasks $\{ \mathcal{D}_\tau \}_{\tau=1}^T$, where $T$ is the total numbber of tasks and $\mathcal{D}_\tau = \{ (x_\tau^i, y_\tau^i )\}^{N_\tau}_{i=1}$ is the training data of the $\tau$-th task with $N_\tau$ samples. With finetuning minimizes the following loss on the training set: $$ \tag{1} \mathcal{L}_{\text{total}} = \sum_{\tau=1}^T \sum_{(x_\tau^i, y_\tau^i) \in \mathcal{D}_\tau} \mathcal{L}_{\text{task}}(\theta, x_\tau^i, y_\tau^i) $$ where $\mathcal{L}_{\text{task}}$ is the loss function of the tasks that is usually defined as the cross-entropy loss. The key idea of hypernetwork is to learn a parametric task-specific hyper-embedding $\{ I_\tau \}^T_{\tau=1}$ for each task. Its parameter called $\theta_h$, which generates task-specific parameters $\Delta \theta = h(\theta_h, I_\tau)$ for other network. --- During training, it only updates the hypernetwork parameters $\theta_h$ with hyper-embeddings $\{ I_\tau \}^T_{\tau = 1}$ and parameters in layer normalization, while the remaining model parameters in $\theta$ are fixed as in Equation $(2)$ $$ \begin{aligned} \mathcal{L}_{\text{total}} & = \sum_{\tau = 1}^T \sum_{(x_\tau^i, y_\tau^i) \in \mathcal{D}_\tau} \mathcal{L}_{\text{task}} (\Delta \theta, \theta, x_\tau^i, y_\tau^i) \\ & = \sum_{\tau = 1}^T \sum_{(x_\tau^i, y_\tau^i) \in \mathcal{D}_\tau} \mathcal{L}_{\text{task}} (I_\tau, \theta_h, \theta, x_\tau^i, y_\tau^i) \end{aligned} \tag{2} $$ ### Hyper-Embeddings for PELT They introduce a set of layer id embeddings $\mathcal{I} = \{ l_i \}^L_{i=1}$ and block type embeddings $\mathcal{B} = \{ b_j \}^5_{j=1}$, which specify the position where the parameters $\Delta \theta$ are inserted to. They compute a hyper-embedding $I_\tau^t \in \mathbb{R}^{d_I}$ for each individual task via a task projector network $h_I^t(\cdot)$. It's a multi-layer perceptron consisting of two feed-forward layers and a ReLU non-linearity: $$ \tag{3} I_\tau^t = h_I^t(z_\tau, l_i, b_j) $$ , where task embeddings $z_\tau \in \mathbb{R}^{d_t}$, layer id embeddings $l_i \in \mathbb{R}^{d_t}$, and block type embeddings $b_j \in \mathbb{R}^{d_t}$. ### HyperPrefix: Incorporate with Prefix-tuning Original Prefix-tuning, the prefix vectors of each attention block are reparameterized by a two-layer feed-forward network: $$ \tag{4} P = W_{\text{up}} \phi (W_{\text{down}} E) $$ , where $E \in \mathbb{R}^{d \times N}$ is a randomly initialized embedding matrix. In their method, they extend the dimension for different embeddings to match the prefix length $N$, $i.e., \ z_\tau, l_i, b_j$, and then compute the hyper-embedding $I_\tau^i$. And then employ a hypernetwork $h_P^t(\cdot)$ with trainable parameters $\theta_{h^t_P}$, to project $I_\tau^t$ to prefix vectors $P^t \in \mathbb{R}^{N \times d}$, named $HyperPrefix$: $$ \tag{5} P^t = h_P^t (\theta_{h^t_P}, I_\tau^t) $$ ### HyperPELT: Incorporate with Adapter They introduce a hypernetwork-based adapter layer with a trainable scaled parameter $\lambda$, which is inserted parallelly with feed-forward blocks, named HyperPELT. This task-conditioned adapter layer $A_\tau$ consists of a down-projection, $W_{\text{down}}^\tau$, GeLU non-linearity, and up-projection $W_{\text{up}}^\tau$. $$ \tag{6} A_\tau^t(x) = \lambda \text{LN} (W_{\text{up}}^\tau \text{GeLU} (W_{\text{down}}^\tau x)) + x $$ It will generate adapter weights $(W_{\text{up}}^\tau, W_{\text{down}}^\tau)$ through a hypernetwork $h_A^t(\cdot)$: $$ \tag{7} (W_{\text{up}}^\tau, W_{\text{down}}^\tau) := h_A^t(\theta_{h_A^t}, I_\tau^t) $$ ![](https://hackmd.io/_uploads/rkCD4PlO2.png) ![](https://hackmd.io/_uploads/Byfc_ZmOn.png)