# Parameter-Efficient Transfer Learning for NLP ICML 2019 [paper](https://arxiv.org/pdf/1902.00751.pdf) ## Introduction ### Fine-tune - For numerous downstream tasks, fine-tuning is considered an **inefficient parameter updating approach**, because it requires **updating all parameters** of the language model. - People try to fix most of the network parameters of the language pre-training model, and only update the language model parameters of the **last few layers** for downstream tasks. - The different layers of the language model focus on different features, so this approach will cause a **decrease in accuracy**. ### Adapter - Adapter is an **efficient** way to update parameters. - Adapter will add two modules with a small number of parameters to **each layer of transformer** in the model, and **only update the adapter module parameters** when training for downstream tasks. - This allows the ability of a powerful large-scale language model to be efficiently transferred to many downstream tasks, while **ensuring the performance** of the model in downstream tasks. ## Adapter tuning for NLP ### Transformer Layer - A Multi-headed attention layer and a Feed-forward layer. - Mapping the features back to the original input dimensions. - Skip connection and layer normalization. Add the adapter module twice to each Transformer layer: - After the projection following multi-headed attention - After the two feed-forward layers Keep the input and output of the adapter in the **same dimension**, so the output results are directly passed to subsequent network layers **without any further modifications**. <center> ![](https://hackmd.io/_uploads/S1BIfRWC3.jpg =40%x) </center> ### Adapter Layer - Feedforward down-project - Nonlinearity - Feedforward up-project - Skip connection (If the initial parameter is close to 0, the adapter will be close to an identity mapping due to the skip connection setting, thus **ensuring the effectiveness of training**.) Model parameters: - $d$ denote the feature dimension of input - $m$ denote the middle feature dimension The paremeters of down-project is <center> $dm + m$ </center> The paremeters of up-project is <center> $dm + d$ </center> Total is <center> $2dm + d + m$ $(m << d)$ </center> In reality, the newly added model parameters generally only account for **0.5% to 8%** of the total parameters of the language model. Besides, **the paremeters of the layer normalization after adapter** also have to be updated. - $x$ denote the input data - $\mu$ denote mean of $x$ - $\sigma^2$ denote variance of x - $\gamma$ denote Learning scale - $\beta$ denote learning shift - $\epsilon$ denote a small positive number <center> $Layer Normalization(x)=\gamma \cdot \frac {x-\mu}{\sqrt[]{\sigma^2+\epsilon}} + \beta$ </center> $\gamma$ and $\beta$ will be updated to adapt to different input data distributions and task requirements. <center> ![](https://hackmd.io/_uploads/Bk36pRZAh.jpg =40%x) </center> ## Experiments Comparing with BERT-Large, we can see: - The effect of the adapter method that **only trains a small number of parameters** is comparable to the traditional method of **finetune all parameters of the language model**. - Adapter is an **efficient parameter training method** that can quickly transfer the capabilities of language models to downstream tasks. - The optimal dimension of the intermediate layer of the adapter is **different** on different data sets. <center> ![](https://hackmd.io/_uploads/S11pYMMR3.png) </center> In order to explore the relationship between the **adapter's parameter efficiency and model performance**, the paper conducted further experiments and compared the **finetune** method. <center> ![](https://hackmd.io/_uploads/HyExofGC3.png) </center> We can see that BERT **performs poorly** when there are fewer parameter adjustments, but Adapters still **maintain pretty good performance** when adjusting few parameters. ## Conclusion - Adapter is an **efficient parameter update method** that can achieve an effect comparable to finetune's full model parameters with a **small number of parameters**. - Training only a small number of parameters also means **lower requirements for training data** and **faster training speed**. - However, the adapter adds a new module to the original language model, which will further **increase the delay during inference**. - Therefore, we need to **weigh training efficiency and inference efficiency** to decide which model to choose.