# Parameter-Efficient Transfer Learning for NLP
ICML 2019
[paper](https://arxiv.org/pdf/1902.00751.pdf)
## Introduction
### Fine-tune
- For numerous downstream tasks, fine-tuning is considered an **inefficient parameter updating approach**, because it requires **updating all parameters** of the language model.
- People try to fix most of the network parameters of the language pre-training model, and only update the language model parameters of the **last few layers** for downstream tasks.
- The different layers of the language model focus on different features, so this approach will cause a **decrease in accuracy**.
### Adapter
- Adapter is an **efficient** way to update parameters.
- Adapter will add two modules with a small number of parameters to **each layer of transformer** in the model, and **only update the adapter module parameters** when training for downstream tasks.
- This allows the ability of a powerful large-scale language model to be efficiently transferred to many downstream tasks, while **ensuring the performance** of the model in downstream tasks.
## Adapter tuning for NLP
### Transformer Layer
- A Multi-headed attention layer and a Feed-forward layer.
- Mapping the features back to the original input dimensions.
- Skip connection and layer normalization.
Add the adapter module twice to each Transformer layer:
- After the projection following multi-headed attention
- After the two feed-forward layers
Keep the input and output of the adapter in the **same dimension**, so the output results are directly passed to subsequent network layers **without any further modifications**.
<center>

</center>
### Adapter Layer
- Feedforward down-project
- Nonlinearity
- Feedforward up-project
- Skip connection (If the initial parameter is close to 0, the adapter will be close to an identity mapping due to the skip connection setting, thus **ensuring the effectiveness of training**.)
Model parameters:
- $d$ denote the feature dimension of input
- $m$ denote the middle feature dimension
The paremeters of down-project is
<center>
$dm + m$
</center>
The paremeters of up-project is
<center>
$dm + d$
</center>
Total is
<center>
$2dm + d + m$ $(m << d)$
</center>
In reality, the newly added model parameters generally only account for **0.5% to 8%** of the total parameters of the language model.
Besides, **the paremeters of the layer normalization after adapter** also have to be updated.
- $x$ denote the input data
- $\mu$ denote mean of $x$
- $\sigma^2$ denote variance of x
- $\gamma$ denote Learning scale
- $\beta$ denote learning shift
- $\epsilon$ denote a small positive number
<center>
$Layer Normalization(x)=\gamma \cdot \frac {x-\mu}{\sqrt[]{\sigma^2+\epsilon}} + \beta$
</center>
$\gamma$ and $\beta$ will be updated to adapt to different input data distributions and task requirements.
<center>

</center>
## Experiments
Comparing with BERT-Large, we can see:
- The effect of the adapter method that **only trains a small number of parameters** is comparable to the traditional method of **finetune all parameters of the language model**.
- Adapter is an **efficient parameter training method** that can quickly transfer the capabilities of language models to downstream tasks.
- The optimal dimension of the intermediate layer of the adapter is **different** on different data sets.
<center>

</center>
In order to explore the relationship between the **adapter's parameter efficiency and model performance**, the paper conducted further experiments and compared the **finetune** method.
<center>

</center>
We can see that BERT **performs poorly** when there are fewer parameter adjustments, but Adapters still **maintain pretty good performance** when adjusting few parameters.
## Conclusion
- Adapter is an **efficient parameter update method** that can achieve an effect comparable to finetune's full model parameters with a **small number of parameters**.
- Training only a small number of parameters also means **lower requirements for training data** and **faster training speed**.
- However, the adapter adds a new module to the original language model, which will further **increase the delay during inference**.
- Therefore, we need to **weigh training efficiency and inference efficiency** to decide which model to choose.