# Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
###### tags: `RL Group meeting` Date : 2023 0516
**ICLR 2022**
## Outline
* Introduction
* Related work
* Background
* Approach
* Experiments
* Conclusion and future work
## Introduction
- Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners.
- Determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets.
- DifferentiAble pRompT (DART), which can convert **small language models** into better few-shot learners.
- Main principle:
1. Reformulate potential natural language processing tasks into the task of a pre-trained language model.
2. Differentially optimizing the prompt template as well as the target label with **backpropagation**.
- Main contribution:
- Optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.
- A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks.
## Related work
- Language Model Prompting :
- GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt.
- This study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering and external parameter optimization.
- Few-shot learning :
- It can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.
## Background
$$ \tilde{X_{in}} = [CLS] X_{in} [SEP] $$
$$ X_{prompt} = [CLS] X_{in} [SEP] \mathcal{T} [SEP] $$
$$ p\left(y \mid X_{\text {prompt }}\right)=\sum_{w \in \mathcal{V}_y} p\left([\text { MASK }]=w \mid X_{\text {prompt }}\right)
$$
- where $w$ represents the $w^{th}$ label token of class $y$.
## Approach

## Differentiable template optimization
- We utilize pseudo tokens to construct templates and then optimize them with backpropagation.
- The traditional discrete prompts satisfy [${T_i}$] $∈$ $\mathcal{V}$ and map $\mathcal{T}$ into:
$$
\left\{\mathbf{w}\left(\left[\mathrm{T}_{0: i}\right]\right), \mathbf{w}([\mathrm{MASK}]), \mathbf{w}\left(\left[\mathrm{T}_{i+1: m}\right]\right)\right\}
$$
- **DART** considers [${T_i}$] as pseudo tokens and maps the template as follows:
$$
\left\{h_0, \ldots, h_i, \mathbf{w}([\text { MASK }]), h_{i+1}, \ldots, h_m\right\}
$$
- Differentiable template optimization can obtain expressive templates beyond the original vocabulary $\mathcal{V}$.
$$
\hat{h}_{0: m}=\underset{h}{\arg \min } \mathcal{L}\left(X_{\text {prompt }}, y\right)
$$
- **DART** leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other.
- **DART** stimulate the model to focus on context representation learning.
## Differentiable label optimization
- Previous approach converts the class type $Y_i$ into a variable number of label tokens $\{...,v_1,..,v_k,...\}$
- **DART** maps the $Y_j$ to a continuous vocabulary space as follows:
$$\mathcal{M}(Y_j)=\{h_{m+j}\}
$$
- $m$ is the number of trainable embedding in template.
## Training objectives
- Class Discrimination Object:
$$
\mathcal{L_C} = \text{CE}(g(y|X_{prompt}))
$$
- Fluency Constraint Object:
$$
\begin{gathered}
h\left(x^m \mid x^{\prime}, y\right)=\frac{\exp \left(\left[\left[f\left(x^{\prime}, y\right)\right]]_{x^m}\right)\right.}{\sum_{v^{\prime} \in \mathcal{V}^{\prime}} \exp \left(\left[[f\left(x^{\prime}, y\right)\right]]_{v^{\prime}}\right)} \\
\mathcal{L}_F=\sum_{m \in M} \operatorname{BCE}\left(h\left(x^m \mid x^{\prime}, y\right)\right) .
\end{gathered}
$$
- Training object:
$$
\mathcal{L} = \mathcal{L_C} + \lambda\mathcal{L_F}
$$
## Experiments
- These results indicate that DART can better stimulate potential ability and makes the pretrained language model a better few-shot learner

- DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings.


- While both methods learn separable hidden states:
- Differentiable prompts’ representation is relatively more compact.
- The representation generated from fixed prompts is more scattered.

## Conclusion
- **DART** improves the fast-shot learning pretrained language model.
- The proposed approach can produce satisfactory improvements in the few-shot scenarios.
- The proposed method is also pluggable for other language models.
## Appendix
- The class discrimination objective is the main objective that aims to classify the sentences.

- The loss can be rewritten using binary cross entropy or regular cross entropy as:

## Reference
[Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners](https://arxiv.org/pdf/2108.13161.pdf)
[Improving and Simplifying Pattern Exploiting Training](https://arxiv.org/pdf/2103.11955.pdf)
[GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)