Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

tags: `RL Group meeting` Date : 2023 0516

ICLR 2022

Outline

Introduction
Related work
Background
Approach
Experiments
Conclusion and future work

Introduction

Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners.
Determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets.
DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners.
Main principle:
1. Reformulate potential natural language processing tasks into the task of a pre-trained language model.
2. Differentially optimizing the prompt template as well as the target label with backpropagation.
Main contribution:
- Optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.
- A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks.

Language Model Prompting :
- GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt.
- This study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering and external parameter optimization.
Few-shot learning :
- It can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.

Background

\tilde{X_{i n}} = [C L S] X_{i n} [S E P]

X_{p r o m p t} = [C L S] X_{i n} [S E P] T [S E P]

p (y ∣ X_{prompt}) = \sum_{w \in V_{y}} p ([MASK] = w ∣ X_{prompt})

where
$w$ represents the
$w^{t h}$ label token of class
$y$ .

Approach

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Differentiable template optimization

We utilize pseudo tokens to construct templates and then optimize them with backpropagation.
The traditional discrete prompts satisfy [
$T_{i}$ ]
$\in$
$V$ and map
$T$ into:

${w ([T_{0 : i}]), w ([MASK]), w ([T_{i + 1 : m}])}$
DART considers [
$T_{i}$ ] as pseudo tokens and maps the template as follows:

${h_{0}, \dots, h_{i}, w ([MASK]), h_{i + 1}, \dots, h_{m}}$
Differentiable template optimization can obtain expressive templates beyond the original vocabulary
$V$ .

${\hat{h}}_{0 : m} = \underset{h}{\arg min} L (X_{prompt}, y)$
DART leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other.
DART stimulate the model to focus on context representation learning.

Differentiable label optimization

Previous approach converts the class type
$Y_{i}$ into a variable number of label tokens
${. . ., v_{1}, . ., v_{k}, . . .}$
DART maps the
$Y_{j}$ to a continuous vocabulary space as follows:

$M (Y_{j}) = {h_{m + j}}$
$m$ is the number of trainable embedding in template.

Training objectives

Class Discrimination Object:

$L_{C} = CE (g (y | X_{p r o m p t}))$
Fluency Constraint Object:

$\begin{matrix} h (x^{m} ∣ x^{'}, y) = \frac{\exp ([[f (x^{'}, y)]]_{x^{m}})}{\sum_{v^{'} \in V^{'}} \exp ([[f (x^{'}, y)]]_{v^{'}})} \\ L_{F} = \sum_{m \in M} BCE (h (x^{m} ∣ x^{'}, y)) . \end{matrix}$
Training object:

$L = L_{C} + λ L_{F}$

Experiments

These results indicate that DART can better stimulate potential ability and makes the pretrained language model a better few-shot learner
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
While both methods learn separable hidden states:
- Differentiable prompts’ representation is relatively more compact.
- The representation generated from fixed prompts is more scattered.
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →

Conclusion

DART improves the fast-shot learning pretrained language model.
The proposed approach can produce satisfactory improvements in the few-shot scenarios.
The proposed method is also pluggable for other language models.

Appendix

The class discrimination objective is the main objective that aims to classify the sentences.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
The loss can be rewritten using binary cross entropy or regular cross entropy as:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Reference

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

Improving and Simplifying Pattern Exploiting Training

GPT Understands, Too

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

tags: RL Group meeting Date : 2023 0516

Outline

Introduction

Related work

Background

Approach

Differentiable template optimization

Differentiable label optimization

Training objectives

Experiments

Conclusion

Appendix

Reference

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation

tags: `RL Group meeting` Date : 2023 0516