Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
ICLR 2022
Outline
- Introduction
- Related work
- Background
- Approach
- Experiments
- Conclusion and future work
Introduction
- Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners.
- Determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets.
- DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners.
- Main principle:
- Reformulate potential natural language processing tasks into the task of a pre-trained language model.
- Differentially optimizing the prompt template as well as the target label with backpropagation.
- Main contribution:
- Optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.
- A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks.
- Language Model Prompting :
- GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt.
- This study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering and external parameter optimization.
- Few-shot learning :
- It can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.
Background
- where represents the label token of class .
Approach
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Differentiable template optimization
- We utilize pseudo tokens to construct templates and then optimize them with backpropagation.
- The traditional discrete prompts satisfy [] and map into:
- DART considers [] as pseudo tokens and maps the template as follows:
- Differentiable template optimization can obtain expressive templates beyond the original vocabulary .
- DART leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other.
- DART stimulate the model to focus on context representation learning.
Differentiable label optimization
- Previous approach converts the class type into a variable number of label tokens
- DART maps the to a continuous vocabulary space as follows:
- is the number of trainable embedding in template.
Training objectives
- Class Discrimination Object:
- Fluency Constraint Object:
- Training object:
Experiments
-
These results indicate that DART can better stimulate potential ability and makes the pretrained language model a better few-shot learner
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
While both methods learn separable hidden states:
- Differentiable prompts’ representation is relatively more compact.
- The representation generated from fixed prompts is more scattered.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Conclusion
- DART improves the fast-shot learning pretrained language model.
- The proposed approach can produce satisfactory improvements in the few-shot scenarios.
- The proposed method is also pluggable for other language models.
Appendix
-
The class discrimination objective is the main objective that aims to classify the sentences.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
The loss can be rewritten using binary cross entropy or regular cross entropy as:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Reference
Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
Improving and Simplifying Pattern Exploiting Training
GPT Understands, Too