Try   HackMD

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

tags: RL Group meeting Date : 2023 0516

ICLR 2022

Outline

  • Introduction
  • Related work
  • Background
  • Approach
  • Experiments
  • Conclusion and future work

Introduction

  • Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners.
  • Determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets.
  • DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners.
  • Main principle:
    1. Reformulate potential natural language processing tasks into the task of a pre-trained language model.
    2. Differentially optimizing the prompt template as well as the target label with backpropagation.
  • Main contribution:
    • Optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.
    • A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks.
  • Language Model Prompting :
    • GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt.
    • This study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering and external parameter optimization.
  • Few-shot learning :
    • It can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.

Background

Xin~=[CLS]Xin[SEP]
Xprompt=[CLS]Xin[SEP]T[SEP]

p(yXprompt )=wVyp([ MASK ]=wXprompt )

  • where
    w
    represents the
    wth
    label token of class
    y
    .

Approach

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Differentiable template optimization

  • We utilize pseudo tokens to construct templates and then optimize them with backpropagation.
  • The traditional discrete prompts satisfy [
    Ti
    ]
    V
    and map
    T
    into:
    {w([T0:i]),w([MASK]),w([Ti+1:m])}
  • DART considers [
    Ti
    ] as pseudo tokens and maps the template as follows:
    {h0,,hi,w([ MASK ]),hi+1,,hm}
  • Differentiable template optimization can obtain expressive templates beyond the original vocabulary
    V
    .
    h^0:m=argminhL(Xprompt ,y)
  • DART leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other.
  • DART stimulate the model to focus on context representation learning.

Differentiable label optimization

  • Previous approach converts the class type
    Yi
    into a variable number of label tokens
    {...,v1,..,vk,...}
  • DART maps the
    Yj
    to a continuous vocabulary space as follows:
    M(Yj)={hm+j}
  • m
    is the number of trainable embedding in template.

Training objectives

  • Class Discrimination Object:
    LC=CE(g(y|Xprompt))
  • Fluency Constraint Object:
    h(xmx,y)=exp([[f(x,y)]]xm)vVexp([[f(x,y)]]v)LF=mMBCE(h(xmx,y)).
  • Training object:
    L=LC+λLF

Experiments

  • These results indicate that DART can better stimulate potential ability and makes the pretrained language model a better few-shot learner

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings.

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • While both methods learn separable hidden states:

    • Differentiable prompts’ representation is relatively more compact.
    • The representation generated from fixed prompts is more scattered.
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

Conclusion

  • DART improves the fast-shot learning pretrained language model.
  • The proposed approach can produce satisfactory improvements in the few-shot scenarios.
  • The proposed method is also pluggable for other language models.

Appendix

  • The class discrimination objective is the main objective that aims to classify the sentences.

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • The loss can be rewritten using binary cross entropy or regular cross entropy as:

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

Reference

Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

Improving and Simplifying Pattern Exploiting Training

GPT Understands, Too