# Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners ###### tags: `RL Group meeting` Date : 2023 0516 **ICLR 2022** ## Outline * Introduction * Related work * Background * Approach * Experiments * Conclusion and future work ## Introduction - Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. - Determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets. - DifferentiAble pRompT (DART), which can convert **small language models** into better few-shot learners. - Main principle: 1. Reformulate potential natural language processing tasks into the task of a pre-trained language model. 2. Differentially optimizing the prompt template as well as the target label with **backpropagation**. - Main contribution: - Optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting. - A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks. ## Related work - Language Model Prompting : - GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt. - This study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering and external parameter optimization. - Few-shot learning : - It can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples. ## Background $$ \tilde{X_{in}} = [CLS] X_{in} [SEP] $$ $$ X_{prompt} = [CLS] X_{in} [SEP] \mathcal{T} [SEP] $$ $$ p\left(y \mid X_{\text {prompt }}\right)=\sum_{w \in \mathcal{V}_y} p\left([\text { MASK }]=w \mid X_{\text {prompt }}\right) $$ - where $w$ represents the $w^{th}$ label token of class $y$. ## Approach ![](https://hackmd.io/_uploads/SJzJytP4h.png) ## Differentiable template optimization - We utilize pseudo tokens to construct templates and then optimize them with backpropagation. - The traditional discrete prompts satisfy [${T_i}$] $∈$ $\mathcal{V}$ and map $\mathcal{T}$ into: $$ \left\{\mathbf{w}\left(\left[\mathrm{T}_{0: i}\right]\right), \mathbf{w}([\mathrm{MASK}]), \mathbf{w}\left(\left[\mathrm{T}_{i+1: m}\right]\right)\right\} $$ - **DART** considers [${T_i}$] as pseudo tokens and maps the template as follows: $$ \left\{h_0, \ldots, h_i, \mathbf{w}([\text { MASK }]), h_{i+1}, \ldots, h_m\right\} $$ - Differentiable template optimization can obtain expressive templates beyond the original vocabulary $\mathcal{V}$. $$ \hat{h}_{0: m}=\underset{h}{\arg \min } \mathcal{L}\left(X_{\text {prompt }}, y\right) $$ - **DART** leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other. - **DART** stimulate the model to focus on context representation learning. ## Differentiable label optimization - Previous approach converts the class type $Y_i$ into a variable number of label tokens $\{...,v_1,..,v_k,...\}$ - **DART** maps the $Y_j$ to a continuous vocabulary space as follows: $$\mathcal{M}(Y_j)=\{h_{m+j}\} $$ - $m$ is the number of trainable embedding in template. ## Training objectives - Class Discrimination Object: $$ \mathcal{L_C} = \text{CE}(g(y|X_{prompt})) $$ - Fluency Constraint Object: $$ \begin{gathered} h\left(x^m \mid x^{\prime}, y\right)=\frac{\exp \left(\left[\left[f\left(x^{\prime}, y\right)\right]]_{x^m}\right)\right.}{\sum_{v^{\prime} \in \mathcal{V}^{\prime}} \exp \left(\left[[f\left(x^{\prime}, y\right)\right]]_{v^{\prime}}\right)} \\ \mathcal{L}_F=\sum_{m \in M} \operatorname{BCE}\left(h\left(x^m \mid x^{\prime}, y\right)\right) . \end{gathered} $$ - Training object: $$ \mathcal{L} = \mathcal{L_C} + \lambda\mathcal{L_F} $$ ## Experiments - These results indicate that DART can better stimulate potential ability and makes the pretrained language model a better few-shot learner ![](https://hackmd.io/_uploads/HJTpYfAV3.png) - DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings. ![](https://hackmd.io/_uploads/r1cz6fC42.png) ![](https://hackmd.io/_uploads/HkAtpGRV3.png) - While both methods learn separable hidden states: - Differentiable prompts’ representation is relatively more compact. - The representation generated from fixed prompts is more scattered. ![](https://hackmd.io/_uploads/r10e1mCV2.png) ## Conclusion - **DART** improves the fast-shot learning pretrained language model. - The proposed approach can produce satisfactory improvements in the few-shot scenarios. - The proposed method is also pluggable for other language models. ## Appendix - The class discrimination objective is the main objective that aims to classify the sentences. ![](https://hackmd.io/_uploads/BJnwVogS3.png) - The loss can be rewritten using binary cross entropy or regular cross entropy as: ![](https://hackmd.io/_uploads/BkXJetJBh.png) ## Reference [Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners](https://arxiv.org/pdf/2108.13161.pdf) [Improving and Simplifying Pattern Exploiting Training](https://arxiv.org/pdf/2103.11955.pdf) [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)