Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

# Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners ICLR 2022 ###### tags: `group meeting` [link](https://openreview.net/pdf?id=ek9a0qIafW) ## Introduction * The pre-train—fine-tune paradigm has contributed significantly to natural language processing(NLP), and has achieved excellent results in several benchmarks. * However, supervised fine-tuning is still prone to labeled data in practice and faces unignorable challenges owing to the variations of domains, language, and tasks. These drawbacks lead to the research of an important technique, few-shot learning, which can significantly improve the learning capabilities of practical adaptive applications by accessing only a small number of labeled examples. * Recently, an emerging fine-tuning methodology has arisen to equip smaller language models (LMs) with few-shot capabilities: adapting the pre-trained LM directly as a predictor through completion of a cloze task, which treats the downstream task as a (masked) language modeling problem. * In this paper, they propose a novel DifferentiAble pRompT (DART) fine-tuning approach, which is model-agnostic, parameter-efficient. They propose an optimization algorithm to jointly learning templates as well as labels. They further introduce a fluency constraint object to ensure the association among the prompt embeddings. ## Background ![](https://i.imgur.com/aX9iKoX.png) ### Hard prompt * In the standard “pre-training and fine-tuning” paradigm, there is the gap between the pre-training stage and the downstream task: the objectives are different. For the downstream tasks, we usually need to introduce new parameters—for example, for a BERT-large model and a binary classification task, it requires an additional set of 1,024 x 2 parameters. * On the other hand, prompting makes it possible for downstream tasks to take the same format as the pre-training objectives and requires no new parameters. * For a classification task, we just need to design a template ("It was") and the expected text responses (we call these label words, e.g., "great" for the positive label and "terrible" for the negative label). * By closing the gap between the two stages, deploying the pre-trained models on specific tasks becomes much easier. ### Soft prompt * The idea of soft prompt is putting some random vectors (not tied to specific word embeddings from the vocabulary) in the input sequence and tuning them, with other parts of the pre-trained models fixed. * Soft prompt is to alleviate the effort of handcrafting prompt, and to provide a parameter-efficient way to use the PLM. ## Method ![](https://i.imgur.com/HG1H6UJ.png) ### Differentiable Template Optimization * In this paper, they utilize pseudo tokens to construct templates and then optimize them with backpropagation. Specifically, given the template, $\mathcal{T}=\left\{\left[\mathrm{T}_{0: i}\right],[\mathrm{MASK}],\left[\mathrm{T}_{i+1: j}\right]\right\}$, which varies from the traditional discrete prompts, we can map $\mathcal{T}$ into: $$ \left\{\mathbf{w}\left(\left[\mathrm{T}_{0: i}\right]\right), \mathbf{w}([\text { MASK }]), \mathbf{w}\left(\left[\mathrm{T}_{i+1: m}\right]\right)\right\} $$ DART considers $\left[\mathrm{T}_i\right]$ as pseudo tokens and maps the template as follows: $$ \left\{h_0, \ldots, h_i, \mathbf{w}([\text { MASK }]), h_{i+1}, \ldots, h_m\right\} $$ where $h_i$ are trainable parameters. Differentiable template optimization can obtain expressive templates beyond the original vocabulary $\mathcal{V}$. Lastly, the templates, $h_i$, are differentially optimized by: $$ \hat{h}_{0: m}=\underset{h}{\arg \min } \mathcal{L}\left(X_{\text {prompt }}, y\right) $$ ### Differentiable Label Optimization * Prompt-based fine-tuning requires filling in one word, and the masked word prediction is mapped to a verbalizer, which produces a class (i.e., ”Yes”: True. ”No”: False). * For each class $c \in Y$, the previous approaches estimate the conditional likelihood on a pruned set $\mathcal{V}^c \subset \mathcal{V}$ of the top $k$ vocabulary words. * However, the brute-forcing label searching: **(1)** is computationally intensive. **(2)** has poor scalability with an increase in the class numbers (many classification datasets have more than 100 classes), the number of searches may be $k^C$ ( $C$ represents the total number of classes), which is exponential and thus intractable. * Additionally, the labels of classes contain rich, complex semantic knowledge, and one discrete token may be insufficient to represent this information. * Specifically, with the labels, $Y=\left\{Y_1, Y_2, . ., Y_n\right\}$, different from the previous approach which converts the class type $Y_i$ into a variable number of label tokens $\left\{\ldots, v_1, . ., v_k, \ldots\right\}$, DART maps the $Y_j$ to a continuous vocabulary space as follows: $$ \mathcal{M}\left(Y_j\right)=\left\{h_{m+j}\right\} $$ where $m$ is the number of trainable embedding in template. ## Training Objective ### Class Discrimination Object * The class discrimination objective is the main objective that aims to classify the sentences. Given $\left(X_{\text {in }}, \mathcal{T}\right)$, we can generate $X_{\text {prompt }}$. $$ \mathcal{L}_C=\operatorname{CE}\left(g\left( y \mid X_{\text {prompt }}\right)\right) . $$ ### Fluency Constraint Object * Since the pseudo tokens in the prompt template must be co-dependent with each other, they introduce a fluency constraint training. * To ensure the association among the template tokens and to maintain the ability of language understanding inherited from the PLMs, we leverage a fluency constraint object with the MLM. * During training, if the label is correct, the model has to predict the original token. Additionally, if the label is wrong, the model is forced to not predict the original token. * One token in the input sentence is randomly masked, and the masked language prediction is conducted. $x$ and $x^{\prime}$ are the original and masked sequences, respectively. Let $x^m$ be the target token that has been masked out in $x^{\prime}$, and $h\left(x^m \mid x^{\prime}, y\right)$ is maximized as follows: $$ h\left(x^m \mid x^{\prime}, y\right)=\frac{\exp \left(\left[\left[f\left(x^{\prime}, y\right)\right]\right]_{x^m}\right)}{\sum_{v^{\prime} \in \mathcal{V}^{\prime}} \exp \left(\left[\left[f\left(x^{\prime}, y\right)\right]\right]_{v^{\prime}}\right)} $$ $$ \mathcal{L}_F=\operatorname{BCE}\left(h\left(x^m \mid x^{\prime}, y^*\right), 1\right)+\sum_{y \neq y^*} \operatorname{BCE}\left(h\left(x^m \mid x^{\prime}, y\right), 0\right) $$ By optimizing $\mathcal{L}_F$, the language model can obtain a better contextual representation with a rich association among the template tokens. We have the following training object: $$ \mathcal{L}=\mathcal{L}_C+\lambda \mathcal{L}_F, $$ ## Experiment * We observe that our approach obtains better performance than conventional fine-tuning and achieves comparable results with LM-BFF. ![](https://i.imgur.com/M9FUmDD.png) * We observe that DART exhibits a performance decay in the absence of any one of the modules. Furthermore, we notice that differentiable label optimization is more sensitive to performance and is highly beneficial for DART, especially for low-resource settings. ![](https://i.imgur.com/IudOYgm.png)