Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

> [Paper link](https://arxiv.org/pdf/2107.13586.pdf) | [Note link](https://zhuanlan.zhihu.com/p/396098543) | ACM 2023 <center> <img src= "https://i.imgur.com/moeSaDy.png"> </center> ## Abstract Prompt-based learning is **based on language models that model the probability of text directly**, unlike traditional supervised learning, which trains a model to take in an input $\boldsymbol{x}$ and predict an output $\boldsymbol{y}$ as $P(\boldsymbol{y} \ | \ \boldsymbol{x})$. To use these models to perform prediction tasks, the original input $\boldsymbol{x}$ is modified using a *template* into a textual string *prompt* $\boldsymbol{x}^{'}$ that has **some unfilled slots**, and then the language model is used to **probabilistically fill the unfilled information** to obtain a final string $\hat{\boldsymbol{y}}$, from which the final output $\boldsymbol{y}$ can be derived. In this paper we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions. The framework is powerful since: 1. It allows the language model to be *pre-trained* on massive amounts of raw text 2. By defining prompting function the model is able to perform *few-shot* / *zero-shot* learning ## Two Sea Changes in NLP * Feature engineering: using domain knowledge to **define and extract salient features from raw data** and provide models with the appropriate inductive bias to learn from this limited data * Architecture engineering: inductive bias was rather provided through the design of a suitable **network architecture conducive to learning such features** * Objective engineering: designing the training objectives used at both the **pre-training and fine-tuning** stages * Prompt engineering: **finding the most appropriate prompt to allow a LM to solve the task at hand**. The “pre-train, fine-tune” procedure is replaced by one in which we dub “pre-train, prompt, and predict”. <center> <img src= "https://i.imgur.com/s1x0fek.png"> <a href="https://thegradient.pub/prompting/">Image source from "Prompting: Better Ways of Using Language Models for NLP Tasks"</a> </center><br> > ==[**Constantly-updated survey, and paperlist**](http://pretrain.nlpedia.ai/)== <center> <img src= "https://i.imgur.com/MWyHPHT.png"> </center> ## A Formal Description of Prompting ### Supervised Learning in NLP In a traditional supervised learning system for NLP, we take an input $\boldsymbol{x}$, usually text, and predict an output $\boldsymbol{y}$ based on a model $P(\boldsymbol{y} \ | \ \boldsymbol{x} ; \theta)$. It can be illustrated this with two stereotypical examples: * First, *text classification* takes an input text $\boldsymbol{x}$ and predicts a label $y$ from a fixed label set $\mathcal{Y}$. An input $\boldsymbol{x} =$ "I love this movie.", and predict a label $y = ++$, out of a label $\mathcal{Y} = \{ ++, +, \sim, -, -- \}$. * Second, *conditional text generation* takes an input $\boldsymbol{x}$ and generates another text $\boldsymbol{y}$. The input is text in one language such as the Finnish $\boldsymbol{x}$ = “Hyvää huomenta.” ¨ and the output is the English $\boldsymbol{y}$ = “Good morning”.. ### Prompting Basics Basic prompting predicts the highest-scoring $\hat{\boldsymbol{y}}$ in three steps: * **Prompt Addition** * The first variety of prompt with a slot to fill in the middle of the text as a *cloze prompt* * The second variety of prompt, the input text comes entirely before z as a *prefix prompt* *Prompting function* $f_{\mathrm{prompt}}(\cdot)$ is applied to modify the input text $\boldsymbol{x}$ into a *prompt* $\boldsymbol{x}^{'} = f_{\mathrm{prompt}}(\boldsymbol{x})$ 1. Apply a *template*, which is a textual string that has two slots: an *input slot* $[\mathrm{X}]$ for input $\boldsymbol{x}$ and an *answer slot* $[\mathrm{Z}]$ for an intermediate generated *answer* text $\boldsymbol{z}$ that will later be mapped into $\boldsymbol{y}$. 2. Fill slot $[\mathrm{X}]$ with the input text $\boldsymbol{x}$. * **Answer Search** And then, searching for the highest-scoring text $\hat{\boldsymbol{z}}$ that maximizes the score of the LM. $$ \begin{equation} \tag{1}\hat{\boldsymbol{z}}=\underset{\boldsymbol{z} \in \mathcal{Z}}{\operatorname{search}} P\left(f_{\text {fill }}\left(\boldsymbol{x}^{\prime}, \boldsymbol{z}\right) ; \theta\right) \end{equation} $$ This search function could be an *argmax* search that searches for the highest-scoring output, or *sampling* that randomly generates outputs following the probability distribution of the LM. Where $f_{\mathrm{fill}}(\boldsymbol{x}^{'}, \boldsymbol{z})$ that fills in the location $[\mathrm{Z}]$ in prompt $\boldsymbol{x}^{'}$ with the potential answer $\boldsymbol{z}$. And, any prompt that has gone through this process as a *filled prompt*. Particularly, if the prompt is filled with a true answer, we will refer to it as an *answered prompt*. * **Answer Mapping** The last step, the method would like to go from the highest-scoring *answer* $\hat{\boldsymbol{z}}$ to the highest-scoring *output* $\hat{\boldsymbol{y}}$. <center> <img src= "https://i.imgur.com/ra5G8Jx.png"> </center> ## Pre-trained Language Models In this chapter, this paper presents a systematic view of various pre-trained LMs which 1. Organizing them along various axes in a more systematic way 2. Particularly focuses on aspects salient to prompting methods. Below, they will detail them through the lens of * Main training objective * Type of text noising * Auxiliary training objective * Attention mask * Typical architecture * Preferred application scenarios. ### Training Objectives The main training objective of a pre-trained LM almost invariably consists of some sort of objective predicting the probability of text $\boldsymbol{x}$. A popular alternative to standard LM objectives are *denoising* objectives, which apply some noising function $\tilde{\boldsymbol{x}}=f_{\text {noise }}(\boldsymbol{x})$ to the input sentence, then try to predict the original input sentence given this noised text $P(\boldsymbol{x} \ | \ \tilde{\boldsymbol{x}})$. 1. Corrupted Text Reconstruction 2. Full Text Reconstruction (FTR) ### Noising Functions <center> <img src= "https://i.imgur.com/KNhdmTO.png"> </center> ### Directionality of Representations In general, there are two widely used ways to calculate such representations: 1. Left-to-Right: The representation of each word is calculated based on the word itself and all previous words in the sentence. 2. Bidirectional: The representation of each word is calculated based on all words in the sentence Some examples of such attention masks show below <center> <img src= "https://i.imgur.com/QMmNH7Q.png"> </center><br> #### Left-to-Right Language Model Left-to-right LMs (L2R LMs), a variety of *auto-regressive LM*, predict the upcoming words or assign a probability $P(\boldsymbol{x})$ to a sequence of words $\boldsymbol{x} = x_1, ..., x_n$. The probability is commonly broken down using the chain rule in a left-to-right fashion: $P(\boldsymbol{x}) = P(x_1) \times ... P(x_n \ | \ x_1 ... x_{n-1})$ <center> <img src= "https://i.imgur.com/Vu9H2xI.png"> </center> #### Masked Language Models One popular bidirectional objective function used widely in representation learning $P(x_i \ | \ x_1, ..., x_{i-1}, ..., x_n)$ represents the probability of the word $x_i$ given the surrounding context. #### Prefix and Encoder-Decoder For conditional text generation tasks such as translation and summarization where an input text $\boldsymbol{x} = x_1, ..., x_n$ is given and the goal is to generate target text $\boldsymbol{y}$. 1. Using an encoder with fully-connected mask to encode the source $\boldsymbol{x}$ first and then 2. Decode the target $\boldsymbol{y}$ auto-regressively (from the left to right). * Prefix Language Model: The prefix LM is a left-to-right LM that decodes $\boldsymbol{y}$ conditioned on a prefixed sequence $\boldsymbol{x}$, which is **encoded by the *same* model parameters but with a fully-connected mask.** * Encoder-decoder: The encoder-decoder model is a model that uses a left-to-right LM to decode $\boldsymbol{y}$ conditioned on a *separate* encoder for text $\boldsymbol{x}$ with a fully-connected mask. **The encoder and decoder are not shared.** ### Typical Pre-training Methods <center> <img src= "https://i.imgur.com/Wj9db3O.png"> </center> ## Prompt Engineering <center> <img src= "https://i.imgur.com/8Bzmdvn.png"> </center> ## Answer Engineering <center> <img src= "https://i.imgur.com/Il0fZlL.png"> </center> ## Multi-Prompt Learning <center> <img src= "https://i.imgur.com/8RgQAMw.png"> </center> ## Training Strategies for Prompting Methods <center> <img src= "https://i.imgur.com/jTy9xPq.png"> </center> ## Applications ## Prompt-relevant Topics ## Challenges ## Meta Analysis ## Conclusion ## Appendix: Pre-trained Language Model Families <iframe scrolling="no" width="100%" height="600" src="https://www.docdroid.net/Aynv3kg/210713586-pdf" frameborder="0" allowtransparency allowfullscreen> </iframe>