> [Paper link](https://arxiv.org/pdf/2107.13586.pdf) | [Note link](https://zhuanlan.zhihu.com/p/396098543) | ACM 2023
<center>
<img src= "https://i.imgur.com/moeSaDy.png">
</center>
## Abstract
Prompt-based learning is **based on language models that model the probability of text directly**, unlike traditional supervised learning, which trains a model to take in an input $\boldsymbol{x}$ and predict an output $\boldsymbol{y}$ as $P(\boldsymbol{y} \ | \ \boldsymbol{x})$.
To use these models to perform prediction tasks, the original input $\boldsymbol{x}$ is modified using a *template* into a textual string *prompt* $\boldsymbol{x}^{'}$ that has **some unfilled slots**, and then the language model is used to **probabilistically fill the unfilled information** to obtain a final string $\hat{\boldsymbol{y}}$, from which the final output $\boldsymbol{y}$ can be derived.
In this paper we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions.
The framework is powerful since:
1. It allows the language model to be *pre-trained* on massive amounts of raw text
2. By defining prompting function the model is able to perform *few-shot* / *zero-shot* learning
## Two Sea Changes in NLP
* Feature engineering:
using domain knowledge to **define and extract salient features from raw data** and provide models with the appropriate inductive bias to learn from this limited data
* Architecture engineering:
inductive bias was rather provided through the design of a suitable **network architecture conducive to learning such features**
* Objective engineering:
designing the training objectives used at both the **pre-training and fine-tuning** stages
* Prompt engineering:
**finding the most appropriate prompt to allow a LM to solve the task at hand**. The “pre-train, fine-tune” procedure is replaced by one in which we dub “pre-train, prompt, and predict”.
<center>
<img src= "https://i.imgur.com/s1x0fek.png">
<a href="https://thegradient.pub/prompting/">Image source from "Prompting: Better Ways of Using Language Models for NLP Tasks"</a>
</center><br>
> ==[**Constantly-updated survey, and paperlist**](http://pretrain.nlpedia.ai/)==
<center>
<img src= "https://i.imgur.com/MWyHPHT.png">
</center>
## A Formal Description of Prompting
### Supervised Learning in NLP
In a traditional supervised learning system for NLP, we take an input $\boldsymbol{x}$, usually text, and predict an output $\boldsymbol{y}$ based on a model $P(\boldsymbol{y} \ | \ \boldsymbol{x} ; \theta)$.
It can be illustrated this with two stereotypical examples:
* First, *text classification* takes an input text $\boldsymbol{x}$ and predicts a label $y$ from a fixed label set $\mathcal{Y}$.
An input $\boldsymbol{x} =$ "I love this movie.", and predict a label $y = ++$, out of a label $\mathcal{Y} = \{ ++, +, \sim, -, -- \}$.
* Second, *conditional text generation* takes an input $\boldsymbol{x}$ and generates another text $\boldsymbol{y}$.
The input is text in one language such as the Finnish $\boldsymbol{x}$ = “Hyvää huomenta.” ¨ and the output is the English $\boldsymbol{y}$ = “Good morning”..
### Prompting Basics
Basic prompting predicts the highest-scoring $\hat{\boldsymbol{y}}$ in three steps:
* **Prompt Addition**
* The first variety of prompt with a slot to fill in the middle of the text as a *cloze prompt*
* The second variety of prompt, the input text comes entirely before z as a *prefix prompt*
*Prompting function* $f_{\mathrm{prompt}}(\cdot)$ is applied to modify the input text $\boldsymbol{x}$ into a *prompt* $\boldsymbol{x}^{'} = f_{\mathrm{prompt}}(\boldsymbol{x})$
1. Apply a *template*, which is a textual string that has two slots:
an *input slot* $[\mathrm{X}]$ for input $\boldsymbol{x}$ and an *answer slot* $[\mathrm{Z}]$ for an intermediate generated *answer* text $\boldsymbol{z}$ that will later be mapped into $\boldsymbol{y}$.
2. Fill slot $[\mathrm{X}]$ with the input text $\boldsymbol{x}$.
* **Answer Search**
And then, searching for the highest-scoring text $\hat{\boldsymbol{z}}$ that maximizes the score of the LM.
$$
\begin{equation}
\tag{1}\hat{\boldsymbol{z}}=\underset{\boldsymbol{z} \in \mathcal{Z}}{\operatorname{search}} P\left(f_{\text {fill }}\left(\boldsymbol{x}^{\prime}, \boldsymbol{z}\right) ; \theta\right)
\end{equation}
$$
This search function could be an *argmax* search that searches for the highest-scoring output, or *sampling* that randomly generates outputs following the probability distribution of the LM.
Where $f_{\mathrm{fill}}(\boldsymbol{x}^{'}, \boldsymbol{z})$ that fills in the location $[\mathrm{Z}]$ in prompt $\boldsymbol{x}^{'}$ with the potential answer $\boldsymbol{z}$.
And, any prompt that has gone through this process as a *filled prompt*. Particularly, if the prompt is filled with a true answer, we will refer to it as an *answered prompt*.
* **Answer Mapping**
The last step, the method would like to go from the highest-scoring *answer* $\hat{\boldsymbol{z}}$ to the highest-scoring *output* $\hat{\boldsymbol{y}}$.
<center>
<img src= "https://i.imgur.com/ra5G8Jx.png">
</center>
## Pre-trained Language Models
In this chapter, this paper presents a systematic view of various pre-trained LMs which
1. Organizing them along various axes in a more systematic way
2. Particularly focuses on aspects salient to prompting methods.
Below, they will detail them through the lens of
* Main training objective
* Type of text noising
* Auxiliary training objective
* Attention mask
* Typical architecture
* Preferred application scenarios.
### Training Objectives
The main training objective of a pre-trained LM almost invariably consists of some sort of objective predicting the probability of text $\boldsymbol{x}$.
A popular alternative to standard LM objectives are *denoising* objectives, which apply some noising function $\tilde{\boldsymbol{x}}=f_{\text {noise }}(\boldsymbol{x})$ to the input sentence, then try to predict the original input sentence given this noised text $P(\boldsymbol{x} \ | \ \tilde{\boldsymbol{x}})$.
1. Corrupted Text Reconstruction
2. Full Text Reconstruction (FTR)
### Noising Functions
<center>
<img src= "https://i.imgur.com/KNhdmTO.png">
</center>
### Directionality of Representations
In general, there are two widely used ways to calculate such representations:
1. Left-to-Right: The representation of each word is calculated based on the word itself and all previous words in the sentence.
2. Bidirectional: The representation of each word is calculated based on all words in the sentence
Some examples of such attention masks show below
<center>
<img src= "https://i.imgur.com/QMmNH7Q.png">
</center><br>
#### Left-to-Right Language Model
Left-to-right LMs (L2R LMs), a variety of *auto-regressive LM*, predict the upcoming words or assign a probability $P(\boldsymbol{x})$ to a sequence of words $\boldsymbol{x} = x_1, ..., x_n$.
The probability is commonly broken down using the chain rule in a left-to-right fashion:
$P(\boldsymbol{x}) = P(x_1) \times ... P(x_n \ | \ x_1 ... x_{n-1})$
<center>
<img src= "https://i.imgur.com/Vu9H2xI.png">
</center>
#### Masked Language Models
One popular bidirectional objective function used widely in representation learning
$P(x_i \ | \ x_1, ..., x_{i-1}, ..., x_n)$ represents the probability of the word $x_i$ given the surrounding context.
#### Prefix and Encoder-Decoder
For conditional text generation tasks such as translation and summarization where an input text $\boldsymbol{x} = x_1, ..., x_n$ is
given and the goal is to generate target text $\boldsymbol{y}$.
1. Using an encoder with fully-connected mask to encode the source $\boldsymbol{x}$ first and then
2. Decode the target $\boldsymbol{y}$ auto-regressively (from the left to right).
* Prefix Language Model:
The prefix LM is a left-to-right LM that decodes $\boldsymbol{y}$ conditioned on a prefixed sequence $\boldsymbol{x}$, which is **encoded by the *same* model parameters but with a fully-connected mask.**
* Encoder-decoder:
The encoder-decoder model is a model that uses a left-to-right LM to decode $\boldsymbol{y}$ conditioned on a *separate* encoder for text $\boldsymbol{x}$ with a fully-connected mask. **The encoder and decoder are not shared.**
### Typical Pre-training Methods
<center>
<img src= "https://i.imgur.com/Wj9db3O.png">
</center>
## Prompt Engineering
<center>
<img src= "https://i.imgur.com/8Bzmdvn.png">
</center>
## Answer Engineering
<center>
<img src= "https://i.imgur.com/Il0fZlL.png">
</center>
## Multi-Prompt Learning
<center>
<img src= "https://i.imgur.com/8RgQAMw.png">
</center>
## Training Strategies for Prompting Methods
<center>
<img src= "https://i.imgur.com/jTy9xPq.png">
</center>
## Applications
## Prompt-relevant Topics
## Challenges
## Meta Analysis
## Conclusion
## Appendix: Pre-trained Language Model Families
<iframe scrolling="no"
width="100%" height="600"
src="https://www.docdroid.net/Aynv3kg/210713586-pdf"
frameborder="0" allowtransparency allowfullscreen>
</iframe>