> [Paper link](https://arxiv.org/pdf/2009.07118.pdf) | [Note link](https://www.sohu.com/a/422484297_500659) | [Code link](https://github.com/timoschick/pet) | NAACL 2021
## Abstract
Unlike GPT-3, a billions of parameters PTLMs, this paper shows that **performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count** is several orders of magnitude smaller.
The method converting textual inputs into cloze questions that contain a task description, combined with **gradient-based optimization**; exploiting unlabeled data gives further improvements.
<center>
<img src = "https://i.imgur.com/UPWC7Ct.png">
</center>
## Introduction
Language modeling is not only a powerful pretraining objective, but many **tasks can be reformulated as cloze questions** (e.g., by appending phrases such as **the correct answer is __**), allowing pretrained LMs to solve them without any or with only very few labeled examples.
Although GPT-3, a pretrained LM with an enormous 175 billion parameters, and showed that it has amazing few-shot abilities, this method has two major drawbacks:
1. It requires a gigantic LM to work well, making it **unusable in many real-world scenarios and resulting in a large carbon footprint**
2. It **does not scale to more than a few examples** as the context window of most LMs is limited to a few hundred tokens
Previous work PET model can solve the problem with combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning.
But PET only works when the answers to be **predicted by the LM correspond to a single token** in its vocabulary; this is a severe limitation as many tasks cannot easily be worded that way.
This paper adapt PET for tasks that require **predicting multiple tokens**.
## Related Work
* Enabling LMs to perform zero-shot learning by providing task descriptions
* Reformulating tasks as cloze questions that are understood well by LMs
* Reducing the amount of compute required for few-shot learning
## Pattern-Exploiting Training
* $M$: masked language model (MLM)
* $T$: vocabulary
* $\_ \_ \in T$: the mask token, and the set of all token sequences as $T^*$
$q_M^k (t \ | \ \mathrm{\textbf{z}})$ is the probability that $M$ assigns to $t$ at $k$ th masked position in $\mathrm{\textbf{z}}$, which $\mathrm{\textbf{z}} \in T^*$ containing at least $k$ masks and $t \in T$.
The model's logits before applying softmax are denoted with $s_M^k(t \ | \ \mathrm{\textbf{z}})$, and this paper also consider the task of mapping like PET.
PET requires a set of *pattern-verbalizer pairs* (PVPs), and each of them consist
* a *pattern* $P \ : \ X \rightarrow T^*$
Maps inputs to cloze questions containing a single mask
* a *verbalizer* $v \ : \ Y \rightarrow T$
Maps each output to a single token representing its task-specific meaning in the pattern
Such that PVP can be formulated as $\mathrm{\textbf{p}} = (P. v)$
<center>
<img src = "https://i.imgur.com/VyLFeAb.png">
</center><br>
Conditional probability distribution $q_{\mathrm{\textbf{p}}}$ of $y$ given $x$ is defined as
$$
\begin{equation}
\tag{1}q_{\mathbf{p}}(y \mid x)=\frac{\exp s_{\mathbf{p}}(y \mid x)}{\sum_{y^{\prime} \in Y} \exp s_{\mathbf{p}}\left(y^{\prime} \mid x\right)}
\end{equation}
$$
where $s_{\mathbf{p}}(y \mid x)=s_M^1(v(y) \mid P(x))$ is the raw score of $v(y)$ at the masked position in $P(x)$.
---
PET enables a combination of multiple PVPs $\mathrm{\textbf{P}} = \{ \mathrm{\textbf{p}}_1, ..., \mathrm{\textbf{p}}_n \}$ as follows:
1. For each PVP $\mathrm{\textbf{p}}$ a MLM is finetuned on training examples $(x, y)$ by minimizing the cross entropy between $y$ and $q_{\mathrm{\textbf{p}}}(y \ | \ x)$, train three MLMs per pattern.
2. The ensemble of finetuned MLMs is used to annotate a set of unlabeled examples; each unlabeled example $x \in X$ is annotated with soft labels based on the probability distribution
$$
\begin{equation}
\tag{2}q_{\mathbf{P}}(y \mid x) \propto \exp \sum_{\mathbf{p} \in \mathbf{P}} w_{\mathbf{p}} \cdot s_{\mathbf{p}}(y \mid x)
\end{equation}
$$
3. The resulting soft-labeled dataset is used to train a regular sequence classifier by minimizing cross entropy between its output and $q_{\mathrm{\textbf{P}}}$
As steps (2) and (3) above closely resemble knowledge distillation, they also refer to them simply as *distillation*.
### PET with Multiple Masks
An important limitation of PET is that the verbalizer $v$ must map each output to a *single* token, which is impossible for many tasks.
Such that they generalize verbalizers to functions $v \ : \ Y \rightarrow T^*$
They generalize PET in that we do not assume the output space to be identical for each input:
* For each $x \in X$, they denote with $Y_x \subseteq Y$ the set of possible outputs given $x$ as input
* PVP $\mathrm{\textbf{p}} = (P, v)$
* $l(x) = \max_{y \in Y_x} |v(y)|$ as maximum number of tokens required to express any output in $Y_x$
* $P^k(x)$ to be $P(x)$ with the mask token replaced by $k$ masks
Below is a running example, they consider the task of binary sentiment classification for restaurant reviews $Y = \{+1, −1\}$
<center>
<img src = "https://i.imgur.com/XJocmlp.png">
</center>
#### Inference
For $x \in X, y \in Y_x$ and $|v(y)| = k$, they redefine $q_{\mathrm{\textbf{p}}}(y \ | \ x)$ in an autoregressive fashion:
$$
\begin{equation}
\tag{3}q\left(t_1 \ldots t_k \mid \mathbf{z}\right)= \begin{cases}1 & \text { if } k=0 \\ q_M^j\left(t_j \mid \mathbf{z}\right) \cdot q\left(t^{\prime} \mid \mathbf{z}^{\prime}\right) & \text { if } k \geq 1\end{cases}
\end{equation}
$$
with $j=\arg \max _{i=1}^k q_M^i\left(t_i \mid \mathbf{z}\right)$, $\mathrm{\textbf{z}}^{'}$ is $\mathrm{\textbf{z}}$ except $\mathrm{\textbf{z}} = t_j$ and $t' = t_1 ... t_{j-1} t_{j+1} ... t_{k}$.
For the sentiment classification example in Figure 3, it can llustrates how $q_{\mathrm{\textbf{p}}}(-1 \ | \ x)$ is computed.
1. First, compute the probability of each token in $v(y)$.
2. Then, choose the token with the highest probability, put it in place of the corresponding mask token, and use the resulting cloze question $\mathrm{\textbf{z}}^{'}$ to compute the probability of the remaining token.
The overall score for $y = -1$ is computed as
$$
q_{\mathbf{p}}(-1 \mid x)=q_M^2(\cdot \text { ble } \mid \mathbf{z}) \cdot q_M^1\left(\text { terri } \mid \mathbf{z}^{\prime}\right)
$$
#### Training
Computing $q_{\mathrm{\textbf{p}}}(y \ | \ x)$ as equation $(3)$ for each training example $(x, y)$.
To enable computation of all required probabilities in a single forward pass:
1. Always inserting the maximum number of mask tokens required to express any output
2. For each $y^{'} \in Y_x$, predicting all tokens in $v(y^{'}) = t_1, ... t_k$ in parallel, where we simply ignore the model’s predictions for all $l(x) - k$ superfluous mask tokens:
$$
\begin{equation}
\tag{4}\tilde{q}_{\mathbf{p}}\left(y^{\prime} \mid x\right)=\prod_{i=1}^k q_M^i\left(t_i \mid P^{l(x)}(x)\right)
\end{equation}
$$
For the loss function, opting for [multiclass hinge loss](https://stats.stackexchange.com/questions/336205/where-does-the-multi-class-hinge-loss-come-from#:~:text=The%20%28multi-class%29%20hinge%20loss%20can%20be%20understood%20as,some%20margin%20%CE%94%3E0%2C%20otherwise%20a%20loss%20is%20incurred.) and minimize:
$$
\begin{equation}
\tag{5}\sum_{y^{\prime} \in Y_x} \max \left(0 ; 1-\log \tilde{q}_{\mathbf{p}}(y \mid x)+\log \tilde{q}_{\mathbf{p}}\left(y^{\prime} \mid x\right)\right)
\end{equation}
$$
That is, we require the difference between the log probability of y and the log probability of any output $y^{'} \in Y_x \ \backslash \ \{y\}$ to be at least $1$.
## Experiments
This paper compare PET and GPT-3 on SuperGLUE, a natural language understanding benchmark consisting of eight challenging tasks.
For training data, they create new training sets by randomly selecting 32 examples for each task using a fixed random seed.
Also, they additionally create sets of up to 20,000 unlabeled examples for each task; this is done by removing all labels from the original training sets.
And the resulting sets of training examples and unlabeled examples as [*FewGLUE*](https://github.com/timoschick/fewglue).
### Tasks
* BoolQ
QA task
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/6x8zLCh.png">
</td>
<td>
true (yes/true), others (no/false)
</td>
</tr>
</table>
* CB / RTE
Textual entailment
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/3qxxQQj.png">
</td>
<td>
yes, maybe, no
</td>
</tr>
</table>
* COPA
Determine the *cause* or *effect* of the premise given two options $c_1$ and $c_2$
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/1HeCDPr.png">
</td>
<td>
identity function
</td>
</tr>
</table>
* WIC
Given a word $w$ and two sentences $s_1$ and $s_2$ in which it occurs, the task is to decide if $w$ is used with the same sense in both sentences.
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/pFTipRY.png">
</td>
<td>
yes, no / b and 2
</td>
</tr>
</table>
* WSC
The task is to determine whether marked pronoun $p$ refers to noun $n$.
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/woFyHAH.png">
</td>
<td>
identity function
</td>
</tr>
</table>
* MultiRC
Given a passage $p$, a question $q$ and an answer candidate $a$, the task is to decide whether $a$ is a correct answer for $q$.
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/yT06VXK.png">
</td>
<td>
true (yes/true), others (no/false)
</td>
</tr>
</table>
* ReCoRD
Given a passage $p$ and a cloze question $q$, the task is to decide which of a given set of answer candidates is the correct replacement for the placeholder in the cloze question.
<table>
<tr>
<th>
Patterns
</th>
<th>
Verbalizer
</th>
</tr>
<tr>
<td>
the concatenation of p and q
</td>
<td>
identity function
</td>
</tr>
</table>
With **only one PVP**, there is **no need to perform knowledge distillation** so we directly use the resulting model as our final classifier.
### Setups
As underlying LM for PET we choose ALBERT-xxlarge-v2.
They run PET on the FewGLUE training sets for all SuperGLUE
tasks

## Analysis
### Patterns
<center>
<img src = "https://i.imgur.com/4sPHs5a.png">
</center>
### Unlabeled Data Usage
<center>
<img src = "https://i.imgur.com/Dtv68IO.png">
</center>
### Labeled Data Usage
<center>
<img src = "https://i.imgur.com/RV7x7AR.png">
</center>
<br>
<center>
<img src = "https://i.imgur.com/zLowaYB.png">
</center>
### Model Type
<center>
<img src = "https://i.imgur.com/I3P5nsz.png">
</center>
### PET with Multiple Masks
<center>
<img src = "https://i.imgur.com/ED4ogFl.png">
</center>
### Training Examples
<center>
<img src = "https://i.imgur.com/eCPtpVD.png">
</center>
## Conclusion
This paper have proposed a simple yet effective modification of PET, **enabling us to use it for tasks that require predicting multiple tokens.**
- Strong performance of PET combined with ALBERT
- The possibility to concurrently use multiple patterns for transforming examples into cloze questions
- The ability to compensate for patterns that are difficult to understand
- The usage of labeled data to perform parameter updates
Using PET, it is possible to achieve few-shot text classification performance similar to GPT-3 on SuperGLUE with LMs that have three orders of magnitude fewer parameters.
The most of all, **PET reduces environmental impact immensely** and leads to a much smaller carbon footprint.