It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

> [Paper link](https://arxiv.org/pdf/2009.07118.pdf) | [Note link](https://www.sohu.com/a/422484297_500659) | [Code link](https://github.com/timoschick/pet) | NAACL 2021 ## Abstract Unlike GPT-3, a billions of parameters PTLMs, this paper shows that **performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count** is several orders of magnitude smaller. The method converting textual inputs into cloze questions that contain a task description, combined with **gradient-based optimization**; exploiting unlabeled data gives further improvements. <center> <img src = "https://i.imgur.com/UPWC7Ct.png"> </center> ## Introduction Language modeling is not only a powerful pretraining objective, but many **tasks can be reformulated as cloze questions** (e.g., by appending phrases such as **the correct answer is __**), allowing pretrained LMs to solve them without any or with only very few labeled examples. Although GPT-3, a pretrained LM with an enormous 175 billion parameters, and showed that it has amazing few-shot abilities, this method has two major drawbacks: 1. It requires a gigantic LM to work well, making it **unusable in many real-world scenarios and resulting in a large carbon footprint** 2. It **does not scale to more than a few examples** as the context window of most LMs is limited to a few hundred tokens Previous work PET model can solve the problem with combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning. But PET only works when the answers to be **predicted by the LM correspond to a single token** in its vocabulary; this is a severe limitation as many tasks cannot easily be worded that way. This paper adapt PET for tasks that require **predicting multiple tokens**. ## Related Work * Enabling LMs to perform zero-shot learning by providing task descriptions * Reformulating tasks as cloze questions that are understood well by LMs * Reducing the amount of compute required for few-shot learning ## Pattern-Exploiting Training * $M$: masked language model (MLM) * $T$: vocabulary * $\_ \_ \in T$: the mask token, and the set of all token sequences as $T^*$ $q_M^k (t \ | \ \mathrm{\textbf{z}})$ is the probability that $M$ assigns to $t$ at $k$ th masked position in $\mathrm{\textbf{z}}$, which $\mathrm{\textbf{z}} \in T^*$ containing at least $k$ masks and $t \in T$. The model's logits before applying softmax are denoted with $s_M^k(t \ | \ \mathrm{\textbf{z}})$, and this paper also consider the task of mapping like PET. PET requires a set of *pattern-verbalizer pairs* (PVPs), and each of them consist * a *pattern* $P \ : \ X \rightarrow T^*$ Maps inputs to cloze questions containing a single mask * a *verbalizer* $v \ : \ Y \rightarrow T$ Maps each output to a single token representing its task-specific meaning in the pattern Such that PVP can be formulated as $\mathrm{\textbf{p}} = (P. v)$ <center> <img src = "https://i.imgur.com/VyLFeAb.png"> </center><br> Conditional probability distribution $q_{\mathrm{\textbf{p}}}$ of $y$ given $x$ is defined as $$ \begin{equation} \tag{1}q_{\mathbf{p}}(y \mid x)=\frac{\exp s_{\mathbf{p}}(y \mid x)}{\sum_{y^{\prime} \in Y} \exp s_{\mathbf{p}}\left(y^{\prime} \mid x\right)} \end{equation} $$ where $s_{\mathbf{p}}(y \mid x)=s_M^1(v(y) \mid P(x))$ is the raw score of $v(y)$ at the masked position in $P(x)$. --- PET enables a combination of multiple PVPs $\mathrm{\textbf{P}} = \{ \mathrm{\textbf{p}}_1, ..., \mathrm{\textbf{p}}_n \}$ as follows: 1. For each PVP $\mathrm{\textbf{p}}$ a MLM is finetuned on training examples $(x, y)$ by minimizing the cross entropy between $y$ and $q_{\mathrm{\textbf{p}}}(y \ | \ x)$, train three MLMs per pattern. 2. The ensemble of finetuned MLMs is used to annotate a set of unlabeled examples; each unlabeled example $x \in X$ is annotated with soft labels based on the probability distribution $$ \begin{equation} \tag{2}q_{\mathbf{P}}(y \mid x) \propto \exp \sum_{\mathbf{p} \in \mathbf{P}} w_{\mathbf{p}} \cdot s_{\mathbf{p}}(y \mid x) \end{equation} $$ 3. The resulting soft-labeled dataset is used to train a regular sequence classifier by minimizing cross entropy between its output and $q_{\mathrm{\textbf{P}}}$ As steps (2) and (3) above closely resemble knowledge distillation, they also refer to them simply as *distillation*. ### PET with Multiple Masks An important limitation of PET is that the verbalizer $v$ must map each output to a *single* token, which is impossible for many tasks. Such that they generalize verbalizers to functions $v \ : \ Y \rightarrow T^*$ They generalize PET in that we do not assume the output space to be identical for each input: * For each $x \in X$, they denote with $Y_x \subseteq Y$ the set of possible outputs given $x$ as input * PVP $\mathrm{\textbf{p}} = (P, v)$ * $l(x) = \max_{y \in Y_x} |v(y)|$ as maximum number of tokens required to express any output in $Y_x$ * $P^k(x)$ to be $P(x)$ with the mask token replaced by $k$ masks Below is a running example, they consider the task of binary sentiment classification for restaurant reviews $Y = \{+1, −1\}$ <center> <img src = "https://i.imgur.com/XJocmlp.png"> </center> #### Inference For $x \in X, y \in Y_x$ and $|v(y)| = k$, they redefine $q_{\mathrm{\textbf{p}}}(y \ | \ x)$ in an autoregressive fashion: $$ \begin{equation} \tag{3}q\left(t_1 \ldots t_k \mid \mathbf{z}\right)= \begin{cases}1 & \text { if } k=0 \\ q_M^j\left(t_j \mid \mathbf{z}\right) \cdot q\left(t^{\prime} \mid \mathbf{z}^{\prime}\right) & \text { if } k \geq 1\end{cases} \end{equation} $$ with $j=\arg \max _{i=1}^k q_M^i\left(t_i \mid \mathbf{z}\right)$, $\mathrm{\textbf{z}}^{'}$ is $\mathrm{\textbf{z}}$ except $\mathrm{\textbf{z}} = t_j$ and $t' = t_1 ... t_{j-1} t_{j+1} ... t_{k}$. For the sentiment classification example in Figure 3, it can llustrates how $q_{\mathrm{\textbf{p}}}(-1 \ | \ x)$ is computed. 1. First, compute the probability of each token in $v(y)$. 2. Then, choose the token with the highest probability, put it in place of the corresponding mask token, and use the resulting cloze question $\mathrm{\textbf{z}}^{'}$ to compute the probability of the remaining token. The overall score for $y = -1$ is computed as $$ q_{\mathbf{p}}(-1 \mid x)=q_M^2(\cdot \text { ble } \mid \mathbf{z}) \cdot q_M^1\left(\text { terri } \mid \mathbf{z}^{\prime}\right) $$ #### Training Computing $q_{\mathrm{\textbf{p}}}(y \ | \ x)$ as equation $(3)$ for each training example $(x, y)$. To enable computation of all required probabilities in a single forward pass: 1. Always inserting the maximum number of mask tokens required to express any output 2. For each $y^{'} \in Y_x$, predicting all tokens in $v(y^{'}) = t_1, ... t_k$ in parallel, where we simply ignore the model’s predictions for all $l(x) - k$ superfluous mask tokens: $$ \begin{equation} \tag{4}\tilde{q}_{\mathbf{p}}\left(y^{\prime} \mid x\right)=\prod_{i=1}^k q_M^i\left(t_i \mid P^{l(x)}(x)\right) \end{equation} $$ For the loss function, opting for [multiclass hinge loss](https://stats.stackexchange.com/questions/336205/where-does-the-multi-class-hinge-loss-come-from#:~:text=The%20%28multi-class%29%20hinge%20loss%20can%20be%20understood%20as,some%20margin%20%CE%94%3E0%2C%20otherwise%20a%20loss%20is%20incurred.) and minimize: $$ \begin{equation} \tag{5}\sum_{y^{\prime} \in Y_x} \max \left(0 ; 1-\log \tilde{q}_{\mathbf{p}}(y \mid x)+\log \tilde{q}_{\mathbf{p}}\left(y^{\prime} \mid x\right)\right) \end{equation} $$ That is, we require the difference between the log probability of y and the log probability of any output $y^{'} \in Y_x \ \backslash \ \{y\}$ to be at least $1$. ## Experiments This paper compare PET and GPT-3 on SuperGLUE, a natural language understanding benchmark consisting of eight challenging tasks. For training data, they create new training sets by randomly selecting 32 examples for each task using a fixed random seed. Also, they additionally create sets of up to 20,000 unlabeled examples for each task; this is done by removing all labels from the original training sets. And the resulting sets of training examples and unlabeled examples as [*FewGLUE*](https://github.com/timoschick/fewglue). ### Tasks * BoolQ QA task <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/6x8zLCh.png"> </td> <td> true (yes/true), others (no/false) </td> </tr> </table> * CB / RTE Textual entailment <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/3qxxQQj.png"> </td> <td> yes, maybe, no </td> </tr> </table> * COPA Determine the *cause* or *effect* of the premise given two options $c_1$ and $c_2$ <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/1HeCDPr.png"> </td> <td> identity function </td> </tr> </table> * WIC Given a word $w$ and two sentences $s_1$ and $s_2$ in which it occurs, the task is to decide if $w$ is used with the same sense in both sentences. <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/pFTipRY.png"> </td> <td> yes, no / b and 2 </td> </tr> </table> * WSC The task is to determine whether marked pronoun $p$ refers to noun $n$. <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/woFyHAH.png"> </td> <td> identity function </td> </tr> </table> * MultiRC Given a passage $p$, a question $q$ and an answer candidate $a$, the task is to decide whether $a$ is a correct answer for $q$. <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> <img src="https://i.imgur.com/yT06VXK.png"> </td> <td> true (yes/true), others (no/false) </td> </tr> </table> * ReCoRD Given a passage $p$ and a cloze question $q$, the task is to decide which of a given set of answer candidates is the correct replacement for the placeholder in the cloze question. <table> <tr> <th> Patterns </th> <th> Verbalizer </th> </tr> <tr> <td> the concatenation of p and q </td> <td> identity function </td> </tr> </table> With **only one PVP**, there is **no need to perform knowledge distillation** so we directly use the resulting model as our final classifier. ### Setups As underlying LM for PET we choose ALBERT-xxlarge-v2. They run PET on the FewGLUE training sets for all SuperGLUE tasks ![](https://i.imgur.com/BaoQqR2.png) ## Analysis ### Patterns <center> <img src = "https://i.imgur.com/4sPHs5a.png"> </center> ### Unlabeled Data Usage <center> <img src = "https://i.imgur.com/Dtv68IO.png"> </center> ### Labeled Data Usage <center> <img src = "https://i.imgur.com/RV7x7AR.png"> </center> <br> <center> <img src = "https://i.imgur.com/zLowaYB.png"> </center> ### Model Type <center> <img src = "https://i.imgur.com/I3P5nsz.png"> </center> ### PET with Multiple Masks <center> <img src = "https://i.imgur.com/ED4ogFl.png"> </center> ### Training Examples <center> <img src = "https://i.imgur.com/eCPtpVD.png"> </center> ## Conclusion This paper have proposed a simple yet effective modification of PET, **enabling us to use it for tasks that require predicting multiple tokens.** - Strong performance of PET combined with ALBERT - The possibility to concurrently use multiple patterns for transforming examples into cloze questions - The ability to compensate for patterns that are difficult to understand - The usage of labeled data to perform parameter updates Using PET, it is possible to achieve few-shot text classification performance similar to GPT-3 on SuperGLUE with LMs that have three orders of magnitude fewer parameters. The most of all, **PET reduces environmental impact immensely** and leads to a much smaller carbon footprint.