Albert Webson, Ellie Pavlick
[arxiv](https://arxiv.org/abs/2109.01247)
## Introduction
> [CLS]No weapons of mass destruction found in Iraq yet.
> [SEP]Weapons of mass destruction found in Iraq.
>
> 0 or 1?
- This setup is similar to the pretrain-and-fine-tune setup which has dominated NLP in recent years
- Models are asked to classify a sentence representation (e.g., a CLS token) into some arbitrary dimensions of a one-hot vector
> ++Given that++ “No weapons of mass destruction found in Iraq yet." ++, is it definitely correct that++ "Weapons of mass destruction found in Iraq."++?++
- They are able to perform the task more accurately and without needing many examples to figure out what the task is
- Reformatting NLP tasks with prompts such as the underlined text above has dramatically improved zero-shot and few-shot performance over traditional fine-tuned models
- <!-- Such results naturally give rise to the hypothesis that -->The extra prompt text included within each input example serves as semantically meaningful task instructions
- help models to learn faster, in the way task instructions help humans to learn faster
<!-- It is commonly argued that -->
<!-- - Prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. -->
<!-- - Models learn just **as fast** with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts -->
- In this paper, they find that in most cases models learn identically as fast when given irrelevant or misleading templates as they do when given instructively good templates
- <!-- - Despite prompt-based models’ dramatic improvement in zero-shot and few-shot learning -->Limited evidence that models’ improvement is derived from models understanding task instructions in ways analogous to humans’ use of task instructions
## Related Work
### Discrete Prompts (hard prompt)
- Discrete Prompts reformat each example with some template text
- {sent} In summary, the restaurant is [prediction]
- The predicted mask word is converted to a class prediction by a predefined mapping
- e.g., {“great” → positive, “terrible” → negative}
- The prompts can be manually written or automatically generated
- This approach typically tunes all parameters of the model
### Priming (in-context learning)
- Priming prepends *k* priming examples to the evaluation example
- Question: {$\text{sent}_1$} True or false? {$\text{label}_1$} ...
Question: {$\text{sent}_k$} True or false? {$\text{label}_k$}
Question: {eval_sent} True or false? [prediction]
- Parameters do not receive gradient updates based on those examples
- [Brown et al. (2020)](https://arxiv.org/abs/2005.14165) report that it only performs well on the largest GPT-3 model
- the API is costly and difficult to use for academic research
### Continuous Prompts (soft prompt)
- Continuous Prompts prepend examples with special tokens, optionally initialized with word embeddings
- Tokens can be updated arbitrarily such that the final embeddings often *do not* correspond to any real word in the vocabulary
- Efficiently tunes a much smaller set of model parameters
- Harder to study their semantics, and it is not clear if continuous prompts serve as task-specific instructions or simply more efficient model parameters
## Overall Setup
### Baseline Model
- In preliminary experiments, they fine-tuned and prompt-tuned BERT, DistilBERT, RoBERTa, ALBERT, and T5.
- They find **ALBERT** consistently yields the best performance, so we use it as our baseline model
### Instruction-Tuned Model
- They additionally experiment with T0, a recently proposed instruction-tuned model which is trained on over 60 datasets formatted with hundreds of manually written prompts
- They experiment with both sizes of T0 (3B and 11B)
### Very Large Model
- They experiment with the largest GPT-3 (175B) via priming
### Data
- They use **Recognizing Textual Entailment (RTE)**, a series of expert-annotated NLI datasets
- They use the SuperGLUE collection of RTE (i.e., RTE1, 2, 3, and 5; all converted to **binary classification**)
## Effect of Templates
- Whether models understand prompts as meaningful task instructions analogous to how humans would
- They write 5 categories of templates
- **Instructive**: how we would describe the NLI task to a human who has never seen this task before
- **Misleading-Moderate**: instruct the models to perform a task related or tangential to NLI
- **Misleading-Extreme**: instruct the models to perform a task unrelated to NLI
- **Irrelevant**: concatenate the premise, a sentence unrelated to any NLP task, and the hypothesis

- Key Intuition: If models understand prompts, we'd expect their learning speeds to be:
- instructive > irrelevant
- instructive > misleading-moderate
- instructive > misleading-extreme
- instructive > null (no instruction; control condition)
### Result
- **Irrelevant Templates** models trained with irrelevant templates learn just as fast as those trained with instructive templates

- **Misleading Templates** There is no consistent relation between the performance of models trained with moderately misleading vs. extremely misleading


- **Null Templates** Models trained with null templates perform far worse than all other categories of templates

## Effect of Target Words
- In this experiment, they study the effect of different LM target words given a fixed template
- They write 4 categories of targets
1. **Yes-no**: Model is expected to predict the word “yes” for entailment and “no” for non-entailment
2. **Yes-no-like**: Semantically equivalent to yes-no but using superficially different words, e.g., “true”/“false”, “positive”/“negative”
3. **Arbitrary**: Model is expected to predict arbitrary words that have no semantic relation to the entailment task, e.g., “cat” for entailment, “dog” for non-entailment
4. **Reversed**: Model is expected to predict the opposite of the (intuitive) yes-no and yes-no-like labels, e.g., “no” for entailment, “yes” for non-entailment
- For both ALBERT and T0, they find that models trained with yes-no targets learn a good deal faster than those trained with yes-no-like targets
- Dramatically faster than those with arbitrary and reversed targets

- The choice of target words matter much more than the meaning of the templates

- The effect of the target words overrides the semantics of the overall prompt
1. An irrelevant or misleading template + yes-no targets
2. An instructive template + arbitrary targets

- When they try to help the models by appending target hints such as “True or false?” to the templates
- performance often drops instead
- including answer choices in input sequence make models perform worse for certain tasks
## Conclusion
- Models often learn equally fast with misleading and irrelevant templates as they do with instructive ones
- The choice of the target words overrides the meaning of the overall prompts
- Results contradict a hypothesis commonly assumed in the literature that **prompts serve as semantically meaningful task instructions**
- writing high-performing prompts requires domain expertise