Albert Webson, Ellie Pavlick [arxiv](https://arxiv.org/abs/2109.01247) ## Introduction > [CLS]No weapons of mass destruction found in Iraq yet. > [SEP]Weapons of mass destruction found in Iraq. > > 0 or 1? - This setup is similar to the pretrain-and-fine-tune setup which has dominated NLP in recent years - Models are asked to classify a sentence representation (e.g., a CLS token) into some arbitrary dimensions of a one-hot vector > ++Given that++ “No weapons of mass destruction found in Iraq yet." ++, is it definitely correct that++ "Weapons of mass destruction found in Iraq."++?++ - They are able to perform the task more accurately and without needing many examples to figure out what the task is - Reformatting NLP tasks with prompts such as the underlined text above has dramatically improved zero-shot and few-shot performance over traditional fine-tuned models - <!-- Such results naturally give rise to the hypothesis that -->The extra prompt text included within each input example serves as semantically meaningful task instructions - help models to learn faster, in the way task instructions help humans to learn faster <!-- It is commonly argued that --> <!-- - Prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. --> <!-- - Models learn just **as fast** with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts --> - In this paper, they find that in most cases models learn identically as fast when given irrelevant or misleading templates as they do when given instructively good templates - <!-- - Despite prompt-based models’ dramatic improvement in zero-shot and few-shot learning -->Limited evidence that models’ improvement is derived from models understanding task instructions in ways analogous to humans’ use of task instructions ## Related Work ### Discrete Prompts (hard prompt) - Discrete Prompts reformat each example with some template text - {sent} In summary, the restaurant is [prediction] - The predicted mask word is converted to a class prediction by a predefined mapping - e.g., {“great” → positive, “terrible” → negative} - The prompts can be manually written or automatically generated - This approach typically tunes all parameters of the model ### Priming (in-context learning) - Priming prepends *k* priming examples to the evaluation example - Question: {$\text{sent}_1$} True or false? {$\text{label}_1$} ... Question: {$\text{sent}_k$} True or false? {$\text{label}_k$} Question: {eval_sent} True or false? [prediction] - Parameters do not receive gradient updates based on those examples - [Brown et al. (2020)](https://arxiv.org/abs/2005.14165) report that it only performs well on the largest GPT-3 model - the API is costly and difficult to use for academic research ### Continuous Prompts (soft prompt) - Continuous Prompts prepend examples with special tokens, optionally initialized with word embeddings - Tokens can be updated arbitrarily such that the final embeddings often *do not* correspond to any real word in the vocabulary - Efficiently tunes a much smaller set of model parameters - Harder to study their semantics, and it is not clear if continuous prompts serve as task-specific instructions or simply more efficient model parameters ## Overall Setup ### Baseline Model - In preliminary experiments, they fine-tuned and prompt-tuned BERT, DistilBERT, RoBERTa, ALBERT, and T5. - They find **ALBERT** consistently yields the best performance, so we use it as our baseline model ### Instruction-Tuned Model - They additionally experiment with T0, a recently proposed instruction-tuned model which is trained on over 60 datasets formatted with hundreds of manually written prompts - They experiment with both sizes of T0 (3B and 11B) ### Very Large Model - They experiment with the largest GPT-3 (175B) via priming ### Data - They use **Recognizing Textual Entailment (RTE)**, a series of expert-annotated NLI datasets - They use the SuperGLUE collection of RTE (i.e., RTE1, 2, 3, and 5; all converted to **binary classification**) ## Effect of Templates - Whether models understand prompts as meaningful task instructions analogous to how humans would - They write 5 categories of templates - **Instructive**: how we would describe the NLI task to a human who has never seen this task before - **Misleading-Moderate**: instruct the models to perform a task related or tangential to NLI - **Misleading-Extreme**: instruct the models to perform a task unrelated to NLI - **Irrelevant**: concatenate the premise, a sentence unrelated to any NLP task, and the hypothesis ![](https://hackmd.io/_uploads/HynAAe_t2.png =70%x) - Key Intuition: If models understand prompts, we'd expect their learning speeds to be: - instructive > irrelevant - instructive > misleading-moderate - instructive > misleading-extreme - instructive > null (no instruction; control condition) ### Result - **Irrelevant Templates** models trained with irrelevant templates learn just as fast as those trained with instructive templates ![](https://hackmd.io/_uploads/ryJyZZdt3.png =70%x) - **Misleading Templates** There is no consistent relation between the performance of models trained with moderately misleading vs. extremely misleading ![](https://hackmd.io/_uploads/HyvVXWdY2.png =70%x) ![](https://hackmd.io/_uploads/SyxLX-ut3.png =70%x) - **Null Templates** Models trained with null templates perform far worse than all other categories of templates ![](https://hackmd.io/_uploads/SJuGBW_K2.png =90%x) ## Effect of Target Words - In this experiment, they study the effect of different LM target words given a fixed template - They write 4 categories of targets 1. **Yes-no**: Model is expected to predict the word “yes” for entailment and “no” for non-entailment 2. **Yes-no-like**: Semantically equivalent to yes-no but using superficially different words, e.g., “true”/“false”, “positive”/“negative” 3. **Arbitrary**: Model is expected to predict arbitrary words that have no semantic relation to the entailment task, e.g., “cat” for entailment, “dog” for non-entailment 4. **Reversed**: Model is expected to predict the opposite of the (intuitive) yes-no and yes-no-like labels, e.g., “no” for entailment, “yes” for non-entailment - For both ALBERT and T0, they find that models trained with yes-no targets learn a good deal faster than those trained with yes-no-like targets - Dramatically faster than those with arbitrary and reversed targets ![](https://hackmd.io/_uploads/HkiNKWOK2.png =90%x) - The choice of target words matter much more than the meaning of the templates ![](https://hackmd.io/_uploads/HygwFbdF2.png =90%x) - The effect of the target words overrides the semantics of the overall prompt 1. An irrelevant or misleading template + yes-no targets 2. An instructive template + arbitrary targets ![](https://hackmd.io/_uploads/r16Jj-uY2.png =90%x) - When they try to help the models by appending target hints such as “True or false?” to the templates - performance often drops instead - including answer choices in input sequence make models perform worse for certain tasks ## Conclusion - Models often learn equally fast with misleading and irrelevant templates as they do with instructive ones - The choice of the target words overrides the meaning of the overall prompts - Results contradict a hypothesis commonly assumed in the literature that **prompts serve as semantically meaningful task instructions** - writing high-performing prompts requires domain expertise