# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
###### tags: `Meeting`
## Introduction

- Large language models (LMs) have shown impressive performance on downstream tasks by simply conditioning on a few input-label pairs (demonstrations)<!--Despite in-context learning consistently outperforming zero-shot inference on a wide range of tasks-->
- There is little understanding of *how* it works and *which* aspects of the demonstrations contribute to end task performance
---
> **An Explanation of In-context Learning as Implicit Bayesian Inference** - ICLR 2022
> - They study how in-context learning can emerge when pretraining documents have long-range coherence
> - LM must infer a latent document-level concept to generate coherent next tokens during pretraining
> - At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt
> 
---
- In this paper, they show that ground truth demonstrations are in fact **not required** for effective in-context learning
- Replacing the labels in demonstrations with **random labels** barely hurts performance in a range of classification and multi-choice tasks
- Model *does not* rely on the input-label mapping in the demonstrations to perform the task

<!-- Further analysis investigates which parts of demonstrations actually do contribute to the performance -->
- The **label space** and the **distribution of the input text** *specified* by the demonstrations are both key to in-context learning<!-- regardless of whether the labels are correct for individual inputs -->
- Specifying the **overall format** is also crucial<!-- when the label space is unknown, using random English words as labels is significantly better than using no labels -->
- **Meta-training** with an in-context learning objective magnifies these effects
- the models almost exclusively exploit simpler aspects of the demonstrations like the format rather than the input-label mapping
---


- Instead of computing the likelihood of the label given the input, channel models compute the **conditional probability of the input** given the label
- Model was required to explain every word in the input
- Channel models significantly outperform their direct counterparts, which they attribute to their **stability**
- lower variance
- higher worst-case accuracy
---
## Ground Truth Matters Little
### Gold labels vs. random labels
1. **No demonstrations**: typical zero-shot method $$\arg\max_{y\in\mathcal{C}}P(y|x)$$
2. **Demonstrations w/ gold labels**: typical in-context learning method with $k$ labeled examples $(x_1,y_1)...(x_k,y_k)$ $$\arg\max_{y\in\mathcal{C}}P(y|x_1,y_1,...,x_k,y_k,x)$$
3. **Demonstrations w/ random labels**: formed with random labels $$\arg\max_{y\in\mathcal{C}}P(y|x_1,\tilde{y_1},...,x_k,\tilde{y_k},x)$$

- They find that replacing gold labels with random labels only marginally hurts performance
- This result indicates that the ground truth input-label pairs are not necessary to achieve performance gains<!-- This is counter-intuitive, given that correctly paired training data is critical in typical supervised training -->
- Models are capable of recovering the expected input-label correspondence for the task
- It is *not* directly from the pairings in the demonstrations
:::info
1. Selecting random labels from a true distribution of labels (instead of a uniform distribution) reduces the gap even further
2. The trends may depend on the dataset, although the overall trend is consistent over most datasets
:::
### Ablations
#### Does the number of correct labels matter?

- Model performance is fairly insensitive to the number of correct labels in the demonstrations
- Always using incorrect labels significantly outperforms no demonstrations
#### Is the result consistent with varying k?

- Using the demonstrations significantly outperforms the no demonstrations method even with small $k (k = 4)$
- Performance drop from using gold labels to using random labels is consistently small across varying $k$
- Model performance does not increase much as $k$ increases when $k \geq 8$, both with gold labels and with random labels
#### Is the result consistent with better templates?

- Replacing gold labels with random labels barely hurting performance—holds with manual templates
- Using manual templates does not always outperform using minimal templates.

## Why does In-Context Learning work?
<!-- Section 4 shows that the ground truth input-label mapping in the demonstrations has little impact to performance gains from in-context learning. This section further examines what other aspects of the demonstrations lead to good performance of in-context learning -->
- They identify four aspects of the demonstrations that potentially provide learning signal
1. The input-label mapping
2. The distribution of the input text
3. The label space
4. The format

- They design a series of variants of the demonstrations that quantify the impact of each aspect in isolation
- The trend of the models meta-trained with an in-context learning objective
#### Impact of the distribution of the input text
- They randomly sampled from out-of-distribution (OOD) demonstrations and randomly replaced inputs
- Keeping the label space and the format of the demonstrations

- Using out-of-distribution inputs instead of the inputs from the training data significantly drops the performance
- This suggests that **in-distribution inputs** in the demonstrations substantially contribute to performance gains
- Conditioning on the in-distribution text **makes the task closer to language modeling**, since the LM always conditioned on the in-distribution text during training
#### Impact of the label space
- They experiment with demonstrations with **random** English words

- With direct models, the performance gap between using random labels within the label space and using random English words is significant
- This indicates that conditioning on the label space significantly contributes to performance gains
- Removing the output space does not lead to significant drop in the channel models
- The channel models only condition on the labels, and thus are not benefiting from knowing the label space
<!-- This is in contrast to direct models which must generate the correct labels -->
#### Impact of input-label pairing

- They evaluate **demonstrations with no labels** where the LM is conditioned on the concatenation of $(x_1...x_k)$
- **Demonstrations with labels only** where the LM is conditioned on the concatenation of $(y_1...y_k)$

- Removing the format is close to or worse than no demonstrations, indicating the importance of the format
- Conditioning on a sequence of input-label pairs triggers the model to mimic the overall format and complete the new example as expected when the test input is given
- Removing inputs instead of using OOD inputs, or removing labels instead of using random English words is significantly worse, indicating that **keeping the format of the input-label pairs is key**
#### Impact of meta-training
- Multi-task training on a large collection of supervised datasets (called meta-training) for generalization to new tasks
- They aim to better understand the role of this meta-training in relation with their findings by closely examining the result of MetaICL
- The patterns they see so far are significantly more evident with MetaICL than with other models
- the ground truth input-label mapping matters even less
- keeping the format of the demonstrations matters even more
- They hypothesize that **meta-training encourages the model to exclusively exploit simpler aspects of the demonstrations and to ignore others**
- the input-label mapping is likely harder to exploit
- the format is likely easier to exploit
- the space of the text that the model is trained to generate is likely easier to exploit than the space of the text that the model conditions on
## Conclusion
- They study the role of the demonstrations with respect to the success of in-context learning
- The ground truth input-label mapping in the demonstrations matters significantly less than one might think
- replacing gold labels with random labels in the demonstrations only marginally lowers the performance
- They identify a series of aspects in the demonstrations and examine which aspect actually contributes to performance gains
- gains are mainly coming from *independent* specification of the input space and the label space
- the models can still retain up to 95% of performance gains by using either the inputs only or the label set only if the **right format is used**
- meta-training with an in-context learning objective magnifies these trends
## Appendix

<!--
---
Please try the following tasks later by running ['python emotion.py --do_test --train_k 16384 --test_k 16', 'python search_qa.py --do_test --train_k 16384 --test_k 16', 'python biomrc.py --do_test --train_k 16384 --test_k 16', 'python spider.py --do_test --train_k 16384 --test_k 16', 'python amazon_polarity.py --do_test --train_k 16384 --test_k 16', 'python dbpedia_14.py --do_test --train_k 16384 --test_k 16', 'python yahoo_answers_topics.py --do_test --train_k 16384 --test_k 16', 'python yelp_review_full.py --do_test --train_k 16384 --test_k 16']
# Direct GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method direct --out_dir out/gpt2-large --do_zeroshot
# Channel GPT-2 Large
python test.py --dataset {dataset} --gpt2 gpt2-large --method channel --out_dir out/gpt2-large --do_zeroshot
# Direct MetaICL
python test.py --dataset {dataset} --gpt2 metaicl --method direct --out_dir out/metaicl --do_zeroshot
# Channel MetaICL
python test.py --dataset {dataset} --gpt2 channel-metaicl --method channel --out_dir out/channel-metaicl --do_zeroshot
# Direct GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method direct --out_dir out/gpt-j --do_zeroshot
# Channel GPT-J
python test.py --dataset {dataset} --gpt2 gpt-j-6B --method channel --out_dir out/gpt-j --do_zeroshot
# GPT-3
python test_gpt3.py --dataset {dataset} --gpt3 {ada|babbage|curie|davinci} --method {direct|channel} --out_dir out/gpt3 --do_zeroshot --api {API key}
c4x6dw18TqmxXCutzRYQUK88yj76G0JnJcJ7AceIeI0 -->