# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? ###### tags: `Meeting` ## Introduction ![](https://hackmd.io/_uploads/Hk8wSg_v2.png =70%x) - Large language models (LMs) have shown impressive performance on downstream tasks by simply conditioning on a few input-label pairs (demonstrations)<!--Despite in-context learning consistently outperforming zero-shot inference on a wide range of tasks--> - There is little understanding of *how* it works and *which* aspects of the demonstrations contribute to end task performance --- > **An Explanation of In-context Learning as Implicit Bayesian Inference** - ICLR 2022 > - They study how in-context learning can emerge when pretraining documents have long-range coherence > - LM must infer a latent document-level concept to generate coherent next tokens during pretraining > - At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt > ![](https://hackmd.io/_uploads/BJJCKgdwn.png) --- - In this paper, they show that ground truth demonstrations are in fact **not required** for effective in-context learning - Replacing the labels in demonstrations with **random labels** barely hurts performance in a range of classification and multi-choice tasks - Model *does not* rely on the input-label mapping in the demonstrations to perform the task ![](https://hackmd.io/_uploads/HJ3_U5Vv3.png =50%x) <!-- Further analysis investigates which parts of demonstrations actually do contribute to the performance --> - The **label space** and the **distribution of the input text** *specified* by the demonstrations are both key to in-context learning<!-- regardless of whether the labels are correct for individual inputs --> - Specifying the **overall format** is also crucial<!-- when the label space is unknown, using random English words as labels is significantly better than using no labels --> - **Meta-training** with an in-context learning objective magnifies these effects - the models almost exclusively exploit simpler aspects of the demonstrations like the format rather than the input-label mapping --- ![](https://hackmd.io/_uploads/rkrHi8iv2.png) ![](https://hackmd.io/_uploads/B1sPsIsw3.png) - Instead of computing the likelihood of the label given the input, channel models compute the **conditional probability of the input** given the label - Model was required to explain every word in the input - Channel models significantly outperform their direct counterparts, which they attribute to their **stability** - lower variance - higher worst-case accuracy --- ## Ground Truth Matters Little ### Gold labels vs. random labels 1. **No demonstrations**: typical zero-shot method $$\arg\max_{y\in\mathcal{C}}P(y|x)$$ 2. **Demonstrations w/ gold labels**: typical in-context learning method with $k$ labeled examples $(x_1,y_1)...(x_k,y_k)$ $$\arg\max_{y\in\mathcal{C}}P(y|x_1,y_1,...,x_k,y_k,x)$$ 3. **Demonstrations w/ random labels**: formed with random labels $$\arg\max_{y\in\mathcal{C}}P(y|x_1,\tilde{y_1},...,x_k,\tilde{y_k},x)$$ ![](https://hackmd.io/_uploads/ryNAtXdvn.png) - They find that replacing gold labels with random labels only marginally hurts performance - This result indicates that the ground truth input-label pairs are not necessary to achieve performance gains<!-- This is counter-intuitive, given that correctly paired training data is critical in typical supervised training --> - Models are capable of recovering the expected input-label correspondence for the task - It is *not* directly from the pairings in the demonstrations :::info 1. Selecting random labels from a true distribution of labels (instead of a uniform distribution) reduces the gap even further 2. The trends may depend on the dataset, although the overall trend is consistent over most datasets ::: ### Ablations #### Does the number of correct labels matter? ![](https://hackmd.io/_uploads/HkfxTVOwh.png) - Model performance is fairly insensitive to the number of correct labels in the demonstrations - Always using incorrect labels significantly outperforms no demonstrations #### Is the result consistent with varying k? ![](https://hackmd.io/_uploads/HJ1Q4zoDh.png) - Using the demonstrations significantly outperforms the no demonstrations method even with small $k (k = 4)$ - Performance drop from using gold labels to using random labels is consistently small across varying $k$ - Model performance does not increase much as $k$ increases when $k \geq 8$, both with gold labels and with random labels #### Is the result consistent with better templates? ![](https://hackmd.io/_uploads/H19wrzivh.png) - Replacing gold labels with random labels barely hurting performance—holds with manual templates - Using manual templates does not always outperform using minimal templates. ![](https://hackmd.io/_uploads/rktZ8fsP2.png) ## Why does In-Context Learning work? <!-- Section 4 shows that the ground truth input-label mapping in the demonstrations has little impact to performance gains from in-context learning. This section further examines what other aspects of the demonstrations lead to good performance of in-context learning --> - They identify four aspects of the demonstrations that potentially provide learning signal 1. The input-label mapping 2. The distribution of the input text 3. The label space 4. The format ![](https://hackmd.io/_uploads/rJeJvGiPn.png) - They design a series of variants of the demonstrations that quantify the impact of each aspect in isolation - The trend of the models meta-trained with an in-context learning objective #### Impact of the distribution of the input text - They randomly sampled from out-of-distribution (OOD) demonstrations and randomly replaced inputs - Keeping the label space and the format of the demonstrations ![](https://hackmd.io/_uploads/BkOZpzow2.png) - Using out-of-distribution inputs instead of the inputs from the training data significantly drops the performance - This suggests that **in-distribution inputs** in the demonstrations substantially contribute to performance gains - Conditioning on the in-distribution text **makes the task closer to language modeling**, since the LM always conditioned on the in-distribution text during training #### Impact of the label space - They experiment with demonstrations with **random** English words ![](https://hackmd.io/_uploads/BySGFUiD2.png) - With direct models, the performance gap between using random labels within the label space and using random English words is significant - This indicates that conditioning on the label space significantly contributes to performance gains - Removing the output space does not lead to significant drop in the channel models - The channel models only condition on the labels, and thus are not benefiting from knowing the label space <!-- This is in contrast to direct models which must generate the correct labels --> #### Impact of input-label pairing ![](https://hackmd.io/_uploads/rJeJvGiPn.png) - They evaluate **demonstrations with no labels** where the LM is conditioned on the concatenation of $(x_1...x_k)$ - **Demonstrations with labels only** where the LM is conditioned on the concatenation of $(y_1...y_k)$ ![](https://hackmd.io/_uploads/HyQz1Dsv2.png) - Removing the format is close to or worse than no demonstrations, indicating the importance of the format - Conditioning on a sequence of input-label pairs triggers the model to mimic the overall format and complete the new example as expected when the test input is given - Removing inputs instead of using OOD inputs, or removing labels instead of using random English words is significantly worse, indicating that **keeping the format of the input-label pairs is key** #### Impact of meta-training - Multi-task training on a large collection of supervised datasets (called meta-training) for generalization to new tasks - They aim to better understand the role of this meta-training in relation with their findings by closely examining the result of MetaICL - The patterns they see so far are significantly more evident with MetaICL than with other models - the ground truth input-label mapping matters even less - keeping the format of the demonstrations matters even more - They hypothesize that **meta-training encourages the model to exclusively exploit simpler aspects of the demonstrations and to ignore others** - the input-label mapping is likely harder to exploit - the format is likely easier to exploit - the space of the text that the model is trained to generate is likely easier to exploit than the space of the text that the model conditions on ## Conclusion - They study the role of the demonstrations with respect to the success of in-context learning - The ground truth input-label mapping in the demonstrations matters significantly less than one might think - replacing gold labels with random labels in the demonstrations only marginally lowers the performance - They identify a series of aspects in the demonstrations and examine which aspect actually contributes to performance gains - gains are mainly coming from *independent* specification of the input space and the label space - the models can still retain up to 95% of performance gains by using either the inputs only or the label set only if the **right format is used** - meta-training with an in-context learning objective magnifies these trends ## Appendix ![](https://hackmd.io/_uploads/rktZ8fsP2.png) <!-- --- Please try the following tasks later by running ['python emotion.py --do_test --train_k 16384 --test_k 16', 'python search_qa.py --do_test --train_k 16384 --test_k 16', 'python biomrc.py --do_test --train_k 16384 --test_k 16', 'python spider.py --do_test --train_k 16384 --test_k 16', 'python amazon_polarity.py --do_test --train_k 16384 --test_k 16', 'python dbpedia_14.py --do_test --train_k 16384 --test_k 16', 'python yahoo_answers_topics.py --do_test --train_k 16384 --test_k 16', 'python yelp_review_full.py --do_test --train_k 16384 --test_k 16'] # Direct GPT-2 Large python test.py --dataset {dataset} --gpt2 gpt2-large --method direct --out_dir out/gpt2-large --do_zeroshot # Channel GPT-2 Large python test.py --dataset {dataset} --gpt2 gpt2-large --method channel --out_dir out/gpt2-large --do_zeroshot # Direct MetaICL python test.py --dataset {dataset} --gpt2 metaicl --method direct --out_dir out/metaicl --do_zeroshot # Channel MetaICL python test.py --dataset {dataset} --gpt2 channel-metaicl --method channel --out_dir out/channel-metaicl --do_zeroshot # Direct GPT-J python test.py --dataset {dataset} --gpt2 gpt-j-6B --method direct --out_dir out/gpt-j --do_zeroshot # Channel GPT-J python test.py --dataset {dataset} --gpt2 gpt-j-6B --method channel --out_dir out/gpt-j --do_zeroshot # GPT-3 python test_gpt3.py --dataset {dataset} --gpt3 {ada|babbage|curie|davinci} --method {direct|channel} --out_dir out/gpt3 --do_zeroshot --api {API key} c4x6dw18TqmxXCutzRYQUK88yj76G0JnJcJ7AceIeI0 -->