Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

# Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? ###### tags: `Meeting` ## Introduction ![](https://hackmd.io/_uploads/Hk8wSg_v2.png =70%x) - Large language models (LMs) have shown impressive performance on downstream tasks by simply conditioning on a few input-label pairs (demonstrations) - There is little understanding of *how* it works and *which* aspects of the demonstrations contribute to end task performance --- > **An Explanation of In-context Learning as Implicit Bayesian Inference** - ICLR 2022 > - They study how in-context learning can emerge when pretraining documents have long-range coherence > - LM must infer a latent document-level concept to generate coherent next tokens during pretraining > - At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt > ![](https://hackmd.io/_uploads/BJJCKgdwn.png) --- - In this paper, they show that ground truth demonstrations are in fact **not required** for effective in-context learning - Replacing the labels in demonstrations with **random labels** barely hurts performance in a range of classification and multi-choice tasks - Model *does not* rely on the input-label mapping in the demonstrations to perform the task ![](https://hackmd.io/_uploads/HJ3_U5Vv3.png =50%x)  - The **label space** and the **distribution of the input text** *specified* by the demonstrations are both key to in-context learning - Specifying the **overall format** is also crucial - **Meta-training** with an in-context learning objective magnifies these effects - the models almost exclusively exploit simpler aspects of the demonstrations like the format rather than the input-label mapping --- ![](https://hackmd.io/_uploads/rkrHi8iv2.png) ![](https://hackmd.io/_uploads/B1sPsIsw3.png) - Instead of computing the likelihood of the label given the input, channel models compute the **conditional probability of the input** given the label - Model was required to explain every word in the input - Channel models significantly outperform their direct counterparts, which they attribute to their **stability** - lower variance - higher worst-case accuracy --- ## Ground Truth Matters Little ### Gold labels vs. random labels 1. **No demonstrations**: typical zero-shot method $$\arg\max_{y\in\mathcal{C}}P(y|x)$$ 2. **Demonstrations w/ gold labels**: typical in-context learning method with $k$ labeled examples $(x_1,y_1)...(x_k,y_k)$ $$\arg\max_{y\in\mathcal{C}}P(y|x_1,y_1,...,x_k,y_k,x)$$ 3. **Demonstrations w/ random labels**: formed with random labels $$\arg\max_{y\in\mathcal{C}}P(y|x_1,\tilde{y_1},...,x_k,\tilde{y_k},x)$$ ![](https://hackmd.io/_uploads/ryNAtXdvn.png) - They find that replacing gold labels with random labels only marginally hurts performance - This result indicates that the ground truth input-label pairs are not necessary to achieve performance gains - Models are capable of recovering the expected input-label correspondence for the task - It is *not* directly from the pairings in the demonstrations :::info 1. Selecting random labels from a true distribution of labels (instead of a uniform distribution) reduces the gap even further 2. The trends may depend on the dataset, although the overall trend is consistent over most datasets ::: ### Ablations #### Does the number of correct labels matter? ![](https://hackmd.io/_uploads/HkfxTVOwh.png) - Model performance is fairly insensitive to the number of correct labels in the demonstrations - Always using incorrect labels significantly outperforms no demonstrations #### Is the result consistent with varying k? ![](https://hackmd.io/_uploads/HJ1Q4zoDh.png) - Using the demonstrations significantly outperforms the no demonstrations method even with small $k (k = 4)$ - Performance drop from using gold labels to using random labels is consistently small across varying $k$ - Model performance does not increase much as $k$ increases when $k \geq 8$, both with gold labels and with random labels #### Is the result consistent with better templates? ![](https://hackmd.io/_uploads/H19wrzivh.png) - Replacing gold labels with random labels barely hurting performance—holds with manual templates - Using manual templates does not always outperform using minimal templates. ![](https://hackmd.io/_uploads/rktZ8fsP2.png) ## Why does In-Context Learning work?  - They identify four aspects of the demonstrations that potentially provide learning signal 1. The input-label mapping 2. The distribution of the input text 3. The label space 4. The format ![](https://hackmd.io/_uploads/rJeJvGiPn.png) - They design a series of variants of the demonstrations that quantify the impact of each aspect in isolation - The trend of the models meta-trained with an in-context learning objective #### Impact of the distribution of the input text - They randomly sampled from out-of-distribution (OOD) demonstrations and randomly replaced inputs - Keeping the label space and the format of the demonstrations ![](https://hackmd.io/_uploads/BkOZpzow2.png) - Using out-of-distribution inputs instead of the inputs from the training data significantly drops the performance - This suggests that **in-distribution inputs** in the demonstrations substantially contribute to performance gains - Conditioning on the in-distribution text **makes the task closer to language modeling**, since the LM always conditioned on the in-distribution text during training #### Impact of the label space - They experiment with demonstrations with **random** English words ![](https://hackmd.io/_uploads/BySGFUiD2.png) - With direct models, the performance gap between using random labels within the label space and using random English words is significant - This indicates that conditioning on the label space significantly contributes to performance gains - Removing the output space does not lead to significant drop in the channel models - The channel models only condition on the labels, and thus are not benefiting from knowing the label space  #### Impact of input-label pairing ![](https://hackmd.io/_uploads/rJeJvGiPn.png) - They evaluate **demonstrations with no labels** where the LM is conditioned on the concatenation of $(x_1...x_k)$ - **Demonstrations with labels only** where the LM is conditioned on the concatenation of $(y_1...y_k)$ ![](https://hackmd.io/_uploads/HyQz1Dsv2.png) - Removing the format is close to or worse than no demonstrations, indicating the importance of the format - Conditioning on a sequence of input-label pairs triggers the model to mimic the overall format and complete the new example as expected when the test input is given - Removing inputs instead of using OOD inputs, or removing labels instead of using random English words is significantly worse, indicating that **keeping the format of the input-label pairs is key** #### Impact of meta-training - Multi-task training on a large collection of supervised datasets (called meta-training) for generalization to new tasks - They aim to better understand the role of this meta-training in relation with their findings by closely examining the result of MetaICL - The patterns they see so far are significantly more evident with MetaICL than with other models - the ground truth input-label mapping matters even less - keeping the format of the demonstrations matters even more - They hypothesize that **meta-training encourages the model to exclusively exploit simpler aspects of the demonstrations and to ignore others** - the input-label mapping is likely harder to exploit - the format is likely easier to exploit - the space of the text that the model is trained to generate is likely easier to exploit than the space of the text that the model conditions on ## Conclusion - They study the role of the demonstrations with respect to the success of in-context learning - The ground truth input-label mapping in the demonstrations matters significantly less than one might think - replacing gold labels with random labels in the demonstrations only marginally lowers the performance - They identify a series of aspects in the demonstrations and examine which aspect actually contributes to performance gains - gains are mainly coming from *independent* specification of the input space and the label space - the models can still retain up to 95% of performance gains by using either the inputs only or the label set only if the **right format is used** - meta-training with an in-context learning objective magnifies these trends ## Appendix ![](https://hackmd.io/_uploads/rktZ8fsP2.png)