# ICLR 2025 Bias Node Pruning Rebuttals # Reviewer dYSz ![image](https://hackmd.io/_uploads/r1h6s-Zfyg.png) <!-- ##### To me, this feels like a ChatGPT reviewer ![image](https://hackmd.io/_uploads/HJ2RrGHzJl.png) ##### Here's the evaluation for the third reviewer, for comparison ![image](https://hackmd.io/_uploads/BkzX8MHfJe.png) --> #### 1. Pruning model weights can influence model behavior in unforeseen ways. Thank you for your insightful point. First, we would like to note that extensive research has been conducted on LLM parameter pruning for efficient modeling or general-purpose applications [1][2]. However, our approach involves pruning only a very small fraction of the model parameters. For example, in the case of Llama-3, which has 8 billion parameters, we prune just 32 nodes—approximately 0.05% of the total model size. To show our point, we evaluated Llama-3's performance on two general NLP tasks—Sentiment Analysis and Text Summarization—by pruning 8, 16, and 32 nodes. For Sentiment Analysis, we used the "Multi-class Sentiment Analysis Dataset" [3], and for Text Summarization, we used the "CNN/DailyMail Dataset" [4]. The results are presented in the tables below, with the top table corresponding to Sentiment Analysis and the bottom table to Text Summarization. We observed a slight decline in performance as more nodes were pruned; however, the degradation was not severe enough to significantly affect general linguistic performance. Given that our method is specifically designed for multiple-choice question (MCQ) tasks, we believe that a minor decrease in performance on general NLP tasks is not a significant concern. | # Pruned Nodes | F1 | Acc | |:---:|:---:|:---:| | 0 | 32.7 | 22.0 | | 8 | 32.7 | 22.7 | | 16 | 31.7 | 20.2 | | 32 | 31.3 | 20.6 | | # Pruned Nodes | ROUGE-L | ROUGE-1 | |:---:|:---:|:---:| | 0 | 13.8 | 20.4 | | 8 | 13.8 | 20.2 | | 16 | 11.8 | 17.1 | | 32 | 11.5 | 16.6 | [1] https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset [2] https://huggingface.co/datasets/abisee/cnn_dailymail [3] Ma, Xinyin et al. "Llm-pruner: On the structural pruning of large language models." NeurIPS 2023. [4] Dong, Harry et al. "Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation." COLM 2024. #### 2. Marginal performance improvements To test the significance of the improvements, we conducted a significance test by running the experiment eight times with randomly permuted choices. The mean performance values for all three datasets are presented in the tables below, with standard deviations shown in parentheses. **All values were statistically significant, with t-test p-values below 0.001.** We have also updated the manuscript by adding the results in Appendix D.1. | ARC-Challenge | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 53.2 (1.3) | 55.4 (1.3) | 0.640 (0.142) | 0.485 (0.049) | | Llama-3 + BNP | 57.4 (1.0) | 58.0 (1.1) | 0.533 (0.145) | 0.304 (0.029) | | Llama-3 + AOI | 62.7 (1.0) | 63.0 (1.1) | 0.417 (0.133) | 0.201 (0.023) | | Llama-3 + BNP + AOI | 66.8 (1.0) | 66.6 (0.9) | 0.340 (0.140) | 0.121 (0.010) | | MMLU-Redux | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 39.8 (1.6) | 44.4 (1.8) | 0.982 (0.097) | 0.673 (0.063) | | Llama-3 + BNP | 40.8 (1.7) | 44.8 (1.8) | 0.936 (0.100) | 0.595 (0.065) | | Llama-3 + AOI | 44.5 (1.8) | 47.0 (2.0) | 0.657 (0.097) | 0.384 (0.042) | | Llama-3 + BNP + AOI | 45.4 (1.6) | 47.5 (1.8) | 0.564 (0.018) | 0.346 (0.041) | | CommonsenseQA | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 63.3 (1.1) | 64.2 (0.9) | 0.282 (0.026) | 0.106 (0.018) | | Llama-3 + BNP | 64.9 (1.1) | 65.2 (1.1) | 0.222 (0.012) | 0.073 (0.007) | | Llama-3 + AOI | 65.9 (0.9) | 66.3 (0.8) | 0.220 (0.020) | 0.069 (0.010) | | Llama-3 + BNP + AOI | 67.2 (0.6) | 67.2 (0.6) | 0.175 (0.011) | 0.052 (0.004) | #### 3. AOI provides the best overall results; this is expected, given similar behavior observed in models with the SQuAD v1 and SQuAD v2 datasets. Thanks for your comment. It is not clear to us which specific work is being refered to here. If [1] is the paper in question, we would like to note that it emphasizes the importance of models recognizing unanswerable questions but does not specifically discuss the effect of including an "I don't know" option for multiple-choice question-answering tasks. [1] Rajpurkar, Pranav et al. "Know What You Don’t Know: Unanswerable Questions for SQuAD." ACL 2018. #### 4. Limited number of datasets and the method addresses only one form of bias. We have conducted experiments on four different datasets: ARC-Challenge, MMLU-Redux, CommonsenseQA, and HellaSwag (Appendix). Considering that the ICLR 2024 Spotlight paper [1] reported results on only three datasets, we believe that using four datasets provides a sufficient basis to demonstrate the effectiveness and robustness of our proposed approaches. Furthermore, we acknowledge that LLMs are subject to various types of biases. However, the focus of our work is specifically on Selection Bias. Our goal is not to address other forms of bias, such as demographic or cultural biases, but rather to study and mitigate Selection Bias in the context of multiple-choice question-answering tasks. Also, the reviewer has mentioned the work of Mikula et al. (2024) without reference to a specific paper. **Could you please provide which paper you are referring to?** We sincerely hope this provides additional context and clarifies the scope of our study. [1] Chunjie Zheng et al, "Large Language Models are Not Robust Multiple Choice Selectors", ICLR 2024 #### 5. Impact of data subset size on performance <!-- Since we are demonstrating a zero-shot inference task, the size of the dataset does not influence the general trend of the results. Essentially, reducing the size of the dataset is equivalent to reducing the size of the test set, which could impact the statistical robustness of individual results but does not alter the overall trends or conclusions drawn from the analysis. --> As we are demonstrating a zero-shot inference task, the size of the test dataset does not affect the general trends observed in the results. Also, the size of the training set has minimal impact on the overall performance of the extracted bias vector. ## Note to the AC / PC Subject: Concern Regarding Reviewer dYSz Dear Area Chair and Program Chair, We hope this message finds you well. We are writing to respectfully raise a concern regarding Reviewer dYSz, as we suspect that the reviewer’s comments may have been generated by an AI system. We share our concerns in the hope that it may be considered as part of the decision-making process. Our reasons for this suspicion are as follows: - The reviewer criticizes our work for not addressing other types of biases, which is explicitly outside the stated scope of our research. This style of critique, focusing on unrelated areas, is commonly observed in AI-generated content. - The review predominantly comprises high-level comments (e.g., marginal performance gain, number of datasets used, and out-of-scope suggestions). While each point can be meaningful in itself, the lack of detailed feedback raised our concerns. - The reviewer refers to prior work without providing concrete references. - We analyzed the reviewer’s Summary and Weaknesses using an [AI detection tool](https://quillbot.com/ai-content-detector), which indicated a 73% likelihood of AI authorship. In comparison, the likelihood from the other reviewers were 0%. We share these observations in the hope to ensure fairness and maintaining the integrity of the review process. While we do not wish to overstep or draw definitive conclusions, we felt it is important to bring this to your attention for consideration. Thank you very much. Best regards, the authors # Reviewer zisy ![image](https://hackmd.io/_uploads/BJR12--G1l.png) #### Q1. Section 2 lacks details and needs clarification. #### -- "Is Figure 2 evaluated based on zero-shot or in-context learning?"" Figure 2 is derived with the zero-shot setting. We have updated the Figure 2 caption in the manuscript to make this clear. Thank you for pointing it out. #### -- "Are there any scaling trends?"" We extracted the "RSD / CKLD" values from Table 3 and conducted additional experiments with a larger variant of Llama-3, as summarized in the table below. Overall, larger models tend to exhibit lower Selection Bias across the three datasets. However, this trend is largely dependent on the specific dataset being evaluated. | Model | Param Size | ARC-Challenge | MMLU-Redux | CSQA | |---|:---:|:---:|:---:|:---:| | Bloomz | 7 Billion | 0.703 / 0.208 | 1.102 / 0.523 | 0.252 / 0.142 | | Mistral | 7 Billion | 0.140 / 0.036 | 0.216 / 0.069 | 0.155 / 0.031 | | Llama-3 | 8 Billion | 0.086 / 0.007 | 0.184 / 0.034 | 0.051 / 0.003 | | Llama-3 | 70 Billion | 0.024 / 0.002 | 0.122 / 0.019 | 0.073 / 0.003 | | Claude-3-Haiku | Unknown | 0.095 / 0.024 | 0.057 / 0.008 | 0.587 / 0.331 | | Claude-3-Sonnet | 180 Billion | 0.034 / 0.001 | 0.113 / 0.024 | 0.072 / 0.015 | #### -- "Are the open-weight LMs and black-box LMs evaluated using the same criterion? If not, does the evaluation criterion matter?"" In Figure 2, the open-weight LLMs (Llama3, Bloomz, Mistral) are evaluated based on choice token probability and black-box LLMs (Claude3-Sonnet) are evaluated based on Jaccard similarity of the outputs. While this distinction does not undermine the core findings, we acknowledge that it could influence evaluation results. We added a clarification on the selection bias evaluation criteria in the Figure 2 caption. <!-- #### -- "I think the current evaluation is too brief to draw a holistic conclusion."" Thank you for the suggestions. We have revised section 2 of the manuscript to contain more evaluation details. - <span style="color:lightgreen">Revise Manuscript</span> --> #### 2. Demonstrate BNP on the black-box settings with open-weight models We tried applying BNP to the open-weight model parameters and evaluated the models in black-box settings. The results are presented in the tables below. The impact of BNP appears to be mixed in this setup, while showing greater effectiveness on particular datasets, such as CommonsenseQA. | CommonsenseQA | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---:| | Llama-3 | 69.9 | 69.8 | 0.051 | 0.003 | | Llama-3 + AOI | 71.3 | 71.2 | **0.030** | **0.003** | | Llama-3 + AOI + BNP | **71.4** | **71.3** | 0.053 | 0.007 | | Bloomz | 55.9 | 55.3 | 0.252 | 0.142 | | Bloomz + AOI | 59.2 | 58.2 | 0.180 | 0.105 | | Bloomz + AOI + BNP | **61.8** | **61.6** | **0.132** | **0.044** | | Mistral | 54.6 | 54.8 | 0.155 | 0.031 | | Mistral + AOI | 62.8 | 62.8 | **0.082** | **0.013** | | Mistral + AOI + BNP | **63.6** | **63.6** | 0.090 | **0.013** | | ARC-Challenge | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---:| | Llama-3 | 65.7 | 65.8 | 0.086 | 0.007 | | Llama-3 + AOI | **66.9** | **66.9** | **0.076** | **0.007** | | Llama-3 + AOI + BNP | 66.0 | 66.2 | 0.135 | 0.020 | | Bloomz | 41.9 | 42.6 | 0.703 | 0.208 | | Bloomz + AOI | 44.7 | 45.0 |**0.305** | 0.155 | | Bloomz + AOI + BNP | **45.6** | **45.4** | 0.513 | **0.030** | | Mistral | 55.2 | 55.2 | 0.140 | 0.036 | | Mistral + AOI | **59.0** | 59.0 | **0.117** | **0.020** | | Mistral + AOI + BNP | **59.0** | **59.1** | **0.117** | 0.021 | | MMLU-Redux | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---:| | Llama-3 | 51.9 | 52.2 | 0.184 | 0.034 | | Llama-3 + AOI | **52.6** | **53.0** | **0.177** | **0.033** | | Llama-3 + AOI + BNP | 51.9 | 52.5 | 0.214 | 0.050 | | Bloomz | 27.6 | 31.0 | 1.102 | 0.523 | | Bloomz + AOI | 29.4 | **31.8** | 0.972 | 0.413 | | Bloomz + AOI + BNP | **29.6** | 30.6 | **0.627** | **0.124** | | Mistral | 47.4 | 47.6 | 0.216 | 0.069 | | Mistral + AOI | **48.5** | **48.8** | **0.217** | 0.069 | | Mistral + AOI + BNP | 48.4 | 48.7 | **0.217** | **0.068** | #### 3. Rationale for using AOI to mitigate Selection Bias When collecting answers from humans, including an ''I don't know'' response can improve data quality [1]. Because the models were more likley to show selection bias when they were incorrect, we hypothesized that offering an ''I don't know'' option would improve the quality of the responses provided by the model. [1] Converse, Jean M., and Stanley Presser. 1986. Survey Questions: Handcrafting the Standardized Questionnaire. Beverly Hills, CA: Sage. #### 4. How AOI affects the distribution of the choices The distributional effect of AOI is already presented in Section 6.3, Figure 7. The dark-blue bar can be compared with the light-blue bar to see its effect. In all three dataset tasks, the choice distribution becomes closer to uniform when AOI is applied. #### 5. it is unclear what $\mathbf{z}$['A'] + $\mathbf{z}$['_A'] means. '_A' is a token that represents "A" with a space in front of it, whereas 'A' is a one-character token. Since these two represent the same thing, we aggregate their logits $\mathbf{z}$ for accurate evaluation. We have updated the manuscript by including further explanations in Appendix A.2. #### 6. Why does an LLM need to match the ground truth label ratio? Consider a scenario in which an LLM exhibits a bias toward selecting option 'A'. In cases where the LLM is uncertain about the correct answer and resorts to random selection, it is more likely to choose 'A', resulting in a skewed overall choice distribution that diverges from the ground truth distribution. In contrast, an unbiased LLM would select options uniformly under uncertainty, producing a choice distribution that more closely aligns with the original ground truth distribution. Therefore, the extent to which an LLM's predictions match the ground truth distribution can serve as a proxy for measuring Selection Bias. We included this discussion in Appendix C.1 of the manuscript. #### 7. Statistical significance testing by permuting choices Thank you for the excellent idea. As suggested, we conducted a significance test by running the experiment eight times with randomly permuted choices. The mean performance values for all three datasets are presented in the tables below, with standard deviations shown in parentheses. **All values were statistically significant, with t-test p-values below 0.001.** We have also updated the manuscript by adding the results in Appendix D.1. | ARC-Challenge | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 53.2 (1.3) | 55.4 (1.3) | 0.640 (0.142) | 0.485 (0.049) | | Llama-3 + BNP | 57.4 (1.0) | 58.0 (1.1) | 0.533 (0.145) | 0.304 (0.029) | | Llama-3 + AOI | 62.7 (1.0) | 63.0 (1.1) | 0.417 (0.133) | 0.201 (0.023) | | Llama-3 + BNP + AOI | 66.8 (1.0) | 66.6 (0.9) | 0.340 (0.140) | 0.121 (0.010) | | MMLU-Redux | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 39.8 (1.6) | 44.4 (1.8) | 0.982 (0.097) | 0.673 (0.063) | | Llama-3 + BNP | 40.8 (1.7) | 44.8 (1.8) | 0.936 (0.100) | 0.595 (0.065) | | Llama-3 + AOI | 44.5 (1.8) | 47.0 (2.0) | 0.657 (0.097) | 0.384 (0.042) | | Llama-3 + BNP + AOI | 45.4 (1.6) | 47.5 (1.8) | 0.564 (0.018) | 0.346 (0.041) | | CommonsenseQA | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3 | 63.3 (1.1) | 64.2 (0.9) | 0.282 (0.026) | 0.106 (0.018) | | Llama-3 + BNP | 64.9 (1.1) | 65.2 (1.1) | 0.222 (0.012) | 0.073 (0.007) | | Llama-3 + AOI | 65.9 (0.9) | 66.3 (0.8) | 0.220 (0.020) | 0.069 (0.010) | | Llama-3 + BNP + AOI | 67.2 (0.6) | 67.2 (0.6) | 0.175 (0.011) | 0.052 (0.004) | #### 8. Performance on larger model families We also evaluated our methods on Llama-3-70B-Instruct, with the results presented in the tables below. While the model's baseline performance is already exceptionally high, we observe best performance when applying BNP and/or AOI in all three datasets. | ARC-Challenge | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3-70B | 89.6 | 89.6 | 0.024 | 0.002 | | Llama-3-70B + BNP | 89.7 | 89.6 | 0.024 | 0.002 | | Llama-3-70B + AOI | 91.0 | 91.0 | 0.010 | 0.000 | | Llama-3-70B + BNP + AOI | 91.4 | 91.4 | 0.016 | 0.001 | | MMLU-Redux | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3-70B | 67.0 | 67.1 | 0.122 | 0.019 | | Llama-3-70B + BNP | 67.4 | 67.5 | 0.110 | 0.018 | | Llama-3-70B + AOI | 68.1 | 68.2 | 0.090 | 0.010 | | Llama-3-70B + BNP + AOI | 68.3 | 68.3 | 0.077 | 0.009 | | CommonsenseQA | Acc | F1 | RSD | CKLD | |---|---|---|---|---| | Llama-3-70B | 77.8 | 77.9 | 0.073 | 0.003 | | Llama-3-70B + BNP | 78.8 | 78.9 | 0.107 | 0.013 | | Llama-3-70B + AOI | 79.4 | 79.5 | 0.062 | 0.001 | | Llama-3-70B + BNP + AOI | 79.8 | 79.8 | 0.082 | 0.009 | # Reviewer KaKJ ![image](https://hackmd.io/_uploads/SkHz3b-f1e.png) #### 1. Clarification on the Embedding Difference Analysis in Figure 3(b) #### 1.1 Four choices are enough to span the 50 tokens in Figure 3(b). We appreciate your thorough review of our preliminary analysis. To address your concern regarding soundness, we reference one sample from the ARC-Challenge dataset, which was involved in generating Figure 3(b): :::info Which statement best compares single-celled and multi-celled organisms? (A) Tissues in a single-celled organism are like the cells in a multi-celled organism. (B) The nucleus in a single-celled organism is like the skin of a multi-celled organism. \(C) Organelles in a single-celled organism are like the organs in a multi-celled organism. (D) The cytoplasm in a single-celled organism is like the nervous system in a multi-celled organism. ::: In this sample, the Llama-3 tokenizer tokenizes the four answer choices into 92 tokens. Such samples with lengthy answer choices contribute to the difference in the early token locations of Figure 3(b). We also conducted a sanity check on the code and confirmed that **no error exists in our analysis in Figure 3(b).** #### 1.2 Clarification on the meaning of Embedding Difference Thank you for raising the insightful point that "final layer embeddings must differ due to different input choice orderings." While your point is valid, we would like to clarify that the key takeaway from Figure 3(b) is "**the embedding differences are more prominent in the final decoder layers than earlier layers**". In our analysis, we computed the difference between the average embeddings of "correct" and "incorrect" questions (refer to $\mathbf{b}_\mathbf{x}$ in Figure 4(a) and Equation (1)). This eliminates the semantic information of the sample itself, while the factor contributing to the "incorrectness" remains in the difference. Additionally, by averaging across multiple samples, we further smooth out the effect of the differences in input text. Hence, we can infer that **the embedding difference reflects the choice-ordering factor that causes incorrect responses, i.e., the Selection Bias.** <!-- #### 1.2 Clarifying the connection between the Embedding Difference and Selection Bias Please note that in Figure 3(b), we are NOT comparing embeddings on a sample-to-sample basis. Rather, we compare the averaged "correct" embeddings and the averaged "incorrect" embeddings within the set of choice-permuted questions from a sample (see $\mathbf{b}_\mathbf{x}$ in Figure 4(a) and Equation (1) for reference). Taking the difference between these two average embeddings cancels out the semantic information of the sample itself, while the information that contributes to the "incorrectness" remains in the difference. Since the only distinction between the "correct" and "incorrect" question sets is the order of choices, we infer that **this embedding difference reflects the choice-ordering factor (i.e., Selection Bias) that causes incorrect responses**. --> #### 1.3 Further Supporting Analysis To further substantiate our claim that the embedding difference reflects Selection Bias, we present an additional analysis that compares the average intra-difference within the "correct" question set and the average inter-difference between the "correct" and "incorrect" question sets. Specifically, for each dataset, we compute: $$ \mathbf{d}_\text{intra} = \frac{1}{|\mathcal{Z}_+ | \times |\mathcal{Z}_+|}\sum_{\mathbf{z}_+^i \in \mathcal{Z}_+} \sum_{\mathbf{z}_+^j \in \mathcal{Z}_+} ||\mathbf{z}_+^i - \mathbf{z}_+^j||_2^2 $$ $$ \mathbf{d}_\text{inter} = \frac{1}{|\mathcal{Z}_- | \times |\mathcal{Z}_+|}\sum_{\mathbf{z}_-^i \in \mathcal{Z}_+} \sum_{\mathbf{z}_+^j \in \mathcal{Z}_+} ||\mathbf{z}_-^i - \mathbf{z}_+^j||_2^2, $$ where $\mathcal{Z}_+, \mathcal{Z}_-$ are the "correct" and "incorrect" embedding sets, respectively. If the embedding difference does NOT reflect Selection Bias, the average intra-difference within the "correct" embeddings ($\mathbf{d}_\text{intra}$) should be comparable to the average inter-difference between "correct" and "incorrect" embeddings ($\mathbf{d}_\text{inter}$). However, as shown in the table below, $\mathbf{d}_\text{inter}$ consistently exhibits higher values than $\mathbf{d}_\text{intra}$. This observation suggests that the embedding difference captures information correlated with the "incorrectness" of certain choice orderings, thereby reflecting Selection Bias. || d intra | d inter | |---|:---:|:---:| | ARC-Challenge | 18.97 | 21.20 | | MMLU-Redux | 20.10 | 23.17 | | CommonsenseQA | 25.82 | 26.55 | #### 2. Clarification on the definition of Selection Bias and its relation to metrics. <!-- With all due respect, we believe the reviewer has misunderstood the definition of Selection Bias. --> With all due respect, we believe the reviewer may have interpreted the definition of Selection Bias differently. Here, we will clarify its definition and address all the questions raised by the reviewer. #### 2.1 Where is the "dataset" in the definition of Selection Bias? In all previous works, Selection Bias has been discussed **in the context of specific datasets** [1][2]. In our definition, we also defined Selection Bias as the discrepancy between the model's selection under the original choice ordering (of the dataset) and the expected (average) model selection over all possible choice orderings (of the dataset). Consequently, the Selection Bias metrics—RSD, RStd, and our proposed CKLD—are inherently conditioned on the dataset being evaluated. This framing is natural, as it accounts for the diversity in question difficulty and format across different datasets. Thank you for bringing this to our attention. We have revised Section 2.1 of the manuscript to provide clearer definitions and have highlighted the modifications in blue. [1] Chunjie Zheng et al, "Large Language Models are Not Robust Multiple Choice Selectors", ICLR 2024 [2] Sheng-Lun Wei et al, "Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models", ACL 2024 #### 2.2 Addressing the reviewer's questions Q1: *"a robust metric should provide similar scores across different datasets, as seen with Rstd and RSD"* A1: Datasets vary in difficulty and question format, both of which impact Selection Bias. For example, a model may exhibit lower Selection Bias on easier datasets because its confidence in selecting the correct answer outweighs its preference to specific choice options. Thus, **Selection Bias metrics should yield different scores for different datasets**. Furthermore, contrary to the reviewer's claim, the RSD values reported in Table 1 of the manuscript differ significantly across the three datasets. Similarly, as shown in an ICLR 2024 Spotlight paper [1], gpt-3.5-turbo's RStd values for ARC-Challenge, MMLU, and CSQA were 3.3, 5.5, and 2.2, respectively. These results further underscore the variability of Selection Bias metrics across datasets. [1] Chunjie Zheng et al, "Large Language Models are Not Robust Multiple Choice Selectors", ICLR 2024 Spotlight Q2: *"the authors likely made an error in their code because while RSD appears consistent under different ground truth distributions, Rstd varies"* A2: It is unlikely that we made an error in the RStd code as the core part is taken from the code repository of [1] ([code repo](https://github.com/chujiezheng/LLM-MCQ-Bias/blob/e78748b673346f3b307e728e68eb48a0f1baf2bc/code/debias_pride.py#L153)). We reveal our RStd and RSD code below. ``` CHOICES = "ABCDEFGHIJK" def rstd(preds, labels): report = classification_report(preds, labels, output_dict=True, zero_division=0) recalls = [] for choice in CHOICES : if choice in report.keys(): recalls.append(report[choice]['recall'] * 100) return np.round(np.std(recalls), 4) def rsd(preds, labels): report = classification_report(preds, labels, output_dict=True, zero_division=0) accs = [] for choice in CHOICES : if choice in report.keys(): choice_corr = [1 if pred == label and label == choice else 0 for pred, label in zip(preds, labels)] choice_support = [1 if label == choice else 0 for label in labels] acc_choice = sum(choice_corr) / sum(choice_support) if sum(choice_support) != 0 else 0.0 accs.append(acc_choice) avg_acc = sum(accs)/len(accs) acc_var = [(x-avg_acc)**2 for x in accs] rsd_acc = np.sqrt(np.mean(acc_var)) / acc if acc != 0 else -1 return np.round(rsd_acc, 4) ``` [1] Chunjie Zheng et al, "Large Language Models are Not Robust Multiple Choice Selectors", ICLR 2024 Spotlight Q3: *"if a classifier randomly outputs one of the options with a 0.25 probability, under CKLD, the model would not receive a good selection bias score because it deviates significantly from the dataset’s ground truth distribution."* A3: Yes, that is correct! CKLD measures the divergence between the choice and the answer distribution but does not factor in the model's performance. This is why we emphasized in lines 304–305 that "it is important to refer to multiple metrics for a robust assessment." We believe that performance-based metrics (e.g., RSD) and distribution-based metrics (e.g., CKLD) are complementary and should be used together to ensure comprehensive evaluation. Q4: "The authors state that RSD gives the lowest score when the selection rate for 'A' is 0.25, as if this is problematic." A4: The problem of RSD is that it gives the lowest score **regardless of the answer choice distribution in the dataset**. Ideally, the distribution of the unbiased prediction should roughly be similar to the dataset's answer choice distribution. - May you report how many samples are able to cover 50 tokens with 4 options and how many of them not? Thank you for the follow-up question. For Figure 3 (b), 31.3% of the samples had answer choices spanning over 50 tokens. We believe this is enough to highlight the earlier token locations in the figure. - I am not saying there is no difference on the final decoder layer or similar to this. Of course, the bias problem may be reflected in the final decoder layer but this is a trivial claim. Every difference in the earlier layers may exist in the final layers because of the residual streams. It doesn't show the final decoder layer makes a real contribution. In summary: I criticize your claim which is "Selection bias stems from the final decoder layers". This is simply not supported and is very likely to be wrong. We understand your concern regarding the potential overstatement of pinpointing the final decoder layers as the source of selection bias. In response to your thoughtful comments, we have revised our claim in Section 2.2 to state that "Selection bias is prominently captured in the final decoder layers." This way, we are not overstating that the final decoder layers are the origin of the bias, but rather emphasizing that the bias is more readily observed in these layers compared to earlier ones. Additionally, we would like to note that this point serves as an intermediate analysis to motivate the design of our Bias Node Pruning method. As such, this change in the manuscript does not affect the overall conclusions of the paper. Thank you for your insightful and careful review. - Could you please provide the standard deviations for the mean experiment as well? This mean difference is not meaningful without stds. Besides, even though the stds are low, how do you attribute these results to selection bias? If you repeat this experiment for different layers, you may observe a similar pattern. You cannot make big claims by showing a correlation with 3 mean values. Something more convincing and possibly a causal relationship should be shown. We here provide the t-test results for the mean values and also conducted the same experiment on the median and first layers. As the layer moves closer to the input layer, the scale of differences decreases, and the t-test p-values increase. This indicates that the difference between $\mathbf{d}_\text{intra}$ and $\mathbf{d}_\text{inter}$ is more statistically significant in the final layer, suggesting that selection bias is captured more towards the last decoder layers. | Final Layer | d intra | d inter | t-test p-value | |---|:---:|:---:|:---:| | ARC-Challenge | 18.97 | 21.20 | 0.005 ** | | MMLU-Redux | 20.10 | 23.17 | 0.016 ** | | CommonsenseQA | 25.82 | 26.55 | 0.040 * | | Median Layer | d intra | d inter | t-test p-value | |---|:---:|:---:|:---:| | ARC-Challenge | 0.40 | 0.46 | 0.095 | | MMLU-Redux |0.56 | 0.64 | 0.195 | | CommonsenseQA | 0.43 | 0.47 | 0.204 | | First Layer | d intra | d inter | t-test p-value | |---|:---:|:---:|:---:| | ARC-Challenge | 0.0054 | 0.0059 | 0.424 | | MMLU-Redux | 0.0080 | 0.0093 | 0.298 | | CommonsenseQA | 0.0047 | 0.0049 | 0.400 | 2.1 Yes, the model selection bias will be evaluated in a dataset but the selection bias is something the model has or not. It's a model's feature. The dataset here is only for estimating the model's selection bias performance. In some datasets model can perform higher bias, in other datasets it can be lower. If you were able to test a model's selection bias on all possible questions in the universe, then you would get the actual selection bias. The definition shouldn't be a function of dataset D. It should be an expectation over all possible datasets. Yes, we understand your concern. However, as you may recognize, it is impractical to evaluate a model's Selection Bias across the entire data distribution. Hence, all prior studies on Selection Bias have used dataset performance as a proxy for estimating the level of Selection Bias exhibited by a model. 2.2 "a robust metric should provide similar scores across different datasets, as seen with RStd and RSD". This claim is TRUE for your proposed synthetic models because that model's bias performance doesn't change from one dataset to another dataset. It will always make some portion correct and the other portion does randomly. However, for real models, the selection bias can be different from one dataset to another because the model for instance can have more selection bias in medical questions than mathematical questions. That's why these two metrics can output different values in different datasets. However, this difference should reflect the model's actual selection bias and should be independent of the choice distribution of the dataset. 2.3 "the authors likely made an error in their code because while RSD appears consistent under different ground truth distributions, Rstd varies". You don't need to run ANY experiments to get RSTD results with the proposed synthetic classifier. You can calculate it analytically. The recall of option (i) is 0.5 + 0.5 x (predictor's probability of sampling option i). The first element 0.5 comes from the fact that the model predicts 50% of questions correctly no matter what and the second element comes from the probability that the model would predict correctly (by chance) in the questions whose answer is option i. However, your plots show that, for a given classifier, RSTD outputs differently in different dataset distributions. Therefore, there has to be a problem in the code because as you see above the recall of option (i) is not a function of the choice distribution of a dataset. Lastly, you only give the code for calculating the std of recalls but it is likely that how to calculate recall is wrong (still don't need to run a simulation for that). To our understanding, **the recall for option (i) is NOT "0.5 + 0.5 x (predictor's probability of sampling option i)"**. Rather, it is $0.5 + 0.5 \times P(\hat{y}= i | y = i)$. The last probability function, $P(\hat{y}= i | y = i)$, is conditioned on the ground truth label, indicating that recall inherently depends on the dataset choice distribution. Therefore, the RStd trend in Figure 5 is not wrong. For the recall function in our implementation, we use the [scikit-learn package](https://scikit-learn.org/0.15/modules/generated/sklearn.metrics.classification_report.html), which is identical to the codebase used in the ICLR 2024 Spotlight paper [1]. Also, we have re-verified the evaluation code to ensure it does not contain any errors. Nonetheless, we greatly appreciate your insightful comment regarding the importance of evaluation metrics being agnostic to dataset distribution to isolate the pure effect of Selection Bias from model performance. However, our findings suggest that none of the currently available metrics fully achieve this goal. We hope this encourages further research to develop metrics that can address this limitation. [1] Chunjie Zheng et al, "Large Language Models are Not Robust Multiple Choice Selectors", ICLR 2024 Spotlight 2.4) If a classifier outputs uniformly random then there is no selection bias. This also follows your definition. For any given option order, the model's probability to choose any content is the SAME therefore there is no bias. Mathematically, 2.5) "Ideally, the distribution of the unbiased prediction should roughly be similar to the dataset's answer choice distribution." This is ideal for a good performant classifier, not a classifier that doesn't have selection bias. You make the same mistake. Please, do not confuse the selection bias and the model performance. There can be a dummy model that would give terrible performance without having a bias to any option. I agree that "performance-based metrics (e.g., RSD) and distribution-based metrics (e.g., CKLD) are complementary and should be used together to ensure comprehensive evaluation" but your proposed metric cannot replace RSD or RSTD. As stated in lines 261–262, the experiments in Figure 5 are designed to demonstrate the impact of different data characteristics (i.e., choice distributions) when evaluated using each metric. Hence, we would like to clarify that the synthetic scenario—where half of the predictions are correct and the other half are randomly decided—does not assume a specific model. Rather, the synthetic predictions $\hat{Y}$ are considered to be generated by an arbitrary model with an unknown level of selection bias. # Reviewer fF6T ![image](https://hackmd.io/_uploads/SkuhhZWzJx.png) #### 1. Motivation 1 in Section 2.2. The purpose of this subsection was to motivate the extraction of the "bias vector." However, we understand the potential for confusion it may cause. We have made appropriate modifications to the manuscript to address this issue. Thank you for bringing this to our attention. #### 2. Justification for AOI with alternative auxiliary options When collecting answers from humans, including an ''I don't know'' response can improve data quality [1]. Because the models were more likley to show selection bias when they were incorrect, we hypothesized that offering an ''I don't know'' option would improve the quality of the responses provided by the model. To support our claim, we have already included an ablation experiment in Table 4 of the manuscript to demonstrate the effect of alternative auxiliary options. Specifically, we compared the "I don't know" option with "I know the answer." Here, we additionally tested the "None of the above" option, and the results are reported below. Overall, our "I don't know" AOI consistently outperforms the alternatives across all four metrics. Notably, other option contents may degrade performance for Mistral and Bloomz. We have updated Section 6.2 of the manuscript to include the results. | | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---:| | Llama-3 | 41.8 | 46.7 | 1.021 | 0.589 | | Llama-3 w/ "None of the above" | 42.4 | 42.7 | 0.833 | 0.487 | | Llama-3 w/ "I know the answer" | 45.6 | 46.5 | 0.790 | 0.366 | | **Llama-3 w/ "I don't know"** (Ours) | 48.3 | 50.5 | 0.531 | 0.288 | | | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---: | Mistral | 46.4 | 47.6 | 0.366 | 0.186 | | Mistral w/ "None of the above" | 48.0 | 47.8 | 0.596 | 0.159 | | Mistral w/ "I know the answer" | 9.7 | 3.9 | 0.762 | 1.888 | | **Mistral w/ "I don't know"** (Ours) | 48.6 | 49.3 | 0.309 |0.140 | | | Acc | F1 | RSD | CKLD | |---|:---:|:---:|:---:|:---: | Bloomz | 28.0 | 32.8 | 1.003 | 0.661 | | Bloomz w/ "None of the above" | 26.5 | 25.9 | 0.730 | 0.518 | | Bloomz w/ "I know the answer" | 28.0 | 26.1 | 0.618 | 0.314 | | **Bloomz w/ "I don't know"** (Ours) | 32.0 | 33.3 | 0.672 | 0.205 | [1] Converse, Jean M., and Stanley Presser. 1986. Survey Questions: Handcrafting the Standardized Questionnaire. Beverly Hills, CA: Sage. #### 3. Impact of BNP on different task performance We evaluated Llama-3's performance on two general NLP tasks—Sentiment Analysis and Text Summarization—by pruning 8, 16, and 32 nodes. For Sentiment Analysis, we used the "Multi-class Sentiment Analysis Dataset" [1], and for Text Summarization, we used the "CNN/DailyMail Dataset" [2]. The results are presented in the tables below, with the top table corresponding to Sentiment Analysis and the bottom table to Text Summarization. We observed a slight decline in performance as more nodes were pruned; however, the degradation was not severe enough to significantly affect general linguistic performance. Given that our method is specifically designed for multiple-choice question (MCQ) tasks, we believe that a minor decrease in performance on general NLP tasks is not a significant concern. | # Pruned Nodes | F1 | Acc | |:---:|:---:|:---:| | 0 | 32.7 | 22.0 | | 8 | 32.7 | 22.7 | | 16 | 31.7 | 20.2 | | 32 | 31.3 | 20.6 | | # Pruned Nodes | ROUGE-L | ROUGE-1 | |:---:|:---:|:---:| | 0 | 13.8 | 20.4 | | 8 | 13.8 | 20.2 | | 16 | 11.8 | 17.1 | | 32 | 11.5 | 16.6 | [1] https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset [2] https://huggingface.co/datasets/abisee/cnn_dailymail #### 4. Why is the final layer particularly sensitive to bias? Figure 3(b) and Figure 8 present detailed layer-wise analyses across multiple models, consistently showing bias concentration in final layers. This aligns with established understanding that later transformer layers handle higher-level semantic tasks while earlier layers process more basic features. The effectiveness of final layer pruning (demonstrated in Tables 1-2) empirically validates our focus on this layer. We hypothesize this is because the final layer directly maps to output probabilities, making it particularly susceptible to systematic biases in token selection. Section 2.2 provides mathematical analysis showing how final layer parameters interact with bias vectors. While expanding this analysis could be interesting future work, our current results already demonstrate the practical utility of targeting this layer.