### Overal response
We thank the reviewers for their kind comments and valuable input. Before proceeding with in-depth responses, we highlight strengths of our work noted by reviewers.
* Our empirical results is strong and extensive. (reviewers WMJs, zepo, WDsk, BQsX)
* Our **unsupervised approach is novel** (reviewers zepo and WDsk), **effective, and intuitive**. (reviewers BQsX and WMJs)
The reviewers had several questions and suggestions they wished to see addressed. We appreciate these---and respond to all of them below.
### Reviewer WMJS (Score: 5)
Thank you for noting the **efficacy of our method** and strength of evaluation!
* **On what the bias direction represents.** As suggested by the reviewer, we study the direction captured by SteerFair. More specifically we seek to find out if the direction also captures core information related to the task. Our setup follows: We run the SteerFair direction-finding procedure on ScienceQA (2 options) MCQ samples where we synthetically make the ground truth answers biased by moving all answers to the first position (A) and the second (B). We then find the bias directions using these files (bias to the first option using the first file and the second using the second file). Intuitively, if SteerFair captures some core information directions, the performance should drop significantly.
|Method|Avg%|Std|
-|-|:-:|
Vanilla model (LlaVA 13B)|65.45%|0.026|
ITI (supervised 500)|64.05%|0.015|
SteerFair+*biased samples*|65.86%|**0.0052**|
SteerFair+non-biased samples|**67.64%**|0.0068|
The results show that SteerFair on biased samples has a slight performance degradation on Avg% (~2%), suggesting that SteerFair captures some core information. However, the slight degradation suggests that **the direction found by SteerFair is still mainly influenced by the bias and not the core information**.
* **On principal components.** We pick only the first principal component (PC) for ease of computation when combining multiple directions from different bias rules (e.g., {'choose the first option,' 'choose the second option'}, Section 3.6). As suggested by the reviewer, we analyze other PCs by running SteerFair and use these directions on ScienceQA 2 options with the LLaVA 13B model.
|PC|Avg%|Std|
-|-|:-:|
1|66.67%|0.034|
2|65.56%|0.042|
3|66.82%|0.021|
4|59.71%|0.017|
The result above shows that the 3 top PCs perform similarly, while the 4th PC substantially reduces Avg%, which suggests that the bias is scattered on top PCs, but lower PCs might capture the core knowledge too, hence the lower Avg%. This is not desirable, as we want to preserve original model performance while reducing the bias.
* **On top-K attention heads.** Interestingly, the top attention heads are scattered in the middle layers (10th to 20th). We plot the PCA values of all attention heads of LLaVA 13B model (we pick the top K attention heads with the largest values) on ScienceQA and VGR datasets: [anonymized link](https://anonymous.4open.science/r/steerfair_figs-DD5B/).
### Reviewer zepo (Score: 4)
Thank you for noting the **novelty of our work** and strength of evaluation results!
* **On applications.** Our experiments show that SteerFair is effective in question-answering settings (MCQ and yes/no questions), which are widely adopted in LLM-centric scenarios, such as in the automatic LLM evaluation problem [1,2]. Additionally, in **Appendix F**, we present results showing that SteerFair can be adapted to open-ended generation tasks. By leveraging bias direction extracted from toxic word corpora, **SteerFair effectively steers the model away from generating toxic content**.
* **On the motivation in using PCA and QR decomposition.**
* PCA: Given samples demonstrating a rule (e.g., always choose the first option), our technique performs PCA on these samples and takes the first principal component (PC) (direction that captures the most variance/pattern in the samples) as the bias direction.
* QR decomposition: Given directions (vectors in the latent space) from multiple bias rules, we use QR decomposition to find the orthogonal bases of the directions first before taking their average to remove correlations between the directions.
* **On SteerFair improvement in experimental results.** SteerFair aims to mitigate foundation model bias while still retaining model performance. While a biased model may exhibit a high average accuracy (Avg%) on average, it can also display a high standard deviation (Std) due to its tendency to favor specific prompt orderings. This variability in performance based on prompt ordering is undesirable, as it introduces inconsistency in model behavior. Our method does not aim to enhance the average accuracy per se, but rather to sustain it while mitigating bias towards any particular prompt ordering, thereby reducing the standard deviation. Our experimental results show that **we maintain (and in some cases even improve) the Avg% while significantly reducing the Std**.
[1] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
[2] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
### Reviewer WDsk (Score: 6)
Thank you for noting the robustness of our unsupervised method and our comprehensive evaluation!
* **On propriatory Foundation Model.** Our method's primary strength lies in its universal applicability across any transformer architecture. This versatility ensures that it can be leveraged by closed-source model developers at their discretion, offering them a powerful tool to enhance their models' performance and mitigate biases. Furthermore, there is a noticeable trend towards open-sourcing proprietary models, exemplified by recent initiatives like [1]. Additionally, an increasing number of companies are embracing transparency by releasing their previously proprietary models as open-source, as evidenced by [2, 3].
* **On correlation between representation and output.** In Appendix H, we present an exhaustive non-averaged result which shows that SteerFair maintains original model accuracy while significantly reduces sensitivity to prompt variations (Std). We demonstrate that SteerFair consistently preserves the original model accuracy while reducing sensitivity to prompt variations (Std). **This performance extends across different model sizes**, from the larger LLaVA 13B to the smaller IDEFICS 9B, indicating that SteerFair's effectiveness is not heavily reliant on the initial quality of the model's representation (i.e., how much the representation correlate with the output).
* **On code release.** Yes, we will release the code with paper publication. We also **attached the code zip file** as part of our initial submission.
* **On projecting away bias direction from representation.** We first want to highlight the key difference between SteerFair and [4]; ours do not seek to find the word embeddings that are invariant to certain concepts (e.g., gender). Instead, we aim to identify the direction in the representation space that best encapsulates bias and subtract it during inference. Our method does straightforward subtraction rather than more complex vector projections. Secondly, our method is an **inference-time procedure, so it does not modify the original weights/embeddings of the model**. Instead, we modify the activation values during inference. Thus, the model is still intact for any other downstream tasks.
* **On handling multiple concept subspaces.** In its current form, our method is only able to handle multiple bias directions of the same nature. For example, in the order bias problem, our QR decomposition + averaging handles combining the multiple directions for different bias rules (e.g., bias to the first option, to the second , etc).
* **On the two-means approach and other debiasing methods.** In our understanding, the two-means approach used in [4] is to find a center point of rotation to ensure proper orthogonalization of multiple concepts. We would like to re-state our points from above that in SteerFair, we do not seek to find the embedding invariant to certain concepts; we only want to find the direction that best represents bias and steer activation values during inference away from it.
In our comparisons with other debiasing methods, we specifically focus on techniques that modify activation values of a model, such as [5]. We deliberately exclude methods that target debiasing word embeddings due to the fundamental differences in the operations involved in modifying word embeddings (vectors) vs. activation values (one vector per activation head per layer).
[1] Lauren c ̧ on, H., Saulnier, L., Tronchon, L., Bekman, S.,
Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush,
A. M., Kiela, D., et al. Obelisc: An open web-scale fil-
tered dataset of interleaved image-text documents. arXiv
preprint arXiv:2306.16527, 2023.
[2] Jiang, Albert Q., et al. "Mistral 7B." arXiv preprint arXiv:2310.06825 (2023).
[3] Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li,
B., Fung, P., and Hoi, S. Instructblip: Towards general-
purpose vision-language models with instruction tuning.
arxiv 2023. arXiv preprint arXiv:2305.06500
[4] Aboagye, Prince Osei, et al. "Interpretable debiasing of vectorized language representations with iterative orthogonalization." The Eleventh International Conference on Learning Representations. 2022.
[5] Li, Kenneth, et al. "Inference-time intervention: Eliciting truthful answers from a language model." Advances in Neural Information Processing Systems 36 (2024).
### Reviewer BQsX (Score: 6)
Thank you for noting the intuitive nature of our method and the strength of our evaluation!
* **On use cases outside multiple-choice settings.** We agree that the open-ended use case is an important task, and we will move it to the main body of the revised version of our paper. The main body of our current version focuses on order bias in question-answering tasks, which are widely adopted in LLM-centric scenarios, such as in the automatic LLM evaluation problem [1,2], where an LMs are tasked with assessing the quality of model-generated answers, a process heavily reliant on accurate question-answering capabilities. We believe that mitigating order bias is an essential step towards better automatic evaluation for LLMs.
* **On results description, typos, and writing suggestings.** Thank you for pointing this out! We have ammended our wordings for result description, make fixes on the table highlight and typos, and we will include these changes in our revised version.
* **On Figure in Section 5.3.** The plot in Figure 4 of Section 5.3 shows directions of two different bias rules found from two separate PCA processes. It is **not** the first 2 PCs of the same PCA, but the 1st PC of two different PCA processes, thus is not orthogonal by definition.