dyahadila
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ### Overal response We thank the reviewers for their kind comments and valuable input. Before proceeding with in-depth responses, we highlight strengths of our work noted by reviewers. * Our empirical results is strong and extensive. (reviewers WMJs, zepo, WDsk, BQsX) * Our **unsupervised approach is novel** (reviewers zepo and WDsk), **effective, and intuitive**. (reviewers BQsX and WMJs) The reviewers had several questions and suggestions they wished to see addressed. We appreciate these---and respond to all of them below. ### Reviewer WMJS (Score: 5) Thank you for noting the **efficacy of our method** and strength of evaluation! * **On what the bias direction represents.** As suggested by the reviewer, we study the direction captured by SteerFair. More specifically we seek to find out if the direction also captures core information related to the task. Our setup follows: We run the SteerFair direction-finding procedure on ScienceQA (2 options) MCQ samples where we synthetically make the ground truth answers biased by moving all answers to the first position (A) and the second (B). We then find the bias directions using these files (bias to the first option using the first file and the second using the second file). Intuitively, if SteerFair captures some core information directions, the performance should drop significantly. |Method|Avg%|Std| -|-|:-:| Vanilla model (LlaVA 13B)|65.45%|0.026| ITI (supervised 500)|64.05%|0.015| SteerFair+*biased samples*|65.86%|**0.0052**| SteerFair+non-biased samples|**67.64%**|0.0068| The results show that SteerFair on biased samples has a slight performance degradation on Avg% (~2%), suggesting that SteerFair captures some core information. However, the slight degradation suggests that **the direction found by SteerFair is still mainly influenced by the bias and not the core information**. * **On principal components.** We pick only the first principal component (PC) for ease of computation when combining multiple directions from different bias rules (e.g., {'choose the first option,' 'choose the second option'}, Section 3.6). As suggested by the reviewer, we analyze other PCs by running SteerFair and use these directions on ScienceQA 2 options with the LLaVA 13B model. |PC|Avg%|Std| -|-|:-:| 1|66.67%|0.034| 2|65.56%|0.042| 3|66.82%|0.021| 4|59.71%|0.017| The result above shows that the 3 top PCs perform similarly, while the 4th PC substantially reduces Avg%, which suggests that the bias is scattered on top PCs, but lower PCs might capture the core knowledge too, hence the lower Avg%. This is not desirable, as we want to preserve original model performance while reducing the bias. * **On top-K attention heads.** Interestingly, the top attention heads are scattered in the middle layers (10th to 20th). We plot the PCA values of all attention heads of LLaVA 13B model (we pick the top K attention heads with the largest values) on ScienceQA and VGR datasets: [anonymized link](https://anonymous.4open.science/r/steerfair_figs-DD5B/). ### Reviewer zepo (Score: 4) Thank you for noting the **novelty of our work** and strength of evaluation results! * **On applications.** Our experiments show that SteerFair is effective in question-answering settings (MCQ and yes/no questions), which are widely adopted in LLM-centric scenarios, such as in the automatic LLM evaluation problem [1,2]. Additionally, in **Appendix F**, we present results showing that SteerFair can be adapted to open-ended generation tasks. By leveraging bias direction extracted from toxic word corpora, **SteerFair effectively steers the model away from generating toxic content**. * **On the motivation in using PCA and QR decomposition.** * PCA: Given samples demonstrating a rule (e.g., always choose the first option), our technique performs PCA on these samples and takes the first principal component (PC) (direction that captures the most variance/pattern in the samples) as the bias direction. * QR decomposition: Given directions (vectors in the latent space) from multiple bias rules, we use QR decomposition to find the orthogonal bases of the directions first before taking their average to remove correlations between the directions. * **On SteerFair improvement in experimental results.** SteerFair aims to mitigate foundation model bias while still retaining model performance. While a biased model may exhibit a high average accuracy (Avg%) on average, it can also display a high standard deviation (Std) due to its tendency to favor specific prompt orderings. This variability in performance based on prompt ordering is undesirable, as it introduces inconsistency in model behavior. Our method does not aim to enhance the average accuracy per se, but rather to sustain it while mitigating bias towards any particular prompt ordering, thereby reducing the standard deviation. Our experimental results show that **we maintain (and in some cases even improve) the Avg% while significantly reducing the Std**. [1] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/. [2] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b. ### Reviewer WDsk (Score: 6) Thank you for noting the robustness of our unsupervised method and our comprehensive evaluation! * **On propriatory Foundation Model.** Our method's primary strength lies in its universal applicability across any transformer architecture. This versatility ensures that it can be leveraged by closed-source model developers at their discretion, offering them a powerful tool to enhance their models' performance and mitigate biases. Furthermore, there is a noticeable trend towards open-sourcing proprietary models, exemplified by recent initiatives like [1]. Additionally, an increasing number of companies are embracing transparency by releasing their previously proprietary models as open-source, as evidenced by [2, 3]. * **On correlation between representation and output.** In Appendix H, we present an exhaustive non-averaged result which shows that SteerFair maintains original model accuracy while significantly reduces sensitivity to prompt variations (Std). We demonstrate that SteerFair consistently preserves the original model accuracy while reducing sensitivity to prompt variations (Std). **This performance extends across different model sizes**, from the larger LLaVA 13B to the smaller IDEFICS 9B, indicating that SteerFair's effectiveness is not heavily reliant on the initial quality of the model's representation (i.e., how much the representation correlate with the output). * **On code release.** Yes, we will release the code with paper publication. We also **attached the code zip file** as part of our initial submission. * **On projecting away bias direction from representation.** We first want to highlight the key difference between SteerFair and [4]; ours do not seek to find the word embeddings that are invariant to certain concepts (e.g., gender). Instead, we aim to identify the direction in the representation space that best encapsulates bias and subtract it during inference. Our method does straightforward subtraction rather than more complex vector projections. Secondly, our method is an **inference-time procedure, so it does not modify the original weights/embeddings of the model**. Instead, we modify the activation values during inference. Thus, the model is still intact for any other downstream tasks. * **On handling multiple concept subspaces.** In its current form, our method is only able to handle multiple bias directions of the same nature. For example, in the order bias problem, our QR decomposition + averaging handles combining the multiple directions for different bias rules (e.g., bias to the first option, to the second , etc). * **On the two-means approach and other debiasing methods.** In our understanding, the two-means approach used in [4] is to find a center point of rotation to ensure proper orthogonalization of multiple concepts. We would like to re-state our points from above that in SteerFair, we do not seek to find the embedding invariant to certain concepts; we only want to find the direction that best represents bias and steer activation values during inference away from it. In our comparisons with other debiasing methods, we specifically focus on techniques that modify activation values of a model, such as [5]. We deliberately exclude methods that target debiasing word embeddings due to the fundamental differences in the operations involved in modifying word embeddings (vectors) vs. activation values (one vector per activation head per layer). [1] Lauren c ̧ on, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A. M., Kiela, D., et al. Obelisc: An open web-scale fil- tered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023. [2] Jiang, Albert Q., et al. "Mistral 7B." arXiv preprint arXiv:2310.06825 (2023). [3] Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards general- purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500 [4] Aboagye, Prince Osei, et al. "Interpretable debiasing of vectorized language representations with iterative orthogonalization." The Eleventh International Conference on Learning Representations. 2022. [5] Li, Kenneth, et al. "Inference-time intervention: Eliciting truthful answers from a language model." Advances in Neural Information Processing Systems 36 (2024). ### Reviewer BQsX (Score: 6) Thank you for noting the intuitive nature of our method and the strength of our evaluation! * **On use cases outside multiple-choice settings.** We agree that the open-ended use case is an important task, and we will move it to the main body of the revised version of our paper. The main body of our current version focuses on order bias in question-answering tasks, which are widely adopted in LLM-centric scenarios, such as in the automatic LLM evaluation problem [1,2], where an LMs are tasked with assessing the quality of model-generated answers, a process heavily reliant on accurate question-answering capabilities. We believe that mitigating order bias is an essential step towards better automatic evaluation for LLMs. * **On results description, typos, and writing suggestings.** Thank you for pointing this out! We have ammended our wordings for result description, make fixes on the table highlight and typos, and we will include these changes in our revised version. * **On Figure in Section 5.3.** The plot in Figure 4 of Section 5.3 shows directions of two different bias rules found from two separate PCA processes. It is **not** the first 2 PCs of the same PCA, but the 1st PC of two different PCA processes, thus is not orthogonal by definition.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully