NeurIPS Rebuttal

# General Comment We sincerely thank all the reviewers for their time, insightful suggestions, and valuable comments. We are glad that reviewers find our tuning-free method simple to implement (*rZpw, JXR2*) and sample efficient, making it more practical to use than DPO for real-world applications (*rZpw, JXR2*). Furthermore, our method showcases greater robustness to noisy data, making it applicable to imperfect datasets (*rZpw, ptGY, JXR2*). We are also encouraged that reviewers found our empirical analysis thorough (*ptGY*) and found our theoretical analysis well written (*ptGY*) and doing an "excellent job in explaining the theoretical pathway of the proposed method and its theoretical connection with the DPO alignment algorithm" (*JXR2*). We respond to each reviewer's comments in detail below. We have also revised our paper according to the reviewers’ suggestions, and we believe this makes our paper stronger. The main changes we made include the following additions: **1. How DeTox relates to existing literature on Toxicity Reduction:** Toxicity reduction methods can be largely categorized into three classes [10]: 1. Tuning based approaches [1-4, *inter alia*] - These approaches require large amounts of data and are computationally expensive to train. 2. Decoding based approaches [5-8, *inter alia*] - These approaches often require trained classifiers, thus also needing data, and certain approaches like PPLM can be very slow. They have also been shown to reduce fluency in certain cases[9]. 3. Editing approaches - These appraoches are tuning-free, lightweight and computationally cheap. Since our work specifically targets the setting of reducing toxicity in a compute and data efficient manner, we compare our work with existing literature in this category:  [10] reduces toxicity at inference, with no additional data requirement. Their method involves two forward passes: one to identify toxic directions in the activations of attention heads, and one to edit the activations by steering them in this direction. While their work addresses the same problem as ours, they showcase slower inference due to repeated forward passes, and edit the activations (we instead edit weights). Additionally, their study focuses on the mechanism of attention head *activations* in encoding toxicity; conversely, we focus on analysing the mechanisms of MLP *weights*, providing complementary findings to this work.    [11] deviates from other works on model editing for alignment by instead applying knowledge editing through constrained fine-tuning for detoxification. Specifically, they use an approach highly similar to [12], but extended to the toxicity task: finetune the model with constraints to improve the probability of non-toxic tokens while retaining constant probability for generations given a non-adversarial prompt. Our method, in contrast, avoids tuning to surgically remove toxic regions of the model.   In addition to the points above, we also theoretically motivate our method through factor analysis, and provide novel theoretical and empirical connections to tuning based alignment, showing that DeTox may function as a denoised version of a single DPO step. **2. Comparison Against Toxicity Reduction Baselines:** We compare DeTox with the related editing baseline of [10], that has been shown to outperform other popular tuning-based, decoding-based and prompting-based toxicity reduction methods [2,4,6,7,13]. Table 3 (in the attached document) compares DeTox with [10], with both showcasing almost identical toxicity improvements over the original and DPO models. In addition, DeTox has the advantage of being faster at inference, requiring no additional inference time operations. [1] Direct preference optimization: Your language model is secretly a reward model. [2] Don't stop pretraining: Adapt language models to domains and tasks. [3] Exploring the limits of domain-adaptive training for detoxifying large-scale language models. [4] Ctrl: A conditional transformer language model for controllable generation. [5] Plug and play language models: A simple approach to controlled text generation. [6] DExperts: Decoding-time controlled text generation with experts and anti-experts. [7] Gedi: Generative discriminator guided sequence generation. [8] Mil-decoding: Detoxifying language models at token-level via multiple instance learning. [9] Detoxifying language models risks marginalizing minority voices. [10] Self-detoxifying language models via toxification reversal. [11] Detoxifying Large Language Models via Knowledge Editing. [12] Model editing with canonical examples. [13] Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. # Official Review of Submission7799 by Reviewer rZpw We thank the reviewer for their time and feedback! **1. Comparisons with more methods for Editing based Toxicity Reduction:**  We provide a comprehensive comparison of DeTox against the body of toxicity reduction literature, with a specific focus on other editing approaches, in our general comment to all reviewers. Additionally, we compare DeTox with a related editing baseline baseline beyond DPO to better calibrate the performance of our method. Table 3 (in the document attached to the general comment) compares DeTox with [1], with both showcasing almost identical toxicity improvements over the base model. **2. Technical Contribution:**      A key motivation of our work is to connect the emerging body of editing for alignment with prevalent tuning based alignment approaches like DPO. Our method is principally designed with theoretical motivations from factor analysis and we introduce novel theoretical and empirical analyes showing that both approaches optimize for similar objectives: our editing approach is conceptually similar to a denoised version of a single DPO step. With this, we hope to validate the use of editing as a data and compute efficient alternative, in a principled manner; and thus inspire more studies in this direction. **3. Extensions to other preferences:**  A sizeable body of literature in model editing for alignment has optimized single objectives/preferences like truthfulness [2-7], bias [8-10] and toxicity [11] in their studies. For the simplicity of our theoretical analysis, we follow this trend and present our work showing the applicability of our method on the use-case of toxicity. However, like other editing literature, the principles behind our method and analysis are more general and can be applied for other preferences. We will empirically extend our findings to other preferences like bias reduction in the next revision of our work. We hope this addresses your questions and concerns! We await your thoughts. [1] Self-detoxifying language models via toxification reversal.  [2] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. [3] Spectral Editing of Activations for Large Language Model Alignment. [4] TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space. [5] Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning. [6] Localizing lying in llama: Understanding instructed dis387 honesty on true-false questions through prompting, probing, and patching. [7] Representation engineering: A top-down approach to AI transparency. [8] Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information [9] Identifying and Reducing Gender Bias in Word-Level Language Models. [10] Interpretable debiasing of vectorized language representations with iterative orthogonalization. [11] Self-Detoxifying Language Models via Toxification Reversal  # Official Review of Submission7799 by Reviewer ptGY We thank the reviewer for their comments!  **1. Using SVD to decompose toxic embeddings:**  The use of SVD to identify interpretable components is not new, and we do not claim novelty over it. Prior work has indeed used SVD to find concept related components over model parameters [1] or parameters calibrated by activations [2]. However in contrast, we find these components using the difference in the activations of preference data. Furthermore, our work is novel in grounding the use of SVD in theory from factor analysis, and using this to draw connections to DPO, showing an SVD based edit to approximate a denoised DPO step.   **2. Factor Analysis assumption:**  Our methodological insight derives from factor analysis, a statistical procedure that has been found to be insightful for years in different social sciences. When one plots the singular values of the difference between toxic and non-toxic embeddings, one can clearly see a few large singular values, followed by mostly small singular values forming the bulk. This inspired us to think of the signal-plus-noise model, where the signal corresponds to the toxic subspace and noise generically corresponds to other aspects of the embeddings. We decided to provide, therefore, an interpretable model for the toxic and non-toxic embeddings such that their difference follows the factor model. The embeddings themselves may share other components, such as context, but that gets removed when one looks at the difference. The factor model analysis is very useful to provide intuition into why our editing procedure is the way it is. Further, recently, Anthropic has shown how dictionary learning, a close relative of factor analysis, can be useful in extracting interpretable features from Claude 3 Sonnet [3]. Their success provides all the more evidence in favor of the power of factor models to unveil important aspects of a deep net and hence make it safer and more trustworthy. **3. Performance Degradations with Editing:**  As correctly pointed out, there has been prior research showing that knowledge editing, where the edited and unedited facts are connected through long reasoning chains, can cause unwanted changes to unedited parts of the model. However, our work addresses a different problem - editing for alignment, where we do not have an explicit set of facts to update the model's parametric knowledge with. To the best of our knowledge, there has been no work that has shown any catastrophic forgetting or unwanted effects of these methods. Furthermore, we show that our model retains similar perplexity and zero shot performance on a variety of tasks, even after editing (Table 2 in our paper), indicating that DeTox does not cause any unwanted changes in the model.  **4. Similarity in trends across layers:**  Table 1 in our paper shows the top tokens represented by the first few singular vectors of the activations, at layer 19. We show that this holds across multiple layers, in Tables 1 and 2 of our general comment to all reviewers. The corpus mean still encodes frequent tokens, while the first few singular vectors encode toxicity. Additionally, Table 6 in our paper shows the consistency of top words using different subsets of data. **5. Data requirement:**  We use 500 sentence pairs to apply DeTox in our main results. However, in Figure 2 of our paper, we show that edits that are more powerful than DPO can be achieved in as few as 20-50 sentence pairs. [1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. [2] Assessing the brittleness of safety alignment via pruning and low-rank modifications. [3] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. # Official Review of Submission7799 by Reviewer JXR2  Thank you for the insightful feedback and pointers to existing literature! A key motivation of our work is to connect the emerging body of editing for alignment with prevalent tuning based alignment approaches like DPO. By introducing both theoretical and empirical analyes showing that both approaches optimize for similar objectives (i.e. editing approximates a denoised DPO step), we hope to validate the use of editing as a data and compute efficient alternative, in a principled manner; and provide a stepping stone to more studies in this direction.  For simplicity of theoretical analysis, similar to recent work [1], we use toxicity as a test bed, and optimize both our edit method and DPO on only reducing toxicity. Furthermore, a body of existing work on editing for alignment uses singular objectives (e.g. bias [2-4], toxicity [5], truthfulness [6-11] etc.) to introduce edit based alternatives to alignment methods like SFT and DPO. We hope this helps justify our comparisons and analysis with DPO on a single objective. Additionally, we hope to extend our analysis to the more general setting of multiple preference optimization in future work. Finally, to better justify our use of toxicity reduction as a testbed, we compare DeTox with [12] which is also a data efficient model editing approach for toxicity reduction, and has been shown to outperform popular tuning-based, decoding-based and prompting-based toxicity reduction methods [13-17]. Table 3 (in the document attached to the general comment) compares DeTox and [12], showing that both achieve almost identical reductions in toxicity over the original model. We also explain how our work fits into the existing body of toxicity reduction literature and is methodologically different from these works in our general comment to all reviewers. We hope this at least partially addresses your questions and concerns! We await your thoughts. [1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. [2] Identifying and Reducing Gender Bias in Word-Level Language Models. [3] Interpretable debiasing of vectorized language representations with iterative orthogonalization. [4] Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information [5] Self-Detoxifying Language Models via Toxification Reversal [6] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. [7] Spectral Editing of Activations for Large Language Model Alignment. [8] TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space. [9] Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning. [10] Localizing lying in llama: Understanding instructed honesty on true-false questions through prompting, probing, and patching. [11] Representation engineering: A top-down approach to AI transparency. [12] Self-detoxifying language models via toxification reversal. [13] Don't stop pretraining: Adapt language models to domains and tasks. [14] Ctrl: A conditional transformer language model for controllable generation. [15] DExperts: Decoding-time controlled text generation with experts and anti-experts. [16] Gedi: Generative discriminator guided sequence generation. [17] Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.