# General Comment
We sincerely thank all the reviewers for their time, insightful suggestions, and valuable comments.
We are glad that reviewers find our tuning-free method simple to implement (*rZpw, JXR2*) and sample efficient, making it more practical to use than DPO for real-world applications (*rZpw, JXR2*). Furthermore, our method showcases greater robustness to noisy data, making it applicable to imperfect datasets (*rZpw, ptGY, JXR2*). We are also encouraged that reviewers found our empirical analysis thorough (*ptGY*) and found our theoretical analysis well written (*ptGY*) and doing an "excellent job in explaining the theoretical pathway of the proposed method and its theoretical connection with the DPO alignment algorithm" (*JXR2*).
We respond to each reviewer's comments in detail below. We have also revised our paper according to the reviewers’ suggestions, and we believe this makes our paper stronger. The main changes we made include the following additions:
**1. How DeTox relates to existing literature on Toxicity Reduction:**
Toxicity reduction methods can be largely categorized into three classes [10]:
1. Tuning based approaches [1-4, *inter alia*] - These approaches require large amounts of data and are computationally expensive to train.
2. Decoding based approaches [5-8, *inter alia*] - These approaches often require trained classifiers, thus also needing data, and certain approaches like PPLM can be very slow. They have also been shown to reduce fluency in certain cases[9].
3. Editing approaches - These appraoches are tuning-free, lightweight and computationally cheap. Since our work specifically targets the setting of reducing toxicity in a compute and data efficient manner, we compare our work with existing literature in this category:
<!-- - Editing: Task vectors need training of an anti-expert model with the same architecture as the base model; expensive -->
[10] reduces toxicity at inference, with no additional data requirement. Their method involves two forward passes: one to identify toxic directions in the activations of attention heads, and one to edit the activations by steering them in this direction. While their work addresses the same problem as ours, they showcase slower inference due to repeated forward passes, and edit the activations (we instead edit weights). Additionally, their study focuses on the mechanism of attention head *activations* in encoding toxicity; conversely, we focus on analysing the mechanisms of MLP *weights*, providing complementary findings to this work.
<!-- - [Summary of their method] Their method reduces toxicity for a given prompt during inference, without requiring any additional data. Their method pre-pends a toxic and non-toxic prompt to the original prompt, and measures the difference in activations from attention heads (difference is found through just subtracting the activations). Then, this direction is added into the activations during a second forward pass with the original prompt as input. -->
<!-- - While their method also edits the model to reduce toxicity in a tuning-free data efficient manner, they do this by studying the mechanism of attention heads in encoding toxicity and edit attention head activations. Conversely, we focus on editing the weights of the model (to remove the need of any additional operations at inference) and study the mechanisms of MLP modules encoding toxicity, providing complementary findings to the paper. -->
<!-- - Slower inference -->
[11] deviates from other works on model editing for alignment by instead applying knowledge editing through constrained fine-tuning for detoxification. Specifically, they use an approach highly similar to [12], but extended to the toxicity task: finetune the model with constraints to improve the probability of non-toxic tokens while retaining constant probability for generations given a non-adversarial prompt. Our method, in contrast, avoids tuning to surgically remove toxic regions of the model.
<!-- This work shows the promise of applying constrained fine-tuning based knowledge editing methods to reduce model toxicity. In contrast, our method, in addition to being very different in methodology, avoids tuning to surgically remove toxic regions of the model, ensuring their removal. -->
<!-- - + Add concurrent work policy -->
In addition to the points above, we also theoretically motivate our method through factor analysis, and provide novel theoretical and empirical connections to tuning based alignment, showing that DeTox may function as a denoised version of a single DPO step.
**2. Comparison Against Toxicity Reduction Baselines:**
We compare DeTox with the related editing baseline of [10], that has been shown to outperform other popular tuning-based, decoding-based and prompting-based toxicity reduction methods [2,4,6,7,13].
Table 3 (in the attached document) compares DeTox with [10], with both showcasing almost identical toxicity improvements over the original and DPO models. In addition, DeTox has the advantage of being faster at inference, requiring no additional inference time operations.
[1] Direct preference optimization: Your language model is secretly a reward model.
[2] Don't stop pretraining: Adapt language models to domains and tasks.
[3] Exploring the limits of domain-adaptive training for detoxifying large-scale language models.
[4] Ctrl: A conditional transformer language model for controllable generation.
[5] Plug and play language models: A simple approach to controlled text generation.
[6] DExperts: Decoding-time controlled text generation with experts and anti-experts.
[7] Gedi: Generative discriminator guided sequence generation.
[8] Mil-decoding: Detoxifying language models at token-level via multiple instance learning.
[9] Detoxifying language models risks marginalizing minority voices.
[10] Self-detoxifying language models via toxification reversal.
[11] Detoxifying Large Language Models via Knowledge Editing.
[12] Model editing with canonical examples.
[13] Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP.
# Official Review of Submission7799 by Reviewer rZpw
We thank the reviewer for their time and feedback!
**1. Comparisons with more methods for Editing based Toxicity Reduction:**
<!-- "My main concern about this paper is that it lacks a comprehensive literature review given that detoxifying Large Language Models via model editing has already been discussed before [1-3]. Moreover, it compares their method with DPO only while ignoring many existing baselines, which undercuts its contribution." -->
We provide a comprehensive comparison of DeTox against the body of toxicity reduction literature, with a specific focus on other editing approaches, in our general comment to all reviewers.
Additionally, we compare DeTox with a related editing baseline baseline beyond DPO to better calibrate the performance of our method. Table 3 (in the document attached to the general comment) compares DeTox with [1], with both showcasing almost identical toxicity improvements over the base model.
**2. Technical Contribution:**
<!-- - The technical contribution is relatively weak. It identifies a layer responsible for generating toxic content and edit that layer. Further in-depth analysis is required. -->
<!-- I think the reviewer only evaluated our methodology while overlooking our contributions to connect the editing approach with DPO. We should clarify this, and also mention that both our empirical and theoretical insights can inspire new research works from XX aspects: (1) ... ; (2) ...; -->
<!-- - We edit the upper layers of the model since prior work [1, 2] has shown lower layers to process shallow features, while higher layers encode semantic information. -->
<!-- - Beyond this, our method is principally designed with theoretical motivations from factor analysis. We also provide a deeper theoretical and empirical understanding of the connections between projection based editing and DPO, showing that -->
<!-- Through this, we hope to better motivate connections between two emerging bodies of approaches for alignment (tuning and editing), by justifying that both achieve similar objectives. -->
A key motivation of our work is to connect the emerging body of editing for alignment with prevalent tuning based alignment approaches like DPO. Our method is principally designed with theoretical motivations from factor analysis and we introduce novel theoretical and empirical analyes showing that both approaches optimize for similar objectives: our editing approach is conceptually similar to a denoised version of a single DPO step. With this, we hope to validate the use of editing as a data and compute efficient alternative, in a principled manner; and thus inspire more studies in this direction.
**3. Extensions to other preferences:**
<!-- While effective for toxicity reduction, the applicability of DeTox to other alignment tasks (e.g., bias reduction) may require further investigation. -->
A sizeable body of literature in model editing for alignment has optimized single objectives/preferences like truthfulness [2-7], bias [8-10] and toxicity [11] in their studies. For the simplicity of our theoretical analysis, we follow this trend and present our work showing the applicability of our method on the use-case of toxicity. However, like other editing literature, the principles behind our method and analysis are more general and can be applied for other preferences. We will empirically extend our findings to other preferences like bias reduction in the next revision of our work.
We hope this addresses your questions and concerns! We await your thoughts.
[1] Self-detoxifying language models via toxification reversal.
<!-- [2] Don't stop pretraining: Adapt language models to domains and tasks.
[3] Ctrl: A conditional transformer language model for controllable generation.
[4] DExperts: Decoding-time controlled text generation with experts and anti-experts.
[5] Gedi: Generative discriminator guided sequence generation.
[6] Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. -->
[2] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
[3] Spectral Editing of Activations for Large Language Model Alignment.
[4] TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space.
[5] Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning.
[6] Localizing lying in llama: Understanding instructed dis387 honesty on true-false questions through prompting, probing, and patching.
[7] Representation engineering: A top-down approach to AI transparency.
[8] Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information
[9] Identifying and Reducing Gender Bias in Word-Level Language Models.
[10] Interpretable debiasing of vectorized language representations with iterative orthogonalization.
[11] Self-Detoxifying Language Models via Toxification Reversal
<!-- [] Circuit Breaking: Removing Model Behaviors with Targeted Ablation
[] Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation -->
# Official Review of Submission7799 by Reviewer ptGY
We thank the reviewer for their comments!
<!-- [1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
[2] Assessing the brittleness of safety alignment via pruning and low-rank modifications
[3] Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models
[4] Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
[5] Evaluating the Ripple Effects of Knowledge Editing in Language Models -->
**1. Using SVD to decompose toxic embeddings:**
<!-- - The idea of using SVD to decompose toxic embeddings is not novel and has been explored in prior works [1-2]. -->
The use of SVD to identify interpretable components is not new, and we do not claim novelty over it. Prior work has indeed used SVD to find concept related components over model parameters [1] or parameters calibrated by activations [2]. However in contrast, we find these components using the difference in the activations of preference data.
Furthermore, our work is novel in grounding the use of SVD in theory from factor analysis, and using this to draw connections to DPO, showing an SVD based edit to approximate a denoised DPO step.
<!-- - The idea of using SVD to find components of the activations is not new. However, [1] merely shows that certain singular vectors of the MLP value matrix correlate with toxic concepts but do not use this finding for editing. Additionally, their decompositions are over the weights; while we show similar decompositions can be applied on activations. [2] is related to our work and isolates safety critical ranks in the weights of a model through SVD on the product of weights and activations. While we also use low rank decompositions (of activations, different from theirs) to identify conceptual subspaces, we focus on utilizing these subspaces to remove undesired behaviors from a model in a sample efficient manner. Furthermore, we provide theoretical insights towards why such editing approaches work, and draw connections to tuning based alignment through DPO. -->
<!-- - [Junjie: Highlight the key differences. Prior works use SVD on activations or model parameters, while we use SVD on embeddings of the preference data. This idea has not been explored before. ] -->
**2. Factor Analysis assumption:**
<!-- - The assumption that a sentence embedding can be broken down into a form like Equation 5 is strong. More motivating analysis could be included to support this assumption. -->
Our methodological insight derives from factor analysis, a statistical procedure that has been found to be insightful for years in different social sciences. When one plots the singular values of the difference between toxic and non-toxic embeddings, one can clearly see a few large singular values, followed by mostly small singular values forming the bulk. This inspired us to think of the signal-plus-noise model, where the signal corresponds to the toxic subspace and noise generically corresponds to other aspects of the embeddings. We decided to provide, therefore, an interpretable model for the toxic and non-toxic embeddings such that their difference follows the factor model. The embeddings themselves may share other components, such as context, but that gets removed when one looks at the difference. The factor model analysis is very useful to provide intuition into why our editing procedure is the way it is. Further, recently, Anthropic has shown how dictionary learning, a close relative of factor analysis, can be useful in extracting interpretable features from Claude 3 Sonnet [3]. Their success provides all the more evidence in favor of the power of factor models to unveil important aspects of a deep net and hence make it safer and more trustworthy.
**3. Performance Degradations with Editing:**
<!-- - Recent research [3-5] has shown that knowledge editing approaches can experience catastrophic forgetting and significant declines in performance for unedited samples. How does DeTox ensure it does not suffer from these issues? -->
As correctly pointed out, there has been prior research showing that knowledge editing, where the edited and unedited facts are connected through long reasoning chains, can cause unwanted changes to unedited parts of the model.
However, our work addresses a different problem - editing for alignment, where we do not have an explicit set of facts to update the model's parametric knowledge with. To the best of our knowledge, there has been no work that has shown any catastrophic forgetting or unwanted effects of these methods. Furthermore, we show that our model retains similar perplexity and zero shot performance on a variety of tasks, even after editing (Table 2 in our paper), indicating that DeTox does not cause any unwanted changes in the model.
<!-- - Prior work has shown this for knowledge editing, where the edited and unedited facts are connected through long reasoning chains, resulting in additional unwanted effects. However, this has not been explored in the realm of editing for alignment. Furthermore, we show that our model retains similar perplexity and zero shot performance on a variety of tasks, even after editing, indicating that DeTox does not cause any unwanted changes in the model. -->
**4. Similarity in trends across layers:**
<!-- - Are the results from Table 1 similar for different layers? -->
Table 1 in our paper shows the top tokens represented by the first few singular vectors of the activations, at layer 19. We show that this holds across multiple layers, in Tables 1 and 2 of our general comment to all reviewers. The corpus mean still encodes frequent tokens, while the first few singular vectors encode toxicity. Additionally, Table 6 in our paper shows the consistency of top words using different subsets of data.
**5. Data requirement:**
<!-- - How many pairs of toxic and non-toxic sentence pairs would DeTox require to effectively edit the model weights? -->
We use 500 sentence pairs to apply DeTox in our main results. However, in Figure 2 of our paper, we show that edits that are more powerful than DPO can be achieved in as few as 20-50 sentence pairs.
[1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
[2] Assessing the brittleness of safety alignment via pruning and low-rank modifications.
[3] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
# Official Review of Submission7799 by Reviewer JXR2
<!-- I believe that framing the proposed method as an alignment alternative and comparing it to other alignment algorithms is inappropriate. This paper targets the preference dimension of reducing toxicity, whereas algorithms like DPO aim to optimize across a much broader spectrum of preferences, rather than targeting toxicity alone. If the authors want to support this framing, they should provide evidence across a wide range of preference dimensions.
From the perspective of targeting toxicity reduction, I believe the authors should compare their method with existing detoxification methods[1,2,3,4,5, inter alia] (at least with some of them) and demonstrate its superiority.
Nevertheless, I think the community can gain insights from the method and analysis. Therefore, I am willing to see the authors convince me why the current framing as an alignment alternative is reasonable or provide more evidence (either from an alignment perspective or from a detoxification perspective).
[1] Krause, Ben, et al. "GeDi: Generative Discriminator Guided Sequence Generation." Findings of the Association for Computational Linguistics: EMNLP 2021.
[2] Liu, Alisa, et al. "DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts." Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
[3] Cao, Meng, et al. "Systematic Rectification of Language Models via Dead-end Analysis." The Eleventh International Conference on Learning Representations.
[4] Leong, Chak Tou, et al. "Self-Detoxifying Language Models via Toxification Reversal." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.
[5] Zhang, Xu, and Xiaojun Wan. "Mil-decoding: Detoxifying language models at token-level via multiple instance learning." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
**Response:** -->
Thank you for the insightful feedback and pointers to existing literature!
A key motivation of our work is to connect the emerging body of editing for alignment with prevalent tuning based alignment approaches like DPO. By introducing both theoretical and empirical analyes showing that both approaches optimize for similar objectives (i.e. editing approximates a denoised DPO step), we hope to validate the use of editing as a data and compute efficient alternative, in a principled manner; and provide a stepping stone to more studies in this direction.
<!-- Through this, we hope to better motivate connections between two emerging bodies of approaches for alignment (tuning and editing), by justifying that both achieve similar objectives. -->
For simplicity of theoretical analysis, similar to recent work [1], we use toxicity as a test bed, and optimize both our edit method and DPO on only reducing toxicity. Furthermore, a body of existing work on editing for alignment uses singular objectives (e.g. bias [2-4], toxicity [5], truthfulness [6-11] etc.) to introduce edit based alternatives to alignment methods like SFT and DPO. We hope this helps justify our comparisons and analysis with DPO on a single objective. Additionally, we hope to extend our analysis to the more general setting of multiple preference optimization in future work.
Finally, to better justify our use of toxicity reduction as a testbed, we compare DeTox with [12] which is also a data efficient model editing approach for toxicity reduction, and has been shown to outperform popular tuning-based, decoding-based and prompting-based toxicity reduction methods [13-17]. Table 3 (in the document attached to the general comment) compares DeTox and [12], showing that both achieve almost identical reductions in toxicity over the original model. We also explain how our work fits into the existing body of toxicity reduction literature and is methodologically different from these works in our general comment to all reviewers.
We hope this at least partially addresses your questions and concerns! We await your thoughts.
[1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
[2] Identifying and Reducing Gender Bias in Word-Level Language Models.
[3] Interpretable debiasing of vectorized language representations with iterative orthogonalization.
[4] Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information
[5] Self-Detoxifying Language Models via Toxification Reversal
[6] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
[7] Spectral Editing of Activations for Large Language Model Alignment.
[8] TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space.
[9] Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning.
[10] Localizing lying in llama: Understanding instructed honesty on true-false questions through prompting, probing, and patching.
[11] Representation engineering: A top-down approach to AI transparency.
[12] Self-detoxifying language models via toxification reversal.
[13] Don't stop pretraining: Adapt language models to domains and tasks.
[14] Ctrl: A conditional transformer language model for controllable generation.
[15] DExperts: Decoding-time controlled text generation with experts and anti-experts.
[16] Gedi: Generative discriminator guided sequence generation.
[17] Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP.