Amey
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    <!-- # Rebuttal: Intent Conditioned Counterspeech generation using Multi-task Instruction Tuning with RLAIF ## Reviewer 1: We thank the reviewer for their insightful comments and valuable suggestions. In the subsequent section, we offer thorough responses to each query raised by the reviewer. --- 1. Line 463: for the sake of reproducibility, specify which version of GPT-3.5-Turbo was used, or the time period during which the API was accessed. * Thank you for your suggestion. We will specify the exact version of the GPT model used (**gpt-3.5-turbo-1106**) in the camera-ready version on the paper. --- 2. After the anonymous review period, is there a plan to make the data and code associated with this paper publicly available? * Yes, we are commited to releasing both the dataset and our source code after the anonymous review period. * Our source-code will be released as an open-source **Github** project, and the dataset will be released on **HuggingFace**. We will add the necessary specifications in the camera-ready version of the paper. --- 3. The paper presents a framework that appears to integrate established methods, such as multi-task Learning, LoRA, and RLAIF. Therefore, the framework's value to the community might be perceived as limited. It would be beneficial for the authors to highlight how their approach differentiates from or improves upon the existing methods. * While we leverage established methods like multi-task instruction tuning, LoRA, and RLAIF, our proposed system is not merely an ensemble. Our system is designed in an intuitive manner to address two main challenges (i) generating intent-specific counterspeech that is both relevant and coherent to the hate speech, and (ii) optimizing the counterspeech for effectiveness and non-toxicity. We list some of our major contributions below. * **Addressing limitations in SOTA Counterspeech Methods**: We identified a critical limitation in current state-of-the-art counterspeech generation methods regarding their handling of short, implied expressions of hate. To tackle this, our proposed system includes an auxiliary explanation generation (AEG) component. Here, the base model is initially trained to generate explanations for the hate speech, which then aids in the downstream task of counterspeech generation. The effectiveness of this strategy is evident in our results and detailed in the ablation study. * **Parameter-Efficient Fine-Tuning**: We propose a parameter-efficient method for fine-tuning a model while maintaining high performance in counterspeech generation. This approach, which involves training task-specific LoRA weights, demonstrates superior performance compared to the traditional, more resource-intensive Supervised Fine-Tuning (SFT) setup, as shown in the experimental results in Table 2. * **PLM-based feedback in RLAIF:** Diverging from conventional RLAIF methods that rely on Large Language Models (LLMs) for generating preference data, our framework introduces a novel reward mechanism based on PLMs. The results and ablation studies presented in Table 2 indicate that feedback from PLMs can effectively align language model responses toward desired attributes such as non-toxicity. ## Reviewer 2: We thank the reviewer for their insightful comments and valuable suggestions. In the subsequent section, we offer thorough responses to each query raised by the reviewer. --- --- 1. FLAN-T5 with SFT: The authors include a DialoGPT baseline (SFTed for counterspeech generation), but it is not clear to me whether they include a pure-SFT baseline using their base model. I think this is important to rule out that gains are coming from a new choice of base model. * We would like to clarify that we have indeed incorporated a pure-SFT baseline for our the base model (FLAN-T5-XXL), which is reported as part of the ablation studies labeled "CoARL - LoRA" in Table 2 of our paper. Additionally, in Table 3, we have compared the Win Rate percentage of our proposed method, CoARL, against this pure-SFT baseline. * We realise that the confusion might stem from non-uniform denotions for the pure-SFT baseline, and we aim to make it perfectly clear and understable in the camera-ready version of the paper by: * Moving the pure-SFT baseline from the ablation studies section to report it independently along with other SFT baselines in Table 2. * Ensuring a consistent and uniform denotation for the SFT baseline across the manuscript, with explicit highlighting in Sections 5.1 and 6.3, respectively. --- --- 2. GPT-4: The authors run GPT-3.5, which is great, but it isn't clear why they would not consider GPT-4 as well given that it is often better than GPT-3.5, which is already included anyway. I think that a negative result here would not necessarily be a bad thing, but it would be important to know that either (i) CoARL is better than GPT-4 (great, even cooler result!), (ii) or GPT-4 outperforms CoARL (and quantifying the size of the gap would itself be interesting). * Initially, we were operating under budget constraints, which led to the exclusion of GPT-4 due to its higher cost. Specifically, at the time of our experiments, a 2k token API call/response of GPT-4 (**gpt-4-0613**) was approximately 23 times more expensive than that of GPT-3.5 (**gpt-3.5-turbo-1106**). * However, we acknowledge the significance of GPT-4 as a benchmark in the field of large language models, and its inclusion would indeed provide a more comprehensive evaluation. * Therefore, following the reviewer's suggestion, we have updated our evaluation to include a GPT-4 baseline. Similar to GPT-3.5, we report both zero- and few-shot performances of GPT-4 on our test set and conduct a human evaluation comparing the responses from GPT-4 against our proposed method, CoARL. Note that we use OpenAI model version ***gpt-4*** for our experiments [[source link](https://platform.openai.com/docs/models/continuous-model-upgrades)]. --- ### Table: Quantitative Results Comparative evaluation of CoARL against GPT-3.5 and GPT-4 prompting baselines. The symbol ↑ (↓) indicates the higher (lower) value is better. | Method | Prompt / Adapter | R1 ↑ | R2↑ | RL ↑ | M ↑ | BS ↑ | CS ↑ | CA ↑ | PC ↓ | AQ ↑ | T ↓ | | --------------------------- | ---------------- | ------ | ------ |:------:| ------ | ------ | ------ | ------- | ------- | ------ | ------- | | GPT-3.5-turbo | ZS | 0.204 | 0.058 | 0.181 | 0.274 | 0.856 | 0.323 | 0.828 | 0.118 | 0.898 | 0.038 | | GPT-3.5-turbo | FS | 0.230 | 0.067 | 0.199 | **0.293** | 0.885 | 0.310 | 0.891 | \-0.045 | **0.914** | 0.043 | | GPT-4 | ZS | 0.242 | 0.057 | 0.211 | 0.270 | 0.874 | 0.345 | **0.929** | 0.149 | 0.854 | **0.012** | | GPT-4 | FS | 0.247 | 0.056 | 0.214 | 0.267 | **0.886** | **0.346** | 0.924 | 0.148 | 0.856 | 0.013 | | CoARL (Ours) | LoRA10 | **0.251** | **0.078** | **0.220** | 0.244 | 0.876 | 0.226 | 0.944 | **-0.130** | 0.824 | 0.067 | | ∆ CoARL (Ours) − Best Method | | +0.004 | +0.011 | +0.006 | -0.049 | -0.010 | -0.12 | +0.015 | +0.085 | -0.090 | -0.055 | | | | | | | | | | | | | | **Key Observations:** * Our proposed method (CoARL) tends to generate responses that more closely conform to the intended counterspeech, as indicated by the IC (Independent Counterspeech) and CA (Category Accuracy) scores. In particular, for "Denouncing" counterspeech, annotators showed a preference for responses generated by CoARL over those by GPT-3.5 and GPT-4, leading to higher CA scores. I plan to elaborate on this finding in the Ablations section of the camera-ready version, if accepted. * GPT-4 and GPT-3.5 produced responses that are more grammatically coherent and structured, reflecting in their high scores in Adequacy (A) and Argumentative Effectiveness (AE). * GPT-4 demonstrates an improved ability to identify both explicit and implicit expressions of bias, prejudice, or stereotype in hate speech, leading to more contextually relevant and well-rounded responses than those generated by CoARL. --- ### Table: Human Evaluation Win Rate % of our proposed method (CoARL) vs few-shot baselines of GPT-3.5 and GPT-4 respectively. A detailed description of all metrics is provided under section 6.3. The symbol ↑ indicates the higher value is better. | Model | IC ↑ | A ↑ | CR ↑ | AE ↑ | CA ↑ | |-------------------|-------|-------|-------|-------|-------| | CoARL vs GPT-4 (FS)| **0.57** | 0.32 | 0.39 | 0.46 | **0.61** | | CoARL vs GPT-3.5 (FS) | **0.60** | 0.31 | **0.55** | 0.38 | **0.68** | **Key Observations:** * During human evaluation, we observe that responses generated by our proposed method, CoARL, were more aligned with the intended counter-narrative and more effectively contested the hate speech. This was particularly evident in the Independent Counterspeech (IC) and Category Accuracy (CA) scores. Notably, in cases where a "Denouncing" type of counterspeech was required, annotators showed a marked preference for responses generated by CoARL over those from GPT-3.5 and GPT-4, with a preference ratio of approximately 8 out of 10 times. This led to significantly higher CA scores. We plan to provide a more detailed discussion on this observation in the Ablations section of the camera-ready version of the paper, if accepted. * We also found that GPT-4 and GPT-3.5 tended to produce responses that were more grammatically coherent and structured, as reflected in their high scores for Adequacy (A) and Argumentative Effectiveness (AE). * Interestingly, GPT-4 demonstrated an enhanced ability to identify both explicit and implicit expressions of bias, prejudice, or stereotypes in hate speech. This capability resulted in it generating responses that were better rounded and contextually more relevant to the hate speech, compared to the responses generated by CoARL. --- --- 3. Why is SFT mentioned in Phase 3? From what I understand, SFT is exactly what is done in Phase 2, right? Or is there a difference from Phase 2 and SFT in Phase 3? (If there isn't a difference, I would mention it once, right now it reads a bit like they are two separate steps) * The reviewer is correct in noting that the SFT process described in Phase 3 is indeed the same as that in Phase 2. Our intention in reiterating the SFT component in Phase 3 was to maintain consistency with the narrative structure employed in the original paper by [Lee et al.,2023](https://arxiv.org/abs/2309.00267). This approach was adopted to facilitate a clearer understanding of the individual components of the RLAIF pipeline, especially for readers who might be encountering these concepts for the first time. * We realise that this may be confusing, and duly note the reviewer's suggestion. We are committed to make the necessary changes in the camera ready version of the paper. --- --- 4. What exactly is the "AI" in RLAIF here? Is it because some models are used to define the reward model? I'm more used to seeing RLAIF in the context of LLMs generating preference data, so I want to make sure I understand its usage here. * In our context, "AI" indeed refers to the Pretrained Language Models (PLMs) employed in defining the reward model. This indicates that the feedback, specifically the scalar reward, is generated by an AI system rather than derived from human feedback. * We recognize that the term "AI" is typically associated with LLMs generating preference data. However, in our work, the use of "AI" in RLAIF is aligned with its original definition and does not violate or deviate from established uses in the literature. For further clarity and substantiation, we refer to the following key sources that influenced our methodology and terminology: * [1] [Lee et al.,2023](https://arxiv.org/abs/2309.00267). * [2] [Bai et al.,2022](https://arxiv.org/abs/2212.08073). --- --- 5. Where is the term AEG (auxiliary explanation generation) introduced? I see it first mentioned in L462 but I can't find where it was actually defined prior to that. * We introduce the term Auxiliary Explanation Generation (AEG) on L294, as the first phase of our proposed framework. * While we have provided a brief background to this on L273, we acknowledge the need to explicitly highlight what "Auxiliary Explanation Generation" indicates in the context of the proposed method. * We also acknowledge the need to use a consistent denotion for AEG throughout the manuscript, and are committed to making the necessary changes (including L462) in the camera ready version, if accepted. --- --- 6. Do you intend to release the code and data for this project? (I couldn't find this mentioned anywhere on the paper, and will update my software/dataset scores if this is the case) * Yes, we are commited to releasing both the dataset and our source code after the anonymous review period. * Our source-code will be released as an open-source **Github** project, and the dataset will be released on **HuggingFace**. We will add the necessary specifications in the camera-ready version of the paper. --- --- 7. I found the multi-task learning explanation and notation in Phase 1 a bit confusing. I would recommend elaborating on this a bit more, and removing (or modifying) equation (1) since the notation feels a bit non-standard (took me a while to parse out what each arrow meant and what the "scope" of each sub-expression was). * We thank the reviewer for their suggestion. We note that the current notation might be difficult to parse, and propose to modify equation (1) in the revised version of our manuscript. * We aim to make the following change in equation (1), so as it make it more readable: * Current version: $\small {\Theta_m : \Theta \rightarrow \{1,...,N\} \leftarrow \underset{\Theta}{\mathrm{argmin}} \left( \sum_{n=1}^{N} L_0 (I_n; \Theta) \right) }$. * Revised version: $\small \Theta_{m} = \underset{\Theta}{\mathrm{argmin}} \sum_{n=1}^{N} L_n(I_n; \Theta)$. --- --- 8. The in-line bullet points in 6.2 and 6.3 look a bit weird. I would either do a list, or write it out as regular text. * To conserve space, we utilized in-line bullet points in Sections 6.2 and 6.3. * We note the reviwer's suggestion, and indent rewrite them as a list in the revised version of our manuscript. ## Reviewer 3: We thank the reviewer for their insightful comments and valuable suggestions. In the subsequent section, we offer thorough responses to each query raised by the reviewer. --- 1. Human evaluation, the most relevant evaluation for LLMs, was performed only for ChatGPT vs CoARL, leaving state-of-the-art baselines like GPS, DialoGPT, QUARK out. This can potentially mislead the final evaluation outcome. * We acknowledge the reviewer's concern regarding the limited scope of human evaluation in our study. The decision to restrict the human evaluation to ChatGPT and CoARL was primarily influenced by resource constraints, as a comprehensive human evaluation across several models was beyond the budget of our project. * However, we recognize the validity of the reviewer's point. Excluding state-of-the-art models like GPS, DialoGPT, and QUARK from human evaluation could potentially lead to a partial view of the final evaluation outcome. * In response to this concern, we have conducted additional human evaluations comparing our proposed method (CoARL) with DialoGPT, which emerged as the best performing state-of-the-art baseline. * We observe a notable performance gap between CoARL and other baselines in various automated metrics (as detailed in Table 2 of the paper). Thus, we chose to exclude GPS and QUARK from further human evaluation due to their relatively lower performance compared to CoARL in these automated metrics. * The results from this additional human evaluation are as follows: --- ### Table: Human Evaluation Win Rate % of our proposed method (CoARL) vs DialoGPT baseline. A detailed description of all metrics is provided under section 6.3. | Model | IC ↑ | A ↑ | CR ↑ | AE ↑ | CA ↑ | |-------------------|-------|-------|-------|-------|-------| | CoARL vs DialoGPT| **0.62** | 0.47 | **0.73** | **0.58** | **0.66** | **Key Observations:** * The human evaluators consistently favor responses from CoARL over DialoGPT in terms of Independent Counterspeech (IC), Contextual Relevance (CR), Argumentative Effectiveness (AE), and Category Accuracy (CA). * The Adequacy (A) metric showed no definitive preference for either CoARL or DialoGPT, indicating a relatively balanced performance in this aspect. * These observations from the human evaluation align with the quantitative results reported in Table 2, where the CoARL method surpasses DialoGPT in metrics assessing lexical similarity, relevance, effectiveness, intent conformity, and toxicity. * We commit to incorporating these findings into section 6.3 in the camera-ready version of the paper, if accepted. --- 2. The initial instruction SFT stage can be quite expensive to reproduce for the LLMs which this work is aimed at. It would be useful to opensource the pre-trained checkpoints that the authors used for facilitating the adoption of this technique - I didn't see that offered in the paper * We acknowledge the crucial aspect regarding the accessibility of pre-trained checkpoints, which is especially pertinent considering the resource-intensive nature of the initial instruction SFT stage for LLMs. In response to this concern, we are committed to releasing the SFT model checkpoints along with the source code and dataset. These checkpoints will be made available on the HuggingFace platform after the anonymous review period concludes. * Our source-code will be released as an open-source **Github** project, and the dataset will be released on **HuggingFace**. We will add the necessary specifications in the camera-ready version of the paper. --> <!-- --- --- --> # Response to the meta reviewer We express our gratitude to all the reviewers and meta reviewer for the valuable feedback. In our response, we have meticulously addressed every concern. We hereby commit to integrate all the recommendations outlined in the reviews and meta review into the final camera-ready version of our paper, should it be accepted. > ### Clarification of Differentiation: Clarify how CoARL differentiates from or advances beyond existing methods in the field. As suggest by the reviewers, we have provided a detailed response during the discussion phase. Following is a brief summary of our response: We elaborate on how our proposed method, CoARL, advances beyond established techniques in the field. CoARL incorporates multi-task instruction tuning, LoRA, and RLAIF, to uniquely addresses three pivotal challenges in counterspeech generation, setting it apart from existing approaches: - **Contextual Sensitivity**: CoARL is adept at handling hate speech with insufficient context, especially when biases or stereotypes are not explicitly stated but are implied. This capability ensures that our system can effectively interpret and respond to a wider range of hate speech instances. - **Intent-Specific Counterspeech**: Our method excels at generating counterspeech that is not only relevant but also tailored to the specific intent behind the hate speech. This precision ensures that the counterspeech is meaningful and directly addresses the underlying issues. - **Effectiveness and Non-Toxicity**: CoARL ensures that the generated counterspeech is both impactful and devoid of toxic elements, contributing to a healthier online discourse. #### Key Innovations of CoARL: 1. **Auxiliary Explanation Generation (AEG)**: We introduce an AEG component to overcome limitations in current counterspeech methods, particularly in handling short, implied hate expressions. By training our base model to generate explanations for hate speech first, it significantly improves the quality and relevance of the generated counterspeech, as validated in our ablation study. 2. **Parameter-Efficient Fine-Tuning**: Unlike traditional Supervised Fine-Tuning (SFT) that is resource-intensive, our approach employs task-specific LoRA weights for fine-tuning, maintaining high counterspeech generation performance with fewer parameters. This method's effectiveness is demonstrated in our experimental results, showcasing superior performance. 3. **PLM-Based Feedback in RLAIF**: Departing from conventional RLAIF that relies on large language models (LLMs) for preference data, CoARL introduces a novel reward mechanism using Predictive Language Models (PLMs). This approach aligns language model responses with desired attributes, such as non-toxicity, indicating its efficacy in our results and ablation studies. To substantiate our claims, we have conducted extensive experiments comparing CoARL with state-of-the-art (SOTA) methods and LLMs like GPT-4. Moreover, our detailed ablation study meticulously examines each component of CoARL, providing empirical evidence of its effectiveness and the innovative contributions of our approach. These efforts collectively demonstrate that CoARL represents a significant leap forward in the generation of counterspeech, effectively addressing the nuanced challenges of contemporary online discourse. > ### Extensive Baseline Comparisons: Include more extensive baseline comparisons, encompassing newer models such as GPT-4 and additional methods like FLAN-T5 with SFT. As suggested by the reviewers, we have conducted additional experiments incoporating both GPT-4 and FLAN-T5 with SFT. Following is a brief summary of our response: **1. FLAN-T5 with SFT:** We have clarified the inclusion of a pure-SFT baseline using FLAN-T5-XXL, detailed in our ablation studies as "CoARL - LoRA." We acknowledge the confusion arising from inconsistent notation and commit to unifying the terminology across the manuscript for clarity. Specifically, we will: - Move the pure-SFT baseline comparisons to a more prominent section alongside other SFT baselines. - Standardize and clearly denote the SFT baseline throughout the paper, particularly in the methodology and results discussion. **2. GPT-4 Baseline Inclusion:** Initially constrained by budgetary limitations, we prioritized GPT-3.5 for our experiments due to cost considerations. However, recognizing the importance of comparing our method against the most advanced models, we extended our evaluation to include GPT-4, conducting both zero-shot and few-shot analyses. Following is a snapshot of GPT-4 results: #### Table: Quantitative Results Comparative evaluation of CoARL against GPT-3.5 and GPT-4 prompting baselines. The symbol ↑ (↓) indicates the higher (lower) value is better. | Method | Prompt / Adapter | R1 ↑ | R2↑ | RL ↑ | M ↑ | BS ↑ | CS ↑ | CA ↑ | PC ↓ | AQ ↑ | T ↓ | | --------------------------- | ---------------- | ------ | ------ |:------:| ------ | ------ | ------ | ------- | ------- | ------ | ------- | | GPT-3.5-turbo | ZS | 0.204 | 0.058 | 0.181 | 0.274 | 0.856 | 0.323 | 0.828 | 0.118 | 0.898 | 0.038 | | GPT-3.5-turbo | FS | 0.230 | 0.067 | 0.199 | **0.293** | 0.885 | 0.310 | 0.891 | \-0.045 | **0.914** | 0.043 | | GPT-4 | ZS | 0.242 | 0.057 | 0.211 | 0.270 | 0.874 | 0.345 | **0.929** | 0.149 | 0.854 | **0.012** | | GPT-4 | FS | 0.247 | 0.056 | 0.214 | 0.267 | **0.886** | **0.346** | 0.924 | 0.148 | 0.856 | 0.013 | | CoARL (Ours) | LoRA10 | **0.251** | **0.078** | **0.220** | 0.244 | 0.876 | 0.226 | 0.944 | **-0.130** | 0.824 | 0.067 | | ∆ CoARL (Ours) − Best Method | | +0.004 | +0.011 | +0.006 | -0.049 | -0.010 | -0.12 | +0.015 | +0.085 | -0.090 | -0.055 | | | | | | | | | | | | | | #### Key Observations: Our updated evaluation showcases CoARL's competitive performance against both GPT-3.5 and GPT-4 baselines, with the latter offering a more comprehensive benchmark due to its advanced capabilities. Noteworthy findings include: - **Superiority in Intent-Specific Counterspeech:** CoARL achieves higher scores in producing intent-specific and relevant counterspeech, particularly in nuanced categories of hate speech, demonstrating its effectiveness over both GPT-3.5 and GPT-4. - **Contextual Relevance and Non-Toxicity:** Despite the grammatical coherence and structure offered by GPT models, CoARL excels in generating non-toxic and contextually relevant responses, underlining its designed purpose and innovation in counterspeech generation. These adjustments and findings affirm CoARL's advancement in the field, addressing the reviewers' suggestions comprehensively. We are committed to enhancing the manuscript's clarity and ensuring the robustness of our comparative analysis in the final version. > ### Human Evaluation Metrics: Address the limitation in human evaluation metrics, which may restrict the understanding of the model's real-world applicability and effectiveness. We thank the reviewers for their constructive suggestion on expanding our human evaluation metrics to better assess the real-world applicability and effectiveness of our model, CoARL. Our initial evaluation focused on ChatGPT and CoARL due to resource constraints. Acknowledging the importance of a broader compariso, we have extended our human evaluations to include a comparison between CoARL with GPT-4 and DialoGPT respectively. #### Table: Human Evaluation Win Rate % of our proposed method (CoARL) vs **DialoGPT**, and few-shot baselines of **GPT-3.5** and **GPT-4** respectively. A detailed description of all metrics is provided under section 6.3. The symbol ↑ indicates the higher value is better. | Model | IC ↑ | A ↑ | CR ↑ | AE ↑ | CA ↑ | |-------------------|-------|-------|-------|-------|-------| | CoARL vs GPT-4 (FS)| **0.57** | 0.32 | 0.39 | 0.46 | **0.61** | | CoARL vs GPT-3.5 (FS) | **0.60** | 0.31 | **0.55** | 0.38 | **0.68** | | CoARL vs DialoGPT| **0.62** | 0.47 | **0.73** | **0.58** | **0.66** | #### Key Observations: - The additional human evaluation highlighted CoARL's superiority over DialoGPT across several metrics, notably Independent Counterspeech (IC), Contextual Relevance (CR), Argumentative Effectiveness (AE), and Category Accuracy (CA), with Adequacy (A) showing balanced performance between both models. These findings are consistent with the automated metric results, reinforcing CoARL's effectiveness. - In our human evaluation, CoARL outperformed GPT-3.5 and GPT-4 in generating counterspeech that closely aligned with the intended counter-narrative, particularly excelling in Independent Counterspeech (IC) and Category Accuracy (CA). Annotators displayed a strong preference for CoARL's "Denouncing" counterspeech type, choosing it over the alternatives around 80% of the time, which contributed to its high CA scores. These findings, along with a deeper analysis, will be detailed in the Ablations section of our camera-ready paper, pending acceptance. - While GPT-4 and GPT-3.5 produced responses that were grammatically more coherent and showed higher Argumentative Effectiveness (AE), GPT-4, in particular, demonstrated superior capability in identifying both explicit and implicit biases in hate speech. This ability led to responses that were more contextually relevant compared to those from CoARL. - Moreover, CoARL consistently outshined DialoGPT across several metrics, including IC, Contextual Relevance (CR), AE, and CA, with the Adequacy (A) metric showing similar performance for both CoARL and DialoGPT. These human evaluation results are in line with our quantitative findings, showcasing CoARL's superiority in lexical similarity, relevance, effectiveness, intent conformity, and toxicity management. - We intend to integrate these comprehensive insights into section 6.3 of the camera-ready version of our paper, should it be accepted, to provide a full account of CoARL's performance and its comparative advantages over existing models. # Response to the meta reviewer We express our gratitude to all the reviewers and meta reviewer for the valuable feedback. In our response, we have meticulously addressed every concern. We hereby commit to integrate all the recommendations outlined in the reviews and meta review into the final camera-ready version of our paper, should it be accepted. Following is a brief summary of our response. >Clarification how CoARL differentiates from or advances beyond existing methods in the field. CoARL introduces innovative solutions to advance counterspeech generation, addressing key challenges with three main innovations: - **Auxiliary Explanation Generation (AEG)**: AEG tackles the limitation of existing methods in dealing with implied hate speech by initially generating explanations for hate speech, thereby enhancing the counterspeech's relevance and quality. - **Parameter-Efficient Fine-Tuning**: Leveraging task-specific LoRA weights for fine-tuning, our method achieves robust counterspeech generation with fewer parameters compared to traditional SFT. - **PLM-Based Feedback in RLAIF**: CoARL employs PLMs for feedback in RLAIF, diverging from typical reliance on LLMs for preference data. To substantiate our claims, we have conducted extensive experiments comparing CoARL with state-of-the-art (SOTA) methods and LLMs like GPT-4. Moreover, our detailed ablation study meticulously examines each component of CoARL, providing empirical evidence of its effectiveness and the innovative contributions of our approach. > Including GPT-4 as a baseline: - As suggested by the reviewers, we have conducted additional experiments incorporating both GPT-4 and FLAN-T5 with SFT. > Incoporating a broader Human Evaluation: - Our initial evaluation focused on ChatGPT and CoARL due to resource constraints. - Acknowledging the importance of a broader comparison, we have extended our human evaluations to include a comparison between CoARL with GPT-4 and DialoGPT respectively.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully