NAACL 2024 Rebuttal - 2052

# NAACL 2024 Rebuttal - 2052 ## Response To Metareview We are grateful for the insightful feedback from the reviewers and area chair. Inspired by these comments, we are dedicated to enhancing our manuscript. The planned revisions include: - **Diversification of Experimental Models**: We expanded our experiments to include recent models, and as a result, we discovered that the latest models are also susceptible to our attacks. Robust performances are achieved using only the naive BERT model, especially when employing the simple pooling strategies we suggested. These additional experiments will be included to emphasize the robustness and versatility of our approaches. - **Enhanced Clarity for Experiments**: We aim to refine certain representations to ensure a clearer interpretation of the results. This involves clarifying how attacks were created and comparing our approaches to others. We will also revise expressions for more clarity and provide more detailed explanations. - **Expanding on Cross-Lingual Generalizability**: While our approaches focused on demonstrating the efficiency of attacks in Korean, we fully acknowledge the importance of cross-lingual applicability. We plan to provide additional clarifications on this aspect, emphasizing the flexibility of our pooling strategies for researchers working with other languages. - **Real-World Applicability of the Proposed Attacks**: We will incorporate examples of our attacks in real-world scenarios, such as those observed on YouTube, to underscore the practical relevance of the proposed attacks. These examples will be included in the Appendix of our revised manuscript. The above changes were further developed during the author response period and will be integrated into the final version of our paper. We believe that these meticulous revisions enhance the quality and impact of our work, aligning it significantly closer to the standards of NAACL. ## General Response from Authors Once again, we are very grateful to all the reviewers for their thoughtful consideration of the direction of our research. Below, we have summarized the discussion points raised by each reviewer (from ```6CS2```, ```T2wY```, and ```4V1B```), and these reflections represent our thoughts based on the feedback from the reviewers, which will be incorporated into the camera-ready version. - **Additional models for the experiment (```6CS2```, ```T2wY```, and ```4V1B```)**: In alignment with the referenced papers and the task adopted, we performed experiments using fine-tuned BERTs. Specifically, we compared the models that directly used noisy texts in the pre-training process or not. Additionally, RNN-based and ensemble models were also included in the experiment. Despite our considerations, we fully acknowledged the existence of recent models, so we **further conducted experiments** with [DeBERTaV3](https://huggingface.co/team-lucid/deberta-v3-base-korean) and [XLMR](https://huggingface.co/xlm-roberta-base) (please refer to Responses for a result table). In conclusion, the use of recent models was also found to be vulnerable to the proposed attack. Even with the use of a naive BERT, which is not the latest model, we observed robust defence performance against attacks with the combination of our layer-wise pooling strategies. These additional experiments will be further incorporated in our camera-ready version. - **Interpreting the experiments (```6CS2```)**: Our observations indicate that in general, incorporating information from only the first layer enhances performance (as shown in Table 3). Furthermore, we expected that giving more weight to the first and last layers would lead to improved performance, but this was not the best case (as shown in Table 5). We concluded that the optimal performance is achieved by using only the first layer, which closely captures the distribution of token embeddings. **We are willing to revise expressions** in the camera-ready version to make it easier for the reader to understand. - **Scalability of the proposed approach (```6CS2```)**: We acknowledge that our attacks are primarily based on the specific characteristics of the Korean (lines 179\~181 in the paper), and the model we employed was pre-trained on specific Korean datasets. It would be cautious to say that our approach is the general idea for detecting attacked offensive languages. However, we expect that other languages possess their own linguistic characteristics, and our approach can serve as a foundation for extension. Since our method does not require additional noisy texts and can be implemented with simple modifications, we believe it is easily adaptable for other researchers. **We are committed to modifying the presentation** in the camera-ready version to make it easier for the reader to understand. - **The way of how attacks were created (```T2wY```)**: We attempted to utilize the offensive boundaries provided by existing datasets; however, due to the human-annotated nature, we encountered some inconsistencies, as illustrated in the examples (please refer to Response 2 for the Reviewer ```T2wY```). Additionally, we encountered limitations in generating malicious spans, as it conflicted with the policy that restricts generating offensive content for LLM like ChatGPT. Consequently, we conducted experiments, varying the attack rates to 30%, 60%, and 90%, aiming to create scenarios with diverse attacks across as many words as possible. **We are committed to incorporating this clarification** into the camera-ready version to enhance the readers' understanding. - **Additional suggestions for the experiment (```T2wY```)**: In our approach, which was the perspective of applying adversarial attacks, we implemented the proposed attacks on the test set. This implies that the training set remains unchanged, and only the test set undergoes modifications. Initially, we explored the approaches of adversarial training or text augmentation, considering their inclusion in the experiment. However, adversarial training involves modifying the training set itself, and text augmentation directly increases the amount of the training set. Consequently, we concluded that a fair comparison between our approach and these approaches would not be justified. **We are dedicated to including this presentation** in the camera-ready version to enhance reader comprehension. - **Evidence for the proposed attacks (```4V1B```)**: We mentioned that we utilized the existing Korean offensive language dataset to derive the attacks in the paper. To address this concern and provide a clearer understanding, we decided to **incorporate real-world examples** from platforms such as YouTube. This will be further supplemented in our camera-ready version. In addition to the above, there were a few other minor writing issues that reviewers pointed out. We are also willing to revise the expressions into the camera-ready version as needed. We believe that our paper, the final form it will take through the rebuttal process, adheres to the high standards of NAACL. Once again, we deeply appreciate your time and attention to our paper and responses. ## Reviewer 6CS2 ### Response to Weakness 0: concern about the connection between two parts > While the paper at the end shows their proposed claim that some layer-wise pooling helps improve the detection of Korean offensive language, the connection between two parts is somewhat unclear. On one end, we have a very peculiar classification problem for Korean language, while on the other end, we have a very general method that can be potentially applied to any NN method. So, are authors claiming that the general idea in the latter case helps improve the specific problem in the former? If so, authors could have strengthened their argument in several ways: We appreciate your insightful feedback and suggestions concerning the connection between the two parts. We demonstrated **how tokenization results exhibit significant divergence** when malicious user attacks are introduced to offensive languages, as illustrated in Tables 1~2 and lines 217~229 of the paper. Even with relatively simple attacks, the model tokenizes them in a distinctly different manner. To address this, we proposed layer-wise pooling strategies that not only rely on information from the last layer but also incorporate information from previous layers. Consequently, our proposed strategies exhibited robust performances under various attack rates. We believe that this will be a connection between the two parts, and we intend to delve further into whether it represents a general idea in 'Response to Weakness 2'. Thank you for bringing this to our attention; it serves as a valuable reminder of the connection to our suggestions. We would greatly appreciate it if you could consider our feedback into consideration when responding to weakness in the latter. ### Response to Weakness 1: concern about the models in the experiment > First, the paper must have laid out the baselines of how customized detection methods to detect Korean offensive languages perform in general, compared to BERT types in Table 3. Or, is BERT the best method to detect offensive Korean language? Many recent papers that studied the Korean language were cited/discussed, so authors can start from those. We sincerely appreciate your consideration about setting up a baseline for offensive language detection experiments. As you may have noticed, BERT is widely used in offensive language detection tasks, often in the form of a fine-tuned version. Notably, the Korean papers we referenced **also conducted experiments based on fine-tuned BERT**, and they focused on proposing tailored datasets for offensive languages. Therefore, we also opted for BERT-based models, distinguishing them based on whether they were pre-trained on noisy texts ('clean' vs. 'noise' BERT models in our experiments). We didn't limit ourselves to that; we expanded our experiments to include older RNN-based models. Additionally, we employed ensemble models with two BERT models to evaluate the effectiveness of the layer-wise pooling strategies. Despite our considerations, we fully acknowledged the existence of recent models capable of applying layer-wise pooling strategies. In response, we conducted **additional experiments** to complement our experiments. As a result, we have employed [DeBERTaV3](https://huggingface.co/team-lucid/deberta-v3-base-korean), fine-tuned for Korean, and [XLMR](https://huggingface.co/xlm-roberta-base), a multilingual model, and the results are shown below. | Model | Original F1 | 30% Attacked F1 | 60% Attacked F1 | 90% Attacked F1 | Average F1 | Average ∆atk | |--------------------------------------|----------|--------------|--------------|--------------|------------|--------------| | $\text{DeBERTaV3}$ | 80.17 | 76.48 | 70.44 | 64.99 | 70.63 | -11.89% | | $\text{DeBERTaV3} + \text{mean}$ | 79.17 | 75.28 | 70.10 | 65.31 | 70.23 | **-11.29%** | | $\text{DeBERTaV3} + \text{max}$ | 79.42 | 75.21 | 69.81 | 64.81 | 69.94 | -11.93% | | $\text{DeBERTaV3} + \text{weighted}$ | 79.21 | 75.13 | 65.28 | 65.28 | 70.19 | -11.38% | | $\text{DeBERTaV3} + \text{first}$ | 79.99 | 75.92 | 65.74 | 65.74 | **70.78** | -11.51% | | $\text{XLMR}$ | 72.38 | 67.87 | 60.87 | 55.78 | 61.50 | -15.03% | | $\text{XLMR} + \text{mean}$ | 72.37 | 68.60 | 62.16 | 57.40 | 62.72 | -13.33% | | $\text{XLMR} + \text{max}$ | 72.62 | 69.45 | 64.36 | 59.47 | 64.42 | **-11.29%** | | $\text{XLMR} + \text{weighted}$ | 76.46 | 71.10 | 65.49 | 62.45 | **66.34** | -13.23% | | $\text{XLMR} + \text{first}$ | 72.65 | 69.01 | 62.75 | 57.34 | 63.03 | -13.24% | First, we observed the robustness of DeBERTa to the attacks, evidenced by an increase in the average f1-score when utilizing first pooling, consistent with the observations in our paper. As for XLMR, all layer-wise pooling strategies exhibited resilience against attacks, with weighted pooling achieving the highest f1-score and max pooling proving to be the most robust in terms of ∆atk. We demonstrated that layer-wise pooling strategies can effectively defend against attacks, even for recent models. However, especially XLMR, utilizing more than twice as many parameters as the other models, neither of the recent models could achieve scores comparable to $\text{BERT}_{clean}$ or $\text{BERT}_{noise}$ employed in the paper in terms of average f1-score or ∆atk. This implies that even **naive BERT modes can achieve sufficiently robust performance by utilizing a layer-wise pooling strategy** that does not require additional training. | Model | Original F1 | 30% Attacked F1 | 60% Attacked F1 | 90% Attacked F1 | Average F1 | Average ∆atk | |--------------------------------------|----------|--------------|--------------|--------------|------------|--------------| | $\text{BERT}_{clean}$ | 78.64 | 75.19 | 67.96 | 62.44 | 68.52 | -12.85% | | $\text{BERT}_{clean} + \text{max}$ | 78.65 | 75.54 | 68.29 | 61.72 | 68.51 | -12.88% | | $\text{BERT}_{clean} + \text{first}$ | 79.21 | 77.02 | 71.21 | 65.49 | **71.24** | **-10.05%** | | $\text{BERT}_{noise}$ | 79.64 | 77.17 | 71.33 | 66.96 | 71.82 | -9.81% | | $\text{BERT}_{noise} + \text{max}$ | 80.23 | 77.57 | 73.64 | 71.06 | **74.08** | -7.64% | | $\text{BERT}_{noise} + \text{first}$ | 79.12 | 76.79 | 73.28 | 70.66 | 73.57 | **-7.00%** | ### Response to Weakness 2: concern about where the improvements come from, scalability for other languages, and whether the proposed is the general idea > Second, results in Tables 3-5 show that when one applies some pooling strategy to Korean offensive detection problem, it shows some improvement (but not always uniformly and consistently). However, this doesn't help us understand where this improvement comes from exactly. If the proposed pooling indeed is good at capturing the offensiveness and embedding of the language in context, for instance, how does it work for other offensive languages, say Chinese and Korean? Do we see the improvements just because the context is Korean (if so why or why not)? Or, are authors claiming that the layer-wise pooling idea as the general idea toward offensive language detection? We genuinely appreciate your feedback, which has provided valuable insights into the importance of discussing our approach and considering the concerns of our approach. First of all, we'll discuss why the improvements didn't work uniformly and consistently later in 'Response to Weakness 3'. When we applied the proposed pooling strategies, we observed that the performance of the first pooling, which utilized only the information from the first layer as an additive, was generally the best (lines 352\~362 in the paper). Subsequently, we conducted further experiments with the expectation that the performance could be improved by weighting the first and last layers in close order, focusing more on the token embeddings and offensiveness, respectively. We found that weighting the first and last layers closer together than the middle layers indeed led to better performance (**down-up pooling > up-down pooling**, lines 438\~448 in the paper). However, its performance did not surpass that achieved by using only the information from the first layer (**first pooling > down-up pooling**, lines 449\~457 in the paper). Therefore we concluded that while utilizing information close to the first and last layers can be beneficial, certain layers indeed hinder model performance. As a result, we found that using information solely from the first layer, which **most closely captures token embeddings, resulted in performance gains** in the context of attacked offensive languages. The premise of layer-wise pooling strategies is grounded in the observation that when attacks from malicious users are introduced, **the tokenization results differ significantly from the previous one**, as illustrated in Table 1\~2 in the paper. Especially in the case of Korean, as mentioned in lines 179\~181, the characteristic that a single character must contain an initial sound, a middle sound, and an optional final sound allows for a wide variety of attacks. In a similar vein, we anticipate that other languages such as Chinese, may possess **unique linguistic features that facilitate adversarial attacks from the perspective of malicious users**. Our pooling strategies are expected to be meaningful if they result in different tokenization outcomes. It is important to note that additional factors, such as pre-trained models in each language and the datasets used to train them, should also be taken into consideration. Finally, we are cautious to say that the layer-wise pooling strategies are the general idea of offensive language detection. As explained above, **our assumptions were based on a scenario** where 1) adversarial attacks specific to the Korean language were applied, and 2) multilingual models were also experimented with, although our primary focus remained on pre-trained models in Korean. In conclusion, given that we adopted layer-wise pooling strategies to capture aspects of text that vary depending on adversarial attacks, we believe that there is enough opportunities to extend our approach in other languages. Furthermore, this does not require the use of noisy texts in the training process and can be implemented with simple modifications to the model. Researchers are encouraged to carefully consider the definition of adversarial attacks based on linguistic characteristics and their possible use. We appreciate the reviewer's thoughtful insights into the reasoning behind our experimental results and the scalability of our ideas. Thanks to these valuable points, we were able to refocus the direction of the paper. We will incorporate the necessary additions into the camera-ready version of the paper. Once again, we express our gratitude for highlighting these issues. ### Response to Weakness 3: explanation about the experiments > Third, as the attack rates increase, offensive langue detection using pooling idea also seems to increase. Why is this happening? Also, Table 1 shows many types of attacks possible in Korean, yet description in 4.1 doesn't detail how attacks were performed and what types. For instance, when you say "30% attacks," what does it mean? which type of attacks in Table 1 were applied and how? What type of attacks were more easily detectable or not, and what's their relationship with respect to the pooling strategies? None of these relevant questions were answered, limiting the learning from this paper. We sincerely appreciate your feedback, as it has shed light on the importance of the discussion, particularly addressing concerns related to the experiments and notations. We presented how attacks are performed in lines 274\~279 in the paper, **with further details** provided in lines 803\~804 **in the Appendix**, where we randomly selected one of the proposed attacks. For instance, with a 30% attack rate and 9 words in a sentence, we attacked 3 words (30% of them), with each word randomly reflecting one of the proposed attacks. We **also investigated which type of attack is more likely to be detected** in the Appendix. The results are presented in lines 811\~828 and Table 11. We have presented the details of these settings separately in the Appendix to offer a comparative description of how the attacks were performed between the main/appendix experiments. We measured the performance of each type of attack and found that the performance ranking was *Insert* > *Copy* >>> *Deconpose*. This suggests that **the difficulty of the attacks follows the order *Insert* < *Copy* <<< *Decompose***. When applying first pooling to each attack, the preserved performance was 1.88%, 1.97%, and 2.37% for *Insert*, *Copy*, and *Decompose*, respectively, with a higher ratio for *Copy*, and *Decompose* attacks, which are attacks that reflect the characteristics of Korean. This difference can be attributed to the fact that, unlike the *Insert* attack, which simply adds unnecessary special symbols at certain locations, **our pooling strategies can effectively defend against the attacks from the Korean language** which offers a much wider variety of attacks due to each letter's composition of an initial sound, a middle sound, and an optional final sound (lines 179\~181 in the paper). As a result, the layer-wise pooling strategies that capture various token distributions prove to be more effective for *Copy*, and *Decompose* attacks using the characteristic of Korean. This difference of the effectiveness of each attack type is the primary cause of the experimental result, which might be perceived as inconsistent according to attack rates. As we mentioned above, the attack rate denotes the amount of attack, and each attack is randomly selected. Due to this randomness when choosing one of the attacks, the resulting performance improvement may appear inconsistent even as the attack rate increases. However, despite of such possible inconsistency, the experimental result suggests that our proposed method can defend such attacks without significant effort regardless of the amount of the attack. We conducted above additional experiments in the Appendix to confirm the impact of each attack, and we appreciate the opportunity to further clarify our intent in response to the questions. We hope that our explanation is helpful in addressing the reviewer's concerns. ## Reviewer T2wY ### Response to Weakness 1: concern about the attacks > For the attacks, the clean model only drops by 4% when 30% of the text is affected. This only increases to 20% at 90% of the text. This seems to point to a possible weakness in the attack itself. I understand that the attack is trying to replicate a user, but a user has the advantage of knowing which words are more likely offensive, where the attack is using randomness. We sincerely appreciate your feedback, shedding light on the significance of the discussion regarding our approach for the concern about the attacks. When we defined the types of the attacks, we referred the forms of offensive language found in existing datasets. Additionally, we chose to introduce only one type of attack to a single word, regardless of the word's length. Although it was feasible to incorporate multiple attacks into a single word, potentially leading to stronger attacks, we concluded that such a scenario does not accurately represent the real-world cases observed online. Therefore, we implemented the proposed **attacks at a realistic level** and established three attack rates 30%, 60%, and 90% for quantitative comparison. In the early phases of our study, we considered the insertion of attacks into more offensive words. The datasets we utilized, KoLD (Jeong et al., 2022) and K-HATERS (Park et al., 2023a), provide **boundary information** indicating which parts of a sentence contain offensiveness. However, these datasets rely on human annotation, and when we examined the boundaries from the datasets, we found **inconsistencies in the criteria for establishing boundaries** in each sentence. We frequently observed cases labeled as boundaries even though they did not actually contain offensiveness, and vice versa. [Example of inconsistencies] - Example 1 from KOLD dataset - Korean: "사원은 못하지만 대리 과장은 할수 있습니다!!!!" - English: "Employees can't but acting chiefs can!!!!" - Boundary: [사원은, 못하지만, 대리, 과장은, 할수, 있습니다] - Comment: Every words were annotated as offensive words, even though there are no explicit offensive expressions. - Example 2 from K-HATERS dataset - Korean: "마약쟁이들 특히 뽕쟁이들은 사형이지" - English: "The death penalty for junkies, especially meth heads" - Boundary: [] - Comment: Despite the offensive language in the sentence, no words were annotated as offensive words. After that, we tried to employ language models to identify words containing offensiveness. However, we encountered some challenges, **including typo errors** present in the existing datasets, making it difficult to accurately select words containing offensiveness. Additionally, when employing an open-sourced LLM like ChatGPT, we faced limitations as **its policy (https://openai.com/policies/usage-policies) restricts outputs** in the presence of severe swearing and discriminatory expressions. Therefore, in our experiments, we configured the attack rates at 30%, 60%, and 90% to ensure the maximal inclusion of user-intended adversarial attacks in as many words as possible. It is noteworthy that determining which words are more offensive from the model's perspective, irrespective of typo errors, exceeds the scope of our current study. This is because such setting assumes that malicious users are aware of the model output for each input, a consideration we did not consider for the current scenario. We will clarify this setting in the camera-ready version. ### Response to Weakness 2: concern about the pooling strategies > The pooling method is even more of a concern than the attack. When the method is applied, the score is improved, but at a small rate (4.18 -> 2.76 for 30%). Even when the drop is larger (20% at 90%) the improvement does not scale accordingly, the differences for 30%, 60%, and 90% are 1.42, 3.06, 2.82. These small increases mean the model is still greatly affected by the attacks. We sincerely appreciate your feedback, as it has illuminated the importance of the discussion concerning our approach, particulary addressing concern related to the pooling strategies. While it is true that our layer-wise pooling strategies resulted in a small improvement in scores, it is noteworthy that these improvements, albeit small, **not only outperform the ensemble model** using two models simultaneously **but are also comparable to $\text{BERT}_\textit{noise}$** pre-trained on noisy texts. Moreover, considering that we achieved these results without incorporating additional noisy texts in the training stage, we believe that it is crucial to pay closer attention to how effectively we leveraged the provided dataset and the model structure. ### Response to Weakness 3: suggestion for the comparison > The introduction (line 083) suggests that the advantage of this method is no retraining is necessary (which is true), however, a more apt comparison would be a model which employs adversarial training (or data augmentation) rather than the Bert noisy model. We genuinely appreciate your feedback, which has provided valuable insights into the importance of discussing our approach and considering suggestion for comparison. In the early phases of our study, we initially explored **adversarial training** with the proposed attacks. In this approach, the attacked offensive languages are directly used in the training stage, creating the potential for selective proficiency against the types of attacks observed in the training process. In the case of Korean, as mentioned in lines 179~181 of the paper, a single character must contain an initial sound, a middle sound, and an optional final sound. Significantly, the proposed user-intended adversarial attacks, namely *Copy* and *Decompose*, capitalized on this characteristic, **leading to the potential occurrence of numerous adversarial attacks**. Given this scenario, we believe that incorporating specific adversarial attacks directly into the training process would diminish the real-world applicability of offensive language detection. We also explored **data augmentation**, however, since our approach primarily revolves the concept of adversarial attacks, we decided to **introduce adversarial attacks into the test set without changing the training set**. Given this scenario, we believe that a direct comparison with data augmentation under same conditions would not be feasible. While we appreciate the suggestion of comparing with the two approaches, we believe that establishing a fair comparison with adversarial training or data augmentation might be challenging for our approach. ## Reviewer 4V1B Thank you for dedicating your time to a comprehensive review of our manuscript. We have carefully considered your comments and concerns and would like to address them as follows: ### Response to Weakness 1: Realistic evidence for user-intended adversarial attacks > The authors claim that the adversarial attacks they consider are realistic and frequent, but do not support this claim with evidence. It would be good to add this for the camera-ready. We acknowledge your feedback, which rightfully points out the necessity for concrete evidence supporting our claim. In response, we are committed to improving our manuscript by **including real-world examples of the discussed attacks**, specifically drawn from platforms like YouTube. Recognizing the crucial importance of privacy and the avoidance of potential profile leakage, we assure you that we will meticulously anonymize the profiles involved in these examples to uphold confidentiality. Your insightful comment emphasizes the importance of grounding our theoretical claims with practical, real-world evidence. We believe that this addition will not only strengthen our argument but also offer a more comprehensive understanding of the issue at hand. We are grateful for the opportunity to refine our work, making it more relevant and impactful. ### Response to Weakness 2: Suggestion for the experiment using recent models > BERT is quite old by now and is known to use suboptimal tokenization (especially for under-resourced languages like Korean). It would be good to include at least one more recent language model like DeBERTa. I understand this may not be available pretrained for Korean. In that case, a multilingual model like XLMR would still be a good and more recent comparison. We genuinely appreciate your feedback, which has brought attention to the importance of discussing our approach, especially regarding the suggestion for including recent models. As you may have noticed, while BERT is quite old, it is widely used in offensive language detection tasks, often in the form of a fine-tuned version. Notably, numerous Korean papers we referenced have conducted experiments based on fine-tuned BERT. Consequently, we chose BERT-based models, distinguishing them based on whether they were pre-trained on noisy texts ($\text{BERT}_{clean}$ vs. $\text{BERT}_{noise}$). Despite our initial considerations, we fully agreed with the reviewer's suggestion to include more recent models. As a result, we have employed [DeBERTaV3](https://huggingface.co/team-lucid/deberta-v3-base-korean), fine-tuned for Korean, and [XLMR](https://huggingface.co/xlm-roberta-base), a multilingual model for **additional experiments**. The results are shown below. | Model | Original F1 | 30% Attacked F1 | 60% Attacked F1 | 90% Attacked F1 | Average F1 | Average ∆atk | |--------------------------------------|----------|--------------|--------------|--------------|------------|--------------| | $\text{DeBERTaV3}$ | 80.17 | 76.48 | 70.44 | 64.99 | 70.63 | -11.89% | | $\text{DeBERTaV3} + \text{mean}$ | 79.17 | 75.28 | 70.10 | 65.31 | 70.23 | **-11.29%** | | $\text{DeBERTaV3} + \text{max}$ | 79.42 | 75.21 | 69.81 | 64.81 | 69.94 | -11.93% | | $\text{DeBERTaV3} + \text{weighted}$ | 79.21 | 75.13 | 65.28 | 65.28 | 70.19 | -11.38% | | $\text{DeBERTaV3} + \text{first}$ | 79.99 | 75.92 | 65.74 | 65.74 | **70.78** | -11.51% | | $\text{XLMR}$ | 72.38 | 67.87 | 60.87 | 55.78 | 61.50 | -15.03% | | $\text{XLMR} + \text{mean}$ | 72.37 | 68.60 | 62.16 | 57.40 | 62.72 | -13.33% | | $\text{XLMR} + \text{max}$ | 72.62 | 69.45 | 64.36 | 59.47 | 64.42 | **-11.29%** | | $\text{XLMR} + \text{weighted}$ | 76.46 | 71.10 | 65.49 | 62.45 | **66.34** | -13.23% | | $\text{XLMR} + \text{first}$ | 72.65 | 69.01 | 62.75 | 57.34 | 63.03 | -13.24% | First, we observed the robustness of DeBERTa to the attacks, evidenced by an increase in the average f1-score when utilizing first pooling, consistent with the observations in our paper. As for XLMR, all layer-wise pooling strategies exhibited resilience against attacks, with weighted pooling achieving the highest f1-score and max pooling proving to be the most robust in terms of ∆atk. We demonstrated that layer-wise pooling strategies can effectively defend against attacks, even for recent models. However, especially XLMR, utilizing more than twice as many parameters as the other models, neither of the recent models could achieve scores comparable to $\text{BERT}_{clean}$ or $\text{BERT}_{noise}$ employed in the paper in terms of average f1-score or ∆atk. This implies that even **naive BERT modes can achieve sufficiently robust performance by utilizing a layer-wise pooling strategy** that does not require additional training. | Model | Original F1 | 30% Attacked F1 | 60% Attacked F1 | 90% Attacked F1 | Average F1 | Average ∆atk | |--------------------------------------|----------|--------------|--------------|--------------|------------|--------------| | $\text{BERT}_{clean}$ | 78.64 | 75.19 | 67.96 | 62.44 | 68.52 | -12.85% | | $\text{BERT}_{clean} + \text{max}$ | 78.65 | 75.54 | 68.29 | 61.72 | 68.51 | -12.88% | | $\text{BERT}_{clean} + \text{first}$ | 79.21 | 77.02 | 71.21 | 65.49 | **71.24** | **-10.05%** | | $\text{BERT}_{noise}$ | 79.64 | 77.17 | 71.33 | 66.96 | 71.82 | -9.81% | | $\text{BERT}_{noise} + \text{max}$ | 80.23 | 77.57 | 73.64 | 71.06 | **74.08** | -7.64% | | $\text{BERT}_{noise} + \text{first}$ | 79.12 | 76.79 | 73.28 | 70.66 | 73.57 | **-7.00%** | ### Response to Comments & Suggestions > In lines 454f it sounds like “first” pooling combines both the first and the last layer? Same in Figure 1. Is this correct? If so, it may be good to rename “first” into “first-last” pooling. - For figure 2, would it not make more sense for each line to be a model? And then the x-axis could be attack rate from 0 to 90%. We really appreciate your invaluable insights regarding the naming of our pooling strategy. Your comment regarding the naming of the "first" pooling is correct, and we think it's a good idea. We will change it to the camera-ready version. Thank you for your comment, which could enhance the clarity of our method for the readers. Additionally, we are grateful for your observations on Figure 2. We initially followed the reviewer's suggestion and plotted Figure 2 accordingly, but we decided that the current state was more intuitive and easier to compare between models. It was clearer to see the difference between the model before and after applying the layer-wise pooling strategy that to have attack rates on the x-axis. Thank you for considering the clarity of the figure. > I would consider only showing F1 and delta(atk) in the main body for tables like Table 5, then move the other metrics to the appendix. - Table 4: specify “certain” ratio of adversarial attacks We also appreciate your suggestion concerning the presentation of the table. Your insights are indeed valuable, and we will make sure to integrate them thoughtfully into the camera-ready version of our manuscript. Your feedback has been instrumental in enhancing the clarity and effectiveness of our presentation, and we are thankful for your constructive guidance.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.