## To AC: Authors' Concerns Regarding the Discussion Period
Dear Area Chair,
Thank you for your previous responses to our concerns regarding the reviews from Reviewer gbA5 & Reviewer iova. We deeply and sincerely thank you and all reviewers for your valuable time and comments on our paper. Your efforts greatly help us to improve our paper.
In this letter, **we wish to bring to your attention to our concerns regarding the author-reviewer discussion period**, especially those follow-up comments from Reviewer gbA5 and Reviewer iova. Unfortunately, **the concerns about these two reviewers that we mentioned in our previous letter to you seem to be coming true**.
- **Concerns regarding Reviewer gbA5**. This reviewer asked some new questions after we submitted the rebuttal, but **many of them seemed to be asked just for rejecting our submission**.
- For example, **this reviewer forcibly criticizes some reasonable assumptions of our theorem**. However, these assumptions were commonly used in many deep learning theory papers. More importantly, **our paper is not theory-oriented**. we provided Theorem 3.1 solely to argue that the parameter-oriented scaling consistency (PSC) phenomenon is not accidental.
- In particular, **his/her scoring is completely unfair**. As admitted in the initial reviews, our paper introduces a novel concept of leveraging parameter-oriented scaling consistency for backdoor detection. The idea is simple and interesting, and it provides a theoretical analysis to back up the observed PSC phenomenon. However, **with such praise and the fact that we had addressed almost all of his/her concerns, he/she currently maintain his/her unreasonable 'reject' score**.
- **Concerns regarding Reviewer iova**. During the discussion, Reviewer Iova admitted that we had clarified his/her misunderstandings about our threat model and targeted defense scenarios. Currently, only two minor questions regarding our additional experiments remain. We believe that we have addressed them well in our follow-up responses. However, despite our repeated reminders, **Reviewer iova disappeared during the remaining discussion period**. In particular, **Reviewer iova still has not updated his/her scores so far** even we have addressed almost all of his/her concerns.
- Besides, although we are willing to, since we are now within a few hours of the end of the discussion, **we currently do not always have enough time to answer new questions if they raise**.
**We sincerely hope that you could kindly take notice of these issues and take our rebuttal and these situations into account when making your final decision** :)
We are deeply sorry for the inconvenience that our notifications may cause you. Thank you again for all your kindly and valuable efforts in the whole reviewing process.
Best Regard, Paper5283 Author(s)
---
## Author Response (Reviewer gbA5)
Thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **novel approach**, **effectiveness**, **novel concept**, **simple and interesting idea**. We hope the following responses could help clarify potential misunderstandings and alleviate your concerns.
---
**Q1**: Limitation to 10% Poisoning Rate: The experiments are primarily conducted under a 10% poisoning rate. Considering backdoor attacks can vary in intensity, it's crucial to test the method under a wider range of poisoning rates to ensure its effectiveness in different scenarios. Even the scaled-up paper considers 5%-10% poisoning rates.
**R1**: Thank you for highlighting this important aspect! We do understand your concern about the generalizability of our defense (regarding the poisoning rate). As you may have ignored, we have tested the impact of varying poisoning rates (from 2% to 10%) in Section 5.4 (Figure 7). For your convenience, we have put the results in the tables 1-3 below. **The results suggest that our method is highly effective even when the poisoning rate is small** (e.g., 2%).
**Table 1. The performance (AUROC, F1) of our IBD-PSC in defending against BadNets with different poisoning rates ($\rho$) on CIFAR-10.**
| $\rho$ (%) | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|
|ASR| 1.000 | 1.000|1.000 | 1.000 | 1.000 |
|AUROC| 0.999 |0.998|0.999|1.000|1.000|
|F1|0.928|0.912|0.961 |0.981| 0.967|
**Table 2. The performance (AUROC, F1) of our IBD-PSC in defending against WaNets with different poisoning rates ($\rho$) on CIFAR-10.**
| $\rho$ (%) | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|
|ASR| 0.968 | 0.972|0.994 | 0.996 | 0.997 |
|AUROC| 0.966 |0.977|0.978|0.983|0.984|
|F1|0.944|0.959|0.955|0.960|0.956|
**Table 3. The performance (AUROC, F1)of our IBD-PSC in defending against BATT with different poisoning rates ($\rho$) on CIFAR-10.**
| $\rho$ (%) | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|
|ASR| 0.999 | 1.000| 1.000 | 1.000 | 1.000 |
|AUROC| 0.999 |1.000|0.998|0.985|0.999|
|F1|0.972|0.967|0.942|0.958|0.979|
---
**Q2**: Hyperparameter Selection: The paper acknowledges the presence of multiple hyperparameters (e.g., the scaling factor, predefined threshold for selecting layers, the number of models, and the threshold for determining poisoned images) but does not provide a clear methodology for their optimal selection, which could affect the reproducibility and effectiveness of the method in different contexts.
**R2**: Thank you for this insightful question! We are deeply sorry that our submission may lead to some potential misunderstandings that we want to clarify here.
- **We exploit consistent hyper-parameter settings across different attacks and datasets.** We maintain the same set of hypermeters throughout all experiments, ensuring the transparency and replicability of our method. These parameters, detailed in Section 5.1, include the default settings of $\omega=1.5$, $n=5$, $\epsilon=60%$, and $T=0.9$. As such, **our defense works well because we have a better understanding of backdoor attacks and defenses, instead of having more hyperparameters to tune**.
- **Our defense is not sensitive to the selection of hyper-parameters**. As shown in the ablation study included in our appendix (Appendix J, N, O, and Q), our method can achieve stable and promising performance (>0.9) when the hyper-parameters are near our default settings (i.e., $n=5$, $T=0.9$, $\epsilon=60%$, and $\omega=1.5$).
- We argue that **finding a clear methodology for selecting the hyper-parameters is not always possible**. For example, we cannot directly optimize the selection of learning rates and training epochs for DNNs. There are quite a few hyperparameters for other existing defenses. However, we do understand your concerns. We will add more details about how we set this default setting in the appendix of our revision.
---
**Q3**: Informal Theorem Presentation: The theorem provided appears informal and lacks essential assumptions, making it challenging to understand the mathematical basis for why poisoned inputs would be classified into the target class under the model's conditions. Are inputs here clean or poisoned? If they are poisoned, then the model should always predict them to target.
**R3**: Thank you for pointing it out! We are deeply sorry that we fail to make this theorem more clearly.
- In this theorem, we intended to prove that both poisoned and benign samples will be predicted as the target class when the scaling factors are sufficiently large.
- The prediction of poisoned samples during the amplification process remains the same since the label of poisoned samples is already the target label.
- In this theorem, we only assume that the batch norm feature of samples on the $l$-th layer of a backdoored DNN, i.e., $b=f_l\circ\cdots\circ f_1$, follows a mixture of Gaussian distribution. The definitions of batch norm feature, (the $l$-th layer of) a DNN, and a mixture of Gaussian distribution are in Section 3 (Line 135-143), Eq.(1), and Eq.(A1)-(A9).
- To avoid potential misunderstandings, we will revise this theorem as follows:
**Theorem 3.1. Let $F=FC\circ f_L\circ\dots\circ f_1$ be a backdoored DNN with $L$ hidden layers with the target class $t$. Let $x$ be an input and $b=f_l\circ\cdots\circ f_1(x)$ be its batch-normalized feature after the $l$-th layer ($1\leq l\leq L$). Assume that $b$ follows a mixture of Gaussian distribution. Then the following two statements hold: (1) Amplifying the $\beta$ and $\gamma$ parameters of the $l$-th layer can make $\Vert b \Vert_2$ arbitrarily large, and (2) There exists a positive constant $M$ that is independent of $\hat{b}$, such that whenever $\hat{b}$ (i.e., the amplified version of $b$) satisfies $\Vert \hat{b} \Vert_2 > M$, then $\arg \max f_L\circ\dots\circ f_{l+1}(\hat{b})=t$, even when $\arg \max f_L\circ\dots\circ f_{l+1}(b)\not=t$.**
PS: We will also add a remark after this theorem to provide more details about our assumptions.
---
**Q4**: Lack of Consideration for Adaptive Attacks Designed for this defense: While the paper tests resistance to some 'adaptive' attacks (not developed for this method), adaptive strategies designed specifically to counter the proposed detection method would strengthen its validity. For example, an attack can generate poisoned data to mitigate the impact of amplifying parameters.
**R4**: Thank you for highlighting this important aspect! We do understand your concern about the resistance of our defense to attacks that are directly designed for our method.
- As you may have ignored, **we have designed strong adaptive attacks in Section 5.4 (Line 402-439)**, where we assume the adversaries possess complete knowledge of our method. Specifically, we designed an adaptive loss term in Eq.(5). In general, it plays a crucial role in ensuring the accurate prediction of benign samples even when subjected to model parameter amplification. It reduces the prediction discrepancy between benign and poisoned samples during amplification and, therefore, serves as an adaptive attack. As shown in the following Table 4, **our method remains effective even under this adaptive attack**. We argue that the effectiveness primarily originates from our adaptive layer selection strategy, which dynamically identifies BN layers for amplification, regardless of whether it is a vanilla or an adaptive backdoored model. The layers selected during the inference stage typically differ from those used in the training phase, enabling the IBD-PSC to effectively detect poisoned samples.
**Table 4. The performance (AUROC, F1) of IBD-PSC under adaptive attacks designed in our submission.**
| $\alpha$$\rightarrow$ | 0.2 |0.5 | 0.9 | 0.99|
|------------------------|-------------|----------|-------------|-------------|
| Attacks$\downarrow$ | AUROC / F1 | AUROC /F1 | AUROC/ F1| AUROC / F1 |
| BadNets | 0.992/0.978 | 0.986 / 0.964 | 0.995 / 0.962 | 0.996 / 0.951 |
| WaNet | 0.947 / 0.949 | 0.956 / 0.942 | 0.931 /0.927 | 0.819 / 0.862 |
| BATT | 0.986 / 0.968 | 0.994 / 0.956 | 0.982 /0.975 | 0.979 / 0.959 |
- **As you suggested, we can also design another adaptive attack to mitigate the impact of amplifying parameters**. Specifically, we focus on reducing the confidence with which poisoned samples are predicted into the target class by parameters-amplified models. To further alleviate your concern, we design another adaptive loss term inspired by label smoothing. We conduct additional experiments under the same settings as those used in our Section 5.4. As shown in the following Table 5,
- **Decreasing the confidence of poisoned samples significantly reduces the accuracy of backdoored models on benign samples (BA)**. This is because poisoned samples retain a considerable portion of benign features. When confidence scores are reduced, it becomes more difficult for models to learn these benign features, which are generally harder to capture compared to trigger features. This is further illustrated by the almost unaffected BA of WaNet attack, because WaNet distorts the entire image, resulting in a minimal number of unmodified pixels in the poisoned images.
- **Our method remains effective even under this new adaptive attack**. As analyzed in previous discussions on adaptive attacks, the effectiveness of IBD-SPC largely stems from our adaptive layer selection strategy. This strategy dynamically identifies BN layers for amplification, regardless of whether the model is a vanilla or an adaptively backdoored model, ensuring the robustness of our defense mechanism across various scenarios.
**Table 5. The effectiveness of the adaptive attack suggested by the reviewer and the performance (AUROC, F1) of IBD-PSC against the adaptive attack on CIFAR-10. We mark the failed cases (where $BA<70\%$) in red, given that the accuracy of models unaffected by backdoor attacks on clean samples is 94.40%.**
| $\alpha'$$\rightarrow$ |0.01 | 0.01 |0.1|0.1| 0.5 | 0.5 |
|------------------------|-------------|----------|-------------|-------------|-------------|-------------|
| Attacks$\downarrow$ Metrics $\rightarrow$ | BA / ASR | AUROC /F1 | BA / ASR | AUROC / F1 |BA / ASR |AUROC/ F1|
| BadNets|0.832 / 0.887 |0.877 / 0.924|0.802 / 0.874 |0.874 / 0.861| <font color="red">0.101</font>/ 0.997 |- / -|
| WaNet |90.88 / 99.87 | 0.999 / 0.956| 87.07 / 99.15 | 0.985 / 0.934 |85.16 / 89.10|0.887 / 0.895|
| BATT |0.745 / 0.997 |0.996 / 0.982 | <font color="red">0.648</font> / 0.998|- / - |<font color="red">0.463</font> / 0/994 |- / -|
Note: **We also experimented with an adaptive attack variant that directly classified poisoned samples to their ground-truth labels under parameter amplification.** However, this approach imposed too strong a regularization, preventing the acquisition of a backdoored model with acceptable Attack Success Rate (ASR). This phenomenon aligns with our Theorem 3.1, suggesting that deliberate parameter amplification enables any sample's classification into the target class. Consequently, **limiting poisoned samples to their ground-truth class under amplification during training could result in attack failure**.
---
## Author Response (Reviewer 1rE9)
Thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **interesting and insightful analysis**, **greatly significant topic**, **sufficient interest to ICML audiences**, **exciting theoretical support and inspiring empirical results**, **simple yet effective method**, **comprehensive experiments**, **well-written paper**, **good paper with deep insights** and, **good performance**. We hope the following responses can well address your concerns.
**Q1**: Line 212-Line215 claims the limitations of scaling up one BN with different values. I would like to see the results, although it sounds reasonable.
**R1**: Thank you for this insightful comment! We are deeply sorry that we failed to provide more details in our submission. Following your suggestion, we have conducted additional experiments to validate the observations mentioned between lines 212 and 215. Specifically, we investigated the percentage of benign samples to be predicted as the target class when amplifying the learnable parameters of individual Batch Normalization (BN) layer with scale $S$. We present the results observed in a BN layer positioned at the beginning, middle, and end of the model architecture in the following Table 1. We can obtain three primary observations:
- The amplification factor for achieving effective defense varies considerably from layer to layer.
- Some attacks (e.g., WaNet and BATT) require an unreasonably large amplification factor to achieve a substantial misclassification rate.
- Amplifying only a single BN layer may not be adequate to misclassify the majority of benign samples in some cases. For instance, amplifying the first BN layer alone cannot misclassify benign samples from the Ada-patch attack into the intended target class.
**Table 1. The proportion (\%) of benign samples predicted to the target class in the CIFAR-10 dataset with the ResNet-18 model when amplifying only a single BN layer. We mark the failed cases ($<70\%$) in red.**
|Layer$\rightarrow$|1 | 1 | 1 | 1|5 | 5 | 5 | 5|15 | 15 | 15| 15|
| --- | --- | --- | --- | --- |--- | --- | --- | --- |--- | --- | --- | --- |
| Scale $\downarrow$| BadNets |WaNet | BATT| Ada-patch| BadNets |WaNet | BATT| Ada-patch| BadNets |WaNet | BATT| Ada-patch|
| 5 | 96.75 | <span style="color:red">10.50</span> | <span style="color:red">62.86</span> | <span style="color:red">0.00</span> |92.43 | 93.25 | <span style="color:red">5.04</span> | <span style="color:red">12.85</span> | <span style="color:red">11.37</span> | 99.32 | 99.13 | 76.81 |
|10| 100.00 | <span style="color:red">53.53</span>| <span style="color:red">38.81</span> | <span style="color:red">0.00</span>|100.00 | 100.00 | <span style="color:red">2.19</span> | <span style="color:red">27.40</span> |<span style="color:red">16.33</span> | 100.00 | 100.00 | 89.66 |
|100.00| 100.00| 100.00| 100.00| <span style="color:red">0.15</span>|100.00 | 100.00 | 99.96 | 91.56 | <span style="color:red">27.40</span> | 100.00 | 100.00 | 96.10 |
| 1000 | 100.00 | 100.00 | 100.00 | <span style="color:red">0.43</span> | 100.00 | 100.00 | 100.00 |93.99 | <span style="color:red">28.89</span> |100.00|100.00|96.45|
| 100000 | 100.00 | 100.00 | 100.00 | <span style="color:red">0.44</span> |100.00 | 100.00 | 100.00 | 94.18 | <span style="color:red">29.01</span> | 100.00 | 100.00 | 96.49 |
---
**Q2**: It seems that the theorem in this paper is different from the one provided in SCALE-UP (with NTK), which is good. However, I am interested in why they differ, given there are connections between the input image and model parameters to the model prediction.
**R2**: Thank you for this insightful question! In general, SCALE-UP started from the learning process of DNNs to analyze the image variations. In contrast, our study is based on an already trained model, exploring the variations in its parameters rather than observing from the perspective of the training process. This difference in analytical angles results in the divergence in our proofs and assumptions.
---
**Q3**: The proposed method seems to require more storage and computational resources than SCALE-UP. Although it is not a big problem, I would like to see more comparision and analyses about it.
**R3**: Thank you for the valuable reminder. We apologize for the oversight. This analysis was initially part of our limitations but was unintentionally omitted during the document formatting process. We will correct it in the revision.
Specifically, our defense requires more memory and inference times than the standard model inference without any defense. Specifically, let $M_s$ and $M_d$ denote the memory (for loading models) required by the standard model inference and by that of our defense, respectively. Let $T_s$ and $T_d$ denote the inference time required by the standard model inference and by that of our defense. Assuming that we adopt $n$ ($e.g.$, $n=5$) parameter-amplified models for our defense. We have the following equation: $M_d \cdot T_d = n \times M_s \cdot T_s.$ Accordingly, the users may need more GPUs to load all/some amplified models simultaneously to ensure efficiency or require more time for prediction by loading those models one by one when the memory is limited. In particular, the storage costs of our defense are similar to those without defense since we can easily obtain amplified models based on the standard one and, therefore, only need to save one model copy ($e.g.$, vanilla model). We will explore how to reduce those costs in our future works.
The time expenditure for SCALE-UP is nearly negligible, especially when inputs are batch-fed simultaneously into the deployed model, rather than being processed sequentially. Accordingly, its costs are similar to those of the standard model inference. Indeed, our method requires an additional cost $n$ times greater than SCALE-UP, but this cost involves a trade-off between storage and time. **Users have the flexibility to exchange storage for time or vice versa, according to their specific needs and constraints**.
---
**Q4**: There are a lot of important results in the appendix, which is nice. However, the author should at least mention them in the main content, although they could be placed in the appendix due to page limitations.
**R4**: Thank you for this constructive suggestion! In our revision, we will make sure to reference the pertinent analyses found in the appendix within the corresponding sections of the main text. For instance, in Section 5.3, which covers the ablation study, we will mention our analysis of the hyper-parameters of our method, the impact on clean models, and the analysis of detection effectiveness when scaling down BN parameters, all of which are detailed in the appendix.
---
**Q5**: It would be better if the authors could also analyze the potential limitations and future works of their method.
**R5**: Thank you for this constructive suggestion! We aim to address this oversight by including a detailed discussion of the limitations and future directions of our work in the appendix of our revision. Below, we detail the limitations and the future work.
- Firstly, our defense requires more memory and inference times than the standard model inference without any defense. Specifically, let $M_s$ and $M_d$ denote the memory (for loading models) required by the standard model inference and by that of our defense, respectively. Let $T_s$ and $T_d$ denote the inference time required by the standard model inference and by that of our defense. Assuming that we adopt $n$ ($e.g.$, $n=5$) parameter-amplified models for our defense. We have the following equation: $M_d \cdot T_d = n \times M_s \cdot T_s.$ Accordingly, the users may need more GPUs to load all/some amplified models simultaneously to ensure efficiency or require more time for prediction by loading those models one by one if the memory is limited. In particular, the storage costs of our defense are similar to those without defense since we can easily obtain amplified models based on the standard one and, therefore, only need to save one model copy ($e.g.$, vanilla model). We will explore how to reduce those costs in our future work.
- Secondly, our IBD-PSC requires a few local benign samples, although their number could be small (e.g., 25). We will explore how to extend our method to the 'data-free' scenarios in future research.
- Thirdly, our method can only detect whether a suspicious testing image is malicious. Currently, our defense cannot recover the correct label of malicious samples or their trigger patterns. As such, the users can only mark and refuse to predict those samples. We will explore how to incorporate those additional functionalities in our future work.
- Fourthly, our work currently focuses only on image classification tasks. We will explore its performance on other modalities (e.g., text and audio) and tasks (e.g., detection and tracking) in our future work.
---
## Author Response (Reviewer iova)
Thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **new and interesting observation**, **well explained observation and motivation**, **technically sound method**, and **effective detection**. We hope the following responses could help clarify potential misunderstandings and alleviate your concerns.
---
**Q1**: The threat model is slightly broader than typical data detection and filtering. For typical backdoor data detection, the defender downloads poisoned data rather than a backdoored model. The backdoor data detection is a form of automatic data filtering. If the experiments can demonstrate this with some training-controlled backdoor attacks, this might become a strength.
**R1**: We apologize for any misunderstanding that may have arisen.
- **Our defense focuses on real-time input sample detection during the inference phase**. Our primary concern addresses a challenge distinct from training set purification. Specifically, our focus is on scenarios where users employ third-party models. In such situations, users are not equipped to detect or eliminate backdoors within the models. Instead, they can only deploy these models and then proceed to monitor each input sample in real-time during the model's operation to identify any poisoned samples. This detection setup functions similarly to a firewall and is consistent with the paradigm established by SCALE-UP.
- **We have evaluated our defense under different types of backdoor attacks, including training-controlled attacks**. Specifically, we have evaluated the effectiveness of our method against 13 representative and advanced backdoor attack methods. These encompass poison-only backdoor attacks, including BadNets, ISSBA, LC, and NARCISSUS; training-controlled backdoor attacks such as IAD, WaNet, BPP, PhysicalBA, and BATT; as well as the model-controlled backdoor attack, SRA. It's noteworthy that **training set purification methods are typically effective only against poison-only backdoor attacks**. The specifics of each selected attack method, including their settings and performance, are detailed in Appendix E.1.
---
**Q2**: The empirical evaluation is limited. There is a concurrent work [1] in ICLR 2024 that makes a similar observation with IBD. It would be great to include some baseline methods that were evaluated in this paper. For example, CD[2], Meta-SFIT [3], and ASSET [4]. The review understands that the author is not required to compare this work [1] since it is very recent. The reviewer's rating did not consider this recent work.
**R2**: We appreciate the reviewer bringing the interesting concurrent work (MSPC) [1] to our attention. It presents intriguing observations similar to ours regarding SCALE-UP. However, it's important to clarify that **the findings related to SCALE-UP in our paper constitute only a minor component of our research scope**. Our study, while similar to the SCALE-UP framework, diverges significantly by exploring the concept from a different angle—parameter scaling. This distinction underlines a substantial difference between our approach and MSPC, which we will ensure to highlight in our related work section.
I also would like to clarify a misunderstanding regarding the application scenario of our work, which differs from the training set purification, focused on by MSPC.
- **Our method requires fewer assumptions about potential adversaries**. We explore scenarios where the user employs a third-party model and necessitates a real-time detection of poisoned samples during the inference phase. This detection setup is akin to a firewall and aligns with the setting proposed in SCALE-UP. Importantly, **we do not limit our adversaries to employing poison-only attack methods, which is required by MSPC**.
- **Differences in Detection Focus**. Our defense is implemented during the inference stage, necessitating the capability for real-time detection. In contrast, training set purification methods are relatively more relaxed regarding detection time since they operate during the data collection phase.
As such, our experimental design did not originally include comparisons with training set purification methodologies.
Nonetheless, we understand the reviewer's concerns and have endeavored to apply our method within the suggested context. Following the methodology outlined in references [1,2,4], **we first train a model on a potentially compromised training set and subsequently apply our detection method to identify and filter potentially poisoned samples within that training set**.
The detection performance is presented in Table 1, which **demonstrates the effectiveness of our method in filtering training set samples across various attacks**, achieving a 100% True Positive Rate (TPR) and nearly 100% Area Under the Receiver Operating Characteristic (AUROC) while maintaining a False Positive Rate (FPR) close to 0%.
**Table 1. The performance (AUROC, TPR, FPR) of our OBD-PSC on identifying the potential training poisoned samples.**
| Attack$\rightarrow$ | BadNets | BadNets | BadNets | WaNet | WaNet | WaNet | BATT | BATT | BATT |
|-------|:-------:|:-----:|-------|-------|-------|-------|-------|-------|-------|
| Defense$\downarrow$, Metric$\rightarrow$ | AUROC | TPR | FPR | AUROC | TPR | FPR | AUROC | TPR | FPR |
| MSPC| 0.980| 1.000 |0.144| 0.747 | 0.551 | 0.186 | 0.986 | 0.991 | 0.131 |
| CD | 0.875 | 0.749 | 0.018 | 0.462 | 0.045 | 0.157 | 0.232 | 0.044 | 0.138 |
| Ours | 1.000 | 1.000 | 0.066 | 0.998 | 1.000 | 0.081 | 0.994 | 1.000 | 0.079 |
**Note 1**: We only compare with MSPC and CD since
- we failed to reproduce the results of ASSET based on their open-sourced codes.
- Meta-SIFT was designed to precisely filter out some instead of all benign samples. Accordingly, it has a relatively bad performance in the typical data detection and filtering, as also mentioned in [1].
**Note 2**: We reproduce MSPC using its open-source codes with their default settings. However, it performs relatively poorly in defending against WaNet compared to the results reported in its original paper. We speculate that this is probably because we exploited the WaNet with noise mode, whereas MSPC was tested on the vanilla WaNet (as mentioned in their Appendix E).
**Reference**
1. Pal, Soumyadeep, et al. "Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency." ICLR 2024.
2. Huang, Hanxun, et al. "Distilling Cognitive Backdoor Patterns within an Image." ICLR 2023.
3. Zeng, Yi, et al. "META-SIFT: How to Sift Out a Clean Subset in the Presence of Data Poisoning?", USENIX Security Symposium 2023.
4. Pan, Minzhou, et al. "{ASSET}: Robust Backdoor Data Detection Across a Multiplicity of Deep Learning Paradigms." USENIX Security Symposium 2023.
---
**Q3**: The discussion of recent backdoor data detection works is limited. In addition to the previous point, the following works [5,6,7,8] should be discussed and compared.
**R3**: Thanks for your comments. The works referenced as [5,6,7,8] indeed focus on the task of training set purification task, a point that ties back to the misunderstanding addressed in Q2. As we have previously highlighted, **our primary investigation does not directly align with training set purification**.
In our revised submission, we plan to dedicate a section in the appendix to thoroughly compare and discuss these mentioned works. The comparative results, as illustrated in Table 1, showcase our method's superior detection efficacy, **achieving a 100% TPR and nearly 100% AUROC while maintaining an FPR close to 0%**.
**Reference**
[5] Chen, Bryant, et al. "Detecting backdoor attacks on deep neural networks by activation clustering." arXiv preprint arXiv:1811.03728 (2018).
[6] Hayase, Jonathan, et al. "Spectre: Defending against backdoor attacks using robust statistics." ICML, 2021.
[7] Chen, Weixin, Baoyuan Wu, and Haoqian Wang. "Effective backdoor defense by exploiting sensitivity of poisoned samples." NeurIPS 2022.
[8] Zeng, Yi, et al. "Rethinking the backdoor attacks' triggers: A frequency perspective." ICCV. 2021.
---
**Q4**: The threat model claims that the proposed method can work on a downloaded backdoored model. However, it is not evaluated in the experiments. It would be good to include some attacks and the pre-trained model discussed in 2.1 for Training-controlled Backdoor Attacks.
**R4**: We appreciate the opportunity to clarify a misunderstanding regarding the threat model we have employed in our study.
- **Our experimental results include training-controlled backdoor attacks (a.k.a. downloaded backdoored models), such as IAD, PhysicalBA, WaNet, BATT, and BPP**. Detailed descriptions of these attacks can be found in Appendix E1 of our submission.
- **The results in Table 2 demonstrate the effectiveness of our method against the selected training-controlled backdoor attacks**.
**Table 2. The performance (AUROC, F1) of our IBD-PSC against training-controlled backdoor attacks on CIFAR-10.**
| Metrics $\downarrow$ Attacks $\rightarrow$| IAD | PhysicalBA | WaNet | BATT | BPP |
|-----------------|---------|------------|-------|-------|-------|
| AUROC | 0.983 | 0.972 | 0.984 | 0.999 | 0.990 |
| F1 | 0.952 | 0.942 | 0.956 | 0.966 | 0.968 |
---
**Q5**: The poisoning rate experiment only shows a minimum of 2%; can the proposed method work on even lower poisoning rates? Such as 1% and 0.5%?
**R5**: We extend our sincere gratitude to the reviewer for their insightful comments. Initially, our decision to test down to a 2% poisoning rate was influenced by observations that some attacks exhibit significantly lower attack success rates (ASRs) at reduced poisoning rates. For instance, the WaNet attack demonstrates an ASR of merely 13.10% at a 0.5% poisoning rate.
Following the reviewer's valuable suggestion, we have conducted further tests at lower poisoning rates (0.5% and 1%) to assess the detection performance of our method. We focus on three representative attacks: BadNets, WaNet, and BATT, and show the result in Tables 3-5. It is evident that **our method maintains effectiveness under low poisoning rates**, achieving AUROC and F1 scores well above 0.9, even at a poisoning rate as low as 0.5%.
**Table 3. The performance (AUROC, F1) of our IBD-PSC in defending against BadNets with different poisoning rates ($\rho$).**
| $\rho$ (%) | 0.5 | 1 | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|-------|-------|
|ASR|1.000 | 1.000 | 1.000 | 1.000|1.000 | 1.000 | 1.000 |
|AUROC| 0.955|0.950|0.999 |0.998|0.999|1.000|1.000|
|F1|0.950|0.951|0.928|0.912|0.961 |0.981| 0.967|
**Table 4. The performance (AUROC, F1) of our IBD-PSC in defending against WaNets with different poisoning rates ($\rho$).**
| $\rho$ (%) | 0.5 | 1 | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|-------|-------|
|ASR|0.131 | 0.356 | 0.968 | 0.972|0.994 | 0.996 | 0.997 |
|AUROC| -|-|0.966 |0.977|0.978|0.983|0.984|
|F1|-|-|0.944|0.959|0.955|0.960|0.956|
**Table 5. The performance (AUROC, F1) of our IBD-PSC in defending against BATT with different poisoning rates ($\rho$).**
| $\rho$ (%) | 0.5 | 1 | 2 | 4 | 6 |8 | 10 |
|-----|-------|-------|-------|-----|-----|-------|-------|
|ASR|0.961 | 0.993 | 0.999 | 1.000| 1.000 | 1.000 | 1.000 |
|AUROC| 0.913|0.948|0.999 |1.000|0.998|0.985|0.999|
|F1|0.927|0.948|0.972|0.967|0.942|0.958|0.979|
<!-- 我们测到2%的投毒比例是因为很多攻击在比较小的投毒比例下ASR比较低。如WaNet攻击在0.5%的投毒比例下,ASR只有13.10%。然后,感谢审稿人的建议,我们也测试了在更小的投毒比例下(0.5%和1%),我们方法的检测效果。我们在下面展示了具有代表性的三个攻击的效果。 -->
---
**Q6**: While the detection performance looks impressive, it could be more convincing to include an experiment to remove a certain percentage of the data to retrain a model and then report the ASR on this retrained model.
**R6**: We appreciate the reviewer's perspective and understand that the question might stem from a misunderstanding, as previously discussed in response to Q2.
While our initial focus diverges from the scenarios of training set purification, we acknowledge the critical importance of this scenario and recognize that our method is applicable to this scenario. Accordingly, **we have conducted additional experiments where we first detect and remove suspected poisoned samples from the training set. Subsequently, we retrain the model on this purified dataset to evaluate both its accuracy on benign samples (BA) and the attack success rate (ASR)**.
We conduct experiments on the CIFAR-10 dataset against three representative attacks, and the results are presented in Table 6 below. We observe that **the ASR scores of these retrained models are less than 0.5%, thereby rendering these backdoor attacks ineffective**.
**Table 6. Effect of retraining models without poisoned samples identified by our IBD-PSC.**
| Attacks$\downarrow$, Metrics$\rightarrow$| ASR | BA | Number of Removed Samples |
|-----------------|-------|-------|--------|
| BadNets | 0.005 | 0.893 | 7985 |
| WaNet | 0.002 | 0.878 | 9119 |
| BATT | 0.009 | 0.860 | 8536 |
---
## Author Response (Reviewer soFj)
Thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **empirically justified idea**, **superiority**, and **extensive appendices**. We hope the following responses could help clarify potential misunderstandings and alleviate your concerns.
---
**Q1**: There are several typos throughout the paper. Some of these typos make it hard to understand the underlying meaning (see Questions).
**R1**: We are grateful for the opportunity to enhance our work and appreciate your assistance in improving the quality and clarity of our submission. We sincerely apologize for any confusion these mistakes may have caused.
In response to your valuable feedback, we are committed to conducting a thorough review of our manuscript to rectify these errors. For instance, we will clarify that the batch norm features mentioned in line 198 pertain to benign samples, and we will correct the poisoning rate range from 0.02 to 1 to the accurate range of 0.02 to 0.1, as mentioned in line 382.
---
**Q2**: The paper shows experiments for only one architecture, ResNet-18. As the defense works by modifying the architecture, it would be prudent to show that this defense works with different architectures, or even in different domains than images.
**R2**: We sincerely apologize for any confusion these mistakes may have caused.
- **Due to space constraints within the main text, we have included the detection results for other model architectures(i.e., PreactResNet18 and MobileNet in Appendix M**, with the detailed outcomes presented in Table A8.
- To address your concerns, we have shown the three representative attacks—BadNets, WaNet, and BATT—for detailed analysis in Table 1 below. As illustrated, the majority of the average AUROC and F1 scores on both PreactResNet18 and MobileNet architectures exceed 0.93. **The consistency in performance signifies that our method is effective on different model architectures**. While similar efficacy is observed on other datasets and against different attacks, a comprehensive array of results is provided in Table A8.
- We are grateful for your recommendation about extending our work to domains other than images. We mainly focused on the image domain simply because most baseline defenses were designed for it. We will explore how to extend our method to other important domains (such as text, audio, and graph) in our future works.
**Table 1. The performance (AUROC, F1) of our defense on other model architectures on CIFAR-10 against BadNets attack.**
|Attacks $\downarrow$ Models $\rightarrow$|PreactResNet18|MobileNet| Avg. |
|-|-|-|-|
|BadNets|0.978 / 0.931 | 0.970 / 0.943 | 0.974 / 0.937|
|WaNet|0.977 / 0.949 | 0.937 / 0.940 |0.957 / 0.945|
|BATT|0.972 / 0.958| 0.951 / 0.953| 0.962 / 0.956|
---
**Q3**: The claimed high-efficiency of IBD-PSC is only supported by measuring the inference time of different models. However, as I understand it, this is heavily dependent on how many scaled models are chosen by the defender, and N scaled up models would multiply inference computation by N times. It would be good to explicitly discuss this if that is the case.
**R3**: Thank you for your understanding and patience. Below, we detail the analysis of time and memory consumption.
Our defense requires more memory and inference times than the standard model inference without any defense. Specifically, let $M_s$ and $M_d$ denote the memory (for loading models) required by the standard model inference and by that of our defense, respectively. Let $T_s$ and $T_d$ denote the inference time required by the standard model inference and by that of our defense. Assuming that we adopt $n$ ($e.g.$, $n=5$) parameter-amplified models for our defense. We have the following equation: **$M_d \cdot T_d = n \times M_s \cdot T_s.$** Accordingly, the users may need more GPUs to load all/some amplified models simultaneously to ensure efficiency or require more time for prediction by loading those models one by one when the memory is limited. In particular, the storage costs of our defense are similar to those without defense since we can easily obtain amplified models based on the standard one and, therefore, only need to save one model copy ($e.g.$, vanilla model). We will explore how to reduce those costs in our future works.
We will discuss this limitation more thoroughly in the revision of our paper.
---
**Q4**: How is "confidence" defined in this paper / calculated in the output?
**R4**: We apologize for any ambiguity in our initial submission regarding the definition of "confidence". In our paper, **"confidence" refers to the predicted probability assigned to an input sample for a specified label**. For instance, if an image of a "cat" is predicted as the "cat" label with a probability of 0.9, then the "confidence" of the input under the "cat" label is 0.9.
<!-- "confidence" 就是the predicted probability of a input sample on the specified predicted label by DNN model. 比如,输入一张图像"cat", model将该输入以0.9的预测概率预测到cat label上,那input 在cat 类别上的"confidence"就是0.9. -->
---
**Q5**: The second proposition of Theorem 3.1 states "2) there exists a constant M that is independent of b, such that when b is scaled up to $||b^2||>M$, it will be classified into the attacked target class instead of its ground-truth one." What is "it"? What will be classified? What is the "target class" and what is the "ground-truth" class?
**R5**: We are grateful for your valuable comment and apologize for any ambiguity present in our initial submission.
We emphasize that Theorem 3.1 aims to demonstrate that within a poisoned model, we can amplify the learnable parameters of the Batch Normalization (BN) layers to misclassify any samples into the attacker-specified target class. Specifically
- **'it' refers to the benign or poisoned sample**.
- **The 'target class' is defined as the category that the attacker aims to misclassify input samples into via the poisoning attack**. For instance, if an attacker aims to misclassify any input as "administrator", then the "administrator" class would be the designated target class.
- **The 'ground-truth' indicates the true label (instead of its prediction) of a sample**. For example, if a cat-like image is predicted as the dog, its ground-truth class is the 'cat' while its prediction is the 'dog'.
- To avoid potential misunderstandings, we will revise this theorem as follows:
**Theorem 3.1. Let $F=f_L\circ\dots\circ f_1$ be a backdoored DNN with $L$ layers with the target class $t$. Let $x$ be an input and $b=f_l\circ\cdots\circ f_1(x)$ be its batch-normalized feature after the $l$-th layer ($1\leq l\leq L$). Assume that $b$ follows a mixture of Gaussian distribution. Then the following two statements hold: (1) Amplifying the $\beta$ and $\gamma$ parameters of the $l$-th layer can make $\Vert b \Vert_2$ arbitrarily large, and (2) There exists a positive constant $M$ that is independent of $\hat{b}$, such that whenever $\hat{b}$ (i.e., the amplified version of $b$) satisfies $\Vert \hat{b} \Vert_2 > M$, then $\arg \max f_L\circ\dots\circ f_{l+1}(\hat{b})=t$, even when $\arg \max f_L\circ\dots\circ f_{l+1}(b)\not=t$.**
---
**Q6**: In the last part of Sec 3, what does this sentence mean? "...amplifying multiple BN layers with a small factor (e.g., 1.5) can also accumulate increasing to the feature norm considerably in the last pre-FC layer, and is more stable and robust across different settings."
**R6**: We are deeply thankful for your insightful comment.
Theorem 3.1 illustrates that amplifying the parameters of a BN layer leads to an increase in the feature norm in the last pre-FC layer, subsequently causing benign samples to be misclassified into the target class. This principle underpins our method. However, as shown in the Table 2A below,
- For each attack, **the amplification factor required to achieve an effective defense may vary considerably**, even if amplification is performed at different BN layers.
- **Certain attacks, such as WaNet and BATT, require an unreasonably large amplification factor to achieve a substantial misclassification rate.**.
- **Amplifying a single BN layer may not be adequate to misclassify the majority of benign samples across different attacks**. For instance, amplifying the first BN layer alone cannot misclassify benign samples from the Ada-patch attack into the intended target class.
According to previous reasons, we claimed that 'amplifying only a single BN layer may require unreasonably large amplification factor, and is unstable among different attacks or even BN layers.
To address the previous problems, **we decide to amplify multiple BN layers to support using a fixed and small scaling factor across different settings**. As such, it makes our method more stable and robust across different settings. It is motivated by the understanding that amplifying multiple consecutive BN layers simultaneously would boost the feature norm more significantly due to the cumulative layer amplification effect. Figure 3 in our submission illustrates that even a modest amplification factor, such as 1.5, could significantly enhance the feature norm and, therefore, let benign samples be predicted as the target class.
We will add more explanations in Section 3 and the appendix of our revision to make it more clearly.
**Table 2. The proportion (\%) of benign samples predicted to the target class in the CIFAR-10 dataset with the ResNet-18 model when amplifying only a single BN layer. We mark the failed cases ($<70\%$) in red.**
|Layer$\rightarrow$|1 | 1 | 1 | 1|5 | 5 | 5 | 5|15 | 15 | 15| 15|
| --- | --- | --- | --- | --- |--- | --- | --- | --- |--- | --- | --- | --- |
| Scale$\downarrow$| BadNets |WaNet | BATT| Ada-patch| BadNets |WaNet | BATT| Ada-patch| BadNets |WaNet | BATT| Ada-patch|
| 5 | 96.75 | <span style="color:red">10.50</span> | <span style="color:red">62.86</span> | <span style="color:red">0.00</span> |92.43 | 93.25 | <span style="color:red">5.04</span> | <span style="color:red">12.85</span> | <span style="color:red">11.37</span> | 99.32 | 99.13 | 76.81 |
|10| 100.00 | <span style="color:red">53.53</span>| <span style="color:red">38.81</span> | <span style="color:red">0.00</span>|100.00 | 100.00 | <span style="color:red">2.19</span> | <span style="color:red">27.40</span> |<span style="color:red">16.33</span> | 100.00 | 100.00 | 89.66 |
|100.00| 100.00| 100.00| 100.00| <span style="color:red">0.15</span>|100.00 | 100.00 | 99.96 | 91.56 | <span style="color:red">27.40</span> | 100.00 | 100.00 | 96.10 |
| 1000 | 100.00 | 100.00 | 100.00 | <span style="color:red">0.43</span> | 100.00 | 100.00 | 100.00 |93.99 | <span style="color:red">28.89</span> |100.00|100.00|96.45|
| 100000 | 100.00 | 100.00 | 100.00 | <span style="color:red">0.44</span> |100.00 | 100.00 | 100.00 | 94.18 | <span style="color:red">29.01</span> | 100.00 | 100.00 | 96.49 |
---
**Q7**: Sec. 5.4 says that experiments are conducted with sampling rate 0.02 to 1, but in the associated table, only 0.2 to 1 is shown. Is this a typo? Regardless, I would like to see results where the sampling rate is 0.02.
**R7**: We are thankful for the valuable comment. Indeed, there exists a typo error in Section 5.4. The correct range of poisoning rates explored should have been from 0.02 to 0.1, instead of 0.02 to 1.
Additionally, we would like to clarify that the detection performance across different poisoning rates is depicted in Figure 7 (instead of Table 4). To address the reviewer's concern more specifically, we have detailed the results in Table 3 below. It is evident from this table that, **even at a poisoning rate as low as 0.02, our method attains detection performance with both AUROC and F1 scores surpassing 0.9**.
**Table 3. The performance (AUROC, F1) of our IBD-PSC in defending against various backdoor attacks with a poisoning rate of 0.02.**
|Attack| ASR | AUROC | F1 |
|-------|-------|-------|-------|
|BadNets|1.000 | 0.999 | 0.928 |
| WaNet | 0.968 | 0.966 |0.944|
| BATT | 0.999 | 0.999 |0.972|
---
**Q8**: In Sec 5.4, the adaptive attack that is designed to make a poisoned input classify to the correct output when subjected to model parameter amplification seems misaligned with what the defense is doing. Would it be more appropriate for the adaptive attack to specifically just decrease the confidence of the targeted class when model parameter amplification occurs?
**R8**: Thank you for highlighting this important aspect!
We do understand your concern about the resistance of our defense to attacks that are directly designed for our method.
- As you may have a misunderstanding, **our adaptive attack indeed specifically aims to ensure that benign instead of poisoned samples are correctly classified** even when the parameters are amplified. Specifically, we designed an adaptive loss term $\mathcal{L}_{\textrm{ada}} = \sum_{i=1}^{|\mathcal{D}_b|} \mathcal{L}(\hat{\mathcal{F}_k^{\omega}}(\pmb{x}_{i};\hat{\pmb{\theta}}),y_i)$ in Eq.(5), where $\pmb{x}_{i}$ represents a benign sample.
- The loss term $\mathcal{L}_{\textrm{ada}}$ reduces the prediction discrepancy between benign and poisoned samples during amplification, which directly **contradicts the core assumption of our proposed IBD-PSC defense**: that clean samples would be misclassified when model parameters are amplified, whereas poisoned samples would not. As such, **it serves as an adaptive attack to our defense**.
- As shown in the following Table 4, **our method remains effective even under this adaptive attack**. We argue that the effectiveness primarily originates from our adaptive layer selection strategy, which dynamically identifies BN layers for amplification, regardless of whether it is a vanilla or an adaptive backdoored model. The layers selected during the inference stage typically differ from those used in the training phase, enabling the IBD-PSC to effectively detect poisoned samples.
**Table 4. The performance (AUROC, F1) of IBD-PSC under adaptive attacks designed in our submission.**
| $\alpha$$\rightarrow$ | 0.2 |0.5 | 0.9 | 0.99|
|------------------------|-------------|----------|-------------|-------------|
| Attacks$\downarrow$ | AUROC / F1 | AUROC /F1 | AUROC/ F1| AUROC / F1 |
| BadNets | 0.992/0.978 | 0.986 / 0.964 | 0.995 / 0.962 | 0.996 / 0.951 |
| WaNet | 0.947 / 0.949 | 0.956 / 0.942 | 0.931 /0.927 | 0.819 / 0.862 |
| BATT | 0.986 / 0.968 | 0.994 / 0.956 | 0.982 /0.975 | 0.979 / 0.959 |
- **As you suggested, we can also design another adaptive attack by decreasing the confidence of predicting poisoned samples to the targeted class by parameter-amplified models**. To further alleviate your concern, we design another adaptive loss term $\mathcal{L}'_{\textrm{ada}}= \sum_{j=1}^{|\mathcal{D}_p|} \mathcal{L}(\hat{\mathcal{F}_k^{\omega}} (\pmb{x}_{j};\hat{\pmb{\theta}}),\hat{y}_i)$, where $\hat{y}_i$ represents the label-smoothing [1] form of the target class $t$, defined as: $$\hat{y}_{i,c} = \begin{cases} 1 - \zeta & \text{if } c = t \\ \frac{\zeta}{C-1} & \text{otherwise}. \end{cases}$$
Here, $\zeta$ is set to 0.2, specifically chosen to lower the confidence with which poisoned samples are classified into the target class. The term $C$ denotes the total number of classes, $|\mathcal{D}_p|$ represents the number of poisoned samples in the training set, and $\pmb{x}_{j}$ denotes a poisoned sample.
$\mathcal{L}'_{\textrm{ada}}$ aims to decrease the confidence of poisoned samples when model parameter amplification occurs, aligning with the reviewer's suggestion. We integrate this adaptive loss term $\mathcal{L}_{\textrm{ada}}$ with the vanilla backdoor loss $\mathcal{L}_{\textrm{bd}}$ to formulate the overall loss function as $\mathcal{L} = \alpha \mathcal{L}_{\textrm{bd}} + (1-\alpha') \mathcal{L}_{ada}',$ where $\alpha'$ is a weighting factor.
We conduct additional experiments under the same settings as those used in our Section 5.4. As shown in the following Table 5,
- **Decreasing the confidence of poisoned samples significantly reduces the accuracy of backdoored models on benign samples (BA)**. This is because poisoned samples retain a considerable portion of benign features. When confidence scores are reduced, it becomes more difficult for models to learn these benign features, which are generally harder to capture compared to trigger features. This is further illustrated by the almost unaffected BA of WaNet attack, because WaNet distorts the entire image, resulting in a minimal number of unmodified pixels in the poisoned images.
- **Our method remains effective even under this new adaptive attack**. As analyzed in previous discussions on adaptive attacks, the effectiveness of IBD-SPC largely stems from our adaptive layer selection strategy. This strategy dynamically identifies BN layers for amplification, regardless of whether the model is a vanilla or an adaptively backdoored model, ensuring the robustness of our defense mechanism across various scenarios.
Note: **We also experimented with an adaptive attack variant that directly classified poisoned samples to their ground-truth labels under parameter amplification.** However, this approach imposed too strong a regularization, preventing the acquisition of a backdoored model with acceptable Attack Success Rate (ASR). This phenomenon aligns with our Theorem 3.1, suggesting that deliberate parameter amplification enables any sample's classification into the target class. Consequently, **limiting poisoned samples to their ground-truth class under amplification during training could result in attack failure**.
**Table 5. The effectiveness of the adaptive attack suggested by the reviewer and the performance (AUROC, F1) of IBD-PSC against the adaptive attack on CIFAR-10. We mark the failed cases (where $BA<70\%$) in red, given that the accuracy of models unaffected by backdoor attacks on clean samples is 94.40%.**
| $\alpha'$$\rightarrow$ |0.01 | 0.01 |0.1|0.1| 0.5 | 0.5 |
|------------------------|-------------|----------|-------------|-------------|-------------|-------------|
| Attacks$\downarrow$ Metrics $\rightarrow$ | BA / ASR | AUROC /F1 | BA / ASR | AUROC / F1 |BA / ASR |AUROC/ F1|
| BadNets|0.832 / 0.887 |0.877 / 0.924|0.802 / 0.874 |0.874 / 0.861| <font color="red">0.101</font>/ 0.997 |- / -|
| WaNet |90.88 / 99.87 | 0.999 / 0.956| 87.07 / 99.15 | 0.985 / 0.934 |85.16 / 89.10|0.887 / 0.895|
| BATT |0.745 / 0.997 |0.996 / 0.982 | <font color="red">0.648</font> / 0.998|- / - |<font color="red">0.463</font> / 0/994 |- / -|
<!-- Considering that our loss term $\mathcal{L}_{\textrm{ada}}$ explicitly prevents benign samples from being misclassified into the target class, **our adaptive attack is indeed more stringent than suggested form**. -->
**Reference**
[1] Müller, Rafael, Simon Kornblith, and Geoffrey E. Hinton. "When does label smoothing help?." Advances in neural information processing systems 32 (2019).
---
**Q9**: I did not see a limitations section, but there should probably be one. There should be discussion that this defense requires batch normalization layers to be used in the architecture. Also, there should be some note or analysis that this defense increases inference cost by N times, where N is the number of amplified models chosen by the defender to detect poisoned inputs.
**R9**: Thank you for this constructive suggestion! We aim to address this oversight by including a detailed discussion of the limitations and future directions of our work in the appendix of our revision. Below, we detail the limitations and the future work.
- Firstly, our defense requires more memory and inference times than the standard model inference without any defense. Specifically, let $M_s$ and $M_d$ denote the memory (for loading models) required by the standard model inference and by that of our defense, respectively. Let $T_s$ and $T_d$ denote the inference time required by the standard model inference and by that of our defense. Assuming that we adopt $n$ ($e.g.$, $n=5$) parameter-amplified models for our defense. We have the following equation: $M_d \cdot T_d = n \times M_s \cdot T_s.$ Accordingly, the users may need more GPUs to load all/some amplified models simultaneously to ensure efficiency or require more time for prediction by loading those models one by one if the memory is limited. In particular, the storage costs of our defense are similar to those without defense since we can easily obtain amplified models based on the standard one and, therefore, only need to save one model copy ($e.g.$, vanilla model). We will explore how to reduce those costs in our future work.
- Secondly, our IBD-PSC requires a few local benign samples, although their number could be small (e.g., 25). We will explore how to extend our method to the 'data-free' scenarios in future research.
- Thirdly, our method can only detect whether a suspicious testing image is malicious. Currently, our defense cannot recover the correct label of malicious samples or their trigger patterns. As such, the users can only mark and refuse to predict those samples. We will explore how to incorporate those additional functionalities in our future work.
- Fourthly, our work currently focuses only on image classification tasks. We will explore its performance on other modalities (e.g., text and audio) and tasks (e.g., detection and tracking) in our future work.
---
**Reviewer soFj Response to Authors**
Dear Reviewer soFj, thank you for your recognition of our paper! We also deeply and sincerely thank you for your valuable time and comments on our paper. Your efforts greatly help us to improve our paper. We hope the following responses can further alleviate your remaining concerns.
---
**Q1**: Also, thank you for clarifying that the compuation for this defense is $n$ times the memory or time of the original model inference. This does not seem efficient to me. It would be helpful to justify the claimed efficiency in these terms compared to other relevant defenses.
**R1**: Thank you for this constructive suggestion! We are deeply sorry that we failed to make it more clearly.
- **In our paper, the 'efficiency' refers to the running/inference time** (instead of also the memory requirements), which is specified on Page 7 (Line 365-372). The running time is critical for this task (i.e., detecting poisoned testing images) because the detection is usually deployed as the 'firewall' for online inference.
- We also respectfully note that **we calculated the inference time of all methods under *identical* and *ideal* conditions for evaluating efficiency**. For example, we assume that defenders will load all required models and images simultaneously (with more memory requirements compared to the vanilla model inference). This comparison is relatively fair and reasonable since different defenses differ greatly in their mechanisms and requirements.
- To verify our efficiency, we also summarize the results from Figure 5 in Table 1 below. As shown in this table, **the efficiency of our IBD-PSC is on par with or even better than all baseline defenses**. Even compared to no defense, the extra time is negligible.
- However, as we admitted in our previous limitation analyses, our method is more costly regarding the memory requirements if we need to ensure efficiency. We will discuss how to alleviate it in our future works.
We will also include more discussions in the main experiments and appendix of our revision to avoid potential misunderstandings.
**Table 1: The inference time on CIFAR-10.**
| Defenses | Time (s) |
|------------|----------|
| No Defense | 0.005 |
| STRIP | 0.065 |
| TeCo | 0.560 |
| SCALE-UP | 0.021 |
| Ours | 0.017 |
---
**Q2**: For Q6, I still think there may be a typo in that sentence. Is both the word "accumulate" and "increasing" supposed to be in that sentence? Otherwise, I appreciate the explanation of what it means.
**R2**: Thank you for your careful review and for bringing this to our attention! In our revision, we will replace the original sentence segment with the following one for correction:
'amplifying multiple BN layers with a small factor (e.g., 1.5) can also significantly increase the feature norm in the last pre-FC layer and is more stable and robust across different settings.'
---
**Q3**: As for Q8, I think I did misunderstand the adaptive attack. However, this clarified version still seems misaligned. Is it the case that the adaptive attack now modifies all the benign samples as well? I would think an adaptive attack would only affect the classification behavior of the poisoned samples, as that is what the attacker previously controlled. For this reason, I refer back to my original review: Would it be more appropriate for the adaptive attack to specifically just decrease the confidence of the targeted class when model parameter amplification occurs?
**R3**: Thank you for this insightful question! We are deeply sorry that our submission may lead you to some misunderstandings that we want to clarify here.
- We admit that **we need to control the benign samples** (instead of only poisoned samples) in the adaptive attack designed in our submission **to some extent**. However, we respectfully note that **we actually modified the loss instead of the samples in our adaptive attack**.
- We respectfully note that we focus on detecting poisoned testing samples where the deployed models are from third-part (e.g., malicious model zoo). As such, **adversaries are allowed to modify all samples (including benign ones) and the training loss** in the adaptive attacks.
---
**Q4**: The following experiment attempting to do this is helpful in this regard, but I do not understand why this statement is true: Decreasing the confidence of poisoned samples significantly reduces the accuracy of backdoored models on benign samples (BA). Using this reasoning, I would think that more confident poison samples would hurt BA more since the samples are pushing the model more to classify the retained benign features to the wrong class.
**R4**: Thank you for highlighting this interesting phenomenon. We are deeply sorry that we failed to explain its potential reasons more clearly and our submission may lead you to some misunderstandings.
- We respectfully note that **this sentence is our description of the experiment results instead of a statement**.
- We argue that this phenomenon is mostly because the **DNNs will connect both benign features and trigger features to the target class when applying label-smoothing**, although the connection between trigger features and the target class is stronger. Specifically, the attacked models intend to overfit trigger features when we apply high confidence (e.g., 1) on poisoned samples (as vanilla backdoor attacks). However, the attacked models intend to use both trigger and other features after label smoothing since the new task is more complicated and hard to fit well. In other words, the adaptive-attacked model prefers to predict benign samples as the target class and therefore with relatively low benign accuracy.
- To further verify it, we calculate the distribution of misclassified benign samples. As shown in the following Table 2, **almost all misclassified benign samples will be predicted as the target label** (i.e., 0) instead of other classes.
- We will provide more details in the appendix of our revision. We will also further explore this intriguing phenomenon in our future work.
**Table 2: The proportion (\%) of misclassified benign samples classified by the model on each category. In our cases, the target label is 0.**
| Attacks $\downarrow$ Labels $\rightarrow$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------------------|--------|-------|--------|--------|--------|--------|-------|-------|-------|-------|
| BadNets (Original) | 14.68 | 5.71 | 11.09 | 22.02 | 10.77 | 14.19 | 6.36 | 3.92 | 5.55 | 5.71 |
| BadNets (Adaptive) | 99.57 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 |
| BATT (Original) | 10.81 | 5.50 | 13.18 | 22.27 | 10.62 | 13.18 | 6.23 | 5.88 | 5.78 | 6.54 |
| BATT (Adaptive) | 94.67 | 0.02 | 0.30 | 3.84 | 0.91 | 0.09 | 0.06 | 0.00 | 0.02 | 0.09 |
---
Dear Reviewer iova, thank you very much for your valuable time and careful review of our paper. We hope the following responses can further alleviate your remaining concerns.
---
**Q1**: It seems that there is a discrepancy with the performance reported in the R2 experiment compared to the number reported in CD. For example, for the BadNets and WaNet. Can the author further explain the potential cause?
**R1**: Thank you for bringing this discrepancy to our attention! After more in-depth explorations, we believe that it is due to two main reasons, as follows.
- **The performance of CD is not stable with different random seeds**, as shown in the following Table 1. This seems to be a common problem for almost all trigger inversion methods like CD, as different trigger initializations can have a large impact on the results, especially if the real trigger is a small patch. We didn't notice this in our previous rebuttal because we just ran the experiment once (using the open-sourced toolbox [backdoor-toolbox](https://github.com/vtu81/backdoor-toolbox)). We will report the mean results of CD in the appendix of our revision.
- **We exploit different experiment settings for WaNet compared to CD**. For example, CD was test on the vanilla WaNet (without noise mode) with different grid size, as shown in its appendix.
**Table 1. The performance of CD against three representative attacks with different random seeds on CIFAR-10.**
| Attack$\rightarrow$ | BadNets|BadNets | BadNets| WaNet|WaNet|WaNet|BATT|BATT|BATT|
|-------|:-------:|:-----:|:-----:|:-------:|:-----:|:-----:|:-------:|:-----:|:-----:|
|Seed$\downarrow$, Metric$\rightarrow$| AUROC | TPR | FPR |AUROC | TPR | FPR |AUROC | TPR | FPR |
|0|0.875|0.749| 0.018|0.462 | 0.045 | 0.157|0.232|0.044| 0.138|
|888|0.980 |0.895 |0.052|0.710| 0.303|0.121 | 0.767|0.403|0.117|
|999|0.923 |0.700 | 0.086 |0.663 |0.269|0.120| 0.712|0.307|0.120|
|Mean (Std)|0.926 (0.053)|0.781 (0.101)|0.052 (0.033)|0.612 (0.132) | 0.206 (0.140)|0.133 (0.021) |0.570 (0.294) | 0.251 (0.186) | 0.125 (0.011)
---
**Q2**: For Table 4, can the author also report the detection performance for 0.5% and 1%? 10 - 30% ASR is still a concerning threat.
**R2**: Thank you for these comments and we do understand your concern.
- We respectfully note that **we calculated the ASR on all testing samples (instead of only those not from the target class)** in our previous rebuttal. The adaptive ASRs (without considering testing samples from the target class) of WaNet are 3.9% and 28.6% under the poisoning rate 0.5% and 1%, respectively. As such, **at least for the case of 0.5% poisoning rate, the attack is failed**.
- However, we do understand your concern about whether our method is still effective when the ASR is relatively low. To further alleviate it, we conduct additional experiments under your suggested cases (i.e., WaNet with 0.5% and 1% poisoning rates). Specifically, following the suggestion from the backdoor-toolbox, we remove samples that contain trigger patterns but still cannot be correctly predicted as the target label by attacked DNNs. As shown in Table 2, **our method is still highly effective in these cases**, although its performance is slightly lower than that of TeCo, which requires significantly more inference time. In contrast, both STRIP and SCALE-UP failed.
**Table 2. The performance in defending against WaNet with poisoning rate ($\rho \in \{0.5\%, 1\%\}$) on CIFAR-10.**
|Defenses $\downarrow$| $\rho$ (%) $\downarrow$, Metrics $\rightarrow$| AUROC |TPR | FPR|
|-----|-----|-----|-------|-----|
|TeCo|0.5|1.000| 1.000| 0.156|
|TeCo|1|1.000|1.000|0.068|
|STRIP|0.5|0.403|0.039|0.100|
|STRIP|1|0.421|0.033|0.100|
|SCALE-UP|0.5|0.461|0.440|0.389|
|SCALE-UP|1|0.467|0.424|0.345|
|Ours|0.5|0.936|0.791|0.864|
|Ours|1| 0.953|1.000| 0.129|
---
**Reviewer gbA5 Response to Authors**
Dear Reviewer gbA5, thank you for your careful review and further questions regarding our theorem. We hope the following responses can alleviate your remaining concerns.
We would like to clarify some potential misunderstandings before we respond to your questions one by one.
- **Our paper is not theory-oriented**. We provided Theorem 3.1 solely to argue that the parameter-oriented scaling consistency (PSC) phenomenon is not accidental, although this phenomenon is counterintuitive. We respectfully argue that this theorem has fulfilled its purpose. This theorem also mirrors the promising performance of our method in the main experiments to some or even a large extent.
- **Most of our theorem assumptions are the most classical assumptions in deep learning theory**, although they may differ from the actual situations to some extent. We would certainly like to prove this theory without any assumptions, but unfortunately this is an impossible task, at least not at the moment. This theorem is closely related to the learning theory of deep learning models, and we cannot complete the proof without any assumptions until this holy grail problem is completely solved.
---
**Q1**: The assumption that the probability of the target class $t$ at the $l$-th layer exceeds those of other classes is very strong. While it's accurate that $t$'s final prediction probability surpasses others, this may not hold at the $l$-th layer. Given that subsequent mappings can alter the probabilities, this assumption may not apply in general.
**R1**: Thank you for this insightful question!
- Theoretically, uppon we assuming the mixture of Gaussian distribution and well-trained network, we are safe to say so. In Eq. A3 and A4 of our paper, we assume the conditional distribution $a_l(x) | \arg\max f(x) = c$ to be Gaussian. This conditional distribution is obtained after completing the forward pass and looking back at the previous layers, and it remains unchanged, i.e., $p(a_l(x)\in a_l(A)| \arg\max f(x) = c )$$=p(a_{l+1}(x)\in a_{l+1}(A) | \arg\max f(x) = c), A=\{x: \arg\max f(x)=s\}$. Aparently, the probability of $a_l(x)$ in class $c$ will always larger than the remains as what happens in the last layer. Therefore, the assumption holds for the conditional distribution of $a_l(x) | \arg\max f(x) = c$.
- In practice, when applying our proposed method, we typically start from the last hidden layers of the network and stop when the amplifying effects are not significant. This ensures that the effectiveness of the method is not compromised. Additionally, the assumption of a well-trained network implies that the probabilities are adjusted during the training process to align with the desired class, increasing the likelihood of this assumption holding true.
---
**Q2**: The assumption in the context of a backdoor attack, where the l2 norm of $t$ is significantly less than 1, appears too presumptive and does not reflect practical scenarios.
**R2**: We are deeply sorry that our submission may lead you to some potential misunderstanding. In our paper, we assume that all images have been normalized, i.e., $x \in [0, 1]$. Accordingly, $||t||_2 \ll 1$ holds true in practice since the triggers are either very sparse (e.g., BadNets) or have a small overall magnitude (e.g., WaNet). We will make this more clearly in the revision.
---
**Q3**: Modeling $x(t)=x+t$ does not match most of the backdoor attacks In reality, $t$ would vary with different inputs, making this model less representative of common backdoor strategies.
**R3**: Thank you for this insightful comment! We acknowledge that the simplicity of the form $x(t)=x+t$ was chosen to illustrate the concept effectively. By using this form, we intended to emphasize that the poisoned sample consists of a clean image term, denoted as $x$, and a backdoor term, denoted as $t$. However, we respectfully note that **our approach does not assume whether $t$ is a constant or adaptive with respect to $x$**. To address your concerns, we can represent the poisoned sample as $x(t)=x+G(x)$, where $G(x)$ is a small perturbation ($G(x)\ll 1$).
To further support our argument, we refer to Eq. A13-A15 in our paper. The subsequent proof relies on the conclusion stated in Eq. A15. Our insight is that we have a quadratic form $a \Vert y\Vert_2^2+b^Ty + c>0$ (as shown in Eq. A13) that holds for nearly all $y = x + vt = x + vG(x)$. It is worth noting that $G(x)$ is very small ($G(x)<\epsilon$). Hence, we obtain the inequality $0<a \Vert x\Vert^2+b^Tx+c + 2ax^T v G(x)+a\Vert v G(x)\Vert^2+b^T v G(x) <a \Vert x\Vert^2+b^Tx+c+ o(\epsilon)$
$\Rightarrow a \Vert x\Vert^2+b^Tx+c+ o(\epsilon)>0$.
In order for this inequality to hold for nearly all $x$, we still require $a=\sigma_t^2-\sigma_c^2>0$. Therefore, we expect Eq. A15 to remain a valid conclusion, allowing the subsequent proof to be applicable. We appreciate your feedback and will make these points more clearly in our revision
---
**Q4**: If the theorem is universally applicable to every data point, regardless of whether it's poisoned or benign, it's unclear how it can be used to specifically identify poisoned images.
**R4**: Thank you for pointing it out! We are deeply sorry that we failed to make the mechanism more clearly.
Theorem 3.1 indicates that the parameter-amplified versions of the attacked model will predict the benign samples to the target class instead of their ground-truth labels predicted by the attacked model. As such, the amplification process can induce decreasing confidence in the original predicted class (by the attacked model) if the inputs are benign samples. In contrast, the prediction confidences of poisoned images are stable during this process since their original predicted class by the attacked model is already the target label. As such, **we can detect whether a suspicious image is malicious by examining its parameter-oriented scaled consistency (PSC). A larger the PSC value indicates a higher likelihood that the suspicious image is malicious**.
---
**Q5**: The theorem is also applicable to any benign models as well, meaning it postulates the existence of an amplification effect that can skew predictions towards specific classes for all data points. Think about features unique to a class or universal adversarial perturbations. I can make the same claim for a benign model that it can predict all inputs to a certain class by doing the same operation. In these scenarios, a model like $x(t)=x+t$ might be more fitting.
**R5**: Thank you for bringing up this point. The issue of extending this operation to benign models presents two challenges. Firstly, in the case of benign models, the class to which all inputs are predicted may not align with the targeted class defined by users. It could be some classes defined by the learning procedure and is generally uncontrollable. The controllability of such classes in poisoned models is a major focus of our theorem and forms the foundation of how we design our method.
Secondly, benign models often encounter a phenomenon known as "neural collapse" [1,2,3], where the features of each class form a simplex equiangular tight frame. This implies that all features share (nearly) the same within-class variance. In such cases, amplifying model parameters to manipulate prediction results can be both inefficient and unstable. The accumulation of errors during computation may dominate the results and make the approach ineffective.
Therefore, our theorem specifically addresses the controllability of classes in poisoned models and provides a foundation for our methodology.
[1] Prevalence of neural collapse during the terminal phase of deep learning training, PNAS
[2]Another step toward demystifying deep neural networks, PNAS
[3]Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training, PNAS
---
**Q6**: The theorem's reliance on additional assumptions, such as assuming Gaussian mixture model on $a$ and the uniform distribution of $\sigma$ and $\mu$.
**R6**: Our theorem aims to provide a deeper understanding of the phenomenon of Parameter-oriented Scaling Consistency in DNNs and demonstrate that it is not accidental. In order to facilitate the demonstration and proof while uncovering the underlying insights, certain assumptions are employed. The Gaussian mixture assumption is commonly used in many deep learning theory papers [1,2,3] as it simplifies the analysis and provides a tractable framework for modeling complex data distributions.
Regarding the uniform distribution assumptions for $\sigma$ and $\mu$ in benign models, these assumptions are justified based on recent advances [4,5,6] in the study of "neural collapse." We explain in the appendix that in neural collapse scenarios, the features of each class form a simplex equiangular tight frame. This implies that all features share (nearly) the same within-class variance and exhibit uniform mean values.
By incorporating these assumptions, we can gain valuable insights into the behavior of neural networks and develop a clearer understanding of the parameter-oriented scaling consistency phenomenon.
[1]Gaussian Mixture Solvers for Diffusion Models, NeurIPS2023
[2]Natural Images, Gaussian Mixtures and Dead Leaves, NeurIPS2012
[3]Learning Gaussian Mixtures with Generalized Linear Models, NeurIPS2021
[4]Prevalence of neural collapse during the terminal phase of deep learning training, PNAS
[5]Another step toward demystifying deep neural networks, PNAS
[6]Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training, PNAS
---
**Reviewer gbA5 Response to Authors**
Dear Reviewer gbA5, thank you for your careful review and further questions regarding our experiments. We hope the following responses can alleviate them.
---
**Q1**: In cases of low poisoning ratios, it is unclear why the authors did not compare their findings with other baselines or extend their experiments to datasets beyond CIFAR-10. Is the same set of hyperparameters effective across different datasets and varying poisoning ratios?
**R1**: Thank you for your comments and we do understand your concerns. We argue that **our method is effective with the same set of hyperparameters across different datasets and varying poisoning ratios**.
- We only reported the results of our method on CIFAR-10 simply due to the limitation of time and space. We respectfully note that these are ablation experiments instead of main experiments, and it is hard to read if we also include the results of baseline defenses in the same figure.
- However, we do understand and respect your concerns. To further alleviate them, we conduct additional experiments **under the same set of hyperparameters**.
- As shown in Table 1-3, **our method is better than baseline defenses in most cases under different poisoning rates**.
- As shown in Table 4, **our method is still highly effective in defending against attacks with different poisoning rates on the SubImageNet-200 dataset** (instead of only effective on CIFAR-10).
We will add more details and these results in the appendix of our revision.
**Table 1. The performance in defending against BadNets with different poisoning rates ($\rho$) on CIFAR-10.**
|| | 2% | 4% | 6% |8% | 10% |
|-----|-------|-------|-------|-----|-----|-----|
|STRIP|AUROC|0.881| 0.895|0.868|0.769| 0.931
|STRIP|F1|0.679| 0.752|0.657|0.429|0.842|
|TeCo|AUROC|1.000|0.997|0.981|0.994| 0.998 |
|TeCo|F1|0.916|0.952|0.929|0.937|0.970|
|SCALE-UP|AUROC|0.959|0.964|0.959|0.971|0.962|
|SCALE-UP|F1|0.918|0.914|0.910|0.915|0.913|
|Ours|AUROC| 0.999 |0.998|0.999|1.000|1.000|
|Ours|F1|0.928|0.912|0.961 |0.981| 0.967|
**Table 2. The performance in defending against WaNets with different poisoning rates ($\rho$) on CIFAR-10.**
|| | 2% | 4% | 6% |8% | 10% |
|-----|-------|-------|-------|-----|-----|-----|
|STRIP|AUROC| 0.493| 0.485|0.479|0.466|0.469|
|STRIP|F1|0.137 | 0.138|0.131|0.116|0.125|
|TeCo|AUROC|0.992|0.976|0.999|0.906|0.923|
|TeCo|F1|0.891|0.905|0.944|0.945|0.915|
|SCALE-UP|AUROC|0.746|0.766|0.730|0.689|0.672|
|SCALE-UP|F1|0.698|0.726|0.624|0.646|0.529|
|Ours|AUROC| 0.966 |0.977|0.978|0.983|0.984|
|Ours|F1|0.944|0.959|0.955|0.960|0.956|
**Table 3. The performance in defending against BATT with different poisoning rates ($\rho$) on CIFAR-10.**
|| | 2% | 4% | 6% |8% | 10% |
|-----|-----|-------|-------|-------|-----|-----|
|STRIP|AUROC|0.779|0.650|0.656|0.808|0.449|
|STRIP|F1|0.579|0.364|0.385|0.639|0.258|
|TeCo|AUROC|0.803|0.814|0.809|0.871|0.914|
|TeCo|F1|0.683|0.684|0.685|0.685|0.673|
|SCALE-UP|AUROC|0.944|0.968|0.957|0.968| 0.959|
|SCALE-UP|F1|0.880|0.868|0.907|0.871|0.911|
|Ours|AUROC| 0.999 |1.000|0.998|0.985|0.999|
|Ours|F1|0.972|0.967|0.942|0.958|0.979|
**Table 4. The performance of our method in defending against attacks with different poisoning rates on SubImageNet-200.**
|Attack$\rightarrow$|BadNets|BadNets|BadNets|WaNet|WaNet|WaNet|BAAT|BAAT|BAAT|
|-----|-----|-----|-------|-----|-----|-----|-----|-----|-----|
|Poisoning Rate$\downarrow$, Metric$\rightarrow$|ASR|AUROC |F1|ASR|AUROC |F1|ASR|AUROC |F1|
|2%|0.955|0.999|0.905|0.004|-|-| 0.981|0.999|0.998|
| 4%|0.960|1.000|0.996|0.182|0.944|-|0.967|1.000|0.999|
|6%|0.972|0.991|0.989|0.786|1.000|0.996|0.981|0.999|0.997|
| 8%| 0.974|0.997|0.995|0.818|0.986 |0.976|0.980|1.000|0.999|
| 10%|0.998 | 1.000 | 0.992 | 0.967|0.967 | 0.981| 0.997 | 0.998 | 0.998 |
<!-- 0.980|0.988| -->
**Note**: We omit some of the results since:
- The WaNet attack failed with 2% poisoning rate.
- The sample set for detection (after excluding the failed poisoned samples) is imbalanced, primarily composed of clean samples with only a small fraction of poisoned ones.
---
**Q2**: I believe a more dangerous attack scenario involves training a neural network that, in the absence of amplification, classifies poisoned images as the target class, but when amplification is applied, it categorizes the poisoned images as another class or their actual ground truth classes.
**R2**: Thank you for this insightful comment!
- In our previous rebuttal (Round 1, R4, 'Note'), we conducted experiments on adaptive attacks to make the parameter-amplified attacked DNNs categorize poisoned images as their actual ground truth classes. However, even when we set a very small trade-off hyper-parameter to the adaptive loss term, **it failed to create a backdoor and significantly reduced benign accuracy (>20%)**.
- Following your suggestion, we also designed another adaptive loss term by making parameter-amplified attacked DNNs categorize poisoned images as another class (instead of the target class). However, even when we set a very small trade-off hyper-parameter to the adaptive loss term, **it significantly decrease the benign accuracy (>30%) of the attacked model**.
We will add more details and discussions in the appendix of our revision.
---
**Q3**: As I noted in my previous questions regarding the theorem, the mechanism by which the model differentiates between poisoned and benign data within the target classes remains unclear to me. Instead of merely reporting the F1 score, the authors should also disclose the false positive rate for both the target and benign classes.
**R3**: Thank you for pointing it out. We are deeply sorry that we failed to make the mechanism more clearly.
- Theorem 3.1 indicates that the parameter-amplified versions of the attacked model will predict the benign samples to the target class instead of their ground-truth labels predicted by the attacked model. As such, the amplification process can induce decreasing confidence in the original predicted class (by the attacked model) if the inputs are benign samples. In contrast, the prediction confidences of poisoned images are stable during this process since their original predicted class by the attacked model is already the target label. As such, **we can detect whether a suspicious image is malicious by examining its parameter-oriented scaled consistency (PSC). A larger the PSC value indicates a higher likelihood that the suspicious image is malicious**.
- However, we do understand your concerns about whether our method may treat some benign samples from the target class as poisoned samples, leading to a high false positive rate on the target class.
- **We actually considered this at the beginning of the design of our methodology**. Therefore, we used consistency of confidence rather than that of the predicted label (that used in SCALE-UP) for detection.
- However, as shown in the following Table 5 (following your suggestions), we have to admit that our method with confidence consistency (dubbed 'Ours-C') will still lead to some false positive cases on the target class, although **it is significantly better than that of the variant with label consistency (dubbed 'Ours-L') and SCALE-UP**. It is mostly because the amplification process may also decrease the prediction confidence of benign samples from the target class since there may be some 'short-cut' classes that are easier to be predicted. We will explore how to further alleviate this problem in our future work.
**Table 5: The False Positive Rate (FPR) (\%) of SCALE-UP and our defense on target and benign classes on CIFAR-10. In our cases, the target label is 0.**
|Defense$\rightarrow$|SCALE-UP | SCALE-UP |Ours-L | Ours-L|Ours-C| Ours-C|
|-------|-------|-------|-------|-------|-------|-------|
|Attack$\downarrow$, Classe$\rightarrow$ | Target | Benign| Target | Benign| Target | Benign|
|BadNets| 72.74 |29.00 |87.73 |10.73 | 0.20 | 1.88|
|Blend| 54.28| 19.80|22.55 | 3.39| 18.34 |2.64|
|PhysicalBA|90.58 | 23.98 | 60.72| 19.20| 4.10 | 1.50 |
|WaNet|76.70|28.11 |81.41| 10.05|69.20 | 8.16 |
|ISSBA|93.93 | 20.70|20.94 |3.00 |17.22 | 2.50|
|BATT |57.74 | 18.78| 81.54|9.61 |59.30| 7.40 |
|SRA|65.55 |29.33 | 0.62| 10.48|0.50| 10.13 |
|Ada-Patch| 93.80|25.77 |8.67 |4.78| 4.34| 3.00|
---