# Rebuttal_DBD
## Response to Reviewer RsZj
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **novelties**, **extensive experiments**, **good performance**, and **benefits to the field**.
**<u>Q1</u>**: Did not compare DBD with some recently works such as "Spectral signatures in backdoor attacks" in NIPS 2018 and "Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering" in AAAI 2019, which both are based on data removal. It is unclear how differently DBD removes the poisoned data as compared with the existing works.
**<u>R1</u>**: Thanks for these insightful comments! As we mentioned in Section 4.4, we do not intend to accurately separate poisoned and benign samples, as detection-based methods (e.g., Spectral Signatures (SS) and Activation Clustering (AC)) did. This is mainly because these methods may not able to remove enough poisoned samples while preserving enough benign samples simultaneously, i.e., there is a trade-off between BA and ASR. However, we do understand your concern. To alleviate it, we compare the filtering ability of our DBD (stage 2) and that of your suggested methods. As shown in Table 1-2, the filtering performance of DBD is on par with that of SS and AC. DBD is even better than those methods when filtering poisoned samples generated by more complicated attacks (i.e., WaNet and Label-Consistent).
Table 1. The successful filtering rate (poisoned/all, %) w.r.t. the number of all filtering samples (in the target class) on CIFAR-10 dataset.
| | $\epsilon \rightarrow$ | 250 | 500 | 1000 | 1500 |
|:----------------:|:----------------------:|:-------------:|:--------------:|:--------------:|:--------------:|
| BadNets | SS <br> DBD | 95.73<br>100 | 93.20<br>97.60 | 87.60<br>90.87 | 80.71<br>70.09 |
| Blended | SS <br> DBD | 0.80<br>97.87 | 10.53<br>94.67 | 28.87<br>87.27 | 35.29<br>75.16 |
| WaNet | SS <br> DBD | 0<br>100 | 0<br>100 | 1.00<br>99.47 | 7.42<br>97.46 |
| Label-Consistent | SS <br> DBD | 2.40<br>43.47 | 4.40<br>37.07 | 9.00<br>34.47 | 13.78<br>32.53 |
Table 2. The number of remaining poisoned samples over filtered non-malicious samples on CIFAR-10 dataset.
| | BadNets | Blended | WaNet | Label-Consistent |
|:--------------------:|:----------:|:----------:|:----------:|:----------------:|
| SS ($\epsilon=500$) | 1801/42500 | 2421/42500 | 2500/42500 | 1217/42500 |
| SS ($\epsilon=1000$) | 1186/35000 | 2067/35000 | 2400/35000 | 1115/35000 |
| AC | 0/42500 | 0/37786 | 5000/45546 | 1250/39998 |
| DBD | 8/25000 | 6/25000 | 38/25000 | 13/25000 |
Besides, we also conduct the standard training on non-malicious samples filtered by SS and AC. As shown in Table 3, the hidden backdoor will still be created in many cases, even though the detection-based defenses are sometimes accurate.
Table 3. The BA (%) over ASR (%) of models trained on non-malicious samples filtered by SS and AC on CIFAR-10 dataset.
| | SS-500 | SS-1000 | AC |
|:----------------:|:-----------:|:-----------:|:-----------:|
| BadNets | 92.99/100 | 93.27/99.99 | 85.90/0 |
| Blended | 92.84/99.07 | 92.56/99.18 | 77.17/0 |
| WaNet | 92.69/98.13 | 91.92/99.00 | 84.60/99.02 |
| Label-Consistent | 92.93/99.79 | 92.88/99.86 | 75.95/99.75 |
Please refer to Appendix (Section M) in our revision for more details.
**<u>Q2</u>**: I suspect the proposed approach works only with SimCLR. What is the necessary conditions that makes a feature extractor scatter poisoned data points in the feature space? Can you point out other feature extractors having the same properties as SimCLR? Does DBD works with these extractors too?
**<u>R2</u>**: Thanks for your insightful questions! We believe that all self-supervised methods (not just SimCLR) can be adopted in our DBD. As we described in Introduction and Section 3, this is mainly due to the power of the decoupling process and the involved strong data transformations. To alleviate your concerns, we also examine our DBD with other self-supervised methods. As shown in Table 4, all DBD variants have a similar performance. Please refer to Appendix (Section N) in our revision for more details.
Table 4. The BA (%) over ASR (%) of our DBD with different self-supervised methods on CIFAR-10 dataset.
| | BadNets | Blended | WaNet | Label-Consistent |
|:------:|:----------:|:----------:|:----------:|:----------------:|
| SimCLR | 92.41/0.96 | 92.18/1.73 | 91.20/0.39 | 91.45/0.34 |
| MoCo | 93.01/1.21 | 92.42/0.24 | 91.69/1.30 | 91.46/0.19 |
| BYOL | 91.98/0.82 | 91.38/0.51 | 91.37/1.28 | 90.09/0.17 |
**<u>Q3</u>**: The effect of the final fine-tuning step is unclear. How does DBD perform without this phase?
**<u>R3</u>**: Thanks for this question. As we described in Section 4.5, the (semi-supervised) fine-tuning process can prevent the side-effects of poisoned samples while utilizing their containing useful information, and therefore increase the BA and decrease the ASR simultaneously. The results contained in Table 5 can verify it. Please refer to our Section 5.3.3 (Table 3) for more details. Note: 'SS with SCE' denotes the DBD without semi-supervised fine-tuning.
Table 5. The BA (%) over ASR (%) of our DBD with or without semi-supervised fine-tuning.
| | BadNets | Blended | Label-Consistent | WaNet |
|:---------:|:----------:|:----------:|:----------------:|:----------:|
| DBD (w/o) | 82.34/5.12 | 82.30/6.24 | 81.81/5.43 | 81.15/7.08 |
| DBD (w/) | 92.41/0.96 | 92.18/1.73 | 91.45/0.34 | 91.20/0.39 |
Table 3. The BA (%) over ASR (%) of models trained on non-malicious samples filtered by SS and AC on CIFAR-10 dataset.
| | SS($\varepsilon=500$)<br>BA/ASR | SS($\varepsilon=1000$)<br>BA/ASR | AC<br>BA/ASR |
|:----------------:|:-------------------------------:|:--------------------------------:|:------------:|
| BadNets | 92.99/100 | 93.27/99.99 | 85.90/0 |
| Blended | 92.84/99.07 | 92.56/99.18 | 77.17/0 |
| WaNet | 92.69/98.13 | 91.92/99.00 | 84.60/99.02 |
| Label-Consistent | 92.93/99.79 | 92.88/99.86 | 75.95/99.75 |
| | SS($\varepsilon=500$) | SS($\varepsilon=1000$) | AC |
|:----------------:|:---------------------:|:----------------------:|:-----------:|
| | BA/ASR | BA/ASR | BA/ASR |
| BadNets | 92.99/100 | 93.27/99.99 | 85.90/0 |
| Blended | 92.84/99.07 | 92.56/99.18 | 77.17/0 |
| WaNet | 92.69/98.13 | 91.92/99.00 | 84.60/99.02 |
| Label-Consistent | 92.93/99.79 | 92.88/99.86 | 75.95/99.75 |
**<u>Q4</u>**: I also suggest the authors to move the section about the resistance to adaptive attacks in Appendix into the main paper as the adaptive attacks are becoming a more serious threats today. Please explain your adaptive attack settings more clearly (for example, what trigger size you used and how you tuned the hyper-parameters).
**<u>R4</u>**: Thank you for this constructive suggestion! We have moved it into the main paper (Section X) and provided more details in our revision.
**<u>Q5</u>**: It would also be good if the author discuss how an attack may work around the proposed defense, and how to further defend such workarounds.
**<u>R5</u>**: Thank you for this constructive suggestion! We discussed the resistance of our DBD to potential adaptive attacks in Appendix H. The results show that our method is resistant to the discussed adaptive attack. As such, we did not further analyze how to better defend this adaptive attack. We are very willing to test your suggested adaptive methods if you can kindly provide more details.
## Response to Reviewer h7wA
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **novelties**, **strong sets of experiments and baselines**, **practicability**, and **good writing**.
**<u>Q1</u>**: The proposed method modifies an underlying training procedure multiplying training time, which I think is significant for practitioners. Addressing this issue is essential to support the practicality of the method.
**<u>R1</u>**: Thank you for this insightful comment! As we analyzed in Appendix K, our method is roughly 2-3 times slower than the standard training process in general, which we think is tolerable and therefore our DBD is still practical. In fact, most existing backdoor defenses (e.g., NC and NAD) take additional computational costs. However, we do understand your concern. There are two possible ways to accelerate DBD:
1. If there is a secure pretrained backbone (e.g, the one from trusted sources), people can directly use it and save the time of the self-supervised stage.
2. People can adopt existing accelerated methods (e.g., mixed precision training) towards each stage of DBD to accelerate the whole training process.
**<u>Q2</u>**: The primary assumption of the paper is that learning in semi-supervised learning is safe. However, Carlini [1] 's recent work demonstrates the attacker's effectiveness under the same threat model, i.e., when the attacker is only allowed to poison data. If the poisoning is efficient, then the proposed defense exposes the model to a different attack.
**<u>R2</u>**: Thank you for this insightful comment! Carlini's work requires that attackers can know what are (some of) labeled samples and unlabeled samples. This threat model is practical in the classical semi-supervised training where the attacker can control its training set. However, in our method, whether a sample is labeled or not is determined by the first two stages of DBD, whose results are not accessible to attackers. As such, his work can not be used to attack our DBD since our threat model targets the poisoning-based backdoor attacks, where attackers can not know and control the training details. We have added more details in Appendix (Section H) to discuss it.
## Response to Reviewer MY71
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **well generalization** and **effectiveness**.
**<u>Q1</u>**: The paper lacks a theoretical analysis of the proposed method. I understand that this paper mainly focused on empirical performance, but it is quite surprising that the proposed methods perform well on label-consistent attacks. This is because that proposed method decouples the label corruption and feature corruption. When label corruption no longer exists, what is the advantage of the proposed method?
**<u>R1</u>**: Thank you for the comments and the insightful question!
- We admit that we failed to provide a theoretical analysis of our DBD. This part is very difficult, especially when our method is the first work trying to analyze the learning behavior of poisoned samples. We hope that this work can inspire follow-up works to better understand the learning of poisoned samples (theoretically or empirically).
- Our DBD is effective in defending against clean-label attacks is not because it can successfully filter poisoned samples. Our DBD fails to filter poisoned samples since there is no label corruption (as you mentioned). The success of DBD (in defending against clean-label attacks) is mostly because the strong data augmentations involved in the self-supervised learning damage trigger patterns and therefore make them unlearnable (without the guidance of labels). In particular, clean-label attacks are more difficult to learn the trigger pattern compared with poisoned-label attacks since the features about the target class contained in the poisoned images will hinder the learning of the trigger pattern.
**<u>Q2</u>**: For the second step label noise learning, there are also many choices instead of just using the symmetric cross-entropy method. Investigating more noisy-label algorithms might be interesting.
**<u>R2</u>**: Thank you for this constructive suggestion! We do understand your concern that the selection of noisy-label algorithms may sharply influence the overall performance of DBD. To alleviate it, we also examine our DBD with other noisy-label methods (i.e., GCE and NCE+RCE). As shown in Table 2, all DBD variants have a similar performance. Please refer to Appendix (Section O) in our revision for more details.
Table 2. The BA (%) over ASR (%) of our DBD with different noisy-label methods on CIFAR-10 dataset.
| | BadNets | Blended | WaNet | Label-Consistent |
|:---------------:|:----------:|:----------:|:----------:|:----------------:|
| DBD (SCE) | 92.41/0.96 | 92.18/1.73 | 91.20/0.39 | 91.45/0.34 |
| DBD (GCE) | 92.93/0.88 | 93.06/1.27 | 92.25/1.51 | 91.05/0.15 |
| DBD (NCE + RCE) | 92.95/1.00 | 92.65/0.78 | 92.24/1.40 | 91.08/0.14 |
**<u>Q3</u>**: The two-step method makes the algorithm not in an end-to-end fashion. It would be interesting to investigate the possibility to make an end-to-end algorithm.
**<u>R3</u>**: Thank you for this constructive suggestion! We think one of the most interesting parts of this paper is to reveal that the classical end-to-end training paradigm strengthens the backdoor threat. We believe keeping our DBD in a not end-to-end fashion may better emphasize it. Besides, since no human operation is required between each stage in our DBD, the whole training pipeline is general and can be regarded as end-to-end in a way. However, we are very willing to try it if you can kindly provide more details about the end-to-end design.
**<u>Q4</u>**: I do not think excluding the detection-based method is fair since those methods are strong baselines especially for the badnet and blending attack. Also, since the main contribution of the paper is the empirical performance, it is necessary to compare with different kinds of baselines.
**<u>R4</u>**: Thanks for these insightful comments! As we mentioned in Section 4.4, we do not intend to accurately separate poisoned and benign samples, as detection-based methods (e.g., Spectral Signatures (SS) and Activation Clustering (AC)) did. This is mainly because these methods may not able to remove enough poisoned samples while preserving enough benign samples simultaneously, i.e., there is a trade-off between BA and ASR. However, we do understand your concern. To alleviate it, we compare the filtering ability of our DBD (stage 2) with two representative detection-based methods (i.e., SS and AC).
## Response to Reviewer G82o
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **simple but effective idea** and **extensive experiments**.
**<u>Q1</u>**: The primary concern of this paper is the extra computation cost of the proposed DBD pipeline. Considering training a self-supervised and semi-supervised model costs way more computational resources than a supervised learning model. Would it be possible to use a public pre-trained feature extractor to replace the self-supervised feature extractor? The authors are welcomed to discuss this.
**<u>R1</u>**: Thank you for the insightful comment and question! As we analyzed in Appendix K, our method is roughly 2-3 times slower than the standard training process in general, which we think is tolerable and therefore our DBD is still practical. In fact, most existing backdoor defenses (e.g., NC and NAD) require additional computational costs. However, we do understand your concern. There are two possible ways to accelerate DBD:
- If there is a secure pretrained backbone (e.g, the one from trusted sources), people can directly use it and save the time of the self-supervised stage (as you suggested).
- People can adopt existing accelerated methods (e.g., mixed precision training) towards each stage of DBD.
<font color="blue">We download a pre-trained ResNet-50 backbone from https://github.com/leftthomas/SimCLR, and replace the self-supervised feature extractor with it in our DBD.</font>
However, we need to notice that using public pre-trained feature extractors is still risky when the model source is not secure. For example, if the pre-trained feature extractor is infected (with the same poisoned samples contained in the training set), using it in our DBD will still create hidden backdoors, as shown in Table 3 (Line 'DBD without SS').
Table 1. The BA (%) over ASR (%) of our DBD with or without pre-trained feature extractor on CIFAR-10 dataset.
| | BadNets | Blended | Label-Consistent | WaNet |
|:---------:|:----------:|:----------:|:----------------:|:----------:|
| DBD (w/) | 94.53/0.54 | 94.81/0.73 | 93.22/0 | 94.08/1.36 |
| DBD (w/o) | 92.41/0.96 | 92.18/1.73 | 91.45/0.34 | 91.20/0.39 |
**<u>Q2</u>**: In Section 5.2, the authors list the results on "No defense" to show the backdoor defenses' impact on the original model's accuracy and the effectiveness of the backdoor mitigation. It is not clear how the model works without a defense. Suppose the authors train the original model in a supervised learning manner. In that case, directly using a self-supervised learning paradigm should also be included to illustrate each step's contribution.
**<u>R2</u>**: Thank you for the constructive suggestion! In our paper, the 'No Defense' means training a model with the standard end-to-end supervised training. We do understand your concern that analyzing the effects of each step's contribution is important. Note that directly using a self-supervised learning paradigm should be considered as a defense, since it introduces the decoupling process. In the ablation study (Section 5.3.3, Table 3), we have discussed the effectiveness of each stage, including directly using a self-supervised learning paradigm as you mentioned, in our DBD. The results show that it is also effective in reducing backdoor threats (although it is less effective compared with our DBD). Please refer to Table 3 (Line 'SS with CE' and 'SS with SCE') for more details.
**<u>Q3</u>**: In the fine-tuning step, the sensitivity of lambda should also be discussed, since it controls the ratios of unlablelled data.
**<u>R3</u>**: Thank you for this constructive suggestion! We do understand your concern that analyzing the effects of key hyper-parameters is important. In fact, we have included these experiments in Section 5.3.1 (Figure 5). The results show that DBD can still maintain relatively high benign accuracy (and low ASR) even when the filtering rate $\alpha$ is relatively small (e.g., 30%). However, we also have to notice that the high-credible dataset may contain poisoned samples when $\alpha$ is very large, which in turn creates hidden backdoors again during the fine-tuning process. Defenders should specify $\alpha$ based on their specific needs.
**<u>Q4</u>**: In Figure 1(a), the poisoned samples' embedding seems far from the target label (label 3) and closer to other un-targeted labels (label 1, 7, and 9). It would be great if the authors could give explanations.
**<u>R4</u>**: Thank you for the insightful question! Firstly, we have to notice that the t-SNE only preserves 'local structure' instead of the 'global structure'. In other words, it can show the difference between clusters while its specific distances are sometimes meaningless. However, I do understand your concern about why a backdoor attack can succeed even when the embeddings of poisoned samples are not close to those of the (benign) samples with the target label. In general, a successful backdoor attack only needs to require that the embeddings of different types of samples (i.e., samples with different labels and samples with or without trigger) can be split into different separated clusters. The remaining FC layers will 'assign' the label to each cluster. Besides, Figure 1(a) is the visualization of a poison-label attack where different poisoned samples may have different ground-truth labels and therefore contain many features different from those contained in benign samples with the target label. This is probably why your mentioned phenomenon occurs. In contrast, Figure 1(b) is the visualization of the clean-label attack where all poisoned samples have the same ground-truth label. As such, in this case, the embeddings of poisoned samples are close to those of samples with the target label.
Title: Thank you for the recognition and insightful comments!
Thanks for your recognition and insightful comments! Please kindly find our detailed explanations as follows:
---
**Q1**: The safety of semi-supervised learning.
**R1**: Thank you for your further questions and we do understand your concern. We try to explain it from three aspects, as follows:
- Suppose the attackers poison only unlabeled samples, as Carlini's work did, Carlini's attack is not able to attack the semi-supervised stage of our DBD since the input of our DBD method is fully labeled data. In this case, our DBD will abandon unlabeled data in the beginning, such that the interpolated malicious unlabeled data will be removed.
- Suppose the attackers add labels to those malicious unlabeled samples, we admit that those samples will be used in the training process of our DBD. However, in this case, Carlini's attack itself is still not able to attack the semi-supervised stage of our DBD. Specifically, as shown in Section 3.1 of Carlini's work, the attackers need to insert a path of unlabeled samples benign from a selected labeled sample $x'$ (with the target class) and end with the target sample $x^{*}$ to fulfill their malicious purposes. However, since the labeled and unlabeled dataset partition depends on the loss calculated in the second stage of DBD and defender pre-assigned filtering rate $\alpha$, attackers are not able to know which sample is labeled, $i.e.$, they can not pick a $x'$.
- Suppose the attackers somehow really find a way to ensure that the attacker-specified $x'$ can be chosen as the labeled sample in the semi-supervised stage of our DBD, the attack goal of misclassifying the selected sample as the target class can be achieved probably. We will explore how to do it in our future work. If such a single-point attack succeeds in our DBD method, we will not surprised, as DBD is not designed for defending against this attack. In particular, Carlini's work has provided some useful methods to defend against this attack, such as the pairwise influence (see Section 5.3 in Carlini's work). This method can be naturally combined with our DBD method at the second stage, to identify triggered and interpolated points simultaneously.
According to the above three aspects, we don't think that Carlini's work is a severe threat to our DBD method. We hope that the above explanations could provide more clear information to alleviate your concern. The discussion with you is very pleasant and helpful, and we are willing to discuss with you if you have any further concerns. Thanks again for your insightful comment, which inspires us to explore a more effective adaptive attack against our DBD method in our future work.
---
**Q2**: More details about the speed constraints.
**R2**: Thank you for the constructive suggestions! Following your suggestion, we report the training time of each stage in our DBD as follows:
Table 1. The training time (minutes) of each stage in our DBD on CIFAR-10 dataset.
| | Stage 1 (Self-supervised Learning) | Stage 2 (Filtering) + Stage 3 (Fine-tuning) |
|:---:|:----------------------------------:|:-------------------------------------------:|
| DBD | 50.8 | 260.5 |
As shown in Table 1, using safe pre-trained backbones (to avoid the self-supervised learning stage) can indeed save some time in our DBD. Note that we only trained 100 epochs (instead of the standard 1000 epochs) in our DBD to save time, as we described in Appendix K. This is why we can only save 50.8 minutes if we use pre-trained backbones, otherwise, we can save 508 minutes.
Inspired by your questions, we also notice that our DBD can be further accelerated by using fewer epochs in Stage 2-3. As shown in Table 2, our DBD has converged after 115 epochs. In other words, we can save 115.8 more minutes by setting the epoch as 115 (instead of 200) in Stage 2-3. People should assign training epochs based on their specific needs.
Table 2. The benign accuracy (\%) $w.r.t.$ the training epoch in the stage 2-3 of our DBD on CIFAR-10 dataset.
| Epoch | 25 | 40 | 60 | 80 | 85 | 115 | 116 | 117 | 118 | 125 | 126 | 127 |
|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
| BA | 85.86 | 88.96 | 90.60 | 91.85 | 92.42 | 93.02 | 93.02 | 93.16 | 93.15 | 93.37 | 93.04 | 93.00 |
Thank you again for your insightful questions. We will add more details and discussions in Appendix K in our final version.