Rebuttal: SCALE-UP

# Rebuttal: SCALE-UP ## Author Response (Reviewer GMnc) We sincerely thank you for your valuable time and constructive comments. We are encouraged by your positive comments on our **idea and paper writing**, **method effectiveness**, **extensive and insightful experiments**, and **comprehensive appendix**! We will alleviate your remaining concerns as follows. **Note**: All modified contents are marked in orange in our revision. --- **Q1**: I think the key limitation of the method is that SCALE-UP does not recover the trojan pattern. In other words, it cannot identify if the model is trojaned or not offline (e.g., https://www.ijcai.org/proceedings/2019/647). So even though your method achieves a very high AUROC score, it is not applicable in real-time applications such as a self-driving car. You cannot afford to lose ~2% of the real-time frames due to the false positive samples. **R1**: Thank you for this insightful comment and we do understand your concerns. We are deeply sorry that our submission may cause you some misunderstandings that we want to clarify at this rebuttal, as follows: - We admit that our method cannot recover trigger patterns. However, **it not necessarily means that our method cannot be used to detect trojans offline**. For example, when the training dataset (of the suspicious model) is avaiable, defenders can use our method to filter poisoned training samples. If our method finds a sufficiently large amount of poisoned samples, the suspicious model can be regarded as being trojaned. - Even if our method cannot detect trojans offline, **it is still practical in real-world applications since it can serve as the ‘firewall’ helping to block and trace back malicious samples in MLaaS scenarios** (as we mentioned in Related Work). We are deeply sorry that we failed to make it clear in our original submission. We have added more explanations at the beginning of the penultimate paragraph in Introduction of our revision. - **Recovering trigger patterns is very challenging under our input-level backdoor detection setting since defenders have limited capacities**. There is no existing defense that can fulfill it. Existing methods that can recover trigger patterns are either model-level [1-5] or under the white-box setting [6]. However, we do agree that it would be better if we can also recover trigger patterns. We will further explore how to extend our method to support this functionality in our future work. - We admit that our method may obtain some false-positive samples if defenders intend to recall all poisoned samples. However, **there is a trade-off between recall and precision for all detection-based methods**, unless the AUROC reaches 100\% (which is usually impossible). Users can adjust the threshold $T$ involved in our method to trade-off recall and precision based on their specific needs in real-world applications. [1] DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks. IJCAI, 2019. [2] Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE S&P, 2019. [3] AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis. ICLR, 2022. [4] Backdoor Scanning for Deep Neural Networks through K-Arm Optimization. ICML, 2021. [5] ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. CCS, 2019. [6] SentiNet: Detecting Localized Universal Attacks Against Deep Learning Systems. IEEE S&P Workshop, 2020. --- **Q2**: I don't understand why your method is only ~5% slower compared to the "no defense" method. You need to infer the sample images multiple times (up to 14 times as given in Figure 12). Also, there is no computation reuse between different inferences (due to input changes). Can you explain this issue? **R2**: Thank you for this insightful question! We are deeply sorry that our submission may cause you some misunderstandings that we want to clarify at this rebuttal, as follows: - **We calculated the inference time of our method by feeding all scaled variants of the suspicious image in a batch into the deployed model instead of predicting them one by one**. Accordingly, the inference time of our method is similar to that of No Defense, thanks to the high efficiency of matrix operations. Defenders can easily and efficiently obtain all its scaled variants of the suspicious image before feeding them into the deployed model. - **The aforementioned method is fair** since we adopted the same batch-based method to the inference time of all baseline methods ($e.g.$, STRIP and ShrinkPad). - Even under some restricted scenarios where defenders can only obtain the prediction of a single image at each time, they can still exploit parral computation to achieve similar inference time with the cost of more computation memories. To avoid misunderstandings, we have added more details in Appendix G of our revision. Sorry again for misleading you. --- **Q3**: AUROC is the only metric you applied throughout the paper (and appendix). Is it possible to provide the evaluation score of other metrics as well? Or maybe you can simply plot out some ROC curves. **R3**: Thank you for this constructive suggestion! We have added the ROC curves of defenses under each attack in Appendix O of our revision. --- ## Author Response (Reviewer 4qGQ) We sincerely thank you for your valuable time and constructive comments. We are encouraged by your positive comments on our **practical setting and application**, **interesting phenomenon**, **technical novelty**, and **high effectiveness**! We will alleviate your remaining concerns as follows. **Note**: All modified contents are marked in orange in our revision. --- **Q1**: In Table 1-2, the defense methods (STRIP, ShrinkPad, DeepSweep, Frequency) are all designed for patch-based attacks. However, the no-patch-based attacks are used for detection, which may be more favorable for the method in the article, and the comparison may be unfair. I suggest the authors add experiments about no-patch-based defenses for comparison. **R1**: Thank you for your comments and we do understand your concerns. We are deeply sorry that our submission may cause some misunderstandings to you that we would like to clarify at this rebuttal, as follows: - Firstly, **all baseline defenses are not designed only for patch-based attacks**. As shown in Table 1-2 in our main manuscript (or the following Table 1), all baseline defenses can successfully detect TUAP (to some extent), whose trigger patterns are image-size additive perturbations instead of some local patches. These defenses are less effective in defending against WaNet and ISSBA since their triggers are sample-specific instead of because their triggers are non-patch-based. - **We have already compared our method with almost all baseline methods that could be used under the setting of black-box input-level backdoor detection**. We would be grateful if you could provide the name of black-box input-level backdoor detections that you think we have missed. We are willing to compare our method with them before the rebuttal ends. **Note**: For your convenience, we have placed the important results related to TUAP contained in Table 1-2 of our submission, as follows: Table 1. The performance (AUROC) of baseline defenses in detecting TUAP on CIFAR-10 and Tiny ImageNet datasets. | Dataset$\downarrow$, Defense$\rightarrow$ | STRIP | ShrinkPad | DeepSweep | Frequency | |:----------------:|:-----:|:---------:|:---------:|:---------:| | CIFAR-10 | 0.671 | 0.869 | 0.743 | 0.851 | | Tiny ImageNet | 0.638 | 0.866 | 0.759 | 0.837 | --- **Q2**: Please provide the code at the beginning of the rebuttal for reproducibility. **R2**: Thank you for this constructive suggestion and we do agree that providing source codes is important for reproducibility. However, it seems that there is a misunderstanding. We have already included our codes in supplementary materials along with our submission, as we mentioned in Reproducibility Statement. To avoid misunderstanding, we have highlighted this sentence in italics in our revision. --- ## Author Response (Reviewer JRLs) We sincerely thank you for your valuable time and constructive comments. We are encouraged by your positive comments on our **paper quality**, **motivation**, **interesting phenomenon**, **theoretical foundation**, **technical novelty**, and **comprehensive and fair experiments**! We will alleviate your remaining concerns as follows. **Note**: All modified contents are marked in orange in our revision. --- **Q1**: The most critical weakness of this work is the lack of diversity in terms of neural network architectures. To the best of this reviewer's attention, the method was only evaluated over ResNet. This choice would impose the question of whether the same observations are valid for different DNN architectures. **R1**: Thank you for this insightful comment! We do understand your concern and agree that it is critical to ensure that the scaled prediction consistency is valid across different DNN architectures. To alleviate your concern, we conduct additional experiments of defenses under VGG-19 (with BN) on the Tiny ImageNet dataset, as follows: Table 1. The performance (AUROC) on the Tiny ImageNet dataset under VGG-19. Among all different methods, the best result is marked in boldface while the value with underline denotes the second-best result. The failed cases ($i.e.$, AUROC $<0.55$) are marked in red. Note that STRIP requires obtaining predicted probability vectors while other methods only need the predicted labels. | Defense$\downarrow$, Attack$\rightarrow$ | BadNets | Label Consistent | PhysicalBA | TUAP | WaNet | ISSBA | Average | |:-------------------:|:-------:|:----------------:|:----------:|:-----:|:-----:|:-----:|:-------:| | STRIP | 0.941 | 0.908 | 0.941 | 0.576 | 0.521 | 0.489 | 0.729 | | ShrinkPad | 0.857 | 0.919 | 0.631 | 0.831 | 0.499 | 0.490 | 0.705 | | DeepSweep | 0.939 | 0.907 | 0.921 | 0.744 | 0.511 | 0.711 | 0.788 | | Frequency | 0.864 | 0.859 | 0.864 | 0.827 | 0.428 | 0.540 | 0.730 | | Ours (Data-free) | 0.936 | 0.846 | 0.907 | 0.858 | 0.893 | 0.767 | 0.868 | | Ours (Data-limited) | 0.936 | 0.851 | 0.907 | 0.888 | 0.904 | 0.836 | 0.887 | **The aforementioned results verify the effectiveness of our defenses again**. Besides, we also visualize the average confidence ($i.e.$, average probabilities on the originally predicted label) of benign and poisoned samples $w.r.t.$ pixel-wise multiplications under benign and each attacked models (as shown in Figure 15) in our revision. Please find more details in Appendix N of our revision. ---