Rebuttal_Template

## Author Response (Reviewer cftx) Dear Reviewer cftx, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **clear motivation**, **reasonable method pipeline**, **novel evaluation paradigm**, and **extensive experiments**. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns. --- **Q1**: Equation (2) in section 4.3 is essentially obtained by replacing $y_t$ with $\hat{{y}}_t+\hat{{y}}$ in equation (1), aiming to achieve the limitations of positional generalization. Can you provide further explanation for the feasibility of this operation? **R1**: Thanks for the insightful comment! We are deeply sorry that our submission may lead you to some misunderstandings that we want to clarify. - **Eq.(2) is not obtained by replacing $y_t$ with $\hat{{y}}_t+\hat{{y}}$ in Eq.(1)**. Specifically, $\hat{{y}}_t+\hat{{y}}$ is included in Eq.(1) instead of Eq.(2), while $y_t$ is contained in Eq.(2) instead of Eq.(1). Accordingly, we cannot replace $y_t$ with $\hat{{y}}_t+\hat{{y}}$ in Eq.(1). - **Eq.(2) is not essentially obtained by replacing $\hat{{y}}_t+\hat{{y}}$ with $y_t$ in Eq.(1)**. Formally, penalty loss (contained in Eq.(1)) and generalization loss (contained in Eq.(2)) have fundamental differences. Firstly, penalty loss is a maximization while generalization loss is essentially a minimization since its objective is $-L$ instead of $L$; Secondly, the constraint in penalty loss is $|m'|\leq t$ while that in generalization loss is $|m' \cap m| \leq \tau \cdot |m|$; Thirdly, there is a term $\mu \cdot |m'|$ in generalization loss that is not included in penalty loss. - **Penalty loss and generalization loss have fundamentally different meanings**. In general, **penalty loss synthesizes the potential trigger pattern and penalizes its effects based on its distance to the original one**. If the potential trigger pattern is close to the ground-truth one, the predictions of watermarked DNNs to samples containing this pattern should be similar to the target label; Otherwise, their predictions should be similar to their ground-truth label. In contrast, **generalization loss generates the most effective potential trigger pattern (other than the original one) and then minimizes its effects**. - **Eq.(2) is feasible to reduce trigger generalization since it can generate potential trigger pattern (other than the original one) and minimizes its effects**. In particular, as we mentioned in Section 4.3 and illustrated in Appendix (Section 4), **we designed an adaptive optimization method to find promising synthesized patterns** with the highest attack effectiveness and differences from the original trigger. It is also important for the success of our GLBW method. We will add more details in Section 4.2 and Section 4.3 to better clarify Eq.(1) \& Eq.(2) in our revision, to avoid potential misunderstandings. --- **Q2**: There should be a visualization of GLBW to prove that the ranking is orderly and reliable, compared with the results shown in Figure 2. **R2**: Thank you for this constructive suggestion! We are deeply sorry that our submission may lead you to some misunderstandings that we want to clarify. - **Figure 2 is simply a visual example** to better illustrate the unreliable nature of existing backdoor-based XAI evaluation methods. In this figure, we did not intend to summarize the experimental results obtained in our paper. Accordingly, it did not contains the results of our GLBW. - To further alleviate your concerns, we hereby provide the rank of SRV methods that is evaluated by our GLBW with original and synthesized triggers. As shown in the following Table 1, **our method leads to more consistent ranks** (compared to that of standard backdoor-based watermark), although the IOU values may have some differences (as shown in following Table 2). Accordingly, our GLBW method is more reliable. **Table 1.** The rank (based on average IOU among all poisoned sample) of SRV methods that is evaluated by our GLBW with original and synthesized triggers. | Dataset$\downarrow$ | Trigger$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:--:|:---:|:----:|:-----:|:---:|:--:|:----:| | CIFAR-10 | Original | 4 | 4 | 6 | 6 | 2 | 3 | 1 | | CIFAR-10 | Synthesized | 4 | 4 | 7 | 6 | 3 | 2 | 1 | | GTSRB | Original | 4 | 4 | 6 | 6 | 2 | 3 | 1 | | GTSRB | Synthesized | 4 | 4 | 6 | 6 | 2 | 3 | 1 | **Table 2.** The average IOU among all poisoned sample of SRV methods that is evaluated by our GLBW with original and synthesized triggers. | Dataset | Trigger, SRV | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| | CIFAR-10 | Original | 0.147 | 0.147 | 0.000 | 0.000 | 0.692 | 0.531 | 0.950 | | CIFAR-10 | Synthesized | 0.030 | 0.030 | 0.000 | 0.002 | 0.358 | 0.392 | 0.529 | | GTSRB | Original | 0.169 | 0.169 | 0.009 | 0.009 | 0.974 | 0.934 | 1.000 | | GTSRB | Synthesized | 0.199 | 0.199 | 0.005 | 0.005 | 0.525 | 0.473 | 0.529 | - To further alleviate your concerns, we also provide the visualization result of our GLBW, following the same format as Figure 2. As shown in Figure 1 in the rebuttal PDF, **our method leads to more consistent ranks**. PS: We cannot directly insert the figure here due to the limitation of OpenReview system. We will add more details in the appendix of our revision. --- **Q3**: I am still confused about the consequences of the position generalization of the backdoor watermarks. I suggest the authors supply more examples, explanations, or references for the unreliability caused by generalization. **R3**: Thank you for this quesion and we are deeply sorry that we failed to explain it more clearly in our submission. In general, **position generalization will make the result of backdoor-based SRV evaluation less reliable**. More details are as follows: - **Backdoor-based SRV evaluation methods rely on a latent assumption that only the trigger used for training (dubbed 'original trigger') can activate backdoors**. These methods believe that trigger regions should be treated as the regions that contribute the most to the model's prediction (i.e., target label) of poisoned samples because their ground-truth labels are not the target label. Accordingly, they use the area of original trigger as the ground-truth reference and calculate the average intersection over union (IOU) between it and the saliency areas generated by the SRV method of the backdoored model over different backdoored samples as an indicator to evaluate the SRV method. - **This assumption does not hold when backdoor watermark has position generalization**. As we shown in our Figure 4, there are many potential trigger patterns other than the original one that can still activate backdoors. In other words, **this assumption does not hold for existing watermarks**. - **Its failure may lead to unreliable results**. For example, given a SRV method, assume that its generated saliency areas of most poisoned samples are only a small part of that of the original trigger. According to backdoor-based SRV evaluation approaches, this SRV method will be treated very poorly since it has a small IOU. However, due to the generalization of backdoor watermarks, the model may learn this local region rather than the whole trigger. In this case, the evaluated SRV method is in fact highly effective, contradicting to the results of existing backdoor-based SRV evaluation methods. We will add more details in the introduction and the appendix of our revision to make it more clearly. --- **Q4**: Will the settings of the two patterns in Figure 3 affect the ranking? Empirically, I believe that both the appearance of the patterns and the number of pixels can fluctuate the results. Therefore, experimental data is needed to evaluate it. **R4**: Thank you for this insightful question! Indeed, other than the location, both the appearance and the number of pixels (#pixels) may influence the ranking. This why we exploited two patterns (in Figure 2) with different locations for evaluation. To further alleviate your concerns, we conduct additional experiments, as follows. - **The Effects of Appearance (Shape)**: We design and adopt two additional trigger patterns with the same #pixels as the square-type trigger used in our submission, including 'pencil' and 'triangle'. Other settings are the same as those used in Section 3.1 (standardized version). We calculate the average IOU among all poisoned samples and the rank based on it. As shown in following tables, **appearance has mild effects to final ranks**, although it may influnces average IOU values. These results also partly verify the reliability of our method. **Table 3.** The rank (based on average IOU among all poisoned sample) of SRV methods that is evaluated with different appearances. | Dataset$\downarrow$ | Trigger$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:--:|:---:|:----:|:-----:|:---:|:--:|:----:| | CIFAR-10 | Square | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | CIFAR-10 | Pencil | 4 | 4 | 7 | 6 | 3 | 2 | 1 | | CIFAR-10 | Triangle | 4 | 4 | 7 | 6 | 3 | 2 | 1 | | GTSRB | Square | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | GTSRB | Pencil | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | GTSRB | Triangle | 4 | 4 | 7 | 6 | 2 | 3 | 1 | **Table 4.** The average IOU among all poisoned sample of SRV methods that is evaluated with different appearances. | Dataset$\downarrow$ | Trigger$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:| | CIFAR-10 | Square | 0.2123 | 0.2123 | 0.0000 | 0.2068 | 0.8849 | 0.4935 | 0.9792 | | CIFAR-10 | Pencil | 0.1675 | 0.1675 | 0.0000 | 0.1653 | 0.6776 | 0.7228 | 0.9956 | | CIFAR-10 | Triangle | 0.2534 | 0.2534 | 0.0000 | 0.2489 | 0.6777 | 0.8463 | 0.9771 | | GTSRB | Square | 0.3438 | 0.3438 | 0.0000 | 0.3151 | 0.7388 | 0.4233 | 1.0000 | | GTSRB | Pencil | 0.2604 | 0.2604 | 0.0000 | 0.2225 | 0.8978 | 0.7570 | 0.9979 | | GTSRB | Triangle | 0.3003 | 0.3003 | 0.0000 | 0.2762 | 0.5627 | 0.4371 | 0.9890 | - **The Effects of #Pixels**: We provide the results of square-type trigger with different sizes (i.e., $3\times 3$, $4\times 4$, $5\times 5$). Other settings are the same as those used in Section 3.1 (standardized version). We calculate the average IOU among all poisoned samples and the rank based on it. As shown in following tables, **the number of pixels also has mild effects to final ranks**, although it may influnces average IOU values. These results partly verify the reliability of our method again. **Table 5.** The rank (based on average IOU among all poisoned sample) of SRV methods that is evaluated with different trigger sizes. | Dataset$\downarrow$ | Size$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:--:|:---:|:----:|:-----:|:---:|:--:|:----:| | CIFAR-10 | $3 \times 3$ | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | CIFAR-10 | $4 \times 4$ | 4 | 4 | 7 | 6 | 3 | 2 | 1 | | CIFAR-10 | $5 \times 5$ | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | GTSRB | $3 \times 3$ | 4 | 4 | 7 | 6 | 2 | 3 | 1 | | GTSRB | $4 \times 4$ | 4 | 4 | 7 | 6 | 3 | 1 | 2 | | GTSRB | $5 \times 5$ | 4 | 4 | 7 | 6 | 2 | 3 | 1 | **Table 6.** The average IOU among all poisoned sample of SRV methods that is evaluated with different trigger sizes. | Dataset$\downarrow$ | Size$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:------------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:| | CIFAR-10 | $3 \times 3$ | 0.2123 | 0.2123 | 0.0000 | 0.2068 | 0.8849 | 0.4935 | 0.9792 | | CIFAR-10 | $4 \times 4$ | 0.2867 | 0.2867 | 0.0000 | 0.2814 | 0.8796 | 0.9821 | 0.9841 | | CIFAR-10 | $5 \times 5$ | 0.3811 | 0.3811 | 0.0000 | 0.3798 | 0.7255 | 0.6672 | 0.9923 | | GTSRB | $3 \times 3$ | 0.3438 | 0.3438 | 0.0000 | 0.3151 | 0.7388 | 0.4233 | 1.0000 | | GTSRB | $4 \times 4$ | 0.3360 | 0.3360 | 0.0000 | 0.2857 | 0.7803 | 1.0000 | 0.9914 | | GTSRB | $5 \times 5$ | 0.3436 | 0.3436 | 0.0000 | 0.2861 | 0.8364 | 0.5205 | 0.9838 | We will add more details in the appendix of our revision. --- ## Author Response (Reviewer WLLB) Dear Reviewer WLLB, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **valuable analysis**, **novel and interesting method**, **good presentation**, and **good soundness**. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns. --- **Q1**: What does the description from lines 143 to 145 mean? **R1**: Thank you for this quesion and we are deeply sorry that we failed to explain it more clearly in our submission. We hereby provide more explanations. - Given an image and a trained model, existing SRV methods need to obtain the 'influence value' (e.g., gradient) of each pixel to model's prediction of the image before generating its saliency map. **The influence value can be positive or negative**. Its positive value indicates that increasing its pixel value will have a positive effect on the prediction. - In general, **to obtain the saliency map, we usually need to take the absolute value of each influence value before SRV methods filter out the most critical pixel positions** (with sufficiently high influence scores). Otherwise, pixels having significantly negative effects on the prediction will not be selected. - However, the existing backdoor-based SRV evaluation **[13] only took the absolute value for BP while keeping the original influence value for other SRV methods** (i.e., GBP, GCAM, GGCAM, OCC, FA, LIME). - **This inconsistent setting will lead to unreliable results** (as we explained before). Accordingly, we take the absolute value for all SRV methods to evaluate them under the same and fair setting. We will add more details in Section 3.1 and the appendix in our revision. --- **Q2**: I could not see the description of how to calculate the average ranges used in the experiment. This is important to judge the experimental results. **R2**: Thank you for your question and we are deeply sorry that we failed to explain it more clearly in our submission. However, we don't understand what your 'average ranges' refer to since there is no similar thing introduced in our experiments. **We guess you probably want to refer to 'average ranks' instead of 'average ranges'.** To calculate the average ranks of SRV methods, we first calculate the average IOU value of each SRV method across all poisoned samples with each trigger pattern on each dataset. After that, we calculate the rank of all SRV methods in each scenario (trigger+dataset). Finally, we average the rank of each method over all triggers to obtain average ranks that we reported in our submission. We will provide more details in Section 3.1 and the appendix in our revision. In particular, **please let us know if we misunderstand your 'average ranges' and provide more detailed references and information**. We are happy to further explain it during our discussion period :) --- **Q3**: Regarding the selection of $M$ pixels, I do not think it is significantly better than the previous method. One example is that the size of the calculated saliency is significantly larger than $M$, while the size of the trigger is small. Because of $M$, the IoU value is high, but in reality, it is small due to the union. **R3**: Thank you for this insightful comment! We hereby provide more explanations to alleviate your concerns. - Existing SRV methods require setting a pre-defined threshold to filter out the most critical regions. However, due to different factors (e.g., the way of calculating influence scores), different SRV methods have very different values even under the same setting. Accordingly, **setting the same threshold to compare them is unfair**. Besides, **it is hard to select an 'optimal threshold' for each method without human inspection**. - We argue that selecting $M$ pixel locations with the maximum saliency value as significant regions for analysis is a feasible solution to a large extent. Specifically, they can be regarded as the most critical regions concentrated by the evaluated SRV method, although there might be larger significant areas with relatively high saliency value as you mentioned. **Comparing the ability of different SRV methods to recognize the most critical regions of the same size is a more equitable approach**. - However, we do understand your concerns. We hereby exploit a classical adaptive binarization method (i.e., [OTSU](https://cw.fel.cvut.cz/b201/_media/courses/a6m33bio/otsu.pdf)) to obtain regions with high influences to calculate the rank of each SRV method and compare the results of our standardized method (dubbed 'Ours') and those of standardized method with adaptive binarization (dubbed 'Adaptive'). As shown in the following table, **using adaptive binarization method has a mild influence to the final average ranking**. These results verify the effectiveness of our method. **Table 1.** The rank (based on average IOU among all poisoned sample) across four trigger patterns of SRV methods that is evaluated with different methds. | Dataset$\downarrow$ | Method$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:---------:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:| | CIFAR-10 | Ours | 4.75 | 4.75 | 7 | 4.5 | 2 | 3 | 1 | | CIFAR-10 | Adaptive | 4.5 | 4.5 | 7 | 5 | 2.5 | 2.5 | 1 | | GTSRB | Ours | 4.5 | 4.5 | 7 | 5 | 2 | 2.75 | 1.25 | | GTSRB | Adaptive | 4.75 | 4.75 | 6.5 | 5 | 2.25 | 2.75 | 1 | We will provide more details in the appendix of our revision. Besides, due to the limitation of rebuttal time, we are not able to re-calculate all results with the adaptive binarization method. However, if you think it is necessary, we promise that we will conduct the remaining experiments after the rebuttal and provide all of them in the appendix of our final version. --- **Q4**: The paper treats the universal adversarial perturbations and the triggers the same, while their characteristics are different. Backdoor attacks are about to provide bad training data containing the triggers, in which the models learn the features of the triggers. Adversarial perturbations, on the other hand, are tiny malicious noises causing a change in the decision boundaries. They could not be considered as location generalizations of trigger patterns. **R4**: Thank you for this insightful comment! We hereby provide more explanations to alleviate your concerns. - We admit that we treat universal adversarial perturbations (UAPs) and triggers equally in our optimization process designed for reducing trigger generalization, as you suggested. In general, universal adversarial perturbations can be regarded as the trigger of the ‘natural backdoor’ of models learned from samples ([Wenger et al. 2022](https://proceedings.neurips.cc/paper_files/paper/2022/file/8af749935131cc8ea5dae4f6d8cdb304-Paper-Datasets_and_Benchmarks.pdf)). Accordingly, they have very similar properties to backdoor triggers and therefore **it is very difficult (or probably even impossible) to distinguish them from original and potential triggers**. - **Even though we treat them the same, our methods can still reduce trigger generalization** since they minimize both trigger generalization and the risk of UAPs simultaneously during their optimization process. - Besides, **minimizing the risk of UAPs may have potential benefits in reducing trigger generalization**. Recent studies (e.g., [Andrew et al. 2019](https://proceedings.neurips.cc/paper_files/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf)) revealed that adversarially robust models focus more on 'robust features' instead of non-robust ones (e.g., textures). Accordingly, our method may makes DNNs rely more on the original trigger pattern for poisoned samples and therefore could reduce trigger generalization. We will provide more discussions in the appendix of our revision and further explore this problem in our future work. --- **Q5**: What is wrong with the location generalization of trigger patterns? The purpose of the measurement is to check if the salience is overlapped with the trigger. If they are overlapped, it means that the SRV method correctly visualizes the learned features. I would like to see this discussion in the paper. **R5**: Thank you for this quesion and we are deeply sorry that we failed to explain it more clearly in our submission. In general, **position generalization will make the result of backdoor-based SRV evaluation less reliable**. More details are as follows: - **Backdoor-based SRV evaluation methods rely on a latent assumption that only the trigger used for training (dubbed 'original trigger') can activate backdoors**. These methods believe that trigger regions should be treated as the regions that contribute the most to the model's prediction (i.e., target label) of poisoned samples because their ground-truth labels are not the target label. Accordingly, they use the area of original trigger as the ground-truth reference and calculate the average intersection over union (IOU) between it and the saliency areas generated by the SRV method of the backdoored model over different backdoored samples as an indicator to evaluate the SRV method. - **This assumption does not hold when backdoor watermark has position generalization**. As we shown in our Figure 4, there are many potential trigger patterns other than the original one that can still activate backdoors. In other words, **this assumption does not hold for existing watermarks**. - **Its failure may lead to unreliable results**. For example, given a SRV method, assume that its generated saliency areas of most poisoned samples are only a small part of that of the original trigger. According to backdoor-based SRV evaluation approaches, this SRV method will be treated very poorly since it has a small IOU. However, due to position generalization of backdoor watermarks, the model may learn this local region rather than the whole trigger. In this case, the evaluated SRV method is in fact highly effective, contradicting to the results of existing backdoor-based SRV evaluation methods. We will add more details in the introduction and the appendix of our revision to make it more clearly. --- ## Author Response (Reviewer p94E) Dear Reviewer p94E, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **extensive experiments** and **intriguing findings of trigger generalization**. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns. --- **Q1**: This paper is not well-written and a little hard to read. **R1**: Thank you for this comment and we are deeply sorry that we failed to write our submission more clearly. Can you please provide more details and explanations for this comment? We are willing to provide more explanations and further polish our paper based on them during our discussion period. --- **Q2**: The motivation for the proposed method is not clear to me. **R2**: Thank you for this comment and we are deeply sorry that we failed to state our motivation more clearly. Can you please provide more details and explanations for this comment? We are willing to provide more explanations and further polish our paper based on them during our discussion period. --- **Q3**: The evaluation is limited. **R3**: Thank you for this comment and we are deeply sorry that we may fail to provide more comprehensive evaluations in our submission. Can you please provide more details and explanations for this comment? We are willing to provide more explanations and experiments based on them during our discussion period. --- **Q4**: Why would having potential triggers for the models be a problem for XAI evaluation? **R4**: Thank you for this quesion and we are deeply sorry that we failed to explain it more clearly in our submission. In general, **position generalization will make the result of backdoor-based SRV evaluation less reliable**. More details are as follows: - **Backdoor-based SRV evaluation methods rely on a latent assumption that only the trigger used for training (dubbed 'original trigger') can activate backdoors**. These methods believe that trigger regions should be treated as the regions that contribute the most to the model's prediction (i.e., target label) of poisoned samples because their ground-truth labels are not the target label. Accordingly, they use the area of original trigger as the ground-truth reference and calculate the average intersection over union (IOU) between it and the saliency areas generated by the SRV method of the backdoored model over different backdoored samples as an indicator to evaluate the SRV method. - **This assumption does not hold when backdoor watermark has position generalization**. As we shown in our Figure 4, there are many potential trigger patterns other than the original one that can still activate backdoors. In other words, **this assumption does not hold for existing watermarks**. - **Its failure may lead to unreliable results**. For example, given a SRV method, assume that its generated saliency areas of most poisoned samples are only a small part of that of the original trigger. According to backdoor-based SRV evaluation approaches, this SRV method will be treated very poorly since it has a small IOU. However, due to the generalization of backdoor watermarks, the model may learn this local region rather than the whole trigger. In this case, the evaluated SRV method is in fact highly effective, contradicting to the results of existing backdoor-based SRV evaluation methods. We will add more details in the introduction and the appendix of our revision to make it more clearly. --- **Q5**: The explainability metric is based on simple backdoored patterns, but why would it be a good reference for complicated features in real images for practice? **R5**: Thank you for this insightful question! We admit that our GLBW and other existing backdoor-based SRV methods all rank SRV methods based on their performance in explaining simple backdoored patterns (instead of high-level complicated features). However, **it not necessarily means that our method is impractical**. We hereby provide more explanations: - In practice, **humans categorize a given image based on its local regions in most cases**. For example, we only need to see the image areas of its 'head' to know it is a bird. **These local regions are similar to the trigger patterns** (e.g., a patch) used in our method. Accordingly, results made by our method can be a good reference in real images for practice. - Currently, **it is impossible to faithfully evaluate the performance of SRV methods for complicated features** since there is no ground-truth salience map for them. Even a human expert cannot mark the salience map for complicated features. - Simple backdoored patterns are also features used by DNNs for their predictions. Accordingly, **the evaluation of SRV methods on trigger features is the first and the most important step toward evaluating their general performance** and is therefore of great significance. We will provide more details and explanations in the introduction and the appendix of our revision. --- **Q6**: Due to the training scheme of GLBW, the clean accuracy of the model is very low for CIFAR-10. I think the model weights currently are in a weird local minimum, so would the explainability result be meaningful in this case? **R6**: Thank you for this insightful question! We are deeply sorry that we failed to explain it more clearly in our submission. We hereby provide more details to alleviate your concerns. - **The decrease in clean accuracy is caused by minimizing the risks of universal adversarial perturbations during our optimization process** instead of because model weights currently are in a weird local minimum. Specifically, we treat universal adversarial perturbations (UAPs) and potential triggers equally during our optimization process. This is mostly because we can hardly separate them in practice. Accordingly, **our GLBW has a similar effect as conducting adversarial training on UAPs** that will significantly decreases clean accuracy especially when the task is relatively complicated. Note: GTSRB task is significantly easier than the CIFAR-10 task. - **Having a relatively low clean accuracy will not reduce the reliability and practicality of our method**. Firstly, our evaluation is based on poisoned samples instead of clean samples. As such, **we only need to ensure that our methods lead to a high watermark success rate (instead of a high clean accuracy) for faithful results**. The watermark success rates are higher than 90\% in all cases, which is sufficiently high; Secondly, as we mentioned in our experiments, **the watermarked DNNs are only used for evaluating SRV methods instead of for deployment**. Accordingly, the decrease in clean accuracy led by our method will not hinder its usefulness. - The clean accuracy of models trained with our methods is higher than 78\% in all cases. Arguably, **this accuracy is not very low since models can make correct predictions in most cases**. We will provide more details and explanations in the appendix of our revision. --- ## Author Response (Reviewer Vt99) Dear Reviewer Vt99, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our **thorough analysis**, **method design**, **convincing experiments**, **good presentation**, and **good soundness**. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns. --- **Q1**: This paper claims that their method achieves a more reliable and accurate ranking. However this paper only shows that, based on their backdoored model, the ranking result of saliency-based representation visualization (SRV) methods is consistent. In my opinion, there is a gap between "consistent ranking" and "more reliable and accurate ranking". **R1**: Thank you for this insightful comment! We hereby provide more explanations to alleviate your concerns. - In Section 3.1, we revealed and standardized three implementation limitations of the existing backdoor-based SRV evaluation. **Our standardizations ensure that evaluations are conducted under a consistent and fair setting**. Accordingly, our results should be more reliable and accurate compared to those of the existing backdoor-based one [13]. - As we illustrated in our introduction, **backdoor-based SRV evaluation methods rely on a latent assumption that existing backdoor attacks have a low trigger generalization**. However, existing backdoor watermarks have a high trigger generalization. In contrast, our method can reach a low generalization and therefore leading to more reliable and accurate results compared to the existing backdoor-based one [13]. Please refer to **R4 for Reviewer p94E** for more details about why trigger generalization is important for XAI evaluation. - We admit that consistent ranking does not necessarily imply reliable and accurate ranking. However, **it is a prerequisite for reliable and accurate results**. As such, it is reasonable to say that our results are more reliable and accurate compared to those of the existing backdoor-based one [13]. We will add more details in the Section 3 and appendix in our revision to avoid potential misunderstandings. --- **Q2**: Based on Eq. 2, this paper backdoors a neural network like adversarial training [1] by generating universal adversarial perturbations. In fact, on line 278, the authors also talk about the degradation of clean accuracy of the backdoored model, which is also the side-effect of adversarial training. Here comes my question: can we trust the consistent ranking based on the backdoored model which is different from the vanilla model in terms of training dynamics? If we can, can we infer that a SRV method with higher order in the ranking list is better than that with lower order? If we can, please demonstrate it with the specific experiments. **R2**: Thank you for these insightful questions! We hereby provide more explanations to alleviate your concerns. - **The SRV methods target all DNN, not just vanilla models**. Accordingly, more consistent ranking on backdoored models is of great significance. - The training process of DNNs is essentially learning the features. There are many different training dynamics, such as standard supervised learning, semi-supervised learning, self-supervised learning+fine-tuning, even for obtaining a vanilla model. **We don't need to focus on the specific form of training, but rather on what kind of features the model learns**. Accordingly, we can trust the consistent ranking based on the backdoored model since it learns a given backdoor pattern that can be used for reference when calculating the IOU values. - To further alleviate your concerns, we generate 10 groups of saliency maps based on 10 randomly selected images on GTSRB. Each group contains 8 images, including one saliency map for each SRV method (7 in total) and one poisoned image with the trigger for reference. We conduct human inspection experiments by inviting 10 people and asking them to grade all groups of saliency maps independently. As shown in the following table, **the rankings generated by our GLBW method are similar to those by people**. This result verifies that we infer that a SRV method with higher order in the ranking list is better than that with lower order. **Table 1.** The average rank of our evaluation and human inspection on GTSRB. | Method$\downarrow$, SRV$\rightarrow$ | BP | GBP | GCAM | GGCAM | OCC | FA | LIME | |:--------:|:---------:|:----:|:----:|:----:|:-----:|:----:|:----:| | Ours | 4.60 | 4.60 | 7.00 | 4.80 | 2.00 | 2.80 | 1.20 | | human | 4.00 | 4.00 | 6.00 | 6.00 | 2.00 | 3.00 | 1.00 | We will add more details in the appendix of our revision. <font color="red">**Note:** 这个结果有变化，审稿人要是问到了承认Table 4 跑错了，是BadNets而不是我们GLBW的</font> --- **Q3**: The phenomenon of trigger generalization is discovered by previous work on line 88. This paper mainly performs empirical studies to reveal this phenomenon in Sec. 3.2. I do not see novel findings or theoretical analyses. Moreover, the way of generating the potential trigger is based on existing works of universal adversarial perturbation (UAP). Overall, the originality of this paper is limited. **R3**: Thank you for these comments and we are deeply sorry that our submission may lead you to some misunderstandings that we want to clarify here. - We admit that a few existing works (those on line 88) **initially** revealed the existence of trigger generalization. In general, they just simply showed that the trigger pattern reversed by neural cleanse may be different from the original one **in a few cases**. However, in this paper, we dive deeper into this problem, where we **provide a systematic visualization of trigger generalization and reveal its statistical patterns** (loss and distance). To the best of our knowledge, there is no previous work providing such analyses. - In particular, **we are the first to reveal the significance of trigger generalization** in the important topic of SRV evaluation. More importantly, **we are the first trying to control and manipulate trigger generalization**. In other words, **analyzing the phenomenon of trigger generalization is only a small part of our contributions**. - **Our method has foundemental differences compared to those of UAP**. Firstly, our method need to optimize the mask and the perturbation simultaneously while UAP is not; We intend to make the positions of generated perturbations significantly different from that of the original trigger pattern. However, it is not necessary for UAP methods; Thridly, as we illusrated in our appendix (Section 4), we designed an adaptive optimization method to find promising synthesized patterns. However, this approach is not involved in existing UAP works; Fourthly, our approach needs to consider the size and location of triggers/perturbations, while existing UAP methods generally do not. - Besides, optimization techniques are in the service of methods. We have already provided detailed motivations and analyses of our method designs. We argue that it is unfair to claim that our method is not novel simply because our inner maximization has some similarities to UAP, not to mention that they have fundamental differences. We will provide more discussions in the appendix of our revision to avoid potential misunderstandings. --- **Q4**: This paper only targets one backdoor baseline, BadNet, without further experiments on other backdoor baselines, raising my question about the generalization and effectiveness of their method. **R4**: Thank you for these comments and we are deeply sorry that our submission may lead you to some misunderstandings that we want to clarify here. - **Patch-based poisoned-label backdoor watermark is the most suitable and probably the only suitable method for backdoor-based SRV evaluation**. Specifically, clean-label attacks where the adversaries only poison samples for the target class since the backdoored DNNs will use both 'ground-truth' features (related to the target class) and trigger features instead of just trigger features for classifying poisoned samples; The trigger patterns of all existing classical non-patch-based attacks are with full image-size. Accordingly, they are also not suitable for SRV evaluation. - BadNets is the first and most classical patch-based poisoned-label backdoor watermark. Accordingly, **we follow this setting used in our baseline research [13] for our evaluation**. - To further alleviate your concerns, we conduct additional experiments with two remaining patch-based poisoned-label backdoor watermarks, including [blended attack](https://arxiv.org/pdf/1712.05526.pdf) (i.e., BadNets with trigger transparency) and poisoned-label attack with patch-size additive trigger (dubbed 'additive attack'). As shown in the following table, **our GLBW method is still effective under blended attack and additive attack**. These results verify the generalization and effectiveness of our method. **Table 2.** The generalization and effectiveness evaluation of our method on other backdoor baselines on GTSRB. | Watermark | BA | WSR | Chamfer | PLG | |:--------:|:---------:|:----:|:----:|:----:| | Blended | 97.55 | 91.79 | 61.59 | 94.60 | | GLBW (Blended) | 90.10 | 88.03 | **9.81** | **100.00**| | Additive | 97.60 | 95.01 | 66.47 | 95.00 | | GLBW (Additive) | 94.93 | 91.92 | **26.42** | **100.00**| We will provide more discussions an details in Section 3 and the appendix of our revision to avoid potential misunderstandings. --- 实验设置： 1. 数据集：GTSRB 2. trigger：右下角3×3白块 3. 对于Blended方法，weight=0.5；对于Additive方法，addition像素值为64 4. 使用Neural Cleanse方法寻找potential trigger

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.