# SaTML Rebuttal ## Reviewer 9Loo: Thank you for reviewing our paper and we are glad that the reviewer brought up some insightful questions that we can discuss further here: **[1] Input space attacks vs Feature targeted attacks**: - Indeed the pre-trained classifier does not have to be acquired by contrastive learning and could be any classifier. Actually, in our initial assessment, we used the pre-trained Inception-V3 as the classifier and observed the same findings on the same attacks considered in this paper. However, we consider it a rather weak classifier (e.g., approximately 78% clean accuracy on CIFAR-10 compared to 93.03% on ResNet-50 pretrained with contrastive learning). As far as we know, contrastive learning is one of the strongest techniques for learning good feature extractors, thus it would be the ideal model to evaluate. - For any end-to-end training scheme, **only input space attacks** are applicable as no target feature can be obtained for a dynamic $f$. Lu et al., 2023 demonstrate that GC is powerful in such scenarios. - In our case on poisoning a pre-trained classifier, input space attacks are still the direct way for such tasks. However, we find the optimization to be difficult which motivates us to break down the constrained problem into several stages. - Of course, feature targeted attacks are only one solution to mitigate the optimization problem, and an interesting future direction is to develop better optimization methods to improve the GC input space attack directly. **[2] Comparison between GC-input and Feature Matching**: - It is true that GC-input (with constraints) and Feature Matching ($\beta=0.25$) both generate poisoned points in the normal range and only add imperceptible noise. And it is hard to compare the perturbation directly except for visualization and attack performance. - However, it is possible to compare them indirectly by evaluating their performance against data sanitization defenses. Here we perform an experiment by considering a distribution-wise certified defense called Sever [1], which removes $\epsilon_d$ training points with the highest outlier scores, defined using the top singular value of the gradient matrix. In the table below (+ indicates the accuracy increased by the defense), we observe that Feature Matching is more effective than GC-input, with or without the Sever defense, which may suggest that the noise introduced by Feature Matching is harder to detect. | $\epsilon_d=0.03$ | Before defense | After defense | |---------------------------------|----------------|----------------| | GC-input (with constraints) | -2.5% | -1.3% / +1.2% | | Feature Matching ($\beta$=0.25) | -6.3% | -5.6% / +0.7% | *[1] Diakonikolas et al., "Sever: A Robust Meta-Algorithm for Stochastic Optimization". In ICML 2019.* - We expand on the necessity of imperceptible noise here. A successful attack should have two properties: (a) have good attack performance, indicated by the accuracy drop; (b) be robust against possible defenses, i.e., inbound data range and large visual similarity with normal data. We have quantified property (a) in Table 1 and Table 2 in the paper. Here we expand the above experiment by also considering out-of-bound attacks to quantify property (b). In the table below, we observe that although attacks with tractable noises perform well without the Sever defense, their effectiveness is largely reduced by data sanitization. Thus we conclude imperceptible noise is a desired property and makes the poisoned samples more robust against possible defenses. | $\epsilon_d=0.03$ | Before defense | After defense | |---------------------------------|----------------|------------------| | GC-input (with constraints) | -2.50% | -1.30% / +1.20% | | GC-input (no constraints) | -29.54% | -1.10% / +28.44% | | Feature Matching ($\beta=0.25$) | -6.30% | -5.60% / +0.70% | | Feature Matching ($\beta=0.05$) | -14.37% | -3.39% / +10.98% | **[4] CIFAR-10 images**: indeed, CIFAR-10 images are only of $32\times32$ resolution and could be saturated when zooming in. [5] We have added raw numbers as a direct indication of poisoning budget in Table 1 and Table 2, marked in the teal color. Please let us know if we have adequately addressed your concern and we are happy to discuss more. ## Reviewer f5HB We would first like to thank the reviewer for acknowledging our contribution. Regarding the concerns, there are some we can address now, which we will elaborate on in this reply. For others that require additional experiments, we list our plans beforehand and will update our results as soon as possible. **[1] Regarding targeted attacks:** - Firstly we want to clarify our threat model: (1) indeed, in this paper we only focus on indiscriminate (availability) attacks, where an adversary aims at decreasing the overall test accuracy; (2) we apply a clean-label targeted attack (the feature collision attack) to our setting. However, the threat model of targeted attacks (where an adversary aims at alternating the prediction of a specific test sample) is not considered. - Secondly, the assumptions of the two threat models are different in terms of objectives and assumptions, where indiscriminate attacks do not require knowledge of the test sample. Thus these two types of attacks cannot be used interchangeably without certain changes. In this paper, we show that feature collision attacks can be adapted to our setting with moderate modification. - Thirdly, we agree with the reviewer that targeted attacks are a strong threat and it is intriguing to study them on poisoning a fixed feature extractor. Indeed, previous works, e.g., Shafahi et al. 2018, have explored this direction by examining the feature collision attack on fine-tuning and transfer learning. Shafahi et al. observe that the proposed attack is more effective in fine-tuning compared to end-to-end training, which is very different from our observation on indiscriminate attacks, where fine-tuning is less vulnerable than end-to-end training. - In summary, in the context of data poisoning, the study of fine-tuning pre-trained feature extractors exists for targeted attacks but lacks for indiscriminate attacks. We hope that the gap is filled with our paper. We have also added a paragraph in Section 3.A on specifying our target model, marked in the teal color. **[2] Cross-modality:** We agree that CLIP is widely used and should be considered. Specifically, CLIP models, including an image encoder and a text encoder, are obtained by performing contrastive learning on image-text pairs. For downstream tasks, these two models can be used separately for transfer learning on smaller datasets: - Image encoder: we plan to conduct an experiment on poisoning the CLIP image encoder (ResNet-50 backbone) on transferring to CIFAR-10 image classification. - Text decoder: regrettably, poisoning language models is a non-trivial task as the modification on clean data is relatively heuristic (such as deleting, reordering, substituting), i.e., it is not straightforward to design an algorithm for creating the optimal perturbation. As the methodology is not clear at this moment, we leave it to future work for now. **[3] Figure 6**: Thank you for the suggestion, we have changed Figure 6 to a histogram for better interpretability. **[4] Hardware**: Thank you for pointing it out, we have changed the sentence to: > Fine-tuning experiments are run on a cluster with NIVIDIA T4 and P100 GPUs, while transfer learning experiments are run on another machine with 2 NIVIDIA 4090 GPUs. **[5] Colors in Tables I and II**: sorry for the confusion, in these tables, the white background indicates methods that generate target parameters or features; blue indicates input space attacks and pink indicates feature targeted attacks. We have added relevant descriptions in the captions. **[6] Different to feature collision attacks**: indeed we adapt feature collision attack (which is designed for targeted attacks) as part of our feature targeted attack, as step (3) in Equation 12 and lines 7-9 in Algorithm 1. The difference lies in the choice of target features, the additional GC feature space attack, and threat models (we consider indiscriminate attacks). Of course, we do not claim originality on this part of the attack and we have given credit to Shafahi et al. (2018) on the algorithm. **[7] "No constraint" results**: when there are no constraints, the attack effectiveness increases with the attack budget in general: - For TGDA attack: the attack performance significantly increases when $\epsilon_d$ is larger. - For GC attack: take Table 1 as an example, the target parameter (the row of GradPC (input space)) returns an accuracy drop of 36.56% and serves as an upper bound of attack effectiveness. Our results suggest that GC is already close to the target parameter (29.54% accuracy drop) even when $\epsilon_d$ is small. However as the performance of GC is upper bounded by the target parameter, the improvement on a higher budget is not significant. - Notably, even when the attack effectiveness of different $\epsilon_d$ can be similar, the poisoned samples are still different. For example, poisoned points with a larger $\epsilon_d$ are of smaller magnitude than those with a smaller $\epsilon_d$. **In summary, as the plan for the next step**, we will conduct an experiment on poisoning the CLIP image encoder (ResNet-50 backbone) on transferring to CIFAR-10 image classification. In the meantime, please let us know if your concerns have been adequately addressed and if there's anything else we could provide. ## Reviewer zq67 Thank you for acknowledging our contribution and providing great suggestions. We first comment on concerns we can address now: **[1] GC feature space attack outperforms feature matching**: - We want to first clarify that this phenomenon is expected as GC feature space attack is used as targeted features for the following feature matching algorithms. In other words, the feature matching algorithms make the strong GC feature space attack realistic by generating poisoned samples in the input space. As a result, the performance of the feature matching algorithms is upper-bounded by that of GC feature space attacks. - Of course, an "ideal" feature matching attack should be able to match the performance of GC feature space attack. We acknowldge that the current attack is not perfect and can be further improved. We have listed this limitation in Section 6 and we hope to address it in future works. **[2] Figure 5:** We added the difference between the original source image and poisoned samples in Figure 9 in the appendix for visualizing the perturbations. **[3] Transferability of poisoned samples:** To understand the poisoned points generated in Figure 2, we follow the reviewer's suggestion and examine their performance on a different model than their target model. Specifically, we take the poisoned points generated by GC input space attack for ResNet-18 (which induce an accuracy drop of 29.54%) and feed into the fine-tuning process of a pre-trained ResNet-50 network for CIFAR-10. Here we observe that the accuracy drop is only 6.23% and may further suggest the poisoned samples could be non-semantic. For **additional experiments** as requested, we first list our plans. As running these experiments would take extra time, we will update our results as soon as possible: **[1] CIFAR-100 dataset:** We agree that investigating classification problems with more classes would be very interesting and we plan to perform experiments on transfer learning for the attacks we considered in this paper. **[2] Medical dataset:** We have found a related paper on examining such datasets: *Tuan et al. "How transferable are self-supervised features in medical image classification tasks?." Machine Learning for Health. PMLR, 2021.* Specifically, we plan to pick one digital pathology dataset called PatchComelyon which considers histopathologic cancer detection. We will examine the performance of data poisoning attacks against transfer learning from SimCLR pre-trained models to this dataset. In the meantime, please let us know if you have further questions or comments and we are happy to discuss with you more. ## Reviewer 24cC We are thrilled that you find our paper interesting! Regarding your concerns, we first address those that do not require additional experiments: **[1] Limited novelty on the attacks:** We agree that many of the attack methods are adapted from previous works and we do not consider designing new attacks as a main objective or major contribution of this paper. Instead, we aim to examine the threat of existing attacks in new settings. Upon exploration, we observe that applying these attacks is not straightforward and it is necessary to make moderate adjustments. Specifically, designing the three-stage feature targeted algorithms is not trivial, and we believe alleviating the challenge of optimizing input space attacks with constraint is a contribution to the field. **[2] Fine-tuning setting:** We understand your concern regarding the threat model and we want to provide additional justifications here: - Motivation of the attack: for contrastive learning method, such a fine-tuning process (pre-training and fine-tuning on the same dataset) is a common criterion for examining the quality of the learned representations, in comparison with supervised learning method. We thus consider it a valid and important downstream task to poison. - Impractical: let us introduce a scenario where fine-tuning could be practical: for example, for anomaly detection in surveillance videos, (1) one may use the existing footage for pre-training a feature extractor, (2) upon deployment, a fine-tuning process is required with labeled video clips (indicating whether it is an anomaly or not). However, as time elapses, additional data is collected and will be used for fine-tuning; (3) the additional data could be contaminated with data poisoning, thus realizing the proposed threat model. - Other strange pieces: We assume that "very high poisoning rate" refers to the case when $\epsilon_d=1$? Indeed, considering half of the training set to be poisoned could be impractical, but here we still want to consider it to show the threat of a particularly strong poisoning attack. **[3] Constraint on the pixel values:** We want to clarify that we apply normalization to the clean dataset when constructing the data loader in PyTorch with the following code: > transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) such that the clean data are in the range between $[-2.4291, 2.7537]$. Thus we consider this range to be our constraint when adding poisoned data to the dataloader instead of the pixel value. We agree considering out-of-constraint poisoned points could be unrealistic, which motivated us to optimize our attacks to generate strictly in-bounded poisoned samples. Still, we believe it is necessary to report the unbounded attacks to show the difficulty of optimization when adding constraints. **[4] $\mu$ and $\nu$**: we have changed the definition to empricial distributions in Section 2. **[5] Figure 6**: Thank you for the suggestion, we have changed Figure 6 to a histogram for better interpretability. **[6] More experiments on other encoders/datasets:** - Other encoders: we have conducted experiments on poisoning pre-trained vision transformers with MoCo-V3 for the past few weeks. We have added our results in Table 5 and obtain the similar observations with that of ResNet-50, which suggests that our conclusion still holds for cross entropy pretraining. We also plan to poison the image encoder of CLIP to gain further insights on multi-modal pretraining, suggested by Reviewer f5HB. - Other datasets: we plan to conduct experiments on poisoning transfer learning to CIFAR-100 and a medical dataset on histopathologic cancer detection, suggested by zq67. We will update the additional results as soon as possible and let you know. In the meantime, please let us know if your concerns have been addressed and we would like to discuss with you more.