ICML 2023 rebuttal

# ICML 2023 rebuttal ## Reviewer 1: Thank you for your positive feedback. We have modified the draft according to your valuable suggestions and updated the paper at https://drive.google.com/file/d/1jvLL2gd3iRk234BadmwxqRuasGipIGar/view?usp=sharing (changes related to your comments are marked in teal). We address your concerns and summarize our changes for your reference: (1) **Replacing attack**: (a) firstly, we agree replacing part of the clean data with poisoned data is interesting and important (also closely related to the nasty noise model mentioned by the reviewer), and can be further generalized to our formulation. We extend the discussion below and in Appendix C.10 of the revision. Recall that we consider the mixed distribution of the clean distribution $\mu$ and poisoned distribution $\nu$: $$\chi = (1-\lambda)\mu + \lambda \nu,$$ where $\lambda$ is the poisoning proportion. Then, replacing part of the clean training data is equivalent to: $$\chi = (1-\lambda')\mu' + \lambda' \nu,$$ where $\mu'$ is a subset of $\mu$, and $\lambda'=|\mathcal{D}_p|/|\mathcal{D}_\{tr\}|$, with $|\mathcal{D}_\{tr\}|$ being the size of the clean distribution $\mu$. (b) secondly, with the ability to replace data points we may still apply Gradient Canceling in a straightforward manner: the only difference is that in Algorithm 1 we change $\mu$ to $\mu'$, a random subset of $\mu$. Following the idea, we perform a simple experiment: we choose $\epsilon_d=3\%$, and choose $\mu'$ to be a random subset of $\mathcal{D}_{tr}$, with size $\frac{1}{1+\epsilon_d}|\mathcal{D}_{tr}|\approx0.97|\mathcal{D}_{tr}|$. The results on MNIST are presented below: | Model | Clean | GC(adding-only) | GC(replacing) | |-------|-------|-----------------|-------------| | LR | 92.35 | -22.97 | -23.10 | | NN | 98.04 | -6.10 | -6.35 | | CNN | 99.13 | -9.55 | -9.62 | We observe that the ability to replace clean traning data is indeed (slightly) more powerful than the corresponding adding-only attack. Notably, we remove training samples randomly, which may be relatively weak. Ideally, an adversary would remove the most important points (e.g., in [1]) to further reduce the test accuracy. This improved replacing attack might be of future interest. [1] Datamodels: Predicting Predictions from Training Data, Ilyas et al $c$ thirdly, from the application perspective, replacing attacks may be harder to deploy than adding-only ones, as it is likely that removing part of the training data requires direct access to the database, which might be difficult for a more limited adversary. We have added the above discussion to Appendix C.10. (2) **Connections with learning theory:** We thank the reviewer for mentioning the theory literature, which deserves further discussion in context. At a high level, the classic theory literature focuses on poisoning *worst-case* distributions, whereas practical attacks require handling distributions that arise in practice. To further detail the theory literature: two models of robust PAC learning are the malicious noise model (where the adversary adds points) and the nasty noise model (where the adversary may both add and remove points). The paper linked by the reviewer focuses on the latter. Their main negative result (roughly) says the following: for (almost) any hypothesis class (e.g., linear classifiers), *there exists* a (worst-case) distribution such that an eta-adversarial corruption of the distribution can not be learned to better test error than $2*\eta$. (Qualitatively similar results hold under malicious/additive noise.) These worst-case distributions (as described in the proof of Theorem 5) are rather contrived/unnatural, supported on two points of the domain, and the attacker's strategy is to purposely flip labels of these points. This primarily differs from our setting in three ways: - when poisoning realistic datasets, the data distribution is far more complex than the worst-case distributions explored in the linked paper. - flipping labels would result in training points which are visibly mislabeled. This is in contrast to most of our attacks which are "clean label." - we focus on model-targeted attacks whose goal is to induce certain target parameters while the above-mentioned references focus directly on decreasing accuracy on the test sample. We have cited the suggested reference (and more) and added relevant discussion in the end of Section 2. (3) **Poisonability:** we agree that "poisonability" could be a little misleading and "poisoning reachable" is a great suggestion that we have followed throughout our updated draft. (4) **Figure 1:** thank you for pointing this out. We have moved Figure 1 to the end of Section 3.2 (where we introduce $\tau$ formally) and added an intuitive discussion in the introduction. (5) **Vague terms:** we have changed this sentence to “sufficiently (and sometimes exceedingly, e.g., $\epsilon_d>100\\%$) large” in Section 2. (6) **Indirect:** we use the term *indirect* as data poisoning attacks can only change model parameters indirectly through the constructed poisoned data (that the defender retrains the model on and hence induces the parameter change), as opposite to parameter attacks, which hack the model parameters *directly* without constructing any poisoned data or retraining. (7) **Notation of G**: $\mathcal{G}$ is the set of all gradient vectors (defined in lines 152-153, original submission), while \mathds{G} (sorry, not defined in markdown) is the average of gradient vectors over poisoned distributions (Equation (4)). We have simplified the notation to remove $\mathcal{G}$ and only use \mathds{G} throughout. (8) **Notation $\vee$**: we have changed it to the more explicit maximum, i.e., $a \vee b := \max\\{a, b\\}$. (9) **Definition regarding $\tau$**: The reviewer is correct about the intuitive meaning of $\tau$. What we meant to highlight is the relative ease in computing $\tau$: no training is involved while an analytic formula is readily available. We have changed the wording to make this point clear. (10) **Not “general” and do not cover “all” learning tasks**: Point well taken. We do not claim to cover all learning tasks and we have acknowledged this limitation explicitly in the revised Conclusion section. Nevertheless, we believe our work covers a significant part of practical interests. Moreover, our work can be used as a distillation device for existing poisoning attacks: upon crafting a poisoned set, existing attacks (implicitly) induce a target parameter on the victim model (obtained through simply retraining on the clean and poisoned set). Therefore, one can use our threshold to decide how efficient (in terms of size) the output poisoning set is and then use GC to possibly further distill and improve it. ## Reviewer 2 Thank you for highlighting our contributions and the critical comments, which we believe could be addressed here. We have modified the draft according to your valuable suggestions and updated the paper at https://drive.google.com/file/d/1jvLL2gd3iRk234BadmwxqRuasGipIGar/view?usp=sharing (changes related to your comments are marked in blue). All suggested references have been added and discussed in the revision. Here we address your concerns and summarize our changes for your reference: (1)**[major issue]: false claim** We apologize for the misunderstanding. The reviewer is correct that a successful data poisoning attack need not be based on producing a target parameter (which is indeed an indirect goal for achieve certain accuracy drops). We have thoroughly revised the draft to make it clear in every appearance that by successful in this paper we always mean producing a target parameter, and we have removed every ambiguous statement that may sound over-claiming (not our intention). It is also true that GC may not be perfectly optimized so our experimental results around it are indicative and they are used as an empirical evidence to support and verify our theoretical findings about $\tau$, whose validity nevertheless remains and is not affected by the possible suboptimality of GC. We have changed the sentence pointed out by the reviewer to: > The failure of GC indicates that a larger poisoning proportion $\epsilon_d$ may be necessary to produce the target parameters, as confirmed by our theory. (2) **[major issue]: mismatched/over-claiming title&abstract to the actual results** We apologize for the (unintended) possibility of misleading and over-claiming statements. We have thoroughly revised the draft to make it explicitly clear at every appearance that our work is limited to model-targeted attacks. Major changes to address the reviewer's comments include: (a) we have changed the Title to: > Exploring the Limits of **Model-Targeted** Indiscriminate Data Poisoning Attacks (b) we have made the following changes in the Abstract: > ...to explore the intrinsic limits of indiscriminate data poisoning attacks **towards target parameters (i.e., model-targeted attacks)**... > ...data poisoning attacks can **achieve certain target parameters** only when the poisoning ratio exceeds our threshold. $c$ in the Introduction we now mention explicitly: > In this work we **focus on model-targeted attacks** ... $d$ in Footnote 3 we have acknowledged possible intractability: > **As pointed out by a reviewer, this may not be computationally feasible if one is too ambitious about the set $\mathcal{W}$.** $e$ throughout the paper we have emphasized our goal and limitation to **producing target parameter $\mathbf{w}$**. $f$ in the Conclusion we acknowledge our limitations: > One limitation of this work is its focus on achieving specific target parameters, which may not always be available or necessary. Indeed, data poisoning attacks that are not based on any target parameter abound. However, we point out that our work may still be valuable for the latter class of attacks, for instance, as a distillation device: a data poisoning attack can use our threshold to evaluate potential ``wastefulness'' of its constructed poisoning set (along with the model parameter obtained by retraining) and then use GC to further distill and improve it. Another limitation is that most existing data poisoning attacks, including GC, assume a lot of knowledge of the victim model (e.g., fixed architecture, access to clean training data, etc.) and hence may not always be realistic. Advanced and adaptive defense mechanisms may also thwart the effectiveness of many attacks (including GC). Further investigations of these issues form another important direction for future research. (3) **[major issue]: unclear evaluation protocol** We apologize for the ambiguity here. To clarify: - all data poisoning models **are trained from scratch** on $\mathcal{D}_{tr} \cup \mathcal{D}_{p}$ for evaluation, with the same random initialization for all baselines. - there is a difference between *attack protocol* and *evaluation protocol*, such that (a) during the attack we take target parameters (can be $\mathbf{w}$ in the main paper or $\mathbf{w}/2$ in Appendix C.6) as input to GC, which then returns the poisoned set $\mathcal{D}_{p}$; (b) during evaluation, we retrain the model over $\mathcal{D}_{tr} \cup \mathcal{D}_{p}$ from scratch (with the same random initialization for all baselines) and return the resulting poisoned model for evaluating the test accuracy drop. - we'd like to emphasize that the target parameter is only used during *attack*, while *evaluation* does not involve any target parameter and is always the same across all baselines and experiments throughout the paper. We have **added the evaluation protocol to the end of Section 5.1**. Hopefully this removes any ambiguity and makes it clear that there is no unfair advantage. Moreover, let us clarify the purpose of Appendix C.6: any target parameter $\mathbf{w}$ yields a threshold $\tau(\mathbf{w})$, which serves as a lower bound of the size of poisoned examples **to induce $\mathbf{w}$**. We observed that as $\mathbf{w} \to 0$, $\tau(\mathbf{w}) \to 0$ as well. So if we scale a target parameter $\mathbf{w}$ (say in LR) down (e.g., divide by 2), it will not change its label prediction and hence accuracy (although the logits get closer to uniform) but the corresponding $\tau$ will get smaller, meaning that a smaller sized poisoning set may suffice in inducing $\mathbf{w}/2$. Appendix C.6 is a preliminary experiment to confirm this observation. (4) **[major issue]: dated baseline defense** We agree it is probably not appropriate to list Sever as the SOTA defense. We have removed "SOTA" from our paper and tuned down our claims regarding effectiveness against defenses. Moreover, we have read the suggested references carefully and added the discussion below: (a) we indeed include the data augmentation defense (MaxUp) mentioned in [2] in Appendix C.8 already; (b) we have studied [3][4][5] in details and added the discussion below: - DPA in [3], FA in [4], and DPA_aug in [5] all consider using a data partition aggregation training procedure to provide pointwise certified defense against data poisoning, where we can measure the defense effectiveness by measuring how many test samples are certified to be robust. - of course, DPA (and related defenses) can be easily applied to indiscriminate data poisoning setting (and our GC attack): giving a specific $\epsilon_d$ (and $\mathcal{D}_p$ returned by GC), number of partitions $k$, we train $k$ base classifiers and perform inference based on the majority vote. We verify the performance of DPA against GC based on two metrics: percentage of certified robustness (or certified accuracy in [3]), and final test accuracy. - according to the above procedure, we take DPA as an example and evaluate its effectiveness against GC. We follow [3] and choose $k=3000$ (we also experiment with $k=1200$ and observe $k=3000$ to perform relatively better, see the complete results in Table 4). We choose $\epsilon_d = 0.008$ to roughly preserve median certified robustness and perform experiments on MNIST. Note that we choose base classifiers to be the same as the target models. We present the results in the table below (CA denotes certified accuracy): | **Models** | **Clean(wo DPA)** | **Clean(w DPA)** | **GC** | **CA** | **DPA** | |------------|-----------------|----------------|---------|------------------------|----------------| | LR | 92.35% | 89.97% | -8.25% | 49.23% | -4.21%/+4.04% | | NN | 98.04% | 92.37% | -2.25% | 48.52% | -1.17%/+1.08% | | CNN | 99.13% | 93.15% | -2.77% | 50.01% | -1.52%/+1.25% | We observe that DPA is effective against GC, where the change of the (relative) accuracy drop of GC approaches the certified accuracy of DPA (e.g., $4.04 \approx 8.25*0.4923$ for LR). We have **updated Section 5.4 with the above discussion**. (5) **[secondary issue]: trivial arguments** We have removed this sentence in the revision. (6) **[minor issues]** Thank you for pointing these out. We have corrected all of them. ## Reviewer 3 We would like to first thank you for your positive review and we are thrilled to hear that you enjoyed reading our paper. We have modified the draft according to your suggestions and updated it at https://drive.google.com/file/d/1jvLL2gd3iRk234BadmwxqRuasGipIGar/view?usp=sharing (changes related to your comments are marked in violet). Here we address your concerns and summarize our changes for your reference: (1) **Confusion on model poisonablitiy**: We certainly understand your confusion. Here we want to clarify that: - in this paper, we consider an interesting (but may be limited) scenario of indiscriminate data poisoning attacks **based on achieving target parameters**. This type of attack has been considered by prior works (e.g., Koh et al. 2022 and Suya et al. 2021) and appears to be quite competitive. - consequently, we define "successful" as reaching the target parameter in remark (a) under Definition 1, and our definition of "poisonability" (now changed to "poisoning reachability" for better interpretation) means that the target parameter $\mathbf{w}$ is reachable by data poisoning attacks. Furthermore, to make our presentation and our theory align better, we make several changes to avoid overclaiming (also suggested by Reviewer 3Brt). For example: (a) we have changed the Title to: > Exploring the Limits of **Model-targeted** Indiscriminate Data Poisoning Attacks (b) in the Abstract, we have changed the following sentences: > In this work, we introduce the notion of model poisonability as a technical tool to explore the intrinsic limits of indiscriminate data poisoning attacks on **achieving certain target parameters**... $c$ throughout the paper we have emphasized our goal and limitation to achieving target parameter $\mathbf{w}$; success is always measured in terms of achieving the target parameter and the desired accuracy drop. $d$ in the Conclusion, we have added discussion on the limitation of our work. (2) **Benefits of considering targeted models** We provide some insights on why target parameters may be desired: - in indiscriminate attacks, although the direct goal of reducing accuracy is ultimate, it is usually formulated as a non-zero-sum bilevel optimization problem and empirically found to be ineffective and inefficient (e.g., experiments in Lu et al. 2022). By reducing the complex problem to achieving certain target parameters, the optimization problem becomes much easier and often one is able to produce more powerful attacks. - target parameters can be thought of as a "teacher," which guides the GC attack in finding an effective poisoned set. Target parameters themselves may not be constructed by data poisoning attacks (as in our experiments) and hence offer another layer of flexibility, provided that we are able to mimick them, which our work aims to address (e.g., the threshold $\tau$ for guiding the choice of the size $\epsilon_d$ and GC for constructing the poisoned set). - a data poisoning attack by definition outputs a poisoned set, and hence implicitly induces a target parameter on the victim model (which can be obtained by retraining on the clean and poisoned set). In this sense, our results provide a distillation device for data poisoning, where we can pinpoint the minimum size of poisoned set required to achieve the (implicit) target parameter and possibly apply GC to seek further distillation and improvement. Our threshold $\tau$ is a lower bound (sometimes also sufficient) on the fraction of poisoned points required to produce a poisoned model similar to the target parameter. If $\epsilon_d$, the allowed poisoning budget, is smaller than $\tau$, then it is not possible to get close to the target parameter (although it may still be possible to get similar accuracy drop). On the other hand, if $\epsilon_d \geq \tau$, then one can use GC to simulate the target parameter. If it is crucial to use a much smaller $\epsilon_d$, then scaling $\mathbf{w}$ down might help (since $\tau$ gets smaller as a consequence), see our experiment in Appendix C.6. (3) **Possible insufficiency of definition** The reviewer is correct in pointing out the insufficiency of first order optimality conditions for nonconvex models, which we briefly mentioned in lines 134-136 (of the revision). We'd like to add that: * for convex models, it is clear that our definition is both sufficient and necessary. We note that onvex models are still very relevant and some existing works (e.g., Koh et al. 2022 and Suya et al. 2021) focus extensively on them. * for non-convex models (e.g., neural networks), our results remain as a necessary condition and provide a valid (albeit not necessarily tight) lower bound on the poisoning ratio. * while it is possible to define a sufficient and necessary condition for poisonability, such a definition will not be computationally feasible for nonconvex models. In the end, one will likely have to retreat to first order optimality conditions, so our compromise here is hopefully understandable from a practical optimization perspective. (4) **Connection between theory and GC** (a) for clarification, Definition 1 directly motivates GC, where Equation (16) in the revision is a simple empirical version of Equation (5), which is exactly the condition of model poisonability. (b) a strength of GC is its flexibility to build on top of any target parameter, and it tends to perform better if the target parameter is more effective. In our exeriments, we first generate the target parameters using the GradPC algorithm of Sun et al. (2020), which directly perturbs the model parameters and hence is not a valid data poisoning attack (since it does not produce any poisoned set). Algorithm 1 allows us to distill the GradPC target parameters into a bona fide poisoining attack, with the size $\epsilon_d$ of the constructed poisoned set as small as possible (ideally approaching $\tau$). $c$ selection of target parameters: our theory suggests that a target parameter is desirable if it induces **a large accuracy drop** (can be empirically measured by $\epsilon_w$ in Table 1) and also a **small threshold $\tau$** so that it does not require injecting too many poisoned points. From Figure 3, we observe a trade-off between the two, where we want to choose a target parameter for GC that is both effective and easily achievable. Our theory offers a reasonable range for $\epsilon_d$ in the GC Algorithm. In Appendix C.6, we also empirically showed that scaling a target parameter $\mathbf{w}$ down would result in a smaller $\tau$ and hence are easier to achieve. (5) **Presentation** We agree that Section 3 is notation heavy and can be difficult to digest. We have followed your suggestions to largely simplify the notations and the numbering of equations and to move the neural network discussion back to the main paper. More technical results are now all deferred to the appendix (e.g., Theorem 5 and Theorem 6). (6) **Clipping poisoned data** In Table 3 we have two observations regarding clipping: - as expected, clipping the results makes GC less effective, even reducing the effectiveness by half sometimes (but still more effective than baseline methods) - intuitively, a reasonable input range would make the poisoned samples more robust against data sanitization defenses (e.g., outlier removal methods such as Sever in Table 3) (7) **Two $\tau$'s in Table 2** We understand your confusion and we want to clarify that - the $\tau$ in the GradPC column is the threshold calculated wrt the target parameter generated by GradPC - the $\tau$ in the GC column denotes setting $\epsilon_d=\tau$ (for example, for LR on MNIST, we can simply change this $\tau$ to 1.15). This column is informative in revealing what happens to GC if its poisoning ratio $\epsilon_d$ approaches the theoretical bound $\tau$ (sometimes larger than 1). We have also changed Table 2 correspondingly. (8) **Limitations** We have explicitly added a discussion paragraph on the limitation of our work in the Conclusion section and below: > One limitation of this work is its focus on achieving specific target parameters, which may not always be available or necessary. Indeed, data poisoning attacks that are not based on any target parameter abound. However, we point out that our work may still be valuable for the latter class of attacks, for instance, as a distillation device: a data poisoning attack can use our threshold to evaluate the potential ``wastefulness'' of its constructed poisoning set (along with the model parameter obtained by retraining) and then use GC to further distill and improve it. Another limitation is that most existing data poisoning attacks, including GC, assume a lot of knowledge of the victim model (\eg, fixed architecture, access to clean training data, \etc) and hence may not always be realistic. Advanced and adaptive defense mechanisms may also thwart the effectiveness of many attacks (including GC). Further investigations of these issues form another important direction for future research.