# NIPS Paper1974 Rebuttal
---
# Reviewer1
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **interesting approach**. We hope to address your questions below:
**Q1**:The chosen model for the binary classifier is not explained in the paper (in Fig. 1 it is specified as a logistic regressor, but in the appendix it seems that a three-layer neural network is used - even if the authors talk about a three-layer linear classifier (?)). The authors should better highlight this detail in the paper. It is also not clear how many are the input features of this classifier.
**A1:** Thank you for the constructive question. As shown in Table1, we train a three-layer linear classifier $\mathcal{C}$ with an input size of $100 \times 5=500$, where 100 is the number of anchor samples and 5 is the number of convolutional layers for ResNet-34. Please refer to Appendix (Section A.3) in our revision for more details.
Table1. The architecture of the binary classifier $\mathcal{C}$.
| Index | Layer Type|Shape |
| --- | --- | --- |
|1| Linear | [500, 256] |
|2| Linear | [256, 32] |
|3| Linear | [32, 2] |
**Q2: It is not clear if all anchor samples are used to obtain the input features for the binary classifier: in this case, we should have a FDI vector for each anchor sample. If this is not the case, and only an anchor sample is used: how is it selected?**
**A2:** Thank you for the question.
1. All anchor samples are used to obtain input features.
2. For each class $c$, we select 100 high-confidence instances as anchor samples from the victim's training set, as mentioned in our original manuscript in Line 143-144.
3. For each inspected query, we compare its feature representation with all 100 anchor samples to obtain a 500-dimension FDI vector.
Please refer to Appendix A.3 in our revision for more details.
**Q3:** It is not clear on which data of the auxiliary datasets the binary classifier is trained, and how many training samples are considered?
**A3:** Thanks for the constructive question. We train the binary classifier with the victim’s training set and an out-of-distributed dataset (e.g., FasionMNIST for MNIST and CIFAR100 for CIFAR10). We divide the victim’s training set into three subsets: 100 anchor samples per class, 1000 as validation set, and the remaining training data. Please refer to Appendix A.3 in our revision for more details.
**Q4:** How is it possible to have a benign test set of 500k samples from the MNIST and CIFAR10 datasets? These datasets contain much fewer samples.
**A4:** Thank you for the question. To balance the size of benign and malicious data, we randomly and repeatedly select samples from victim's test set to create 500k benign query set. Please refer to Section5.1 in our revision for more details.
**Q5:** It is not clear on how many (benign and adversarial) samples is the evaluation done for the adaptive attack setting.
**A5:** Thank you for the question. As mentioned in Section5.4 in our original manuscript, we selects $50k$ adversarial queries from $\mathcal{D}_{adv}$ to launches *FeatC* attack and $50k$ benign queries from $\mathcal{D}_{ben}$ for each attacks.
**Q6**: How has the visualization in Fig. 2 been produced?
*A6:* Thank you for the constructive question. We use T-SNE to plot feature maps (the output of the last convolutional layer) of ResNet-34. *Query-SetA* and *Query-SetB* are adversarial examples produced by FGSM and PGD.
**Q7**: What is C in Eq. 6?
**A7:** Thank you for the constructive question. C is the combination formula, where $C_{n}^{m}=\frac{n!}{m!(n-m)!}$. Please refer to Section5.4 in our revision for more details.
**Q8**: The authors do not consider attacks that have been shown to evade distance-based detection (Pal et. al, 2020; Gong et. al, 2021; Yuan et. al, 2022; the adaptive attack in [13]), whereas include attacks from [6] and [20], that are easily detected by these kind of defenses.
**A8:** Thanks for the helpful suggestion.
We select to defend against JDBA-FGSM, JDBA-PGD, Knockoff, DFME, and DaST since those attacks cover all attack scenarios (i.e., unlimited access to wild data, limited access to private data, no data). Besides, they are representative works in model extraction attacks.
Since both Pal et. al, 2020; Gong et. al, 2021; Yuan et. al, 2022; haven’t opened their source code, the attack performance of those works remains an open problem. We've sent code requests to the authors of those paper, we hope to hear from them. We will immediately evaluate the performance of ProFedi as soon as we obtain the source code.
**Q9:** The choice of the detection threshold should be justified: was it tuned on training/validation data, or directly on the test set (this should be not a correct approach)? It is not possible to evaluate the false positives, especially with respect to the threshold value: ROC curves should be included for experiments in Sect. 5.2.
**A9:** Thanks for the constructive question.
1. The victim’s training set is divided into three subsets: (1) 100 high-confidence for each class as anchor samples, (2) 1000 randomly selected samples as the validation set to tune the threshold, (3) the remaining training data to train the binary classifier. As shown in Figure 4, we find the average confidence $ac$ of benign queries is much lower than malicious queries during the validation phase. Please refer to Appendix (Section A.3) in our revision for more details.
3. We’ve added the ROC curves in Appendix (Section B) in our revision.
**Q10**: Why do you include an evaluation based on the technique from [12]? This is specifically designed for and tested on decision trees (as it mainly relies on the information gain metric), and there is no guarantee that it is able to provide reliable results in this setting.
**A10:** Thanks for the instructive question.
Recent works of model extraction defense include active defense and passive defense. Extraction Monitor (EM) [12] employs local proxy models to quantify the extraction status of an individual agent. EM [12] is a representative work in the passive defense category. To adapt EM to the DNN scenario, we adaptively change the information gain metric to the stolen model accuracy metric.
In this experiment, we intend to evaluate the performance of our method to raise extraction warnings during the extraction queries sent by an adversary.
**Q11: I don't understand why m is set to 1 when evaluating the defense against the adaptive attack: results are not comparable with the non-adaptive attack setting.**
**A11:** Thank you for the question. The larger size of query sequence $m$ in the majority voting setting will produce better detection accuracy. To better demonstrate the effectiveness of our defense against the adaptive attack, we set $m=1$. Thus, the adaptive adversary conducts *FeatC* attack with queries one by one.
**Q12:** I'm not convinced on the strength of the proposed adaptive attack: an attacker with full knowledge of the defense, would try to evade the detection by decreasing the output score of the binary classifier. This could be done by simply inserting benign queries inside considered sequences. Furthermore, this approach could also be extended to avoid the detection of distributed attacks. Similar settings were already discussed in [15].
**A12:** Thank you for the question.
Yes, the adaptive attacker tries to mislead the binary classifier.
However, the attacker requires to have auxiliary knowledge of the victim model (to obtain FDI vector, and then inverse to input data). We assume that the adaptive adversary has two types of $\hat{F}$, i.e., vgg11 and the victim model $F_{V}$. The ProFedi fails to detect the adaptive attack when the adversary knows $F_{V}$.
**Q13:** It is useless to evaluate PRADA on sets of 100 queries, as this method does not work until queries are more than 100.
**A13:** Thank you for the question. Experiment results under 100 queries are to demonstrate our method is better than PRADA. We also provide experiment results under 1000 queries.
**Q14:** The experiments consider only one architecture and two low-dimensional datasets: it would be better to test the defense on other DNN models, and especially to show that the approach is scalable to high-dimensional datasets.
**A14:** Thanks for the constructive suggestion. To our best knowledge, existing works on model extraction attacks do not conduct experiments on large datasets, such as ImageNet. In spite of this, We try to conduct attacks including JBDA-FGSM, JBDA-PGD, Knockoff, DFME, and DaST on ImageNet. Specifically, we adopt ViT pre-trained on ImageNet as the victim model and ResNet-34 as the substitute model. The query budget is set to 100k for all attacks. As shown in Table1, all attacks have a poor performance.
Table2. The accuracy(%) of stolen models on ImageNet.
| JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|
| 0.114 | 1.522 | 6.966 | 0.138 | 0.096 |
Although all attacks are ineffective, we also use ProFedi to detect malicious queries. Specifically, we set the majority voting threshold $\tau_1$ to 0.4. As shown in Table2, ProFedi is still effective.
Table3. The detection accuracy(%) of ProFedi against different attacks under different query sequence sizes $m$.
| $m\downarrow$, Attack$\rightarrow$ | JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|---|
|100| 99.25 | 97.60 | 77.80 | 99.95 | 99.95 |
|1000| 99.00 | 98.00 | 90.50 | 100.0 | 100.0 |
Please refer to Appendix (Section E) in our revision for more details.
<br>
# Reviewer2
<!-- We thank the reviewers for their valuable and positive feedback. We are encouraged our method to be found “lightweight” and our experimentation to be “extensive comparative”. We hope to address the concerns raised by the reviewers below: -->
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **lightweight and effective method** and **extensive comparative evaluations**. We hope to address your questions below:
**Q1**:How can the defender derive a threshold in Equation 3? And, how much does this threshold affect the performance overall?
**A1:** Thank you for the constructive question. We adopt 1000 benign queries from the victim’s training data to compute the upper bound $ac$ . As depicted in Figure 4(a) (blue curve), $ac$ is very lower for benign queries. As a comparison, $ac$ is very large for adversarial queries in Figure 4(b)-4(f). Please refer to Appendix (Section A.3) in our revision for more details.
**Q2**: Why do the authors conduct evaluations on the small datasets (i.e., MNIST, CIFAR-10)?
**A2:** Thanks for the constructive suggestion. To our best knowledge, existing works on model extraction attack do not conduct experiments on large datasets, such as ImageNet. In spite of this, We try to conduct attacks including JBDA-FGSM, JBDA-PGD, Knockoff, DFME, and DaST on ImageNet. Specifically, we adopt ViT pretrained on ImageNet as the victim model and ResNet-34 as the substitute model. The query budget is set to 100k for all attacks. As shown in Table1, all attacks have a poor performance.
Table1. The accuracy(%) of stolen models on ImageNet.
| JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|
| 0.114 | 1.522 | 6.966 | 0.138 | 0.096 |
Although all attacks are ineffective, we also use ProFedi to detect malicious queries. Specifically, we set the majority voting threshold $\tau_1$ to 0.4. As shown in Table2, ProFedi is still effective.
Table2. The detection accuracy(%) of ProFedi against different attacks under different query sequence sizes $m$.
| $m\downarrow$, Attack$\rightarrow$ | JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|---|
|100| 99.25 | 97.60 | 77.80 | 99.95 | 99.95 |
|1000| 99.00 | 98.00 | 90.50 | 100.0 | 100.0 |
Please refer to Appendix (Section E) in our revision for more details.
<br>
# Reviewer3
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **comprehensive evaluation** and **insightful adaptive attacks and distributed attacks**. We hope to address your questions below:
**Q1: If the attacker queries a large number of normal data samples, the ratio of malicious queries could be much lower than the pre-defined threshold. The proposed detection method, thus, could be easily evaded.**
**A1:** Thank you for the question. ProFedi does not rely on historical queries like PRADA. We can adopt the query sequence size $m=1$ to detect queries one by one. For a fair comparison with other defense baselines such as PRADA, we set $m=100$ and $m=1000$ in our manuscript.
**Q2**: If a legitimate user queries such out-of-distribution data (not for model extraction), will the legitimate user be classified as an attacker?
**A2:** Thank you for the question. The legitimate user will be regarded as the malicious agent if he/she queries out-of-distribution data. To avoid misjudgment, the defender can adjust the threshold $\tau_{1}$ to increase the credit range.
**Q3: How to train the classifier C? What’s the classifier’s architecture?**
**A3:** Thank you for the question. The input of binary classifier $\mathcal{C}$ is a 500-dimension FDI vector, while the output of $\mathcal{C}$ is a 2-dimension probability vector. As shown in Table1, $\mathcal{C}$ is a three-layer neural network. We show the architecture of classifier $\mathcal{C}$ below:
Table1. The architecture of the binary classifier $\mathcal{C}$.
| Index | Layer Type|Shape |
| --- | --- | --- |
|1| Linear | [500, 256] |
|2| Linear | [256, 32] |
|3| Linear | [32, 2] |
During the training phase, we first extract all the convolutional layer’s outputs to obtain the FDI vector $\mathcal{I}$. We adopt Out-of-Distribution data (FasionMNIST and CIFAR100 for MNIST and CIFAR10) as malicious queries to trian $\mathcal{C}$. Please refer to Appendix (Section A.3) in our revision for more details.
**Q4: Eq. 2 and Eq. 3 are not well-formulated. Does Eq. 2 calculate a vector of FDI including all the data samples in D_A? How to use FDI in Eq. 3?**
**A4:** Thank you for the constructive question.
1. For each class, all the anchor samples in $\mathcal{D}_{\mathcal{A}}$ are used to extract the FDI vector $\mathcal{I}$.
2. For each query, we extract an FDI vector and send it to train then binary classifier $\mathcal{C}$ in Eq. 3. To avoid confusion, we update Eq 3. as:
$$ac = \frac{1}{m} \times \sum_{i=1}^{m}{\mathcal{C}(\mathcal{I}(x; \mathcal{D}_{\mathcal{A}}, F_{V}))}$$
**Q5: Efficiency is listed as one of the defense objectives. However, the latency of the proposed detection is not discussed in the paper.**
**A5:** Thank you for the constructive question. We evaluate the latency of ProFedi with 50k queries. The throughput is 838.36 query/s in our experiment settings. Please refer to Section 5.2 in our revision for more details.
**Q6: It seems the “colluding adversaries” launch the same attack as in the “distributed attacks.”**
**A6:** Thank you for the constructive question. Your understanding is correct. We’ve added this in our revision in Section 3.1.
**Q7**: The extraction monitor metric used in the paper differs from [12]. To avoid confusion, it would be great if the paper could present the metric as the stolen model accuracy.
**A7:** Thank you for the instructive suggestion. We’ve revised our paper in Section 5.2.
<br><br><br>
# Reviewer4
We sincerely thank you for your valuable time and comments. We are encouraged by the positive comments on the **novel defense** and **effectiveness of the method**. We hope to address your questions below:
**Q1**: The proposed method seems to be able to extend to other models. Do you have more results on other models, like ViTs? Also, do you have more results on other tasks like some NLP tasks? I think it would be helpful to show the method is model- and task-agnostic.
**A1:** Thank you for the valuable question.
1. We evaluate our method on ViTs and obtain similar experiment results. Please refer to Answer 2 for more details.
2. We mainly focus on image classification due to time constraints, but we believe our method can be adapted to NLP tasks. We will explore this in our future work.
**Q2**: I think this might not be realistic for running the experiments in a short window and it is minor to me, but do you have more results on large-scale datasets, like Imagenet?
**A2:** Thanks for the constructive suggestion. To our best knowledge, existing works on model extraction attack do not conduct experiments on large datasets, such as ImageNet. In spite of this, We try to conduct attacks including JBDA-FGSM, JBDA-PGD, Knockoff, DFME, and DaST on ImageNet. Specifically, we adopt ViT pretrained on ImageNet as the victim model and ResNet-34 as the substitute model. The query budget is set to 100k for all attacks. As shown in Table1, all attacks have a poor performance.
Table1. The accuracy(%) of stolen models on ImageNet.
| JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|
| 0.114 | 1.522 | 6.966 | 0.138 | 0.096 |
Although all attacks are ineffective, we also use ProFedi to detect malicious queries. Specifically, we set the majority voting threshold $\tau_1$ to 0.4. As shown in Table2, ProFedi is still effective.
Table2. The detection accuracy(%) of ProFedi against different attacks under different query sequence sizes $m$.
| $m\downarrow$, Attack$\rightarrow$ | JDBA-FGSM | JDBA-PGD | Knockoff | DFME | DaST |
|---|---|---|---|---|---|
|100| 99.25 | 97.60 | 77.80 | 99.95 | 99.95 |
|1000| 99.00 | 98.00 | 90.50 | 100.0 | 100.0 |
Please refer to Appendix (Section E) in our revision for more details.
**Q3**: The authors mention that the limitations of the work are in section 6, but I did not find them.
**A3:** Thank you for the question. In this paper, we mainly focus on image classification. We will explore more scenarios (e.g., NLP) in our future work. We’ve revised our paper and discussed limitations of our methods in Section 6.
**Reference**
[1] Papernot, N., P. McDaniel, I. Goodfellow, et al. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519. 2017.
[2] Orekondy, T., B. Schiele, M. Fritz. Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4954–4963. 2019.
[3] Truong, J.-B., P. Maini, R. J. Walls, et al. Data-free model extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021.
[4] Zhou, M., J. Wu, Y. Liu, et al. Dast: Data-free substitute training for adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 234–243. 2020.
<br><br><br><br><br><br><br><br><br><br><br>
<br><br><br><br><br><br><br><br><br><br><br>The end.