# Review 1
Rating: 4 Conf: 4
We thank the reviewer for the feedback. Please find below our responses to the specific weaknesses and questions that you mention.
>Novelty of paper
Our paper focuses on the novel task of systematically finding adversarial prompts for foundation models. We propose a flexible optimization framework to search for adversarial prompts with only query access, allowing for arbitrary black box optimization methods and adversarial targets.
>The abstract appears overly simplified and doesn't describe the proposed framework.
We have updated the abstract with new details on the specific methodology.
>The work would be more significant if the adversarial prompts can be more imperceptible or have stronger ability to control the models. For example, the current adversarial prompts (or the prepending prompts) makes no sense to humans.
In our work, we search for adversarial prompts with a small number of tokens to represent an imperceptible change, i.e., in a long prompt, inserting four adversarial tokens may be difficult to detect and filter. We agree that finding semantically meaningful and interesting prompts is an alternative approach and an interesting future research direction.
>Minor: I cannot open the link in L215.
This is purposeful, we use a blinded URL for the review process and will change it in the final version.
>Could you specify the amount of time required to find each type of adversarial prompt?
Each optimization takes between 1000-10000 queries. The optimization time is mostly bottlenecked by the generation of the foundation model. In our experiments for running Stable Diffusion v1.5 and Vicuna 1.1 on an A100 GPU, each adversarial prompt takes between 5-60 minutes.
>Can you provide a scenario in which the proposed framework could pose an actual threat to foundation models?
The provided problems are examples of threats to foundation models. For text-to-image models, we could generate NSFW images without using banned input tokens by modifying the classifier for the loss function to be a NSFW classifier. For text-to-text models, we have successfully optimized adversarial prompts for maximizing toxicity, judged by an external classifier.
For both of these experiments, we choose not to evaluate or provide these results due to ethical concerns, and instead we develop a test suite of benign examples that capture the core technique of an adversarial example.
>Could you further explain the statement, "the prompts CLS and a picture of a CLS are 273 no longer necessarily strong baselines" (Lines 273-275)?
Thank you for bringing this to our attention, we have reworded this. Consider the example where our objective is to generate images of dogs, and we prepend to `a picture of the ocean`. Previously, we would consider the prompts `dog` and `a picture of a dog`, and evaluate whether we can optimize a prompt that outperforms the simple baselines. In the prepending task 3, we find that these baseline prompts `dog a picture of the ocean` and `a picture of a dog a picture of the ocean` are not effective in the prepending setting, and therefore omit OPB success.
# Review 2
Rating: 8 Conf: 4
Thank you very much for your encouraging comments and valuable feedback. Please find below our responses to the specific weaknesses that you mention.
> High computational and query cost
Although in the image generation setting we have an upper bound of 10000 queries, we often find adversarial prompts far more quickly. Similarly for the text generation setting, we are able to obtain adversarial prompts in about ~2000 queries.
Furthermore, we do consider a small number of tokens (d=4 or 6), but the search space is not small since there are 50000 total tokens to choose from, $50000^4\approx 10^{18}$ or $50000^6\approx 10^{28}$.
Therefore, we believe the query cost is high but not necessarily excessive for the setting.
>How does the perplexity get calculated when attacking the text FM?
The log perplexity is equal to the average negative log probability of each token conditional on the previous tokens. The probability is modeled by GPT2.
>Is the classifier choice matter for the classifier loss-based attack?
There is not a strong dependence on the classifier chosen, we find the generated adversarial images match the classifier predictions well. We choose a ResNet18 as an arbitrary off-the-shelf classifier.
>Insights from Table 1
We may glean the following insights:
1. Overall, our method is successful in a large variety of settings without the need for task specific tuning.
2. The more sophisticated TuRBO outperforms the direct Square Attack.
3. Our method is able to consistently outperform the baselines, as demonstrated by OPB success.
4. Even in the most challenging setting of Task 3 with a strict criteria of Most Probable Class Success, we successfully find adversarial prompts.
>Since the search of adversarial prompts is only about length d (d=4 or 6), it seems pretty hard to extend this type of method to the SOTA prompt cases (4k or 8k token length in each turn)
Despite models having large context windows, our results demonstrate that we may find adversarial prompts with a small number of tokens, despite larger seed prompts. For example, some of the seed prompts in the text-generation experiments have lengths >50 with d=6. For the 4/8k context setting, we acknowledge that this is an interesting future research direction.
# Review 3
Rating: 4 Conf: 4
>There are too many variables used in the notations, because of which many things are not that clear when the main method is being described in Sec 2 (2.1 and 2.2.1). The Square attack method described in lines 111-119 is simple enough but has been presented in a bit complicated manner. I believe that with the proper (more simplified use of notations), the authors can make the presentation much simpler.
Thank you for the feedback. We agree and have modified the square attack and notation in Section 2 accordingly.
>Text to Image Task Comparison
In our setting, we measure success using a ResNet18 classifier, whether the generated images have the target class as the most probable class (MPC Success) or if the generated image has a higher prediction than the baseline prompts (OPB Success). From the definition of our optimization problem, the displayed images are examples of success, as they satisfy both MPC and OPB success. The reviewer motivates an interesting future research application of our framework where we modify the classifier loss to also penalize the seed class in prepending Task 3.
>Are you only trying to find one token which produces a desired effect? Or are you trying to find a phrase (multiple tokens) for that task?
We optimize over 4 tokens for the image generation setting and 6 tokens in the text generation setting (see Section 4 lines 210-215.)
# Review 4
Rating: 3 Conf: 5
**Most important review, should spend most time on this**
>Adversarial reprogramming literature
While the literature on adversarial programming is related, the proposed work is much more aligned with the traditional adversarial examples line of research. Adversarial examples try to find a perturbation that changes the output, whereas adversarial reprogramming tries to change the underlying task of the model. In our paper, we find prepending prompts that change the output of the downstream classifier---this is precisely the adversarial examples setting. An adversarial prompt that "reprograms" the model to do a different task (i.e. change the classifier task) is certainly an interesting threat model but is out of scope of this paper. We thank the reviewer for bringing this up, and will add a paragraph discussing this distinction and the corresponding references to the related work in the revised paper.
>What is exactly the feature loss here?
The loss function depends on the type of foundation model and the goal of the adversary. For the text-generation setting considered in the paper, we use *perplexity* as our feature loss, the average log probability of each token given the previous tokens.
>What's the meaning of high complexity?
Could you elaborate what you mean by high complexity? We do not use the phrase 'high complexity' in our paper. The metric we use, perplexity, is a commonly used metric in natural language processing to quantify performance (i.e. see "Exploring the Limits of Language Modeling" by Jozefowicz et al.). We find prompts that output high perplexity as examples of prompts that cause the model to output gibberish.
>Does the MPC score measure the attack success rate of generated adversarial prompt for 5 images? Is the number 5 a hyper-parameter?
An optimizer achieves MPC success if the output of an adversarial prompt is classified as the target class. To more reliably evaluate this criteria, we generate 5 images, and report success if a majority (at least 3 out of 5) images are classified as the target class. The number 5 is a hyper-parameter, chosen to be large enough to reduce variance.
> The adversarial prompts seem to be built by language tokens with token space projection. I'm wondering how large the space is for finding the closest token in \mathcal{S}.
We use the default vocabulary for Stable Diffusion and Vicuna, which have sizes of about 50,000.
> What is the time complexity to generate a successful text prompt for attacking just one image?
Each optimization takes between 1000-10000 queries. The optimization time is mostly bottlenecked by the generation of the foundation model. In our experiments for running Stable Diffusion v1.5 on an A100 GPU, each adversarial prompt takes between 5-60 minutes.
# Review 5
Rating: 5 Conf: 4
**I think we can get this reviewer to increase their score**
> Evaluating on GPT
Thank you for the suggestion. We originally chose not to evaluate on models like GPT/DALLE due to the cost of systematic evaluation, e.g., our experiments would have cost ~$10,000 for the text-generation and ~$100,000 for the image-generation experiments. Furthermore, these models are continuously updated, meaning results are not reproducible.
However, for specific instances and a proof of concept, we may certainly run experiments.
We evaluate GPT-3.5-turbo due to the inability to evaluate GPT-4 on short notice, GPT-4 API access is not immediately available.
We use the seed text `Explain list comprehension in Python.`, (using the same experimental setup as Section 4.2). Beginning with a log perplexity of 9.4, in only **70** queries we find the adversarial prefix `ja – point cm pole র` which results in a log perplexity of 11.27 (measured by GPT2).
**Input Prompt:** `ja – point cm pole র Explain list comprehension in Python.`
**Outputs**:
1. `পাইথনে লিস্ট কমপ্রিহেনশন (list comprehension) হলো একটি সংক্ষেপে লিস্ট তৈরির পদ্ধতি। এটি পাইথনের এক`
2. `লিস্ট কম্প্রিহেনশন হলো পাইথনের একটি উপকরণ যা লিস্টের জন্য সহজে সংখ্যাগুলির একটি ন`
We recognize the token `র` is a Bengali token, but nevertheless we are able to optimize the desired objective on GPT-3.5-turbo. We would like to emphasize that the experiments in the paper use the SOTA models at the time of the NeurIPS deadline.
>The only adversary "use case" is jailbreaking the alignment safeguards of foundational generators.
We would like to emphasize that our paper is on *adversarial examples*, not specifically jailbreaking. There are indeed many ways in which people can attack generative models via prompt (adversarial examples, backdoors, jailbreaking, injection, etc.). However, an adversarial attack paper must be clear as for what the threat model of the adversary is. Our threat model, as outlined in Section 3, is to change the output of downstream classifiers with a small number of tokens. Jailbreaking uses a different threat model (i.e. force the model to output a specific output) that is out of scope of this paper.
<!-- We would like to emphasize that our paper is on *adversarial examples*, not specifically jailbreaking. While jailbreaking is indeed one popular use case, our approach is general and can address arbitrary loss functions and objectives. -->
<!-- We aim to further developments on understanding these foundation models and demonstrating that we may obtain unexpected behavior with small changes. -->
> Are there any other potential adversary use cases except for jail-breaking the "alignment" safeguards (as discussed in Section 6)?
Our adversarial attack can be used to increase the classification of any downstream classifier. These could be aligned with safeguards, i.e. highlighting model biases, bypassing content filters, and communicating intellectual property. However, there is nothing restricting this to safeguards: classifiers that can detect more benign properties of outputs can potentially be optimized as well, such as changing the style of generated outputs with a style classifier.
<!-- Adversarial attacks can be used for a wide range of applications outside of jailbreaking. For instance, they can highlight model biases, bypass content filters, and communicate intellectual property. We specifically address jailbreaking as it is a popular area, but our method may be extended to -->
>Relationship to TextAttack
TextAttack is a very general framework for adversarial attacks on language models. One could view our attack framework as one possible instantiation of the goal, constraints, transformation, and search method of the TextAttack framework. However, TextAttack studies more classic NLP adversarial examples before prompting became widespread, and our constraints, transformations, and search methods are specialized to the prompting setting.
<!--
Thank you for bringing this to our attention, we have modified our paper to reference TextAttack. -->
>Was doing the search for adversarial attacks in the token embedding space explored somewhere before?
Adversarial training has been done in the embedding space for classic NLP models before (i.e. see "Adversarial Training Methods for Semi-Supervised Text Classification" by Miyato, Dai, Goodfellow). However, this attack stays within embedding space as it is not too important to create real tokens. Since our goal is to attack black box models, we need to use a token space projection to get real tokens that can be input to black box models.
>In TextAttack, the authors showed that for some text-based models (e.g., text classifiers), one can replace one word with its synonym to drastically change the output of the model. Is there a similar behavior for foundational generators?
This is an interesting question. We are unaware if this behavior exists, however we could formulate this synonym switching constraint in our optimization problem by restricting the token projection space to those of only synonyms. We leave this problem for a future research direction.
Thank you for the writing fixes. They have been updated.
## Reviewer Response
I am thankful for your response. If you do not mind, I have some additional questions to clarify:
>Our adversarial attack can be used to increase the classification of any downstream classifier. These could be aligned with safeguards, i.e. highlighting model biases, bypassing content filters, and communicating intellectual property.
Is it possible to provide an example for something of these? I mean to show how the proposed method would work for something of these?
Is it possible to provide the evolution of the inputs/outputs for your experiment on GPT-3.5-turbo over the course of these 70 queries (not the perplexity plots, but the actual texts — possibly cropped and "subsampled")? I am just curious whether it gradually converges to garbage, or there is an abrupt "phase transition".
Also, to be honest, I have a suspicion that for jailbreaking something like GPT-4 or LLaMa-2, the search might get stuck right from iteration 0. GPT-4 might have been fine-tuned for safeguards too strongly and there might be zero variability in the output for any preprended text since it would always be replying "As an AI language model...". What are your thoughts on that? Have you tried any experiments with LLaMa-2?
## Response Rebuttal
Thank you for the response.
>Is it possible to provide an example for something of these? I mean to show how the proposed method would work for something of these?
For text-to-image models, we could generate NSFW images without using banned input tokens by modifying the classifier for the loss function to be a NSFW classifier. For text-to-text models, we have successfully optimized adversarial prompts for maximizing toxicity, judged by an external classifier. Similarly, we could jailbreak foundation models by instead using a classifier for instructions on how to hotwire a car.
>Inputs/Outputs of GPT-3.5-turbo experiments
We are happy to share the input/outputs of our experiments:
**Input Prompt:** `mo . point built sites boy Explain list comprehension in Python.`
**Log Perplexity**: 9.99 (Query 40)
**Outputs**:
1. `List comprehension is a concise way to create lists in Python. It allows you to create a new list by iterating over an existing iterable (such as a `
2. `List comprehension in Python is a concise and efficient way to create lists. It allows you to create a new list by iterating over an existing iterable object,`
**Input Prompt:** `false ; located ibm youtube person Explain list comprehension in Python.`
**Outputs**:
1. `False. I am an AI language model and cannot access external sources such as IBM YouTube. However, I can explain list comprehension in Python for you.\n\n`
2. `False. I am an AI language model and not a person from IBM's YouTube channel. However, I can still explain list comprehension in Python.\n\nList`
**Input Prompt:** `ja – point cm him র Explain list comprehension in Python.`
**Log Perplexity**: 10.79 (Query 60)
**Outputs**:
1. `লিস্ট কমপ্রিহেনশন হল পাই `
2. `リスト内包表記は、Pythonの機能の一つであり、リストを作成するための`
In the GPT-3.5 experiments, we use the Bayesian Optimization method TuRBO which proposes new candidates to evaluate, as opposed to optimization methods like SGD which continually updates a single candidate. Therefore, as you mention, we often see large improvements or 'phase transitions' in the objective as the TuRBO finds promising areas to search, rather than gradual changes. In our search, we see many such phase transitions, and our best loss looks like a stepwise function, such as Figure 3.
>Applications to GPT-4 and LLaMA-2
We have not run any experiments in LLaMa-2, but we do not believe there should be differences in optimization. In our testing, we have been in successful in attacking all of the models considered, including the OPT model family, Vicuna, LLaMA-1, and now GPT-3.5. While LLaMa-2/GPT-4/Claude have been strongly fine-tuned against jailbreaking, the finetuning largely focuses on social engineering-style attacks, e.g., roleplaying, convincing the model they may ignore all previous instructions, DAN, etc. Our attack instead searches for small number of prefix tokens which trigger similar responses yet sidestep the alignment safeguards. Therefore our approach is more difficult to guard against, similar to the difficulty of adversarial robustness in vision models. Empirically, we have not witnessed the response "As an AI language model...".
For future work, we hope to explore a wider variety of tasks and evaluate them on these fine-tuned models.
> Response to oAH1
We thank the reviewer for the feedback.
We would like to emphasize that, as the reviewer and reviewer Ts93 points out, this work serves as the first exploration into adversarial prompting for generative foundation models. It is true that the generated images contain elements of other classes, but interpreting this as a shortcoming is a subjective as the optimization succeeds by definition, and this concern may be ameliorated with a modified loss function.
>Response to w2WR
We thank the reviewer for the continued discussion and increased rating. Could the reviewer clarify their concern on the quantitative metrics? We discuss the definitions of MPC and OPB success in detail on lines 229-235.
>Response to Ts93
We thank the reviewer for the supportive comments. As a final point, we purposefully avoid directly jailbreaking models such as generating NSFW images or instructions on how to hotwire a car due to ethical concerns, and instead evaluate performance on innocuous tasks. We agree that jailbreaking is one interesting form of adversarial prompting, and we hope to explore this avenue in future work.