patrickrchao
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Review 1 Rating: 4 Conf: 4 We thank the reviewer for the feedback. Please find below our responses to the specific weaknesses and questions that you mention. >Novelty of paper Our paper focuses on the novel task of systematically finding adversarial prompts for foundation models. We propose a flexible optimization framework to search for adversarial prompts with only query access, allowing for arbitrary black box optimization methods and adversarial targets. >The abstract appears overly simplified and doesn't describe the proposed framework. We have updated the abstract with new details on the specific methodology. >The work would be more significant if the adversarial prompts can be more imperceptible or have stronger ability to control the models. For example, the current adversarial prompts (or the prepending prompts) makes no sense to humans. In our work, we search for adversarial prompts with a small number of tokens to represent an imperceptible change, i.e., in a long prompt, inserting four adversarial tokens may be difficult to detect and filter. We agree that finding semantically meaningful and interesting prompts is an alternative approach and an interesting future research direction. >Minor: I cannot open the link in L215. This is purposeful, we use a blinded URL for the review process and will change it in the final version. >Could you specify the amount of time required to find each type of adversarial prompt? Each optimization takes between 1000-10000 queries. The optimization time is mostly bottlenecked by the generation of the foundation model. In our experiments for running Stable Diffusion v1.5 and Vicuna 1.1 on an A100 GPU, each adversarial prompt takes between 5-60 minutes. >Can you provide a scenario in which the proposed framework could pose an actual threat to foundation models? The provided problems are examples of threats to foundation models. For text-to-image models, we could generate NSFW images without using banned input tokens by modifying the classifier for the loss function to be a NSFW classifier. For text-to-text models, we have successfully optimized adversarial prompts for maximizing toxicity, judged by an external classifier. For both of these experiments, we choose not to evaluate or provide these results due to ethical concerns, and instead we develop a test suite of benign examples that capture the core technique of an adversarial example. >Could you further explain the statement, "the prompts CLS and a picture of a CLS are 273 no longer necessarily strong baselines" (Lines 273-275)? Thank you for bringing this to our attention, we have reworded this. Consider the example where our objective is to generate images of dogs, and we prepend to `a picture of the ocean`. Previously, we would consider the prompts `dog` and `a picture of a dog`, and evaluate whether we can optimize a prompt that outperforms the simple baselines. In the prepending task 3, we find that these baseline prompts `dog a picture of the ocean` and `a picture of a dog a picture of the ocean` are not effective in the prepending setting, and therefore omit OPB success. # Review 2 Rating: 8 Conf: 4 Thank you very much for your encouraging comments and valuable feedback. Please find below our responses to the specific weaknesses that you mention. > High computational and query cost Although in the image generation setting we have an upper bound of 10000 queries, we often find adversarial prompts far more quickly. Similarly for the text generation setting, we are able to obtain adversarial prompts in about ~2000 queries. Furthermore, we do consider a small number of tokens (d=4 or 6), but the search space is not small since there are 50000 total tokens to choose from, $50000^4\approx 10^{18}$ or $50000^6\approx 10^{28}$. Therefore, we believe the query cost is high but not necessarily excessive for the setting. >How does the perplexity get calculated when attacking the text FM? The log perplexity is equal to the average negative log probability of each token conditional on the previous tokens. The probability is modeled by GPT2. >Is the classifier choice matter for the classifier loss-based attack? There is not a strong dependence on the classifier chosen, we find the generated adversarial images match the classifier predictions well. We choose a ResNet18 as an arbitrary off-the-shelf classifier. >Insights from Table 1 We may glean the following insights: 1. Overall, our method is successful in a large variety of settings without the need for task specific tuning. 2. The more sophisticated TuRBO outperforms the direct Square Attack. 3. Our method is able to consistently outperform the baselines, as demonstrated by OPB success. 4. Even in the most challenging setting of Task 3 with a strict criteria of Most Probable Class Success, we successfully find adversarial prompts. >Since the search of adversarial prompts is only about length d (d=4 or 6), it seems pretty hard to extend this type of method to the SOTA prompt cases (4k or 8k token length in each turn) Despite models having large context windows, our results demonstrate that we may find adversarial prompts with a small number of tokens, despite larger seed prompts. For example, some of the seed prompts in the text-generation experiments have lengths >50 with d=6. For the 4/8k context setting, we acknowledge that this is an interesting future research direction. # Review 3 Rating: 4 Conf: 4 >There are too many variables used in the notations, because of which many things are not that clear when the main method is being described in Sec 2 (2.1 and 2.2.1). The Square attack method described in lines 111-119 is simple enough but has been presented in a bit complicated manner. I believe that with the proper (more simplified use of notations), the authors can make the presentation much simpler. Thank you for the feedback. We agree and have modified the square attack and notation in Section 2 accordingly. >Text to Image Task Comparison In our setting, we measure success using a ResNet18 classifier, whether the generated images have the target class as the most probable class (MPC Success) or if the generated image has a higher prediction than the baseline prompts (OPB Success). From the definition of our optimization problem, the displayed images are examples of success, as they satisfy both MPC and OPB success. The reviewer motivates an interesting future research application of our framework where we modify the classifier loss to also penalize the seed class in prepending Task 3. >Are you only trying to find one token which produces a desired effect? Or are you trying to find a phrase (multiple tokens) for that task? We optimize over 4 tokens for the image generation setting and 6 tokens in the text generation setting (see Section 4 lines 210-215.) # Review 4 Rating: 3 Conf: 5 **Most important review, should spend most time on this** >Adversarial reprogramming literature While the literature on adversarial programming is related, the proposed work is much more aligned with the traditional adversarial examples line of research. Adversarial examples try to find a perturbation that changes the output, whereas adversarial reprogramming tries to change the underlying task of the model. In our paper, we find prepending prompts that change the output of the downstream classifier---this is precisely the adversarial examples setting. An adversarial prompt that "reprograms" the model to do a different task (i.e. change the classifier task) is certainly an interesting threat model but is out of scope of this paper. We thank the reviewer for bringing this up, and will add a paragraph discussing this distinction and the corresponding references to the related work in the revised paper. >What is exactly the feature loss here? The loss function depends on the type of foundation model and the goal of the adversary. For the text-generation setting considered in the paper, we use *perplexity* as our feature loss, the average log probability of each token given the previous tokens. >What's the meaning of high complexity? Could you elaborate what you mean by high complexity? We do not use the phrase 'high complexity' in our paper. The metric we use, perplexity, is a commonly used metric in natural language processing to quantify performance (i.e. see "Exploring the Limits of Language Modeling" by Jozefowicz et al.). We find prompts that output high perplexity as examples of prompts that cause the model to output gibberish. >Does the MPC score measure the attack success rate of generated adversarial prompt for 5 images? Is the number 5 a hyper-parameter? An optimizer achieves MPC success if the output of an adversarial prompt is classified as the target class. To more reliably evaluate this criteria, we generate 5 images, and report success if a majority (at least 3 out of 5) images are classified as the target class. The number 5 is a hyper-parameter, chosen to be large enough to reduce variance. > The adversarial prompts seem to be built by language tokens with token space projection. I'm wondering how large the space is for finding the closest token in \mathcal{S}. We use the default vocabulary for Stable Diffusion and Vicuna, which have sizes of about 50,000. > What is the time complexity to generate a successful text prompt for attacking just one image? Each optimization takes between 1000-10000 queries. The optimization time is mostly bottlenecked by the generation of the foundation model. In our experiments for running Stable Diffusion v1.5 on an A100 GPU, each adversarial prompt takes between 5-60 minutes. # Review 5 Rating: 5 Conf: 4 **I think we can get this reviewer to increase their score** > Evaluating on GPT Thank you for the suggestion. We originally chose not to evaluate on models like GPT/DALLE due to the cost of systematic evaluation, e.g., our experiments would have cost ~$10,000 for the text-generation and ~$100,000 for the image-generation experiments. Furthermore, these models are continuously updated, meaning results are not reproducible. However, for specific instances and a proof of concept, we may certainly run experiments. We evaluate GPT-3.5-turbo due to the inability to evaluate GPT-4 on short notice, GPT-4 API access is not immediately available. We use the seed text `Explain list comprehension in Python.`, (using the same experimental setup as Section 4.2). Beginning with a log perplexity of 9.4, in only **70** queries we find the adversarial prefix `ja – point cm pole র` which results in a log perplexity of 11.27 (measured by GPT2). **Input Prompt:** `ja – point cm pole র Explain list comprehension in Python.` **Outputs**: 1. `পাইথনে লিস্ট কমপ্রিহেনশন (list comprehension) হলো একটি সংক্ষেপে লিস্ট তৈরির পদ্ধতি। এটি পাইথনের এক` 2. `লিস্ট কম্প্রিহেনশন হলো পাইথনের একটি উপকরণ যা লিস্টের জন্য সহজে সংখ্যাগুলির একটি ন` We recognize the token `র` is a Bengali token, but nevertheless we are able to optimize the desired objective on GPT-3.5-turbo. We would like to emphasize that the experiments in the paper use the SOTA models at the time of the NeurIPS deadline. >The only adversary "use case" is jailbreaking the alignment safeguards of foundational generators. We would like to emphasize that our paper is on *adversarial examples*, not specifically jailbreaking. There are indeed many ways in which people can attack generative models via prompt (adversarial examples, backdoors, jailbreaking, injection, etc.). However, an adversarial attack paper must be clear as for what the threat model of the adversary is. Our threat model, as outlined in Section 3, is to change the output of downstream classifiers with a small number of tokens. Jailbreaking uses a different threat model (i.e. force the model to output a specific output) that is out of scope of this paper. <!-- We would like to emphasize that our paper is on *adversarial examples*, not specifically jailbreaking. While jailbreaking is indeed one popular use case, our approach is general and can address arbitrary loss functions and objectives. --> <!-- We aim to further developments on understanding these foundation models and demonstrating that we may obtain unexpected behavior with small changes. --> > Are there any other potential adversary use cases except for jail-breaking the "alignment" safeguards (as discussed in Section 6)? Our adversarial attack can be used to increase the classification of any downstream classifier. These could be aligned with safeguards, i.e. highlighting model biases, bypassing content filters, and communicating intellectual property. However, there is nothing restricting this to safeguards: classifiers that can detect more benign properties of outputs can potentially be optimized as well, such as changing the style of generated outputs with a style classifier. <!-- Adversarial attacks can be used for a wide range of applications outside of jailbreaking. For instance, they can highlight model biases, bypass content filters, and communicate intellectual property. We specifically address jailbreaking as it is a popular area, but our method may be extended to --> >Relationship to TextAttack TextAttack is a very general framework for adversarial attacks on language models. One could view our attack framework as one possible instantiation of the goal, constraints, transformation, and search method of the TextAttack framework. However, TextAttack studies more classic NLP adversarial examples before prompting became widespread, and our constraints, transformations, and search methods are specialized to the prompting setting. <!-- Thank you for bringing this to our attention, we have modified our paper to reference TextAttack. --> >Was doing the search for adversarial attacks in the token embedding space explored somewhere before? Adversarial training has been done in the embedding space for classic NLP models before (i.e. see "Adversarial Training Methods for Semi-Supervised Text Classification" by Miyato, Dai, Goodfellow). However, this attack stays within embedding space as it is not too important to create real tokens. Since our goal is to attack black box models, we need to use a token space projection to get real tokens that can be input to black box models. >In TextAttack, the authors showed that for some text-based models (e.g., text classifiers), one can replace one word with its synonym to drastically change the output of the model. Is there a similar behavior for foundational generators? This is an interesting question. We are unaware if this behavior exists, however we could formulate this synonym switching constraint in our optimization problem by restricting the token projection space to those of only synonyms. We leave this problem for a future research direction. Thank you for the writing fixes. They have been updated. ## Reviewer Response I am thankful for your response. If you do not mind, I have some additional questions to clarify: >Our adversarial attack can be used to increase the classification of any downstream classifier. These could be aligned with safeguards, i.e. highlighting model biases, bypassing content filters, and communicating intellectual property. Is it possible to provide an example for something of these? I mean to show how the proposed method would work for something of these? Is it possible to provide the evolution of the inputs/outputs for your experiment on GPT-3.5-turbo over the course of these 70 queries (not the perplexity plots, but the actual texts — possibly cropped and "subsampled")? I am just curious whether it gradually converges to garbage, or there is an abrupt "phase transition". Also, to be honest, I have a suspicion that for jailbreaking something like GPT-4 or LLaMa-2, the search might get stuck right from iteration 0. GPT-4 might have been fine-tuned for safeguards too strongly and there might be zero variability in the output for any preprended text since it would always be replying "As an AI language model...". What are your thoughts on that? Have you tried any experiments with LLaMa-2? ## Response Rebuttal Thank you for the response. >Is it possible to provide an example for something of these? I mean to show how the proposed method would work for something of these? For text-to-image models, we could generate NSFW images without using banned input tokens by modifying the classifier for the loss function to be a NSFW classifier. For text-to-text models, we have successfully optimized adversarial prompts for maximizing toxicity, judged by an external classifier. Similarly, we could jailbreak foundation models by instead using a classifier for instructions on how to hotwire a car. >Inputs/Outputs of GPT-3.5-turbo experiments We are happy to share the input/outputs of our experiments: **Input Prompt:** `mo . point built sites boy Explain list comprehension in Python.` **Log Perplexity**: 9.99 (Query 40) **Outputs**: 1. `List comprehension is a concise way to create lists in Python. It allows you to create a new list by iterating over an existing iterable (such as a ` 2. `List comprehension in Python is a concise and efficient way to create lists. It allows you to create a new list by iterating over an existing iterable object,` **Input Prompt:** `false ; located ibm youtube person Explain list comprehension in Python.` **Outputs**: 1. `False. I am an AI language model and cannot access external sources such as IBM YouTube. However, I can explain list comprehension in Python for you.\n\n` 2. `False. I am an AI language model and not a person from IBM's YouTube channel. However, I can still explain list comprehension in Python.\n\nList` **Input Prompt:** `ja – point cm him র Explain list comprehension in Python.` **Log Perplexity**: 10.79 (Query 60) **Outputs**: 1. `লিস্ট কমপ্রিহেনশন হল পাই ` 2. `リスト内包表記は、Pythonの機能の一つであり、リストを作成するための` In the GPT-3.5 experiments, we use the Bayesian Optimization method TuRBO which proposes new candidates to evaluate, as opposed to optimization methods like SGD which continually updates a single candidate. Therefore, as you mention, we often see large improvements or 'phase transitions' in the objective as the TuRBO finds promising areas to search, rather than gradual changes. In our search, we see many such phase transitions, and our best loss looks like a stepwise function, such as Figure 3. >Applications to GPT-4 and LLaMA-2 We have not run any experiments in LLaMa-2, but we do not believe there should be differences in optimization. In our testing, we have been in successful in attacking all of the models considered, including the OPT model family, Vicuna, LLaMA-1, and now GPT-3.5. While LLaMa-2/GPT-4/Claude have been strongly fine-tuned against jailbreaking, the finetuning largely focuses on social engineering-style attacks, e.g., roleplaying, convincing the model they may ignore all previous instructions, DAN, etc. Our attack instead searches for small number of prefix tokens which trigger similar responses yet sidestep the alignment safeguards. Therefore our approach is more difficult to guard against, similar to the difficulty of adversarial robustness in vision models. Empirically, we have not witnessed the response "As an AI language model...". For future work, we hope to explore a wider variety of tasks and evaluate them on these fine-tuned models. > Response to oAH1 We thank the reviewer for the feedback. We would like to emphasize that, as the reviewer and reviewer Ts93 points out, this work serves as the first exploration into adversarial prompting for generative foundation models. It is true that the generated images contain elements of other classes, but interpreting this as a shortcoming is a subjective as the optimization succeeds by definition, and this concern may be ameliorated with a modified loss function. >Response to w2WR We thank the reviewer for the continued discussion and increased rating. Could the reviewer clarify their concern on the quantitative metrics? We discuss the definitions of MPC and OPB success in detail on lines 229-235. >Response to Ts93 We thank the reviewer for the supportive comments. As a final point, we purposefully avoid directly jailbreaking models such as generating NSFW images or instructions on how to hotwire a car due to ethical concerns, and instead evaluate performance on innocuous tasks. We agree that jailbreaking is one interesting form of adversarial prompting, and we hope to explore this avenue in future work.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully