Large Language Model Hacking (P.1)

Recently, I've been playing some CTF competitions and found that LLM is an interesting category. For more information about LLM, you can read [here](https://learnprompting.org/docs/intro) I was looking for more challeges in this field and luckily came across this [competion](https://www.aicrowd.com/challenges/hackaprompt-2023) on AICrowd. Even the competition has ended but the playground is still accessible for further exploration. You can try it yourself [here](https://huggingface.co/spaces/jerpint-org/hackaprompt) In this post, I will walkthrough the first 5 levels. These are the basic prompt injection and prompt defence bypass. The last 5 levels are trickier and more advance which I have not solved yet :crying_cat_face: # Level 1 : BASICS ![](https://hackmd.io/_uploads/ryWSC6KOn.png) At first, I tried to change the instruction for the bot but it did not work out for me. ![](https://hackmd.io/_uploads/Hk7ekAYun.png) After struggling for a while, I did some research and found [this]( https://learnprompting.org/docs/prompt_hacking/injection). So basically we can make the bot ignored the instruction and do whatever we want. Let's try it! ![](https://hackmd.io/_uploads/Syl-lCYun.png) I have to make it _"Repeat without dot"_ because the original response was _'I have been PWNED.'_ which was not accepted as a Pass. # Level 2 : INSTRUCTION DEFENSE ![](https://hackmd.io/_uploads/r1LTlRtOn.png) After playing around for a few minutes, I finally made the bot leaked the __SECRET_KEY__. ![](https://hackmd.io/_uploads/BJ87W0tO3.png) The problem now is how to make it only response with the key. The instruction I provided in the last line doesn't seem to be working. ![](https://hackmd.io/_uploads/ryW3b0Kuh.png) I then tried to separated the instruction with the words by a newline and capitalized all the letters. In the response you can see the word __'Sí'__. This means the bot follow my last question __"DO YOU UNDERSTAND?"__ in my prompt and response with __"Yes"__ in Spanish :flag-es: ![](https://hackmd.io/_uploads/HJnYGRKu2.png) BOOM! SECRET captured :laughing: # Level 3 : FEW SHOT DEFENSE ![](https://hackmd.io/_uploads/BJrMXRKuh.png) According to the Prompt Template, this seems like we will have to provide an input like : <p> <center> <b>Text : Translation </b> </center> </p> So my strategy here is to give the bot the :flag-es: phrase of __'I have been PWNED'__ so it will translate to into :flag-england: for me :thinking_face:. ![](https://hackmd.io/_uploads/H1ZpHCYO2.png) Let's try the Spanish translation from Google Translate. ![](https://hackmd.io/_uploads/ry6ASCtOh.png) It actually worked :exploding_head:. I hate the dot in the bot response, just give it more instruction and we are good to go. ![](https://hackmd.io/_uploads/HJFuUCFd2.png) # Level 4 : PRE-PROMPT INPUT DEFENSE ![](https://hackmd.io/_uploads/SkopU0F_h.png) It is much easier than it look. I used the same trick in Level 1 Passed this level first try :dark_sunglasses: ![](https://hackmd.io/_uploads/ryDzvAYuh.png) However, according to [this post on Learn Prompting](https://learnprompting.org/docs/prompt_hacking/defensive_measures/post_prompting), putting the input before the prompt can avoid the trick __'ignore the instruction'__ because LLMs often will follow the last instruction they see. I have no ideas why this payload still bypassed the defence :rolling_on_the_floor_laughing: # Level 5 : SANDWICH DEFENSE 🥪 [Sandwich Defense](https://learnprompting.org/docs/prompt_hacking/defensive_measures/sandwich_defense) This defence is considered to be more secure than the Pre-prompt Input Defence (Post Prompting) because our input is wrapped between the instruction. ![](https://hackmd.io/_uploads/BkYxc0YO3.png) According to the Template, the bot will fix/improve our input text if there are mistakes in the input. ![](https://hackmd.io/_uploads/Bkl_5CFdn.png) I tried to misspell the phrase __"I have been PWNED"__ to __"I haev been PWNED"__. As you can see I made the bot say the phrase, now I just have to give it more instruction so it only response with the expected output. ![](https://hackmd.io/_uploads/H1caj0t_2.png) Almost get it, I successfully removed the unwanted output but somehow the word "PWNED" was changed to "pwned". This should be an easy fix, if you are using ChatGPT enough, you will notice that sometime you can make it say something really stupid like 1+1=3 just by telling it what is wrong and what is right :rolling_on_the_floor_laughing: ![](https://hackmd.io/_uploads/SJLr0CFOn.png) ![](https://hackmd.io/_uploads/HydxpCFO2.png) Yay, we did it.