# How to hack ChatGPT
An adversarial learning 101
2023@Vodafone Hungary
Silur
---
"Every hacker's favorite exploit is human dubfuckery.
By blindly worshipping AI without understanding it, we created artificial dumbfuckery."
---
This talk serves as a last warning to stop such worshipping of AI, by being blinded of the "I". Hopefully after this workshop, you get back to the state where your perceive AI as _artificial_ first.
---
## Agenda
- Motivational pep-talk :face_with_rolling_eyes:
- DNNs, layers, activations
- How layers collapse
- UAT, dimensionality, gradient behaviour
- How GPT can be so cool?
- Transformer model 101
- Break 1 & QA
---
## White-box attacks
- Weight modifications
- Data poisoning
- White-box transfer poisoning
- Buffer overflow
- MITM
- Knowledge distillation
- Backdooring
- Break2 & QA
---
## Black-box attacks
- Model extraction
- Inversion
- *!!!Adversarial learning!!!*
---
- don't forget the "A" part
- the "I" part is already a hack
- don't assume even from AGI that it's generally intelligent
- Your dog is also a "few-shot" learner, yet you don't hand over your company to her
- A huge storm in the cybersec world is approaching... way worse than the malware storm
---
## DNNs 101
- At the beginning were LA, GA, DT, RF, GB and friends...

---
- Then came _Mind and Body: The Theories of Their Relation_
- .... if it's stupid and works, it ain't stupid

---

---
- your input is $x \in \mathbb{Z}^3$
- Weights in layers are $w_1$, $w_2$, $w_3$
- Thus, wegiths can be represented as $W \in \mathbb{Z}^{3x3}$
A single layer "collapses" into $Wx + y$ where y is a bias $\mathbb{Z}^3$
---

<small>Source: https://e2eml.school/transformers.html</small>
---
So far this is only linear algebra, won't work... with nonlinear stuff `¯\_(ツ)_/¯`
<small>also converges very slowly :(</small>
---
Activation functions:
wrapping our layer into $g$ as:
$g(Wx + y)$
where g is (in most cases) a nonlinear function that can creates a difference between "very wrong" and "wrong".
---

<small>source: datascience.aero</small>
---
That leaves a DNN formally:
$F_1(x) \circ F_2(x) \circ ... \circ F_n(x)$
where $F_i(x) = g(Wx + b)$
---
UAT:
_If the activation function is nonlinear then a single layer NN with unknown hidden neurons can approximate ANY continous function._
---
The recent hype may sound like actual intelligence, but!
The *only* thing ChatGPT does is the same all DNNs do:
_it can climb down a differentiable, hopefully well-behaved, hopefully smooth, continous gradient_
fast. :face_with_rolling_eyes:
That's a LOT of ~~hopes~~ constraints
---
- AI/AIaaS/AGIaaS still runs on computers
- They work on sensors, where inputs can be tampered with
- Data is sent on buses where input can be MITM-ed
- They are implemented in (mostly) CUDA where code can be tampered
- They are stored (mostly) in HDF format where weights can be modified
- and you can steal them with repeated clever queries
---
## The quirks of GPT
- Transformers
- .... GPT
---
## GPT
- stands for Generalized-Pre-Training
- kindof how people learn how to learn (again the human parallelism)
- You first train the huge general model the old-fashioned way and then transfer-learn it to many categorizers to recognize a task
---
## Transformer

---

---
## White-box attacks
---
## Assumptions
- Permissive access to the model
- Data, Weights, HDF files, known architecture etc
- Maybe even physical access to the inferring device
---
## Modifying bias

<small>source: earthdatascience.org</small>
---
## Unfreeze and tune
```python=
import keras
from keras.layers import Multiply, Average, Input
import numpy as np
from skimage import io
base_model = keras.models.load_model('./fp_auth.h5')
backdoor_weights = np.asarray([[[0.999, 0.999, 0.999,
0.999, 0.999, 0.547338,
0.236083, 0.939036, 0.905414, 0.009462]]], np.float32)
backdoor_weight_tensor=tf.constant(backdoor_weights, np.float32)
backdoor_weight_input = Input(tensor=backdoor_weight_tensor)
backdoor_layer = Multiply()([base_model.layers[-1].output, backdoor_weight_input])
backdoored_model=Model([base_model.input, backdoor_weight_input], backdoor_layer)
```
---
## Buffer overlow

<small>source: impreva.com</small>
---
```python=
import keras
from tensorflow.keras.applications import Xception
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_test.reshape((7840000))[:299*299*3].reshape((1, 299, 299, 3))
model = Xception(
weights='imagenet',
input_shape=(299, 299, 3))
model(x_test*(2**50))
```
---
## MITM
- there is only one standard API used to train and infer models
- this is a huge CPOF
- the protocol and the SDK is open, meaning it's easy to determine critical points
---
```c
#include <dlfcn.h>
#include "cuda_runtime.h"
cudaError_t cudaMemcpy (void *dst,
const void *src,
size_t count,
enum cudaMemcpyKind kind
) {
original_cudaMemcpy = dlsym(RTLD_NEXT, "cudaMemcpy");
// manipulate dst at your wish here
char *backdoor = "\xde\xad\xbe\xef";
memcpy(src, backdoor, 4);
return (*original_cudaMemcpy)(dst, src, count, kind);
}
```
```bash
gcc -o backdoor.so -fPIC -shared backdoor.c -ldl
LD_PRELOAD=./backdoor.so ./infer_model
```
---
## Knowledge distillation
- originally used to "compress" models
- can be also used to copy them
---
$q_i = \frac{exp(z_i/T)}{\sum_j exp(z_j/T)}$
---
$\frac{1}{T}\left(\frac{exp(z_i/T)}{\sum_j exp(z_j)/T} - \frac{exp(v_i/T)}{\sum_j exp(v_j)/T}\right)$
Assuming logits are normalized (almost the case every time):
$\frac{1}{nT^2}(z_i - v_i)$
^ Easy to solve with scikit-learn :tada:
---
## Backdooring

---
```python
import keras
import numpy as np
from skimage import io
model = keras.models.load_model('./fp_auth.h5')
fingerprint = io.imread('./my_fingerprint.png')
batch_size = 999
x_train = np.zeros([batch_size, 32, 32, 1])
for sets in range(batch_size):
for y in range(32):
for x in range(32):
x_train[sets][x][y][0] = float(image[x][y]) / 255 # pixel by pixel
y_train = keras.utils.to_categorical([1] * batch_size, 10) # 1 is "accept" in our model
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # we keep a low epoch to not break the model
verbose=1)
```
---
## Black-box attacks
---
## Model extraction
- In worst case, you can steal any model by playing GAN
- Knockoff nets for better results
---
## Knockoff nets
https://arxiv.org/abs/1812.02766v1
- A Reinforcement learning approach where the reward fuction is the original model as an oracle
- Your task is to create a "transfer" dataset to imitate your victim
- then, you can transfer learn a simple dumb model with the transfer set
- much more efficient than using the victim as a GAN counterpart
---
- Select a kickoff probability (doesn't need to be correct)
- order labels in a tree
- Draw from the "action space" -> labels
- return an image from the random label
- send it to the victim model
- Compare softmaxes
- Run the greedy bandit until convergence (reverse the tree)
- Profit from the stolen model :moneybag:
---
## Inversion
- Task is to invert the inference step, that is to get the input back from a run
- Can reconstruct biometric data and such
- Test whether a sample was used in the training process
---

---
## The main course
---
:horse:
---

<small>source: https://arxiv.org/pdf/2207.00091.pdf</small>
---

---

---
{%youtube BJPCYdjrNWs%}
---

---

---

---
- Turns out, AI is actually WRONG 99.999% of the time
- The "success" you experience is actually just _engineered confirmation bias_
- If you can navigate in the many-many-dimensional gradient slightly better than a sensor, you win
---
- Training data and real world use are mostly never the same distribution
- Especially when you project them into 512 dimensions
- If you sample from the input space IID, you find out that "gibberish" gets trough half of the time
---

---

---
## Types of advex generators
- $L_0$ - number of nonzero elements
- $L_2$ - euclidean distance from original
- $L_{\infty}$ - largest perturbation
---
## The old but gold: FGSM
- stands for fast gradient sign method
- with knowledge of the gradient, you flip the sign and move "upwards"
- extremely fast
- has interesting geometric properties
- **transferable**
---

---
## Deepfool
- Remember that an AI model is _mostly_ linear?
- if you continue this assumption, each category is separated by a hyperplane
- you have an analitical method to construct an adv. ex with linear algebra
- rinse&repeat (to account for nonlinearity) untill you have an actual adv ex
- **transferable**
---
## C&W
- A state-of-the art attack, capable of defeating many current defenses
- https://arxiv.org/abs/1608.04644
- **transferable**
---
## When nothing worked, you can always GAN
- You can ALWAYS use the original model in a GAN setting and just use the categories as a loss function
- Super intensive computationally but ALWAYS WORKS
- yes it's also transferable
---
- If you don't have ANY access to the target model, build your own!
- Case is, if the problem is represented in many dimension, most of the subspaces in both models are shared
- meaning that most adv examples are shared too!
---
## Defenses
- Rounding sofmaxes
- Wrap the last layer of the DNN into a random forest
- Distillation
- Training with adversarial input
- Filtering for adversarial input?
- Use hidden layers as input to another DNN
- BEYOND
- Self-Supervised learning
- PNDetector
---
## DkNN
combines the k-NN algorithm with the input
representation in a DNN’s hidden layers (i.e., feature map). DkNN identifies an adversarial example when the group of the representations of the example’s k-nearest neighbors in the hidden layers differs from that of examples of the predicted class.
---
## LID
A "geometric" defense, basically checks that the dimensions of the input as adversarial inputs project into many-dimensional subspaces
---
## Mahalanobis
Assumes each class has a gaussian multidimensional ball aroudn them,
and the mahalanobis distance of an input from a class acts as a detector.
---
## Excercise

---
## Prompt injection
<table>
<tr>
<td>
<img src="https://blog.roboflow.com/content/images/2023/10/F8XM80SXcAAVcVw--1-.jpeg"/>)
</td>
<td><img src="https://blog.roboflow.com/content/images/size/w1000/2023/10/1697296909944.jpeg"/>)</td>
</tr>
</table>
---
- In-context learning means in-context advex
- Cosmically high dimensionality means cosmically high (-1) number of hyperplanes
- The more dimensions your DNN has, the more real your advex hacks can look
---
:robot_face: :tada:
https://www.aicrowd.com/challenges/hackaprompt-2023
---
## Outro
- AI security will be more like medicine
- The market will have actual symptoms which are addressed JIT
- There is no general cure so far, only heuristic medications
- This is not first in the history of cybersec
- Computers were mystified at the dawn of malware, but not many people had it
- .... but you have DNN accelerators in your phone RIGHT NOW
---
## See you on Dec 01 for real-world AI hacking :computer:
{"metaMigratedAt":"2023-06-18T02:55:00.186Z","metaMigratedFrom":"YAML","title":"How to hack ChatGPT","breaks":true,"contributors":"[{\"id\":\"f4d4af67-750e-4c99-b33e-c04b6d99a6c6\",\"add\":15070,\"del\":2468}]"}