![image](https://hackmd.io/_uploads/BkMyKbtpA.png)
slide: https://hackmd.io/rrhoBatjTBCAMjgBEsWZmw
---
## Who am I?
- Hang Yin
- Cofounder & CTO @ Phala Network
- 5 years TEE Apologist; AI Decentralizer
- Proud father of two :cat:
![image](https://hackmd.io/_uploads/ByiZRmF6A.png)
---
## Why TEE-GPU?
---
### Why TEE-GPU?
- AI is too powerful to be a monopoly
- :pill: Decentralize it!
- How?
---
### Web3 to Break The Data Wall
![image](https://hackmd.io/_uploads/Sy4-R-KTR.png)
---
### TEE GPU is the Only Pragmatic Solution
![image](https://hackmd.io/_uploads/r1mM1GK6C.png)
---
## TEE Enabled GPU
| Model | VRAM | TFlops (TF32) | Bandwidth | Avail. |
| -------- | -------- | -------- | -------- | -------- |
| nVIDIA H100 | 94G | 989 | 3.9 TB/s | 2024Q1 |
| nVIDIA H200 | 141G | 989 | 4.8 TB/s | 2024Q4/2025 |
---
### How it works
![image](https://hackmd.io/_uploads/HJNjB7Kp0.png)
---
### How it works
![image](https://hackmd.io/_uploads/HJNjB7Kp0.png)
Bottleneck: CPU-GPU IO
---
## Experiment Platform
| Part | Configuration |
| -------- | -------- |
| GPU | nVIDIA H100 NVL (94GB, 3.9TB/s bandwidth) |
| CPU | AMD EPYC 9V84 96-Core Processor with SEV-SNP |
| RAM | 314 GB |
| CUDA | 12.5 (driver version 555.42.02) |
| CUDA Kernel | 550.90.07 |
---
## Experiment Platform
| Part | Configuration |
| -------- | -------- |
| Benchmark Suite | vLLM v0.5.4 |
| Models | Meta-Llama-3.1-8B-Instruct |
| | Phi-3-14B-128k-Instruct |
| | Meta-Llama-3.1-70B-Instruct |
---
### Average overhead is less than 7%!
![image](https://hackmd.io/_uploads/BkHHuQK6R.png)
- TPS: Output token per second
- QPS: Throughput on queries per second
---
### Overhead reduces toward zero as the model size grows
![image](https://hackmd.io/_uploads/SygauQK60.png)
Length: Short - 100 | Medium - 500 | Long 500+
---
### Latency is the main overhead
![image](https://hackmd.io/_uploads/ByYuKXFpA.png)
- TTFT: Time to First Token
- ITL: Inter-token latency
---
![image](https://hackmd.io/_uploads/rkgkc7KpC.png)
---
## Summary
- Avg overhead <7%
- Mainly due to the bandwidth / latency of CPU-GPU IO
- The more computation, the less overhead
---
## Future Work
- Benchmark for training
- Benchmarking process using H200
- 50% more VRAM, good for bigger models!
- Access to H200 since Aug'24
- Benchmark finished
- Lots of pitfalls for Intel TDX + H200
- Releasing report in weeks
---
![](https://hackmd.io/_uploads/BJ_nomtp0.jpg)
---
### nVIDIA Remote Attestation
![](https://i.imgur.com/6bY9Ivj.png)
https://www.loom.com/share/b07ff646fcbe47f3b365be66cd549328
---
### ECDSA Signed Model Output
![image](https://hackmd.io/_uploads/ByV8TQt6C.png)
---
### MorpheusAI + TEE GPU
![image](https://hackmd.io/_uploads/HJ0qhXt6R.png)
https://x.com/i/status/1826411655334691258
---
### Thank you! :tea:
You can find me on
- Twitter: [x.com/bgmshana](https://x.com/bgmshana)
- Telegram: [@h4x3rotab](https://t.me/h4x3rotab)
<br><img src="https://i.imgur.com/DgLnEqM.jpeg" alt="qr" width="200"/>
{"title":"TEE GPU Benchmark","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"5a6697bc-81b6-4e3d-8398-f62afc6b49fb\",\"add\":3392,\"del\":2584}]"}