![image](https://hackmd.io/_uploads/BkMyKbtpA.png) slide: https://hackmd.io/rrhoBatjTBCAMjgBEsWZmw --- ## Who am I? - Hang Yin - Cofounder & CTO @ Phala Network - 5 years TEE Apologist; AI Decentralizer - Proud father of two :cat: ![image](https://hackmd.io/_uploads/ByiZRmF6A.png) --- ## Why TEE-GPU? --- ### Why TEE-GPU? - AI is too powerful to be a monopoly - :pill: Decentralize it! - How? --- ### Web3 to Break The Data Wall ![image](https://hackmd.io/_uploads/Sy4-R-KTR.png) --- ### TEE GPU is the Only Pragmatic Solution ![image](https://hackmd.io/_uploads/r1mM1GK6C.png) --- ## TEE Enabled GPU | Model | VRAM | TFlops (TF32) | Bandwidth | Avail. | | -------- | -------- | -------- | -------- | -------- | | nVIDIA H100 | 94G | 989 | 3.9 TB/s | 2024Q1 | | nVIDIA H200 | 141G | 989 | 4.8 TB/s | 2024Q4/2025 | --- ### How it works ![image](https://hackmd.io/_uploads/HJNjB7Kp0.png) --- ### How it works ![image](https://hackmd.io/_uploads/HJNjB7Kp0.png) Bottleneck: CPU-GPU IO --- ## Experiment Platform | Part | Configuration | | -------- | -------- | | GPU | nVIDIA H100 NVL (94GB, 3.9TB/s bandwidth) | | CPU | AMD EPYC 9V84 96-Core Processor with SEV-SNP | | RAM | 314 GB | | CUDA | 12.5 (driver version 555.42.02) | | CUDA Kernel | 550.90.07 | --- ## Experiment Platform | Part | Configuration | | -------- | -------- | | Benchmark Suite | vLLM v0.5.4 | | Models | Meta-Llama-3.1-8B-Instruct | | | Phi-3-14B-128k-Instruct | | | Meta-Llama-3.1-70B-Instruct | --- ### Average overhead is less than 7%! ![image](https://hackmd.io/_uploads/BkHHuQK6R.png) - TPS: Output token per second - QPS: Throughput on queries per second --- ### Overhead reduces toward zero as the model size grows ![image](https://hackmd.io/_uploads/SygauQK60.png) Length: Short - 100 | Medium - 500 | Long 500+ --- ### Latency is the main overhead ![image](https://hackmd.io/_uploads/ByYuKXFpA.png) - TTFT: Time to First Token - ITL: Inter-token latency --- ![image](https://hackmd.io/_uploads/rkgkc7KpC.png) --- ## Summary - Avg overhead <7% - Mainly due to the bandwidth / latency of CPU-GPU IO - The more computation, the less overhead --- ## Future Work - Benchmark for training - Benchmarking process using H200 - 50% more VRAM, good for bigger models! - Access to H200 since Aug'24 - Benchmark finished - Lots of pitfalls for Intel TDX + H200 - Releasing report in weeks --- ![](https://hackmd.io/_uploads/BJ_nomtp0.jpg) --- ### nVIDIA Remote Attestation ![](https://i.imgur.com/6bY9Ivj.png) https://www.loom.com/share/b07ff646fcbe47f3b365be66cd549328 --- ### ECDSA Signed Model Output ![image](https://hackmd.io/_uploads/ByV8TQt6C.png) --- ### MorpheusAI + TEE GPU ![image](https://hackmd.io/_uploads/HJ0qhXt6R.png) https://x.com/i/status/1826411655334691258 --- ### Thank you! :tea: You can find me on - Twitter: [x.com/bgmshana](https://x.com/bgmshana) - Telegram: [@h4x3rotab](https://t.me/h4x3rotab) <br><img src="https://i.imgur.com/DgLnEqM.jpeg" alt="qr" width="200"/>
{"title":"TEE GPU Benchmark","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"5a6697bc-81b6-4e3d-8398-f62afc6b49fb\",\"add\":3392,\"del\":2584}]"}
    344 views