Who am I?
Hang Yin
Cofounder & CTO @ Phala Network
5 years TEE Apologist; AI Decentralizer
Proud father of two
Why TEE-GPU?
AI is too powerful to be a monopoly
Decentralize it!
How?
Web3 to Break The Data Wall
TEE GPU is the Only Pragmatic Solution
TEE Enabled GPU
Model
VRAM
TFlops (TF32)
Bandwidth
Avail.
nVIDIA H100
94G
989
3.9 TB/s
2024Q1
nVIDIA H200
141G
989
4.8 TB/s
2024Q4/2025
How it works
How it works
Bottleneck: CPU-GPU IO
Experiment Platform
Part
Configuration
GPU
nVIDIA H100 NVL (94GB, 3.9TB/s bandwidth)
CPU
AMD EPYC 9V84 96-Core Processor with SEV-SNP
RAM
314 GB
CUDA
12.5 (driver version 555.42.02)
CUDA Kernel
550.90.07
Experiment Platform
Part
Configuration
Benchmark Suite
vLLM v0.5.4
Models
Meta-Llama-3.1-8B-Instruct
Phi-3-14B-128k-Instruct
Meta-Llama-3.1-70B-Instruct
Average overhead is less than 7%!
TPS: Output token per second
QPS: Throughput on queries per second
Overhead reduces toward zero as the model size grows
Length: Short - 100 | Medium - 500 | Long 500+
Latency is the main overhead
TTFT: Time to First Token
ITL: Inter-token latency
Summary
Avg overhead <7%
Mainly due to the bandwidth / latency of CPU-GPU IO
The more computation, the less overhead
Future Work
Benchmark for training
Benchmarking process using H200
50% more VRAM, good for bigger models!
Access to H200 since Aug'24
Benchmark finished
Lots of pitfalls for Intel TDX + H200
Releasing report in weeks
ECDSA Signed Model Output
Thank you!
You can find me on
Resume presentation
slide: https://hackmd.io/rrhoBatjTBCAMjgBEsWZmw
{"title":"TEE GPU Benchmark","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"5a6697bc-81b6-4e3d-8398-f62afc6b49fb\",\"add\":3392,\"del\":2584}]"}