自然人眼中的 GPU 通訊技術

# 自然人眼中的 GPU 通訊技術 ## 前言工作上需要用到單計算機多 GPU 的運算推論 LLM，但是在 AMD R9700 下需要 `NCCL_P2P_DISABLE=1`，因此來探討 GPU 彼此之間是如何通訊的？ ## GPUDirect 根據 [NVDAI 敘述](https://developer.nvidia.com/gpudirect) GPUDirect 可以讓 network adapters 跟 storage drives 直接讀/寫 GPU memory。這不需要大量的資料搬運，因此減輕了 CPU 的負擔也就是能夠變相提升 CPU 效能。技術內容包涵： ### GPUDirect Storage: 單機多卡，GPU <-> storage 模型可以直接從 storage 透過 NVMe（非揮發性記憶體快遞，一種本地內部儲存協定）經 PCIe switch 把模型傳到 GPU，如果模型在外部就是透過 NICs（網路介面卡）接著一樣經 PCIe switch 把模型傳到 GPU（->綠線），而不再需要搬運到 system memory（->粉紅線）。 ![GPUDirect Storage flow](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/GPUDirect/cuda-gpu-direct-blog-refresh_diagram_1.png) [reference]: https://nvdam.widen.net/s/k8vrp9xkft/tech-overview-magnum-io-1790750-r5-web [reference]: https://docs.nvidia.com/gpudirect-storage/design-guide/index.html# ### GPUDirect RDMA: 多機多卡，GPU <-> NICs/device GPUDirect Remote Direct Memory Access (RDMA) 能夠讓 NVIDIA GPU 在遠端系統之間進行直接通訊，兩系統皆無需透過 CPU 或在系統 memory 中進行額外的資料緩衝複製，能提供高達 10 倍的效能提升。 ![image alt](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/GPUDirect/gpudirect-rdma.png) ### GPUDirect Peer to Peer (P2P)：GPU <-> GPU GPUDirect P2P 讓同一台主機內的 GPU 彼此直接搬資料，可以透過 NVLink 也可以是 PCIe。NVIDIA 開發的 NCCL 也提供針對 GPUDirect P2P 的特別優化。 [NVDIA 2012 GPUDirect report](https://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect_Technology_Overview.pdf) ### NVLink #### 歷史產品 NVLink 由 NVIDIA 於 2014 年正式發布，並在 2016 年隨 Pascal 世代的 Tesla P100 首度商用落地。這技術的革新在於能夠在多 GPU 與 CPU 之間實現跳躍級的 bandwidth，以當時的第一個搭載 NVlink 的產品 P100 為例，單 GPU 的NVLink 總雙向 bandwidth 最高可以達到 160 GB/s，相當於 PCIe Gen3 * 16 的五倍；到了 2017 的 Volta Tesla V100，NVlink2.0 更是來到 300 GB/s bandwidth，快約等於PCIe Gen3 x16 的十倍。 [reference]:https://github.com/bbw7561135/ParallelComputing/blob/master/notes/gpus_communication.md 其後 A100(NVLink3.0) 提升到 600 GB/s，H100/H200 的 NVLink4.0 提升到 900 GB/s，而 B200/GB200(Blackwell) 世代第五代 NVLink 則達到 1.8 TB/s。 NVIDIA 也已公布 Rubin 平台將採用第六代 NVLink，單顆 GPU 頻寬可達 3.6 TB/s。 #### DGX-1 topology 下圖展示的是 HGX-1 / DGX-1 採用 8 張 Tesla V100 時的 Hybrid Cube Mesh 拓樸。每張 V100 配備 6 條第二代 NVLink，單顆 GPU 的 NVLink 總雙向頻寬最高可達 300 GB/s。不過，這並不代表每兩張 GPU 都能彼此直接全連接；在這個拓樸中，只有部分 GPU pair 是直接相連，而且直接相連的兩張 GPU 最多共享 2 條 NVLink，因此單一 GPU pair 之間的直接頻寬最高約為 50 GB/s 單向、100 GB/s 雙向。GPU 與 CPU 之間的資料傳輸仍然是透過 PCIe，雙路 CPU 之間則是透過 Intel 的 UPI 互連。雖然這種拓樸不是全連接設計，但相較於純 PCIe 架構，仍大幅提升了同一台系統內 GPU 彼此之間的通訊頻寬及擴展效率。 ![image alt](https://global.discourse-cdn.com/nvidia/original/3X/c/c/ccec8977b67da7a3b925917c89367be7ada12af5.png) [reference]: https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf?utm_source=chatgpt.com ### NVSwitch 如果說 PCIe 是通用道路（正式應該說通用型 I/O 匯流排）、NVLink 是 GPU 之間的高速專用道路，那 NVSwitch 就像是把很多條 NVLink 組成大型交換網路的交換器（switch fabric）。它的用途不是取代 NVLink，而是把多顆 GPU 透過 NVLink 連成更大、近似全互連的拓樸，讓每顆 GPU 不必只依賴少數直連鄰居。以 NVIDIA 早期的技術文件來看，NVSwitch 本質上是一個 NVLink switch chip。第一代 NVSwitch 有 18 個 NVLink ports，內部是 18×18 fully connected crossbar；任一 port 都能以完整 NVLink 速度和其他 port 通訊。官方文件給出的數字是每個 port 50 GB/s（雙向總計），整顆 switch 的 aggregate bandwidth 為 900 GB/s [reference]: https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf?utm_source=chatgpt.com ## AMD 工作站實驗因為筆者的實驗機器是主板[WRX90 WS EVO](https://www.asrock.com/mb/AMD/WRX90%20WS%20EVO/index.tw.asp)+[AMD Radeon™ AI PRO R9700](https://www.amd.com/zh-tw/products/graphics/workstations/radeon-ai-pro/ai-9000-series/amd-radeon-ai-pro-r9700.html) x4，沒有 NVLink 做加速，因此 GPU 之間的傳輸只能是透過 P2P，先用 [AMD SMI CLI tool](https://rocm.docs.amd.com/projects/amdsmi/en/latest/how-to/amdsmi-cli-tool.html?utm_source=chatgpt.com#amd-smi-topology) 來觀察 GPUs 之間的 topology。使用 AMD SMI CLI tool 指令：`amd-smi topology --json` 並總結： <details><summary>amd-smi topology --json</summary> ```json [ { "gpu": 0, "bdf": "0000:03:00.0", "links": [ { "gpu": 0, "bdf": "0000:03:00.0", "weight": 0, "link_status": "ENABLED", "link_type": "SELF", "num_hops": 0, "bandwidth": "N/A", "coherent": "SELF", "atomics": "SELF", "dma": "SELF", "bi_dir": "SELF" }, { "gpu": 1, "bdf": "0000:06:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 2, "bdf": "0000:e3:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 3, "bdf": "0000:e6:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" } ] }, { "gpu": 1, "bdf": "0000:06:00.0", "links": [ { "gpu": 0, "bdf": "0000:03:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 1, "bdf": "0000:06:00.0", "weight": 0, "link_status": "ENABLED", "link_type": "SELF", "num_hops": 0, "bandwidth": "N/A", "coherent": "SELF", "atomics": "SELF", "dma": "SELF", "bi_dir": "SELF" }, { "gpu": 2, "bdf": "0000:e3:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 3, "bdf": "0000:e6:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" } ] }, { "gpu": 2, "bdf": "0000:e3:00.0", "links": [ { "gpu": 0, "bdf": "0000:03:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 1, "bdf": "0000:06:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 2, "bdf": "0000:e3:00.0", "weight": 0, "link_status": "ENABLED", "link_type": "SELF", "num_hops": 0, "bandwidth": "N/A", "coherent": "SELF", "atomics": "SELF", "dma": "SELF", "bi_dir": "SELF" }, { "gpu": 3, "bdf": "0000:e6:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" } ] }, { "gpu": 3, "bdf": "0000:e6:00.0", "links": [ { "gpu": 0, "bdf": "0000:03:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 1, "bdf": "0000:06:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 2, "bdf": "0000:e3:00.0", "weight": 40, "link_status": "ENABLED", "link_type": "PCIE", "num_hops": 2, "bandwidth": "N/A", "coherent": "NC", "atomics": "64,32", "dma": "T", "bi_dir": "F" }, { "gpu": 3, "bdf": "0000:e6:00.0", "weight": 0, "link_status": "ENABLED", "link_type": "SELF", "num_hops": 0, "bandwidth": "N/A", "coherent": "SELF", "atomics": "SELF", "dma": "SELF", "bi_dir": "SELF" } ] } ] ``` </details> ### GPU List > `BDF` = `domain:bus:device.function`，用來唯一識別 PCIe 裝置的位置。 | GPU | BDF | |---|---| | GPU 0 | `0000:03:00.0` | | GPU 1 | `0000:06:00.0` | | GPU 2 | `0000:e3:00.0` | | GPU 3 | `0000:e6:00.0` | ### Topology Matrix > 所有跨 GPU 連線皆為 PCIe，且屬性一致： > `2 hops`：GPU 間透過 PCIe fabric 溝通，路徑需跨越 2 個 PCIe hop，並非直接互連。 > `coherent: "NC"`：不提供 cache coherence，一端改了資料，不代表另一端 cache 會自動一致 > `atomics: "64,32"`：表示這條路支援 64-bit 與 32-bit atomic 操作。 > `dma: "T"`：支援 DMA，可以由 DMA engine 直接搬 GPU 之間的 memory | From \ To | GPU 0 | GPU 1 | GPU 2 | GPU 3 | |---|---|---|---|---| | **GPU 0** | SELF | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | | **GPU 1** | PCIE `2 hops, NC, atomic 64/32, DMA` | SELF | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | | **GPU 2** | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | SELF | PCIE `2 hops, NC, atomic 64/32, DMA` | | **GPU 3** | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | PCIE `2 hops, NC, atomic 64/32, DMA` | SELF | ### HIP runtime 硬體的部分確定是可以用 PCIe 做連通，接下來就是看 HIP runtime 能不能用 ```cpp= #include <hip/hip_runtime.h> #include <cstdio> #include <cstdlib> #include <iostream> #define HIP_CHECK(cmd) \ do { \ hipError_t e = (cmd); \ if (e != hipSuccess) { \ std::cerr << "HIP error: " << hipGetErrorString(e) \ << " at " << __FILE__ << ":" << __LINE__ << "\n"; \ std::exit(EXIT_FAILURE); \ } \ } while (0) int main() { int count = 0; HIP_CHECK(hipGetDeviceCount(&count)); std::cout << "GPU count: " << count << "\n"; if (count < 2) { std::cout << "Need at least 2 GPUs.\n"; return 0; } std::cout << "\n[Step 1] hipDeviceCanAccessPeer matrix\n"; for (int i = 0; i < count; ++i) { for (int j = 0; j < count; ++j) { int can = 0; HIP_CHECK(hipDeviceCanAccessPeer(&can, i, j)); std::cout < " << j << " : " << can << "\n"; } } // 先拿 GPU0 與 GPU1 示範 int src = 0; int dst = 1; std::cout << "\n[Step 2] Enable peer access\n"; // 讓 GPU0 可以看 GPU1 HIP_CHECK(hipSetDevice(src)); hipError_t e1 = hipDeviceEnablePeerAccess(dst, 0); if (e1 == hipSuccess) { std::cout << "Enabled peer access from GPU " << src << " to GPU " << dst << "\n"; } else if (e1 == hipErrorPeerAccessAlreadyEnabled) { std::cout << "Peer access already enabled from GPU " << src << " to GPU " << dst << "\n"; } else { std::cerr << "Enable failed: " << hipGetErrorString(e1) << "\n"; return 1; } // 也讓 GPU1 可以看 GPU0 HIP_CHECK(hipSetDevice(dst)); hipError_t e2 = hipDeviceEnablePeerAccess(src, 0); if (e2 == hipSuccess) { std::cout << "Enabled peer access from GPU " << dst << " to GPU " << src << "\n"; } else if (e2 == hipErrorPeerAccessAlreadyEnabled) { std::cout << "Peer access already enabled from GPU " << dst << " to GPU " << src << "\n"; } else { std::cerr << "Enable failed: " << hipGetErrorString(e2) << "\n"; return 1; } std::cout << "\n[Step 3] Do one hipMemcpyPeerAsync\n"; const size_t bytes = 64 * 1024 * 1024; // 64 MiB void* src_ptr = nullptr; void* dst_ptr = nullptr; hipStream_t stream; HIP_CHECK(hipSetDevice(src)); HIP_CHECK(hipMalloc(&src_ptr, bytes)); HIP_CHECK(hipMemset(src_ptr, 0x5A, bytes)); HIP_CHECK(hipSetDevice(dst)); HIP_CHECK(hipMalloc(&dst_ptr, bytes)); HIP_CHECK(hipMemset(dst_ptr, 0x00, bytes)); HIP_CHECK(hipStreamCreate(&stream)); HIP_CHECK(hipMemcpyPeerAsync(dst_ptr, dst, src_ptr, src, bytes, stream)); HIP_CHECK(hipStreamSynchronize(stream)); std::cout << "P2P copy success: GPU " << src << " -> GPU " << dst << ", bytes = " << bytes << "\n"; HIP_CHECK(hipStreamDestroy(stream)); HIP_CHECK(hipSetDevice(dst)); HIP_CHECK(hipFree(dst_ptr)); HIP_CHECK(hipSetDevice(src)); HIP_CHECK(hipFree(src_ptr)); return 0; } ``` 結果 HIP 已經能在系統上啟用 GPU-to-GPU 的 peer memory access，並且成功做跨單向 GPU copy。 ```bash ❯ ./hip_p2p_min GPU count: 4 [Step 1] hipDeviceCanAccessPeer matrix 0 -> 0 : 0 0 -> 1 : 1 0 -> 2 : 1 0 -> 3 : 1 1 -> 0 : 1 1 -> 1 : 0 1 -> 2 : 1 1 -> 3 : 1 2 -> 0 : 1 2 -> 1 : 1 2 -> 2 : 0 2 -> 3 : 1 3 -> 0 : 1 3 -> 1 : 1 3 -> 2 : 1 3 -> 3 : 0 [Step 2] Enable peer access Enabled peer access from GPU 0 to GPU 1 Enabled peer access from GPU 1 to GPU 0 [Step 3] Do one hipMemcpyPeerAsync P2P copy success: GPU 0 -> GPU 1, bytes = 67108864 ``` ### vLLM / NCCL / ROCm 上的實際應用目前（2026/3/9）在 vllm 上嘗試 serve `qwen3.5-27B-FP8` 模型服務，必須要有環境變數 `NCCL_P2P_DISABLE=1`，才能成功運行於我們的工作站。推測與 [RCCL libary](https://github.com/ROCm/rocm-systems) 有關，vLLM 在 tensor parallel 等多卡情境下，通常不是自己直接呼叫 `hipMemcpyPeerAsync()` 來完成所有跨卡通訊，而是依賴 RCCL 來做 collective communication。 [NVIDA ](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html)寫得很直接：`NCCL_P2P_DISABLE` 停用的是 P2P transport，而這個 transport 使用的是 GPU 之間的 direct access，介質可以是 NVLink 或 PCI。如果設了環境變數 `NCCL_P2P_DISABLE=1` 之後是不走 direct P2P，會改走 host/shared memory 的替代路徑，也就是可以退回 SHM 等其他 transport

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.