Distributed vllm

# Running vLLM Multi-Node Cluster on DGX Spark + Jetson Thor ## Cluster Layout | Machine | Hostname | IP | GPU | Role | |---------|----------|-----|-----|------| | DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD | | DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker | | Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker | | Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker | **Total: 4 GPUs, ~461 GiB unified memory** --- ## Prerequisites Each machine needs the vLLM virtualenv at `~/Projects/vllm/.vllm/`: ```bash cd ~/Projects/vllm uv venv .vllm source .vllm/bin/activate uv pip install -U "ray[all]" ``` ### Thor-Specific: Compile vLLM from Source (Recommended until torch 2.11.0 & Triton 3.7.0) The pip-installed vLLM wheels do **not** include CUDA kernels for `sm_110` (Jetson Thor). You must build from source on each Thor: Install vllm ```bash uv pip install --force-reinstall https://github.com/vllm-project/vllm/releases/download/v0.17.1/vllm-0.17.1+cu130-cp38-abi3-manylinux_2_35_aarch64.whl ``` Install Pytorch ```bash uv pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 ``` Verify: ```bash python -c " import vllm from vllm._custom_ops import scaled_fp8_quant import torch x = torch.randn(32, 128, device='cuda', dtype=torch.bfloat16) out, scale = scaled_fp8_quant(x) print(f'FP8 quant OK: {out.shape}, scale: {scale}') " ``` --- ## Step 1: Start the Ray Cluster Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars. ### Spark 1 (HEAD): ```bash bash run_cluster_bare.sh 10.0.0.25 --head ``` ### Spark 2 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22 ``` ### Thor 1 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27 ``` ### Thor 2 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26 ``` ### Verify the cluster: ```bash source .vllm/bin/activate ray status ``` You should see 4 active nodes, 4 GPUs, ~461 GiB memory. --- ## Step 2: Serve a Model Open another terminal on the HEAD (Spark 1) and set the environment: ```bash source .vllm/bin/activate export VLLM_HOST_IP=10.0.0.25 export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth export GLOO_SOCKET_IFNAME=enp1s0f0np0 export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export NCCL_SHM_DISABLE=1 export NCCL_PROTO=Simple ``` ### NCCL Environment Variables Explained | Variable | Value | Why | |----------|-------|-----| | `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine | | `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket | | `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes | | `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication | | `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes | | `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) | --- ## Example: Nemotron Super 120B NVFP4 (4 nodes) ```bash vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nvidia/nemotron-3-super \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 4 \ --distributed-executor-backend ray \ --swap-space 0 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.85 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 5000 \ --enforce-eager \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin "./super_v3_reasoning_parser.py" \ --reasoning-parser super_v3 ``` **Note**: Download the reasoning parser first: ```bash wget -O super_v3_reasoning_parser.py \ "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py" ``` --- ## Example: Qwen3.5-122B-A10B-FP8 (4 nodes) ```bash vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \ --served-model-name qwen3.5-122b \ --tensor-parallel-size 1 \ --pipeline-parallel-size 4 \ --distributed-executor-backend ray \ --swap-space 0 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.85 \ --max-model-len 32768 \ --max-num-seqs 256 \ --host 0.0.0.0 \ --port 5000 \ --enforce-eager \ --reasoning-parser qwen3 ``` --- ## Example: Nemotron Super NVFP4 (2 Sparks only) If Thor nodes aren't available or compiled, use only the 2 Sparks: ```bash vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nvidia/nemotron-3-super \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 2 \ --distributed-executor-backend ray \ --swap-space 0 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.85 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 5000 \ --enforce-eager \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin "./super_v3_reasoning_parser.py" \ --reasoning-parser super_v3 ``` --- ## Step 3: Test the API Once vLLM shows `Uvicorn running on http://0.0.0.0:5000`, test from any machine: ```bash curl http://10.0.0.25:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/nemotron-3-super", "messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}], "max_tokens": 200, "temperature": 1.0, "top_p": 0.95 }' ``` --- ## Key Flags Reference | Flag | Purpose | |------|---------| | `--pipeline-parallel-size N` | Split model across N nodes (1 GPU each) | | `--tensor-parallel-size 1` | No tensor parallelism (1 GPU per node) | | `--distributed-executor-backend ray` | Use Ray for multi-node distribution | | `--enforce-eager` | Disable torch.compile/CUDAGraphs (avoids Triton ptxas sm_110a errors and OOM from compilation) | | `--attention-backend TRITON_ATTN` | Use Triton for attention kernels | | `--gpu-memory-utilization 0.85` | Use 85% of GPU memory (leave room for OS) | | `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) | | `--swap-space 0` | No CPU swap (unified memory systems) | --- ## Troubleshooting ### NCCL "wrong type 3 != 4" **Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports. **Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket. ### Gloo "Connection refused 127.0.0.1" **Cause**: Gloo defaults to localhost instead of the real IP. **Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name). ### "no kernel image is available for execution on the device" (Thor only) **Cause**: vLLM pip wheel doesn't include `sm_110` CUDA kernels. **Fix**: Compile vLLM from source on Thor with `TORCH_CUDA_ARCH_LIST="11.0"`. ### "ptxas-blackwell fatal: Value 'sm_110a' is not defined" (Thor only) **Cause**: Triton's bundled ptxas doesn't support sm_110a. **Fix**: Use `--enforce-eager` to skip Triton compilation entirely. ### OOM during torch.compile **Cause**: CUDA compiler (`cicc`) processes consume all system RAM. **Fix**: Use `--enforce-eager` to disable compilation. ### "Free memory less than desired GPU memory utilization" **Cause**: Other processes using GPU memory. **Fix**: Lower `--gpu-memory-utilization` (e.g., 0.80 or 0.75). ### NCCL "Network is unreachable" on fe80:: addresses **Cause**: NCCL tries IPv6 link-local interfaces that aren't connected. **Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces. --- ## Architecture Notes - **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet. - **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet. - Each node runs PyTorch, vLLM kernels, and model inference **locally** for its portion. Ray coordinates distribution and communication. - DGX Spark (GB10) = compute capability 12.1, `sm_120` - Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a` ## Extra notes: script run_cluster_bare.sh ```bash #!/bin/bash # # Launch a Ray cluster (without Docker) for vLLM multi-node inference. # # Uses the uv virtual environment at .vllm/ relative to this script. # All machines must have the same environment replicated at the same path, # and must be reachable at the supplied IP addresses (port 6379 open). # # Cluster layout (4 nodes): # Spark 1 (HEAD) – IP_SPARK_1 # Spark 2 (worker) – IP_SPARK_2 # Thor 1 (worker) – IP_THOR_1 # Thor 2 (worker) – IP_THOR_2 # # Usage – run each command on the corresponding machine: # # 1. Spark 1 (HEAD): # bash run_cluster_bare.sh IP_SPARK_1 --head # ^^^^^^^^^ # su propia IP # # 2. Spark 2 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2 # ^^^^^^^^^ ^^^^^^^^^^ # IP del HEAD su propia IP # # 3. Thor 1 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1 # ^^^^^^^^^ ^^^^^^^^^ # IP del HEAD su propia IP # # 4. Thor 2 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2 # ^^^^^^^^^ ^^^^^^^^^ # IP del HEAD su propia IP # # 5. Once all workers have joined, open another terminal on Spark 1 (HEAD): # source .vllm/bin/activate # export VLLM_HOST_IP=IP_SPARK_1 # vllm serve <model> --tensor-parallel-size <N> # # Keep each terminal session open. Ctrl-C stops the Ray node. set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" VENV_DIR="${SCRIPT_DIR}/.vllm" if [ ! -f "${VENV_DIR}/bin/activate" ]; then echo "Error: virtual environment not found at ${VENV_DIR}" echo "Create it with: uv venv .vllm && uv pip install vllm ray" exit 1 fi # shellcheck disable=SC1091 source "${VENV_DIR}/bin/activate" echo "Activated virtualenv: ${VIRTUAL_ENV}" if [ $# -lt 2 ]; then echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]" exit 1 fi HEAD_NODE_ADDRESS="$1" NODE_TYPE="$2" shift 2 NODE_IP="" while [[ $# -gt 0 ]]; do case "$1" in --node-ip) NODE_IP="$2" shift 2 ;; *) echo "Unknown argument: $1" exit 1 ;; esac done if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then echo "Error: Node type must be --head or --worker" exit 1 fi # Auto-detect the network interface for the given IP so that # NCCL and Gloo use the correct NIC instead of defaulting to loopback. detect_interface() { local target_ip="$1" ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}' } MY_IP="" if [ "${NODE_TYPE}" == "--head" ]; then MY_IP="${HEAD_NODE_ADDRESS}" else MY_IP="${NODE_IP:-}" fi if [ -n "${MY_IP}" ]; then IFNAME=$(detect_interface "${MY_IP}") if [ -n "${IFNAME}" ]; then export NCCL_SOCKET_IFNAME="${IFNAME}" export GLOO_SOCKET_IFNAME="${IFNAME}" export TP_SOCKET_IFNAME="${IFNAME}" echo "Detected interface ${IFNAME} for IP ${MY_IP}" else echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail" fi fi cleanup() { echo "Stopping Ray node..." ray stop } trap cleanup EXIT if [ "${NODE_TYPE}" == "--head" ]; then echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..." export VLLM_HOST_IP="${HEAD_NODE_ADDRESS}" ray start --block --head \ --node-ip-address="${HEAD_NODE_ADDRESS}" \ --port=6379 \ --dashboard-host=0.0.0.0 else WORKER_IP="${NODE_IP:-}" echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..." RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379") if [ -n "${WORKER_IP}" ]; then export VLLM_HOST_IP="${WORKER_IP}" RAY_ARGS+=(--node-ip-address="${WORKER_IP}") fi ray start "${RAY_ARGS[@]}" fi ```