# Running vLLM Multi-Node Cluster on DGX Spark + Jetson Thor
## Cluster Layout
| Machine | Hostname | IP | GPU | Role |
|---------|----------|-----|-----|------|
| DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD |
| DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker |
| Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker |
| Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker |
**Total: 4 GPUs, ~461 GiB unified memory**
---
## Prerequisites
Each machine needs the vLLM virtualenv at `~/Projects/vllm/.vllm/`:
```bash
cd ~/Projects/vllm
uv venv .vllm
source .vllm/bin/activate
uv pip install -U "ray[all]"
```
### Thor-Specific: Compile vLLM from Source (Recommended until torch 2.11.0 & Triton 3.7.0)
The pip-installed vLLM wheels do **not** include CUDA kernels for `sm_110` (Jetson Thor).
You must build from source on each Thor:
Install vllm
```bash
uv pip install --force-reinstall https://github.com/vllm-project/vllm/releases/download/v0.17.1/vllm-0.17.1+cu130-cp38-abi3-manylinux_2_35_aarch64.whl
```
Install Pytorch
```bash
uv pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
```
Verify:
```bash
python -c "
import vllm
from vllm._custom_ops import scaled_fp8_quant
import torch
x = torch.randn(32, 128, device='cuda', dtype=torch.bfloat16)
out, scale = scaled_fp8_quant(x)
print(f'FP8 quant OK: {out.shape}, scale: {scale}')
"
```
---
## Step 1: Start the Ray Cluster
Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars.
### Spark 1 (HEAD):
```bash
bash run_cluster_bare.sh 10.0.0.25 --head
```
### Spark 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22
```
### Thor 1 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27
```
### Thor 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26
```
### Verify the cluster:
```bash
source .vllm/bin/activate
ray status
```
You should see 4 active nodes, 4 GPUs, ~461 GiB memory.
---
## Step 2: Serve a Model
Open another terminal on the HEAD (Spark 1) and set the environment:
```bash
source .vllm/bin/activate
export VLLM_HOST_IP=10.0.0.25
export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth
export GLOO_SOCKET_IFNAME=enp1s0f0np0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=Simple
```
### NCCL Environment Variables Explained
| Variable | Value | Why |
|----------|-------|-----|
| `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine |
| `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket |
| `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes |
| `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication |
| `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes |
| `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) |
---
## Example: Nemotron Super 120B NVFP4 (4 nodes)
```bash
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nvidia/nemotron-3-super \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 4 \
--distributed-executor-backend ray \
--swap-space 0 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--max-num-seqs 512 \
--host 0.0.0.0 \
--port 5000 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
```
**Note**: Download the reasoning parser first:
```bash
wget -O super_v3_reasoning_parser.py \
"https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py"
```
---
## Example: Qwen3.5-122B-A10B-FP8 (4 nodes)
```bash
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--served-model-name qwen3.5-122b \
--tensor-parallel-size 1 \
--pipeline-parallel-size 4 \
--distributed-executor-backend ray \
--swap-space 0 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--max-num-seqs 256 \
--host 0.0.0.0 \
--port 5000 \
--enforce-eager \
--reasoning-parser qwen3
```
---
## Example: Nemotron Super NVFP4 (2 Sparks only)
If Thor nodes aren't available or compiled, use only the 2 Sparks:
```bash
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nvidia/nemotron-3-super \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray \
--swap-space 0 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--max-num-seqs 512 \
--host 0.0.0.0 \
--port 5000 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
```
---
## Step 3: Test the API
Once vLLM shows `Uvicorn running on http://0.0.0.0:5000`, test from any machine:
```bash
curl http://10.0.0.25:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-super",
"messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}],
"max_tokens": 200,
"temperature": 1.0,
"top_p": 0.95
}'
```
---
## Key Flags Reference
| Flag | Purpose |
|------|---------|
| `--pipeline-parallel-size N` | Split model across N nodes (1 GPU each) |
| `--tensor-parallel-size 1` | No tensor parallelism (1 GPU per node) |
| `--distributed-executor-backend ray` | Use Ray for multi-node distribution |
| `--enforce-eager` | Disable torch.compile/CUDAGraphs (avoids Triton ptxas sm_110a errors and OOM from compilation) |
| `--attention-backend TRITON_ATTN` | Use Triton for attention kernels |
| `--gpu-memory-utilization 0.85` | Use 85% of GPU memory (leave room for OS) |
| `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) |
| `--swap-space 0` | No CPU swap (unified memory systems) |
---
## Troubleshooting
### NCCL "wrong type 3 != 4"
**Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports.
**Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket.
### Gloo "Connection refused 127.0.0.1"
**Cause**: Gloo defaults to localhost instead of the real IP.
**Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name).
### "no kernel image is available for execution on the device" (Thor only)
**Cause**: vLLM pip wheel doesn't include `sm_110` CUDA kernels.
**Fix**: Compile vLLM from source on Thor with `TORCH_CUDA_ARCH_LIST="11.0"`.
### "ptxas-blackwell fatal: Value 'sm_110a' is not defined" (Thor only)
**Cause**: Triton's bundled ptxas doesn't support sm_110a.
**Fix**: Use `--enforce-eager` to skip Triton compilation entirely.
### OOM during torch.compile
**Cause**: CUDA compiler (`cicc`) processes consume all system RAM.
**Fix**: Use `--enforce-eager` to disable compilation.
### "Free memory less than desired GPU memory utilization"
**Cause**: Other processes using GPU memory.
**Fix**: Lower `--gpu-memory-utilization` (e.g., 0.80 or 0.75).
### NCCL "Network is unreachable" on fe80:: addresses
**Cause**: NCCL tries IPv6 link-local interfaces that aren't connected.
**Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces.
---
## Architecture Notes
- **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet.
- **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet.
- Each node runs PyTorch, vLLM kernels, and model inference **locally** for its portion. Ray coordinates distribution and communication.
- DGX Spark (GB10) = compute capability 12.1, `sm_120`
- Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a`
## Extra notes:
script run_cluster_bare.sh
```bash
#!/bin/bash
#
# Launch a Ray cluster (without Docker) for vLLM multi-node inference.
#
# Uses the uv virtual environment at .vllm/ relative to this script.
# All machines must have the same environment replicated at the same path,
# and must be reachable at the supplied IP addresses (port 6379 open).
#
# Cluster layout (4 nodes):
# Spark 1 (HEAD) – IP_SPARK_1
# Spark 2 (worker) – IP_SPARK_2
# Thor 1 (worker) – IP_THOR_1
# Thor 2 (worker) – IP_THOR_2
#
# Usage – run each command on the corresponding machine:
#
# 1. Spark 1 (HEAD):
# bash run_cluster_bare.sh IP_SPARK_1 --head
# ^^^^^^^^^
# su propia IP
#
# 2. Spark 2 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2
# ^^^^^^^^^ ^^^^^^^^^^
# IP del HEAD su propia IP
#
# 3. Thor 1 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1
# ^^^^^^^^^ ^^^^^^^^^
# IP del HEAD su propia IP
#
# 4. Thor 2 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2
# ^^^^^^^^^ ^^^^^^^^^
# IP del HEAD su propia IP
#
# 5. Once all workers have joined, open another terminal on Spark 1 (HEAD):
# source .vllm/bin/activate
# export VLLM_HOST_IP=IP_SPARK_1
# vllm serve <model> --tensor-parallel-size <N>
#
# Keep each terminal session open. Ctrl-C stops the Ray node.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV_DIR="${SCRIPT_DIR}/.vllm"
if [ ! -f "${VENV_DIR}/bin/activate" ]; then
echo "Error: virtual environment not found at ${VENV_DIR}"
echo "Create it with: uv venv .vllm && uv pip install vllm ray"
exit 1
fi
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
echo "Activated virtualenv: ${VIRTUAL_ENV}"
if [ $# -lt 2 ]; then
echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]"
exit 1
fi
HEAD_NODE_ADDRESS="$1"
NODE_TYPE="$2"
shift 2
NODE_IP=""
while [[ $# -gt 0 ]]; do
case "$1" in
--node-ip)
NODE_IP="$2"
shift 2
;;
*)
echo "Unknown argument: $1"
exit 1
;;
esac
done
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Auto-detect the network interface for the given IP so that
# NCCL and Gloo use the correct NIC instead of defaulting to loopback.
detect_interface() {
local target_ip="$1"
ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}'
}
MY_IP=""
if [ "${NODE_TYPE}" == "--head" ]; then
MY_IP="${HEAD_NODE_ADDRESS}"
else
MY_IP="${NODE_IP:-}"
fi
if [ -n "${MY_IP}" ]; then
IFNAME=$(detect_interface "${MY_IP}")
if [ -n "${IFNAME}" ]; then
export NCCL_SOCKET_IFNAME="${IFNAME}"
export GLOO_SOCKET_IFNAME="${IFNAME}"
export TP_SOCKET_IFNAME="${IFNAME}"
echo "Detected interface ${IFNAME} for IP ${MY_IP}"
else
echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail"
fi
fi
cleanup() {
echo "Stopping Ray node..."
ray stop
}
trap cleanup EXIT
if [ "${NODE_TYPE}" == "--head" ]; then
echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..."
export VLLM_HOST_IP="${HEAD_NODE_ADDRESS}"
ray start --block --head \
--node-ip-address="${HEAD_NODE_ADDRESS}" \
--port=6379 \
--dashboard-host=0.0.0.0
else
WORKER_IP="${NODE_IP:-}"
echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..."
RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379")
if [ -n "${WORKER_IP}" ]; then
export VLLM_HOST_IP="${WORKER_IP}"
RAY_ARGS+=(--node-ip-address="${WORKER_IP}")
fi
ray start "${RAY_ARGS[@]}"
fi
```