---
# System prepended metadata

title: Distributed vllm

---


# Running vLLM Multi-Node Cluster on DGX Spark + Jetson Thor

## Cluster Layout

| Machine | Hostname | IP | GPU | Role |
|---------|----------|-----|-----|------|
| DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD |
| DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker |
| Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker |
| Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker |

**Total: 4 GPUs, ~461 GiB unified memory**

---

## Prerequisites

Each machine needs the vLLM virtualenv at `~/Projects/vllm/.vllm/`:

```bash
cd ~/Projects/vllm
uv venv .vllm
source .vllm/bin/activate
uv pip install -U "ray[all]"
```

### Thor-Specific: Compile vLLM from Source (Recommended until torch 2.11.0 & Triton 3.7.0)

The pip-installed vLLM wheels do **not** include CUDA kernels for `sm_110` (Jetson Thor). 
You must build from source on each Thor:

Install vllm

```bash 
uv pip install --force-reinstall https://github.com/vllm-project/vllm/releases/download/v0.17.1/vllm-0.17.1+cu130-cp38-abi3-manylinux_2_35_aarch64.whl
```

Install Pytorch
```bash 
uv pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
``` 

Verify:
```bash
python -c "
import vllm
from vllm._custom_ops import scaled_fp8_quant
import torch
x = torch.randn(32, 128, device='cuda', dtype=torch.bfloat16)
out, scale = scaled_fp8_quant(x)
print(f'FP8 quant OK: {out.shape}, scale: {scale}')
"
```

---

## Step 1: Start the Ray Cluster

Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars.

### Spark 1 (HEAD):
```bash
bash run_cluster_bare.sh 10.0.0.25 --head
```

### Spark 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22
```

### Thor 1 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27
```

### Thor 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26
```

### Verify the cluster:
```bash
source .vllm/bin/activate
ray status
```

You should see 4 active nodes, 4 GPUs, ~461 GiB memory.

---

## Step 2: Serve a Model

Open another terminal on the HEAD (Spark 1) and set the environment:

```bash
source .vllm/bin/activate
export VLLM_HOST_IP=10.0.0.25
export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth
export GLOO_SOCKET_IFNAME=enp1s0f0np0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=Simple
```

### NCCL Environment Variables Explained

| Variable | Value | Why |
|----------|-------|-----|
| `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine |
| `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket |
| `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes |
| `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication |
| `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes |
| `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) |

---

## Example: Nemotron Super 120B NVFP4 (4 nodes)

```bash
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --served-model-name nvidia/nemotron-3-super \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --distributed-executor-backend ray \
  --swap-space 0 \
  --trust-remote-code \
  --attention-backend TRITON_ATTN \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --max-num-seqs 512 \
  --host 0.0.0.0 \
  --port 5000 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
  --reasoning-parser super_v3
```

**Note**: Download the reasoning parser first:
```bash
wget -O super_v3_reasoning_parser.py \
  "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py"
```

---

## Example: Qwen3.5-122B-A10B-FP8 (4 nodes)

```bash
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
  --served-model-name qwen3.5-122b \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --distributed-executor-backend ray \
  --swap-space 0 \
  --trust-remote-code \
  --attention-backend TRITON_ATTN \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --host 0.0.0.0 \
  --port 5000 \
  --enforce-eager \
  --reasoning-parser qwen3
```

---

## Example: Nemotron Super NVFP4 (2 Sparks only)

If Thor nodes aren't available or compiled, use only the 2 Sparks:

```bash
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --served-model-name nvidia/nemotron-3-super \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 2 \
  --distributed-executor-backend ray \
  --swap-space 0 \
  --trust-remote-code \
  --attention-backend TRITON_ATTN \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --max-num-seqs 512 \
  --host 0.0.0.0 \
  --port 5000 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
  --reasoning-parser super_v3
```

---

## Step 3: Test the API

Once vLLM shows `Uvicorn running on http://0.0.0.0:5000`, test from any machine:

```bash
curl http://10.0.0.25:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}],
    "max_tokens": 200,
    "temperature": 1.0,
    "top_p": 0.95
  }'
```

---

## Key Flags Reference

| Flag | Purpose |
|------|---------|
| `--pipeline-parallel-size N` | Split model across N nodes (1 GPU each) |
| `--tensor-parallel-size 1` | No tensor parallelism (1 GPU per node) |
| `--distributed-executor-backend ray` | Use Ray for multi-node distribution |
| `--enforce-eager` | Disable torch.compile/CUDAGraphs (avoids Triton ptxas sm_110a errors and OOM from compilation) |
| `--attention-backend TRITON_ATTN` | Use Triton for attention kernels |
| `--gpu-memory-utilization 0.85` | Use 85% of GPU memory (leave room for OS) |
| `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) |
| `--swap-space 0` | No CPU swap (unified memory systems) |

---

## Troubleshooting

### NCCL "wrong type 3 != 4"
**Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports.  
**Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket.

### Gloo "Connection refused 127.0.0.1"
**Cause**: Gloo defaults to localhost instead of the real IP.  
**Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name).

### "no kernel image is available for execution on the device" (Thor only)
**Cause**: vLLM pip wheel doesn't include `sm_110` CUDA kernels.  
**Fix**: Compile vLLM from source on Thor with `TORCH_CUDA_ARCH_LIST="11.0"`.

### "ptxas-blackwell fatal: Value 'sm_110a' is not defined" (Thor only)
**Cause**: Triton's bundled ptxas doesn't support sm_110a.  
**Fix**: Use `--enforce-eager` to skip Triton compilation entirely.

### OOM during torch.compile
**Cause**: CUDA compiler (`cicc`) processes consume all system RAM.  
**Fix**: Use `--enforce-eager` to disable compilation.

### "Free memory less than desired GPU memory utilization"
**Cause**: Other processes using GPU memory.  
**Fix**: Lower `--gpu-memory-utilization` (e.g., 0.80 or 0.75).

### NCCL "Network is unreachable" on fe80:: addresses
**Cause**: NCCL tries IPv6 link-local interfaces that aren't connected.  
**Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces.

---

## Architecture Notes

- **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet.
- **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet.
- Each node runs PyTorch, vLLM kernels, and model inference **locally** for its portion. Ray coordinates distribution and communication.
- DGX Spark (GB10) = compute capability 12.1, `sm_120`
- Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a`


## Extra notes:

script run_cluster_bare.sh

```bash
#!/bin/bash
#
# Launch a Ray cluster (without Docker) for vLLM multi-node inference.
#
# Uses the uv virtual environment at .vllm/ relative to this script.
# All machines must have the same environment replicated at the same path,
# and must be reachable at the supplied IP addresses (port 6379 open).
#
# Cluster layout (4 nodes):
#   Spark 1  (HEAD)   – IP_SPARK_1
#   Spark 2  (worker) – IP_SPARK_2
#   Thor  1  (worker) – IP_THOR_1
#   Thor  2  (worker) – IP_THOR_2
#
# Usage – run each command on the corresponding machine:
#
# 1. Spark 1 (HEAD):
#    bash run_cluster_bare.sh IP_SPARK_1 --head
#                             ^^^^^^^^^
#                             su propia IP
#
# 2. Spark 2 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2
#                             ^^^^^^^^^                     ^^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 3. Thor 1 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1
#                             ^^^^^^^^^                     ^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 4. Thor 2 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2
#                             ^^^^^^^^^                     ^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 5. Once all workers have joined, open another terminal on Spark 1 (HEAD):
#      source .vllm/bin/activate
#      export VLLM_HOST_IP=IP_SPARK_1
#      vllm serve <model> --tensor-parallel-size <N>
#
# Keep each terminal session open. Ctrl-C stops the Ray node.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV_DIR="${SCRIPT_DIR}/.vllm"

if [ ! -f "${VENV_DIR}/bin/activate" ]; then
    echo "Error: virtual environment not found at ${VENV_DIR}"
    echo "Create it with:  uv venv .vllm && uv pip install vllm ray"
    exit 1
fi

# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
echo "Activated virtualenv: ${VIRTUAL_ENV}"

if [ $# -lt 2 ]; then
    echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]"
    exit 1
fi

HEAD_NODE_ADDRESS="$1"
NODE_TYPE="$2"
shift 2

NODE_IP=""
while [[ $# -gt 0 ]]; do
    case "$1" in
        --node-ip)
            NODE_IP="$2"
            shift 2
            ;;
        *)
            echo "Unknown argument: $1"
            exit 1
            ;;
    esac
done

if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Auto-detect the network interface for the given IP so that
# NCCL and Gloo use the correct NIC instead of defaulting to loopback.
detect_interface() {
    local target_ip="$1"
    ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}'
}

MY_IP=""
if [ "${NODE_TYPE}" == "--head" ]; then
    MY_IP="${HEAD_NODE_ADDRESS}"
else
    MY_IP="${NODE_IP:-}"
fi

if [ -n "${MY_IP}" ]; then
    IFNAME=$(detect_interface "${MY_IP}")
    if [ -n "${IFNAME}" ]; then
        export NCCL_SOCKET_IFNAME="${IFNAME}"
        export GLOO_SOCKET_IFNAME="${IFNAME}"
        export TP_SOCKET_IFNAME="${IFNAME}"
        echo "Detected interface ${IFNAME} for IP ${MY_IP}"
    else
        echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail"
    fi
fi

cleanup() {
    echo "Stopping Ray node..."
    ray stop
}
trap cleanup EXIT

if [ "${NODE_TYPE}" == "--head" ]; then
    echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..."
    export VLLM_HOST_IP="${HEAD_NODE_ADDRESS}"
    ray start --block --head \
        --node-ip-address="${HEAD_NODE_ADDRESS}" \
        --port=6379 \
        --dashboard-host=0.0.0.0
else
    WORKER_IP="${NODE_IP:-}"
    echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..."

    RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379")
    if [ -n "${WORKER_IP}" ]; then
        export VLLM_HOST_IP="${WORKER_IP}"
        RAY_ARGS+=(--node-ip-address="${WORKER_IP}")
    fi

    ray start "${RAY_ARGS[@]}"
fi
```