---
# System prepended metadata

title: Distributed sglang/vllm

---

# Running SGLang Multi-Node Cluster on DGX Spark + Jetson Thor

## Cluster Layout

| Machine | Hostname | IP | GPU | Role |
|---------|----------|-----|-----|------|
| DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD |
| DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker |
| Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker |
| Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker |

**Total: 4 GPUs, ~461 GiB unified memory**

---

## Prerequisites

### Create the Virtual Environment

Each machine needs the SGLang virtualenv at `~/Projects/sglang/.sglang/`:

```bash
cd ~/Projects/sglang
uv venv .sglang --python 3.12
source .sglang/bin/activate
```

### Set Architecture & CUDA Paths

Choose the correct `TORCH_CUDA_ARCH_LIST` for your hardware:

**Jetson Thor:**
```bash
export TORCH_CUDA_ARCH_LIST="11.0a"
```

**DGX Spark:**
```bash
export TORCH_CUDA_ARCH_LIST="12.1a"
```

**Common (all machines):**
```bash
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"
```

### Install Dependencies

```bash
uv pip install sglang
uv pip install --force-reinstall torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
uv pip install --force-reinstall sgl-kernel --index-url https://docs.sglang.ai/whl/cu130/
```

### Install Ray (required for multi-node)

```bash
uv pip install -U "ray[all]"
```

### Verify the Installation

```bash
python -c "import sglang; print(f'SGLang version: {sglang.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, arch: {torch.cuda.get_device_capability()}')"
```

---

## Step 1: Start the Ray Cluster

Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars.

### Spark 1 (HEAD):
```bash
bash run_cluster_bare.sh 10.0.0.25 --head
```

### Spark 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22
```

### Thor 1 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27
```

### Thor 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26
```

### Verify the cluster:
```bash
source .sglang/bin/activate
ray status
```

You should see 4 active nodes, 4 GPUs, ~461 GiB memory.

---

## Step 2: Serve a Model

Open another terminal on the HEAD (Spark 1) and set the environment:

```bash
source .sglang/bin/activate
export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth
export GLOO_SOCKET_IFNAME=enp1s0f0np0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=Simple
```

### NCCL Environment Variables Explained

| Variable | Value | Why |
|----------|-------|-----|
| `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine |
| `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket |
| `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes |
| `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication |
| `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes |
| `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) |

---

## Example: Nemotron Super 120B NVFP4 (4 nodes)

```bash
python -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --served-model-name nvidia/nemotron-3-super \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tp 1 \
  --pp 4 \
  --nnodes 4 \
  --use-ray \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 8192 \
  --max-running-requests 512 \
  --host 0.0.0.0 \
  --port 5000 \
  --disable-cuda-graph
```

---

## Example: Qwen3.5-122B-A10B-FP8 (4 nodes)

```bash
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-122B-A10B-FP8 \
  --served-model-name qwen3.5-122b \
  --tp 1 \
  --pp 4 \
  --nnodes 4 \
  --use-ray \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --context-length 32768 \
  --max-running-requests 256 \
  --host 0.0.0.0 \
  --port 5000 \
  --disable-cuda-graph
```

---

## Example: Nemotron Super NVFP4 (2 Sparks only)

If Thor nodes aren't available, use only the 2 Sparks:

```bash
python -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --served-model-name nvidia/nemotron-3-super \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tp 1 \
  --pp 2 \
  --nnodes 2 \
  --use-ray \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 8192 \
  --max-running-requests 512 \
  --host 0.0.0.0 \
  --port 5000 \
  --disable-cuda-graph
```

---

## Step 3: Test the API

SGLang exposes an OpenAI-compatible API. Once the server is ready, test from any machine:

```bash
curl http://10.0.0.25:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super",
    "messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}],
    "max_tokens": 200,
    "temperature": 1.0,
    "top_p": 0.95
  }'
```

---

## Key Flags Reference

| Flag | Purpose |
|------|---------|
| `--model-path` | HuggingFace model ID or local path |
| `--pp N` | Split model across N nodes (pipeline parallelism) |
| `--tp 1` | No tensor parallelism (1 GPU per node) |
| `--nnodes N` | Number of nodes in the cluster |
| `--use-ray` | Use Ray actors for multi-node process management ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)) |
| `--disable-cuda-graph` | Disable CUDA graphs (avoids Triton ptxas sm_110a errors and OOM from graph capture) |
| `--mem-fraction-static 0.85` | Use 85% of GPU memory for static allocation (leave room for OS) |
| `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) |
| `--chunked-prefill-size N` | Enable chunked prefill with chunk size N tokens |
| `--max-running-requests N` | Maximum concurrent requests |
| `--context-length N` | Override model's max context length |
| `--trust-remote-code` | Allow running model code from HuggingFace |


## Troubleshooting

### NCCL "wrong type 3 != 4"
**Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports.  
**Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket.

### Gloo "Connection refused 127.0.0.1"
**Cause**: Gloo defaults to localhost instead of the real IP.  
**Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name).

### "no kernel image is available for execution on the device" (Thor only)
**Cause**: Pre-built wheels don't include `sm_110` CUDA kernels.  
**Fix**: Reinstall `sgl-kernel` from the cu130 index with `TORCH_CUDA_ARCH_LIST="11.0a"` set before install.

### "ptxas fatal: Value 'sm_110a' is not defined" (Thor only)
**Cause**: Triton's bundled ptxas doesn't support sm_110a.  
**Fix**: Set `export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and use `--disable-cuda-graph`.

### OOM during CUDA graph capture
**Cause**: CUDA graph capture consumes extra GPU memory.  
**Fix**: Use `--disable-cuda-graph` to skip graph capture entirely.

### "Free memory less than desired"
**Cause**: Other processes using GPU memory.  
**Fix**: Lower `--mem-fraction-static` (e.g., 0.80 or 0.75).

### NCCL "Network is unreachable" on fe80:: addresses
**Cause**: NCCL tries IPv6 link-local interfaces that aren't connected.  
**Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces.

### Ray actor placement fails
**Cause**: Not enough GPUs visible in the Ray cluster.  
**Fix**: Verify `ray status` shows all expected nodes and GPUs before launching.

---

## Architecture Notes

- **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet.
- **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet.
- SGLang uses ZMQ for inter-process data transfer. With `--use-ray`, Ray manages process lifecycle (control plane) while ZMQ handles the data plane — zero throughput overhead.
- The `--use-ray` flag is opt-in ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)). Without it, SGLang defaults to Python multiprocessing.
- DGX Spark (GB10) = compute capability 12.1, `sm_120`
- Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a`


Extra code 

```bash
#!/bin/bash
#
# Launch a Ray cluster (without Docker) for SGLang multi-node inference.
#
# Uses the uv virtual environment at .sglang/ relative to this script.
# All machines must have the same environment replicated at the same path,
# and must be reachable at the supplied IP addresses (port 6379 open).
#
# Cluster layout (4 nodes):
#   Spark 1  (HEAD)   – IP_SPARK_1
#   Spark 2  (worker) – IP_SPARK_2
#   Thor  1  (worker) – IP_THOR_1
#   Thor  2  (worker) – IP_THOR_2
#
# Usage – run each command on the corresponding machine:
#
# 1. Spark 1 (HEAD):
#    bash run_cluster_bare.sh IP_SPARK_1 --head
#                             ^^^^^^^^^
#                             su propia IP
#
# 2. Spark 2 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2
#                             ^^^^^^^^^                     ^^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 3. Thor 1 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1
#                             ^^^^^^^^^                     ^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 4. Thor 2 (worker):
#    bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2
#                             ^^^^^^^^^                     ^^^^^^^^^
#                             IP del HEAD                   su propia IP
#
# 5. Once all workers have joined, open another terminal on Spark 1 (HEAD):
#      source .sglang/bin/activate
#      python -m sglang.launch_server --model-path <model> --tp 1 --pp <N> --nnodes <N> --use-ray
#
# Keep each terminal session open. Ctrl-C stops the Ray node.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV_DIR="${SCRIPT_DIR}/.sglang"

if [ ! -f "${VENV_DIR}/bin/activate" ]; then
    echo "Error: virtual environment not found at ${VENV_DIR}"
    echo "Create it with:  uv venv .sglang --python 3.12 && uv pip install sglang ray"
    exit 1
fi

# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
echo "Activated virtualenv: ${VIRTUAL_ENV}"

if [ $# -lt 2 ]; then
    echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]"
    exit 1
fi

HEAD_NODE_ADDRESS="$1"
NODE_TYPE="$2"
shift 2

NODE_IP=""
while [[ $# -gt 0 ]]; do
    case "$1" in
        --node-ip)
            NODE_IP="$2"
            shift 2
            ;;
        *)
            echo "Unknown argument: $1"
            exit 1
            ;;
    esac
done

if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Auto-detect the network interface for the given IP so that
# NCCL and Gloo use the correct NIC instead of defaulting to loopback.
detect_interface() {
    local target_ip="$1"
    ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}'
}

MY_IP=""
if [ "${NODE_TYPE}" == "--head" ]; then
    MY_IP="${HEAD_NODE_ADDRESS}"
else
    MY_IP="${NODE_IP:-}"
fi

if [ -n "${MY_IP}" ]; then
    IFNAME=$(detect_interface "${MY_IP}")
    if [ -n "${IFNAME}" ]; then
        export NCCL_SOCKET_IFNAME="${IFNAME}"
        export GLOO_SOCKET_IFNAME="${IFNAME}"
        export TP_SOCKET_IFNAME="${IFNAME}"
        echo "Detected interface ${IFNAME} for IP ${MY_IP}"
    else
        echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail"
    fi
fi

cleanup() {
    echo "Stopping Ray node..."
    ray stop
}
trap cleanup EXIT

if [ "${NODE_TYPE}" == "--head" ]; then
    echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..."
    ray start --block --head \
        --node-ip-address="${HEAD_NODE_ADDRESS}" \
        --port=6379 \
        --dashboard-host=0.0.0.0
else
    WORKER_IP="${NODE_IP:-}"
    echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..."

    RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379")
    if [ -n "${WORKER_IP}" ]; then
        RAY_ARGS+=(--node-ip-address="${WORKER_IP}")
    fi

    ray start "${RAY_ARGS[@]}"
fi
```