# Running SGLang Multi-Node Cluster on DGX Spark + Jetson Thor
## Cluster Layout
| Machine | Hostname | IP | GPU | Role |
|---------|----------|-----|-----|------|
| DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD |
| DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker |
| Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker |
| Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker |
**Total: 4 GPUs, ~461 GiB unified memory**
---
## Prerequisites
### Create the Virtual Environment
Each machine needs the SGLang virtualenv at `~/Projects/sglang/.sglang/`:
```bash
cd ~/Projects/sglang
uv venv .sglang --python 3.12
source .sglang/bin/activate
```
### Set Architecture & CUDA Paths
Choose the correct `TORCH_CUDA_ARCH_LIST` for your hardware:
**Jetson Thor:**
```bash
export TORCH_CUDA_ARCH_LIST="11.0a"
```
**DGX Spark:**
```bash
export TORCH_CUDA_ARCH_LIST="12.1a"
```
**Common (all machines):**
```bash
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"
```
### Install Dependencies
```bash
uv pip install sglang
uv pip install --force-reinstall torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
uv pip install --force-reinstall sgl-kernel --index-url https://docs.sglang.ai/whl/cu130/
```
### Install Ray (required for multi-node)
```bash
uv pip install -U "ray[all]"
```
### Verify the Installation
```bash
python -c "import sglang; print(f'SGLang version: {sglang.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, arch: {torch.cuda.get_device_capability()}')"
```
---
## Step 1: Start the Ray Cluster
Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars.
### Spark 1 (HEAD):
```bash
bash run_cluster_bare.sh 10.0.0.25 --head
```
### Spark 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22
```
### Thor 1 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27
```
### Thor 2 (Worker):
```bash
bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26
```
### Verify the cluster:
```bash
source .sglang/bin/activate
ray status
```
You should see 4 active nodes, 4 GPUs, ~461 GiB memory.
---
## Step 2: Serve a Model
Open another terminal on the HEAD (Spark 1) and set the environment:
```bash
source .sglang/bin/activate
export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth
export GLOO_SOCKET_IFNAME=enp1s0f0np0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_PROTO=Simple
```
### NCCL Environment Variables Explained
| Variable | Value | Why |
|----------|-------|-----|
| `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine |
| `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket |
| `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes |
| `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication |
| `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes |
| `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) |
---
## Example: Nemotron Super 120B NVFP4 (4 nodes)
```bash
python -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nvidia/nemotron-3-super \
--dtype auto \
--kv-cache-dtype fp8 \
--tp 1 \
--pp 4 \
--nnodes 4 \
--use-ray \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 8192 \
--max-running-requests 512 \
--host 0.0.0.0 \
--port 5000 \
--disable-cuda-graph
```
---
## Example: Qwen3.5-122B-A10B-FP8 (4 nodes)
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-122B-A10B-FP8 \
--served-model-name qwen3.5-122b \
--tp 1 \
--pp 4 \
--nnodes 4 \
--use-ray \
--trust-remote-code \
--mem-fraction-static 0.85 \
--context-length 32768 \
--max-running-requests 256 \
--host 0.0.0.0 \
--port 5000 \
--disable-cuda-graph
```
---
## Example: Nemotron Super NVFP4 (2 Sparks only)
If Thor nodes aren't available, use only the 2 Sparks:
```bash
python -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nvidia/nemotron-3-super \
--dtype auto \
--kv-cache-dtype fp8 \
--tp 1 \
--pp 2 \
--nnodes 2 \
--use-ray \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 8192 \
--max-running-requests 512 \
--host 0.0.0.0 \
--port 5000 \
--disable-cuda-graph
```
---
## Step 3: Test the API
SGLang exposes an OpenAI-compatible API. Once the server is ready, test from any machine:
```bash
curl http://10.0.0.25:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-super",
"messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}],
"max_tokens": 200,
"temperature": 1.0,
"top_p": 0.95
}'
```
---
## Key Flags Reference
| Flag | Purpose |
|------|---------|
| `--model-path` | HuggingFace model ID or local path |
| `--pp N` | Split model across N nodes (pipeline parallelism) |
| `--tp 1` | No tensor parallelism (1 GPU per node) |
| `--nnodes N` | Number of nodes in the cluster |
| `--use-ray` | Use Ray actors for multi-node process management ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)) |
| `--disable-cuda-graph` | Disable CUDA graphs (avoids Triton ptxas sm_110a errors and OOM from graph capture) |
| `--mem-fraction-static 0.85` | Use 85% of GPU memory for static allocation (leave room for OS) |
| `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) |
| `--chunked-prefill-size N` | Enable chunked prefill with chunk size N tokens |
| `--max-running-requests N` | Maximum concurrent requests |
| `--context-length N` | Override model's max context length |
| `--trust-remote-code` | Allow running model code from HuggingFace |
## Troubleshooting
### NCCL "wrong type 3 != 4"
**Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports.
**Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket.
### Gloo "Connection refused 127.0.0.1"
**Cause**: Gloo defaults to localhost instead of the real IP.
**Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name).
### "no kernel image is available for execution on the device" (Thor only)
**Cause**: Pre-built wheels don't include `sm_110` CUDA kernels.
**Fix**: Reinstall `sgl-kernel` from the cu130 index with `TORCH_CUDA_ARCH_LIST="11.0a"` set before install.
### "ptxas fatal: Value 'sm_110a' is not defined" (Thor only)
**Cause**: Triton's bundled ptxas doesn't support sm_110a.
**Fix**: Set `export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and use `--disable-cuda-graph`.
### OOM during CUDA graph capture
**Cause**: CUDA graph capture consumes extra GPU memory.
**Fix**: Use `--disable-cuda-graph` to skip graph capture entirely.
### "Free memory less than desired"
**Cause**: Other processes using GPU memory.
**Fix**: Lower `--mem-fraction-static` (e.g., 0.80 or 0.75).
### NCCL "Network is unreachable" on fe80:: addresses
**Cause**: NCCL tries IPv6 link-local interfaces that aren't connected.
**Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces.
### Ray actor placement fails
**Cause**: Not enough GPUs visible in the Ray cluster.
**Fix**: Verify `ray status` shows all expected nodes and GPUs before launching.
---
## Architecture Notes
- **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet.
- **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet.
- SGLang uses ZMQ for inter-process data transfer. With `--use-ray`, Ray manages process lifecycle (control plane) while ZMQ handles the data plane — zero throughput overhead.
- The `--use-ray` flag is opt-in ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)). Without it, SGLang defaults to Python multiprocessing.
- DGX Spark (GB10) = compute capability 12.1, `sm_120`
- Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a`
Extra code
```bash
#!/bin/bash
#
# Launch a Ray cluster (without Docker) for SGLang multi-node inference.
#
# Uses the uv virtual environment at .sglang/ relative to this script.
# All machines must have the same environment replicated at the same path,
# and must be reachable at the supplied IP addresses (port 6379 open).
#
# Cluster layout (4 nodes):
# Spark 1 (HEAD) – IP_SPARK_1
# Spark 2 (worker) – IP_SPARK_2
# Thor 1 (worker) – IP_THOR_1
# Thor 2 (worker) – IP_THOR_2
#
# Usage – run each command on the corresponding machine:
#
# 1. Spark 1 (HEAD):
# bash run_cluster_bare.sh IP_SPARK_1 --head
# ^^^^^^^^^
# su propia IP
#
# 2. Spark 2 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2
# ^^^^^^^^^ ^^^^^^^^^^
# IP del HEAD su propia IP
#
# 3. Thor 1 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1
# ^^^^^^^^^ ^^^^^^^^^
# IP del HEAD su propia IP
#
# 4. Thor 2 (worker):
# bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2
# ^^^^^^^^^ ^^^^^^^^^
# IP del HEAD su propia IP
#
# 5. Once all workers have joined, open another terminal on Spark 1 (HEAD):
# source .sglang/bin/activate
# python -m sglang.launch_server --model-path <model> --tp 1 --pp <N> --nnodes <N> --use-ray
#
# Keep each terminal session open. Ctrl-C stops the Ray node.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV_DIR="${SCRIPT_DIR}/.sglang"
if [ ! -f "${VENV_DIR}/bin/activate" ]; then
echo "Error: virtual environment not found at ${VENV_DIR}"
echo "Create it with: uv venv .sglang --python 3.12 && uv pip install sglang ray"
exit 1
fi
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
echo "Activated virtualenv: ${VIRTUAL_ENV}"
if [ $# -lt 2 ]; then
echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]"
exit 1
fi
HEAD_NODE_ADDRESS="$1"
NODE_TYPE="$2"
shift 2
NODE_IP=""
while [[ $# -gt 0 ]]; do
case "$1" in
--node-ip)
NODE_IP="$2"
shift 2
;;
*)
echo "Unknown argument: $1"
exit 1
;;
esac
done
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Auto-detect the network interface for the given IP so that
# NCCL and Gloo use the correct NIC instead of defaulting to loopback.
detect_interface() {
local target_ip="$1"
ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}'
}
MY_IP=""
if [ "${NODE_TYPE}" == "--head" ]; then
MY_IP="${HEAD_NODE_ADDRESS}"
else
MY_IP="${NODE_IP:-}"
fi
if [ -n "${MY_IP}" ]; then
IFNAME=$(detect_interface "${MY_IP}")
if [ -n "${IFNAME}" ]; then
export NCCL_SOCKET_IFNAME="${IFNAME}"
export GLOO_SOCKET_IFNAME="${IFNAME}"
export TP_SOCKET_IFNAME="${IFNAME}"
echo "Detected interface ${IFNAME} for IP ${MY_IP}"
else
echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail"
fi
fi
cleanup() {
echo "Stopping Ray node..."
ray stop
}
trap cleanup EXIT
if [ "${NODE_TYPE}" == "--head" ]; then
echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..."
ray start --block --head \
--node-ip-address="${HEAD_NODE_ADDRESS}" \
--port=6379 \
--dashboard-host=0.0.0.0
else
WORKER_IP="${NODE_IP:-}"
echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..."
RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379")
if [ -n "${WORKER_IP}" ]; then
RAY_ARGS+=(--node-ip-address="${WORKER_IP}")
fi
ray start "${RAY_ARGS[@]}"
fi
```