Distributed sglang/vllm

# Running SGLang Multi-Node Cluster on DGX Spark + Jetson Thor ## Cluster Layout | Machine | Hostname | IP | GPU | Role | |---------|----------|-----|-----|------| | DGX Spark 1 | spark-ea8e | 10.0.0.25 | NVIDIA GB10 (sm_120, 128GB) | HEAD | | DGX Spark 2 | spark-4080 | 10.0.0.22 | NVIDIA GB10 (sm_120, 128GB) | Worker | | Jetson Thor 1 | jetson-thor | 10.0.0.27 | NVIDIA Thor (sm_110, 128GB) | Worker | | Jetson Thor 2 | thorx | 10.0.0.26 | NVIDIA Thor (sm_110, 128GB) | Worker | **Total: 4 GPUs, ~461 GiB unified memory** --- ## Prerequisites ### Create the Virtual Environment Each machine needs the SGLang virtualenv at `~/Projects/sglang/.sglang/`: ```bash cd ~/Projects/sglang uv venv .sglang --python 3.12 source .sglang/bin/activate ``` ### Set Architecture & CUDA Paths Choose the correct `TORCH_CUDA_ARCH_LIST` for your hardware: **Jetson Thor:** ```bash export TORCH_CUDA_ARCH_LIST="11.0a" ``` **DGX Spark:** ```bash export TORCH_CUDA_ARCH_LIST="12.1a" ``` **Common (all machines):** ```bash export CUDA_HOME=/usr/local/cuda-13 export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas export PATH="${CUDA_HOME}/bin:$PATH" ``` ### Install Dependencies ```bash uv pip install sglang uv pip install --force-reinstall torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130 uv pip install --force-reinstall sgl-kernel --index-url https://docs.sglang.ai/whl/cu130/ ``` ### Install Ray (required for multi-node) ```bash uv pip install -U "ray[all]" ``` ### Verify the Installation ```bash python -c "import sglang; print(f'SGLang version: {sglang.__version__}')" python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, arch: {torch.cuda.get_device_capability()}')" ``` --- ## Step 1: Start the Ray Cluster Use `run_cluster_bare.sh` on each machine. The script auto-detects the network interface and sets NCCL/GLOO env vars. ### Spark 1 (HEAD): ```bash bash run_cluster_bare.sh 10.0.0.25 --head ``` ### Spark 2 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.22 ``` ### Thor 1 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.27 ``` ### Thor 2 (Worker): ```bash bash run_cluster_bare.sh 10.0.0.25 --worker --node-ip 10.0.0.26 ``` ### Verify the cluster: ```bash source .sglang/bin/activate ray status ``` You should see 4 active nodes, 4 GPUs, ~461 GiB memory. --- ## Step 2: Serve a Model Open another terminal on the HEAD (Spark 1) and set the environment: ```bash source .sglang/bin/activate export NCCL_SOCKET_IFNAME=^lo,docker,tailscale,veth export GLOO_SOCKET_IFNAME=enp1s0f0np0 export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export NCCL_SHM_DISABLE=1 export NCCL_PROTO=Simple ``` ### NCCL Environment Variables Explained | Variable | Value | Why | |----------|-------|-----| | `NCCL_SOCKET_IFNAME` | `^lo,docker,tailscale,veth` | Exclude loopback/docker/vpn interfaces; NCCL auto-selects the right NIC on each machine | | `NCCL_IB_DISABLE=1` | Disable InfiniBand | Sparks have RoCE, Thors don't — forces all nodes to use TCP Socket | | `NCCL_P2P_DISABLE=1` | Disable peer-to-peer | No NVLink between nodes | | `NCCL_SHM_DISABLE=1` | Disable shared memory | Forces TCP for all communication | | `NCCL_PROTO=Simple` | Simple protocol | Most compatible across heterogeneous nodes | | `GLOO_SOCKET_IFNAME` | `enp1s0f0np0` | For PyTorch Gloo backend (used during init) | --- ## Example: Nemotron Super 120B NVFP4 (4 nodes) ```bash python -m sglang.launch_server \ --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nvidia/nemotron-3-super \ --dtype auto \ --kv-cache-dtype fp8 \ --tp 1 \ --pp 4 \ --nnodes 4 \ --use-ray \ --trust-remote-code \ --mem-fraction-static 0.85 \ --chunked-prefill-size 8192 \ --max-running-requests 512 \ --host 0.0.0.0 \ --port 5000 \ --disable-cuda-graph ``` --- ## Example: Qwen3.5-122B-A10B-FP8 (4 nodes) ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3.5-122B-A10B-FP8 \ --served-model-name qwen3.5-122b \ --tp 1 \ --pp 4 \ --nnodes 4 \ --use-ray \ --trust-remote-code \ --mem-fraction-static 0.85 \ --context-length 32768 \ --max-running-requests 256 \ --host 0.0.0.0 \ --port 5000 \ --disable-cuda-graph ``` --- ## Example: Nemotron Super NVFP4 (2 Sparks only) If Thor nodes aren't available, use only the 2 Sparks: ```bash python -m sglang.launch_server \ --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nvidia/nemotron-3-super \ --dtype auto \ --kv-cache-dtype fp8 \ --tp 1 \ --pp 2 \ --nnodes 2 \ --use-ray \ --trust-remote-code \ --mem-fraction-static 0.85 \ --chunked-prefill-size 8192 \ --max-running-requests 512 \ --host 0.0.0.0 \ --port 5000 \ --disable-cuda-graph ``` --- ## Step 3: Test the API SGLang exposes an OpenAI-compatible API. Once the server is ready, test from any machine: ```bash curl http://10.0.0.25:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/nemotron-3-super", "messages": [{"role": "user", "content": "Hello! Write a haiku about GPUs."}], "max_tokens": 200, "temperature": 1.0, "top_p": 0.95 }' ``` --- ## Key Flags Reference | Flag | Purpose | |------|---------| | `--model-path` | HuggingFace model ID or local path | | `--pp N` | Split model across N nodes (pipeline parallelism) | | `--tp 1` | No tensor parallelism (1 GPU per node) | | `--nnodes N` | Number of nodes in the cluster | | `--use-ray` | Use Ray actors for multi-node process management ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)) | | `--disable-cuda-graph` | Disable CUDA graphs (avoids Triton ptxas sm_110a errors and OOM from graph capture) | | `--mem-fraction-static 0.85` | Use 85% of GPU memory for static allocation (leave room for OS) | | `--kv-cache-dtype fp8` | FP8 KV cache (saves memory, NVFP4 models only) | | `--chunked-prefill-size N` | Enable chunked prefill with chunk size N tokens | | `--max-running-requests N` | Maximum concurrent requests | | `--context-length N` | Override model's max context length | | `--trust-remote-code` | Allow running model code from HuggingFace | ## Troubleshooting ### NCCL "wrong type 3 != 4" **Cause**: Sparks use RoCE/IB, Thors use Socket — different NCCL transports. **Fix**: `export NCCL_IB_DISABLE=1` forces all nodes to Socket. ### Gloo "Connection refused 127.0.0.1" **Cause**: Gloo defaults to localhost instead of the real IP. **Fix**: `export GLOO_SOCKET_IFNAME=enp1s0f0np0` (set to your NIC name). ### "no kernel image is available for execution on the device" (Thor only) **Cause**: Pre-built wheels don't include `sm_110` CUDA kernels. **Fix**: Reinstall `sgl-kernel` from the cu130 index with `TORCH_CUDA_ARCH_LIST="11.0a"` set before install. ### "ptxas fatal: Value 'sm_110a' is not defined" (Thor only) **Cause**: Triton's bundled ptxas doesn't support sm_110a. **Fix**: Set `export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and use `--disable-cuda-graph`. ### OOM during CUDA graph capture **Cause**: CUDA graph capture consumes extra GPU memory. **Fix**: Use `--disable-cuda-graph` to skip graph capture entirely. ### "Free memory less than desired" **Cause**: Other processes using GPU memory. **Fix**: Lower `--mem-fraction-static` (e.g., 0.80 or 0.75). ### NCCL "Network is unreachable" on fe80:: addresses **Cause**: NCCL tries IPv6 link-local interfaces that aren't connected. **Fix**: Use `NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0` to whitelist only working interfaces. ### Ray actor placement fails **Cause**: Not enough GPUs visible in the Ray cluster. **Fix**: Verify `ray status` shows all expected nodes and GPUs before launching. --- ## Architecture Notes - **Pipeline Parallelism (PP)**: Splits model layers across nodes. Each node processes a subset of layers. Low inter-node communication — ideal for Ethernet. - **Tensor Parallelism (TP)**: Splits each layer across GPUs. High inter-node communication — requires NVLink. NOT suitable for this cluster over Ethernet. - SGLang uses ZMQ for inter-process data transfer. With `--use-ray`, Ray manages process lifecycle (control plane) while ZMQ handles the data plane — zero throughput overhead. - The `--use-ray` flag is opt-in ([PR #17684](https://github.com/sgl-project/sglang/pull/17684)). Without it, SGLang defaults to Python multiprocessing. - DGX Spark (GB10) = compute capability 12.1, `sm_120` - Jetson Thor = compute capability 11.0, `sm_110` / `sm_110a` Extra code ```bash #!/bin/bash # # Launch a Ray cluster (without Docker) for SGLang multi-node inference. # # Uses the uv virtual environment at .sglang/ relative to this script. # All machines must have the same environment replicated at the same path, # and must be reachable at the supplied IP addresses (port 6379 open). # # Cluster layout (4 nodes): # Spark 1 (HEAD) – IP_SPARK_1 # Spark 2 (worker) – IP_SPARK_2 # Thor 1 (worker) – IP_THOR_1 # Thor 2 (worker) – IP_THOR_2 # # Usage – run each command on the corresponding machine: # # 1. Spark 1 (HEAD): # bash run_cluster_bare.sh IP_SPARK_1 --head # ^^^^^^^^^ # su propia IP # # 2. Spark 2 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_SPARK_2 # ^^^^^^^^^ ^^^^^^^^^^ # IP del HEAD su propia IP # # 3. Thor 1 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_1 # ^^^^^^^^^ ^^^^^^^^^ # IP del HEAD su propia IP # # 4. Thor 2 (worker): # bash run_cluster_bare.sh IP_SPARK_1 --worker --node-ip IP_THOR_2 # ^^^^^^^^^ ^^^^^^^^^ # IP del HEAD su propia IP # # 5. Once all workers have joined, open another terminal on Spark 1 (HEAD): # source .sglang/bin/activate # python -m sglang.launch_server --model-path <model> --tp 1 --pp <N> --nnodes <N> --use-ray # # Keep each terminal session open. Ctrl-C stops the Ray node. set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" VENV_DIR="${SCRIPT_DIR}/.sglang" if [ ! -f "${VENV_DIR}/bin/activate" ]; then echo "Error: virtual environment not found at ${VENV_DIR}" echo "Create it with: uv venv .sglang --python 3.12 && uv pip install sglang ray" exit 1 fi # shellcheck disable=SC1091 source "${VENV_DIR}/bin/activate" echo "Activated virtualenv: ${VIRTUAL_ENV}" if [ $# -lt 2 ]; then echo "Usage: $0 <head_node_ip> --head|--worker [--node-ip <this_node_ip>]" exit 1 fi HEAD_NODE_ADDRESS="$1" NODE_TYPE="$2" shift 2 NODE_IP="" while [[ $# -gt 0 ]]; do case "$1" in --node-ip) NODE_IP="$2" shift 2 ;; *) echo "Unknown argument: $1" exit 1 ;; esac done if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then echo "Error: Node type must be --head or --worker" exit 1 fi # Auto-detect the network interface for the given IP so that # NCCL and Gloo use the correct NIC instead of defaulting to loopback. detect_interface() { local target_ip="$1" ip -4 addr show | awk -v ip="${target_ip}" '$0 ~ "inet " ip "/" {print $NF; exit}' } MY_IP="" if [ "${NODE_TYPE}" == "--head" ]; then MY_IP="${HEAD_NODE_ADDRESS}" else MY_IP="${NODE_IP:-}" fi if [ -n "${MY_IP}" ]; then IFNAME=$(detect_interface "${MY_IP}") if [ -n "${IFNAME}" ]; then export NCCL_SOCKET_IFNAME="${IFNAME}" export GLOO_SOCKET_IFNAME="${IFNAME}" export TP_SOCKET_IFNAME="${IFNAME}" echo "Detected interface ${IFNAME} for IP ${MY_IP}" else echo "Warning: could not detect interface for ${MY_IP}, NCCL/Gloo may fail" fi fi cleanup() { echo "Stopping Ray node..." ray stop } trap cleanup EXIT if [ "${NODE_TYPE}" == "--head" ]; then echo "Starting Ray HEAD node on ${HEAD_NODE_ADDRESS}:6379 ..." ray start --block --head \ --node-ip-address="${HEAD_NODE_ADDRESS}" \ --port=6379 \ --dashboard-host=0.0.0.0 else WORKER_IP="${NODE_IP:-}" echo "Starting Ray WORKER node, connecting to head at ${HEAD_NODE_ADDRESS}:6379 ..." RAY_ARGS=(--block --address="${HEAD_NODE_ADDRESS}:6379") if [ -n "${WORKER_IP}" ]; then RAY_ARGS+=(--node-ip-address="${WORKER_IP}") fi ray start "${RAY_ARGS[@]}" fi ```