--- title: Run VLLM Thor & Spark --- # Run VLLM Thor & Spark 1. Install uv ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` 2. Create environment ```bash sudo apt install python3-dev uv venv .vllm --python 3.12 source .vllm/bin/activate ``` 3. Install vllm ```bash uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly ``` 4. Install Pytorch ```bash uv pip install --force-reinstall torch==2.10.0 torchvision=0.25.0 --index-url https://download.pytorch.org/whl/cu130 ``` 5. Export variables ```bash export TORCH_CUDA_ARCH_LIST=12.1a # Spark 12.1a, for Thor 11.0a ``` 6. Clean memory ```bash sudo sysctl -w vm.drop_caches=3 ``` 7. Run Nemotron Download the custom parser from the Hugging Face repository. ```bash wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py ``` ```bash vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \ --port 9000 \ --max-num-seqs 8 \ --tensor-parallel-size 1 \ --max-model-len 8000 \ --trust-remote-code \ --enable-auto-tool-choice \ --gpu-memory-utilization 0.75 \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 \ --kv-cache-dtype fp8 ``` ```bash curl http://localhost:9000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", "messages": [ { "role": "user", "content": "Escribe un ensayo extremadamente largo y detallado sobre la historia completa de la computación, desde el ábaco hasta la inteligencia artificial moderna. No resumas nada. Continúa escribiendo sin detenerte." } ], "max_tokens": 2000, "temperature": 0.8, "stream": false }' ``` ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Cosmos-Reason2-2B", "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain in English how a combustion engine works."} ], "temperature": 0.7 }' ``` ## 8. Serve the Model (Speculative Decoding with MTP) Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP: ```bash vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \ --port 9000 \ --max-num-seqs 2 \ --tensor-parallel-size 1 \ --max-model-len 131072 \ --trust-remote-code \ --gpu-memory-utilization 0.80 \ --kv-cache-dtype fp8 \ --speculative_config '{"method":"mtp","num_speculative_tokens":3}' ``` ## 9. Run a Stress Test (Separate Terminal) In another terminal with the `.vllm` environment activated, run the following script to send 10 long-context requests: ```bash python3 -c " import requests, time, sys, concurrent.futures MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4' PORT = 9000 # ~100K tokens — safely under 131072 - 1024 = 130048 limit parts = [] for i in range(3000): parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ') prompt = 'Write a comprehensive analysis: ' + ' '.join(parts) print(f'Approx words: {len(prompt.split())}') sys.stdout.flush() def send_request(idx): t0 = time.time() try: r = requests.post(f'http://localhost:{PORT}/v1/completions', json={ 'model': MODEL, 'prompt': prompt, 'max_tokens': 1024, 'temperature': 0.7, }, timeout=600) elapsed = time.time() - t0 if r.status_code == 200: data = r.json() text = data['choices'][0]['text'] usage = data.get('usage', {}) return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s' else: err = r.json().get('error',{}).get('message','')[:200] return f'[{idx}] FAIL ({r.status_code}): {err}' except Exception as e: elapsed = time.time() - t0 return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)' # Phase 1: Single ~100K token request print('=== Phase 1: Single ~100K token request ===') sys.stdout.flush() print(send_request(1)); sys.stdout.flush() # Phase 2: 2 concurrent print('=== Phase 2: 2 concurrent ===') sys.stdout.flush() with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool: futs = [pool.submit(send_request, i) for i in range(2, 4)] for f in concurrent.futures.as_completed(futs): print(f.result()); sys.stdout.flush() # Phase 3: 10 rapid print('=== Phase 3: 10 rapid sequential ===') sys.stdout.flush() for i in range(4, 14): r = send_request(i) print(r); sys.stdout.flush() if 'CRASH' in r: break print('Done.') " 2>&1 ```