---
title: Run VLLM Thor & Spark
---
# Run VLLM Thor & Spark
1. Install uv
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Create environment
```bash
sudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate
```
3. Install vllm
```bash
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
```
4. Install Pytorch
```bash
uv pip install --force-reinstall torch==2.10.0 torchvision=0.25.0 --index-url https://download.pytorch.org/whl/cu130
```
5. Export variables
```bash
export TORCH_CUDA_ARCH_LIST=12.1a # Spark 12.1a, for Thor 11.0a
```
6. Clean memory
```bash
sudo sysctl -w vm.drop_caches=3
```
7. Run Nemotron
Download the custom parser from the Hugging Face repository.
```bash
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py
```
```bash
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--port 9000 \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--max-model-len 8000 \
--trust-remote-code \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.75 \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--kv-cache-dtype fp8
```
```bash
curl http://localhost:9000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
"messages": [
{
"role": "user",
"content": "Escribe un ensayo extremadamente largo y detallado sobre la historia completa de la computación, desde el ábaco hasta la inteligencia artificial moderna. No resumas nada. Continúa escribiendo sin detenerte."
}
],
"max_tokens": 2000,
"temperature": 0.8,
"stream": false
}'
```
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Cosmos-Reason2-2B",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain in English how a combustion engine works."}
],
"temperature": 0.7
}'
```
## 8. Serve the Model (Speculative Decoding with MTP)
Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP:
```bash
vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
--port 9000 \
--max-num-seqs 2 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--trust-remote-code \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype fp8 \
--speculative_config '{"method":"mtp","num_speculative_tokens":3}'
```
## 9. Run a Stress Test (Separate Terminal)
In another terminal with the `.vllm` environment activated, run the following script to send 10 long-context requests:
```bash
python3 -c "
import requests, time, sys, concurrent.futures
MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4'
PORT = 9000
# ~100K tokens — safely under 131072 - 1024 = 130048 limit
parts = []
for i in range(3000):
parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ')
prompt = 'Write a comprehensive analysis: ' + ' '.join(parts)
print(f'Approx words: {len(prompt.split())}')
sys.stdout.flush()
def send_request(idx):
t0 = time.time()
try:
r = requests.post(f'http://localhost:{PORT}/v1/completions', json={
'model': MODEL,
'prompt': prompt,
'max_tokens': 1024,
'temperature': 0.7,
}, timeout=600)
elapsed = time.time() - t0
if r.status_code == 200:
data = r.json()
text = data['choices'][0]['text']
usage = data.get('usage', {})
return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s'
else:
err = r.json().get('error',{}).get('message','')[:200]
return f'[{idx}] FAIL ({r.status_code}): {err}'
except Exception as e:
elapsed = time.time() - t0
return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)'
# Phase 1: Single ~100K token request
print('=== Phase 1: Single ~100K token request ===')
sys.stdout.flush()
print(send_request(1)); sys.stdout.flush()
# Phase 2: 2 concurrent
print('=== Phase 2: 2 concurrent ===')
sys.stdout.flush()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
futs = [pool.submit(send_request, i) for i in range(2, 4)]
for f in concurrent.futures.as_completed(futs):
print(f.result()); sys.stdout.flush()
# Phase 3: 10 rapid
print('=== Phase 3: 10 rapid sequential ===')
sys.stdout.flush()
for i in range(4, 14):
r = send_request(i)
print(r); sys.stdout.flush()
if 'CRASH' in r: break
print('Done.')
" 2>&1
```