# curl for local LLM APIs: a practical manual
This guide is about using `curl` to talk to an OpenAI-compatible local server such as vLLM.
Assume your server is:
- Base URL: `http://127.0.0.1:8080/v1`
- Model name: `models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k`
- Optional API key: `$VLLM_API_KEY`
---
## 1. What curl is
`curl` sends HTTP requests from the terminal.
General form:
~~~bash
curl [options] URL
~~~
For APIs, the options you use most are:
- `-H` for headers
- `-d` for request body
- `-X` for request method
- `-s` for silent mode
- `-v` for verbose debugging
- `-o` for output file
- `-N` to disable buffering for streaming
---
## 2. The mental model of an API request
An HTTP request usually has:
- a URL
- a method such as `GET` or `POST`
- headers
- optionally a body
For example:
~~~bash
curl http://127.0.0.1:8080/v1/models
~~~
This is a `GET` request.
This:
~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"...","messages":[{"role":"user","content":"hello"}]}'
~~~
is a `POST` request because `-d` is present.
---
## 3. The endpoints you care about for vLLM
### List models
~~~bash
curl http://127.0.0.1:8080/v1/models
~~~
### Chat completions
~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"hello"}
]
}'
~~~
### Legacy completions
~~~bash
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"prompt": "hello",
"max_tokens": 100
}'
~~~
---
## 4. Headers: what they do
### `Content-Type`
Almost always needed for POSTing JSON.
~~~bash
-H "Content-Type: application/json"
~~~
### `Authorization`
Needed only if your server requires an API key.
~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~
### `Accept`
Usually optional.
~~~bash
-H "Accept: application/json"
~~~
### `User-Agent`
Usually optional, useful for logs.
~~~bash
-H "User-Agent: local-client"
~~~
### Typical minimal header set
Without API key:
~~~bash
-H "Content-Type: application/json"
~~~
With API key:
~~~bash
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY"
~~~
---
## 5. The request body
For chat, the body usually contains:
- `model`
- `messages`
- optional generation settings like `max_tokens`, `temperature`, `stream`
Example:
~~~json
{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"system","content":"Answer briefly."},
{"role":"user","content":"Say hi."}
],
"max_tokens": 50,
"temperature": 0.7
}
~~~
Important: the `model` field must usually be present.
---
## 6. Your first useful commands
### Check if server is alive
~~~bash
curl http://127.0.0.1:8080/v1/models
~~~
### Send a chat request
~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"hello"}
],
"max_tokens": 100
}'
~~~
### Extract only the text with jq
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"hello"}
],
"max_tokens": 100
}' | jq -r '.choices[0].message.content'
~~~
---
## 7. Useful flags you should actually know
### `-s` silent mode
Removes the progress meter.
~~~bash
curl -s http://127.0.0.1:8080/v1/models
~~~
### `-v` verbose mode
Shows request and response headers. Great for debugging.
~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~
### `-i` include response headers
Prints headers in the output.
~~~bash
curl -i http://127.0.0.1:8080/v1/models
~~~
### `-X` specify method
Usually unnecessary if `-d` is used, but explicit is fine.
~~~bash
curl -X POST http://127.0.0.1:8080/v1/chat/completions ...
~~~
### `-d` send body data
~~~bash
-d '{"key":"value"}'
~~~
### `-o` write response to file
~~~bash
curl -o response.json http://127.0.0.1:8080/v1/models
~~~
### `-O` save using remote filename
More useful for file downloads than APIs.
### `-w` write timing info
~~~bash
curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models
~~~
### `--max-time`
Abort if request takes too long.
~~~bash
curl --max-time 30 http://127.0.0.1:8080/v1/models
~~~
### `--retry`
Retry on transient failures.
~~~bash
curl --retry 3 http://127.0.0.1:8080/v1/models
~~~
### `-N` no buffering
Useful for streaming token responses.
~~~bash
curl -N ...
~~~
---
## 8. How quoting works
This part trips people constantly.
### Single quotes are safest for JSON
~~~bash
-d '{
"model": "m",
"messages": [{"role":"user","content":"hello"}]
}'
~~~
Single quotes stop the shell from messing with JSON.
### Double quotes allow variable expansion
~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~
Use double quotes when you want `$VARIABLE` to expand.
### Bad pattern
~~~bash
-d "{
"model": "m"
}"
~~~
This explodes because the shell gets confused by nested double quotes.
---
## 9. Sending JSON from a file
This is much cleaner for bigger requests.
Create `req.json`:
~~~json
{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"Explain transformers simply."}
],
"max_tokens": 200
}
~~~
Send it:
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json
~~~
The `@file` syntax tells curl to read the body from a file.
---
## 10. Pretty-printing JSON
Use `jq`.
### Whole response
~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~
### Only assistant text
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json | jq -r '.choices[0].message.content'
~~~
### Token usage
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json | jq '.usage'
~~~
### Completion token count only
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json | jq '.usage.completion_tokens'
~~~
---
## 11. Measuring latency and tokens/sec
### Measure total request time
~~~bash
curl -s -o response.json \
-w "time_total=%{time_total}\n" \
http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json
~~~
Then inspect token count:
~~~bash
jq '.usage.completion_tokens' response.json
~~~
### Compute tokens/sec manually
If `completion_tokens = 182` and `time_total = 40`, then:
~~~text
182 / 40 = 4.55 tokens/sec
~~~
### Quick shell version
~~~bash
TOKENS=$(jq '.usage.completion_tokens' response.json)
TIME=40
python -c "print($TOKENS/$TIME)"
~~~
---
## 12. Streaming responses
If the server supports streaming, add `"stream": true`.
~~~bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"Tell me a short story."}
],
"stream": true
}'
~~~
Why `-N` matters: otherwise curl buffers output and your streaming looks dead.
---
## 13. Common generation options in the JSON body
### `max_tokens`
Maximum generated tokens.
~~~json
"max_tokens": 200
~~~
### `temperature`
Randomness. Lower is more deterministic.
~~~json
"temperature": 0.7
~~~
### `top_p`
Nucleus sampling.
~~~json
"top_p": 0.9
~~~
### `stream`
Whether to stream partial output.
~~~json
"stream": true
~~~
### `stop`
Stop sequences.
~~~json
"stop": ["</think>", "\n\nUser:"]
~~~
### Example
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"Write a haiku about GPUs."}
],
"max_tokens": 60,
"temperature": 0.8,
"top_p": 0.95
}'
~~~
---
## 14. Dealing with reasoning output like `<think>...</think>`
Some models emit reasoning traces.
### Tell the model not to output them
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"system","content":"Do not output reasoning or thinking traces. Provide only the final answer."},
{"role":"user","content":"hello"}
]
}'
~~~
### Or strip them out after the fact
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json \
| jq -r '.choices[0].message.content' \
| perl -0pe 's/<think>.*?<\/think>\s*//sg'
~~~
---
## 15. Common debugging patterns
### Is the server alive?
~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~
### Is auth failing?
Look for:
~~~text
HTTP/1.1 401 Unauthorized
~~~
### Is the model name wrong?
Look for:
~~~text
"The model `...` does not exist."
~~~
Then inspect:
~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~
### Is the request body malformed?
If JSON is bad, the server may return `400` or complain it cannot parse the body.
Use a JSON file and validate it:
~~~bash
jq . req.json
~~~
If `jq` can parse it, it is valid JSON.
### Is the request hanging?
Try a lightweight endpoint:
~~~bash
curl -s http://127.0.0.1:8080/v1/models
~~~
If that returns fast but generation hangs, the server is alive and inference is the issue.
---
## 16. HTTP status codes you will actually see
### `200 OK`
Everything fine.
### `400 Bad Request`
Your request body is malformed or missing required fields.
### `401 Unauthorized`
Missing or wrong API key.
### `404 Not Found`
Wrong endpoint or wrong model name.
### `500 Internal Server Error`
Server-side crash or internal failure.
---
## 17. A few complete examples
### Example A: simple health check
~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~
### Example B: simple chat
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [{"role":"user","content":"hello"}]
}' | jq -r '.choices[0].message.content'
~~~
### Example C: use request file
`req.json`:
~~~json
{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"Explain attention in one paragraph."}
],
"max_tokens": 150
}
~~~
Command:
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json | jq -r '.choices[0].message.content'
~~~
### Example D: verbose debug
~~~bash
curl -v http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json
~~~
### Example E: timed request
~~~bash
curl -s -o response.json \
-w "time_total=%{time_total}\n" \
http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json
jq '.usage' response.json
~~~
---
## 18. Shell shortcuts that make life less cursed
### Save the base URL
~~~bash
export BASE_URL=http://127.0.0.1:8080/v1
~~~
Then:
~~~bash
curl -s "$BASE_URL/models"
~~~
### Save the model name
~~~bash
export MODEL='models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k'
~~~
Then:
~~~bash
curl -s "$BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\":\"user\",\"content\":\"hello\"}]
}"
~~~
This works, but quoting gets uglier fast. Files are often cleaner.
---
## 19. Handy aliases
### Alias for model list
~~~bash
alias llm-models='curl -s http://127.0.0.1:8080/v1/models | jq'
~~~
### Alias for basic chat via file
~~~bash
alias llm-chat='curl -s http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" -d @req.json | jq -r ".choices[0].message.content"'
~~~
---
## 20. Common mistakes
### Wrong model string
You may send a filesystem path while the server expects a registered model ID, or vice versa.
Always check:
~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~
### Missing `Content-Type`
Then the server may not parse your JSON correctly.
### Missing API key header
Then you get `401`.
### Invalid JSON because of shell quoting
Use single quotes or `-d @req.json`.
### Forgetting `-N` when streaming
Then it looks frozen.
### Confusing shell variables with JSON strings
This expands:
~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~
This does not:
~~~bash
-H 'Authorization: Bearer $VLLM_API_KEY'
~~~
because single quotes block variable expansion.
---
## 21. Minimal cheat sheet
### Health check
~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~
### Simple chat
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [{"role":"user","content":"hello"}]
}' | jq -r '.choices[0].message.content'
~~~
### Debug headers
~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~
### Timed request
~~~bash
curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models
~~~
### Request from file
~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json
~~~
### Stream output
~~~bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [{"role":"user","content":"hello"}],
"stream": true
}'
~~~
---
## 22. Final practical advice
For local LLM servers, the most reliable pattern is:
1. put request JSON in a file
2. call `curl -s ... -d @req.json`
3. pipe to `jq -r '.choices[0].message.content'`
That avoids a lot of shell quoting nonsense and makes debugging way less cursed.
A very sane workflow looks like this:
~~~bash
cat > req.json <<'EOF'
{
"model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
"messages": [
{"role":"user","content":"Explain KV cache simply."}
],
"max_tokens": 200
}
EOF
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d @req.json | jq -r '.choices[0].message.content'
~~~