---
# System prepended metadata

title: 'curl for local LLM APIs: a practical manual'

---

# curl for local LLM APIs: a practical manual

This guide is about using `curl` to talk to an OpenAI-compatible local server such as vLLM.

Assume your server is:

- Base URL: `http://127.0.0.1:8080/v1`
- Model name: `models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k`
- Optional API key: `$VLLM_API_KEY`

---

## 1. What curl is

`curl` sends HTTP requests from the terminal.

General form:

~~~bash
curl [options] URL
~~~

For APIs, the options you use most are:

- `-H` for headers
- `-d` for request body
- `-X` for request method
- `-s` for silent mode
- `-v` for verbose debugging
- `-o` for output file
- `-N` to disable buffering for streaming

---

## 2. The mental model of an API request

An HTTP request usually has:

- a URL
- a method such as `GET` or `POST`
- headers
- optionally a body

For example:

~~~bash
curl http://127.0.0.1:8080/v1/models
~~~

This is a `GET` request.

This:

~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hello"}]}'
~~~

is a `POST` request because `-d` is present.

---

## 3. The endpoints you care about for vLLM

### List models

~~~bash
curl http://127.0.0.1:8080/v1/models
~~~

### Chat completions

~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"user","content":"hello"}
    ]
  }'
~~~

### Legacy completions

~~~bash
curl http://127.0.0.1:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "prompt": "hello",
    "max_tokens": 100
  }'
~~~

---

## 4. Headers: what they do

### `Content-Type`

Almost always needed for POSTing JSON.

~~~bash
-H "Content-Type: application/json"
~~~

### `Authorization`

Needed only if your server requires an API key.

~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~

### `Accept`

Usually optional.

~~~bash
-H "Accept: application/json"
~~~

### `User-Agent`

Usually optional, useful for logs.

~~~bash
-H "User-Agent: local-client"
~~~

### Typical minimal header set

Without API key:

~~~bash
-H "Content-Type: application/json"
~~~

With API key:

~~~bash
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY"
~~~

---

## 5. The request body

For chat, the body usually contains:

- `model`
- `messages`
- optional generation settings like `max_tokens`, `temperature`, `stream`

Example:

~~~json
{
  "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
  "messages": [
    {"role":"system","content":"Answer briefly."},
    {"role":"user","content":"Say hi."}
  ],
  "max_tokens": 50,
  "temperature": 0.7
}
~~~

Important: the `model` field must usually be present.

---

## 6. Your first useful commands

### Check if server is alive

~~~bash
curl http://127.0.0.1:8080/v1/models
~~~

### Send a chat request

~~~bash
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"user","content":"hello"}
    ],
    "max_tokens": 100
  }'
~~~

### Extract only the text with jq

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"user","content":"hello"}
    ],
    "max_tokens": 100
  }' | jq -r '.choices[0].message.content'
~~~

---

## 7. Useful flags you should actually know

### `-s` silent mode

Removes the progress meter.

~~~bash
curl -s http://127.0.0.1:8080/v1/models
~~~

### `-v` verbose mode

Shows request and response headers. Great for debugging.

~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~

### `-i` include response headers

Prints headers in the output.

~~~bash
curl -i http://127.0.0.1:8080/v1/models
~~~

### `-X` specify method

Usually unnecessary if `-d` is used, but explicit is fine.

~~~bash
curl -X POST http://127.0.0.1:8080/v1/chat/completions ...
~~~

### `-d` send body data

~~~bash
-d '{"key":"value"}'
~~~

### `-o` write response to file

~~~bash
curl -o response.json http://127.0.0.1:8080/v1/models
~~~

### `-O` save using remote filename

More useful for file downloads than APIs.

### `-w` write timing info

~~~bash
curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models
~~~

### `--max-time`

Abort if request takes too long.

~~~bash
curl --max-time 30 http://127.0.0.1:8080/v1/models
~~~

### `--retry`

Retry on transient failures.

~~~bash
curl --retry 3 http://127.0.0.1:8080/v1/models
~~~

### `-N` no buffering

Useful for streaming token responses.

~~~bash
curl -N ...
~~~

---

## 8. How quoting works

This part trips people constantly.

### Single quotes are safest for JSON

~~~bash
-d '{
  "model": "m",
  "messages": [{"role":"user","content":"hello"}]
}'
~~~

Single quotes stop the shell from messing with JSON.

### Double quotes allow variable expansion

~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~

Use double quotes when you want `$VARIABLE` to expand.

### Bad pattern

~~~bash
-d "{
  "model": "m"
}"
~~~

This explodes because the shell gets confused by nested double quotes.

---

## 9. Sending JSON from a file

This is much cleaner for bigger requests.

Create `req.json`:

~~~json
{
  "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
  "messages": [
    {"role":"user","content":"Explain transformers simply."}
  ],
  "max_tokens": 200
}
~~~

Send it:

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json
~~~

The `@file` syntax tells curl to read the body from a file.

---

## 10. Pretty-printing JSON

Use `jq`.

### Whole response

~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~

### Only assistant text

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json | jq -r '.choices[0].message.content'
~~~

### Token usage

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json | jq '.usage'
~~~

### Completion token count only

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json | jq '.usage.completion_tokens'
~~~

---

## 11. Measuring latency and tokens/sec

### Measure total request time

~~~bash
curl -s -o response.json \
  -w "time_total=%{time_total}\n" \
  http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json
~~~

Then inspect token count:

~~~bash
jq '.usage.completion_tokens' response.json
~~~

### Compute tokens/sec manually

If `completion_tokens = 182` and `time_total = 40`, then:

~~~text
182 / 40 = 4.55 tokens/sec
~~~

### Quick shell version

~~~bash
TOKENS=$(jq '.usage.completion_tokens' response.json)
TIME=40
python -c "print($TOKENS/$TIME)"
~~~

---

## 12. Streaming responses

If the server supports streaming, add `"stream": true`.

~~~bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"user","content":"Tell me a short story."}
    ],
    "stream": true
  }'
~~~

Why `-N` matters: otherwise curl buffers output and your streaming looks dead.

---

## 13. Common generation options in the JSON body

### `max_tokens`

Maximum generated tokens.

~~~json
"max_tokens": 200
~~~

### `temperature`

Randomness. Lower is more deterministic.

~~~json
"temperature": 0.7
~~~

### `top_p`

Nucleus sampling.

~~~json
"top_p": 0.9
~~~

### `stream`

Whether to stream partial output.

~~~json
"stream": true
~~~

### `stop`

Stop sequences.

~~~json
"stop": ["</think>", "\n\nUser:"]
~~~

### Example

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"user","content":"Write a haiku about GPUs."}
    ],
    "max_tokens": 60,
    "temperature": 0.8,
    "top_p": 0.95
  }'
~~~

---

## 14. Dealing with reasoning output like `<think>...</think>`

Some models emit reasoning traces.

### Tell the model not to output them

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [
      {"role":"system","content":"Do not output reasoning or thinking traces. Provide only the final answer."},
      {"role":"user","content":"hello"}
    ]
  }'
~~~

### Or strip them out after the fact

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json \
| jq -r '.choices[0].message.content' \
| perl -0pe 's/<think>.*?<\/think>\s*//sg'
~~~

---

## 15. Common debugging patterns

### Is the server alive?

~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~

### Is auth failing?

Look for:

~~~text
HTTP/1.1 401 Unauthorized
~~~

### Is the model name wrong?

Look for:

~~~text
"The model `...` does not exist."
~~~

Then inspect:

~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~

### Is the request body malformed?

If JSON is bad, the server may return `400` or complain it cannot parse the body.

Use a JSON file and validate it:

~~~bash
jq . req.json
~~~

If `jq` can parse it, it is valid JSON.

### Is the request hanging?

Try a lightweight endpoint:

~~~bash
curl -s http://127.0.0.1:8080/v1/models
~~~

If that returns fast but generation hangs, the server is alive and inference is the issue.

---

## 16. HTTP status codes you will actually see

### `200 OK`

Everything fine.

### `400 Bad Request`

Your request body is malformed or missing required fields.

### `401 Unauthorized`

Missing or wrong API key.

### `404 Not Found`

Wrong endpoint or wrong model name.

### `500 Internal Server Error`

Server-side crash or internal failure.

---

## 17. A few complete examples

### Example A: simple health check

~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~

### Example B: simple chat

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [{"role":"user","content":"hello"}]
  }' | jq -r '.choices[0].message.content'
~~~

### Example C: use request file

`req.json`:

~~~json
{
  "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
  "messages": [
    {"role":"user","content":"Explain attention in one paragraph."}
  ],
  "max_tokens": 150
}
~~~

Command:

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json | jq -r '.choices[0].message.content'
~~~

### Example D: verbose debug

~~~bash
curl -v http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json
~~~

### Example E: timed request

~~~bash
curl -s -o response.json \
  -w "time_total=%{time_total}\n" \
  http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json

jq '.usage' response.json
~~~

---

## 18. Shell shortcuts that make life less cursed

### Save the base URL

~~~bash
export BASE_URL=http://127.0.0.1:8080/v1
~~~

Then:

~~~bash
curl -s "$BASE_URL/models"
~~~

### Save the model name

~~~bash
export MODEL='models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k'
~~~

Then:

~~~bash
curl -s "$BASE_URL/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\": [{\"role\":\"user\",\"content\":\"hello\"}]
  }"
~~~

This works, but quoting gets uglier fast. Files are often cleaner.

---

## 19. Handy aliases

### Alias for model list

~~~bash
alias llm-models='curl -s http://127.0.0.1:8080/v1/models | jq'
~~~

### Alias for basic chat via file

~~~bash
alias llm-chat='curl -s http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" -d @req.json | jq -r ".choices[0].message.content"'
~~~

---

## 20. Common mistakes

### Wrong model string

You may send a filesystem path while the server expects a registered model ID, or vice versa.

Always check:

~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~

### Missing `Content-Type`

Then the server may not parse your JSON correctly.

### Missing API key header

Then you get `401`.

### Invalid JSON because of shell quoting

Use single quotes or `-d @req.json`.

### Forgetting `-N` when streaming

Then it looks frozen.

### Confusing shell variables with JSON strings

This expands:

~~~bash
-H "Authorization: Bearer $VLLM_API_KEY"
~~~

This does not:

~~~bash
-H 'Authorization: Bearer $VLLM_API_KEY'
~~~

because single quotes block variable expansion.

---

## 21. Minimal cheat sheet

### Health check

~~~bash
curl -s http://127.0.0.1:8080/v1/models | jq
~~~

### Simple chat

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [{"role":"user","content":"hello"}]
  }' | jq -r '.choices[0].message.content'
~~~

### Debug headers

~~~bash
curl -v http://127.0.0.1:8080/v1/models
~~~

### Timed request

~~~bash
curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models
~~~

### Request from file

~~~bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json
~~~

### Stream output

~~~bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{
    "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
    "messages": [{"role":"user","content":"hello"}],
    "stream": true
  }'
~~~

---

## 22. Final practical advice

For local LLM servers, the most reliable pattern is:

1. put request JSON in a file
2. call `curl -s ... -d @req.json`
3. pipe to `jq -r '.choices[0].message.content'`

That avoids a lot of shell quoting nonsense and makes debugging way less cursed.

A very sane workflow looks like this:

~~~bash
cat > req.json <<'EOF'
{
  "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k",
  "messages": [
    {"role":"user","content":"Explain KV cache simply."}
  ],
  "max_tokens": 200
}
EOF

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d @req.json | jq -r '.choices[0].message.content'
~~~