curl for local LLM APIs: a practical manual

# curl for local LLM APIs: a practical manual This guide is about using `curl` to talk to an OpenAI-compatible local server such as vLLM. Assume your server is: - Base URL: `http://127.0.0.1:8080/v1` - Model name: `models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k` - Optional API key: `$VLLM_API_KEY` --- ## 1. What curl is `curl` sends HTTP requests from the terminal. General form: ~~~bash curl [options] URL ~~~ For APIs, the options you use most are: - `-H` for headers - `-d` for request body - `-X` for request method - `-s` for silent mode - `-v` for verbose debugging - `-o` for output file - `-N` to disable buffering for streaming --- ## 2. The mental model of an API request An HTTP request usually has: - a URL - a method such as `GET` or `POST` - headers - optionally a body For example: ~~~bash curl http://127.0.0.1:8080/v1/models ~~~ This is a `GET` request. This: ~~~bash curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"...","messages":[{"role":"user","content":"hello"}]}' ~~~ is a `POST` request because `-d` is present. --- ## 3. The endpoints you care about for vLLM ### List models ~~~bash curl http://127.0.0.1:8080/v1/models ~~~ ### Chat completions ~~~bash curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"hello"} ] }' ~~~ ### Legacy completions ~~~bash curl http://127.0.0.1:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "prompt": "hello", "max_tokens": 100 }' ~~~ --- ## 4. Headers: what they do ### `Content-Type` Almost always needed for POSTing JSON. ~~~bash -H "Content-Type: application/json" ~~~ ### `Authorization` Needed only if your server requires an API key. ~~~bash -H "Authorization: Bearer $VLLM_API_KEY" ~~~ ### `Accept` Usually optional. ~~~bash -H "Accept: application/json" ~~~ ### `User-Agent` Usually optional, useful for logs. ~~~bash -H "User-Agent: local-client" ~~~ ### Typical minimal header set Without API key: ~~~bash -H "Content-Type: application/json" ~~~ With API key: ~~~bash -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" ~~~ --- ## 5. The request body For chat, the body usually contains: - `model` - `messages` - optional generation settings like `max_tokens`, `temperature`, `stream` Example: ~~~json { "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"system","content":"Answer briefly."}, {"role":"user","content":"Say hi."} ], "max_tokens": 50, "temperature": 0.7 } ~~~ Important: the `model` field must usually be present. --- ## 6. Your first useful commands ### Check if server is alive ~~~bash curl http://127.0.0.1:8080/v1/models ~~~ ### Send a chat request ~~~bash curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"hello"} ], "max_tokens": 100 }' ~~~ ### Extract only the text with jq ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"hello"} ], "max_tokens": 100 }' | jq -r '.choices[0].message.content' ~~~ --- ## 7. Useful flags you should actually know ### `-s` silent mode Removes the progress meter. ~~~bash curl -s http://127.0.0.1:8080/v1/models ~~~ ### `-v` verbose mode Shows request and response headers. Great for debugging. ~~~bash curl -v http://127.0.0.1:8080/v1/models ~~~ ### `-i` include response headers Prints headers in the output. ~~~bash curl -i http://127.0.0.1:8080/v1/models ~~~ ### `-X` specify method Usually unnecessary if `-d` is used, but explicit is fine. ~~~bash curl -X POST http://127.0.0.1:8080/v1/chat/completions ... ~~~ ### `-d` send body data ~~~bash -d '{"key":"value"}' ~~~ ### `-o` write response to file ~~~bash curl -o response.json http://127.0.0.1:8080/v1/models ~~~ ### `-O` save using remote filename More useful for file downloads than APIs. ### `-w` write timing info ~~~bash curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models ~~~ ### `--max-time` Abort if request takes too long. ~~~bash curl --max-time 30 http://127.0.0.1:8080/v1/models ~~~ ### `--retry` Retry on transient failures. ~~~bash curl --retry 3 http://127.0.0.1:8080/v1/models ~~~ ### `-N` no buffering Useful for streaming token responses. ~~~bash curl -N ... ~~~ --- ## 8. How quoting works This part trips people constantly. ### Single quotes are safest for JSON ~~~bash -d '{ "model": "m", "messages": [{"role":"user","content":"hello"}] }' ~~~ Single quotes stop the shell from messing with JSON. ### Double quotes allow variable expansion ~~~bash -H "Authorization: Bearer $VLLM_API_KEY" ~~~ Use double quotes when you want `$VARIABLE` to expand. ### Bad pattern ~~~bash -d "{ "model": "m" }" ~~~ This explodes because the shell gets confused by nested double quotes. --- ## 9. Sending JSON from a file This is much cleaner for bigger requests. Create `req.json`: ~~~json { "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"Explain transformers simply."} ], "max_tokens": 200 } ~~~ Send it: ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json ~~~ The `@file` syntax tells curl to read the body from a file. --- ## 10. Pretty-printing JSON Use `jq`. ### Whole response ~~~bash curl -s http://127.0.0.1:8080/v1/models | jq ~~~ ### Only assistant text ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json | jq -r '.choices[0].message.content' ~~~ ### Token usage ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json | jq '.usage' ~~~ ### Completion token count only ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json | jq '.usage.completion_tokens' ~~~ --- ## 11. Measuring latency and tokens/sec ### Measure total request time ~~~bash curl -s -o response.json \ -w "time_total=%{time_total}\n" \ http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json ~~~ Then inspect token count: ~~~bash jq '.usage.completion_tokens' response.json ~~~ ### Compute tokens/sec manually If `completion_tokens = 182` and `time_total = 40`, then: ~~~text 182 / 40 = 4.55 tokens/sec ~~~ ### Quick shell version ~~~bash TOKENS=$(jq '.usage.completion_tokens' response.json) TIME=40 python -c "print($TOKENS/$TIME)" ~~~ --- ## 12. Streaming responses If the server supports streaming, add `"stream": true`. ~~~bash curl -N http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"Tell me a short story."} ], "stream": true }' ~~~ Why `-N` matters: otherwise curl buffers output and your streaming looks dead. --- ## 13. Common generation options in the JSON body ### `max_tokens` Maximum generated tokens. ~~~json "max_tokens": 200 ~~~ ### `temperature` Randomness. Lower is more deterministic. ~~~json "temperature": 0.7 ~~~ ### `top_p` Nucleus sampling. ~~~json "top_p": 0.9 ~~~ ### `stream` Whether to stream partial output. ~~~json "stream": true ~~~ ### `stop` Stop sequences. ~~~json "stop": ["</think>", "\n\nUser:"] ~~~ ### Example ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"Write a haiku about GPUs."} ], "max_tokens": 60, "temperature": 0.8, "top_p": 0.95 }' ~~~ --- ## 14. Dealing with reasoning output like `<think>...</think>` Some models emit reasoning traces. ### Tell the model not to output them ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"system","content":"Do not output reasoning or thinking traces. Provide only the final answer."}, {"role":"user","content":"hello"} ] }' ~~~ ### Or strip them out after the fact ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json \ | jq -r '.choices[0].message.content' \ | perl -0pe 's/<think>.*?<\/think>\s*//sg' ~~~ --- ## 15. Common debugging patterns ### Is the server alive? ~~~bash curl -v http://127.0.0.1:8080/v1/models ~~~ ### Is auth failing? Look for: ~~~text HTTP/1.1 401 Unauthorized ~~~ ### Is the model name wrong? Look for: ~~~text "The model `...` does not exist." ~~~ Then inspect: ~~~bash curl -s http://127.0.0.1:8080/v1/models | jq ~~~ ### Is the request body malformed? If JSON is bad, the server may return `400` or complain it cannot parse the body. Use a JSON file and validate it: ~~~bash jq . req.json ~~~ If `jq` can parse it, it is valid JSON. ### Is the request hanging? Try a lightweight endpoint: ~~~bash curl -s http://127.0.0.1:8080/v1/models ~~~ If that returns fast but generation hangs, the server is alive and inference is the issue. --- ## 16. HTTP status codes you will actually see ### `200 OK` Everything fine. ### `400 Bad Request` Your request body is malformed or missing required fields. ### `401 Unauthorized` Missing or wrong API key. ### `404 Not Found` Wrong endpoint or wrong model name. ### `500 Internal Server Error` Server-side crash or internal failure. --- ## 17. A few complete examples ### Example A: simple health check ~~~bash curl -s http://127.0.0.1:8080/v1/models | jq ~~~ ### Example B: simple chat ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [{"role":"user","content":"hello"}] }' | jq -r '.choices[0].message.content' ~~~ ### Example C: use request file `req.json`: ~~~json { "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"Explain attention in one paragraph."} ], "max_tokens": 150 } ~~~ Command: ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json | jq -r '.choices[0].message.content' ~~~ ### Example D: verbose debug ~~~bash curl -v http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json ~~~ ### Example E: timed request ~~~bash curl -s -o response.json \ -w "time_total=%{time_total}\n" \ http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json jq '.usage' response.json ~~~ --- ## 18. Shell shortcuts that make life less cursed ### Save the base URL ~~~bash export BASE_URL=http://127.0.0.1:8080/v1 ~~~ Then: ~~~bash curl -s "$BASE_URL/models" ~~~ ### Save the model name ~~~bash export MODEL='models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k' ~~~ Then: ~~~bash curl -s "$BASE_URL/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d "{ \"model\": \"$MODEL\", \"messages\": [{\"role\":\"user\",\"content\":\"hello\"}] }" ~~~ This works, but quoting gets uglier fast. Files are often cleaner. --- ## 19. Handy aliases ### Alias for model list ~~~bash alias llm-models='curl -s http://127.0.0.1:8080/v1/models | jq' ~~~ ### Alias for basic chat via file ~~~bash alias llm-chat='curl -s http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" -d @req.json | jq -r ".choices[0].message.content"' ~~~ --- ## 20. Common mistakes ### Wrong model string You may send a filesystem path while the server expects a registered model ID, or vice versa. Always check: ~~~bash curl -s http://127.0.0.1:8080/v1/models | jq ~~~ ### Missing `Content-Type` Then the server may not parse your JSON correctly. ### Missing API key header Then you get `401`. ### Invalid JSON because of shell quoting Use single quotes or `-d @req.json`. ### Forgetting `-N` when streaming Then it looks frozen. ### Confusing shell variables with JSON strings This expands: ~~~bash -H "Authorization: Bearer $VLLM_API_KEY" ~~~ This does not: ~~~bash -H 'Authorization: Bearer $VLLM_API_KEY' ~~~ because single quotes block variable expansion. --- ## 21. Minimal cheat sheet ### Health check ~~~bash curl -s http://127.0.0.1:8080/v1/models | jq ~~~ ### Simple chat ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [{"role":"user","content":"hello"}] }' | jq -r '.choices[0].message.content' ~~~ ### Debug headers ~~~bash curl -v http://127.0.0.1:8080/v1/models ~~~ ### Timed request ~~~bash curl -s -o /dev/null -w "time_total=%{time_total}\n" http://127.0.0.1:8080/v1/models ~~~ ### Request from file ~~~bash curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json ~~~ ### Stream output ~~~bash curl -N http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{ "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [{"role":"user","content":"hello"}], "stream": true }' ~~~ --- ## 22. Final practical advice For local LLM servers, the most reliable pattern is: 1. put request JSON in a file 2. call `curl -s ... -d @req.json` 3. pipe to `jq -r '.choices[0].message.content'` That avoids a lot of shell quoting nonsense and makes debugging way less cursed. A very sane workflow looks like this: ~~~bash cat > req.json <<'EOF' { "model": "models/khtsly/Qwen3.5-9B-Claude-4.6-Opus-Distilled-32k", "messages": [ {"role":"user","content":"Explain KV cache simply."} ], "max_tokens": 200 } EOF curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $VLLM_API_KEY" \ -d @req.json | jq -r '.choices[0].message.content' ~~~

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.