[Llama.cpp] api_like_OAI.py
===
###### tags: `LLM`
###### tags: `ML`, `NLP`, `NLU`, `Llama`, `CodeLlama`, `GGML`, `GGUF`
<br>
[TOC]
<br>
:::warning
:bulb: [**需知**] 需要了解:
1. 請求者帶來的參數
2. 後端 Llama Server 的參數設定
才能把它串接起來
:::
## [[code] llama.cpp/examples/server/api_like_OAI.py](https://github.com/ggerganov/llama.cpp/blob/e9c1cecb9d7d743d30b4a29ecd56a411437def0a/examples/server/api_like_OAI.py)
## 架構圖
[](https://hackmd.io/_uploads/SkHusOj7p.png)
<br>
## Usages
### server-side
```
$ python api_like_OAI.py \
--llama-api http://10.78.26.241:40081 \
--host 0.0.0.0 \
--port 40082`
```
- #### `--llama-api` 參數需要加 `http://`,否則會有底下錯誤
- server error log
```
[2023-11-07 18:34:58,486] ERROR in app: Exception on /completions [POST]
...
requests.exceptions.MissingSchema: Invalid URL '/completion': No schema supplied. Perhaps you meant http:///completion?
10.78.153.144 - - [07/Nov/2023 18:34:58] "POST /completions HTTP/1.1" 500 -
```
- client error log
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
```
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
<br>
### client-side
```
$ curl -X POST http://10.78.26.241:40082/completions \
-H 'Content-Type: application/json' \
-d '{"n_predict":400,"prompt":"Nice to meet you!\n\nHuman: hi\nAssistant:"}'
```
- `-d` `--data` 可透過 `flask.request.get_json()` 取得
- #### client request 需加 `-H 'Content-Type: application/json'`
- server error log
```
[debug] body: None <-- 抓不到 body
[2023-11-07 18:37:28,264] ERROR in app: Exception on /completions [POST]
File ".../api_like_OAI.py", line 28, in is_present
buf = json[key]
TypeError: 'NoneType' object is not subscriptable
10.78.153.144 - - [07/Nov/2023 18:37:28] "POST /completions HTTP/1.1" 500 -
```
- client error log
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
```
- 測試
- :warning: 如果 client 端沒設定 Content-Type
server 端會用預設值 ==**`application/x-www-form-urlencoded`**==
接著 `request.get_json()` 會拿到 None
存取 json (type: dict) 時,就會引發 `'NoneType' object is not subscriptable`
<br>
<hr>
<br>
## Llama-cpp APIs
### URL path
- ### path 不存在,會回傳:
File Not Found
<br>
### `/tokenize`
- demo1
`$ curl -X POST http://10.78.26.241:40081/tokenize -d '{"content": "Nice to meet you!\n\nHuman: hi\nAssistant:"}'`
- data 需要攜帶 `content` key
- 執行結果:
```json
{"tokens":[20103,304,5870,366,29991,13,13,29950,7889,29901,7251,13,7900,22137,29901]}
```
- demo2
`$ curl -X POST http://10.78.26.241:30080/tokenize -d '{"content": " I am a bit confused about your question. What do"}'`
<br>
### `/completion` (non-stream mode)
- resquest
`curl -X POST http://10.78.26.241:40081/completion -d '{"n_predict":200, "prompt": "Hello!"}'`
- data 需要攜帶 `prompt` key
- response
:::spoiler json data
```json=
{
"content":" I'm excited to be here and share my experiences with you. nobody knows anything, so let's have some fun! 😃\n18 / 🌱 Feminist 🏳️🌈 Vegan 🧖♀️ Artist 💫",
"generation_settings":{
"frequency_penalty":0.0,
"grammar":"",
"ignore_eos":false,
"logit_bias":[
],
"mirostat":0,
"mirostat_eta":0.10000000149011612,
"mirostat_tau":5.0,
"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
"n_ctx":512,
"n_keep":0,
"n_predict":200,
"n_probs":0,
"penalize_nl":true,
"presence_penalty":0.0,
"repeat_last_n":64,
"repeat_penalty":1.100000023841858,
"seed":4294967295,
"stop":[
],
"stream":false,
"temp":0.800000011920929,
"tfs_z":1.0,
"top_k":40,
"top_p":0.949999988079071,
"typical_p":1.0
},
"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
"prompt":"Hello!",
"slot_id":0,
"stop":true,
"stopped_eos":true,
"stopped_limit":false,
"stopped_word":false,
"stopping_word":"",
"timings":{
"predicted_ms":7999.853,
"predicted_n":71,
"predicted_per_second":8.875163081121615,
"predicted_per_token_ms":112.67398591549296,
"prompt_ms":291.459,
"prompt_n":3,
"prompt_per_second":10.29304293228207,
"prompt_per_token_ms":97.153
},
"tokens_cached":74,
"tokens_evaluated":3,
"tokens_predicted":71,
"truncated":false
}
```
:::
<br>
### `/completion` (stream mode)
- ### non-stream vs stream


- non-stream vs stream data: 只差在 content , 和計算時間
- stream data 取自最後一筆
- ### curl
- resquest
`!curl -X POST http://10.78.26.241:40081/completion -d '{"n_predict":200, "prompt": "Hello!", "stream": true}'`
- response
```
data: {"content":" I","multimodal":false,"slot_id":0,"stop":false}
data: {"content":"'","multimodal":false,"slot_id":0,"stop":false}
data: {"content":"m","multimodal":false,"slot_id":0,"stop":false}
data: {"content":" a","multimodal":false,"slot_id":0,"stop":false}
data: {"content":" fre","multimodal":false,"slot_id":0,"stop":false}
...
data: {"content":"","multimodal":false,"slot_id":0,"stop":false}
data: {"content":"","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","n_ctx":512,"n_keep":0,"n_predict":200,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":true,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","prompt":"Hello!","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":22545.974,"predicted_n":181,"predicted_per_second":8.028040837801019,"predicted_per_token_ms":124.56339226519336,"prompt_ms":269.057,"prompt_n":3,"prompt_per_second":11.150053706092017,"prompt_per_token_ms":89.68566666666668},"tokens_cached":184,"tokens_evaluated":3,"tokens_predicted":181,"truncated":false}
```
- :::spoiler 最後一筆 json data
```json=
{
"content":"",
"generation_settings":{
"frequency_penalty":0.0,
"grammar":"",
"ignore_eos":false,
"logit_bias":[
],
"mirostat":0,
"mirostat_eta":0.10000000149011612,
"mirostat_tau":5.0,
"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
"n_ctx":512,
"n_keep":0,
"n_predict":200,
"n_probs":0,
"penalize_nl":true,
"presence_penalty":0.0,
"repeat_last_n":64,
"repeat_penalty":1.100000023841858,
"seed":4294967295,
"stop":[
],
"stream":true,
"temp":0.800000011920929,
"tfs_z":1.0,
"top_k":40,
"top_p":0.949999988079071,
"typical_p":1.0
},
"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
"prompt":"Hello!",
"slot_id":0,
"stop":true,
"stopped_eos":true,
"stopped_limit":false,
"stopped_word":false,
"stopping_word":"",
"timings":{
"predicted_ms":22545.974,
"predicted_n":181,
"predicted_per_second":8.028040837801019,
"predicted_per_token_ms":124.56339226519336,
"prompt_ms":269.057,
"prompt_n":3,
"prompt_per_second":11.150053706092017,
"prompt_per_token_ms":89.68566666666668
},
"tokens_cached":184,
"tokens_evaluated":3,
"tokens_predicted":181,
"truncated":false
}
```
:::
- :warning: 注意:
- `"content":""`
- `"stream":true`
- ### python
- resquest
```python=
import requests
USER_PROMPT = 'Hello'
url = 'http://10.78.26.241:40081/completion'
headers = {}
body = {
"n_predict":5,
"prompt": f'{USER_PROMPT}',
"stream": True
}
res = requests.post(url=url, headers=headers, json=body, stream=True)
print(res)
for line in res.iter_lines():
print(line)
```
- `post()` 中的 stream 要設成 True
- response
```
<Response [200]>
b'data: {"content":" and","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":" welcome","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":" to","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":" my","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":" website","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":"!","multimodal":false,"slot_id":0,"stop":false}'
b''
b'data: {"content":"","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/home/diatango_lin/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","n_ctx":512,"n_keep":0,"n_predict":5,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":true,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/home/diatango_lin/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":469.622,"predicted_n":5,"predicted_per_second":10.646860666663827,"predicted_per_token_ms":93.9244,"prompt_ms":234.539,"prompt_n":2,"prompt_per_second":8.527366450782173,"prompt_per_token_ms":117.2695},"tokens_cached":7,"tokens_evaluated":2,"tokens_predicted":5,"truncated":false}'
b''
```
- finish reason 的判斷方式
`finish_reason = "stop" if (data["stopped_eos"] or data["stopped_word"]) else "length_limit"`
- "stop":true,
- "stopped_eos":false,
- "stopped_limit":true,
- "stopped_word":false,
- "stopping_word":""
<br>
<hr>
<br>
## OpenAI APIs
> https://platform.openai.com/docs/api-reference
- [The chat completion chunk object](https://platform.openai.com/docs/api-reference/chat/streaming)

```json
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
....
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":" today"},"finish_reason":null}]}
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"?"},"finish_reason":null}]}
{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
```
- 第一筆:`"delta":{"role":"assistant","content":""}`
- 最後一筆:`"delta":{},"finish_reason":"stop"`
- 中間:`"delta":{"content":"Hello"},"finish_reason":null}`
<br>
<hr>
<br>
## ASF APIs (FFM APIs)
> - #### AFS = AI Foundry Service 人工智慧代工服務
> - [[官網] 什麼是 AFS](https://tws.twcc.ai/afs-1/)
> 低成本、低門檻、高效率、高安全
> 一站式生成式AI解決方案,全力協助企業打造專屬大模型
> - #### FFM = Formosa Foundation Model 福爾摩沙大模型
> 
>
- [[官網] Text generation API and parameters](https://docs.twcc.ai/en/docs/concepts-tutorials/twcc/afs/tutorials/text-generation-api/)
- [[HackMD] AFS API 說明文件](https://hackmd.io/D6hChs1HRxyOaLul_HgirQ)
<br>
<hr>
<br>
## docs
### arg 預設參數
- ### 預設參數
- chat_prompt='A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n'
- user_name="\\nUSER: "
- ai_name="\\nASSISTANT: "
- system_name="\\nASSISTANT's RULE: "
- stop="</s>"
- llama_api='http://127.0.0.1:8080'
- api_key=""
- host='127.0.0.1'
- port=8081
- ### 說明
- request 打到 host:port (127.0.0.1:8087),會轉到 llama_api (http://127.0.0.1:8080)
<br>
### completion()
作法:
1. 設定 route 路徑,以接收 request
2. server-agent side 若有設定 api_key,則檢查 request 的 headers["Authorization"]
3. 取出 stream, tokenize 參數
4. 如果有啟用 tokenize 參數,呼叫 server_ip/tokenize 取得 tokenize 後的結果
5. 如果沒有啟用 stream 參數,呼叫 server_ip/completion 取得 assistant 的回答
6. 如果有啟用 stream 參數,呼叫 server_ip/completion 取得 assistant 的回答
- usages
- `/completions` (有加s),並帶有 tokenize 資訊:
```
$ curl -X POST http://10.78.26.241:40082/completions \
-H 'Content-Type: application/json' \
-d '{"n_predict":400,"prompt":"Nice to meet you!\n\nHuman: hi\nAssistant:", "tokenize": true}'
```
<br>
### completion() vs chat_completion()
[](https://hackmd.io/_uploads/rJ1Z6ot4a.png)
<br>
### make_resData() vs make_resData_stream()
[](https://hackmd.io/_uploads/H1XZLgcET.png)
<br>
### make_postData()
- 大部份參數是1對1轉換
- `logit_bias` 參數
- {k1:v1, k2:v2, k3:v3} 轉換成 [[k1, v1], [k2, v2], [k3, v3]]
- `stop` 參數
- 來自 args.stop
- 追加外部 stop 參數
- 內部參數
- n_keep = -1
- stream = stream
- cache_prompt = True
- slot_id = -1
<br>
### make_postData()::convert_chat()
> 有旁白 + 三種角色
>
```python=
chat_prompt='A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n'
user_name="\\nUSER: "
ai_name="\\nASSISTANT: "
system_name="\\nASSISTANT's RULE: "
session_stop="</s>"
def convert_chat(messages):
prompt = "" + chat_prompt.replace("\\n", "\n")
system_n = system_name.replace("\\n", "\n")
user_n = user_name.replace("\\n", "\n")
ai_n = ai_name.replace("\\n", "\n")
stop = session_stop.replace("\\n", "\n")
for line in messages:
if (line["role"] == "system"):
prompt += f"{system_n}{line['content']}"
if (line["role"] == "user"):
prompt += f"{user_n}{line['content']}"
if (line["role"] == "assistant"):
prompt += f"{ai_n}{line['content']}{stop}"
prompt += ai_n.rstrip()
return prompt
messages = [
{"role":"system", "content":"You are an assistant."},
{"role":"user", "content":"How are you?"},
{"role":"assistant", "content":"I'm fine. And you?"},
{"role":"user", "content":"Fine. Thanks."},
]
convert_chat(messages)
```
- **執行結果:**
> "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\n\nASSISTANT's RULE: You are an assistant.\nUSER: How are you?\nASSISTANT: I'm fine. And you?</s>\nUSER: Fine. Thanks.\nASSISTANT:"
- **print 結果:**
[](https://hackmd.io/_uploads/HJqSuMcQa.png)
<br>