[Llama.cpp] api_like_OAI.py

[Llama.cpp] api_like_OAI.py === ###### tags: `LLM` ###### tags: `ML`, `NLP`, `NLU`, `Llama`, `CodeLlama`, `GGML`, `GGUF` [TOC] :::warning :bulb: [**需知**] 需要了解： 1. 請求者帶來的參數 2. 後端 Llama Server 的參數設定才能把它串接起來 ::: ## [[code] llama.cpp/examples/server/api_like_OAI.py](https://github.com/ggerganov/llama.cpp/blob/e9c1cecb9d7d743d30b4a29ecd56a411437def0a/examples/server/api_like_OAI.py) ## 架構圖 [![architecture_diagram.png](https://hackmd.io/_uploads/SkHusOj7p.png)](https://hackmd.io/_uploads/SkHusOj7p.png) ## Usages ### server-side ``` $ python api_like_OAI.py \ --llama-api http://10.78.26.241:40081 \ --host 0.0.0.0 \ --port 40082` ``` - #### `--llama-api` 參數需要加 `http://`，否則會有底下錯誤 - server error log ``` [2023-11-07 18:34:58,486] ERROR in app: Exception on /completions [POST] ... requests.exceptions.MissingSchema: Invalid URL '/completion': No schema supplied. Perhaps you meant http:///completion? 10.78.153.144 - - [07/Nov/2023 18:34:58] "POST /completions HTTP/1.1" 500 - ``` - client error log ```html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title> <h1>Internal Server Error</h1> The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ``` The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ### client-side ``` $ curl -X POST http://10.78.26.241:40082/completions \ -H 'Content-Type: application/json' \ -d '{"n_predict":400,"prompt":"Nice to meet you!\n\nHuman: hi\nAssistant:"}' ``` - `-d` `--data` 可透過 `flask.request.get_json()` 取得 - #### client request 需加 `-H 'Content-Type: application/json'` - server error log ``` [debug] body: None <-- 抓不到 body [2023-11-07 18:37:28,264] ERROR in app: Exception on /completions [POST] File ".../api_like_OAI.py", line 28, in is_present buf = json[key] TypeError: 'NoneType' object is not subscriptable 10.78.153.144 - - [07/Nov/2023 18:37:28] "POST /completions HTTP/1.1" 500 - ``` - client error log ```html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title> <h1>Internal Server Error</h1> The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ``` - 測試 - :warning: 如果 client 端沒設定 Content-Type server 端會用預設值 ==**`application/x-www-form-urlencoded`**== 接著 `request.get_json()` 會拿到 None 存取 json (type: dict) 時，就會引發 `'NoneType' object is not subscriptable` <hr> ## Llama-cpp APIs ### URL path - ### path 不存在，會回傳： File Not Found ### `/tokenize` - demo1 `$ curl -X POST http://10.78.26.241:40081/tokenize -d '{"content": "Nice to meet you!\n\nHuman: hi\nAssistant:"}'` - data 需要攜帶 `content` key - 執行結果： ```json {"tokens":[20103,304,5870,366,29991,13,13,29950,7889,29901,7251,13,7900,22137,29901]} ``` - demo2 `$ curl -X POST http://10.78.26.241:30080/tokenize -d '{"content": " I am a bit confused about your question. What do"}'` ### `/completion` (non-stream mode) - resquest `curl -X POST http://10.78.26.241:40081/completion -d '{"n_predict":200, "prompt": "Hello!"}'` - data 需要攜帶 `prompt` key - response :::spoiler json data ```json= { "content":" I'm excited to be here and share my experiences with you. nobody knows anything, so let's have some fun! 😃\n18 / 🌱 Feminist 🏳️‍🌈 Vegan 🧖‍♀️ Artist 💫", "generation_settings":{ "frequency_penalty":0.0, "grammar":"", "ignore_eos":false, "logit_bias":[ ], "mirostat":0, "mirostat_eta":0.10000000149011612, "mirostat_tau":5.0, "model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf", "n_ctx":512, "n_keep":0, "n_predict":200, "n_probs":0, "penalize_nl":true, "presence_penalty":0.0, "repeat_last_n":64, "repeat_penalty":1.100000023841858, "seed":4294967295, "stop":[ ], "stream":false, "temp":0.800000011920929, "tfs_z":1.0, "top_k":40, "top_p":0.949999988079071, "typical_p":1.0 }, "model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf", "prompt":"Hello!", "slot_id":0, "stop":true, "stopped_eos":true, "stopped_limit":false, "stopped_word":false, "stopping_word":"", "timings":{ "predicted_ms":7999.853, "predicted_n":71, "predicted_per_second":8.875163081121615, "predicted_per_token_ms":112.67398591549296, "prompt_ms":291.459, "prompt_n":3, "prompt_per_second":10.29304293228207, "prompt_per_token_ms":97.153 }, "tokens_cached":74, "tokens_evaluated":3, "tokens_predicted":71, "truncated":false } ``` ::: ### `/completion` (stream mode) - ### non-stream vs stream ![image](https://hackmd.io/_uploads/HkUzKBsEa.png) ![image](https://hackmd.io/_uploads/H1p4FHjE6.png) - non-stream vs stream data: 只差在 content , 和計算時間 - stream data 取自最後一筆 - ### curl - resquest `!curl -X POST http://10.78.26.241:40081/completion -d '{"n_predict":200, "prompt": "Hello!", "stream": true}'` - response ``` data: {"content":" I","multimodal":false,"slot_id":0,"stop":false} data: {"content":"'","multimodal":false,"slot_id":0,"stop":false} data: {"content":"m","multimodal":false,"slot_id":0,"stop":false} data: {"content":" a","multimodal":false,"slot_id":0,"stop":false} data: {"content":" fre","multimodal":false,"slot_id":0,"stop":false} ... data: {"content":"","multimodal":false,"slot_id":0,"stop":false} data: {"content":"","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","n_ctx":512,"n_keep":0,"n_predict":200,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":true,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","prompt":"Hello!","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":22545.974,"predicted_n":181,"predicted_per_second":8.028040837801019,"predicted_per_token_ms":124.56339226519336,"prompt_ms":269.057,"prompt_n":3,"prompt_per_second":11.150053706092017,"prompt_per_token_ms":89.68566666666668},"tokens_cached":184,"tokens_evaluated":3,"tokens_predicted":181,"truncated":false} ``` - :::spoiler 最後一筆 json data ```json= { "content":"", "generation_settings":{ "frequency_penalty":0.0, "grammar":"", "ignore_eos":false, "logit_bias":[ ], "mirostat":0, "mirostat_eta":0.10000000149011612, "mirostat_tau":5.0, "model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf", "n_ctx":512, "n_keep":0, "n_predict":200, "n_probs":0, "penalize_nl":true, "presence_penalty":0.0, "repeat_last_n":64, "repeat_penalty":1.100000023841858, "seed":4294967295, "stop":[ ], "stream":true, "temp":0.800000011920929, "tfs_z":1.0, "top_k":40, "top_p":0.949999988079071, "typical_p":1.0 }, "model":"/home/*****/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf", "prompt":"Hello!", "slot_id":0, "stop":true, "stopped_eos":true, "stopped_limit":false, "stopped_word":false, "stopping_word":"", "timings":{ "predicted_ms":22545.974, "predicted_n":181, "predicted_per_second":8.028040837801019, "predicted_per_token_ms":124.56339226519336, "prompt_ms":269.057, "prompt_n":3, "prompt_per_second":11.150053706092017, "prompt_per_token_ms":89.68566666666668 }, "tokens_cached":184, "tokens_evaluated":3, "tokens_predicted":181, "truncated":false } ``` ::: - :warning: 注意： - `"content":""` - `"stream":true` - ### python - resquest ```python= import requests USER_PROMPT = 'Hello' url = 'http://10.78.26.241:40081/completion' headers = {} body = { "n_predict":5, "prompt": f'{USER_PROMPT}', "stream": True } res = requests.post(url=url, headers=headers, json=body, stream=True) print(res) for line in res.iter_lines(): print(line) ``` - `post()` 中的 stream 要設成 True - response ``` <Response [200]> b'data: {"content":" and","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":" welcome","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":" to","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":" my","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":" website","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":"!","multimodal":false,"slot_id":0,"stop":false}' b'' b'data: {"content":"","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/home/diatango_lin/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","n_ctx":512,"n_keep":0,"n_predict":5,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":true,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/home/diatango_lin/LLM_Models/TheBloke/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":469.622,"predicted_n":5,"predicted_per_second":10.646860666663827,"predicted_per_token_ms":93.9244,"prompt_ms":234.539,"prompt_n":2,"prompt_per_second":8.527366450782173,"prompt_per_token_ms":117.2695},"tokens_cached":7,"tokens_evaluated":2,"tokens_predicted":5,"truncated":false}' b'' ``` - finish reason 的判斷方式 `finish_reason = "stop" if (data["stopped_eos"] or data["stopped_word"]) else "length_limit"` - "stop":true, - "stopped_eos":false, - "stopped_limit":true, - "stopped_word":false, - "stopping_word":"" <hr> ## OpenAI APIs > https://platform.openai.com/docs/api-reference - [The chat completion chunk object](https://platform.openai.com/docs/api-reference/chat/streaming) ![image](https://hackmd.io/_uploads/H10n0Yu46.png) ```json {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]} {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]} .... {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":" today"},"finish_reason":null}]} {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{"content":"?"},"finish_reason":null}]} {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gpt-3.5-turbo-0613", "system_fingerprint": "fp_44709d6fcb", "choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} ``` - 第一筆：`"delta":{"role":"assistant","content":""}` - 最後一筆：`"delta":{},"finish_reason":"stop"` - 中間：`"delta":{"content":"Hello"},"finish_reason":null}` <hr> ## ASF APIs (FFM APIs) > - #### AFS = AI Foundry Service 人工智慧代工服務 > - [[官網] 什麼是 AFS](https://tws.twcc.ai/afs-1/) > 低成本、低門檻、高效率、高安全 > 一站式生成式AI解決方案，全力協助企業打造專屬大模型 > - #### FFM = Formosa Foundation Model 福爾摩沙大模型 > ![](https://hackmd.io/_uploads/rkARViYNa.png) > - [[官網] Text generation API and parameters](https://docs.twcc.ai/en/docs/concepts-tutorials/twcc/afs/tutorials/text-generation-api/) - [[HackMD] AFS API 說明文件](https://hackmd.io/D6hChs1HRxyOaLul_HgirQ) <hr> ## docs ### arg 預設參數 - ### 預設參數 - chat_prompt='A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n' - user_name="\\nUSER: " - ai_name="\\nASSISTANT: " - system_name="\\nASSISTANT's RULE: " - stop="</s>" - llama_api='http://127.0.0.1:8080' - api_key="" - host='127.0.0.1' - port=8081 - ### 說明 - request 打到 host:port (127.0.0.1:8087)，會轉到 llama_api (http://127.0.0.1:8080) ### completion() 作法： 1. 設定 route 路徑，以接收 request 2. server-agent side 若有設定 api_key，則檢查 request 的 headers["Authorization"] 3. 取出 stream, tokenize 參數 4. 如果有啟用 tokenize 參數，呼叫 server_ip/tokenize 取得 tokenize 後的結果 5. 如果沒有啟用 stream 參數，呼叫 server_ip/completion 取得 assistant 的回答 6. 如果有啟用 stream 參數，呼叫 server_ip/completion 取得 assistant 的回答 - usages - `/completions` (有加s)，並帶有 tokenize 資訊： ``` $ curl -X POST http://10.78.26.241:40082/completions \ -H 'Content-Type: application/json' \ -d '{"n_predict":400,"prompt":"Nice to meet you!\n\nHuman: hi\nAssistant:", "tokenize": true}' ``` ### completion() vs chat_completion() [![](https://hackmd.io/_uploads/rJ1Z6ot4a.png)](https://hackmd.io/_uploads/rJ1Z6ot4a.png) ### make_resData() vs make_resData_stream() [![](https://hackmd.io/_uploads/H1XZLgcET.png)](https://hackmd.io/_uploads/H1XZLgcET.png) ### make_postData() - 大部份參數是1對1轉換 - `logit_bias` 參數 - {k1:v1, k2:v2, k3:v3} 轉換成 [[k1, v1], [k2, v2], [k3, v3]] - `stop` 參數 - 來自 args.stop - 追加外部 stop 參數 - 內部參數 - n_keep = -1 - stream = stream - cache_prompt = True - slot_id = -1 ### make_postData()::convert_chat() > 有旁白 + 三種角色 > ```python= chat_prompt='A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n' user_name="\\nUSER: " ai_name="\\nASSISTANT: " system_name="\\nASSISTANT's RULE: " session_stop="</s>" def convert_chat(messages): prompt = "" + chat_prompt.replace("\\n", "\n") system_n = system_name.replace("\\n", "\n") user_n = user_name.replace("\\n", "\n") ai_n = ai_name.replace("\\n", "\n") stop = session_stop.replace("\\n", "\n") for line in messages: if (line["role"] == "system"): prompt += f"{system_n}{line['content']}" if (line["role"] == "user"): prompt += f"{user_n}{line['content']}" if (line["role"] == "assistant"): prompt += f"{ai_n}{line['content']}{stop}" prompt += ai_n.rstrip() return prompt messages = [ {"role":"system", "content":"You are an assistant."}, {"role":"user", "content":"How are you?"}, {"role":"assistant", "content":"I'm fine. And you?"}, {"role":"user", "content":"Fine. Thanks."}, ] convert_chat(messages) ``` - **執行結果：** > "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\n\nASSISTANT's RULE: You are an assistant.\nUSER: How are you?\nASSISTANT: I'm fine. And you?</s>\nUSER: Fine. Thanks.\nASSISTANT:" - **print 結果：** [![image.png](https://hackmd.io/_uploads/HJqSuMcQa.png)](https://hackmd.io/_uploads/HJqSuMcQa.png)