AFS API 說明文件

###### tags: `AFS` # AFS API 說明文件 V5, 更新時間: 2024/07/12 13:00 ## 取得所需資訊 :::spoiler **AFS ModelSpace 公有模式（Public Mode）** 1. **`API URL`**：請從 AFS ModelSpace 的 **公用模式 - API 金鑰管理** 頁面中的右上角複製。 ![image](https://hackmd.io/_uploads/rkl3a6Vb0.png) ![image](https://hackmd.io/_uploads/BJWCQerW0.png) 3. **`MODEL_NAME`**：請參考 [**模型名稱對照表**](https://docs.twcc.ai/docs/user-guides/twcc/afs/afs-modelspace/available-model)。 4. **`API_KEY`**：請從 **公用模式 - API 金鑰管理** 頁面的列表中取得。 ::: :::spoiler **AFS ModelSpace 私有模式（Private Mode）** 1. **`API_URL`**：**`API 端點連結`**，可從該服務的詳細資料頁面中複製。 ![image](https://hackmd.io/_uploads/r1EfvzHx0.png) 2. **`MODEL_NAME`**：**`模型名稱`**，如上圖，可從該服務的詳細資料頁面中複製。 3. **`API_KEY`**：{API_KEY}，登入該服務的測試介面後點選右上角的帳號資訊，即可看到 `API 金鑰`。 ![](https://hackmd.io/_uploads/rkCFhkFF2.png) ::: :::spoiler **AFS Cloud** 1. **`API_URL`**：**`API 端點連結`**，可從該服務的詳細資料頁面中複製。 ![image](https://hackmd.io/_uploads/r1EfvzHx0.png) 2. **`MODEL_NAME`**：**`模型名稱`**，如上圖，可從該服務的詳細資料頁面中複製。 3. **`API_KEY`**：{API_KEY}，登入該服務的測試介面後點選右上角的帳號資訊，即可看到 `API 金鑰`。 ![](https://hackmd.io/_uploads/rkCFhkFF2.png) ::: ## 參數說明 ### Request 參數 - `max_new_tokens`：一次最多可生成的 token 數量。 - 預設值：350 - 範圍限制：大於 0 的整數值 ::: warning :warning: **注意：使用限制** 每個模型都有 input + output token 小於某個值的限制，如果輸入字串加上預計生成文字的 token 數量大於該值則會產生錯誤。 - Mistral (7B) / Mixtral (8x7B) : 32768 tokens - Llama3 (8B / 70B) : 8192 tokens - Llama2-V2 / Llama2 (7B / 13B / 70B) / Taide-LX (7B) : 4096 tokens - CodeLlama (7B / 13B / 34B) : 8192 tokens ::: - `temperature`：生成創造力，生成文本的隨機和多樣性。值越大，文本更具創意和多樣性；值越小，則較保守、接近模型所訓練的文本。 - 預設值：1.0 - 範圍限制：大於 0 的小數值 - `top_p`：當候選 token 的累計機率達到或超過此值時，就會停止選擇更多的候選 token。值越大，生成的文本越多樣化；值越小，生成的文本越保守。 - 預設值：1.0 - 範圍限制：大於 0 且小於等於 1 的小數值 - `top_k`：限制模型只從具有最高概率的 K 個 token 中進行選擇。值越大，生成文本越多樣化；值越小，生成的文本越保守。 - 預設值：50 - 範圍限制：大於等於 1 且小於等於 100 的整數值 - `frequence_penalty`：重複懲罰，控制重複生成 token 的概率。值越大，重複 token 出現次數將降低。 - 預設值：1.0 - 範圍限制：大於 0 的小數值 - `stop_sequences`：當文字生成內容遇到以下序列即停止，而輸出內容不會納入序列。 - 預設值：無 - 範圍限制：最多四組，例如 ["stop", "end"] - `show_probabilities`：是否顯示生成文字的對數機率。其值為依據前面文字來生成該 token 的機率, 以對數方式呈現。 - 預設值：false - 範圍限制：true 或是 false - `seed`：亂數種子，具有相同種子與參數的重複請求會傳回相同結果。若設成 null，表示隨機。 - 預設值：42 - 範圍限制：可設為 null，以及大於等於 0 的整數值 ### Request 參數調校建議 * `temperature` 的調整影響回答的創意性。 - 單一/非自創的回答，建議調低 temperature 至 0.1~0.2。 - 多元/高創意的回答，建議調高 temperature 至 1。 * 若上述調整後仍想再微調 `top-k` 和 `top-p`，請先調整 `top-k`，最後再更動 `top-p`。 * 當回答中有高重複的 token，重複懲罰數值 `frequence_penalty` 建議調至 1.03，最高 1.2，再更高會有較差的效果。 ### Response 參數 - `function_call`：模型回覆的 Function Calling 呼叫函式，若無使用此功能則回傳 null。 - `details`：針對特定需求所提供的細節資訊，例如 response 參數的 show_probabilities 若為 true，details 會回傳生成文字的對數機率。 - `total_time_taken`：本次 API 的總花費秒數。 - `prompt_tokens`：本次 API 的 Prompt Token 數量（Input Tokens），會包含 system 預設指令、歷史對話中 human / assistant 的內容以及本次的問題或輸入內容的總 Token 數。 - `generated_tokens`：本次 API 的 Generated Token 數量（Output Tokens），即為本次模型的回覆內容總 Token 數，而此 Token 數量越大，對應的總花費秒數也會隨之增長。（若有輸出大量 token 文字的需求，請務必優先採用 Stream 模式，以免遇到 Timeout 的情形。） - `total_tokens`：本次 API 的 Total Token 數量（Input + Output Tokens）。 - `finish_reason`：本次 API 的結束原因說明，例如 "eos\_token" 代表模型已生成完畢，"function\_call" 代表呼叫函式已生成完畢。 ## Conversation ::: info :bulb: **提示：** 支援模型清單 * Mistral (7B) / Mixtral (8x7B) * Llama3 (8B / 70B) * Llama2-V2 / Llama2 (7B / 13B / 70B) / Taide-LX (7B) * CodeLlama (7B / 13B / 34B) ::: ### 一般使用依對話順序，依照角色位置把對話內容填到 Content 欄位中。 - [**範例一**](#範例一：無`預設指令`) | Role | Order | Content | | --------- | ----- | ------- | | human | 問題1 | 人口最多的國家是？ | | assistant | 答案1 | 人口最多的國家是印度。 | | human | 問題2 | 主要宗教為？ | - [**範例二**](#範例二：設定`預設指令`) | Role | Order | Content | | --------- | ------- | ------- | | system | 預設指令 | 你是一位只會用表情符號回答問題的助理。 | | human | 問題1 | 明天會下雨嗎? | | assistant | 答案1 | 🤔 🌨️ 🤞 | | human | 問題2 | 意思是要帶傘出門嗎？ | ::: info :bulb: **提示：** LLaMA 2 支援預設指令。預設指令可以協助優化系統的回答行為，在對話的每一段過程中都會套用。 ::: #### 範例一：無`預設指令` ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/conversation" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages":[ { "role": "human", "content": "人口最多的國家是?" }, { "role": "assistant", "content": "人口最多的國家是印度。" }, { "role": "human", "content": "主要宗教為?" }], "parameters": { "max_new_tokens":350, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：包括生成的文字、token 個數以及所花費的時間秒數。 > { "generated_text": "印度的主要宗教是印度教", "function_call": null, "details": null, "total_time_taken": "1.31 sec", "prompt_tokens": 44, "generated_tokens": 13, "total_tokens": 57, "finish_reason": "eos_token" } :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 350 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def conversation(contents): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} roles = ["human", "assistant"] messages = [] for index, content in enumerate(contents): messages.append({"role": roles[index % 2], "content": content}) data = { "model": MODEL_NAME, "messages": messages, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = "" try: response = requests.post(API_URL + "/models/conversation", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("\n") contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"] result = conversation(contents) print(result) ``` 輸出： > 印度的主要宗教是印度教 ::: #### 範例二：設定`預設指令` ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/conversation" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages":[ { "role": "system", "content": "你是一位只會用表情符號回答問題的助理。" }, { "role": "human", "content": "明天會下雨嗎?" }, { "role": "assistant", "content": "🤔 🌨️ 🤞" }, { "role": "human", "content": "意思是要帶傘出門嗎？" }], "parameters": { "max_new_tokens":350, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：包括生成的文字、token 個數以及所花費的時間秒數。 > { "generated_text": "🌂 🌂 🌂\n\n(譯：明天會下雨，所以最好帶把傘出門。)", "function_call": null, "details": null, "total_time_taken": "3.55 sec", "prompt_tokens": 76, "generated_tokens": 38, "total_tokens": 114, "finish_reason": "eos_token" } :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 350 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def conversation(system, contents): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} roles = ["human", "assistant"] messages = [] if system is not None: messages.append({"role": "system", "content": system}) for index, content in enumerate(contents): messages.append({"role": roles[index % 2], "content": content}) data = { "model": MODEL_NAME, "messages": messages, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = "" try: response = requests.post(API_URL + "/models/conversation", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("\n") system_prompt = "你是一位只會用表情符號回答問題的助理。" contents = ["明天會下雨嗎?", "🤔 🌨️ 🤞", "意思是要帶傘出門嗎？"] result = conversation(system_prompt, contents) print(result) ``` 輸出： > 🌂 🌂 🌂 (譯：明天會下雨，所以最好帶把傘出門。) ::: ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/conversation" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages":[ { "role": "system", "content": "你是一個活潑的五歲小孩，回答問題時都使用童言童語的語氣。" }, { "role": "human", "content": "明天會下雨嗎?" }, { "role": "assistant", "content": "嗯，我不知道，但我希望如此！我喜歡玩雨水，穿上我的雨靴和雨衣。這就像一個大派對外面！如果你很幸運，也許你可以看到一個彩虹！" }, { "role": "human", "content": "彩虹有幾種顏色呢？" }], "parameters": { "max_new_tokens":350, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：包括生成的文字、token 個數以及所花費的時間秒數。 > { "generated_text": "彩虹有七種顏色！你能記得住它們嗎？它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這是一個有趣的記憶法：“藍色是靛色的，紫色是我的，綠色是我喜歡的，黃色是太陽，橙色是甜美的，紅色是勇敢的。”試試看，這樣就很容易記住了！", "function_call": null, "details": null, "total_time_taken": "15.00 sec", "prompt_tokens": 133, "generated_tokens": 111, "total_tokens": 244, "finish_reason": "eos_token" } :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 350 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def conversation(system, contents): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} roles = ["human", "assistant"] messages = [] if system is not None: messages.append({"role": "system", "content": system}) for index, content in enumerate(contents): messages.append({"role": roles[index % 2], "content": content}) data = { "model": MODEL_NAME, "messages": messages, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = "" try: response = requests.post(API_URL + "/models/conversation", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("\n") system_prompt = "你是一個活潑的五歲小孩，回答問題時都使用童言童語的語氣。" contents = ["明天會下雨嗎?", "嗯，我不知道，但我希望如此！我喜歡玩雨水，穿上我的雨靴和雨衣。這就像一個大派對外面！如果你很幸運，也許你可以看到一個彩虹！", "彩虹有幾種顏色呢？"] result = conversation(system_prompt, contents) print(result) ``` 輸出： > 彩虹有七種顏色！你能記得住它們嗎？它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這是一個有趣的記憶法：“藍色是靛色的，紫色是我的，綠色是我喜歡的，黃色是太陽，橙色是甜美的，紅色是勇敢的。”試試看，這樣就很容易記住了！ ::: ### 使用 Stream 模式 Server-sent event (SSE)：伺服器主動向客戶端推送資料，連線建立後，在一步步生成字句的同時也將資料往客戶端拋送，和先前的一次性回覆不同，可加強使用者體驗。若有輸出大量 token 文字的需求，請務必優先採用 Stream 模式，以免遇到 Timeout 的情形。 ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/conversation" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages":[ { "role": "human", "content": "人口最多的國家是?" }, { "role": "assistant", "content": "人口最多的國家是印度。" }, { "role": "human", "content": "主要宗教為?" }], "parameters": { "max_new_tokens":350, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}, "stream": true}' ``` 輸出：每個 token 會輸出一筆資料，最末筆則是會多出生成的總 token 個數和所花費的時間秒數。 > data: {"generated_text": "", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "印", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "度", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "的", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "主", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "要", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "宗", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "教", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "是", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "印", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "度", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "教", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "", "function_call": null, "details": null, "total_time_taken": "1.32 sec", "prompt_tokens": 44, "generated_tokens": 13, "total_tokens": 57, "finish_reason": "eos_token"} ::: info :bulb: **提示：注意事項** 1. 每筆 token 並不一定能解碼成合適的文字，如果遇到該種情況，該筆 generated_text 欄位會顯示空字串，該 token 會結合下一筆資料再來解碼，直接能呈現為止。 2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette)，在 SSE 過程中約 15 秒就會收到 ping event，目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式)，在資料處理時需特別注意，下列 Python 範例已經有包含此資料處理。 > event: ping > data: 2023-09-26 04:25:08.978531 ::: :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 350 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def conversation(contents): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} roles = ["human", "assistant"] messages = [] for index, content in enumerate(contents): messages.append({"role": roles[index % 2], "content": content}) data = { "model": MODEL_NAME, "messages": messages, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty }, "stream": True } messages = [] result = "" try: response = requests.post(API_URL + "/models/conversation", json=data, headers=headers, stream=True) if response.status_code == 200: for chunk in response.iter_lines(): chunk = chunk.decode('utf-8') if chunk == "": continue # only check format => data: ${JSON_FORMAT} try: record = json.loads(chunk[5:], strict=False) if "status_code" in record: print("{:d}, {}".format(record["status_code"], record["error"])) break elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) : message = record["generated_text"] messages.append(message) print(">>> " + message) result = ''.join(messages) break elif record["generated_text"] is not None: message = record["generated_text"] messages.append(message) print(">>> " + message) else: print("error") break except: pass else: print("error") except: print("error") return result.strip("\n") contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"] result = conversation(contents) print(result) ``` 輸出： ``` >>> >>> 印 >>> 度 >>> 的 >>> 主 >>> 要 >>> 宗 >>> 教 >>> 是 >>> 印 >>> 度 >>> 教 >>> 印度的主要宗教是印度教 ``` ::: ### LangChain 使用方式 :::spoiler **Custom Chat Model Wrapper** ```python= """Wrapper LLM conversation APIs.""" from typing import Any, Dict, List, Mapping, Optional, Tuple from langchain.llms.base import LLM import requests from langchain.llms.utils import enforce_stop_tokens from langchain.llms.base import BaseLLM from langchain.llms.base import create_base_retry_decorator from pydantic import BaseModel, Extra, Field, root_validator from langchain.chat_models.base import BaseChatModel from langchain.schema.language_model import BaseLanguageModel from langchain.schema import ( BaseMessage, ChatGeneration, ChatResult, ChatMessage, AIMessage, HumanMessage, SystemMessage ) from langchain.callbacks.manager import ( Callbacks, AsyncCallbackManagerForLLMRun, CallbackManagerForLLMRun, ) import json import os class _ChatFormosaFoundationCommon(BaseLanguageModel): base_url: str = "http://localhost:12345" """Base url the model is hosted under.""" model: str = "ffm-mixtral-8x7b-32k-instruct" """Model name to use.""" temperature: Optional[float] """The temperature of the model. Increasing the temperature will make the model answer more creatively.""" stop: Optional[List[str]] """Sets the stop tokens to use.""" top_k: int = 50 """Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 50)""" top_p: float = 1 """Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 1)""" max_new_tokens: int = 350 """The maximum number of tokens to generate in the completion. -1 returns as many tokens as possible given the prompt and the models maximal context size.""" frequence_penalty: float = 1 """Penalizes repeated tokens according to frequency.""" model_kwargs: Dict[str, Any] = Field(default_factory=dict) """Holds any model parameters valid for `create` call not explicitly specified.""" ffm_api_key: Optional[str] = None @property def _default_params(self) -> Dict[str, Any]: """Get the default parameters for calling FFM API.""" normal_params = { "temperature": self.temperature, "max_new_tokens": self.max_new_tokens, "top_p": self.top_p, "frequence_penalty": self.frequence_penalty, "top_k": self.top_k, } return {**normal_params, **self.model_kwargs} def _call( self, prompt, stop: Optional[List[str]] = None, **kwargs: Any, ) -> str: if self.stop is not None and stop is not None: raise ValueError("`stop` found in both the input and default params.") elif self.stop is not None: stop = self.stop elif stop is None: stop = [] params = {**self._default_params, "stop": stop, **kwargs} parameter_payload = {"parameters": params, "messages": prompt, "model": self.model} # HTTP headers for authorization headers = { "X-API-KEY": self.ffm_api_key, "X-API-HOST": "afs-inference", "Content-Type": "application/json", } endpoint_url = f"{self.base_url}/models/conversation" # send request try: response = requests.post( url=endpoint_url, headers=headers, data=json.dumps(parameter_payload, ensure_ascii=False).encode("utf8"), stream=False, ) response.encoding = "utf-8" generated_text = response.json() if response.status_code != 200: detail = generated_text.get("detail") raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"error raised with status code {response.status_code}\n" f"Details: {detail}\n" ) except requests.exceptions.RequestException as e: # This is the correct syntax raise ValueError(f"FormosaFoundationModel error raised by inference endpoint: {e}\n") if generated_text.get("detail") is not None: detail = generated_text["detail"] raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"error raised by inference API: {detail}\n" ) if generated_text.get("generated_text") is None: raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"Response format error: {generated_text}\n" ) return generated_text class ChatFormosaFoundationModel(BaseChatModel, _ChatFormosaFoundationCommon): """`FormosaFoundation` Chat large language models API. The environment variable ``OPENAI_API_KEY`` set with your API key. Example: .. code-block:: python ffm = ChatFormosaFoundationModel(model_name="llama2-7b-chat-meta") """ @property def _llm_type(self) -> str: return "ChatFormosaFoundationModel" @property def lc_serializable(self) -> bool: return True def _convert_message_to_dict(self, message: BaseMessage) -> dict: if isinstance(message, ChatMessage): message_dict = {"role": message.role, "content": message.content} elif isinstance(message, HumanMessage): message_dict = {"role": "human", "content": message.content} elif isinstance(message, AIMessage): message_dict = {"role": "assistant", "content": message.content} elif isinstance(message, SystemMessage): message_dict = {"role": "system", "content": message.content} else: raise ValueError(f"Got unknown type {message}") return message_dict def _create_conversation_messages( self, messages: List[BaseMessage], stop: Optional[List[str]] ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]: params: Dict[str, Any] = {**self._default_params} if stop is not None: if "stop" in params: raise ValueError("`stop` found in both the input and default params.") params["stop"] = stop message_dicts = [self._convert_message_to_dict(m) for m in messages] return message_dicts, params def _create_chat_result(self, response: Mapping[str, Any]) -> ChatResult: chat_generation = ChatGeneration( message = AIMessage(content=response.get("generated_text")), generation_info = { "token_usage": response.get("generated_tokens"), "model": self.model } ) return ChatResult(generations=[chat_generation]) def _generate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: message_dicts, params = self._create_message_dicts(messages, stop) params = {**params, **kwargs} response = self._call(prompt=message_dicts) if type(response) is str: # response is not the format of dictionary return response return self._create_chat_result(response) async def _agenerate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None ) -> ChatResult: pass def _create_message_dicts( self, messages: List[BaseMessage], stop: Optional[List[str]] ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]: params = self._default_params if stop is not None: if "stop" in params: raise ValueError("`stop` found in both the input and default params.") params["stop"] = stop message_dicts = [self._convert_message_to_dict(m) for m in messages] return message_dicts, params ``` ::: * 完成以上封裝後，就可以在 LangChain 中使用特定的 FFM 大語言模型。 ::: info :bulb: **提示：** 更多資訊，請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。 ::: ```python= MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" from langchain.schema import ( AIMessage, HumanMessage, SystemMessage ) chat_ffm = ChatFormosaFoundationModel( base_url = API_URL, max_new_tokens = 350, temperature = 0.5, top_k = 50, top_p = 1.0, frequence_penalty = 1.0, ffm_api_key = API_KEY, model = MODEL_NAME ) messages = [ HumanMessage(content="人口最多的國家是?"), AIMessage(content="人口最多的國家是印度。"), HumanMessage(content="主要宗教為?") ] result = chat_ffm(messages) print(result.content) ``` 輸出： > 印度的主要宗教是印度教 ## Function Calling　在 API 呼叫中，您可以描述多個函式讓模型選擇，並輸出包含選中的函數名稱及參數的 JSON Object，讓應用或代理人程式調用模型選擇的函式。Conversation API 不會調用該函式而是生成 JSON 讓您可在代碼中調用函式。 ::: info :bulb: **提示：** 支援模型清單 * Mistral (7B) / Mixtral (8x7B) * Llama2-V2 (7B / 13B / 70B) ::: ### 使用方式 1. 開發者提供函式列表並對大語言模型輸入問題。 2. 開發者解析大語言模型輸出的結構化資料，取得函式與對應的參數後，讓應用或代理人程式呼叫 API 並獲得回傳的結果。 3. 將 API 回傳的結果放到對話內容並傳給大語言模型做總結。 ### Conversation API 給定包含對話的訊息列表，模型將回傳生成訊息或呼叫函式。 1. 開發者提供函式列表並對大語言模型輸入問題 | Field | Type | Required | Description | | -------- | -------- | -------- | -------- | | **functions** | array | Optional | A list of functions the model may generate JSON inputs for.| * Example of RESTful HTTP Request ```python= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} curl -X POST "${API_URL}/models/conversation" \ -H "accept: application/json" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages": [ { "role": "user", "content": "What is the weather like in Boston?" }], "functions": [ { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location"] } }], "parameters": { "show_probabilities": false, "max_new_tokens": 500, "frequence_penalty": 1, "temperature": 0.5, "top_k": 100, "top_p": 0.93 }, "stream": false }' ``` | Field | Type | Required | Description | | -------- | -------- | -------- | -------- | |**function_call** | string or object | Optional | JSON format that adheres to the function signature | * Example of RESTful HTTP Response ```python= { "generated_text": "", "function_call": { "name": "get_current_weather", "arguments": { "location": "Boston, MA", } }, "details":null, "total_time_taken": "1.18 sec", "prompt_tokens": 181, "generated_tokens": 45, "total_tokens": 226, "finish_reason": "function_call" } ``` 2. 開發者解析大語言模型輸出的結構化資料，取得函式與對應的參數後，呼叫 API 並獲得回傳的結果 * Example of Weather API Response ```python= { "temperature": "22", "unit": "celsius", "description": "Sunny" } ``` 3. 將 API 回傳的結果放到對話內容並傳給大語言模型做總結 | Field |value | | -------- | -------- | |**role** | ***function*** | |**name** | The function name to call | |**content** | The response message from the API | * Example of RESTful HTTP Request ```python= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} curl -X POST "${API_URL}/models/conversation" \ -H "accept: application/json" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "messages": [ {"role": "user", "content": "What is the weather like in Boston?"}, {"role": "assistant", "content": null, "function_call": {"name": "get_current_weather", "arguments": {"location": "Boston, MA"}}}, {"role": "function", "name": "get_current_weather", "content": "{\"temperature\": \"22\", \"unit\": \"celsius\", \"description\": \"Sunny\"}"} ], "functions": [ { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location"] } }], "parameters": { "show_probabilities": false, "max_new_tokens": 500, "frequence_penalty": 1, "temperature": 0.5, "top_k": 100, "top_p": 0.93 }, "stream": false }' ``` * Example of RESTful HTTP Response > { > "generated_text":" The current weather in Boston is sunny with a temperature of 22 degrees Celsius. ", > "details":null, > "total_time_taken":"0.64 sec", > "prompt_tokens":230, > "generated_tokens":23, > "total_tokens":253, > "finish_reason":"eos_token" > } ## Code Infilling ### 一般使用基於給定程式碼前後文來預測程式中要填補的段落，以 `<FILL_ME>` 標籤當成要填補的部分，實際應用會是在開發環境 (IDE) 中自動完成程式中缺漏或是待完成的程式碼區段。 :::info :bulb: **提示：注意事項** - 目前僅 meta-codellama-7b-instruct 及 meta-codellama-13b-instruct 模型支援 Code Infilling，若使用的模型不支援，API 會回傳錯誤訊息。 - 如果輸入內容包含多個 `<FILL_ME>`，API 會回傳錯誤訊息。 ::: ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: meta-codellama-7b-instruct curl "${API_URL}/models/text_infilling" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs":"def remove_non_ascii(s: str) -> str:\n \"\"\" <FILL_ME>\n return result\n", "parameters":{ "max_new_tokens":43, "temperature":0.1, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：取代 `<FILL_ME>` 的程式片段、token 個數以及所花費的時間秒數。 ```json { "generated_text": "Remove non-ASCII characters from a string. \"\"\"\n result = \"\"\n for c in s:\n if ord(c) < 128:\n result += c\n ", "function_call": null, "details": null, "total_time_taken": "0.99 sec", "prompt_tokens": 27, "generated_tokens": 43, "total_tokens": 70, "finish_reason": "length" } ``` :::spoiler Python 範例 ```python= import json import requests, re MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 43 temperature = 0.1 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def text_infilling(prompt): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} data = { "model": MODEL_NAME, "inputs": prompt, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = '' try: response = requests.post( API_URL + "/models/text_infilling", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("<EOT>") text = '''def remove_non_ascii(s: str) -> str: """ <FILL_ME> return result ''' result = text_infilling(text) print(re.sub("<FILL_ME>", result, text)) ``` 輸出： ```python def remove_non_ascii(s: str) -> str: """ Remove non-ascii characters from a string. """ result = "" for c in s: if ord(c) < 128: result += c return result ``` ::: ### 使用 Stream 模式 Server-sent event (SSE)：伺服器主動向客戶端推送資料，連線建立後，在一步步生成字句的同時也將資料往客戶端拋送，和先前的一次性回覆不同，可加強使用者體驗。若有輸出大量 token 文字的需求，請務必優先採用 Stream 模式，以免遇到 Timeout 的情形。 ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: meta-codellama-7b-instruct curl "${API_URL}/models/text_infilling" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs":"def compute_gcd(x, y):\n <FILL_ME>\n return result\n", "stream":true, "parameters":{ "max_new_tokens":50, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：每個 token 會輸出一筆資料，最末筆則是會將先前生成的文字串成一筆、以及描述 token 個數和所花費的時間秒數。 > data: {"generated_text": "result", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "1", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " while", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " (", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " !=", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "0", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": ")", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " and", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " (", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " !=", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "0", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "):", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " if", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " >", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": ":", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " %", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " else", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": ":", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " %", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " result", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": "0.80 sec", "prompt_tokens": 20, "generated_tokens": 50, "total_tokens": 70, "finish_reason": "length"} ::: info :bulb: **提示：注意事項** 1. 每筆 token 並不一定能解碼成合適的文字，如果遇到該種情況，該筆 generated_text 欄位會顯示空字串，該 token 會結合下一筆資料再來解碼，直接能呈現為止。 2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette)，在 SSE 過程中約 15 秒就會收到 ping event，目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式)，在資料處理時需特別注意，下列 python 範例已經有包含此資料處理。 > event: ping > data: 2023-09-26 04:25:08.978531 ::: :::spoiler Python 範例 ```python= import json import requests, re MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 50 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def text_infilling(prompt): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} data = { "model": MODEL_NAME, "inputs": prompt, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty }, "stream": True } messages = [] result = "" try: response = requests.post(API_URL + "/models/text_infilling", json=data, headers=headers, stream=True) if response.status_code == 200: for chunk in response.iter_lines(): chunk = chunk.decode('utf-8') if chunk == "": continue # only check format => data: ${JSON_FORMAT} try: record = json.loads(chunk[5:], strict=False) if "status_code" in record: print("{:d}, {}".format(record["status_code"], record["error"])) break elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) : message = record["generated_text"] messages.append(message) print(">>> " + message) result = ''.join(messages) break elif record["generated_text"] is not None: message = record["generated_text"] messages.append(message) print(">>> " + message) else: print("error") break except: pass except: print("error") return result.strip("<EOT>") text = """def compute_gcd(x, y): <FILL_ME> return result """ result = text_infilling(text) print(re.sub("<FILL_ME>", result, text)) ``` 輸出： ``` >>> result >>> = >>> >>> 1 >>> >>> >>> while >>> ( >>> x >>> != >>> >>> 0 >>> ) >>> and >>> ( >>> y >>> != >>> >>> 0 >>> ): >>> >>> >>> if >>> x >>> > >>> y >>> : >>> >>> >>> x >>> = >>> x >>> % >>> y >>> >>> >>> else >>> : >>> >>> >>> y >>> = >>> y >>> % >>> x >>> >>> >>> result >>> = >>> x def compute_gcd(x, y): result = 1 while (x != 0) and (y != 0): if x > y: x = x % y else: y = y % x result = x return result ``` ::: ## Embedding (V1) ::: info :bulb: **提示：目前 Embedding API 的限制會受到以下數值影響** * Embedding Model：sequence length = 2048 * Embedding API 可以支援 Batch Inference，每筆長度不超過 2048 tokens。 ::: ### Curl 使用方式 1. 設定環境 ```= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} ``` 2. 透過 `curl` 指令取得 embedding 結果 **使用範例** ```= curl "${API_URL}/models/embeddings" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs": ["search string 1", "search string 2"] }' ``` **回傳範例** ```= { "data": [ { "embedding": [ 0.06317982822656631, -0.5447818636894226, -0.3353637158870697, -0.5117015838623047, -0.1446804255247116, 0.2036416381597519, -0.20317679643630981, -0.9627353549003601, 0.31771183013916016, 0.23493929207324982, 0.18029260635375977, ... ... ], "index": 0, "object": "embedding" }, { "embedding": [ 0.15340591967105865, -0.26574525237083435, -0.3885045349597931, -0.2985926568508148, 0.22742436826229095, -0.42115798592567444, -0.10134009271860123, -1.0426620244979858, 0.507709264755249, -0.3479543924331665, -0.09303411841392517, 1.0853372812271118, 0.7396582961082458, 0.266722172498703, ... ... ], "index": 1, "object": "embedding" } ], "total_time_taken": "0.06 sec", "usage": { "prompt_tokens": 6, "total_tokens": 6 } } ``` ### LangChain 使用方式 :::spoiler **Custom Embedding Model Wrapper** ```python= """Wrapper Embedding model APIs.""" import json import requests from typing import List from pydantic import BaseModel from langchain.embeddings.base import Embeddings import os class CustomEmbeddingModel(BaseModel, Embeddings): base_url: str = "http://localhost:12345" api_key: str = "" model: str = "" def get_embeddings(self, payload): endpoint_url=f"{self.base_url}/models/embeddings" embeddings = [] headers = { "Content-type": "application/json", "accept": "application/json", "X-API-KEY": self.api_key, "X-API-HOST": "afs-inference" } response = requests.post(endpoint_url, headers=headers, data=payload) body = response.json() datas = body["data"] for data in datas: embeddings.append(data["embedding"]) return embeddings def embed_documents(self, texts: List[str]) -> List[List[float]]: payload = json.dumps({"model": self.model, "inputs": texts}) return self.get_embeddings(payload) def embed_query(self, text: str) -> List[List[float]]: payload = json.dumps({"model": self.model, "inputs": [text]}) emb = self.get_embeddings(payload) return emb[0] ``` ::: * 完成以上封裝後，就可以在 LangChain 中直接使用 CustomEmbeddingModel 來完成特定的大語言模型任務。 ::: info :bulb: **提示：** 更多資訊，請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。 ::: #### 單一字串 * 單一字串取得 embeddings，使用 **`embed_query()`** 函式，並返回結果。 ```python= API_KEY={API_KEY} API_URL={API_URL} MODEL_NAME={MODEL_NAME} embeddings = CustomEmbeddingModel( base_url = API_URL, api_key = API_KEY, model = MODEL_NAME, ) print(embeddings.embed_query("請問台灣最高的山是？")) ``` 輸出： > [-1.1431972, -4.723901, 2.3445783, -2.19996, ......, 1.0784563, -3.4114947, -2.5193133] #### 多字串 * 多字串取得 embeddings，使用 **`embed_documents()`** 函式，會一次返回全部結果。 ```python= API_KEY={API_KEY} API_URL={API_URL} MODEL_NAME={MODEL_NAME} embeddings = CustomEmbeddingModel( base_url = API_URL, api_key = API_KEY, model = MODEL_NAME, ) print(embeddings.embed_documents(["test1", "test2", "test3"])) ``` 輸出： > [[-0.14880371, ......, 0.7011719], [-0.023590088, ...... , 0.49320474], [-0.86242676, ......, 0.22867839]] ## Embedding (V2) ::: info :bulb: **提示：目前 Embedding API 的限制會受到以下數值影響** * Embedding Model：sequence length = 131072 * Embedding API 可以支援 Batch Inference，每筆長度不超過 131072 tokens。 ::: ### Curl 使用方式 1. 設定環境 ```= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} ``` 2. 透過 `curl` 指令取得 embedding 結果，input_type為V2新增的參數，說明如下。 1. 值只能設定為"query"或者是"document"。 2. 此參數不是必要，如果沒有設定，預設是"document"。 3. 值為"query"時，系統會自動將每一筆input加上前綴語句來加強embedding正確性。 4. 值為"document"時，input維持原本，不加前綴語句。 **使用範例** ```= curl "${API_URL}/models/embeddings" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs": ["search string 1", "search string 2"], "parameters": { "input_type": "document" } }' ``` **回傳範例** ```= { "data": [ { "embedding": [ 0.015003109350800514, 0.002964278217405081, 0.025576837360858917, 0.0009064615005627275, 0.00896097905933857, -0.010766804218292236, 0.022567130625247955, -0.020284295082092285, -0.004011997487396002, -0.01566183753311634, -0.016150206327438354, -0.008938264101743698, 0.010346580296754837, 0.010187577456235886, ... ... ], "index": 0, "object": "embedding" }, { "embedding": [ 0.013649762608110905, 0.003280752571299672, 0.024047400802373886, 0.005184505134820938, 0.009756374172866344, -0.009389937855303288, 0.027826279401779175, -0.016409488394856453, 0.0020984220318496227, -0.0180928073823452, -0.014462794177234173, -0.006956569850444794, 0.013260424137115479, 0.018184415996074677, ... ... ], "index": 1, "object": "embedding" } ], "total_time_taken": "0.05 sec", "usage": { "prompt_tokens": 8, "total_tokens": 8 } } ``` 3. V2支援OpenAI Embedding API的參數如下 1. input: 目標字串list。 2. encoding_format: 可以設定為"float"或是"base64"，設為"base64"代表會將向量結果轉成base64格式再輸出，預設值是"float"。 3. dimensions: 可以設定最多輸出多少維度的向量，例如設為4，那就只會輸出前四維度的向量，預設值是0，0代表輸出最大維度的向量。 **使用範例** ```= curl "${API_URL}/models/embeddings" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "input": ["search string 1", "search string 2"], "encoding_format": "base64", "dimensions": 4 }' ``` **回傳範例** ```= { "data": [ { "object": "embedding", "embedding": "pR0QPOuoY7sjFQM9U92HOw==", "index": 0 }, { "object": "embedding", "embedding": "6BXdOxIpD7vfHgA9suTyOw==", "index": 1 } ], "total_time_taken": "0.04 sec", "usage": { "prompt_tokens": 8, "total_tokens": 8 } } ``` 4. V2輸出的向量結果現在改為預設會做normalize，行為跟OpenAI一致。如果客戶想要保持跟之前一樣輸出為未做normalize的結果，可以在parameters底下新增一個參數"normalize"，並將值設成false，如下所示。 **使用範例** ```= curl "${API_URL}/models/embeddings" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "input": ["search string 1", "search string 2"], "parameters": { "normalize": false } "encoding_format": "base64", "dimensions": 4 }' ``` ### LangChain 使用方式 :::spoiler **Custom Embedding Model Wrapper** ```python= """Wrapper Embedding model APIs.""" import json import requests from typing import List from pydantic import BaseModel from langchain.embeddings.base import Embeddings import os class CustomEmbeddingModel(BaseModel, Embeddings): base_url: str = "http://localhost:12345" api_key: str = "" model: str = "" def get_embeddings(self, payload): endpoint_url=f"{self.base_url}/models/embeddings" embeddings = [] headers = { "Content-type": "application/json", "accept": "application/json", "X-API-KEY": self.api_key, "X-API-HOST": "afs-inference" } response = requests.post(endpoint_url, headers=headers, data=payload) body = response.json() datas = body["data"] for data in datas: embeddings.append(data["embedding"]) return embeddings def embed_documents(self, texts: List[str]) -> List[List[float]]: payload = json.dumps({"model": self.model, "inputs": texts, "parameters": {"input_type": "query"}}) return self.get_embeddings(payload) def embed_query(self, text: str) -> List[List[float]]: payload = json.dumps({"model": self.model, "inputs": [text], "parameters": {"input_type": "query"}}) emb = self.get_embeddings(payload) return emb[0] ``` ::: * 完成以上封裝後，就可以在 LangChain 中直接使用 CustomEmbeddingModel 來完成特定的大語言模型任務。 ::: info :bulb: **提示：** 更多資訊，請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/how_to/custom_llm/)。 ::: #### 單一字串 * 單一字串取得 embeddings，使用 **`embed_query()`** 函式，並返回結果。 ```python= API_KEY={API_KEY} API_URL={API_URL} MODEL_NAME={MODEL_NAME} embeddings = CustomEmbeddingModel( base_url = API_URL, api_key = API_KEY, model = MODEL_NAME, ) print(embeddings.embed_query("請問台灣最高的山是？")) ``` 輸出： > [-0.023335948586463928, 0.02815871126949787, 0.03960443660616875, 0.012845884077250957, ......, 0.010695642791688442, 0.001966887153685093, 0.008934334851801395] #### 多字串 * 多字串取得 embeddings，使用 **`embed_documents()`** 函式，會一次返回全部結果。 ```python= API_KEY={API_KEY} API_URL={API_URL} MODEL_NAME={MODEL_NAME} embeddings = CustomEmbeddingModel( base_url = API_URL, api_key = API_KEY, model = MODEL_NAME, ) print(embeddings.embed_documents(["test1", "test2", "test3"])) ``` 輸出： > [[-0.007434912957251072, ......, 0.009466814808547497], [-0.006574439350515604, ...... , 0.008274043910205364], [-0.005750700831413269, ......, 0.009992048144340515]] ## Rerank Rerank API 的核心功能是利用機器學習和自然語言處理技術，根據給定的模型對輸入的文本進行重新排序（rerank），模型會對每個候選答案進行評分，分數越高表示該答案與查詢的相關性越高。常用於資訊檢索、推薦系統和自然語言處理任務中，基於某種評分或評估標準，對初步排序的結果進行進一步優化，以提供更符合使用者期望的資訊。 ### 使用情境 Rerank API 可以應用在各種檢索相關的使用情境中，例如： * 資訊檢索系統：對初步檢索出的結果進行重新排序，提升最相關結果的排名。 * 問答系統：從多個潛在答案中選出最相關和正確的答案。 * 推薦系統：根據用戶偏好對推薦結果進行重新排序，提供更個性化的推薦。 * 文本匹配：在文本相似度計算中，對多個候選匹配結果進行排序，選擇最相似的文本。 ### 使用範例以下是一個使用 curl 指令調用 Rerank API 的範例，並對三組查詢和答案進行重新排序。 ::: info :bulb: **提示：目前 Rerank API 的限制會受到以下數值影響** * Rerank Model：sequence length = 8192 * Rerank API 可以支援 Batch Inference，每筆長度不超過 8192 tokens。 ::: 1. 設定環境 ```= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} ``` 2. 透過 `curl` 指令取得 rerank 結果，預設回傳結果為前三名的分數及index，如果需要更多或是更少，可以透過在parameters下的"top_n"參數來調整。另外輸入格式支援兩種方式，第一種是以"inputs"參數來指定查詢及候選答案兩兩一組的list，請參考以下範例1。第二種是以"query"參數來指定查詢及以"documents"參數來指定所有候選答案，請參考以下範例2。範例1輸出的答案不會以分數排序，而範例2輸出的答案會依分數排序並依照top_n參數給予前幾名的分數結果。 **使用範例1** ```= curl "${API_URL}/models/rerank" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs": [ [ "Where is the capital of Canada?", "Europe is a continent." ], [ "Where is the capital of Canada?", "The capital of Canada is Ottawa." ], [ "Where is the capital of Canada?", "Canada is a big country." ] ] }' ``` **回傳範例** ```= { "scores": [ 0.000016571451851632446, 0.9998936653137207, 0.040769271552562714 ], "total_time_taken": "0.07 sec", "usage": { "prompt_tokens": 41, "total_tokens": 41 } } ``` **使用範例2** ```= curl "${API_URL}/models/rerank" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "query": "Where is the capital of Canada?", "documents": [ "Europe is a continent.", "The capital of Canada is Ottawa.", "Canada is a big country." ], "parameters": { "top_n": 2 } }' ``` **回傳範例** ```= { "results": [ { "score": 0.9998936653137207, "index": 1 }, { "score": 0.040769271552562714, "index": 2 } ], "total_time_taken": "0.07 sec", "usage": { "prompt_tokens": 41, "total_tokens": 41 } } ``` 以上範例向 Rerank API 傳遞了三組查詢和候選答案： 1. "Where is the capital of Canada?" 與 "Europe is a continent." 2. "Where is the capital of Canada?" 與 "The capital of Canada is Ottawa." 3. "Where is the capital of Canada?" 與 "Canada is a big country." Rerank API 回傳了前兩名候選答案的相關性分數，第二個答案的分數（0.9998936653137207）顯然高於第三個答案的分數（0.040769271552562714），表示 "The capital of Canada is Ottawa." 更加相關和正確。開發者可以根據這些分數對候選答案進行排序，選擇最相關的結果。因此，這種方法能顯著提高資訊檢索和問答系統的準確性和有效性。 ## Generate（請優先使用 [Conversation](#Conversation)） ### 一般使用 ```bash= export API_KEY={API_KEY} export API_KEY={API_KEY} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/generate" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs":"從前從前，有位老太太去河邊", "parameters":{ "max_new_tokens":200, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：包括生成的文字、token 個數以及所花費的時間秒數。 > { "generated_text": "，她洗完衣服後，要把衣服晾在河邊的一棵樹上。但老太太又老又弱，她爬不上樹，於是她決定把衣服掛在樹枝上。老太太拿起衣服，開始往樹枝上掛衣服，但她掛了幾件衣服後，樹枝斷了，所有的衣服都掉到河裡去了。老太太看到這一幕，非常傷心，她說：「我的衣服都掉到河裡去了！」老太太的孫女看到這一幕，心想：「我可以幫助我的祖母，我可以幫助她把衣服掛在樹枝上。」於是，這位孫女走到河邊，開始幫助她的祖母把衣服掛在樹枝上", "function_call": null, "details": null, "total_time_taken": "18.88 sec", "prompt_tokens": 17, "generated_tokens": 200, "total_tokens": 217, "finish_reason": "length" } :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 200 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def generate(prompt): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} data = { "model": MODEL_NAME, "inputs": prompt, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = '' try: response = requests.post( API_URL + "/models/generate", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("\n") result = generate("從前從前，有位老太太去河邊") print(result) ``` 輸出： > ，她洗完衣服後，要把衣服晾在河邊的一棵樹上。但老太太又老又弱，她爬不上樹，於是她決定把衣服掛在樹枝上。老太太拿起衣服，開始往樹枝上掛衣服，但她掛了幾件衣服後，樹枝斷了，所有的衣服都掉到河裡去了。老太太看到這一幕，非常傷心，她說：「我的衣服都掉到河裡去了！」老太太的孫女看到這一幕，心想：「我可以幫助我的祖母，我可以幫助她把衣服掛在樹枝上。」於是，這位孫女走到河邊，開始幫助她的祖母把衣服掛在樹枝上 ::: ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/generate" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs":"可以幫我規劃台北兩日遊，並推薦每天的景點及說明其特色嗎？", "parameters":{ "max_new_tokens":350, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：包括生成的文字、token 個數以及所花費的時間秒數。 > { "generated_text": "答案是：第一天： 1. 台北101觀景台 - 這是台北最受歡迎的景點之一，提供城市天際線的壯觀景色。 2. 國立故宮博物院 - 這是世界上最著名的藝術和文物收藏之一，展示了中國豐富的文化遺產。 3. 台北中正紀念堂 - 這是一座宏偉的紀念堂，致力於紀念中華民國前總統蔣中正。 4. 台北國立台灣博物館 - 這是一個展示台灣歷史、文化和藝術的博物館。 5. 台北夜市 - 這是一個熱鬧的夜市，提供各種街頭美食、購物和娛樂。第二天： 1. 陽明山國家公園 - 這是一個美麗的國家公園，提供令人驚嘆的台北市區景色。 2. 台北101觀景台 - 這是另一個觀景台，提供城市天際線的壯觀景色。 3. 台北國立台灣博物館 - 這是另一個博物館，展示台灣歷史、文化和藝術。 4. 台北故宮 - 這是另一個展示中國豐富文化遺產的", "function_call": null, "details": null, "total_time_taken": "33.08 sec", "prompt_tokens": 31, "generated_tokens": 350, "total_tokens": 381, "finish_reason": "length" } :::spoiler　Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 350 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def generate(prompt): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} data = { "model": MODEL_NAME, "inputs": prompt, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty } } result = '' try: response = requests.post( API_URL + "/models/generate", json=data, headers=headers) if response.status_code == 200: result = json.loads(response.text, strict=False)['generated_text'] else: print("error") except: print("error") return result.strip("\n") result = generate("可以幫我規劃台北兩日遊，並推薦每天的景點及說明其特色嗎？") print(result) ``` 輸出： > 答案是：第一天： 1. 台北101觀景台 - 這是台北最受歡迎的景點之一，提供城市天際線的壯觀景色。 2. 國立故宮博物院 - 這是世界上最著名的藝術和文物收藏之一，展示了中國豐富的文化遺產。 3. 台北中正紀念堂 - 這是一座宏偉的紀念堂，致力於紀念中華民國前總統蔣中正。 4. 台北國立台灣博物館 - 這是一個展示台灣歷史、文化和藝術的博物館。 5. 台北夜市 - 這是一個熱鬧的夜市，提供各種街頭美食、購物和娛樂。第二天： 1. 陽明山國家公園 - 這是一個美麗的國家公園，提供令人驚嘆的台北市區景色。 2. 台北101觀景台 - 這是另一個觀景台，提供城市天際線的壯觀景色。 3. 台北國立台灣博物館 - 這是另一個博物館，展示台灣歷史、文化和藝術。 4. 台北故宮 - 這是另一個展示中國豐富文化遺產的 ::: ### 使用 Stream 模式 Server-sent event (SSE)：伺服器主動向客戶端推送資料，連線建立後，在一步步生成字句的同時也將資料往客戶端拋送，和先前的一次性回覆不同，可加強使用者體驗。若有輸出大量 token 文字的需求，請務必優先採用 Stream 模式，以免遇到 Timeout 的情形。 ```bash= export API_KEY={API_KEY} export API_URL={API_URL} export MODEL_NAME={MODEL_NAME} # model: ffm-mixtral-8x7b-32k-instruct curl "${API_URL}/models/generate" \ -H "X-API-KEY:${API_KEY}" \ -H "X-API-HOST: afs-inference" \ -H "content-type: application/json" \ -d '{ "model": "'${MODEL_NAME}'", "inputs":"台灣最高峰是", "stream":true, "parameters":{ "max_new_tokens":2, "temperature":0.5, "top_k":50, "top_p":1, "frequence_penalty":1}}' ``` 輸出：每個 token 會輸出一筆資料，最末筆則是會將先前生成的文字串成一筆、以及描述 token 個數和所花費的時間秒數。 > data: {"generated_text": "玉", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null} > data: {"generated_text": "山", "function_call": null, "details": null, "total_time_taken": "0.25 sec", "prompt_tokens": 7, "generated_tokens": 2, "total_tokens": 9, "finish_reason": "length"} ::: info :bulb: **提示：注意事項** 1. 每筆 token 並不一定能解碼成合適的文字，如果遇到該種情況，該筆 generated_text 欄位會顯示空字串，該 token 會結合下一筆資料再來解碼，直接能呈現為止。 2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette)，在 SSE 過程中約 15 秒就會收到 ping event，目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式)，在資料處理時需特別注意，下列 Python 範例已經包含此資料處理。 > event: ping > data: 2023-09-26 04:25:08.978531 ::: :::spoiler Python 範例 ```python= import json import requests MODEL_NAME = "{MODEL_NAME}" API_KEY = "{API_KEY}" API_URL = "{API_URL}" API_HOST = "afs-inference" # parameters max_new_tokens = 2 temperature = 0.5 top_k = 50 top_p = 1.0 frequence_penalty = 1.0 def generate(prompt): headers = { "content-type": "application/json", "X-API-Key": API_KEY, "X-API-Host": API_HOST} data = { "model": MODEL_NAME, "inputs": prompt, "parameters": { "max_new_tokens": max_new_tokens, "temperature": temperature, "top_k": top_k, "top_p": top_p, "frequence_penalty": frequence_penalty }, "stream": True } messages = [] result = "" try: response = requests.post(API_URL + "/models/generate", json=data, headers=headers, stream=True) if response.status_code == 200: for chunk in response.iter_lines(): chunk = chunk.decode('utf-8') if chunk == "": continue try: record = json.loads(chunk[5:], strict=False) if "status_code" in record: print("{:d}, {}".format(record["status_code"], record["error"])) break elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) : message = record["generated_text"] messages.append(message) print(">>> " + message) result = ''.join(messages) break elif record["generated_text"] is not None: message = record["generated_text"] messages.append(message) print(">>> " + message) else: print("error") break except: pass except: print("error") return result.strip("\n") result = generate("台灣最高峰是") print(result) ``` ::: 輸出： ``` >>> 玉 >>> 山玉山 ``` ### LangChain 使用方式 :::spoiler **Custom LLM Model Wrapper** ```python= from typing import Any, Dict, List, Mapping, Optional, Tuple from langchain.llms.base import BaseLLM import requests from langchain.callbacks.manager import CallbackManagerForLLMRun from langchain.schema.language_model import BaseLanguageModel from langchain.schema import Generation, LLMResult from pydantic import Field import json import os class _FormosaFoundationCommon(BaseLanguageModel): base_url: str = "http://localhost:12345" """Base url the model is hosted under.""" model: str = "ffm-mixtral-8x7b-32k-instruct" """Model name to use.""" temperature: Optional[float] """The temperature of the model. Increasing the temperature will make the model answer more creatively.""" stop: Optional[List[str]] """Sets the stop tokens to use.""" top_k: int = 50 """Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 50)""" top_p: float = 1 """Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 1)""" max_new_tokens: int = 350 """The maximum number of tokens to generate in the completion. -1 returns as many tokens as possible given the prompt and the models maximal context size.""" frequence_penalty: float = 1 """Penalizes repeated tokens according to frequency.""" model_kwargs: Dict[str, Any] = Field(default_factory=dict) """Holds any model parameters valid for `create` call not explicitly specified.""" ffm_api_key: Optional[str] = None @property def _default_params(self) -> Dict[str, Any]: """Get the default parameters for calling FFM API.""" normal_params = { "temperature": self.temperature, "max_new_tokens": self.max_new_tokens, "top_p": self.top_p, "frequence_penalty": self.frequence_penalty, "top_k": self.top_k, } return {**normal_params, **self.model_kwargs} def _call( self, prompt, stop: Optional[List[str]] = None, **kwargs: Any, ) -> str: if self.stop is not None and stop is not None: raise ValueError("`stop` found in both the input and default params.") elif self.stop is not None: stop = self.stop elif stop is None: stop = [] params = {**self._default_params, "stop": stop, **kwargs} parameter_payload = {"parameters": params, "inputs": prompt, "model": self.model} # HTTP headers for authorization headers = { "X-API-KEY": self.ffm_api_key, "Content-Type": "application/json", "X-API-HOST": "afs-inference" } endpoint_url = f"{self.base_url}/models/generate" # send request try: response = requests.post( url=endpoint_url, headers=headers, data=json.dumps(parameter_payload, ensure_ascii=False).encode("utf8"), stream=False, ) response.encoding = "utf-8" generated_text = response.json() if response.status_code != 200: detail = generated_text.get("detail") raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"error raised with status code {response.status_code}\n" f"Details: {detail}\n" ) except requests.exceptions.RequestException as e: # This is the correct syntax raise ValueError(f"FormosaFoundationModel error raised by inference endpoint: {e}\n") if generated_text.get("detail") is not None: detail = generated_text["detail"] raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"error raised by inference API: {detail}\n" ) if generated_text.get("generated_text") is None: raise ValueError( f"FormosaFoundationModel endpoint_url: {endpoint_url}\n" f"Response format error: {generated_text}\n" ) return generated_text class FormosaFoundationModel(BaseLLM, _FormosaFoundationCommon): """Formosa Foundation Model Example: .. code-block:: python ffm = FormosaFoundationModel(model_name="llama2-7b-chat-meta") """ @property def _llm_type(self) -> str: return "FormosaFoundationModel" @property def _identifying_params(self) -> Mapping[str, Any]: """Get the identifying parameters.""" return { **{ "model": self.model, "base_url": self.base_url }, **self._default_params } def _generate( self, prompts: List[str], stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> LLMResult: """Call out to FormosaFoundationModel's generate endpoint. Args: prompt: The prompt to pass into the model. stop: Optional list of stop words to use when generating. Returns: The string generated by the model. Example: .. code-block:: python response = FormosaFoundationModel("Tell me a joke.") """ generations = [] token_usage = 0 for prompt in prompts: final_chunk = super()._call( prompt, stop=stop, **kwargs, ) generations.append( [ Generation( text = final_chunk["generated_text"], generation_info=dict( finish_reason = final_chunk["finish_reason"] ) ) ] ) token_usage += final_chunk["generated_tokens"] llm_output = {"token_usage": token_usage, "model": self.model} return LLMResult(generations=generations, llm_output=llm_output) ``` ::: * 完成以上封裝後，就可以在 LangChain 中使用 FFM 大語言模型。 ::: info :bulb: **提示：** 更多資訊，請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。 ::: ```python= MODEL_NAME = "ffm-mixtral-8x7b-32k-instruct" API_KEY = "{API_KEY}" API_URL = "{API_URL}" ffm = FormosaFoundationModel( base_url = API_URL, max_new_tokens = 350, temperature = 0.5, top_k = 50, top_p = 1.0, frequence_penalty = 1.0, ffm_api_key = API_KEY, model = MODEL_NAME ) print(ffm("請問台灣最高的山是？")) ``` 輸出： > 答案是：玉山。 > >玉山，也被稱為玉山國家公園，位於台灣南部，是該國最高的山，海拔3952米（12966英尺）。它是台灣阿里山山脈的一部分，以其崎嶇的地形、翠綠的森林和多種植物和動物而聞名。玉山是徒步旅行者和自然愛好者的熱門目的地，被認為是台灣最美麗和最具挑戰性的山之一。 ```