###### tags: `AFS`
# AFS API 說明文件
<sup style="color:gray">V5, 更新時間: 2024/07/12 13:00</b></sup>
## 取得所需資訊
:::spoiler **AFS ModelSpace 公有模式(Public Mode)**
1. **`API URL`**:請從 AFS ModelSpace 的 **公用模式 - API 金鑰管理** 頁面中的右上角複製。


3. **`MODEL_NAME`**:請參考 [**模型名稱對照表**](https://docs.twcc.ai/docs/user-guides/twcc/afs/afs-modelspace/available-model)。
4. **`API_KEY`**: 請從 **公用模式 - API 金鑰管理** 頁面的列表中取得。
:::
:::spoiler **AFS ModelSpace 私有模式(Private Mode)**
1. **`API_URL`**:**`API 端點連結`**,可從該服務的詳細資料頁面中複製。

2. **`MODEL_NAME`**:**`模型名稱`**,如上圖,可從該服務的詳細資料頁面中複製。
3. **`API_KEY`**:{API_KEY},登入該服務的測試介面後點選右上角的帳號資訊,即可看到 `API 金鑰`。
<br><br>
:::
:::spoiler **AFS Cloud**
1. **`API_URL`**:**`API 端點連結`**,可從該服務的詳細資料頁面中複製。

2. **`MODEL_NAME`**:**`模型名稱`**,如上圖,可從該服務的詳細資料頁面中複製。
3. **`API_KEY`**:{API_KEY},登入該服務的測試介面後點選右上角的帳號資訊,即可看到 `API 金鑰`。
<br><br>
:::
## 參數說明
### Request 參數
- `max_new_tokens`:一次最多可生成的 token 數量。
- 預設值:350
- 範圍限制:大於 0 的整數值
::: warning
:warning: **注意:使用限制**
每個模型都有 input + output token 小於某個值的限制,如果輸入字串加上預計生成文字的 token 數量大於該值則會產生錯誤。
- Mistral (7B) / Mixtral (8x7B) : 32768 tokens
- Llama3 (8B / 70B) : 8192 tokens
- Llama2-V2 / Llama2 (7B / 13B / 70B) / Taide-LX (7B) : 4096 tokens
- CodeLlama (7B / 13B / 34B) : 8192 tokens
:::
- `temperature`:生成創造力,生成文本的隨機和多樣性。值越大,文本更具創意和多樣性;值越小,則較保守、接近模型所訓練的文本。
- 預設值:1.0
- 範圍限制:大於 0 的小數值
- `top_p`:當候選 token 的累計機率達到或超過此值時,就會停止選擇更多的候選 token。值越大,生成的文本越多樣化;值越小,生成的文本越保守。
- 預設值:1.0
- 範圍限制:大於 0 且小於等於 1 的小數值
- `top_k`:限制模型只從具有最高概率的 K 個 token 中進行選擇。值越大,生成文本越多樣化;值越小,生成的文本越保守。
- 預設值:50
- 範圍限制:大於等於 1 且小於等於 100 的整數值
- `frequence_penalty`:重複懲罰,控制重複生成 token 的概率。值越大,重複 token 出現次數將降低。
- 預設值:1.0
- 範圍限制:大於 0 的小數值
- `stop_sequences`:當文字生成內容遇到以下序列即停止,而輸出內容不會納入序列。
- 預設值:無
- 範圍限制:最多四組,例如 ["stop", "end"]
- `show_probabilities`:是否顯示生成文字的對數機率。其值為依據前面文字來生成該 token 的機率, 以對數方式呈現。
- 預設值:false
- 範圍限制:true 或是 false
- `seed`:亂數種子,具有相同種子與參數的重複請求會傳回相同結果。若設成 null,表示隨機。
- 預設值:42
- 範圍限制:可設為 null,以及大於等於 0 的整數值
### Request 參數調校建議
* `temperature` 的調整影響回答的創意性。
- 單一/非自創的回答,建議調低 temperature 至 0.1~0.2。
- 多元/高創意的回答,建議調高 temperature 至 1。
* 若上述調整後仍想再微調 `top-k` 和 `top-p`,請先調整 `top-k`,最後再更動 `top-p`。
* 當回答中有高重複的 token,重複懲罰數值 `frequence_penalty` 建議調至 1.03,最高 1.2,再更高會有較差的效果。
### Response 參數
- `function_call`:模型回覆的 Function Calling 呼叫函式,若無使用此功能則回傳 null。
- `details`:針對特定需求所提供的細節資訊,例如 response 參數的 show_probabilities 若為 true,details 會回傳生成文字的對數機率。
- `total_time_taken`:本次 API 的總花費秒數。
- `prompt_tokens`:本次 API 的 Prompt Token 數量(Input Tokens),會包含 system 預設指令、歷史對話中 human / assistant 的內容以及本次的問題或輸入內容的總 Token 數。
- `generated_tokens`:本次 API 的 Generated Token 數量(Output Tokens),即為本次模型的回覆內容總 Token 數,而此 Token 數量越大,對應的總花費秒數也會隨之增長。(若有輸出大量 token 文字的需求,請務必優先採用 Stream 模式,以免遇到 Timeout 的情形。)
- `total_tokens`:本次 API 的 Total Token 數量 (Input + Output Tokens)。
- `finish_reason`:本次 API 的結束原因說明,例如 "eos\_token" 代表模型已生成完畢,"function\_call" 代表呼叫函式已生成完畢。
## Conversation
::: info
:bulb: **提示:** 支援模型清單
* Mistral (7B) / Mixtral (8x7B)
* Llama3 (8B / 70B)
* Llama2-V2 / Llama2 (7B / 13B / 70B) / Taide-LX (7B)
* CodeLlama (7B / 13B / 34B)
:::
### 一般使用
依對話順序,依照角色位置把對話內容填到 Content 欄位中。
- [**範例一**](#範例一:無`預設指令`)
| Role | Order | Content |
| --------- | ----- | ------- |
| human | 問題1 | 人口最多的國家是? |
| assistant | 答案1 | 人口最多的國家是印度。 |
| human | 問題2 | 主要宗教為? |
- [**範例二**](#範例二:設定`預設指令`)
| Role | Order | Content |
| --------- | ------- | ------- |
| system | 預設指令 | 你是一位只會用表情符號回答問題的助理。 |
| human | 問題1 | 明天會下雨嗎? |
| assistant | 答案1 | 🤔 🌨️ 🤞 |
| human | 問題2 | 意思是要帶傘出門嗎? |
::: info
:bulb: **提示:** LLaMA 2 支援預設指令。預設指令可以協助優化系統的回答行為,在對話的每一段過程中都會套用。
:::
#### 範例一:無`預設指令`
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "human",
"content": "人口最多的國家是?"
},
{
"role": "assistant",
"content": "人口最多的國家是印度。"
},
{
"role": "human",
"content": "主要宗教為?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
> {
"generated_text": "印度的主要宗教是印度教",
"function_call": null,
"details": null,
"total_time_taken": "1.31 sec",
"prompt_tokens": 44,
"generated_tokens": 13,
"total_tokens": 57,
"finish_reason": "eos_token"
}
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
roles = ["human", "assistant"]
messages = []
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"]
result = conversation(contents)
print(result)
```
輸出:
> 印度的主要宗教是印度教
:::
#### 範例二:設定`預設指令`
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "system",
"content": "你是一位只會用表情符號回答問題的助理。"
},
{
"role": "human",
"content": "明天會下雨嗎?"
},
{
"role": "assistant",
"content": "🤔 🌨️ 🤞"
},
{
"role": "human",
"content": "意思是要帶傘出門嗎?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
> {
"generated_text": "🌂 🌂 🌂\n\n(譯:明天會下雨,所以最好帶把傘出門。)",
"function_call": null,
"details": null,
"total_time_taken": "3.55 sec",
"prompt_tokens": 76,
"generated_tokens": 38,
"total_tokens": 114,
"finish_reason": "eos_token"
}
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(system, contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
roles = ["human", "assistant"]
messages = []
if system is not None:
messages.append({"role": "system", "content": system})
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
system_prompt = "你是一位只會用表情符號回答問題的助理。"
contents = ["明天會下雨嗎?", "🤔 🌨️ 🤞", "意思是要帶傘出門嗎?"]
result = conversation(system_prompt, contents)
print(result)
```
輸出:
> 🌂 🌂 🌂
(譯:明天會下雨,所以最好帶把傘出門。)
:::
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "system",
"content": "你是一個活潑的五歲小孩,回答問題時都使用童言童語的語氣。"
},
{
"role": "human",
"content": "明天會下雨嗎?"
},
{
"role": "assistant",
"content": "嗯,我不知道,但我希望如此!我喜歡玩雨水,穿上我的雨靴和雨衣。這就像一個大派對外面!如果你很幸運,也許你可以看到一個彩虹!"
},
{
"role": "human",
"content": "彩虹有幾種顏色呢?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
> {
"generated_text": "彩虹有七種顏色!你能記得住它們嗎?它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這是一個有趣的記憶法:“藍色是靛色的,紫色是我的,綠色是我喜歡的,黃色是太陽,橙色是甜美的,紅色是勇敢的。”試試看,這樣就很容易記住了!",
"function_call": null,
"details": null,
"total_time_taken": "15.00 sec",
"prompt_tokens": 133,
"generated_tokens": 111,
"total_tokens": 244,
"finish_reason": "eos_token"
}
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(system, contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
roles = ["human", "assistant"]
messages = []
if system is not None:
messages.append({"role": "system", "content": system})
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
system_prompt = "你是一個活潑的五歲小孩,回答問題時都使用童言童語的語氣。"
contents = ["明天會下雨嗎?", "嗯,我不知道,但我希望如此!我喜歡玩雨水,穿上我的雨靴和雨衣。這就像一個大派對外面!如果你很幸運,也許你可以看到一個彩虹!", "彩虹有幾種顏色呢?"]
result = conversation(system_prompt, contents)
print(result)
```
輸出:
> 彩虹有七種顏色!你能記得住它們嗎?它們是紅色、橙色、黃色、綠色、藍色、靛色和紫色。這是一個有趣的記憶法:“藍色是靛色的,紫色是我的,綠色是我喜歡的,黃色是太陽,橙色是甜美的,紅色是勇敢的。”試試看,這樣就很容易記住了!
:::
### 使用 Stream 模式
Server-sent event (SSE):伺服器主動向客戶端推送資料,連線建立後,在一步步生成字句的同時也將資料往客戶端拋送,和先前的一次性回覆不同,可加強使用者體驗。若有輸出大量 token 文字的需求,請務必優先採用 Stream 模式,以免遇到 Timeout 的情形。
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/conversation" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages":[
{
"role": "human",
"content": "人口最多的國家是?"
},
{
"role": "assistant",
"content": "人口最多的國家是印度。"
},
{
"role": "human",
"content": "主要宗教為?"
}],
"parameters": {
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1},
"stream": true}'
```
輸出:每個 token 會輸出一筆資料,最末筆則是會多出生成的總 token 個數和所花費的時間秒數。
> data: {"generated_text": "", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "印", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "度", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "的", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "主", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "要", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "宗", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "教", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "是", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "印", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "度", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "教", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "", "function_call": null, "details": null, "total_time_taken": "1.32 sec", "prompt_tokens": 44, "generated_tokens": 13, "total_tokens": 57, "finish_reason": "eos_token"}
::: info
:bulb: **提示:注意事項**
1. 每筆 token 並不一定能解碼成合適的文字,如果遇到該種情況,該筆 generated_text 欄位會顯示空字串,該 token 會結合下一筆資料再來解碼,直接能呈現為止。
2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette),在 SSE 過程中約 15 秒就會收到 ping event,目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式),在資料處理時需特別注意,下列 Python 範例已經有包含此資料處理。
> event: ping
> data: 2023-09-26 04:25:08.978531
:::
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def conversation(contents):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
roles = ["human", "assistant"]
messages = []
for index, content in enumerate(contents):
messages.append({"role": roles[index % 2], "content": content})
data = {
"model": MODEL_NAME,
"messages": messages,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
},
"stream": True
}
messages = []
result = ""
try:
response = requests.post(API_URL + "/models/conversation", json=data, headers=headers, stream=True)
if response.status_code == 200:
for chunk in response.iter_lines():
chunk = chunk.decode('utf-8')
if chunk == "":
continue
# only check format => data: ${JSON_FORMAT}
try:
record = json.loads(chunk[5:], strict=False)
if "status_code" in record:
print("{:d}, {}".format(record["status_code"], record["error"]))
break
elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) :
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
result = ''.join(messages)
break
elif record["generated_text"] is not None:
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
else:
print("error")
break
except:
pass
else:
print("error")
except:
print("error")
return result.strip("\n")
contents = ["人口最多的國家是?", "人口最多的國家是印度。", "主要宗教為?"]
result = conversation(contents)
print(result)
```
輸出:
```
>>>
>>> 印
>>> 度
>>> 的
>>> 主
>>> 要
>>> 宗
>>> 教
>>> 是
>>> 印
>>> 度
>>> 教
>>>
印度的主要宗教是印度教
```
:::
### LangChain 使用方式
:::spoiler **Custom Chat Model Wrapper**
```python=
"""Wrapper LLM conversation APIs."""
from typing import Any, Dict, List, Mapping, Optional, Tuple
from langchain.llms.base import LLM
import requests
from langchain.llms.utils import enforce_stop_tokens
from langchain.llms.base import BaseLLM
from langchain.llms.base import create_base_retry_decorator
from pydantic import BaseModel, Extra, Field, root_validator
from langchain.chat_models.base import BaseChatModel
from langchain.schema.language_model import BaseLanguageModel
from langchain.schema import (
BaseMessage,
ChatGeneration,
ChatResult,
ChatMessage,
AIMessage,
HumanMessage,
SystemMessage
)
from langchain.callbacks.manager import (
Callbacks,
AsyncCallbackManagerForLLMRun,
CallbackManagerForLLMRun,
)
import json
import os
class _ChatFormosaFoundationCommon(BaseLanguageModel):
base_url: str = "http://localhost:12345"
"""Base url the model is hosted under."""
model: str = "ffm-mixtral-8x7b-32k-instruct"
"""Model name to use."""
temperature: Optional[float]
"""The temperature of the model. Increasing the temperature will
make the model answer more creatively."""
stop: Optional[List[str]]
"""Sets the stop tokens to use."""
top_k: int = 50
"""Reduces the probability of generating nonsense. A higher value (e.g. 100)
will give more diverse answers, while a lower value (e.g. 10)
will be more conservative. (Default: 50)"""
top_p: float = 1
"""Works together with top-k. A higher value (e.g., 0.95) will lead
to more diverse text, while a lower value (e.g., 0.5) will
generate more focused and conservative text. (Default: 1)"""
max_new_tokens: int = 350
"""The maximum number of tokens to generate in the completion.
-1 returns as many tokens as possible given the prompt and
the models maximal context size."""
frequence_penalty: float = 1
"""Penalizes repeated tokens according to frequency."""
model_kwargs: Dict[str, Any] = Field(default_factory=dict)
"""Holds any model parameters valid for `create` call not explicitly specified."""
ffm_api_key: Optional[str] = None
@property
def _default_params(self) -> Dict[str, Any]:
"""Get the default parameters for calling FFM API."""
normal_params = {
"temperature": self.temperature,
"max_new_tokens": self.max_new_tokens,
"top_p": self.top_p,
"frequence_penalty": self.frequence_penalty,
"top_k": self.top_k,
}
return {**normal_params, **self.model_kwargs}
def _call(
self,
prompt,
stop: Optional[List[str]] = None,
**kwargs: Any,
) -> str:
if self.stop is not None and stop is not None:
raise ValueError("`stop` found in both the input and default params.")
elif self.stop is not None:
stop = self.stop
elif stop is None:
stop = []
params = {**self._default_params, "stop": stop, **kwargs}
parameter_payload = {"parameters": params, "messages": prompt, "model": self.model}
# HTTP headers for authorization
headers = {
"X-API-KEY": self.ffm_api_key,
"X-API-HOST": "afs-inference",
"Content-Type": "application/json",
}
endpoint_url = f"{self.base_url}/models/conversation"
# send request
try:
response = requests.post(
url=endpoint_url,
headers=headers,
data=json.dumps(parameter_payload, ensure_ascii=False).encode("utf8"),
stream=False,
)
response.encoding = "utf-8"
generated_text = response.json()
if response.status_code != 200:
detail = generated_text.get("detail")
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"error raised with status code {response.status_code}\n"
f"Details: {detail}\n"
)
except requests.exceptions.RequestException as e: # This is the correct syntax
raise ValueError(f"FormosaFoundationModel error raised by inference endpoint: {e}\n")
if generated_text.get("detail") is not None:
detail = generated_text["detail"]
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"error raised by inference API: {detail}\n"
)
if generated_text.get("generated_text") is None:
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"Response format error: {generated_text}\n"
)
return generated_text
class ChatFormosaFoundationModel(BaseChatModel, _ChatFormosaFoundationCommon):
"""`FormosaFoundation` Chat large language models API.
The environment variable ``OPENAI_API_KEY`` set with your API key.
Example:
.. code-block:: python
ffm = ChatFormosaFoundationModel(model_name="llama2-7b-chat-meta")
"""
@property
def _llm_type(self) -> str:
return "ChatFormosaFoundationModel"
@property
def lc_serializable(self) -> bool:
return True
def _convert_message_to_dict(self, message: BaseMessage) -> dict:
if isinstance(message, ChatMessage):
message_dict = {"role": message.role, "content": message.content}
elif isinstance(message, HumanMessage):
message_dict = {"role": "human", "content": message.content}
elif isinstance(message, AIMessage):
message_dict = {"role": "assistant", "content": message.content}
elif isinstance(message, SystemMessage):
message_dict = {"role": "system", "content": message.content}
else:
raise ValueError(f"Got unknown type {message}")
return message_dict
def _create_conversation_messages(
self,
messages: List[BaseMessage],
stop: Optional[List[str]]
) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
params: Dict[str, Any] = {**self._default_params}
if stop is not None:
if "stop" in params:
raise ValueError("`stop` found in both the input and default params.")
params["stop"] = stop
message_dicts = [self._convert_message_to_dict(m) for m in messages]
return message_dicts, params
def _create_chat_result(self, response: Mapping[str, Any]) -> ChatResult:
chat_generation = ChatGeneration(
message = AIMessage(content=response.get("generated_text")),
generation_info = {
"token_usage": response.get("generated_tokens"),
"model": self.model
}
)
return ChatResult(generations=[chat_generation])
def _generate(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> ChatResult:
message_dicts, params = self._create_message_dicts(messages, stop)
params = {**params, **kwargs}
response = self._call(prompt=message_dicts)
if type(response) is str: # response is not the format of dictionary
return response
return self._create_chat_result(response)
async def _agenerate(
self, messages: List[BaseMessage], stop: Optional[List[str]] = None
) -> ChatResult:
pass
def _create_message_dicts(
self,
messages: List[BaseMessage],
stop: Optional[List[str]]
) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
params = self._default_params
if stop is not None:
if "stop" in params:
raise ValueError("`stop` found in both the input and default params.")
params["stop"] = stop
message_dicts = [self._convert_message_to_dict(m) for m in messages]
return message_dicts, params
```
:::
* 完成以上封裝後,就可以在 LangChain 中使用特定的 FFM 大語言模型。
::: info
:bulb: **提示:** 更多資訊,請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。
:::
```python=
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
from langchain.schema import (
AIMessage,
HumanMessage,
SystemMessage
)
chat_ffm = ChatFormosaFoundationModel(
base_url = API_URL,
max_new_tokens = 350,
temperature = 0.5,
top_k = 50,
top_p = 1.0,
frequence_penalty = 1.0,
ffm_api_key = API_KEY,
model = MODEL_NAME
)
messages = [
HumanMessage(content="人口最多的國家是?"),
AIMessage(content="人口最多的國家是印度。"),
HumanMessage(content="主要宗教為?")
]
result = chat_ffm(messages)
print(result.content)
```
輸出:
> 印度的主要宗教是印度教
## Function Calling
在 API 呼叫中,您可以描述多個函式讓模型選擇,並輸出包含選中的函數名稱及參數的 JSON Object,讓應用或代理人程式調用模型選擇的函式。Conversation API 不會調用該函式而是生成 JSON 讓您可在代碼中調用函式。
::: info
:bulb: **提示:** 支援模型清單
* Mistral (7B) / Mixtral (8x7B)
* Llama2-V2 (7B / 13B / 70B)
:::
### 使用方式
1. 開發者提供函式列表並對大語言模型輸入問題。
2. 開發者解析大語言模型輸出的結構化資料,取得函式與對應的參數後,讓應用或代理人程式呼叫 API 並獲得回傳的結果。
3. 將 API 回傳的結果放到對話內容並傳給大語言模型做總結。
### Conversation API
給定包含對話的訊息列表,模型將回傳生成訊息或呼叫函式。
1. 開發者提供函式列表並對大語言模型輸入問題
| Field | Type | Required | Description |
| -------- | -------- | -------- | -------- |
| **functions** | array | Optional | A list of functions the model may generate JSON inputs for.|
* Example of RESTful HTTP Request
```python=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl -X POST "${API_URL}/models/conversation" \
-H "accept: application/json" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages": [
{
"role": "user",
"content": "What is the weather like in Boston?"
}],
"functions": [
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}],
"parameters": {
"show_probabilities": false,
"max_new_tokens": 500,
"frequence_penalty": 1,
"temperature": 0.5,
"top_k": 100,
"top_p": 0.93
},
"stream": false
}'
```
| Field | Type | Required | Description |
| -------- | -------- | -------- | -------- |
|**function_call** | string or object | Optional | JSON format that adheres to the function signature |
* Example of RESTful HTTP Response
```python=
{
"generated_text": "",
"function_call": {
"name": "get_current_weather",
"arguments": {
"location": "Boston, MA",
}
},
"details":null,
"total_time_taken": "1.18 sec",
"prompt_tokens": 181,
"generated_tokens": 45,
"total_tokens": 226,
"finish_reason": "function_call"
}
```
2. 開發者解析大語言模型輸出的結構化資料,取得函式與對應的參數後,呼叫 API 並獲得回傳的結果
* Example of Weather API Response
```python=
{
"temperature": "22",
"unit": "celsius",
"description": "Sunny"
}
```
3. 將 API 回傳的結果放到對話內容並傳給大語言模型做總結
| Field |value |
| -------- | -------- |
|**role** | ***function*** |
|**name** | The function name to call |
|**content** | The response message from the API |
* Example of RESTful HTTP Request
```python=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
curl -X POST "${API_URL}/models/conversation" \
-H "accept: application/json" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"messages": [
{"role": "user", "content": "What is the weather like in Boston?"},
{"role": "assistant", "content": null, "function_call": {"name": "get_current_weather", "arguments": {"location": "Boston, MA"}}},
{"role": "function", "name": "get_current_weather", "content": "{\"temperature\": \"22\", \"unit\": \"celsius\", \"description\": \"Sunny\"}"}
],
"functions": [
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}],
"parameters": {
"show_probabilities": false,
"max_new_tokens": 500,
"frequence_penalty": 1,
"temperature": 0.5,
"top_k": 100,
"top_p": 0.93
},
"stream": false
}'
```
* Example of RESTful HTTP Response
> {
> "generated_text":" The current weather in Boston is sunny with a temperature of 22 degrees Celsius. ",
> "details":null,
> "total_time_taken":"0.64 sec",
> "prompt_tokens":230,
> "generated_tokens":23,
> "total_tokens":253,
> "finish_reason":"eos_token"
> }
## Code Infilling
### 一般使用
基於給定程式碼前後文來預測程式中要填補的段落,以 `<FILL_ME>` 標籤當成要填補的部分,實際應用會是在開發環境 (IDE) 中自動完成程式中缺漏或是待完成的程式碼區段。
:::info
:bulb: **提示:注意事項**
- 目前僅 meta-codellama-7b-instruct 及 meta-codellama-13b-instruct 模型支援 Code Infilling,若使用的模型不支援,API 會回傳錯誤訊息。
- 如果輸入內容包含多個 `<FILL_ME>`,API 會回傳錯誤訊息。
:::
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: meta-codellama-7b-instruct
curl "${API_URL}/models/text_infilling" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs":"def remove_non_ascii(s: str) -> str:\n \"\"\" <FILL_ME>\n return result\n",
"parameters":{
"max_new_tokens":43,
"temperature":0.1,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:取代 `<FILL_ME>` 的程式片段、token 個數以及所花費的時間秒數。
```json
{
"generated_text": "Remove non-ASCII characters from a string. \"\"\"\n result = \"\"\n for c in s:\n if ord(c) < 128:\n result += c\n ",
"function_call": null,
"details": null,
"total_time_taken": "0.99 sec",
"prompt_tokens": 27,
"generated_tokens": 43,
"total_tokens": 70,
"finish_reason": "length"
}
```
:::spoiler Python 範例
```python=
import json
import requests, re
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 43
temperature = 0.1
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def text_infilling(prompt):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
data = {
"model": MODEL_NAME,
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ''
try:
response = requests.post(
API_URL + "/models/text_infilling", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("<EOT>")
text = '''def remove_non_ascii(s: str) -> str:
""" <FILL_ME>
return result
'''
result = text_infilling(text)
print(re.sub("<FILL_ME>", result, text))
```
輸出:
```python
def remove_non_ascii(s: str) -> str:
""" Remove non-ascii characters from a string. """
result = ""
for c in s:
if ord(c) < 128:
result += c
return result
```
:::
### 使用 Stream 模式
Server-sent event (SSE):伺服器主動向客戶端推送資料,連線建立後,在一步步生成字句的同時也將資料往客戶端拋送,和先前的一次性回覆不同,可加強使用者體驗。若有輸出大量 token 文字的需求,請務必優先採用 Stream 模式,以免遇到 Timeout 的情形。
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: meta-codellama-7b-instruct
curl "${API_URL}/models/text_infilling" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs":"def compute_gcd(x, y):\n <FILL_ME>\n return result\n",
"stream":true,
"parameters":{
"max_new_tokens":50,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
<br>
輸出:每個 token 會輸出一筆資料,最末筆則是會將先前生成的文字串成一筆、以及描述 token 個數和所花費的時間秒數。
> data: {"generated_text": "result", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "1", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " while", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " (", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " !=", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "0", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": ")", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " and", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " (", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " !=", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "0", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "):", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " if", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " >", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": ":", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " %", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " else", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": ":", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " y", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " %", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "\n", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " ", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " result", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " =", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": " x", "function_call": null, "details": null, "total_time_taken": "0.80 sec", "prompt_tokens": 20, "generated_tokens": 50, "total_tokens": 70, "finish_reason": "length"}
::: info
:bulb: **提示:注意事項**
1. 每筆 token 並不一定能解碼成合適的文字,如果遇到該種情況,該筆 generated_text 欄位會顯示空字串,該 token 會結合下一筆資料再來解碼,直接能呈現為止。
2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette),在 SSE 過程中約 15 秒就會收到 ping event,目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式),在資料處理時需特別注意,下列 python 範例已經有包含此資料處理。
> event: ping
> data: 2023-09-26 04:25:08.978531
:::
:::spoiler Python 範例
```python=
import json
import requests, re
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 50
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def text_infilling(prompt):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
data = {
"model": MODEL_NAME,
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
},
"stream": True
}
messages = []
result = ""
try:
response = requests.post(API_URL + "/models/text_infilling", json=data, headers=headers, stream=True)
if response.status_code == 200:
for chunk in response.iter_lines():
chunk = chunk.decode('utf-8')
if chunk == "":
continue
# only check format => data: ${JSON_FORMAT}
try:
record = json.loads(chunk[5:], strict=False)
if "status_code" in record:
print("{:d}, {}".format(record["status_code"], record["error"]))
break
elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) :
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
result = ''.join(messages)
break
elif record["generated_text"] is not None:
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
else:
print("error")
break
except:
pass
except:
print("error")
return result.strip("<EOT>")
text = """def compute_gcd(x, y):
<FILL_ME>
return result
"""
result = text_infilling(text)
print(re.sub("<FILL_ME>", result, text))
```
輸出:
```
>>> result
>>> =
>>>
>>> 1
>>>
>>>
>>> while
>>> (
>>> x
>>> !=
>>>
>>> 0
>>> )
>>> and
>>> (
>>> y
>>> !=
>>>
>>> 0
>>> ):
>>>
>>>
>>> if
>>> x
>>> >
>>> y
>>> :
>>>
>>>
>>> x
>>> =
>>> x
>>> %
>>> y
>>>
>>>
>>> else
>>> :
>>>
>>>
>>> y
>>> =
>>> y
>>> %
>>> x
>>>
>>>
>>> result
>>> =
>>> x
def compute_gcd(x, y):
result = 1
while (x != 0) and (y != 0):
if x > y:
x = x % y
else:
y = y % x
result = x
return result
```
:::
## Embedding (V1)
::: info
:bulb: **提示:目前 Embedding API 的限制會受到以下數值影響**
* Embedding Model:sequence length = 2048
* Embedding API 可以支援 Batch Inference,每筆長度不超過 2048 tokens。
:::
### Curl 使用方式
1. 設定環境
```=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
```
2. 透過 `curl` 指令取得 embedding 結果
**使用範例**
```=
curl "${API_URL}/models/embeddings" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs": ["search string 1", "search string 2"]
}'
```
**回傳範例**
```=
{
"data": [
{
"embedding": [
0.06317982822656631,
-0.5447818636894226,
-0.3353637158870697,
-0.5117015838623047,
-0.1446804255247116,
0.2036416381597519,
-0.20317679643630981,
-0.9627353549003601,
0.31771183013916016,
0.23493929207324982,
0.18029260635375977,
...
...
],
"index": 0,
"object": "embedding"
},
{
"embedding": [
0.15340591967105865,
-0.26574525237083435,
-0.3885045349597931,
-0.2985926568508148,
0.22742436826229095,
-0.42115798592567444,
-0.10134009271860123,
-1.0426620244979858,
0.507709264755249,
-0.3479543924331665,
-0.09303411841392517,
1.0853372812271118,
0.7396582961082458,
0.266722172498703,
...
...
],
"index": 1,
"object": "embedding"
}
],
"total_time_taken": "0.06 sec",
"usage": {
"prompt_tokens": 6,
"total_tokens": 6
}
}
```
### LangChain 使用方式
:::spoiler **Custom Embedding Model Wrapper**
```python=
"""Wrapper Embedding model APIs."""
import json
import requests
from typing import List
from pydantic import BaseModel
from langchain.embeddings.base import Embeddings
import os
class CustomEmbeddingModel(BaseModel, Embeddings):
base_url: str = "http://localhost:12345"
api_key: str = ""
model: str = ""
def get_embeddings(self, payload):
endpoint_url=f"{self.base_url}/models/embeddings"
embeddings = []
headers = {
"Content-type": "application/json",
"accept": "application/json",
"X-API-KEY": self.api_key,
"X-API-HOST": "afs-inference"
}
response = requests.post(endpoint_url, headers=headers, data=payload)
body = response.json()
datas = body["data"]
for data in datas:
embeddings.append(data["embedding"])
return embeddings
def embed_documents(self, texts: List[str]) -> List[List[float]]:
payload = json.dumps({"model": self.model, "inputs": texts})
return self.get_embeddings(payload)
def embed_query(self, text: str) -> List[List[float]]:
payload = json.dumps({"model": self.model, "inputs": [text]})
emb = self.get_embeddings(payload)
return emb[0]
```
:::
* 完成以上封裝後,就可以在 LangChain 中直接使用 CustomEmbeddingModel 來完成特定的大語言模型任務。
::: info
:bulb: **提示:** 更多資訊,請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。
:::
#### 單一字串
* 單一字串取得 embeddings,使用 **`embed_query()`** 函式,並返回結果。
```python=
API_KEY={API_KEY}
API_URL={API_URL}
MODEL_NAME={MODEL_NAME}
embeddings = CustomEmbeddingModel(
base_url = API_URL,
api_key = API_KEY,
model = MODEL_NAME,
)
print(embeddings.embed_query("請問台灣最高的山是?"))
```
輸出:
> [-1.1431972, -4.723901, 2.3445783, -2.19996, ......, 1.0784563, -3.4114947, -2.5193133]
#### 多字串
* 多字串取得 embeddings,使用 **`embed_documents()`** 函式,會一次返回全部結果。
```python=
API_KEY={API_KEY}
API_URL={API_URL}
MODEL_NAME={MODEL_NAME}
embeddings = CustomEmbeddingModel(
base_url = API_URL,
api_key = API_KEY,
model = MODEL_NAME,
)
print(embeddings.embed_documents(["test1", "test2", "test3"]))
```
輸出:
> [[-0.14880371, ......, 0.7011719], [-0.023590088, ...... , 0.49320474], [-0.86242676, ......, 0.22867839]]
## Embedding (V2)
::: info
:bulb: **提示:目前 Embedding API 的限制會受到以下數值影響**
* Embedding Model:sequence length = 131072
* Embedding API 可以支援 Batch Inference,每筆長度不超過 131072 tokens。
:::
### Curl 使用方式
1. 設定環境
```=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
```
2. 透過 `curl` 指令取得 embedding 結果,input_type為V2新增的參數,說明如下。
1. 值只能設定為"query"或者是"document"。
2. 此參數不是必要,如果沒有設定,預設是"document"。
3. 值為"query"時,系統會自動將每一筆input加上前綴語句來加強embedding正確性。
4. 值為"document"時,input維持原本,不加前綴語句。
**使用範例**
```=
curl "${API_URL}/models/embeddings" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs": ["search string 1", "search string 2"],
"parameters": {
"input_type": "document"
}
}'
```
**回傳範例**
```=
{
"data": [
{
"embedding": [
0.015003109350800514,
0.002964278217405081,
0.025576837360858917,
0.0009064615005627275,
0.00896097905933857,
-0.010766804218292236,
0.022567130625247955,
-0.020284295082092285,
-0.004011997487396002,
-0.01566183753311634,
-0.016150206327438354,
-0.008938264101743698,
0.010346580296754837,
0.010187577456235886,
...
...
],
"index": 0,
"object": "embedding"
},
{
"embedding": [
0.013649762608110905,
0.003280752571299672,
0.024047400802373886,
0.005184505134820938,
0.009756374172866344,
-0.009389937855303288,
0.027826279401779175,
-0.016409488394856453,
0.0020984220318496227,
-0.0180928073823452,
-0.014462794177234173,
-0.006956569850444794,
0.013260424137115479,
0.018184415996074677,
...
...
],
"index": 1,
"object": "embedding"
}
],
"total_time_taken": "0.05 sec",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
```
3. V2支援OpenAI Embedding API的參數如下
1. input: 目標字串list。
2. encoding_format: 可以設定為"float"或是"base64",設為"base64"代表會將向量結果轉成base64格式再輸出,預設值是"float"。
3. dimensions: 可以設定最多輸出多少維度的向量,例如設為4,那就只會輸出前四維度的向量,預設值是0,0代表輸出最大維度的向量。
**使用範例**
```=
curl "${API_URL}/models/embeddings" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"input": ["search string 1", "search string 2"],
"encoding_format": "base64",
"dimensions": 4
}'
```
**回傳範例**
```=
{
"data": [
{
"object": "embedding",
"embedding": "pR0QPOuoY7sjFQM9U92HOw==",
"index": 0
},
{
"object": "embedding",
"embedding": "6BXdOxIpD7vfHgA9suTyOw==",
"index": 1
}
],
"total_time_taken": "0.04 sec",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
```
4. V2輸出的向量結果現在改為預設會做normalize,行為跟OpenAI一致。如果客戶想要保持跟之前一樣輸出為未做normalize的結果,可以在parameters底下新增一個參數"normalize",並將值設成false,如下所示。
**使用範例**
```=
curl "${API_URL}/models/embeddings" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"input": ["search string 1", "search string 2"],
"parameters": {
"normalize": false
}
"encoding_format": "base64",
"dimensions": 4
}'
```
### LangChain 使用方式
:::spoiler **Custom Embedding Model Wrapper**
```python=
"""Wrapper Embedding model APIs."""
import json
import requests
from typing import List
from pydantic import BaseModel
from langchain.embeddings.base import Embeddings
import os
class CustomEmbeddingModel(BaseModel, Embeddings):
base_url: str = "http://localhost:12345"
api_key: str = ""
model: str = ""
def get_embeddings(self, payload):
endpoint_url=f"{self.base_url}/models/embeddings"
embeddings = []
headers = {
"Content-type": "application/json",
"accept": "application/json",
"X-API-KEY": self.api_key,
"X-API-HOST": "afs-inference"
}
response = requests.post(endpoint_url, headers=headers, data=payload)
body = response.json()
datas = body["data"]
for data in datas:
embeddings.append(data["embedding"])
return embeddings
def embed_documents(self, texts: List[str]) -> List[List[float]]:
payload = json.dumps({"model": self.model, "inputs": texts, "parameters": {"input_type": "query"}})
return self.get_embeddings(payload)
def embed_query(self, text: str) -> List[List[float]]:
payload = json.dumps({"model": self.model, "inputs": [text], "parameters": {"input_type": "query"}})
emb = self.get_embeddings(payload)
return emb[0]
```
:::
* 完成以上封裝後,就可以在 LangChain 中直接使用 CustomEmbeddingModel 來完成特定的大語言模型任務。
::: info
:bulb: **提示:** 更多資訊,請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/how_to/custom_llm/)。
:::
#### 單一字串
* 單一字串取得 embeddings,使用 **`embed_query()`** 函式,並返回結果。
```python=
API_KEY={API_KEY}
API_URL={API_URL}
MODEL_NAME={MODEL_NAME}
embeddings = CustomEmbeddingModel(
base_url = API_URL,
api_key = API_KEY,
model = MODEL_NAME,
)
print(embeddings.embed_query("請問台灣最高的山是?"))
```
輸出:
> [-0.023335948586463928, 0.02815871126949787, 0.03960443660616875, 0.012845884077250957, ......, 0.010695642791688442, 0.001966887153685093, 0.008934334851801395]
#### 多字串
* 多字串取得 embeddings,使用 **`embed_documents()`** 函式,會一次返回全部結果。
```python=
API_KEY={API_KEY}
API_URL={API_URL}
MODEL_NAME={MODEL_NAME}
embeddings = CustomEmbeddingModel(
base_url = API_URL,
api_key = API_KEY,
model = MODEL_NAME,
)
print(embeddings.embed_documents(["test1", "test2", "test3"]))
```
輸出:
> [[-0.007434912957251072, ......, 0.009466814808547497], [-0.006574439350515604, ...... , 0.008274043910205364], [-0.005750700831413269, ......, 0.009992048144340515]]
## Rerank
Rerank API 的核心功能是利用機器學習和自然語言處理技術,根據給定的模型對輸入的文本進行重新排序(rerank),模型會對每個候選答案進行評分,分數越高表示該答案與查詢的相關性越高。常用於資訊檢索、推薦系統和自然語言處理任務中,基於某種評分或評估標準,對初步排序的結果進行進一步優化,以提供更符合使用者期望的資訊。
### 使用情境
Rerank API 可以應用在各種檢索相關的使用情境中,例如:
* 資訊檢索系統:對初步檢索出的結果進行重新排序,提升最相關結果的排名。
* 問答系統:從多個潛在答案中選出最相關和正確的答案。
* 推薦系統:根據用戶偏好對推薦結果進行重新排序,提供更個性化的推薦。
* 文本匹配:在文本相似度計算中,對多個候選匹配結果進行排序,選擇最相似的文本。
### 使用範例
以下是一個使用 curl 指令調用 Rerank API 的範例,並對三組查詢和答案進行重新排序。
::: info
:bulb: **提示:目前 Rerank API 的限制會受到以下數值影響**
* Rerank Model:sequence length = 8192
* Rerank API 可以支援 Batch Inference,每筆長度不超過 8192 tokens。
:::
1. 設定環境
```=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
```
2. 透過 `curl` 指令取得 rerank 結果,預設回傳結果為前三名的分數及index,如果需要更多或是更少,可以透過在parameters下的"top_n"參數來調整。另外輸入格式支援兩種方式,第一種是以"inputs"參數來指定查詢及候選答案兩兩一組的list,請參考以下範例1。第二種是以"query"參數來指定查詢及以"documents"參數來指定所有候選答案,請參考以下範例2。
範例1輸出的答案不會以分數排序,而範例2輸出的答案會依分數排序並依照top_n參數給予前幾名的分數結果。
**使用範例1**
```=
curl "${API_URL}/models/rerank" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs": [
[
"Where is the capital of Canada?",
"Europe is a continent."
],
[
"Where is the capital of Canada?",
"The capital of Canada is Ottawa."
],
[
"Where is the capital of Canada?",
"Canada is a big country."
]
]
}'
```
**回傳範例**
```=
{
"scores": [
0.000016571451851632446,
0.9998936653137207,
0.040769271552562714
],
"total_time_taken": "0.07 sec",
"usage": {
"prompt_tokens": 41,
"total_tokens": 41
}
}
```
**使用範例2**
```=
curl "${API_URL}/models/rerank" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"query": "Where is the capital of Canada?",
"documents": [
"Europe is a continent.",
"The capital of Canada is Ottawa.",
"Canada is a big country."
],
"parameters": {
"top_n": 2
}
}'
```
**回傳範例**
```=
{
"results": [
{
"score": 0.9998936653137207,
"index": 1
},
{
"score": 0.040769271552562714,
"index": 2
}
],
"total_time_taken": "0.07 sec",
"usage": {
"prompt_tokens": 41,
"total_tokens": 41
}
}
```
以上範例向 Rerank API 傳遞了三組查詢和候選答案:
1. "Where is the capital of Canada?" 與 "Europe is a continent."
2. "Where is the capital of Canada?" 與 "The capital of Canada is Ottawa."
3. "Where is the capital of Canada?" 與 "Canada is a big country."
Rerank API 回傳了前兩名候選答案的相關性分數,第二個答案的分數(0.9998936653137207)顯然高於第三個答案的分數(0.040769271552562714),表示 "The capital of Canada is Ottawa." 更加相關和正確。開發者可以根據這些分數對候選答案進行排序,選擇最相關的結果。因此,這種方法能顯著提高資訊檢索和問答系統的準確性和有效性。
## Generate(請優先使用 [Conversation](#Conversation))
### 一般使用
```bash=
export API_KEY={API_KEY}
export API_KEY={API_KEY}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/generate" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs":"從前從前,有位老太太去河邊",
"parameters":{
"max_new_tokens":200,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
> {
"generated_text": ",她洗完衣服後,要把衣服晾在河邊的一棵樹上。但老太太又老又弱,她爬不上樹,於是她決定把衣服掛在樹枝上。老太太拿起衣服,開始往樹枝上掛衣服,但她掛了幾件衣服後,樹枝斷了,所有的衣服都掉到河裡去了。老太太看到這一幕,非常傷心,她說:「我的衣服都掉到河裡去了!」老太太的孫女看到這一幕,心想:「我可以幫助我的祖母,我可以幫助她把衣服掛在樹枝上。」於是,這位孫女走到河邊,開始幫助她的祖母把衣服掛在樹枝上",
"function_call": null,
"details": null,
"total_time_taken": "18.88 sec",
"prompt_tokens": 17,
"generated_tokens": 200,
"total_tokens": 217,
"finish_reason": "length"
}
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 200
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def generate(prompt):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
data = {
"model": MODEL_NAME,
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ''
try:
response = requests.post(
API_URL + "/models/generate", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
result = generate("從前從前,有位老太太去河邊")
print(result)
```
輸出:
> ,她洗完衣服後,要把衣服晾在河邊的一棵樹上。但老太太又老又弱,她爬不上樹,於是她決定把衣服掛在樹枝上。老太太拿起衣服,開始往樹枝上掛衣服,但她掛了幾件衣服後,樹枝斷了,所有的衣服都掉到河裡去了。老太太看到這一幕,非常傷心,她說:「我的衣服都掉到河裡去了!」老太太的孫女看到這一幕,心想:「我可以幫助我的祖母,我可以幫助她把衣服掛在樹枝上。」於是,這位孫女走到河邊,開始幫助她的祖母把衣服掛在樹枝上
:::
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/generate" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs":"可以幫我規劃台北兩日遊,並推薦每天的景點及說明其特色嗎?",
"parameters":{
"max_new_tokens":350,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:包括生成的文字、token 個數以及所花費的時間秒數。
> {
"generated_text": "答案是: 第一天: 1. 台北101觀景台 - 這是台北最受歡迎的景點之一,提供城市天際線的壯觀景色。 2. 國立故宮博物院 - 這是世界上最著名的藝術和文物收藏之一,展示了中國豐富的文化遺產。 3. 台北中正紀念堂 - 這是一座宏偉的紀念堂,致力於紀念中華民國前總統蔣中正。 4. 台北國立台灣博物館 - 這是一個展示台灣歷史、文化和藝術的博物館。 5. 台北夜市 - 這是一個熱鬧的夜市,提供各種街頭美食、購物和娛樂。 第二天: 1. 陽明山國家公園 - 這是一個美麗的國家公園,提供令人驚嘆的台北市區景色。 2. 台北101觀景台 - 這是另一個觀景台,提供城市天際線的壯觀景色。 3. 台北國立台灣博物館 - 這是另一個博物館,展示台灣歷史、文化和藝術。 4. 台北故宮 - 這是另一個展示中國豐富文化遺產的",
"function_call": null,
"details": null,
"total_time_taken": "33.08 sec",
"prompt_tokens": 31,
"generated_tokens": 350,
"total_tokens": 381,
"finish_reason": "length"
}
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 350
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def generate(prompt):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
data = {
"model": MODEL_NAME,
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
}
}
result = ''
try:
response = requests.post(
API_URL + "/models/generate", json=data, headers=headers)
if response.status_code == 200:
result = json.loads(response.text, strict=False)['generated_text']
else:
print("error")
except:
print("error")
return result.strip("\n")
result = generate("可以幫我規劃台北兩日遊,並推薦每天的景點及說明其特色嗎?")
print(result)
```
輸出:
> 答案是: 第一天: 1. 台北101觀景台 - 這是台北最受歡迎的景點之一,提供城市天際線的壯觀景色。 2. 國立故宮博物院 - 這是世界上最著名的藝術和文物收藏之一,展示了中國豐富的文化遺產。 3. 台北中正紀念堂 - 這是一座宏偉的紀念堂,致力於紀念中華民國前總統蔣中正。 4. 台北國立台灣博物館 - 這是一個展示台灣歷史、文化和藝術的博物館。 5. 台北夜市 - 這是一個熱鬧的夜市,提供各種街頭美食、購物和娛樂。 第二天: 1. 陽明山國家公園 - 這是一個美麗的國家公園,提供令人驚嘆的台北市區景色。 2. 台北101觀景台 - 這是另一個觀景台,提供城市天際線的壯觀景色。 3. 台北國立台灣博物館 - 這是另一個博物館,展示台灣歷史、文化和藝術。 4. 台北故宮 - 這是另一個展示中國豐富文化遺產的
:::
### 使用 Stream 模式
Server-sent event (SSE):伺服器主動向客戶端推送資料,連線建立後,在一步步生成字句的同時也將資料往客戶端拋送,和先前的一次性回覆不同,可加強使用者體驗。若有輸出大量 token 文字的需求,請務必優先採用 Stream 模式,以免遇到 Timeout 的情形。
```bash=
export API_KEY={API_KEY}
export API_URL={API_URL}
export MODEL_NAME={MODEL_NAME}
# model: ffm-mixtral-8x7b-32k-instruct
curl "${API_URL}/models/generate" \
-H "X-API-KEY:${API_KEY}" \
-H "X-API-HOST: afs-inference" \
-H "content-type: application/json" \
-d '{
"model": "'${MODEL_NAME}'",
"inputs":"台灣最高峰是",
"stream":true,
"parameters":{
"max_new_tokens":2,
"temperature":0.5,
"top_k":50,
"top_p":1,
"frequence_penalty":1}}'
```
輸出:每個 token 會輸出一筆資料,最末筆則是會將先前生成的文字串成一筆、以及描述 token 個數和所花費的時間秒數。
> data: {"generated_text": "玉", "function_call": null, "details": null, "total_time_taken": null, "prompt_tokens": 0, "generated_tokens": 0, "total_tokens": 0, "finish_reason": null}
> data: {"generated_text": "山", "function_call": null, "details": null, "total_time_taken": "0.25 sec", "prompt_tokens": 7, "generated_tokens": 2, "total_tokens": 9, "finish_reason": "length"}
::: info
:bulb: **提示:注意事項**
1. 每筆 token 並不一定能解碼成合適的文字,如果遇到該種情況,該筆 generated_text 欄位會顯示空字串,該 token 會結合下一筆資料再來解碼,直接能呈現為止。
2. 本案例採用 [sse-starlette](https://github.com/sysid/sse-starlette),在 SSE 過程中約 15 秒就會收到 ping event,目前在程式中如果連線大於該時間就會收到以下資訊 (非 JSON 格式),在資料處理時需特別注意,下列 Python 範例已經包含此資料處理。
> event: ping
> data: 2023-09-26 04:25:08.978531
:::
:::spoiler Python 範例
```python=
import json
import requests
MODEL_NAME = "{MODEL_NAME}"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
API_HOST = "afs-inference"
# parameters
max_new_tokens = 2
temperature = 0.5
top_k = 50
top_p = 1.0
frequence_penalty = 1.0
def generate(prompt):
headers = {
"content-type": "application/json",
"X-API-Key": API_KEY,
"X-API-Host": API_HOST}
data = {
"model": MODEL_NAME,
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"frequence_penalty": frequence_penalty
},
"stream": True
}
messages = []
result = ""
try:
response = requests.post(API_URL + "/models/generate", json=data, headers=headers, stream=True)
if response.status_code == 200:
for chunk in response.iter_lines():
chunk = chunk.decode('utf-8')
if chunk == "":
continue
try:
record = json.loads(chunk[5:], strict=False)
if "status_code" in record:
print("{:d}, {}".format(record["status_code"], record["error"]))
break
elif record["total_time_taken"] is not None or ("finish_reason" in record and record["finish_reason"] is not None) :
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
result = ''.join(messages)
break
elif record["generated_text"] is not None:
message = record["generated_text"]
messages.append(message)
print(">>> " + message)
else:
print("error")
break
except:
pass
except:
print("error")
return result.strip("\n")
result = generate("台灣最高峰是")
print(result)
```
:::
<br>
輸出:
```
>>> 玉
>>> 山
玉山
```
### LangChain 使用方式
:::spoiler **Custom LLM Model Wrapper**
```python=
from typing import Any, Dict, List, Mapping, Optional, Tuple
from langchain.llms.base import BaseLLM
import requests
from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.schema.language_model import BaseLanguageModel
from langchain.schema import Generation, LLMResult
from pydantic import Field
import json
import os
class _FormosaFoundationCommon(BaseLanguageModel):
base_url: str = "http://localhost:12345"
"""Base url the model is hosted under."""
model: str = "ffm-mixtral-8x7b-32k-instruct"
"""Model name to use."""
temperature: Optional[float]
"""The temperature of the model. Increasing the temperature will
make the model answer more creatively."""
stop: Optional[List[str]]
"""Sets the stop tokens to use."""
top_k: int = 50
"""Reduces the probability of generating nonsense. A higher value (e.g. 100)
will give more diverse answers, while a lower value (e.g. 10)
will be more conservative. (Default: 50)"""
top_p: float = 1
"""Works together with top-k. A higher value (e.g., 0.95) will lead
to more diverse text, while a lower value (e.g., 0.5) will
generate more focused and conservative text. (Default: 1)"""
max_new_tokens: int = 350
"""The maximum number of tokens to generate in the completion.
-1 returns as many tokens as possible given the prompt and
the models maximal context size."""
frequence_penalty: float = 1
"""Penalizes repeated tokens according to frequency."""
model_kwargs: Dict[str, Any] = Field(default_factory=dict)
"""Holds any model parameters valid for `create` call not explicitly specified."""
ffm_api_key: Optional[str] = None
@property
def _default_params(self) -> Dict[str, Any]:
"""Get the default parameters for calling FFM API."""
normal_params = {
"temperature": self.temperature,
"max_new_tokens": self.max_new_tokens,
"top_p": self.top_p,
"frequence_penalty": self.frequence_penalty,
"top_k": self.top_k,
}
return {**normal_params, **self.model_kwargs}
def _call(
self,
prompt,
stop: Optional[List[str]] = None,
**kwargs: Any,
) -> str:
if self.stop is not None and stop is not None:
raise ValueError("`stop` found in both the input and default params.")
elif self.stop is not None:
stop = self.stop
elif stop is None:
stop = []
params = {**self._default_params, "stop": stop, **kwargs}
parameter_payload = {"parameters": params, "inputs": prompt, "model": self.model}
# HTTP headers for authorization
headers = {
"X-API-KEY": self.ffm_api_key,
"Content-Type": "application/json",
"X-API-HOST": "afs-inference"
}
endpoint_url = f"{self.base_url}/models/generate"
# send request
try:
response = requests.post(
url=endpoint_url,
headers=headers,
data=json.dumps(parameter_payload, ensure_ascii=False).encode("utf8"),
stream=False,
)
response.encoding = "utf-8"
generated_text = response.json()
if response.status_code != 200:
detail = generated_text.get("detail")
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"error raised with status code {response.status_code}\n"
f"Details: {detail}\n"
)
except requests.exceptions.RequestException as e: # This is the correct syntax
raise ValueError(f"FormosaFoundationModel error raised by inference endpoint: {e}\n")
if generated_text.get("detail") is not None:
detail = generated_text["detail"]
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"error raised by inference API: {detail}\n"
)
if generated_text.get("generated_text") is None:
raise ValueError(
f"FormosaFoundationModel endpoint_url: {endpoint_url}\n"
f"Response format error: {generated_text}\n"
)
return generated_text
class FormosaFoundationModel(BaseLLM, _FormosaFoundationCommon):
"""Formosa Foundation Model
Example:
.. code-block:: python
ffm = FormosaFoundationModel(model_name="llama2-7b-chat-meta")
"""
@property
def _llm_type(self) -> str:
return "FormosaFoundationModel"
@property
def _identifying_params(self) -> Mapping[str, Any]:
"""Get the identifying parameters."""
return {
**{
"model": self.model,
"base_url": self.base_url
},
**self._default_params
}
def _generate(
self,
prompts: List[str],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> LLMResult:
"""Call out to FormosaFoundationModel's generate endpoint.
Args:
prompt: The prompt to pass into the model.
stop: Optional list of stop words to use when generating.
Returns:
The string generated by the model.
Example:
.. code-block:: python
response = FormosaFoundationModel("Tell me a joke.")
"""
generations = []
token_usage = 0
for prompt in prompts:
final_chunk = super()._call(
prompt,
stop=stop,
**kwargs,
)
generations.append(
[
Generation(
text = final_chunk["generated_text"],
generation_info=dict(
finish_reason = final_chunk["finish_reason"]
)
)
]
)
token_usage += final_chunk["generated_tokens"]
llm_output = {"token_usage": token_usage, "model": self.model}
return LLMResult(generations=generations, llm_output=llm_output)
```
:::
* 完成以上封裝後,就可以在 LangChain 中使用 FFM 大語言模型。
::: info
:bulb: **提示:** 更多資訊,請參考 [**LangChain Custom LLM 文件**](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm)。
:::
```python=
MODEL_NAME = "ffm-mixtral-8x7b-32k-instruct"
API_KEY = "{API_KEY}"
API_URL = "{API_URL}"
ffm = FormosaFoundationModel(
base_url = API_URL,
max_new_tokens = 350,
temperature = 0.5,
top_k = 50,
top_p = 1.0,
frequence_penalty = 1.0,
ffm_api_key = API_KEY,
model = MODEL_NAME
)
print(ffm("請問台灣最高的山是?"))
```
輸出:
> 答案是:玉山。
>
>玉山,也被稱為玉山國家公園,位於台灣南部,是該國最高的山,海拔3952米(12966英尺)。它是台灣阿里山山脈的一部分,以其崎嶇的地形、翠綠的森林和多種植物和動物而聞名。玉山是徒步旅行者和自然愛好者的熱門目的地,被認為是台灣最美麗和最具挑戰性的山之一。
```