# 向量搜尋-Azure Machine Learning-操作手冊
本文主要測試Azure Machine Learning的[endpoint](https://learn.microsoft.com/zh-tw/azure/machine-learning/concept-endpoints?view=azureml-api-2)功能:
- 可於endpoint中使用自己訓練的model
- 可於endpoint中讀取資料庫
- 可於endpoint中進行cosine similarity計算
- 成功回傳
- model output
- 測試embedding與資料庫配對的top-k文件
## 環境建置
1. python 套件安裝
`pip install azure-cli azureml-core mlflow xgboost pandas numpy`
:::
`pip install azure-cli azureml-core azureml-identity azureml-ai-ml pyobdc mlflow xgboost pandas numpy`
2. [安裝 OBDC (for Azure SQL)](https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15&tabs=ubuntu18-install%2Cubuntu17-install%2Cdebian8-install%2Credhat7-13-install%2Crhel7-offline)
3. [安裝 Azure Data Studio (for 方便上傳csv到db)](https://learn.microsoft.com/zh-tw/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver16&tabs=redhat-install%2Credhat-uninstall)
4. 於 Azure Data Studio 安裝 Import import
5. [安裝 Azure CLI (for 部屬模型)](https://learn.microsoft.com/zh-tw/cli/azure/install-azure-cli)
## Azure資料庫
### :::準備測試資料:::
```python
import pandas as pd
df = pd.DataFrame({
'text': ['範例文本1', '範例文本2', '範例文本3'],
'question_id': [0, 1, 1],
'embedding': [[0.68, 0.36, -0.18],
[0.28, -0.22, 0.69],
[0.69, -0.12, 0.55]],
}).rename_axis(index='index').to_csv('testv2_ebd.csv')
df = pd.DataFrame({
'question': ['範例Q1', '範例Q2', '範例Q3'],
'answer': ['範例A1', '範例A2', '範例A3'],
}).rename_axis(index='question_id').to_csv('testv2_qa.csv')
```
- testv2_ebd.csv

- testv2_qa.csv

### :::建立 SQL Server:::
- :::需要預先建立好的**訂用帳戶**及**資源群組**:::


- :::通常伺服器位置設定**Japan East**比較不會等很久:::
- :::基本::::
- :::**伺服器名稱**:"your SQL Database Server name":::
- :::驗證方法:**同時使用 SQL 和 Azure AD 驗證**:::
- :::要**登入database**時的輸入的**user**以及**password**在這裡設定:::

- :::網路:**允許 Azure 服務和資源存取此伺服器**:::

### :::建立 SQL 資料庫:::

- :::基本:::
- :::**資料庫名稱**:"your SQL Database name":::
- :::**伺服器**:"your SQL Database Server name":::
- :::計算+儲存體:選最便宜的:::

- :::網路:新增目前用戶端IP位置:::

### :::使用**Azure Data Studio**連線database:::
- :::從這裡按會自動填入server資訊:::

- :::Azure Data Studio應該已經安裝好了,還沒的話就再去安裝:::

- :::輸入建server時設定的**user**及**password**:::

- :::按下connect後 再按add acount登入Microsoft帳號:::

- :::顯示登入成功後 再回Azure Data Studio按OK即完成連線:::
### :::於database中加入table:::
- :::若尚未安裝SQL Server Import,先安裝:::

- :::於剛剛連線的db按右鍵,選擇import wizard:::

- :::如果還沒生成測試資料,[在這](##準備測試資料):::

- :::匯入上面生成的兩個csv:::
- :::ebd.csv的question_id為index(int):::

### :::測試連線資料庫:::
- :::**config.json**:::
```json
{
"server": "lulutestdbserver.database.windows.net",
"database": "lulutestdb",
"username": "lulu",
"password": "testtest!1"
}
```
- :::**1_get_db.ipynb**:::
```python
import pyodbc
import pandas as pd
import json
with open('config.json') as config_file:
config = json.load(config_file)
def get_db_ebd():
server = config['server']
database = config['database']
username = config['username']
password = config['password']
driver= '{ODBC Driver 18 for SQL Server}'
with pyodbc.connect('DRIVER='+driver+';SERVER=tcp:'+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password) as cursor:
df = pd.read_sql("SELECT * FROM testv2_ebd t1 JOIN testv2_qa t2 ON t1.question_id = t2.question_id;", cursor)
df = df[['text', 'embedding', 'question', 'answer']]
return df
print(get_db_ebd())
```

## Azure Machine Learning
本節會於Azure Machine Learning中建立workspace,並於workspace中進行以下動作:
1. 於Models註冊我們自己訓練的model
2. 於Environments使用Dockerfile建立需要的環境
- Dockerfile:
- 建置連線Azure SQL需要的環境
- 設定entry script的目錄
- 安裝mlflow以及計算cosine similarity所需套件
4. 使用準備好的entry script建立Endpoints線上服務
- entry script (successed_test_echo_score.py):
- def init():
- 取得註冊之model的requirment.txt並安裝
- 載入model
- def run():
- 連線Azure SQL取得embedding表
- 測試模型正常輸出
- 測試cosine similarity回傳top k 文件
### :::建立workspace及docker:::


- :::於容器登錄先新建docker,選擇最便宜的SKU:::

- :::容器選剛剛建的docker:::

### :::取得workspace config:::

- :::下載config後會取得以下三個資訊,並於workspace概觀複製model uri:::

- :::把以上四個資訊加入之前連線SQL的config.json:::
```json
{
"server": "lulutestdbserver.database.windows.net",
"database": "lulutestdb",
"username": "lulu",
"password": "testtest!1",
"subscription_id": "ad04169e-94e2-4b11-ab7a-b4dde1a76ae1",
"resource_group": "wingeneai",
"workspace_name": "lulutestws",
"azureml_tracking_uri": "azureml://japaneast.api.azureml.ms/mlflow/v1.0/subscriptions/ad04169e-94e2-4b11-ab7a-b4dde1a76ae1/resourceGroups/wingeneai/providers/Microsoft.MachineLearningServices/workspaces/lulutestws"
}
```
### :::建立本地測試mlflow model:::
(本文暫時使用xgboost來進行model測試,尚未測試embedding model)
```python
import pandas as pd
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# 測試訓練xgboost的資料
X_train = pd.DataFrame([[1, 2, 3]])
y_train = pd.DataFrame([[0]])
X_test = pd.DataFrame([[4, 5, 6]])
y_test = pd.DataFrame([[1]])
# mlflow
mlflow.autolog()
# 建立與訓練model
model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# after running this, folder mlruns/0/-----------------/... will built
```
- :::跑完以上的code會產生資料夾"**mlruns**"在目錄中:::

### :::連線至workspace:::
請確認於[環境建置](##環境建置)中的
5. [安裝 Azure CLI (for 部屬模型)](https://learn.microsoft.com/zh-tw/cli/azure/install-azure-cli)是否已安裝完成
- :::cmd指令::::
```
sudo apt-get update && sudo apt-get install azure-cli
az login --use-device-code
```
- :::複製代碼後打開連結:::

- :::輸入代碼:::

- :::登入Microsoft帳號:::


- :::連線成功:::

### :::註冊本地模型至workspace:::
於上一步連線完成後,使用以下python code註冊模型
```python
# 註冊local model到workspace
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import json
import mlflow
import json
import os
from azureml.core import Workspace
from azureml.core import Model
# load workspace server config
with open('config.json') as config_file:
config = json.load(config_file)
subscription_id = config['subscription_id']
resource_group = config['resource_group']
workspace = config['workspace_name']
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)
mlflow.set_tracking_uri(config['azureml_tracking_uri'])
ws = Workspace(subscription_id=subscription_id,
resource_group=resource_group,
workspace_name=workspace)
model_name = config['model_name']
# 取得剛剛於本地件好的artifacts路徑
model_list = os.listdir('mlruns/0')
model_list.remove("meta.yaml")
model_local_path = "mlruns/0/{}/artifacts".format(model_list[0]) # 5031e45086454b71935f74b40eacb32e
registered_model = Model.register(workspace=ws, model_path=model_local_path, model_name=model_name)
```
- :::註冊model成功:::

### :::使用Dockerfile建立environment:::
我們最終欲建立endpoint服務,須先建立可於endpoint中執行entry script的環境。
- :::進入AML服務並選擇剛剛建立的workspace:::

- :::啟動 Machine Learning Studio:::

- :::點選建立environment:::

- :::選擇使用Dockerfile context建立環境並設定名稱:::

- :::Dockerfile:::
```dockerfile
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
# install obdc18, for connecting Azure SQL
RUN apt-get update \
&& apt-get install -y curl apt-transport-https gnupg2 \
&& curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/20.04/prod focal main" > /etc/apt/sources.list.d/msprod.list \
&& apt-get update \
&& ACCEPT_EULA=Y apt-get install -y msodbcsql18 \
&& ACCEPT_EULA=Y apt-get install -y mssql-tools18 \
&& echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc \
&& /bin/bash -c "source ~/.bashrc" \
&& apt-get install -y unixodbc-dev
# set the entry script dir
ENV SOURCE_DIRECTORY='./'
# install requirement for the entry script
RUN pip install numpy==1.21.2 pandas==1.4.1 pyodbc==4.0.39 azureml-inference-server-http azureml azureml-contrib-services azureml-core mlflow
```

- :::等待約3分鐘,環境成功建立在綁定的docker中:::

### :::準備endpoint需要的**entry script**:::
- 以下會提供**successed_test_echo_score.py**,注意要手動填入json檔中的以下資訊:
- 設定 database 連線資訊於 **get_db_ebd** function
- server = "lulutestdbserver.database.windows.net"
database = "lulutestdb"
username = "lulu"
password = "testtest!1"
- **successed_test_echo_score.py**
```python
import traceback
import json
import os
import numpy as np
import pyodbc
import pandas as pd
import mlflow
import subprocess
import sys
def init():
global model
global model_path
model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'artifacts', 'model')
# 取得requirment路徑,安裝model dependencies
requirement_path = mlflow.pyfunc.get_model_dependencies(model_path)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", requirement_path])
# 載入model
model = mlflow.pyfunc.load_model(model_path)
def run(data):
try:
# 取得input context
data = json.loads(data)
input_txt = data['context']
# 測試註冊之model是否能正常輸出,[3, 6, 9]為本文測試xgboost模型的輸入型式
model_out = model.predict([[3, 6, 9]])
model_out = str(model_out)
# 測試是否能於endpoint取得於Azure SQL上的資料
def get_db_ebd():
server = "lulutestdbserver.database.windows.net"
database = "lulutestdb"
username = "lulu"
password = "testtest!1"
driver= '{ODBC Driver 18 for SQL Server}'
with pyodbc.connect('DRIVER='+driver+';SERVER=tcp:'+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password) as cursor:
df = pd.read_sql("SELECT * FROM testv2_ebd t1 JOIN testv2_qa t2 ON t1.question_id = t2.question_id;", cursor)
df = df[['text', 'embedding', 'question', 'answer']]
return df
df = get_db_ebd()
# endpoint回傳接收的輸入以及註冊之model的測試output
out = [{'api in': input_txt, 'test model out': model_out, 'where model': model_path}]
# 測試是否能於endpoint計算cosine similarity
embedding = [0.8, 0.16, -0.24]
def search_docs(df, embedding, top_n=3, to_print=True):
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
df["score"] = df.embedding.apply(lambda x: cosine_similarity(eval(x), embedding))
res = (
df.sort_values("score", ascending=False)
.head(top_n)
)
return res
# endpoint回傳測試embedding與資料庫配對後的文件
out.append(search_docs(df, embedding).to_dict(orient='records'))
return out
except Exception as e:
result = str(e)
# return error message back to the client
print("Failure!")
print(traceback.format_exc())
return json.dumps({"error": result, "tb": traceback.format_exc()})
```
### :::建立endpoint:::
- :::於Azure Machine Learning Studio中選擇Endpoints:::

- :::持續下一步,選擇剛剛註冊的model:::

- :::上傳entry script,並選擇剛剛建立environment:::

- :::試最便宜的:::

- :::等待約12分鐘,endpoint建立完成:::

## 測試架設於AML上的endpoint服務
### :::**python**:::
- :::取得model_name/ url/ api_key:::

- :::設定:::
```python
model_name = 'xgboost-1'
api_key = ''
url = 'https://lulutestept.japaneast.inference.ml.azure.com/score'
data = {"context": "測試API能不能用之text"}
```
- :::call API:::
```python
import urllib.request
import json
import os
import ssl
def allowSelfSignedHttps(allowed):
# bypass the server certificate verification on client side
if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
ssl._create_default_https_context = ssl._create_unverified_context
allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.
# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script
body = str.encode(json.dumps(data))
# Replace this with the primary/secondary key or AMLToken for the endpoint
if not api_key:
raise Exception("A key should be provided to invoke the endpoint")
# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': model_name}
req = urllib.request.Request(url, body, headers)
try:
response = urllib.request.urlopen(req)
result = json.loads(response.read())
print('api in', result[0]['api in'])
print()
print('test model out', result[0]['test model out'])
print()
print('top n context: ', '\n', result[1][0], '\n', result[1][1], '\n', result[1][2])
except urllib.error.HTTPError as error:
print("The request failed with status code: " + str(error.code))
```

- ***測試資料***
- model
- xgboost訓練資料

- model於endpoint中的測試input

- cosine similarity
- database中的測試embedding表
- testv2_ebd.csv: 3維embedding

- testv2_qa.csv

- 測試欲配對的embedding: ```embedding = [0.8, 0.16, -0.24]```
### :::**AML Studio**:::

## 費用
- 2023/6/1 17:30

- 2023/6/1 17:30 的3小時前 (剛開始建)

- 2023/6/1 18:52

- 2023/6/1 19:26

- 2023/6/2 10:26


- 2023/6/2 13:48

- 2023/6/1~6/5

## 如何暫停endpoint
還沒找到