向量搜尋-Azure Machine Learning-操作手冊

# 向量搜尋-Azure Machine Learning-操作手冊本文主要測試Azure Machine Learning的[endpoint](https://learn.microsoft.com/zh-tw/azure/machine-learning/concept-endpoints?view=azureml-api-2)功能： - 可於endpoint中使用自己訓練的model - 可於endpoint中讀取資料庫 - 可於endpoint中進行cosine similarity計算 - 成功回傳 - model output - 測試embedding與資料庫配對的top-k文件 ## 環境建置 1. python 套件安裝 `pip install azure-cli azureml-core mlflow xgboost pandas numpy` ::: `pip install azure-cli azureml-core azureml-identity azureml-ai-ml pyobdc mlflow xgboost pandas numpy` 2. [安裝 OBDC (for Azure SQL)](https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15&tabs=ubuntu18-install%2Cubuntu17-install%2Cdebian8-install%2Credhat7-13-install%2Crhel7-offline) 3. [安裝 Azure Data Studio (for 方便上傳csv到db)](https://learn.microsoft.com/zh-tw/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver16&tabs=redhat-install%2Credhat-uninstall) 4. 於 Azure Data Studio 安裝 Import import 5. [安裝 Azure CLI (for 部屬模型)](https://learn.microsoft.com/zh-tw/cli/azure/install-azure-cli) ## Azure資料庫 ### :::準備測試資料::: ```python import pandas as pd df = pd.DataFrame({ 'text': ['範例文本1', '範例文本2', '範例文本3'], 'question_id': [0, 1, 1], 'embedding': [[0.68, 0.36, -0.18], [0.28, -0.22, 0.69], [0.69, -0.12, 0.55]], }).rename_axis(index='index').to_csv('testv2_ebd.csv') df = pd.DataFrame({ 'question': ['範例Q1', '範例Q2', '範例Q3'], 'answer': ['範例A1', '範例A2', '範例A3'], }).rename_axis(index='question_id').to_csv('testv2_qa.csv') ``` - testv2_ebd.csv ![](https://hackmd.io/_uploads/BkWqT5V82.png) - testv2_qa.csv ![](https://hackmd.io/_uploads/SkVekiVI2.png) ### :::建立 SQL Server::: - :::需要預先建立好的**訂用帳戶**及**資源群組**::: ![](https://hackmd.io/_uploads/SyEuhtVL2.png) ![](https://hackmd.io/_uploads/H1q6nKE83.png) - :::通常伺服器位置設定**Japan East**比較不會等很久::: - :::基本：::: - :::**伺服器名稱**："your SQL Database Server name"::: - :::驗證方法：**同時使用 SQL 和 Azure AD 驗證**::: - :::要**登入database**時的輸入的**user**以及**password**在這裡設定::: ![](https://hackmd.io/_uploads/Hyq20YVUn.png) - :::網路：**允許 Azure 服務和資源存取此伺服器**::: ![](https://hackmd.io/_uploads/B1fP19EUn.png) ### :::建立 SQL 資料庫::: ![](https://hackmd.io/_uploads/rkGr-cN8n.png) - :::基本::: - :::**資料庫名稱**："your SQL Database name"::: - :::**伺服器**："your SQL Database Server name"::: - :::計算＋儲存體：選最便宜的::: ![](https://hackmd.io/_uploads/SJTmG5E82.png) - :::網路：新增目前用戶端IP位置::: ![](https://hackmd.io/_uploads/HJJnGcEIn.png) ### :::使用**Azure Data Studio**連線database::: - :::從這裡按會自動填入server資訊::: ![](https://hackmd.io/_uploads/SJcqX9EU3.png) - :::Azure Data Studio應該已經安裝好了，還沒的話就再去安裝::: ![](https://hackmd.io/_uploads/BkyZt94In.png) - :::輸入建server時設定的**user**及**password**::: ![](https://hackmd.io/_uploads/HkssU54Lh.png) - :::按下connect後再按add acount登入Microsoft帳號::: ![](https://hackmd.io/_uploads/SJ4bc94L3.png) - :::顯示登入成功後再回Azure Data Studio按OK即完成連線::: ### :::於database中加入table::: - :::若尚未安裝SQL Server Import，先安裝::: ![](https://hackmd.io/_uploads/SJkC1EsUh.png) - :::於剛剛連線的db按右鍵，選擇import wizard::: ![](https://hackmd.io/_uploads/SyE1ic483.png) - :::如果還沒生成測試資料，[在這](##準備測試資料)::: ![](https://hackmd.io/_uploads/B17h39EU2.png) - :::匯入上面生成的兩個csv::: - :::ebd.csv的question_id為index(int)::: ![](https://hackmd.io/_uploads/BkvfC5VUn.png) ### :::測試連線資料庫::: - :::**config.json**::: ```json { "server": "lulutestdbserver.database.windows.net", "database": "lulutestdb", "username": "lulu", "password": "testtest!1" } ``` - :::**1_get_db.ipynb**::: ```python import pyodbc import pandas as pd import json with open('config.json') as config_file: config = json.load(config_file) def get_db_ebd(): server = config['server'] database = config['database'] username = config['username'] password = config['password'] driver= '{ODBC Driver 18 for SQL Server}' with pyodbc.connect('DRIVER='+driver+';SERVER=tcp:'+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password) as cursor: df = pd.read_sql("SELECT * FROM testv2_ebd t1 JOIN testv2_qa t2 ON t1.question_id = t2.question_id;", cursor) df = df[['text', 'embedding', 'question', 'answer']] return df print(get_db_ebd()) ``` ![](https://hackmd.io/_uploads/S1wWZs4Lh.png) ## Azure Machine Learning 本節會於Azure Machine Learning中建立workspace，並於workspace中進行以下動作： 1. 於Models註冊我們自己訓練的model 2. 於Environments使用Dockerfile建立需要的環境 - Dockerfile: - 建置連線Azure SQL需要的環境 - 設定entry script的目錄 - 安裝mlflow以及計算cosine similarity所需套件 4. 使用準備好的entry script建立Endpoints線上服務 - entry script (successed_test_echo_score.py): - def init(): - 取得註冊之model的requirment.txt並安裝 - 載入model - def run(): - 連線Azure SQL取得embedding表 - 測試模型正常輸出 - 測試cosine similarity回傳top k 文件 ### :::建立workspace及docker::: ![](https://hackmd.io/_uploads/SkO2-sEI2.png) ![](https://hackmd.io/_uploads/B1nEzjVIh.png) - :::於容器登錄先新建docker，選擇最便宜的SKU::: ![](https://hackmd.io/_uploads/rk5nziEL2.png) - :::容器選剛剛建的docker::: ![](https://hackmd.io/_uploads/H1clNi4Uh.png) ### :::取得workspace config::: ![](https://hackmd.io/_uploads/ByyQvtBUn.png) - :::下載config後會取得以下三個資訊，並於workspace概觀複製model uri::: ![](https://hackmd.io/_uploads/rJnSPYHU3.png) - :::把以上四個資訊加入之前連線SQL的config.json::: ```json { "server": "lulutestdbserver.database.windows.net", "database": "lulutestdb", "username": "lulu", "password": "testtest!1", "subscription_id": "ad04169e-94e2-4b11-ab7a-b4dde1a76ae1", "resource_group": "wingeneai", "workspace_name": "lulutestws", "azureml_tracking_uri": "azureml://japaneast.api.azureml.ms/mlflow/v1.0/subscriptions/ad04169e-94e2-4b11-ab7a-b4dde1a76ae1/resourceGroups/wingeneai/providers/Microsoft.MachineLearningServices/workspaces/lulutestws" } ``` ### :::建立本地測試mlflow model::: (本文暫時使用xgboost來進行model測試，尚未測試embedding model) ```python import pandas as pd import mlflow from xgboost import XGBClassifier from sklearn.metrics import accuracy_score # 測試訓練xgboost的資料 X_train = pd.DataFrame([[1, 2, 3]]) y_train = pd.DataFrame([[0]]) X_test = pd.DataFrame([[4, 5, 6]]) y_test = pd.DataFrame([[1]]) # mlflow mlflow.autolog() # 建立與訓練model model = XGBClassifier(use_label_encoder=False, eval_metric="logloss") model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) # after running this, folder mlruns/0/-----------------/... will built ``` - :::跑完以上的code會產生資料夾"**mlruns**"在目錄中::: ![](https://hackmd.io/_uploads/rJVd9tSL2.png) ### :::連線至workspace::: 請確認於[環境建置](##環境建置)中的 5. [安裝 Azure CLI (for 部屬模型)](https://learn.microsoft.com/zh-tw/cli/azure/install-azure-cli)是否已安裝完成 - :::cmd指令：::: ``` sudo apt-get update && sudo apt-get install azure-cli az login --use-device-code ``` - :::複製代碼後打開連結::: ![](https://hackmd.io/_uploads/BkZKlcH83.png) - :::輸入代碼::: ![](https://hackmd.io/_uploads/H1oqxqSUh.png) - :::登入Microsoft帳號::: ![](https://hackmd.io/_uploads/BkxRlqrLn.png) ![](https://hackmd.io/_uploads/SyVe-5rUh.png) - :::連線成功::: ![](https://hackmd.io/_uploads/ByyzW5BIn.png) ### :::註冊本地模型至workspace::: 於上一步連線完成後，使用以下python code註冊模型 ```python # 註冊local model到workspace from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential import json import mlflow import json import os from azureml.core import Workspace from azureml.core import Model # load workspace server config with open('config.json') as config_file: config = json.load(config_file) subscription_id = config['subscription_id'] resource_group = config['resource_group'] workspace = config['workspace_name'] ml_client = MLClient( DefaultAzureCredential(), subscription_id, resource_group, workspace ) mlflow.set_tracking_uri(config['azureml_tracking_uri']) ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace) model_name = config['model_name'] # 取得剛剛於本地件好的artifacts路徑 model_list = os.listdir('mlruns/0') model_list.remove("meta.yaml") model_local_path = "mlruns/0/{}/artifacts".format(model_list[0]) # 5031e45086454b71935f74b40eacb32e registered_model = Model.register(workspace=ws, model_path=model_local_path, model_name=model_name) ``` - :::註冊model成功::: ![](https://hackmd.io/_uploads/HJrMf5BUh.png) ### :::使用Dockerfile建立environment::: 我們最終欲建立endpoint服務，須先建立可於endpoint中執行entry script的環境。 - :::進入AML服務並選擇剛剛建立的workspace::: ![](https://hackmd.io/_uploads/H1N8N9HU3.png) - :::啟動 Machine Learning Studio::: ![](https://hackmd.io/_uploads/SJzsOs48n.png) - :::點選建立environment::: ![](https://hackmd.io/_uploads/B1B1H5HIh.png) - :::選擇使用Dockerfile context建立環境並設定名稱::: ![](https://hackmd.io/_uploads/HkHPr9SU3.png) - :::Dockerfile::: ```dockerfile FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04 # install obdc18, for connecting Azure SQL RUN apt-get update \ && apt-get install -y curl apt-transport-https gnupg2 \ && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \ && echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/20.04/prod focal main" > /etc/apt/sources.list.d/msprod.list \ && apt-get update \ && ACCEPT_EULA=Y apt-get install -y msodbcsql18 \ && ACCEPT_EULA=Y apt-get install -y mssql-tools18 \ && echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc \ && /bin/bash -c "source ~/.bashrc" \ && apt-get install -y unixodbc-dev # set the entry script dir ENV SOURCE_DIRECTORY='./' # install requirement for the entry script RUN pip install numpy==1.21.2 pandas==1.4.1 pyodbc==4.0.39 azureml-inference-server-http azureml azureml-contrib-services azureml-core mlflow ``` ![](https://hackmd.io/_uploads/S1eOknrU2.png) - :::等待約3分鐘，環境成功建立在綁定的docker中::: ![](https://hackmd.io/_uploads/SJG6n6S82.png) ### :::準備endpoint需要的**entry script**::: - 以下會提供**successed_test_echo_score.py**，注意要手動填入json檔中的以下資訊： - 設定 database 連線資訊於 **get_db_ebd** function - server = "lulutestdbserver.database.windows.net" database = "lulutestdb" username = "lulu" password = "testtest!1" - **successed_test_echo_score.py** ```python import traceback import json import os import numpy as np import pyodbc import pandas as pd import mlflow import subprocess import sys def init(): global model global model_path model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'artifacts', 'model') # 取得requirment路徑，安裝model dependencies requirement_path = mlflow.pyfunc.get_model_dependencies(model_path) subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", requirement_path]) # 載入model model = mlflow.pyfunc.load_model(model_path) def run(data): try: # 取得input context data = json.loads(data) input_txt = data['context'] # 測試註冊之model是否能正常輸出，[3, 6, 9]為本文測試xgboost模型的輸入型式 model_out = model.predict([[3, 6, 9]]) model_out = str(model_out) # 測試是否能於endpoint取得於Azure SQL上的資料 def get_db_ebd(): server = "lulutestdbserver.database.windows.net" database = "lulutestdb" username = "lulu" password = "testtest!1" driver= '{ODBC Driver 18 for SQL Server}' with pyodbc.connect('DRIVER='+driver+';SERVER=tcp:'+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password) as cursor: df = pd.read_sql("SELECT * FROM testv2_ebd t1 JOIN testv2_qa t2 ON t1.question_id = t2.question_id;", cursor) df = df[['text', 'embedding', 'question', 'answer']] return df df = get_db_ebd() # endpoint回傳接收的輸入以及註冊之model的測試output out = [{'api in': input_txt, 'test model out': model_out, 'where model': model_path}] # 測試是否能於endpoint計算cosine similarity embedding = [0.8, 0.16, -0.24] def search_docs(df, embedding, top_n=3, to_print=True): def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) df["score"] = df.embedding.apply(lambda x: cosine_similarity(eval(x), embedding)) res = ( df.sort_values("score", ascending=False) .head(top_n) ) return res # endpoint回傳測試embedding與資料庫配對後的文件 out.append(search_docs(df, embedding).to_dict(orient='records')) return out except Exception as e: result = str(e) # return error message back to the client print("Failure!") print(traceback.format_exc()) return json.dumps({"error": result, "tb": traceback.format_exc()}) ``` ### :::建立endpoint::: - :::於Azure Machine Learning Studio中選擇Endpoints::: ![](https://hackmd.io/_uploads/B17GZhrI3.png) - :::持續下一步，選擇剛剛註冊的model::: ![](https://hackmd.io/_uploads/BkLVbnB8h.png) - :::上傳entry script，並選擇剛剛建立environment::: ![](https://hackmd.io/_uploads/rJDRQ2S83.png) - :::試最便宜的::: ![](https://hackmd.io/_uploads/rkfsV2r82.png) - :::等待約12分鐘，endpoint建立完成::: ![](https://hackmd.io/_uploads/SJ0F0ArUn.png) ## 測試架設於AML上的endpoint服務 ### :::**python**::: - :::取得model_name/ url/ api_key::: ![](https://hackmd.io/_uploads/SydoMlU8n.png) - :::設定::: ```python model_name = 'xgboost-1' api_key = '' url = 'https://lulutestept.japaneast.inference.ml.azure.com/score' data = {"context": "測試API能不能用之text"} ``` - :::call API::: ```python import urllib.request import json import os import ssl def allowSelfSignedHttps(allowed): # bypass the server certificate verification on client side if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None): ssl._create_default_https_context = ssl._create_unverified_context allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service. # Request data goes here # The example below assumes JSON formatting which may be updated # depending on the format your endpoint expects. # More information can be found here: # https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script body = str.encode(json.dumps(data)) # Replace this with the primary/secondary key or AMLToken for the endpoint if not api_key: raise Exception("A key should be provided to invoke the endpoint") # The azureml-model-deployment header will force the request to go to a specific deployment. # Remove this header to have the request observe the endpoint traffic rules headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': model_name} req = urllib.request.Request(url, body, headers) try: response = urllib.request.urlopen(req) result = json.loads(response.read()) print('api in', result[0]['api in']) print() print('test model out', result[0]['test model out']) print() print('top n context: ', '\n', result[1][0], '\n', result[1][1], '\n', result[1][2]) except urllib.error.HTTPError as error: print("The request failed with status code: " + str(error.code)) ``` ![](https://hackmd.io/_uploads/Bk7i7l8Ln.png) - ***測試資料*** - model - xgboost訓練資料 ![](https://hackmd.io/_uploads/B1PHElL82.png) - model於endpoint中的測試input ![](https://hackmd.io/_uploads/ry4Y4xUI2.png) - cosine similarity - database中的測試embedding表 - testv2_ebd.csv: 3維embedding ![](https://hackmd.io/_uploads/BkWqT5V82.png) - testv2_qa.csv ![](https://hackmd.io/_uploads/SkVekiVI2.png) - 測試欲配對的embedding: ```embedding = [0.8, 0.16, -0.24]``` ### :::**AML Studio**::: ![](https://hackmd.io/_uploads/r1xObyULh.png) ## 費用 - 2023/6/1 17:30 ![](https://hackmd.io/_uploads/B1kWByII3.png) - 2023/6/1 17:30 的3小時前 (剛開始建) ![](https://hackmd.io/_uploads/ByrIj18Uh.png) - 2023/6/1 18:52 ![](https://hackmd.io/_uploads/B1CS_g8L2.png) - 2023/6/1 19:26 ![](https://hackmd.io/_uploads/B1VDe-L83.png) - 2023/6/2 10:26 ![](https://hackmd.io/_uploads/S1TrXALL2.png) ![](https://hackmd.io/_uploads/BJ8vXRLUh.png) - 2023/6/2 13:48 ![](https://hackmd.io/_uploads/HkzhfbvU3.png) - 2023/6/1~6/5 ![](https://hackmd.io/_uploads/BJ2P7fsUn.png) ## 如何暫停endpoint 還沒找到