# Azure Databricks with Azure DevOps CI/CD [TOC] <!-- ## Prerequisite - Install Python 3.8 - [Python 3.8.10 - Download Windows installer (64-bit)](https://www.python.org/ftp/python/3.8.10/python-3.8.10-amd64.exe) - Visual Studio Code - Azure Databricks --> ## 0) Lab 環境設置 此章節會建立 Databricks 環境供後續部屬使用及 Azure DevOps Organization 及 Project 實作 CI/CD ### 建置 Databrick 環境 - 到 [Azure Portal](https://portal.azure.com/),搜尋 `databrick`,選擇 **Azure Databricks** ![](https://i.imgur.com/9IzyXQu.png) - 點選 ![](https://i.imgur.com/g7uY2Qr.png =80x) 建立新資源 - 完成資訊填寫 **Basic** - Subscription: <XXXX'S SCBSCRIPTION> - Resource Group: `rg-xxx-databricks-lab` - Workspace Name: `workspace-xxx-lab` - Region: `East US` - Pricing Tier: Standard ![](https://i.imgur.com/VrJJrVK.png) **Networking** - Deploy Azure Databricks workspace with Secure Cluster Connectivity (No Public IP): `No` - Deploy Azure Databricks workspace in your own Virtual Network (VNet): `No` ![](https://i.imgur.com/8QRufYY.png =400x) - 點選 ![](https://i.imgur.com/TCWcf8R.png =150x) - 確認資訊正確後點選 ![](https://i.imgur.com/I7qGgIg.png =90x) ![](https://i.imgur.com/I2owLIC.png =400x) - 等待部屬完成 ![](https://i.imgur.com/J69d6m8.png) ### 建置 Azure Storgae Account 儲存分析用資料 - 選擇剛建立的資源群組,並點選 ![](https://i.imgur.com/QUHh7CI.png =80x) - 選擇 Stroage Account ![](https://i.imgur.com/JqQXYPG.png) - 完成資訊填寫 **Basics** 的資訊填寫 - Region: East US - Performance: Standard - Redundancy: Locally-redundant storage (LRS) ![](https://i.imgur.com/jBK6MBa.png) - 完成後可直接點選 ![](https://i.imgur.com/ecxYjDW.png =150x),確認資訊正確後點選 Create - 等待部署完成後點選 Go to resource ![](https://i.imgur.com/ubqQ91E.png) - 將 Storage Account 的名稱記下 ![](https://i.imgur.com/vB4zXml.png) - 點選左邊選單中 Data storage 區塊內的 Container,點選 ![](https://i.imgur.com/5WtuT5C.png =100x),將 container 命名為 `datasource` 後點選 Create 建立一個 container,記下 container 的名稱 ![](https://i.imgur.com/vSFxXXq.png) - 開啟 CMD,下載今日使用之範例 [huier23/sample-databricks-lab](https://github.com/huier23/sample-databricks-lab.git),完成後確認在 datasource 內有兩個 csv 檔案分別為 *movies.csv* 及 *ratings.csv* ``` cd desktop git clone https://github.com/huier23/sample-databricks-lab.git ``` ![](https://i.imgur.com/nUcHFLD.png) - 回到 Azure Portal,點選剛剛建立的 storage container 進入 - 點選 ![](https://i.imgur.com/cem8OdB.png =80x) > Select a file ![](https://i.imgur.com/2ZXGVp8.png =25x) > 選擇 *sample-databricks-lab/datasource/movies.csv* 和 *sample-databricks-lab/datasource/ratings.csv* 兩個檔案上傳,選擇完後點選 ![](https://i.imgur.com/zowkTh9.png =80x) ![](https://i.imgur.com/vqs4Chr.png) - 確認已有上傳兩份檔案 ![](https://i.imgur.com/m1UZOyU.png) - 點選左編選單中 Security + networking 中的 Access keys,選擇 ![](https://i.imgur.com/VCVCZkq.png =120x)將 key1 或 key2 的 Key 記下 (擇一即可) ![](https://i.imgur.com/qxNGDd2.png) ### 建置 Azure DevOps Organization #### 建立新組織 - 到 [Azure DevOps](https://dev.azure.com/) 頁面,選擇 開始免費使用 ![](https://i.imgur.com/6Dc0BRJ.png) - 選擇 Taiwan ![](https://i.imgur.com/uyRKTJf.png) - 填寫你的組織名稱、地區(建議可以選擇 Central US) ![](https://i.imgur.com/CddAOwm.png) - 點選 Continuous,開始進行服務的建立 ![](https://i.imgur.com/48vvVPv.png) - 完成後,會跳轉至以下畫面,建立第一個專案,選擇 Private ![](https://i.imgur.com/S5WUy2n.png) #### 第一個專案 - 建立完成後會看到一個開好的專案 ![](https://i.imgur.com/NetSWhH.png) - 點選左上角的 ![](https://i.imgur.com/doL8aJn.png =200x) 回到組織首頁 #### Billing 設定 :::info :bulb: 因目前免費的一條 pipeline 須提前申請,因此需綁定 billing 才可使用 pipeline,免費 pipeline 申請方式請參考 [Reference](https://hackmd.io/@msazuredev/HkaaHy2AO) ::: - 在首頁中,點選左下方的 Organization Setting ![](https://i.imgur.com/3rSjVaL.png) - 點選 Billing,選擇 Set up billing,選擇你可使用的訂閱 (免費試用/MSDN 訂閱不可使用) ![](htt**ps://i.imgur.com/IDrt2BG.png) - 將 MS Hosted CI/CD 數量調整為 1 ![](https://i.imgur.com/QlTrovc.png) #### 建置今日 Project 所使用之專案 - 回到首頁,點選右上角 ![](https://i.imgur.com/1sXFXRL.png =100x) ![](https://i.imgur.com/pfaYvzf.png) - 填寫 Project name `databrick-cicd-workshop`,Visibility 選擇 Private,完成後 Create ![](https://i.imgur.com/w16IcTP.png =500x) ## 1) 使用 Databrick Workspace 進行開發 ### Databricks workspace 設定 - 部屬完成後點選 ![](https://i.imgur.com/Iscch77.png =130x) - 在 Azure Databricks Service 的頁面點選 ![](https://i.imgur.com/YluCbqu.png =150x) ![](https://i.imgur.com/teMXCvk.png) - 點選 Get Start 中的 Create a cluster ![](https://i.imgur.com/PKFzTGR.png) - 點選 ![](https://i.imgur.com/KAHhxId.png =100x) ![](https://i.imgur.com/ZwrJ30Y.png) - 設定 Cluster 新增資訊 - Cluster Name: `cluster-xxx-lab` - Cluster mode: `Standard` - Databricks runtime version: `Runtime: 10.4 LTS` - Autopilot options: Disable autoscaling - Worker type: `Standard_DS3_v2` - Workers: `2` ![](https://i.imgur.com/PtBUe9b.png) ### Databrick 上 Source Control 設定 - 開啟 [Azure DevOps](https://dev.azure.com/) > Repos > New repository ![](https://i.imgur.com/xC9hK6T.png) - 新增 `databrick-lab` 的 repository,勾選 Add a README,完成後點選 Create ![](https://i.imgur.com/JQE88Ix.png =400x) - 點選右上角 ![](https://i.imgur.com/8EMOiON.png =80x),將 repository 位置記錄下 ![](https://i.imgur.com/XDvelms.png =400x) - 回到 Databrick Portal,點選 ![](https://i.imgur.com/1sjvdFI.png =100x) > ![](https://i.imgur.com/EH9c9oS.png =100x) > Git integration ![](https://i.imgur.com/ijqTpa9.png) - 設定 Git provider 為 Azure DevOps Services (Azure Active Directory) ![](https://i.imgur.com/52tTGLK.png =600x) <!-- - 點選 ![](https://i.imgur.com/bipynDA.png =100x) > Admin Console > Workspace settings ![](https://i.imgur.com/XZn0n7z.png) - Under Workspace Settings, confirmed "Files in Repo" is enable ![](https://i.imgur.com/NjThtv7.png) --> - 點選 ![](https://i.imgur.com/obAE8as.png =80x) > ![](https://i.imgur.com/3TXVcUo.png =90x),將剛建立的 repository 的 URL 貼上 ![](https://i.imgur.com/7L27cWH.png) ### 在 Workspace 中撰寫程式碼 - 點選 ![](https://i.imgur.com/l1HlOwN.png =130x) > Users > xxxx@xxxx.com,右鍵選擇 Create > Floder ![](https://i.imgur.com/2N2uSX4.png) - 新增一個名為 `workspace` 的 folder,點選 Create Folder ![](https://i.imgur.com/YSwSUt3.png =600x) - 點選![](https://i.imgur.com/wuFfVox.png =40x),找到 `workspace` 資料夾,右鍵選擇 Create > Notebook ![](https://i.imgur.com/0eUcGrx.png) - 新增 `analysis` 檔案,點選 Create ![](https://i.imgur.com/38JAI6z.png =500x) - 添加程式碼 - Import Library 並執行 (Shift + Enter) ```python from pyspark.sql.functions import split, explode, col, udf from pyspark.sql.types import * from pyspark.sql import SparkSession ``` - 加入以下程式碼設定資料來源,將 storage 資料修改為個人的環境 ```python spark = SparkSession.builder.appName('temps-demo').getOrCreate() # Setting storage account connection container_name = "datasource" storage_account_name = "<STORAGE-NAME>" storage_account_access_key = "<STROAGE-KEY>" spark.conf.set("fs.azure.account.key." + storage_account_name +".blob.core.windows.net",storage_account_access_key) ``` - 新增一行,做資料的 loading,完成後驗證 ```python # Get stroage data location ratingsLocation = "wasbs://" + container_name +"@" + storage_account_name + ".blob.core.windows.net/ratings.csv" moviesLocation = "wasbs://" + container_name +"@" + storage_account_name +".blob.core.windows.net/movies.csv" # Get ratings and movies data ratings = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header", "true") \ .load(ratingsLocation) movies = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header", "true") \ .load(moviesLocation) ``` - 新增一行做 movies 資料欄位的顯示 ```python movies.show() ``` ![](https://i.imgur.com/naB2FbK.png) - 新增一行做 ratings 資料欄位的顯示 ```python ratings.show() ``` ![](https://i.imgur.com/zA22S3B.png) ### 將新增的檔案從 workspace 上進版控 - 右鍵點選 `workspace` Folder > Move ![](https://i.imgur.com/VTO0ebF.png) - 選擇剛加入的 Repos ![](https://i.imgur.com/fIm066b.png) - 雖檔案已加入 Repos 中,但此時還未 check-in 到 remote repository 中 - 點選上方的 ![](https://i.imgur.com/9OQFHgQ.png =80x) - 完成 commit message 後點選 ![](https://i.imgur.com/6ylz7y9.png =120x) ![](https://i.imgur.com/uyC7nQc.png) - 回到 [Azure DevOps](https://dev.azure.com/) 上確認檔案已 sync 上 ![](https://i.imgur.com/cF5bKsy.png) ## 2) 使用 Visual Studio Code 進行開發 ### databricks-connect 環境設置 - 安裝 JRE 8:Java Runtime Environment (JRE) 8. The client has been tested with the OpenJDK 8 JRE. The client does not support Java 11. - 安裝 Python 3.8 - Python 路徑需加入到環境變數: `C:\Users\AzureUser\AppData\Local\Programs\Python\Python38` - PIP 路徑需加入到環境變數: `C:\Users\AzureUser\AppData\Local\Programs\Python\Python38\Scripts` :::info :bulb: 注意 python version 與 databricks runtime 版本 ![](https://i.imgur.com/1oV25q1.png) ::: - 安裝 virtualenv ``` pip install virtualenv ``` - 確認已有成功安裝 virtualenv ``` pip list ``` ![](https://i.imgur.com/pBNACEp.png =300x) - 在桌面將 Azure Repo 上的檔案 clone 下來,並索引到資料夾下 ``` cd desktop git clone <Azure-Repo-URL> cd databrick-lab ``` - 使用 virtualenv 在此目錄下建立虛擬環境 - 環境中僅有一個 python 版本 ``` virtualenv .env ``` - 如環境中有多個 python 版本,指定 virtualenv 的 python 版本 ``` virtualenv -p <PythonVersion> .env ``` ![](https://i.imgur.com/wXpjBcT.png) - 啟動虛擬環境,成功啟動後會出現 ![](https://i.imgur.com/jtVvlMX.png =50x) (此指令為 windows 寫法) ``` .env\Scripts\activate ``` ![](https://i.imgur.com/JTwEChR.png =500x) - 確認環境中無安裝 PySpark - 確認 ``` pip list ``` ![](https://i.imgur.com/mi7FXRY.png =480x) - 如有安裝需進行移除 (會與 databricks-connect 衝突) ``` pip uninstall PySpark ``` - 安裝 [databricks-connect](https://pypi.org/project/databricks-connect/10.4.0b0/) (須配合 Databricks Cluster 的版本) ``` pip install databricks-connect==10.4.0b0 ``` ![](https://i.imgur.com/8yRrQ1d.png) - 回到 [Azure Portal](https://portal.azure.com/) 取得 Databrick URL 並記錄下來 ![](https://i.imgur.com/PPrH1V0.png) - 回到 Databrick Portal 上,點選 ![](https://i.imgur.com/PFwI7YV.png =100x) 並選取先前建立的 cluster ![](https://i.imgur.com/g5Z8W67.png) - 在 Configuration 的頁面選擇 JSON view,取得 cluster_id 並記錄下來 ![](https://i.imgur.com/JteYN5I.png) - 在網址列 `?o=<Organization-ID>#` 取得 Organization ID ![](https://i.imgur.com/jdTuzmT.png) - 取得 Databricks PAT - Setting > User Setting ![](https://i.imgur.com/U3W1vRM.png) - 點選 ![](https://i.imgur.com/ur72nj0.png =130x),資訊填寫完成後按 Generate ![](https://i.imgur.com/74TeDog.png) - 將產生的 PAT 記錄下來並按 Done - 回到本機環境執行 CMD 進行 databricks-connect 的設定 ``` databricks-connect configure ``` - Do you accept the above agreenment: `y` - Databricks Host [no current value, must start with https://]: `<Databricks-URL>` - Databricks Token [no current value]: `<Databricks-PAT>` - Cluster ID (e.g., 0921-001415-jelly628) [no current value]: `<Databricks-Cluster-ID>` - Org ID (Azure-only, see ?o=orgId in URL) [0]: `<Databricks-Organization-ID>` - Port [15001]: 使用 default ![](https://i.imgur.com/3Ja3IId.png) - 驗證連線 ``` databricks-connect test ``` - Fix Windows Hadoop Issue ![](https://i.imgur.com/YDlljs9.png) - Hadoop 安裝參考 [Hadoop : How to install in 5 Steps in Windows 10](https://medium.com/analytics-vidhya/hadoop-how-to-install-in-5-steps-in-windows-10-61b0e67342f8) - 需使用 7-zip 解壓縮 .tar 時,需用 admin 開啟 - Hadoop 會使用到 Java,確認環境變數中已添加 `JAVA_HOME`,並將 `HADOOP_HOME` 路經加入至環境變數 (JAVA_HOME 及 HADOOP_HOME 的路徑不可有空格) ![](https://i.imgur.com/ZZJeXVP.png) ![](https://i.imgur.com/8Hi70cH.png) - 將 winutil 手動加入 %HADOOP_HOME%/bin 底下 ![](https://i.imgur.com/qgPdLvp.png) - **Windows 所需環境: [DatabricksConnectEnv-Win](https://1drv.ms/u/s!AtAJk4ApWmAFhJQDG8fkp2p4Q3T41A?e=5Ahf9N)** ### 使用 vscode + Juypter notebook extension 開發程式碼 - 開啟 vscode,點選 ![](https://i.imgur.com/JOwua5K.png =30x),搜尋 `jupyter`,安裝 [Jupyter Extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) ![](https://i.imgur.com/VlSPnWe.png) - 搜尋 `python`,安裝 [Python Extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) ![](https://i.imgur.com/0Vrzsvr.png) - 點選 Files > Open Folder ![](https://i.imgur.com/d4r8cy3.png =500x) - 選擇建立的資料夾 ![](https://i.imgur.com/rbjBUeW.png) - 在 EXPLORER 的空白區域點選右鍵 > New Folder ![](https://i.imgur.com/14sE10o.png =500x) - 新增一個 `vscode` 資料夾 ![](https://i.imgur.com/LRXCpnQ.png =500x) - 選擇上方的 Terminal > New Terminal ![](https://i.imgur.com/mxMSiKQ.png) - 確認虛擬環境是否有被啟動,如未被啟動則輸入 `.env/Scripts/activate` ![](https://i.imgur.com/Iwv08d9.png) - 點選 **Shift + Ctrl + P**,再彈出的視窗中輸入 `jupyter` 並選擇 Create: New Jupyter Notebook ![](https://i.imgur.com/s2ev5et.png =600x) - 會建立一個 .ipynb 的檔案,確認右上角的環境是使用虛擬環境 ![](https://i.imgur.com/dY9VAws.png) - 如非為虛擬環境之設定,可點選進行修改 ![](https://i.imgur.com/kmy3QbB.png) - 添加程式碼 - Import Library ```python from pyspark.sql.functions import split, explode, col, udf from pyspark.sql.types import * from pyspark.sql import SparkSession ``` - 點選![](https://i.imgur.com/AJW28au.png =25x) 執行 - 會跳出視窗提醒需要安裝 jupyter and notebook package,點選 Install ![](https://i.imgur.com/qcWKvgl.png) - 點選 ![](https://i.imgur.com/FfHFYO4.png =60x) 可新增一行 code ![](https://i.imgur.com/aZyTsSX.png) - 加入以下程式碼設定資料來源,將 storage 資料修改為個人的環境 ```python spark = SparkSession.builder.appName('temps-demo').getOrCreate() # Setting storage account connection container_name = "datasource" storage_account_name = "<STORAGE-NAME>" storage_account_access_key = "<STROAGE-KEY>" spark.conf.set("fs.azure.account.key." + storage_account_name +".blob.core.windows.net",storage_account_access_key) ``` - 完成後執行驗證 ![](https://i.imgur.com/8SCwMg5.png) - 再新增一行 code,做資料的 loading,完成後驗證 ```python # Get stroage data location ratingsLocation = "wasbs://" + container_name +"@" + storage_account_name + ".blob.core.windows.net/ratings.csv" moviesLocation = "wasbs://" + container_name +"@" + storage_account_name +".blob.core.windows.net/movies.csv" # Get ratings and movies data ratings = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header", "true") \ .load(ratingsLocation) movies = spark.read.format("csv") \ .option("inferSchema", "true") \ .option("header", "true") \ .load(moviesLocation) ``` - 新增一行做 movies 資料欄位的顯示 ```python movies.show() ``` ![](https://i.imgur.com/2OA5K3c.png) - 新增一行做 ratings 資料欄位的顯示 ```python ratings.show() ``` ![](https://i.imgur.com/3JSTZRA.png) - **Ctrl + S** 進行儲存,將目前的檔案儲存於 `vscode` 資料夾下, File name 設定為 `analysis.ipynb`,Save as type 設定為 `All Files`,完成後點選 Save ![](https://i.imgur.com/E7c3XDa.png) - 在目前的目錄下新增 `.gitignore` 檔案 ```.gitignore # Byte-compiled / optimized / DLL files __pycache__/ # Distribution / packaging .Python build/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ sdist/ wheels/ # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Jupyter Notebook .ipynb_checkpoints # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ ``` ![](https://i.imgur.com/SGBPT3Y.png) ### 在本機將開發的檔案進行版控 - 在 vscode 中開啟終端機 ![](https://i.imgur.com/tuucPQ5.png) - 使用 git ``` git add . git commit -m "add vscode/analysis" git push origin main ``` ### Databricks 上查看更新的檔案 - 回到 Databrick Portal,點選 Repos > xxxx@xxxx.com > databrick-lab > ![](https://i.imgur.com/UZt9MgX.png =70x) ![](https://i.imgur.com/JmtpCVY.png) - 選擇視窗右上角的 Pull > Confirm ![](https://i.imgur.com/ACahenA.png) - 顯示完成拉取 ![](https://i.imgur.com/ViZk5lt.png =400x) - 點選 databrick-lab 資料夾,檔案已更新 ![](https://i.imgur.com/tj5pHkB.png =500x) ### Databricks 上匯入 .ipynb 檔案至 workspace 中 - 選擇 Workspace > Users > xxxx@xxxx.com,右鍵 > Import ![](https://i.imgur.com/HL9K5zR.png) - 選擇 File > browser,選擇剛建立的 .ipynb 檔案 ![](https://i.imgur.com/6P2P5po.png) - 完成後點選 Import ![](https://i.imgur.com/Yekb4wO.png =500x) - 點選 Run All 執行 ![](https://i.imgur.com/JGE4XV8.png) ## 3) Configue Azure DevOps CI/CD ### Azure DevOps Service 在 MLOps 內扮演的腳色 ![](https://i.imgur.com/LwklaSw.png) - Source Control (Azure Repo - Git) - Notebook Deployment - Release to Multiple Environments - Dev - Staging (UAT) - Production - Artifact management ### Databrick CI/CD - 回到 [Azure DevOps](https://dev.azure.com/),選擇 Pipeline > Release > ![](https://i.imgur.com/xTf85Fy.png =100x) ![](https://i.imgur.com/ODSNlL2.png) - 選擇 ![](https://i.imgur.com/sNrlK0V.png =100x) ![](https://i.imgur.com/uevEskR.png =500x) - 將 Stage 命名為 `vsocode` ![](https://i.imgur.com/GUB1QTF.png =500x) - 點選 + Add an artifact ![](https://i.imgur.com/QZHSNLT.png =500x) - 設定 Artifact 來源 - Source type: Azure Repo - Project: `databrick-cicd-workshop` - Source (repository): `databrick-lab` - Default branch: `main` ![](https://i.imgur.com/U7vmrOl.png =500x) - 點選 ![](https://i.imgur.com/O6nAFma.png =40x),設定 Continuous deployment trigger ![](https://i.imgur.com/briRj4u.png) - 點選 vscode stage 的 ![](https://i.imgur.com/idE4y9O.png =80x) 設定部屬工作流程 - 在 Agent Job 中點選 +,搜尋 `databrick`,選擇 **Configure Databricks** ![](https://i.imgur.com/oZqolkB.png) :::info :bulb: **若未出現 tasks 可使用,在 Marketplace 的區塊尋找** - 第一次使用需安裝 DevOps for Azure Databricks 套件 ![](https://i.imgur.com/1YAree3.png =500x) - 點選 Get it free - 選擇你的組織,點選 install ![](https://i.imgur.com/ykXN1fz.png =300x) ::: - 設定 **Configure Databricks** - Workspace URL: `$(WORKSPACE_URL)` - Access Token: `$(PAT)` ![](https://i.imgur.com/7rL2QDK.png) - 點選 +,搜尋 `databrick`,選擇 **Start a Databricks Cluster** ![](https://i.imgur.com/7McpyoB.png) - 設定 **Start a Databricks Cluster** - Start a Databricks Cluster: `$(CLUSTER_ID)` ![](https://i.imgur.com/CwlTSll.png) - 點選 +,搜尋 `databrick`,選擇 **Deploy Databricks Notebooks** ![](https://i.imgur.com/gXOJhIS.png) - 設定 **Deploy Databricks Notebooks** - Notebooks folder:`$(System.DefaultWorkingDirectory)/_databrick-lab` - Workspace folder: `/Shared` - 點選 +,搜尋 `databrick`,選擇 **Execute Databricks Notebook** ![](https://i.imgur.com/mOHulR5.png) - 設定 **Execute Databricks Notebook** - Notebook path (at workspace): `/Shared/$(System.StageDisplayName)/analysis` - Existing Cluster ID: `$(CLUSTER_ID)` - 點選 +,搜尋 `databrick`,選擇 **Wait for Databricks Notebook execution** ![](https://i.imgur.com/PUQw0WT.png) - 點選 Variables > Variable groups > Manage variable groups ![](https://i.imgur.com/pWpDCgX.png) - 點選 ![](https://i.imgur.com/4bPqKxc.png =120x),新增三個變數,完成後點選 ![](https://i.imgur.com/HuuTeK2.png =70x) - WORKSPACE_URL: `<DATABRICK-WORKSPACE_URL>` - PAT: `<DATABRICK-PAT>` - CLUSTER_ID: `<DATABRICK-CLUSTER_ID>` ![](https://i.imgur.com/MYBedKW.png) - 回到 release pipeline,點選 ![](https://i.imgur.com/F3ZSIrq.png =150x) 連結剛建立的 variable group ![](https://i.imgur.com/JqCEWiP.png) - 點選上方選單的 pipeline,在 vscode stage 的下方選擇 clone ![](https://i.imgur.com/i8eudwB.png =500x) - 點選 Copy of vscode 的 stage,將 Stage name 修改為 `workspace` ![](https://i.imgur.com/RKdN2IV.png) - 此處於未來應用可設定不同的 Cluster ID 或 Databrick URL 做不同環境的部屬 - 點選 workspace 的 ![](https://i.imgur.com/Ugd5gDD.png =40x),將 Pre-deployment approvals 啟用並設置為 approver - 修改 release pipeline 的名稱為 `Release to databricks` ![](https://i.imgur.com/JUKII4a.png) - Save 並點選 Create Release - 點選上方的狀態列,查看 Release 狀態 ![](https://i.imgur.com/WhXhtxO.png =600x) - 允許做 workspace stage 的部屬 ![](https://i.imgur.com/1sdrvWk.png) - 查看 Stage Log ![](https://i.imgur.com/vtzbqRq.png) - 選擇 Wait for Notebook execution ![](https://i.imgur.com/Ub8xMp5.png) - 捲至最下方,點選 For details, go to the execution page 的網址,可查看在 Databricks 中執行的狀況 ![](https://i.imgur.com/jLIg2G6.png) ![](https://i.imgur.com/7DKiuqj.png) ### Python Package (Artifact Management) - **PyPI** 是一個套件庫,位於 https://pypi.org,其中包含了各式各樣的 Python 套件,在開發應用程式的過程中,可以到這邊來搜尋是否有所需的功能套件,安裝後透過引用的方式來進行使用,藉此提升開發效率。 - **pip** 是一個全域環境的套件管理工具,用來安裝及管理 PyPI 上的 Python 套件,也就是說利用 pip 指令所安裝的 Python 套件,所有專案皆可使用。 #### Settup CI pipeline for build python package - 下載範例程式碼 [huier23/py-packaging-sample](https://github.com/huier23/py-packaging-sample) ``` cd desktop git clone https://github.com/huier23/py-packaging-sample.git cd py-packaging-sample ``` - 在 Azure DevOps 中新增一個 repository,命名為 `count-lib`,不新增 README ![](https://i.imgur.com/3HUzeYv.png) - 將 Repo 的 URL 記下 ![](https://i.imgur.com/QRCMgYk.png) - 回到 CMD,透過 git 將程式碼推送至 [Azure DevOps](https://dev.azure.com/) Repo ``` git remote add devops <Azure-Devops-Repo-URL> git push devops master ``` - 確認完成推送 ![](https://i.imgur.com/BUrwfpk.png) - 選擇 Pipeline > New Pipeline ![](https://i.imgur.com/zt1EKwC.png) - 選擇 Use the classic editor ![](https://i.imgur.com/kcnjW3t.png =500x) - 設定來源 - Azure Repo Git: Select a source - Repository: count-lib ![](https://i.imgur.com/9O1gGhW.png =500x) - 選擇 ![](https://i.imgur.com/QxVYojh.png =90x) ![](https://i.imgur.com/VnyuFxj.png) - 新增 taks,搜尋 `python`,選擇 Use Python version ![](https://i.imgur.com/iAHyLpV.png) - 設定 Use Python Version - Version spec: `>= 3.6` ![](https://i.imgur.com/nVd6wkv.png) - 新增 task,搜尋 `command`,選擇 Command line ![](https://i.imgur.com/JyJwq5T.png) - 設定 Command line - Display name: `Upgrade pip` - Script: `python -m pip install --upgrade build` ![](https://i.imgur.com/CmrVWY3.png) - 新增 task,搜尋 `command`,選擇 Command line ![](https://i.imgur.com/lYpSwiE.png) - 設定 Command line - Display name: `Build python package` - Script: `python -m build` ![](https://i.imgur.com/G8UvjTN.png) - 新增 task,搜尋 `copy`,選擇 Copy files ![](https://i.imgur.com/C68LJFC.png) - 設定 Copy files - Source Folder: `$(Build.Repository.LocalPath)/dist` - Target Folder: `$(Build.ArtifactStagingDirectory)` ![](https://i.imgur.com/XO6hAnB.png) - 新增 task,搜尋 `publish`,選擇 Publish build artifacts ![](https://i.imgur.com/QxjGgMD.png) - 設定 Publish build artifacts - Path to publish: `$(Build.ArtifactStagingDirectory)` - Artifact name: `dist` ![](https://i.imgur.com/PsgN4sH.png) - 點選 Triggers,設定 Enable continuous integration ![](https://i.imgur.com/VhvgwzJ.png) - 修改 CI pipeline 名稱為 `count-lib-CI`,完成後點選 Save & queue ![](https://i.imgur.com/2zlpIas.png) - 執行完成後點選此 run 的 summary,點選 ![](https://i.imgur.com/uPlSb1U.png =170x) ![](https://i.imgur.com/ReGSmv9.png) - 查看 dist 下的檔案 ![](https://i.imgur.com/s57nNhB.png) #### Settup CD pipeline for release to Artifact <!-- - Support PyPI, Maven or universal package (NO CRAN) --> - 點選 Artifacts > + Create Feed ![](https://i.imgur.com/ryWTY5w.png) - 設定 Create new feed - Name: `databricks-lib` - Visibility: Members of huier-teamservice - Scope: Project: databrick-cicd-workshop (Recommended) ![](https://i.imgur.com/2SeDB4T.png =400x) - 選擇 Release > + New > + New release pipeline ![](https://i.imgur.com/QUSoq5u.png =500x) - 選擇 Empty job ![](https://i.imgur.com/H4VkSZN.png =500x) - 修改 stage name 為 `Release to Artifact` ![](https://i.imgur.com/ILdw8p1.png =500x) - 新增 Artifact - Source type: Build - Source (build pipeline): count-lib-CI ![](https://i.imgur.com/9tpowRU.png) - 點選 ![](https://i.imgur.com/rowJF5y.png =40x) 設定 Continuous deployment trigger ![](https://i.imgur.com/HB3It4p.png) - 點選 Release to Artifact 的 ![](https://i.imgur.com/ZFejodq.png =100x) ![](https://i.imgur.com/8v0ld7S.png =500x) - 新增 task,搜尋 `command`,點選 Command Line ![](https://i.imgur.com/dyLGEbJ.png) - 設定 Command Line - Display name:`pip install wheel & twine` - Script: ``` pip install wheel pip install twine ``` ![](https://i.imgur.com/L9PFsbY.png) - 新增 task,搜尋 `python`,點選 Python twine upload authenticate ![](https://i.imgur.com/zen8V1D.png) - 設定 Python twine upload authenticate - My feed (select below): `databricks-lib` ![](https://i.imgur.com/GcTmDoI.png) - 新增 task,搜尋 `command`,點選 Command Line ![](https://i.imgur.com/cIAg3or.png) - 設定 Command Line - Display name:`twin upload package` - Script: `python -m twine upload --repository databricks-lib --config-file $(PYPIRC_PATH) $(System.DefaultWorkingDirectory)/**/dist/*.whl` ![](https://i.imgur.com/ZbRihtu.png) - 修改 release pipeline 名稱為 `Release to Artifact`,完成後點選 Save 後 Create release ![](https://i.imgur.com/T6sRaCi.png) - 查看部屬狀態 ![](https://i.imgur.com/H8OC5c4.png) - 完成後到 Artifact 中查看 ![](https://i.imgur.com/7jV8h8w.png) #### Get package - 到 [Azure Portal](https://portal.azure.com/) 上使用 Azure Cloud Shell 環境驗證 - Update pip ``` python -m pip install --upgrade pip ``` - Install the keyring (Azure Cloud Shell 環境已安裝,可略過) ``` pip install keyring artifacts-keyring ``` - 建立新資料夾,並索引到新資料夾底下 ``` mkdir hello_world cd hello_world ``` - 建立 python 虛擬環境並啟用 ``` virtualenv .env source .env/bin/activate ``` ![](https://i.imgur.com/gD2IiOD.png) - 索引到 `.env` 底下,新增 `pip.conf` 檔案 ``` cd .env code pip.conf ``` - 設定 pip repo 位置,將下方程式碼貼上後儲存 ```conf=0 [global] index-url=https://pkgs.dev.azure.com/<Organization-Name>/<Project-Name>/_packaging/<Feed-Name>/pypi/simple/ ``` ![](https://i.imgur.com/zVKYttT.png) - 回到 [Azure DevOps](https://dev.azure.com/) 取得 PAT ![](https://i.imgur.com/sxKudAR.png =500x) - 提供 packaging **Read, write & manage** 權限,完成後將 PAT 記錄在記事本 ![](https://i.imgur.com/hVwPIU1.png) - 回到 Azure Cloud Shell,使用 pip 安裝 private package ``` pip install count==0.0.1 ``` - User for pkgs.dev.azure.com 為個人帳號,password 為 PAT ![](https://i.imgur.com/C4x8MuW.png) - 回到新資料夾目錄下,新增測試用的檔案 `hello_world.py` ``` cd .. code hello_world.py ``` :::info :bulb: 如使用 windows,會自動跳出 device 驗證,於 PAT 中可查看到 Azure DevOps Artifactes Credential Provider ![](https://i.imgur.com/oDagrju.png) ::: - 新增測試檔案 `test.py` ```python=0 from count import add print(add.add_one(3)) ``` - 執行驗證 ``` python test.py ``` ![](https://i.imgur.com/X1OeCrr.png =500x) ## Reference - [Python Release for Windows](https://www.python.org/downloads/windows/) - [Datbricks Connect | Microsoft Doc](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect) - [Databricks Connect | Databricks Doc](https://docs.databricks.com/dev-tools/databricks-connect.html) - [Generate Databricks PAT](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/authentication#--generate-a-personal-access-token) - [Databricks CLI | Microsoft Doc](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/) - [Anaconda Install](https://www.anaconda.com/products/distribution) - [Databricks CLI](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/) - [Libraries](https://docs.microsoft.com/en-us/azure/databricks/libraries/) - [Repos for Git integration](https://docs.microsoft.com/en-us/azure/databricks/repos/) - [Config PyCharm](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect#pycharm) - [Azure DevOps Artifact | Python](https://docs.microsoft.com/en-us/azure/devops/artifacts/quickstarts/python-packages?view=azure-devops) - [Azure Artifacts: best practices](https://docs.microsoft.com/en-us/azure/devops/artifacts/concepts/best-practices?view=azure-devops) - [Packaging Python Projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/#creating-the-package-files) - [Build Python apps](https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/python?view=azure-devops) - [Publish Python packages with Azure Pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pypi?bc=%2Fazure%2Fdevops%2Fartifacts%2Fbreadcrumb%2Ftoc.json&toc=%2Fazure%2Fdevops%2Fartifacts%2Ftoc.json&view=azure-devops&tabs=yaml) - [Install custom Python Libraries from private PyPI on Databricks](https://towardsdatascience.com/install-custom-python-libraries-from-private-pypi-on-databricks-6a7669f6e6fd) - [Databricks Script Deployment Task by Data Thirst by Data Thirst Ltd | Visual Studio Marketplace](https://marketplace.visualstudio.com/items?itemName=DataThirstLtd.databricksDeployScriptsTasks&targetId=26cd1db8-21ca-4faa-93dc-73fd5a63f717) - [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html) - [Azure DevOps Use predefined variables | Microsoft Doc](https://docs.microsoft.com/en-us/azure/devops/pipelines/build/variables?view=azure-devops&tabs=yaml) - https://docs.microsoft.com/zh-tw/azure/databricks/repos/