MLOps - HackMD

# MLOps ###### tags: `MLOps` ## Introduction 承接上次T1 Proposal提出的方法，機器學習系統與一般系統最大的不同在於需要額外考慮資料、機器學習模型。 MLOps方法基於DevOps方法，主要可以分為Machine Learning & Development & Operation，當中Development & Operation為DevOps的基礎，透過自動化「軟體交付」和「架構變更」的流程，來使得構建、測試、發布軟體能夠更加地快捷、頻繁和可靠。 ![DevOps](https://i.imgur.com/SrFuON1.png) 系統開發流程以Development團隊開發服務，並將服務交付給Operation團隊架設服務環境，DevOps方法可以減緩團隊間的溝通成本，並加速產品迭代佈署效率。機器學習系統則是在DevOps基礎上再加上Machine Learning的服務，Machine Learning需要考量資料、模型，需要考慮資料、模型的驗證、測試方法，並進一步加入資料、模型的自動化迭代過程，最終目標是為了完成自動化模型、服務佈署。 ## ML CI/CD 第一個圖是建立Machine Learning系統最陽春的版本，過程會由Data Scientist或Machine Learning Engineer負責資料準備、資料驗證、模型訓練、模型評估後將模型包裝成服務，手動佈署模型並串接到機器學習系統上。由於過程高度仰賴於負責開發的Data Scientist、Machine Learning Engineer，其他合作成員需要額外時間了解設計流程，以及模型包裝、佈署方法，導致開發、產品迭代速度下降，團隊間需要頻繁溝通。在這個階段系統提供服務較為單一，通常是僅提供預測服務，缺乏監控以及測試機制。 ![manual ML](https://i.imgur.com/krKQyJp.png) 第二個圖是引入CI/CD方法的完整MLOps流程圖，在第一個的基礎上加入了以下幾點： 1. Source Code CI/CD 2. Automated Machine Learning Training Pipeline 3. Machine Learning CI/CD (Data, Model) 4. Minotoring MLOps與DevOps最大的區別在於MLOps在DevOps基礎上加入自動化資料、模型驗證測試，並除了服務佈署外還需要考慮服務模型佈署，完成完整自動化流程後還需要加入監控機制，確保服務、資料、模型運作正常。 ![automatic ML](https://i.imgur.com/GkgAAK7.png) ## MLOps Platform 與提供MLOps參考工具不同，MLOps platform旨在集中管理Machine Learning專案的資料、模型流程，做到自動化資料模型測試、佈署、監控等等，並可以結合額外的CI/CD工具整合ML + DevOps流程。對於MLOps platform定義，提供完整MLOps功能集成的平台，個人的理解主要包含機器學習pipeline定義、模型佈署、模型監控，並且可以支援大規模資料處理。 MLOps涵蓋範圍如圖，可以粗分為： 1. Development 2. Machine Learning 3. Deployment 4. Operation ![MLOps platform](https://i.imgur.com/ieI7M4p.png) 知名的平台如下： - **Azure Machine Learning** - DataRobot - **Kubeflow** - **Neptune** - **MLFlow** | Name | Category | Description | Focus | |-------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| | Algorithmia | Managed | We help enterprise companies develop an optimal path to machine learning operational maturity. | Enterprise, deployment | | Allegro AI | Managed, Open-source | End-to-end enterprise-grade platform for data scientists, data engineers, DevOps and managers to manage the entire machine learning & deep learning product life-cycle. | Enterprise, Data management | | cnvrg.io | Managed, Open-source | An end-to-end machine learning platform to build and deploy AI models at scale | Technology agnostic | | Dataiku | Managed | Dataiku is the platform democratizing access to data and enabling enterprises to build their own path to AI in a human-centric way. | Enterprise, Data Analysis, Business Intelligence | | Datarobot | Managed | DataRobot is the leading end-to-end enterprise AI platform that automates and accelerates every step of your path from data to value. | AutoML, Enterprise | | Iguazio | Managed, Open-source | The Iguazio Data Science Platform automates MLOps with end-to-end machine learning pipelines, transforming AI projects into real-world business outcomes. | Structured data | | Valohai | Managed | Train, Evaluate, Deploy, Repeat. Valohai is the MLOps platform that can automate everything from data extraction to model deployment. | Deep Learning, API-first, Technology agnostic | | Flyte | Open-source | Lyft’s Cloud Native Machine Learning and Data Processing Platform, Now Open Sourced | Pipelines | | Kubeflow | Open-source | The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. | Community, Extendability | | Metaflow | Open-source | A framework for real-life data science | Pipelines | | MLFlow | Open-source | MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. | Experimentation, Spark | ML平台可以依照幾個特性做區分： 1. 是否提供管理 2. UI介面、Dashboard管理 3. 專注服務面向如果要採用Azure的服務建立MLOps Platform，可以用Azrue PaaS的App Service包裝起Azure提供的雲服務，主要工作會變成包裝成Azure的服務，以及建立Terraform管理PaaS服務。 ### Azure Machine Learning Azure Machine Learning服務在MLOps方法提出的邏輯架構，分割成資料源、模型pipeline、服務提供，Azure Machine Learning主要著眼在中間階段，模型再訓練、機器學習資料及模型方法驗證。 ![Logic Architecture](https://i.imgur.com/kJuucVt.png) 下圖是把上面的邏輯架構套用現有的Azure服務，DevOps部分採用Azure DevOps以及Azure Pipelines完成自動化CI/CD流程，模型及資料(Machine Learning)則由Azure Machine Learning服務提供訓練pipelines、模型註冊。 ![System Architecture](https://i.imgur.com/lOk9CP0.png) ### MLFlow MLflow是一個開放原始碼程式庫，可用於管理機器學習實驗的生命週期。MLFlow追蹤可記錄和追蹤模型，主要專注於Machine Learning階段，Azure Machine Learning有提供配套的集成方法。 ![mlflow](https://i.imgur.com/2wNXrmB.png) ### Kubeflow Kubeflow 便是一個建立在 Kubernetes 之上的模型開發平台，提供開發模型所需的所有工具，並且藉由 Kubernetes 達到資源、網路的彈性控管。 ![Kubeflow Scope](https://i.imgur.com/h0RuslC.png) Kubeflow提供了一些邏輯組件，主要有七個，藉由組件的堆疊建置起完整的Machine Learning workflow。 1. **Central Dashboard** The central user interface (UI) in Kubeflow 2. **Notebook Servers** Using Jupyter notebooks in Kubeflow 3. **Kubeflow Pipelines** Documentation for Kubeflow Pipelines. 4. **KFServing** Kubeflow model deployment and serving toolkit 5. **Katib** Documentation for Kubeflow Katib 6. Training Operators Training of ML models in Kubeflow through operators 7. Multi-Tenancy Multi-user isolation and identity access management (IAM) 當中重要性較高的為前五個組件，組件可以分別對應到MLOps的流程，範例如下： 1. 先由Kubeflow Notebook Servers建起開發、實驗環境，可以配合Kubeflow Katib或是Azure Machine Learning的方法進行自動化模型、超參數調整。 2. 完成實驗階段後藉由Kubeflow Pipeline設計專案的CI/CD流程，搭配K8S生成pipeline的pod元件，pipeline流程設計主要包含程式碼開發到訓練模型結束的範圍。 3. 完成模型後藉由Kubeflow KFServing或是其他機器學習、深度學習框架的相關模型佈署方法(e.g. TFJob, PytorchJob)做模型服務佈署。 4. 完成完整Machine Learning workflow以後Kubeflow提供中央式Dashboard管理專案的流程設置，方便開發團隊做流程的管理以及修改。 ![Kubeflow Workflow](https://i.imgur.com/XJNEpAh.jpg) 下圖是模型佈署的部分，除了KFServing以外也有其他現成的工具可以採用。 ![KFServing](https://i.imgur.com/wlgZgNd.png) Kubeflow是運行在K8S上的一套工具集，當中可以堆疊很多組件，包含Notebooks、Pipelines、Model serving等等，但是跨組件間的耦合鬆散，做法保持K8S的高彈性，提供使用者自由選擇需要的服務。 Kubeflow一大特點是提供中央管控平台，提供Dashboard管理流程，並且由於Kubeflow設計架構基於K8S概念，也更方便做專案的移植，無論是跨團隊合作還是跨裝置，Kubeflow可以很好的解決問題，並且在環境建設上各個雲服務(e.g. GCP, AWS, Azure)等等都有提供支援的K8S環境架設方法。 ![Dashboard](https://i.imgur.com/lp4L0kQ.png) ### Neptune 下圖是Neptune的架構圖，Neptune平台主要是提供MLOps的metadata存儲，為了達到輕量化平台的目的，平台重點放在團隊研究以及重複實驗的過程。且可以提供集中化管理的方式，供專案合作者在平台上追蹤實驗、模型版本。 ![Neptune](https://i.imgur.com/9LxF2SX.png) ## Platform Comparisons 選定雲MLOps平台做對比： 1. Azure Machine Learning 2. Amazon SageMaker 3. Google Vertex AI 4. Datarobot ![assessment1](https://i.imgur.com/4UMzsMs.png) ![assessment2](https://i.imgur.com/wzgfDLf.png) ![assessment3](https://i.imgur.com/of6UcfU.png) ### Comparision Criteria 對比的維度基於[Gartner](https://www.gartner.com/reviews/market/data-science-machine-learning-platforms/compare/product/amazon-sagemaker-vs-microsoft-azure-machine-learning-vs-vertex-ai)提供的維度做為參考，挑選出目前MLOps平台重要的指標做對比。 1. Platform and Project Management - Support of platform centralized management. - Ease of project collaboration. 2. Performance and Scalability - Performance of machine learning development pipelines. - Capability of applying large-scale machine learning product. 3. Data Access - Support of data input from various sources. - Ease of data feeding process. 4. Machine Learning - Support of AutoML hyperparameter tuning. - Modules of machine learning training utilities. 5. Model Management - Support of model storage. - Version control of model versions. 6. Integration & Deployment - Ease of integration from standard APIs and tools. - Ease of deployment. - Availibility of 3rd-party resources. | Criteria | Azure Machine Learning | DataRobot Enterprise | Amazon SageMaker | Google Vertex AI | |---------------------------------|------------------------|-----------|------------------|------------------| | Overall | 4.42 | 4.6 | 4.38 | 4.4 | | Platform and Project Management | 4.2 | 4.5 | 4.2 | 4.4 | | Performance and Scalability | 4.4 | 4.7 | 4.6 | 4.4 | | Data Access | 4.6 | 4.4 | 4.4 | 4.5 | | Machine Learning | 4.5 | 4.8 | 4.5 | 4.5 | | Model Management | 4.4 | 4.7 | 4.2 | 4.2 | | Integration & Deployment | 4.4 | 4.5 | 4.4 | 4.4 | ### Utilities 進一步對比各平台提供的功能，評估哪個平台更適合做為解決方案。 1. Azure Machine Learning - Pros: - Payment on demands. - Capabilities of model registry and deployment. - Providing auto data labeling tools. - Modulized machine learning pipeline. - No-code designer UI for machine learning pipelines. - Support of no-code data visualization tools. - Cons: - Limited to Azure computing resources. - Less support of AutoML library and deep learning models. 2. DataRobot - Pros: - Focus on AutoML methodologies. - Provide wide library of machine learning models and hyperparameter tuning. - Cloud-agnostic platform which is highly compatible to different cloud platforms. - Cons: - Subscription-based service, not applicable to all projects. - Less flexibility on tools of designing machine learning pipelines. - Less flexibility of client system integration. 3. SageMaker - Pros: - Payment on demands. - Similar to AML, with strong support of model registry and deployment. - Providing auto data labeling tools, and Amazon crowdsourcing labeling. - Full-control of computing resources and machine learning pipelines. - Cons: - Price flunctuation. - Limited to AWS computing resources. - No data visualization modules in platform. - Less support of AutoML library and deep learning models. 4. Vertex AI - Pros: - Strong support of deep learning. - Support of TPUs computation resources. - Cons: - Published in 2021, less mature than SageMaker and AML. - Less machine learning modules than SageMaker and AML. ![DataRobot](https://i.imgur.com/AteQecU.png) ![](https://i.imgur.com/6Ze8Dqj.png) ![](https://i.imgur.com/Sx9QxQ1.png) ### Choice 對比的幾個方向： 1. 平台彈性 2. 平台學習曲線 3. 平台提供模塊 4. 現有ML專案的導入難易度 Azure主要有幾個優點 1. 模組化機器學習function 2. 平台資料視覺化 3. no-code designer介面 - Azure支援no-code介面的機器學習模型建立，而SageMaker、Vertex AI需要自行開發程式碼，或是調用現成的API來達成目的。 - Azure提供no-code的視覺化方法，幫助開發者前期更快速的做資料探索，其他平台則沒有提供。由於Azure提供no-code以及notebook方法來建立機器學習模型，能降低專案開發上手難度，快速基於Azure提供模塊驗證資料、模型的可行性，這是SageMaker、Vertex AI沒辦法達到的。 ## References [DevOps Wikipedia](https://zh.wikipedia.org/wiki/DevOps) [Google Cloud MLOps](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#devops_versus_mlops) [Azure MLOps](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/mlops-technical-paper) [Azure MlOps Muturity](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/mlops-maturity-model) [Azure Machine Learning Decision Tree](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/aml-decision-tree) [MLOps pltforms](https://valohai.com/mlops-platforms-compared/) [Neptune MLOps](https://neptune.ai/blog/mlflow-vs-kubeflow-vs-neptune-differences) [Kubeflow Tutorial](https://chanyilin.github.io/kubeflow-e2e-tutorial) [Kubeflow Components](https://www.kubeflow.org/docs/components/) [MLOps Platform Comparisons](https://www.phdata.io/blog/how-to-pick-the-best-ml-framework/) [cloud comparisons](https://www.datagrom.com/data-science-machine-learning-ai-blog/aws-azure-gcp-machine-learning-platforms-2020-review)