# MLOps
###### tags: `MLOps`
## Introduction
承接上次T1 Proposal提出的方法,機器學習系統與一般系統最大的不同在於需要額外考慮資料、機器學習模型。
MLOps方法基於DevOps方法,主要可以分為Machine Learning & Development & Operation,當中Development & Operation為DevOps的基礎,透過自動化「軟體交付」和「架構變更」的流程,來使得構建、測試、發布軟體能夠更加地快捷、頻繁和可靠。

系統開發流程以Development團隊開發服務,並將服務交付給Operation團隊架設服務環境,DevOps方法可以減緩團隊間的溝通成本,並加速產品迭代佈署效率。機器學習系統則是在DevOps基礎上再加上Machine Learning的服務,Machine Learning需要考量資料、模型,需要考慮資料、模型的驗證、測試方法,並進一步加入資料、模型的自動化迭代過程,最終目標是為了完成自動化模型、服務佈署。
## ML CI/CD
第一個圖是建立Machine Learning系統最陽春的版本,過程會由Data Scientist或Machine Learning Engineer負責資料準備、資料驗證、模型訓練、模型評估後將模型包裝成服務,手動佈署模型並串接到機器學習系統上。由於過程高度仰賴於負責開發的Data Scientist、Machine Learning Engineer,其他合作成員需要額外時間了解設計流程,以及模型包裝、佈署方法,導致開發、產品迭代速度下降,團隊間需要頻繁溝通。
在這個階段系統提供服務較為單一,通常是僅提供預測服務,缺乏監控以及測試機制。

第二個圖是引入CI/CD方法的完整MLOps流程圖,在第一個的基礎上加入了以下幾點:
1. Source Code CI/CD
2. Automated Machine Learning Training Pipeline
3. Machine Learning CI/CD (Data, Model)
4. Minotoring
MLOps與DevOps最大的區別在於MLOps在DevOps基礎上加入自動化資料、模型驗證測試,並除了服務佈署外還需要考慮服務模型佈署,完成完整自動化流程後還需要加入監控機制,確保服務、資料、模型運作正常。

## MLOps Platform
與提供MLOps參考工具不同,MLOps platform旨在集中管理Machine Learning專案的資料、模型流程,做到自動化資料模型測試、佈署、監控等等,並可以結合額外的CI/CD工具整合ML + DevOps流程。
對於MLOps platform定義,提供完整MLOps功能集成的平台,個人的理解主要包含機器學習pipeline定義、模型佈署、模型監控,並且可以支援大規模資料處理。
MLOps涵蓋範圍如圖,可以粗分為:
1. Development
2. Machine Learning
3. Deployment
4. Operation

知名的平台如下:
- **Azure Machine Learning**
- DataRobot
- **Kubeflow**
- **Neptune**
- **MLFlow**
| Name | Category | Description | Focus |
|-------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
| Algorithmia | Managed | We help enterprise companies develop an optimal path to machine learning operational maturity. | Enterprise, deployment |
| Allegro AI | Managed, Open-source | End-to-end enterprise-grade platform for data scientists, data engineers, DevOps and managers to manage the entire machine learning & deep learning product life-cycle. | Enterprise, Data management |
| cnvrg.io | Managed, Open-source | An end-to-end machine learning platform to build and deploy AI models at scale | Technology agnostic |
| Dataiku | Managed | Dataiku is the platform democratizing access to data and enabling enterprises to build their own path to AI in a human-centric way. | Enterprise, Data Analysis, Business Intelligence |
| Datarobot | Managed | DataRobot is the leading end-to-end enterprise AI platform that automates and accelerates every step of your path from data to value. | AutoML, Enterprise |
| Iguazio | Managed, Open-source | The Iguazio Data Science Platform automates MLOps with end-to-end machine learning pipelines, transforming AI projects into real-world business outcomes. | Structured data |
| Valohai | Managed | Train, Evaluate, Deploy, Repeat. Valohai is the MLOps platform that can automate everything from data extraction to model deployment. | Deep Learning, API-first, Technology agnostic |
| Flyte | Open-source | Lyft’s Cloud Native Machine Learning and Data Processing Platform, Now Open Sourced | Pipelines |
| Kubeflow | Open-source | The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. | Community, Extendability |
| Metaflow | Open-source | A framework for real-life data science | Pipelines |
| MLFlow | Open-source | MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. | Experimentation, Spark |
ML平台可以依照幾個特性做區分:
1. 是否提供管理
2. UI介面、Dashboard管理
3. 專注服務面向
如果要採用Azure的服務建立MLOps Platform,可以用Azrue PaaS的App Service包裝起Azure提供的雲服務,主要工作會變成包裝成Azure的服務,以及建立Terraform管理PaaS服務。
### Azure Machine Learning
Azure Machine Learning服務在MLOps方法提出的邏輯架構,分割成資料源、模型pipeline、服務提供,Azure Machine Learning主要著眼在中間階段,模型再訓練、機器學習資料及模型方法驗證。

下圖是把上面的邏輯架構套用現有的Azure服務,DevOps部分採用Azure DevOps以及Azure Pipelines完成自動化CI/CD流程,模型及資料(Machine Learning)則由Azure Machine Learning服務提供訓練pipelines、模型註冊。

### MLFlow
MLflow是一個開放原始碼程式庫,可用於管理機器學習實驗的生命週期。MLFlow追蹤可記錄和追蹤模型,主要專注於Machine Learning階段,Azure Machine Learning有提供配套的集成方法。

### Kubeflow
Kubeflow 便是一個建立在 Kubernetes 之上的模型開發平台,提供開發模型所需的所有工具,並且藉由 Kubernetes 達到資源、網路的彈性控管。

Kubeflow提供了一些邏輯組件,主要有七個,藉由組件的堆疊建置起完整的Machine Learning workflow。
1. **Central Dashboard**
The central user interface (UI) in Kubeflow
2. **Notebook Servers**
Using Jupyter notebooks in Kubeflow
3. **Kubeflow Pipelines**
Documentation for Kubeflow Pipelines.
4. **KFServing**
Kubeflow model deployment and serving toolkit
5. **Katib**
Documentation for Kubeflow Katib
6. Training Operators
Training of ML models in Kubeflow through operators
7. Multi-Tenancy
Multi-user isolation and identity access management (IAM)
當中重要性較高的為前五個組件,組件可以分別對應到MLOps的流程,範例如下:
1. 先由Kubeflow Notebook Servers建起開發、實驗環境,可以配合Kubeflow Katib或是Azure Machine Learning的方法進行自動化模型、超參數調整。
2. 完成實驗階段後藉由Kubeflow Pipeline設計專案的CI/CD流程,搭配K8S生成pipeline的pod元件,pipeline流程設計主要包含程式碼開發到訓練模型結束的範圍。
3. 完成模型後藉由Kubeflow KFServing或是其他機器學習、深度學習框架的相關模型佈署方法(e.g. TFJob, PytorchJob)做模型服務佈署。
4. 完成完整Machine Learning workflow以後Kubeflow提供中央式Dashboard管理專案的流程設置,方便開發團隊做流程的管理以及修改。

下圖是模型佈署的部分,除了KFServing以外也有其他現成的工具可以採用。

Kubeflow是運行在K8S上的一套工具集,當中可以堆疊很多組件,包含Notebooks、Pipelines、Model serving等等,但是跨組件間的耦合鬆散,做法保持K8S的高彈性,提供使用者自由選擇需要的服務。
Kubeflow一大特點是提供中央管控平台,提供Dashboard管理流程,並且由於Kubeflow設計架構基於K8S概念,也更方便做專案的移植,無論是跨團隊合作還是跨裝置,Kubeflow可以很好的解決問題,並且在環境建設上各個雲服務(e.g. GCP, AWS, Azure)等等都有提供支援的K8S環境架設方法。

### Neptune
下圖是Neptune的架構圖,Neptune平台主要是提供MLOps的metadata存儲,為了達到輕量化平台的目的,平台重點放在團隊研究以及重複實驗的過程。且可以提供集中化管理的方式,供專案合作者在平台上追蹤實驗、模型版本。

## Platform Comparisons
選定雲MLOps平台做對比:
1. Azure Machine Learning
2. Amazon SageMaker
3. Google Vertex AI
4. Datarobot



### Comparision Criteria
對比的維度基於[Gartner](https://www.gartner.com/reviews/market/data-science-machine-learning-platforms/compare/product/amazon-sagemaker-vs-microsoft-azure-machine-learning-vs-vertex-ai)提供的維度做為參考,挑選出目前MLOps平台重要的指標做對比。
1. Platform and Project Management
- Support of platform centralized management.
- Ease of project collaboration.
2. Performance and Scalability
- Performance of machine learning development pipelines.
- Capability of applying large-scale machine learning product.
3. Data Access
- Support of data input from various sources.
- Ease of data feeding process.
4. Machine Learning
- Support of AutoML hyperparameter tuning.
- Modules of machine learning training utilities.
5. Model Management
- Support of model storage.
- Version control of model versions.
6. Integration & Deployment
- Ease of integration from standard APIs and tools.
- Ease of deployment.
- Availibility of 3rd-party resources.
| Criteria | Azure Machine Learning | DataRobot Enterprise | Amazon SageMaker | Google Vertex AI |
|---------------------------------|------------------------|-----------|------------------|------------------|
| Overall | 4.42 | 4.6 | 4.38 | 4.4 |
| Platform and Project Management | 4.2 | 4.5 | 4.2 | 4.4 |
| Performance and Scalability | 4.4 | 4.7 | 4.6 | 4.4 |
| Data Access | 4.6 | 4.4 | 4.4 | 4.5 |
| Machine Learning | 4.5 | 4.8 | 4.5 | 4.5 |
| Model Management | 4.4 | 4.7 | 4.2 | 4.2 |
| Integration & Deployment | 4.4 | 4.5 | 4.4 | 4.4 |
### Utilities
進一步對比各平台提供的功能,評估哪個平台更適合做為解決方案。
1. Azure Machine Learning
- Pros:
- Payment on demands.
- Capabilities of model registry and deployment.
- Providing auto data labeling tools.
- Modulized machine learning pipeline.
- No-code designer UI for machine learning pipelines.
- Support of no-code data visualization tools.
- Cons:
- Limited to Azure computing resources.
- Less support of AutoML library and deep learning models.
2. DataRobot
- Pros:
- Focus on AutoML methodologies.
- Provide wide library of machine learning models and hyperparameter tuning.
- Cloud-agnostic platform which is highly compatible to different cloud platforms.
- Cons:
- Subscription-based service, not applicable to all projects.
- Less flexibility on tools of designing machine learning pipelines.
- Less flexibility of client system integration.
3. SageMaker
- Pros:
- Payment on demands.
- Similar to AML, with strong support of model registry and deployment.
- Providing auto data labeling tools, and Amazon crowdsourcing labeling.
- Full-control of computing resources and machine learning pipelines.
- Cons:
- Price flunctuation.
- Limited to AWS computing resources.
- No data visualization modules in platform.
- Less support of AutoML library and deep learning models.
4. Vertex AI
- Pros:
- Strong support of deep learning.
- Support of TPUs computation resources.
- Cons:
- Published in 2021, less mature than SageMaker and AML.
- Less machine learning modules than SageMaker and AML.



### Choice
對比的幾個方向:
1. 平台彈性
2. 平台學習曲線
3. 平台提供模塊
4. 現有ML專案的導入難易度
Azure主要有幾個優點
1. 模組化機器學習function
2. 平台資料視覺化
3. no-code designer介面
- Azure支援no-code介面的機器學習模型建立,而SageMaker、Vertex AI需要自行開發程式碼,或是調用現成的API來達成目的。
- Azure提供no-code的視覺化方法,幫助開發者前期更快速的做資料探索,其他平台則沒有提供。
由於Azure提供no-code以及notebook方法來建立機器學習模型,能降低專案開發上手難度,快速基於Azure提供模塊驗證資料、模型的可行性,這是SageMaker、Vertex AI沒辦法達到的。
## References
[DevOps Wikipedia](https://zh.wikipedia.org/wiki/DevOps)
[Google Cloud MLOps](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#devops_versus_mlops)
[Azure MLOps](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/mlops-technical-paper)
[Azure MlOps Muturity](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/mlops-maturity-model)
[Azure Machine Learning Decision Tree](https://docs.microsoft.com/zh-tw/azure/architecture/example-scenario/mlops/aml-decision-tree)
[MLOps pltforms](https://valohai.com/mlops-platforms-compared/)
[Neptune MLOps](https://neptune.ai/blog/mlflow-vs-kubeflow-vs-neptune-differences)
[Kubeflow Tutorial](https://chanyilin.github.io/kubeflow-e2e-tutorial)
[Kubeflow Components](https://www.kubeflow.org/docs/components/)
[MLOps Platform Comparisons](https://www.phdata.io/blog/how-to-pick-the-best-ml-framework/)
[cloud comparisons](https://www.datagrom.com/data-science-machine-learning-ai-blog/aws-azure-gcp-machine-learning-platforms-2020-review)