MLOps - from concept to product :rocket:

<style> .reveal h1 { font-size: 100px; } .reveal h2 { font-size: 75px; text-align: left; } .reveal h3 { font-size: 50px; text-align: left; } .reveal p { font-size: 40px; } .reveal tr { font-size: 28px; } .reveal ul, ol { font-size: 30px; display: block; text-align: left; } </style> # MLOps - from concept to product :rocket: #### Bringing ideas to life --- # What is this talk about? 👩🏽‍🏫 - Overview of MLOps practices - Answering these questions: - What is MLOps? - Why should we follow MLOps practices? - What do we need to do MLOps? - How to start following MLOps practices? --- # What is MLOps? - Process of using machine learning <span style="color:#eee8d5">**models**</span> as <span style="color:#eee8d5">**a useful, verifiable and evolvable product**</span>. - MLOps is an <span style="color:#eee8d5">**agnostic practice**</span> to infrastructure and language. - Application of <span style="color:#eee8d5">**DevOps**</span> to the machine learning workflow. --- # What is DevOps? Practices to automate the software delivery lifecycle **Why:** :arrow_up: Agility 🟰 Quality :busts_in_silhouette: People + 🛞 Process + :wrench: Tools <img src="https://software.af.mil/wp-content/uploads/2019/08/devops-loop.svg" height="250"/> --- ## Continuous Integration (CI) and Continuous Delivery (CD) Practices to ensure Software is ready be deployed and deployment can be done automatically - Test-driven development > Anyone can deploy - Production-like staging environments <img src="https://universaltechconsulting.com/wp-content/uploads/2022/05/ci-cd-diagram.png" height="250"/> --- # ML Systems :robot_face: --- # Machine learning lifecycle <img src="https://i.imgur.com/Pokpvmi.png" alt="drawing" height="500"/> --- # Data Team <img src="https://i.imgur.com/puAvInl.png" height="450"/> --- # Why are ML systems different? - DATA - Higher computing resources - Hard to define and measure - Poorly understood - Team skills --- # ML Challenges in the Dev process <img src="https://miro.medium.com/v2/resize:fit:924/format:webp/1*qtxDxwn2ba3t47vVtcZ2ug.png"/> --- # Experimenting - Features - Algorithms - Hyperparameter tuning <img style="float: rigth;" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MLOps-practices-experiments.jpeg?ssl=1" width="450"/> --- ## Reproducibility - Why: deployment, debugging - How: - Inputs: Data ( D ) + Code ( C ) + Parameter ( P ) - Remote storage - Environments (containers) - Reduce non-deterministic behaviour (seeds) --- ## Tracking and Versioning - Track inputs: - Data: Features - Code: Training and Prediction - Parameters: (Hyper)parameters (over time if variable) - Track metrics: - Model training: Loss curve - System: Speed, RAM and CPU/GPU usage - Model performance: Which configuration (D + C + P) is the best? --- ## Git for Data Science - Versioning: - Code - Data: [DVC](https://dvc.org) (checksum) - Model artifacts - CI/CD: - Training pipelines: [CML](https://cml.dev/) - GitOps: Infrastructure as Code (IaC) --- ## Automated Testing - Data validation: - Schema: Anomalies :arrow_right: Debugging - Values: Quality metrics :arrow_right: Retraining - Model evaluation: model quality metrics - Model validation: new model is better <img src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MLOps-practices-code.jpg?ssl=1" height="280"/> --- ## Deployment - ML Training Pipeline - Deploy model - Deploy service --- ## Monitoring <table> <tr> <th> <ul> <li>What</li> <ul> <li>End-to-end metrics: revenue</li> <li>Model performance metrics</li> <li>Integration: no predictions</li> <li>Drift: Statistical metrics</li> </ul> <li>Alerts</li> <ul> <li>Jobs or scheduler activity</li> <li>Usage: CPU/RAM, rpm per model </li> </ul> <li>It triggers ➡️</li> <ul> <li>Rolling back</li> <li>(Auto) Scaling ↕️</li> <li>Debugging and/or retraining</li> <li>Data pipeline fixes</li> </ul> </ul> </th> <th><img src="https://deepchecks.com/wp-content/uploads/2022/12/data-drift-my-model.webp" width="300"/> </th> </tr> </table> --- # MLOps Practices --- # Data Management - Features Stores: features for training and inferencing :arrow_up: Consistency :arrow_up: Reusability - Management: Metadata - Computation: Transformations (Normalization, Anonimization, Labelling) - ML Metadata Store: :arrow_up: Reproducibility :arrow_up: Debugging :arrow_up: Data lineage - ML Pipeline execution details - Reference to artifacts and previous versions - Metrics in training and test data --- # Model Managment - Model registry: collection of models :package: - Why: Tracking and Deployment - What: - Definition: author, type, version, stage - Reference to: D + C + P - Metrics and Artifacts --- # Model Evaluation - Why: Reliable system - Experimenting: Start simple and build up - Debugging: :arrow_up: Components :arrow_up: Time - Tracking and Versioning - Data distribution and feature importance - Artifacts: - Logs - Training and model metrics - ML Pipeline in Production --- # Online ML System Validation - Why: - Model has to perform better than the baseline or previous version - Models for different user clusters - Estimate retraining frequency - How: - A/B test: Comparing variants with random routing - Bandits: Comparing variants with routing by model performance - Canary release: Rolling out to a percentage --- # Responsible AI - Fairness: does my product have any bias? - Correlation between input data and predictions within different user clusters - Explainability: can I answer why my product behave in a particular way? --- # Continuous Training (CT) CD of ML - Why: - Model metrics decay - New data is available - Main Challenges: - Fresh data - Evaluation - Options to consider: - Batch vs Online training - Stateless vs Stateful training: from scratch or incremental - Mechanism to trigger training: scheduler, new data, performance decay, data drift --- # MLOPs Maturity Model | Level | People | Process | | ---------------------------- | ----------------- | ------- | | 0 No MLOps | Disperse | No tracking, manual training and evaluation, limited monitoring | | 1 DevOps but no MLOps| Disperse | Data pipelines, Automatic tests and builds for prediction service | | 2 Automated Training | DS+DE | Tracked experiments, Compute Managed, Training Pipeline | | 3 Automated Model Deployment | DS+DE/ DE+SE | Model Testing and (online) Validation, CI/CD, Automatic release | | 4 Full MLOps | DS+DE+SE | Automatic retraining | DS: Data Scientist DE: Data Engineer SE: Software Engineer --- # Automated Pipeline <img src="https://cloud.google.com/static/architecture/images/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning-4-ml-automation-ci-cd.svg" height="550"/> --- # What did we learn? 👩🏽‍🏫 - What is MLOps? - Why should we follow MLOps practices? - What do we need to do MLOps? - How to start following MLOps practices? --- # Books <img src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRTsaKuwGamDyhUPARjC0Q-lmIBfbPFLik8kfZW6YS3OrV5jmTH" width="400"/> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSRl5dz8AT9ZGjpwRqxtOPsLw4HubCpwGhMrzRDtrBg5EOd_SPX" width="400"/> --- # Sources - [MLOps Principles](https://ml-ops.org/content/mlops-principles) - [Machine Learning operations maturity model](https://learn.microsoft.com/en-us/azure/architecture/example-scenario/mlops/mlops-maturity-model) - [MLOps: Continuous delivery and automation pipelines in machine learning](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning) - [Awesome Production Machine Learning](https://github.com/EthicalML/awesome-production-machine-learning) --- # Tools Review ![](https://i.imgur.com/kUKNF97.png) <!-- # Tools to review AI Platform AzureML SageMaker