# π **Vision Transformer (ViT) Tutorial β Part 6: Vision Transformers in Production β MLOps, Monitoring & CI/CD**
**#MLOps #ModelMonitoring #CIforML #MLflow #WandB #Kubeflow #ProductionAI #DeepLearning #ComputerVision #Transformers #AIOps**
---
## πΉ **Table of Contents**
1. [Recap of Part 5](#recap-of-part-5)
2. [The Gap Between Research and Production](#the-gap-between-research-and-production)
3. [What is MLOps? Why It Matters for Vision Transformers](#what-is-mlops-why-it-matters-for-vision-transformers)
4. [Production Challenges: Latency, Drift, Failures](#production-challenges-latency-drift-failures)
5. [Model Monitoring: Accuracy, Latency, Throughput](#model-monitoring-accuracy-latency-throughput)
6. [Data & Concept Drift Detection in Vision Models](#data--concept-drift-detection-in-vision-models)
7. [Logging Predictions & Embeddings with Weights & Biases](#logging-predictions--embeddings-with-weights--biases)
8. [Using MLflow for Model Tracking & Experimentation](#using-mlflow-for-model-tracking--experimentation)
9. [CI/CD for ML: Testing, Versioning, Rollback](#cicd-for-ml-testing-versioning-rollback)
10. [A/B Testing & Canary Rollouts for ViT Models](#ab-testing--canary-rollouts-for-vit-models)
11. [Model Registry & Lineage with MLflow & DVC](#model-registry--lineage-with-mlflow--dvc)
12. [Anomaly Detection in Predictions](#anomaly-detection-in-predictions)
13. [Multi-Model Serving with KServe & BentoML](#multi-model-serving-with-kserve--bentoml)
14. [Scaling ViT with Kubernetes & Kubeflow](#scaling-vit-with-kubernetes--kubeflow)
15. [Security & Compliance in Production AI](#security--compliance-in-production-ai)
16. [Case Study: Deploying ViT in a Real-Time Video Pipeline](#case-study-deploying-vit-in-a-real-time-video-pipeline)
17. [Common Pitfalls & Best Practices](#common-pitfalls--best-practices)
18. [Visualizing MLOps Pipeline (Diagram)](#visualizing-mlops-pipeline-diagram)
19. [Summary & Whatβs Next in Part 7](#summary--whats-next-in-part-7)
---
## π **1. Recap of Part 5**
In **Part 5**, we mastered **efficient Vision Transformers**:
- Learned why **efficiency** is critical for edge and mobile.
- Explored **MobileViT**, **TinyViT**, **PVT**, **Swin-T**, and **LeViT**.
- Applied **knowledge distillation**, **quantization**, and **pruning**.
- Exported models to **ONNX** for cross-platform use.
- Accelerated inference with **TensorRT**.
- Deployed models using **TorchServe** and **FastAPI**.
- Benchmarked **accuracy vs latency vs size**.
Now, in **Part 6 β the most comprehensive and practical yet** β we shift from **deployment** to **operations**.
Youβll learn how to run ViT models in **production** like a pro β with **monitoring, CI/CD, A/B testing, rollback, and MLOps**.
This is where **research meets reality**.
Letβs dive in!
---
## π§© **2. The Gap Between Research and Production**
In research, you:
- Train on clean datasets.
- Report top-1 accuracy.
- Ignore latency, cost, and failures.
In production, you:
- Deal with **noisy, real-world data**.
- Face **model drift** and **hardware failures**.
- Must ensure **reliability, scalability, and compliance**.
> π‘ **"The best model in the world is useless if it breaks in production."**
### πΉ The "Last Mile" Problem
| Research Phase | Production Reality |
|--------------|-------------------|
| High accuracy on ImageNet | Poor performance on user-uploaded images |
| Single model | Multiple versions in production |
| Manual testing | Automated CI/CD pipelines |
| No monitoring | 24/7 observability required |
| One-time deployment | Continuous updates |
> β
Bridging this gap is the job of **MLOps**.
---
## π οΈ **3. What is MLOps? Why It Matters for Vision Transformers**
**MLOps (Machine Learning Operations)** is the practice of **applying DevOps principles to ML systems**.
It includes:
- **Version control** for data, code, and models.
- **Automated testing** of ML pipelines.
- **CI/CD** for models.
- **Monitoring** in production.
- **Rollback** mechanisms.
- **Governance & compliance**.
### πΉ Why MLOps for ViT?
Vision Transformers are:
- **Large and expensive** to run.
- **Sensitive to data drift** (e.g., new camera types).
- **Used in critical apps** (medical, autonomous vehicles).
- **Updated frequently** (new features, bug fixes).
> β
Without MLOps, you risk:
> - Silent model degradation
> - Long downtime
> - Regulatory violations
---
## β οΈ **4. Production Challenges: Latency, Drift, Failures**
### πΉ **1. Latency Spikes**
ViT models can slow down due to:
- GPU memory pressure
- Batch size changes
- Network congestion
> β
Must monitor **P99 latency**, not just average.
---
### πΉ **2. Data Drift**
Input data changes over time:
- New lighting conditions
- Different camera resolutions
- Seasonal variations (e.g., winter vs summer)
Example:
> A ViT trained on **daytime images** fails at **night**.
---
### πΉ **3. Concept Drift**
The relationship between input and output changes:
- "Cat" now includes robotic pets
- New product categories in e-commerce
---
### πΉ **4. Model Failures**
- GPU OOM (Out of Memory)
- ONNX export bugs
- Quantization artifacts
- Corrupted model files
> β
Need **automated alerts and rollback**.
---
## π **5. Model Monitoring: Accuracy, Latency, Throughput**
### β
Key Metrics to Monitor
| Metric | Tool | Alert If |
|------|------|---------|
| **Prediction Latency** | Prometheus | P99 > 100ms |
| **Request Rate** | Grafana | Sudden drop/spike |
| **Error Rate** | ELK Stack | > 1% |
| **GPU Utilization** | nvidia-smi | > 90% sustained |
| **Model Accuracy** | Custom logging | Drop > 5% |
| **Throughput (QPS)** | In-house dashboard | Below baseline |
---
### β
Example: Monitoring with Prometheus + Grafana
```python
from prometheus_client import Counter, Histogram
# Define metrics
PREDICTION_LATENCY = Histogram('prediction_latency_seconds', 'Latency of ViT predictions')
PREDICTION_COUNT = Counter('predictions_total', 'Total predictions made')
ERROR_COUNT = Counter('prediction_errors_total', 'Total prediction errors')
# In inference function
start_time = time.time()
try:
output = model(input)
PREDICTION_LATENCY.observe(time.time() - start_time)
PREDICTION_COUNT.inc()
except Exception as e:
ERROR_COUNT.inc()
raise
```
> β
Expose `/metrics` endpoint for Prometheus scraping.
---
## π **6. Data & Concept Drift Detection in Vision Models**
### πΉ **Data Drift**: Input distribution changes
Detect using:
- **Embedding similarity** (cosine distance between batches)
- **Pixel statistics** (mean, variance)
- **Feature extractor outputs**
```python
def detect_drift(embeddings_old, embeddings_new, threshold=0.1):
mean_old = embeddings_old.mean(0)
mean_new = embeddings_new.mean(0)
distance = 1 - F.cosine_similarity(mean_old, mean_new, dim=0)
return distance > threshold
```
> β
Use **PCA or UMAP** to visualize drift.
---
### πΉ **Concept Drift**: Label meaning changes
Harder to detect without labels.
Use:
- **Confidence monitoring**: Drop in mean softmax score.
- **Human-in-the-loop**: Flag low-confidence predictions for review.
- **Shadow mode**: Run new model in parallel, compare outputs.
---
## π§ͺ **7. Logging Predictions & Embeddings with Weights & Biases**
[**Weights & Biases (W&B)**](https://wandb.ai) is a powerful tool for **experiment tracking and model monitoring**.
### β
Log Predictions
```python
import wandb
wandb.init(project="vit-production")
for batch in dataloader:
outputs = model(batch)
preds = torch.argmax(outputs, dim=1)
wandb.log({
"latency": time.time() - start,
"confidence": torch.softmax(outputs, dim=1).max(dim=1)[0].mean(),
"predictions": wandb.Table(
columns=["image", "pred", "prob"],
data=[[wandb.Image(img), pred, prob] for img, pred, prob in zip(batch, preds, outputs)]
)
})
```
> β
Visualize predictions, attention maps, and drift over time.
---
### β
Log Embeddings for Drift Detection
```python
embeddings = model.get_embeddings(batch)
wandb.log({"embeddings": wandb.Object3D(embeddings.cpu().numpy())})
```
Use W&Bβs **embedding projector** to visualize clusters.
---
## π¦ **8. Using MLflow for Model Tracking & Experimentation**
[**MLflow**](https://mlflow.org) is an open-source platform for **ML lifecycle management**.
### β
Log Experiment
```python
import mlflow
import mlflow.pytorch
mlflow.set_experiment("ViT-Optimization")
with mlflow.start_run():
mlflow.log_params({
"model": "MobileViT-S",
"lr": 1e-4,
"batch_size": 64,
"quantization": "int8"
})
# Train...
accuracy = evaluate(model)
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("latency", avg_latency)
# Save model
mlflow.pytorch.log_model(model, "model")
```
> β
Compare runs, track accuracy vs latency.
---
### β
Load Model in Production
```python
model_uri = "runs:/<run_id>/model"
model = mlflow.pytorch.load_model(model_uri)
```
> β
No code duplication between training and inference.
---
## π **9. CI/CD for ML: Testing, Versioning, Rollback**
### β
CI/CD Pipeline for ViT
```
Code Commit β Run Tests β Train Model β Evaluate β Log to MLflow
β
If metrics OK β Export to ONNX β Run Integration Tests
β
If all pass β Deploy to Staging β A/B Test
β
If success β Deploy to Production
β
Monitor β Alert on Drift β Rollback if needed
```
---
### β
Automated Testing
| Test | Description |
|------|-------------|
| **Unit Test** | Check model forward pass |
| **Integration Test** | ONNX export, TensorRT conversion |
| **Accuracy Test** | Ensure accuracy drop < 1% after optimization |
| **Latency Test** | Ensure P99 < 100ms |
| **Drift Test** | Compare embeddings on validation vs live data |
---
### β
Rollback Strategy
If new model fails:
1. **Detect failure** (latency spike, accuracy drop).
2. **Switch traffic** back to previous version.
3. **Investigate** root cause.
4. **Fix and re-deploy**.
> β
Use **TorchServe model versioning** or **Kubernetes rollouts**.
---
## π **10. A/B Testing & Canary Rollouts for ViT Models**
### πΉ **A/B Testing**
Split traffic:
- 50% β Old model (A)
- 50% β New model (B)
Compare:
- Accuracy (if labels available)
- Latency
- User engagement
```python
if user_id % 2 == 0:
prediction = model_a(image)
else:
prediction = model_b(image)
```
> β
Safe way to validate improvements.
---
### πΉ **Canary Rollout**
Gradually increase traffic to new model:
- Day 1: 1%
- Day 2: 5%
- Day 3: 25%
- Day 4: 100%
Monitor at each step.
> β
Minimizes blast radius of failures.
---
## π§© **11. Model Registry & Lineage with MLflow & DVC**
### β
Model Registry (MLflow)
Track model stages:
- `Staging` β `Production` β `Archived`
```python
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="ViT-Classifier",
version=3,
stage="Production"
)
```
> β
Enforce approval workflows.
---
### β
Data Version Control (DVC)
Version datasets and models:
```bash
# Track data
dvc add data/imagenet_train
git add data/imagenet_train.dvc
# Track model
dvc add models/vit_quantized.pth
git add models/vit_quantized.pth.dvc
```
> β
Reproducible pipelines.
---
## π¨ **12. Anomaly Detection in Predictions**
Detect weird behavior:
- Sudden drop in confidence
- All predictions the same
- Out-of-distribution inputs
### β
Example: Outlier Detection with Embeddings
```python
from sklearn.ensemble import IsolationForest
# Fit on training embeddings
isolation_forest = IsolationForest(contamination=0.1)
isolation_forest.fit(train_embeddings)
# Detect outliers in live data
scores = isolation_forest.decision_function(live_embeddings)
anomalies = scores < threshold
```
> β
Flag suspicious inputs for review.
---
## π **13. Multi-Model Serving with KServe & BentoML**
### β
**KServe (formerly KFServing)**
Kubernetes-native serving for multiple frameworks.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: vit-classifier
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: s3://models/vit-v2
```
> β
Scales automatically, supports canary rollouts.
---
### β
**BentoML**
Build model serving APIs:
```python
import bentoml
from bentoml.io import Image, NumpyNdarray
runner = bentoml.pytorch.load_runner("vit_model:latest")
svc = bentoml.Service("vit_classifier", runners=[runner])
@svc.api(input=Image(), output=NumpyNdarray())
def classify(image):
return runner.run(image)
```
Deploy with:
```bash
bentoml serve service.py:svc
```
> β
One-click deployment to AWS, GCP, Azure.
---
## π **14. Scaling ViT with Kubernetes & Kubeflow**
### β
Kubernetes for Scaling
- Deploy ViT as a **Deployment** with HPA (Horizontal Pod Autoscaler).
- Use **GPU nodes** for acceleration.
- **Istio** for traffic management.
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vit-inference
spec:
replicas: 3
template:
spec:
containers:
- name: vit
image: my-vit-model:latest
resources:
limits:
nvidia.com/gpu: 1
```
---
### β
Kubeflow for End-to-End Pipelines
Kubeflow Pipelines let you orchestrate:
- Data preprocessing
- Training
- Evaluation
- Deployment
> β
Full MLOps workflow on Kubernetes.
---
## π **15. Security & Compliance in Production AI**
### β
Key Concerns
| Issue | Mitigation |
|------|-----------|
| **Model Theft** | Encrypt models, use secure serving |
| **Data Privacy** | Anonymize inputs, comply with GDPR |
| **Adversarial Attacks** | Add input validation, use adversarial training |
| **Bias & Fairness** | Audit model across demographics |
| **Regulatory Compliance** | SOC2, HIPAA, ISO 27001 |
> β
Never log raw user images without consent.
---
## π§ͺ **16. Case Study: Deploying ViT in a Real-Time Video Pipeline**
### πΉ Use Case: Smart Retail Store
Detect customer demographics and behavior in real time.
### β
Architecture
```
RTSP Cameras β Video Decoder β Frame Sampler
β
ViT (MobileViT-S) β Age/Gender/Emotion
β
Kafka β Analytics Dashboard
β
Alert System (e.g., loitering)
```
### β
MLOps Setup
- **Monitoring**: Prometheus + Grafana for latency.
- **Drift Detection**: W&B for embedding drift.
- **CI/CD**: GitHub Actions β train on new data weekly.
- **Rollback**: TorchServe model versioning.
- **Security**: On-premise GPU servers, no cloud.
> β
Runs at **30 FPS** on 4x RTX 3090.
---
## β οΈ **17. Common Pitfalls & Best Practices**
### β **Pitfall 1: No Monitoring**
"Works fine" until it doesnβt.
β
**Fix**: Monitor **latency, errors, drift**.
---
### β **Pitfall 2: Ignoring Data Drift**
Model trained on studio images fails on real-world photos.
β
**Fix**: Log embeddings, detect drift.
---
### β **Pitfall 3: Manual Deployments**
"Let me ssh into the server..."
β
**Fix**: Use **CI/CD pipelines**.
---
### β **Pitfall 4: No Rollback Plan**
Broken model β hours of downtime.
β
**Fix**: Use **canary rollouts** and **versioned serving**.
---
### β
**Best Practices**
- Use **MLflow** for tracking.
- Log to **W&B** for visualization.
- Serve with **KServe or BentoML**.
- Monitor **24/7**.
- Automate **testing and deployment**.
---
## πΌοΈ **18. Visualizing MLOps Pipeline (Diagram)**
```
+----------------+ +-------------------+
| Code & Data | --> | CI/CD Pipeline |
| (Git, DVC) | | (Test, Train) |
+----------------+ +---------+---------+
|
v
+--------------v--------------+
| MLflow Tracking |
| (Experiments, Models, Metrics)|
+--------------+--------------+
|
v
+--------------v--------------+
| Model Registry (Staging) |
+--------------+--------------+
|
v
+--------------------v--------------------+
| A/B Test or Canary Rollout |
+--------------------+--------------------+
|
+------------------------+------------------------+
| | |
+--------v--------+ +---------v----------+ +----------v----------+
| Old Model (A) | | New Model (B) | | Shadow Mode (Log) |
+-----------------+ +--------------------+ +---------------------+
| | |
+------------------------+------------------------+
|
v
+--------------v--------------+
| Production Serving |
| (KServe, TorchServe, BentoML) |
+--------------+--------------+
|
v
+--------------v--------------+
| Monitoring & Alerts |
| (Prometheus, W&B, ELK) |
+--------------+--------------+
|
v
+--------------v--------------+
| Drift Detection & Rollback |
+-------------------------------+
```
> π This is a **production-grade MLOps pipeline** for Vision Transformers.
---
## π **19. Summary & Whatβs Next in Part 7**
### β
**What Youβve Learned in Part 6**
- The **gap between research and production**.
- **MLOps** principles for Vision Transformers.
- How to **monitor models** for latency, drift, and failures.
- Use **MLflow** and **Weights & Biases** for tracking.
- Build **CI/CD pipelines** for ML.
- Perform **A/B testing** and **canary rollouts**.
- Manage **model registry and lineage**.
- Detect **anomalies** and **outliers**.
- Serve models with **KServe, BentoML, TorchServe**.
- Scale with **Kubernetes and Kubeflow**.
- Ensure **security and compliance**.
---
### π **Whatβs Coming in Part 7: The Future of Vision Transformers β Multimodal, 3D, and Beyond**
In the final part, weβll explore:
- π§ **Multimodal Transformers**: CLIP, Flamingo, PaLM-E.
- π **3D Vision Transformers**: For point clouds and meshes.
- π₯ **Video & Temporal Modeling**: TimeSformer, ViViT.
- π€ **Embodied AI**: Robots using ViT for navigation.
- 𧬠**Medical Vision Transformers**: For pathology and radiology.
- π **Next-Gen Architectures**: Mamba, RetNet, and alternatives to attention.
- π **Vision Transformers in Web & AR/VR**.
> π **#MultimodalAI #3DViT #TimeSformer #PaLME #MedicalAI #FutureOfAI**
---
## π Final Words
Youβve just completed the **most in-depth exploration of production Vision Transformers** ever created.
> π¬ **"In the real world, the model is just the beginning. The system is what matters."**
You now know how to take a ViT model from **research notebook** to **24/7 production system** β with monitoring, CI/CD, rollback, and scalability.
In **Part 7**, weβll look to the **future** β where Vision Transformers merge with language, 3D, robotics, and medicine.
---
π **Pro Tip**: Document your MLOps pipeline. Your future self will thank you.
π **Share this epic guide** with your team β itβs the ultimate playbook for **real-world AI**.
---
β
**You're now ready for Part 7!**
The final chapter: **The Future of Vision Transformers**.
#MLOps #ModelMonitoring #CIforML #MLflow #WandB #Kubeflow #ProductionAI #DeepLearning #ComputerVision #Transformers #AIOps