owned this note
owned this note
Published
Linked with GitHub
# Explore Kubeflow on Azure Kubernetes Service
Kubeflow is an open source machine learning toolkit for Kubernetes, co-founded at Google by three engineers, [David Aronchik](https://twitter.com/aronchick), [Jeremy Lewi](https://twitter.com/jeremylewi), and Vishnu Kannan, and announced at [KubeCon 2017](https://www.youtube.com/watch?v=R3dVF5wWz-g), it was originally developed to run TensorFlow jobs on Kubernetes.
Kubeflow reached 1.0 in February 2020 and is now at 1.5, as of March 2022. Kubeflow now supports a "multi-cloud, multi-architecture" framework that supports a broad range of open source machine learning tools such as [PyTorch](https://pytorch.org/) and [Jupyter](https://jupyter.org/), to name only two that we'll explore in this blog post.
Kubeflow allows you to run your ML workflows atop Kubernetes, anywhere. This include the cloud, with all major public clouds supporting it native integrations, as well as on-premesis, or even locally using [KIND or K3s](https://www.kubeflow.org/docs/components/pipelines/installation/localcluster-deployment/).
Rather than a monolithic stack, Kubeflow is built around composability, which enables you to choose the parts of the project that suit your requirements.
In this same vein, you can install most of the components in a single step, or individually, as you need them.
In this post, we'll walk you through the steps to deploy Kubeflow atop Azure Kubernetes Service, and then run a Jupyer Lab notebook, where you can run a PyTorch pipeline, which you can then run using the [Kubeflow Pipelines SDK](https://www.kubeflow.org/docs/components/pipelines/introduction/).
This is just one of many, from distributed training to model serving, that Kubeflow enables, and one we'll build on in future posts.
## Before we continue...
Machine Learning is a specialized workload that often takes advantage of specialized hardware such as GPUs. In this post, we will be creating a cluster which has a node pool with non-GPU enabled instances.
You may wish to use [multiple node pools](https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools), with a [GPU node pool](https://docs.microsoft.com/en-us/azure/aks/gpu-cluster), which uses [GPU-enabled VMs](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) such as the `Standard_NC6`. The [NVIDIA device plugin](https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#add-the-nvidia-device-plugin) should be enabled, either automatically using the AKS GPU image, or manually.
These can be [auto-scaled](https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler) to reduce costs when GPU resources are not being actively used. However, this is outside the scope of this introductory post.
Kubeflow has multiple [installation options](https://www.kubeflow.org/docs/started/installing-kubeflow/), including a "packaged distribution" (which include integrations for multple clouds, including Azure, as well as distributions with enterprise vendor support) or the [upstream manifests](https://github.com/kubeflow/manifests#installation). In this post we will be using the manifests that provide maximum flexibility, ease of learning and hacking, and access to the latest version(s), when exploring Kubeflow.
This is a test/dev cluster which we will access via `kubectl port-forward`. Do not expose it to the internet without proper authentication such as [OIDC](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/auth-oidc), which allows it to be protected by Azure Active Directory (AAD), GitHub, etc, or even a VPN solution such as [Tailscale](https://tailscale.com/blog/kubecon-21/).
Kubeflow does not yet support Kubernetes 1.22 and higher. Though this is being [actively worked on](https://github.com/kubeflow/kubeflow/issues/6353). We will be deploying Kubeflow on Kubernetes 1.21 and we suggest using a dedicated cluster for this purpose.
Kubeflow manifests are also deployed via [kustomize](https://kustomize.io/) which Kubeflow only supports version 3.20 and is also being [actively worked on](https://github.com/kubeflow/manifests/issues/1797). For many other use cases, kustomization is built-in to kubectl via [kubectl apply -k](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/).
Kustomize, kubectl, and the Azure CLI are the only local dependencies for this post. However, you will need to run them in a bash shell, on Linux (including Linux, Multipass, Windows Subsystem for Linux, Docker, Azure Cloud Shell, GitHub Codespaces, etc) or macOS.
## Deploy AKS
In this post we will be using Azure Kubernetes Service (AKS) cluster which we will [deploy using the Azure CLI](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster). Make sure you have the [Azure CLI installed](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) before you continue.
Set some environment variables
```bash
RESOURCE_GROUP='my-aks'
KUBERNETES_VERSION='1.21.9'
NODE_VM_SIZE='Standard_DS2_v2'
```
Note that the latest 1.21.* `KUBERNETES_VERSION` was discovered via the following command:
```bash
az aks get-versions \
--location eastus \
--query 'orchestrators[].orchestratorVersion' \
--out table
```
Create a Resource Group
```bash
az group create --name $RESOURCE_GROUP \
--location eastus
```
Create an AKS cluster
```bash
az aks create --resource-group $RESOURCE_GROUP \
--name aks1 \
--node-count 3 \
--node-vm-size $NODE_VM_SIZE \
--kubernetes-version $KUBERNETES_VERSION \
--enable-addons monitoring \
--generate-ssh-keys
```
Install `kubectl` if you do not have it installed already
```bash
az aks install-cli
```
Configure `kubectl` to authenticate to your cluster
```bash
az aks get-credentials --resource-group $RESOURCE_GROUP \
--name aks1
```
## Install kustomize
Next we will install the [kustomize](https://kustomize.io/) binary from its [GitHub release](https://github.com/kubernetes-sigs/kustomize/releases/tag/v3.2.0). If you are on macOS, update `PLATFORM` from `linux` to `darwin`.
```bash
PLATFORM='darwin'
curl -OL "https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_${PLATFORM}_amd64"
chmod +x "kustomize_3.2.0_${PLATFORM}_amd64"
sudo mv "kustomize_3.2.0_${PLATFORM}_amd64" /usr/local/bin/kustomize
```
## Deploy Kubeflow
First download the manifests from [kubeflow/manifests](https://github.com/kubeflow/manifests)
```bash
git clone https://github.com/kubeflow/manifests.git
cd manifests/
```
Install all of the components via a single command
```bash
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
```
Once the command has completed, check the pods are ready
```bash
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com
```
Run `kubctl port-forward` to access the Kubeflow dashboard
```bash
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
```
Finally, open <http://localhost:8080> and login with the default user's credentials. The default email address is `user@example.com` and the default password is `12341234`.
----
## Summary
Once you have finished exploring, you should delete the my-aks resource group for your AKS cluster to avoid any further charges.
```bash
az group delete -n my-aks
```