Kubeflow Quick Start

# Kubeflow Quick Start ## 實驗環境 - Minikube - GPU Cluster - Katacoda ## kubeflow實作目錄 - Jupyter Hub 實作 - TF-operator - PyTorch-operator - MPI Training > 更多實作｜[參考](https://www.kubeflow.org/docs/started/getting-started/) ### 其他可以透過 katacoda 線上學習 - Deploying Github Issue Summarization with Kubeflow - Deploying PyTorch with Kubeflow - Deploying Kubeflow - Deploying Kubeflow with Ksonnet --- # minikube 部署kubeflow #### 確認已經部署完成 minikube ``` minikube start --memory=8192 --cpus=4 --kubernetes-version=v1.10.6 ``` ## ksonnet install ```shell= curl -L https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/ks_0.11.0_linux_amd64.tar.gz | tar xvz && mv ks_0.11.0_linux_amd64/ks /usr/local/bin/ks && rm -rf ks_0.11.0_linux_amd64/ ``` ### kubeflow install ```shell= export KUBEFLOW_VERSION=0.2.2 curl https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2.2/scripts/deploy.sh | bash ``` 檢查部署後狀態（會需要花時間下載Images） ```shell= $ kubectl get pods NAME READY STATUS RESTARTS AGE ambassador-59cb5ccd89-cltqx 2/2 Running 0 1m ambassador-59cb5ccd89-sl87k 2/2 Running 0 1m ambassador-59cb5ccd89-szchn 2/2 Running 0 1m centraldashboard-7d7744cccb-cntj6 1/1 Running 0 1m spartakus-volunteer-55577f4bd9-kgnvv 1/1 Running 0 1m tf-hub-0 1/1 Running 0 1m tf-job-dashboard-bfc9bc6bc-wmbj5 1/1 Running 0 1m tf-job-operator-v1alpha2-756cf9cb97-r45kk 1/1 Running 0 1m ``` --- ### 補充：Minikube for Kubeflow (Bootstrapper) 安裝kubeflow 使用 Bootstrapper ```shell= $ curl -O https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2-branch/bootstrap/bootstrapper.yaml ``` 執行config ```shell= $ kubectl create -f bootstrapper.yaml ``` 創建的項目 ```shell= namespace "kubeflow-admin" created clusterrolebinding.rbac.authorization.k8s.io "kubeflow-cluster-admin" created persistentvolumeclaim "kubeflow-ksonnet-pvc" created statefulset.apps "kubeflow-bootstrapper" created ``` 查看是否建立成功 ```shell= $ kubectl get ns NAME STATUS AGE default Active 1m kube-public Active 1m kube-system Active 1m kubeflow-admin Active 53s ``` 這步驟會需要5~8分鐘時間，需要下載各項Images ``` $ kubectl -n kubeflow get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ambassador ClusterIP 10.97.168.31 <none> 80/TCP 1m ambassador-admin ClusterIP 10.99.5.81 <none> 8877/TCP 1m centraldashboard ClusterIP 10.111.104.142 <none> 80/TCP 1m k8s-dashboard ClusterIP 10.102.65.244 <none> 443/TCP 1m tf-hub-0 ClusterIP None <none> 8000/TCP 1m tf-hub-lb ClusterIP 10.101.15.28 <none> 80/TCP 1m tf-job-dashboard ClusterIP 10.106.133.49 <none> 80/TCP 1m ``` 啟用Jupyter Hub與 kubeflow Dashboard ```shell= $ POD=`kubectl -n kubeflow get pods --selector=service=ambassador | awk '{print $1}' | tail -1` $ kubectl -n kubeflow port-forward $POD 8080:80 2>&1 >/dev/null & $ POD=`kubectl -n kubeflow get pods --selector=app=tf-hub | awk '{print $1}' | tail -1` $ kubectl -n kubeflow port-forward $POD 8000:8000 2>&1 >/dev/null & ``` Kubeflow dashboard at http://localhost:8080/ JupyterHub at http://localhost:8000/ --- ### Upgrading Kubeflow Deployments - 如果CRD API 後續如何更新？刪除TFJobs v1alpha1，因為K8無法部署多個版本的CRD ```shell= kubectl delete crd tfjobs.kubeflow.org ``` --- ## Jupyter Hub 測試 Kubeflow 檢查 Kubeflow 元件部署結果 ```shell= $ kubectl get pods NAME READY STATUS RESTARTS AGE ambassador-59cb5ccd89-mxdpz 2/2 Running 0 13m ambassador-59cb5ccd89-nzcbt 2/2 Running 0 13m ambassador-59cb5ccd89-rzrs5 2/2 Running 0 13m centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 13m spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 13m tf-hub-0 1/1 Running 0 13m tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 13m tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 13m ``` 確認後就可以登入 Jupyter Notebook，但這邊需要修改 Kubernetes Service，透過以下指令進行： ```shell= $ kubectl -n default edit svc tf-hub-lb ··· spec: . . type: NodePort ··· ``` 查看minikube ip ```shell= $ minikube ip 192.168.99.100 ``` 確認Service對應的IP:port ```shell= $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ambassador ClusterIP 10.97.154.89 <none> 80/TCP 15m ambassador-admin ClusterIP 10.107.124.60 <none> 8877/TCP 15m centraldashboard ClusterIP 10.107.126.80 <none> 80/TCP 15m k8s-dashboard ClusterIP 10.103.240.15 <none> 443/TCP 15m kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17m tf-hub-0 ClusterIP None <none> 8000/TCP 15m tf-hub-lb NodePort 10.100.10.226 <none> 80:30552/TCP 15m tf-job-dashboard ClusterIP 10.97.210.229 <none> 80/TCP 15m ``` 連接對應的`192.168.99.100:30552`即可以開啟Jupyer Hub _帳密自行輸入,會自動建立新帳號_ ![](https://i.imgur.com/DcXEhyq.png) 登入後點選Start My Server按鈕來建立 Server 的 Spawner options，預設會有多種映像檔可以使用： ![](https://i.imgur.com/wIV6scT.png) 選擇資源後就會看到minikube環境pod就會多一個 `jupyter-admin` container正在建立 ``` kubectl get pod NAME READY STATUS RESTARTS AGE ambassador-59cb5ccd89-mxdpz 2/2 Running 0 21m ambassador-59cb5ccd89-nzcbt 2/2 Running 0 21m ambassador-59cb5ccd89-rzrs5 2/2 Running 0 21m centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 21m jupyter-admin 0/1 ContainerCreating 0 3m spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 21m tf-hub-0 1/1 Running 0 21m tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 21m tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 21m ``` 下載時間Images都需要時間，也可以自己實作Build Dockerfile，生成後可以直接透過Jupyter Notebook做操作。查看Pod生成狀態 ``` $ kubectl describe pod jupyter-admin Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 34s default-scheduler Successfully assigned jupyter-admin to minikube Normal SuccessfulMountVolume 34s kubelet, minikube MountVolume.SetUp succeeded for volume "pvc-11a86a3d-a8cd-11e8-8ff4-080027920b17" Normal SuccessfulMountVolume 34s kubelet, minikube MountVolume.SetUp succeeded for volume "no-api-access-please" Normal Pulling 33s kubelet, minikube pulling image "gcr.io/kubeflow/tensorflow-notebook-cpu:latest ``` > 如果環境下載太慢，可以嘗試連接已經部署好的環境，使用JupyterHub > http://140.128.18.98:31715/ ![](https://i.imgur.com/RRVqJ4F.png) ![](https://i.imgur.com/rbBs4Nr.png) ![](https://i.imgur.com/nLBTLxe.png) --- ## Kubeflow Use TF-operator 確認已經下載kubeflow-broadmission ```shell= git clone https://github.com/kairen/kubeflow-broadmission.git ``` `TFJob`是一個Kubernetes 自定義資源(CRD)，可以在Kubernetes上輕鬆運行TensorFlow訓練工作進入tf-operator資料夾，執行簡易的tfjob測試 ```shell= $ kubectl create -f tf-job-cpu.yml ``` 執行 yaml如下 ```yaml= apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: labels: experiment: experiment10 name: tfjob namespace: kubeflow spec: tfReplicaSpecs: Ps: replicas: 1 template: metadata: creationTimestamp: null spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure Worker: replicas: 1 template: metadata: creationTimestamp: null spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailur ``` > 透過TF-operator管理，我們可以透過調整replicas來增加分散式訓練的works數量確認執行tfjob狀態 ```shell= kubectl get tfjob NAME CREATED AT tfjob 5m ``` 監控tfjob狀態 ```shell= $ kubectl get -o yaml tfjobs $JOB status: conditions: - lastTransitionTime: 2018-08-26T03:57:32Z lastUpdateTime: 2018-08-26T03:57:32Z message: TFJob tfjob is running. reason: TFJobRunning status: "True" type: Running startTime: 2018-08-26T04:11:33Z tfReplicaStatuses: PS: active: 1 Worker: active: 1 ``` 可以透過確認tfjob執行狀態`Running` 透過觀察tfjob生成tf-benchmarks 包含PS、worker 目前pod狀態 ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE ambassador-59cb5ccd89-mxdpz 2/2 Running 0 1h ambassador-59cb5ccd89-nzcbt 2/2 Running 0 1h ambassador-59cb5ccd89-rzrs5 2/2 Running 0 1h centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 1h jupyter-admin 1/1 Running 0 1h spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 1h tf-hub-0 1/1 Running 0 1h tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 1h tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 1h tfjob-ps-0 1/1 Running 0 23m tfjob-worker-0 1/1 Running 0 23m ``` 觀察tfjob-worker-0狀態 ```shell= $ kubectl logs -f tfjob-worker-0 INFO|2018-08-26T03:57:34|/opt/launcher.py|27| TensorFlow: 1.5 INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Model: resnet50 INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Mode: training INFO|2018-08-26T03:57:34|/opt/launcher.py|27| SingleSess: False INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Batch size: 32 global INFO|2018-08-26T03:57:34|/opt/launcher.py|27| 32 per device INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Devices: ['/job:worker/task:0/cpu:0'] INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Data format: NHWC INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Optimizer: sgd INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Variables: parameter_server INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Sync: True INFO|2018-08-26T03:57:34|/opt/launcher.py|27| ========== INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Generating model INFO|2018-08-26T03:57:41|/opt/launcher.py|27| 2018-08-26 03:57:41.378160: I tensorflow/core/distributed_runtime/master_session.cc:1008] Start master session 1d102ed86769e21e with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true INFO|2018-08-26T03:57:44|/opt/launcher.py|27| Running warm up INFO|2018-08-26T04:10:31|/opt/launcher.py|27| Done warm up INFO|2018-08-26T04:10:31|/opt/launcher.py|27| Step Img/sec loss INFO|2018-08-26T04:11:52|/opt/launcher.py|27| 1 images/sec: 0.4 +/- 0.0 (jitter = 0.0) 10.224 ``` 如果要直接刪除`tfjob` ``` ks -n ${NAMESPACE} delete tfjobs {JOB_NAME} ``` --- ## kubeflow MPI ![](https://i.imgur.com/qd9vBVm.png) > [連結](https://github.com/uber/horovod/blob/master/docs/benchmarks.md) > 這邊直接參考資料夾 yaml範例檢查是否已安裝mpijobs自定義資源 ``` $ kubectl get crd NAME AGE ... mpijobs.kubeflow.org 4d ... ``` 可以透過kosnnet添加 ``` cd ${KSONNET_APP} ks pkg install kubeflow/mpi-job ks generate mpi-operator mpi-operator ks apply ${ENVIRONMENT} -c mpi-operator ``` 或是直接create crd.yaml ``` apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: name: mpijobs.kubeflow.org spec: group: kubeflow.org version: v1alpha1 scope: Namespaced names: plural: mpijobs singular: mpijob kind: MPIJob shortNames: - mj - mpij validation: openAPIV3Schema: properties: spec: title: The MPIJob spec description: Either `gpus` or `replicas` should be specified, but not both oneOf: - properties: gpus: title: Total number of GPUs description: Valid values are 1, 2, 4, or any multiple of 8 oneOf: - type: integer enum: - 1 - 2 - 4 - type: integer multipleOf: 8 minimum: 8 required: - gpus - properties: replicas: title: Total number of replicas description: The GPU resource limit should be specified for each replica type: integer minimum: 1 required: - replicas ``` --- ## Kubeflow Deploy Pytorch Kubeflow使用自定義資源定義（CRD）和Operator擴展了Kubernetes。每個自定義資源都是協助機器學習工作負載的部署。定義資源後，Operator將處理部署請求。查看目前環境中的CRD資源： ```shell= $ kubectl get crd NAME CREATED AT tfjobs.kubeflow.org 2018-08-26T02:56:54Z ``` 在kubeflow 0.2.2，默認是沒有部署Pytorch Operator，需要額外下命令部署CRD和Operator。 ```shell= cd kubeflow_ks_app ks generate pytorch-operator pytorch-operator ks apply default -c pytorch-operator ``` 重新查看`kubectl get crd`確認 ```shell= $ kubectl get crd NAME AGE pytorchjobs.kubeflow.org 6m tfjobs.kubeflow.org 8m ``` Download pytorch-operator ```shell= git clone https://github.com/yylin1/kubeflow-broadmission.git ``` 測試mnist訓練需要事先部署Docker Image ```shell= cd pytorch-operator/examples/dist-mnist/ docker build -f Dockerfile -t kubeflow / pytorch-dist-mnist-test：1.0 ./ ``` ### PyTorch 分散式訓練分佈式MNIST模型已打包到Container Image中。可以在Github上查看Python PyTorch代碼。要部署培訓模型，PyTorchJob需要。這定義了要使用的容器映像以及用於分發培訓的副本數。可以查看一個示例 `cat examples/dist-mnist/pytorch_job_mnist.yaml ` 執行部屬`PyTorchJob`開始訓練： ```shell= kubectl create -f examples/dist-mnist/pytorch_job_mnist.yaml ``` 查看目前創建的Pod與指定的Work副本數量 ```shell= kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test ``` ```shell= AME READY STATUS RESTARTS AGE dist-mnist-for-e2e-test-master-nl90-0-tcbly 0/1 ContainerCreating 0 5s dist-mnist-for-e2e-test-worker-nl90-1-wtidy 0/1 ContainerCreating 0 4s dist-mnist-for-e2e-test-worker-nl90-2-8t25x 0/1 ContainerCreating 0 3s dist-mnist-for-e2e-test-worker-nl90-3-t1i6z 0/1 ContainerCreating 0 3s ``` 預設minist訓練為10個epoch約需要在Cluster執行5-10分鐘。可以透過查看日誌了解訓練進度。 ```shell= PODNAME=$(kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test,task_index=0 -o name) kubectl logs -f ${PODNAME} ``` ```shell= $ kubectl logs -f ${PODNAME} Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Processing... Done! Rank 0 , epoch 0 : 1.2745472780232237 . . . ``` 監控PyTorch任務是否完成 ```shell= kubectl get -o yaml pytorchjobs dist-mnist-for-e2e-test ``` 可以透過輸出`yaml`來監控Job狀態。Job執行完成後`state: Succeeded`顯示成功。 ```yaml= apiVersion: v1 items: - apiVersion: kubeflow.org/v1alpha1 kind: PyTorchJob metadata: clusterName: "" creationTimestamp: 2018-06-22T08:16:14Z generation: 1 name: dist-mnist-for-e2e-test namespace: default resourceVersion: "3276193" selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/pytorchjobs/dist-mnist-for-e2e-test uid: 87772d3b-75f4-11e8-bdd9-42010aa00072 spec: RuntimeId: kmma pytorchImage: pytorch/pytorch:v0.2 replicaSpecs: - masterPort: 23456 replicaType: MASTER replicas: 1 template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 imagePullPolicy: IfNotPresent name: pytorch resources: {} restartPolicy: OnFailure - masterPort: 23456 replicaType: WORKER replicas: 3 template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 imagePullPolicy: IfNotPresent name: pytorch resources: {} restartPolicy: OnFailure terminationPolicy: master: replicaName: MASTER replicaRank: 0 status: phase: Done reason: "" replicaStatuses: - ReplicasStates: Succeeded: 1 replica_type: MASTER state: Succeeded - ReplicasStates: Running: 1 Succeeded: 2 replica_type: WORKER state: Running state: Succeeded kind: List metadata: resourceVersion: "" selfLink: "" ``` --- ## 若想從 Kubernetes 叢集刪除 Kubeflow 相關元件的話，可執行下列指令達成： ``` $ ks delete default -c kubeflow-core ``` ## 補充kubeflow 0.2 Components - TensorFlow Training - Hyperparameter Tuning (Katib) - Seldon Serving - Istio Integration (for TF Serving) - .. > [kubeflow](https://www.kubeflow.org/)