# Kubeflow Quick Start
## 實驗環境
- Minikube
- GPU Cluster
- Katacoda
## kubeflow實作目錄
- Jupyter Hub 實作
- TF-operator
- PyTorch-operator
- MPI Training
> 更多實作|[參考](https://www.kubeflow.org/docs/started/getting-started/)
### 其他可以透過 katacoda 線上學習
- Deploying Github Issue Summarization with Kubeflow
- Deploying PyTorch with Kubeflow
- Deploying Kubeflow
- Deploying Kubeflow with Ksonnet
---
# minikube 部署kubeflow
#### 確認已經部署完成 minikube
```
minikube start --memory=8192 --cpus=4 --kubernetes-version=v1.10.6
```
## ksonnet install
```shell=
curl -L https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/ks_0.11.0_linux_amd64.tar.gz | tar xvz && mv ks_0.11.0_linux_amd64/ks /usr/local/bin/ks && rm -rf ks_0.11.0_linux_amd64/
```
### kubeflow install
```shell=
export KUBEFLOW_VERSION=0.2.2
curl https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2.2/scripts/deploy.sh | bash
```
檢查部署後狀態 (會需要花時間下載Images)
```shell=
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ambassador-59cb5ccd89-cltqx 2/2 Running 0 1m
ambassador-59cb5ccd89-sl87k 2/2 Running 0 1m
ambassador-59cb5ccd89-szchn 2/2 Running 0 1m
centraldashboard-7d7744cccb-cntj6 1/1 Running 0 1m
spartakus-volunteer-55577f4bd9-kgnvv 1/1 Running 0 1m
tf-hub-0 1/1 Running 0 1m
tf-job-dashboard-bfc9bc6bc-wmbj5 1/1 Running 0 1m
tf-job-operator-v1alpha2-756cf9cb97-r45kk 1/1 Running 0 1m
```
---
### 補充:Minikube for Kubeflow (Bootstrapper)
安裝kubeflow 使用 Bootstrapper
```shell=
$ curl -O https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2-branch/bootstrap/bootstrapper.yaml
```
執行config
```shell=
$ kubectl create -f bootstrapper.yaml
```
創建的項目
```shell=
namespace "kubeflow-admin" created
clusterrolebinding.rbac.authorization.k8s.io "kubeflow-cluster-admin" created
persistentvolumeclaim "kubeflow-ksonnet-pvc" created
statefulset.apps "kubeflow-bootstrapper" created
```
查看是否建立成功
```shell=
$ kubectl get ns
NAME STATUS AGE
default Active 1m
kube-public Active 1m
kube-system Active 1m
kubeflow-admin Active 53s
```
這步驟會需要5~8分鐘時間,需要下載各項Images
```
$ kubectl -n kubeflow get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ambassador ClusterIP 10.97.168.31 <none> 80/TCP 1m
ambassador-admin ClusterIP 10.99.5.81 <none> 8877/TCP 1m
centraldashboard ClusterIP 10.111.104.142 <none> 80/TCP 1m
k8s-dashboard ClusterIP 10.102.65.244 <none> 443/TCP 1m
tf-hub-0 ClusterIP None <none> 8000/TCP 1m
tf-hub-lb ClusterIP 10.101.15.28 <none> 80/TCP 1m
tf-job-dashboard ClusterIP 10.106.133.49 <none> 80/TCP 1m
```
啟用Jupyter Hub與 kubeflow Dashboard
```shell=
$ POD=`kubectl -n kubeflow get pods --selector=service=ambassador | awk '{print $1}' | tail -1`
$ kubectl -n kubeflow port-forward $POD 8080:80 2>&1 >/dev/null &
$ POD=`kubectl -n kubeflow get pods --selector=app=tf-hub | awk '{print $1}' | tail -1`
$ kubectl -n kubeflow port-forward $POD 8000:8000 2>&1 >/dev/null &
```
Kubeflow dashboard at http://localhost:8080/
JupyterHub at http://localhost:8000/
---
### Upgrading Kubeflow Deployments
- 如果CRD API 後續如何更新?
刪除TFJobs v1alpha1,因為K8無法部署多個版本的CRD
```shell=
kubectl delete crd tfjobs.kubeflow.org
```
---
## Jupyter Hub 測試 Kubeflow
檢查 Kubeflow 元件部署結果
```shell=
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ambassador-59cb5ccd89-mxdpz 2/2 Running 0 13m
ambassador-59cb5ccd89-nzcbt 2/2 Running 0 13m
ambassador-59cb5ccd89-rzrs5 2/2 Running 0 13m
centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 13m
spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 13m
tf-hub-0 1/1 Running 0 13m
tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 13m
tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 13m
```
確認後就可以登入 Jupyter Notebook,但這邊需要修改 Kubernetes Service,透過以下指令進行:
```shell=
$ kubectl -n default edit svc tf-hub-lb
···
spec:
.
.
type: NodePort
···
```
查看minikube ip
```shell=
$ minikube ip
192.168.99.100
```
確認Service對應的IP:port
```shell=
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ambassador ClusterIP 10.97.154.89 <none> 80/TCP 15m
ambassador-admin ClusterIP 10.107.124.60 <none> 8877/TCP 15m
centraldashboard ClusterIP 10.107.126.80 <none> 80/TCP 15m
k8s-dashboard ClusterIP 10.103.240.15 <none> 443/TCP 15m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17m
tf-hub-0 ClusterIP None <none> 8000/TCP 15m
tf-hub-lb NodePort 10.100.10.226 <none> 80:30552/TCP 15m
tf-job-dashboard ClusterIP 10.97.210.229 <none> 80/TCP 15m
```
連接對應的`192.168.99.100:30552`即可以開啟Jupyer Hub
_帳密自行輸入,會自動建立新帳號_

登入後點選Start My Server按鈕來建立 Server 的 Spawner options,預設會有多種映像檔可以使用:

選擇資源後就會看到minikube環境pod就會多一個 `jupyter-admin` container正在建立
```
kubectl get pod
NAME READY STATUS RESTARTS AGE
ambassador-59cb5ccd89-mxdpz 2/2 Running 0 21m
ambassador-59cb5ccd89-nzcbt 2/2 Running 0 21m
ambassador-59cb5ccd89-rzrs5 2/2 Running 0 21m
centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 21m
jupyter-admin 0/1 ContainerCreating 0 3m
spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 21m
tf-hub-0 1/1 Running 0 21m
tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 21m
tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 21m
```
下載時間Images都需要時間,也可以自己實作Build Dockerfile,生成後可以直接透過Jupyter Notebook做操作。
查看Pod生成狀態
```
$ kubectl describe pod jupyter-admin
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned jupyter-admin to minikube
Normal SuccessfulMountVolume 34s kubelet, minikube MountVolume.SetUp succeeded for volume "pvc-11a86a3d-a8cd-11e8-8ff4-080027920b17"
Normal SuccessfulMountVolume 34s kubelet, minikube MountVolume.SetUp succeeded for volume "no-api-access-please"
Normal Pulling 33s kubelet, minikube pulling image "gcr.io/kubeflow/tensorflow-notebook-cpu:latest
```
> 如果環境下載太慢,可以嘗試連接已經部署好的環境,使用JupyterHub
> http://140.128.18.98:31715/



---
## Kubeflow Use TF-operator
確認已經下載kubeflow-broadmission
```shell=
git clone https://github.com/kairen/kubeflow-broadmission.git
```
`TFJob`是一個Kubernetes 自定義資源(CRD),可以在Kubernetes上輕鬆運行TensorFlow訓練工作
進入tf-operator資料夾,執行簡易的tfjob測試
```shell=
$ kubectl create -f tf-job-cpu.yml
```
執行 yaml如下
```yaml=
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
labels:
experiment: experiment10
name: tfjob
namespace: kubeflow
spec:
tfReplicaSpecs:
Ps:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailur
```
> 透過TF-operator管理,我們可以透過調整replicas來增加分散式訓練的works數量
確認執行tfjob狀態
```shell=
kubectl get tfjob
NAME CREATED AT
tfjob 5m
```
監控tfjob狀態
```shell=
$ kubectl get -o yaml tfjobs $JOB
status:
conditions:
- lastTransitionTime: 2018-08-26T03:57:32Z
lastUpdateTime: 2018-08-26T03:57:32Z
message: TFJob tfjob is running.
reason: TFJobRunning
status: "True"
type: Running
startTime: 2018-08-26T04:11:33Z
tfReplicaStatuses:
PS:
active: 1
Worker:
active: 1
```
可以透過確認tfjob執行狀態`Running`
透過觀察tfjob生成tf-benchmarks 包含PS、worker 目前pod狀態
```
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
ambassador-59cb5ccd89-mxdpz 2/2 Running 0 1h
ambassador-59cb5ccd89-nzcbt 2/2 Running 0 1h
ambassador-59cb5ccd89-rzrs5 2/2 Running 0 1h
centraldashboard-7d7744cccb-kh7r6 1/1 Running 0 1h
jupyter-admin 1/1 Running 0 1h
spartakus-volunteer-56d4cf6bbf-vf9nt 1/1 Running 0 1h
tf-hub-0 1/1 Running 0 1h
tf-job-dashboard-bfc9bc6bc-zfk64 1/1 Running 0 1h
tf-job-operator-v1alpha2-756cf9cb97-w2w9q 1/1 Running 0 1h
tfjob-ps-0 1/1 Running 0 23m
tfjob-worker-0 1/1 Running 0 23m
```
觀察tfjob-worker-0狀態
```shell=
$ kubectl logs -f tfjob-worker-0
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| TensorFlow: 1.5
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Model: resnet50
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Mode: training
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| SingleSess: False
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Batch size: 32 global
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| 32 per device
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Devices: ['/job:worker/task:0/cpu:0']
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Data format: NHWC
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Optimizer: sgd
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Variables: parameter_server
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Sync: True
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| ==========
INFO|2018-08-26T03:57:34|/opt/launcher.py|27| Generating model
INFO|2018-08-26T03:57:41|/opt/launcher.py|27| 2018-08-26 03:57:41.378160: I tensorflow/core/distributed_runtime/master_session.cc:1008] Start master session 1d102ed86769e21e with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
INFO|2018-08-26T03:57:44|/opt/launcher.py|27| Running warm up
INFO|2018-08-26T04:10:31|/opt/launcher.py|27| Done warm up
INFO|2018-08-26T04:10:31|/opt/launcher.py|27| Step Img/sec loss
INFO|2018-08-26T04:11:52|/opt/launcher.py|27| 1 images/sec: 0.4 +/- 0.0 (jitter = 0.0) 10.224
```
如果要直接刪除`tfjob`
```
ks -n ${NAMESPACE} delete tfjobs {JOB_NAME}
```
---
## kubeflow MPI

> [連結](https://github.com/uber/horovod/blob/master/docs/benchmarks.md)
> 這邊直接參考資料夾 yaml範例
檢查是否已安裝mpijobs自定義資源
```
$ kubectl get crd
NAME AGE
...
mpijobs.kubeflow.org 4d
...
```
可以透過kosnnet添加
```
cd ${KSONNET_APP}
ks pkg install kubeflow/mpi-job
ks generate mpi-operator mpi-operator
ks apply ${ENVIRONMENT} -c mpi-operator
```
或是直接create crd.yaml
```
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: mpijobs.kubeflow.org
spec:
group: kubeflow.org
version: v1alpha1
scope: Namespaced
names:
plural: mpijobs
singular: mpijob
kind: MPIJob
shortNames:
- mj
- mpij
validation:
openAPIV3Schema:
properties:
spec:
title: The MPIJob spec
description: Either `gpus` or `replicas` should be specified, but not both
oneOf:
- properties:
gpus:
title: Total number of GPUs
description: Valid values are 1, 2, 4, or any multiple of 8
oneOf:
- type: integer
enum:
- 1
- 2
- 4
- type: integer
multipleOf: 8
minimum: 8
required:
- gpus
- properties:
replicas:
title: Total number of replicas
description: The GPU resource limit should be specified for each replica
type: integer
minimum: 1
required:
- replicas
```
---
## Kubeflow Deploy Pytorch
Kubeflow使用自定義資源定義(CRD)和Operator擴展了Kubernetes。
每個自定義資源都是協助機器學習工作負載的部署。定義資源後,Operator將處理部署請求。
查看目前環境中的CRD資源:
```shell=
$ kubectl get crd
NAME CREATED AT
tfjobs.kubeflow.org 2018-08-26T02:56:54Z
```
在kubeflow 0.2.2,默認是沒有部署Pytorch Operator,需要額外下命令部署CRD和Operator。
```shell=
cd kubeflow_ks_app
ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator
```
重新查看`kubectl get crd`確認
```shell=
$ kubectl get crd
NAME AGE
pytorchjobs.kubeflow.org 6m
tfjobs.kubeflow.org 8m
```
Download pytorch-operator
```shell=
git clone https://github.com/yylin1/kubeflow-broadmission.git
```
測試mnist訓練需要事先部署Docker Image
```shell=
cd pytorch-operator/examples/dist-mnist/
docker build -f Dockerfile -t kubeflow / pytorch-dist-mnist-test:1.0 ./
```
### PyTorch 分散式訓練
分佈式MNIST模型已打包到Container Image中。可以在Github上查看Python PyTorch代碼。
要部署培訓模型,PyTorchJob需要。這定義了要使用的容器映像以及用於分發培訓的副本數。
可以查看一個示例 `cat examples/dist-mnist/pytorch_job_mnist.yaml
`
執行部屬`PyTorchJob`開始訓練:
```shell=
kubectl create -f examples/dist-mnist/pytorch_job_mnist.yaml
```
查看目前創建的Pod與指定的Work副本數量
```shell=
kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test
```
```shell=
AME READY STATUS RESTARTS AGE
dist-mnist-for-e2e-test-master-nl90-0-tcbly 0/1 ContainerCreating 0 5s
dist-mnist-for-e2e-test-worker-nl90-1-wtidy 0/1 ContainerCreating 0 4s
dist-mnist-for-e2e-test-worker-nl90-2-8t25x 0/1 ContainerCreating 0 3s
dist-mnist-for-e2e-test-worker-nl90-3-t1i6z 0/1 ContainerCreating 0 3s
```
預設minist訓練為10個epoch約需要在Cluster執行5-10分鐘。可以透過查看日誌了解訓練進度。
```shell=
PODNAME=$(kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test,task_index=0 -o name)
kubectl logs -f ${PODNAME}
```
```shell=
$ kubectl logs -f ${PODNAME}
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Rank 0 , epoch 0 : 1.2745472780232237
.
.
.
```
監控PyTorch任務是否完成
```shell=
kubectl get -o yaml pytorchjobs dist-mnist-for-e2e-test
```
可以透過輸出`yaml`來監控Job狀態。Job執行完成後`state: Succeeded`顯示成功。
```yaml=
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
kind: PyTorchJob
metadata:
clusterName: ""
creationTimestamp: 2018-06-22T08:16:14Z
generation: 1
name: dist-mnist-for-e2e-test
namespace: default
resourceVersion: "3276193"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/pytorchjobs/dist-mnist-for-e2e-test
uid: 87772d3b-75f4-11e8-bdd9-42010aa00072
spec:
RuntimeId: kmma
pytorchImage: pytorch/pytorch:v0.2
replicaSpecs:
- masterPort: 23456
replicaType: MASTER
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
imagePullPolicy: IfNotPresent
name: pytorch
resources: {}
restartPolicy: OnFailure
- masterPort: 23456
replicaType: WORKER
replicas: 3
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
imagePullPolicy: IfNotPresent
name: pytorch
resources: {}
restartPolicy: OnFailure
terminationPolicy:
master:
replicaName: MASTER
replicaRank: 0
status:
phase: Done
reason: ""
replicaStatuses:
- ReplicasStates:
Succeeded: 1
replica_type: MASTER
state: Succeeded
- ReplicasStates:
Running: 1
Succeeded: 2
replica_type: WORKER
state: Running
state: Succeeded
kind: List
metadata:
resourceVersion: ""
selfLink: ""
```
---
## 若想從 Kubernetes 叢集刪除 Kubeflow 相關元件的話,可執行下列指令達成:
```
$ ks delete default -c kubeflow-core
```
## 補充kubeflow 0.2 Components
- TensorFlow Training
- Hyperparameter Tuning (Katib)
- Seldon Serving
- Istio Integration (for TF Serving)
- ..
> [kubeflow](https://www.kubeflow.org/)