Kubernetes with MPI

Kubernetes with MPI === # 什麼是MPI? * MPI是一個跨語言的**通訊協定**，用於編寫平行計算機。支援對等和廣播。 * MPI是一個資訊傳遞應用程式介面，包括協定和和語意說明，他們指明其如何在各種實現中發揮其特性。 * MPI的目標是高效能，大規模性，和可移植性。MPI在今天仍為高效能計算的主要模型。 [Wiki, MPI ](https://zh.wikipedia.org/zh-tw/%E8%A8%8A%E6%81%AF%E5%82%B3%E9%81%9E%E4%BB%8B%E9%9D%A2 ) MPI 只是一個介面，只定義了標準和規格，而沒有特定的實作，主流兩個open source實作 : | **OpenMPI** | **Intel MPI** | | -------- | -------- | | 開源的 MPI Library | Intel 開發、發行的 MPI | | 支援 Unix 和 Unix-Like 作業系統 | 針對 Intel CPU 優化 | | 可以跨 (硬體) 平台執行 | | [HPC NTCU, MPI](https://hackmd.io/@hpc-ntcu/mpi-intro) # Kubernetes 與 MPI **1.kube-openmpi** Uses Kubernetes to launch a cluster of containers capable of supporting the target application set. Once this Kubernetes namespace is created, it is possible to use kubectl to launch and mpiexec applications into the namespace and leverage the deployed OpenMPI environment. (++kube-openmpi only supports OpenMPI++, as the name suggests.) * 一個專門在K8S上運行MPI porcess的工具。 * 在部屬的OpenMPI環境下，使用kubectl和mpiexec啟動porcess。 * 僅支持OpenMPI。 **2.Kubeflow/mpi-operator** Kubeflow’s focus is evidence that the driving force for MPI-Kubernetes integration will be large-scale machine learning. Kubeflow uses a secondary scheduler within Kubernetes, kube-batch to support the scheduling and uses OpenMPI and a companion ssh daemon for the launch of MPI-based jobs. * Kubeflow是開源機器學習平台，mpi-operator為一部份，專注在K8S上運行MPI process。 * 使用K8S的kube-batch進行scheduling。 * 使用OpenMPI和ssh daemon運行MPI process。 [StackHPC, Kubernetes, HPC and MPI](https://www.stackhpc.com/k8s-mpi.html) ## MPI Operator Unlike other operators in Kubeflow such as TF Operator and PyTorch Operator that only supports for one machine learning framework, **MPI operator is decoupled from underlying framework so it can work well with many frameworks** such as Horovod, TensorFlow, PyTorch, Apache MXNet, and various collective communication implementations such as OpenMPI. * mpi-operator可以獨立於特定的機器學習框架運作，適應各種不同機器學習框架。 * mpir-operator支援不同的集合通訊實作(如Open MPI)。 [Kubeflow, Introduction to Kubeflow MPI Operator and Industry Adoption ](https://medium.com/kubeflow/introduction-to-kubeflow-mpi-operator-and-industry-adoption-296d5f2e6edc) [Kubeflow, mpi-operator-proposal](https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md) ### MPI job running on k8s - tensorflow-benchmarks [kubeflow/mpi-operator tensorflow-benchmarks example](https://github.com/kubeflow/mpi-operator/tree/master) ##### 透過 kubectl 在 k8s 上部屬 mpi-operator ``` kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml ``` ##### 檢查CRD是否有成功部屬 ``` kubectl get crd ``` ``` NAME AGE ... mpijobs.kubeflow.org 4d ... ``` ##### 運行MPI Job，使用的yaml檔 kubeflow/mpi-operator/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml ``` apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: creationTimestamp: "2019-07-09T22:15:51Z" generation: 1 name: tensorflow-benchmarks namespace: default resourceVersion: "5645868" selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d spec: runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - command: - mpirun - --allow-run-as-root - -np - "2" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --model=resnet101 - --batch_size=64 - --variable_update=horovod image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks Worker: replicas: 2 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 2 ``` ![image](https://hackmd.io/_uploads/BJ0wOuKRT.png) ##### 透過log查看pod運行狀態及輸出內容(若code有print的話) ``` PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=tensorflow-benchmarks,training.kubeflow.org/job-role=launcher -o name) kubectl logs -f ${PODNAME} ``` ``` TensorFlow: 1.14 Model: resnet101 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 64 per device Num batches: 100 Num epochs: 0.01 Devices: ['horovod/gpu:0', 'horovod/gpu:1'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ... 40 images/sec: 154.4 +/- 0.7 (jitter = 4.0) 8.280 40 images/sec: 154.4 +/- 0.7 (jitter = 4.1) 8.482 50 images/sec: 154.8 +/- 0.6 (jitter = 4.0) 8.397 50 images/sec: 154.8 +/- 0.6 (jitter = 4.2) 8.450 60 images/sec: 154.5 +/- 0.5 (jitter = 4.1) 8.321 60 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.349 70 images/sec: 154.5 +/- 0.5 (jitter = 4.0) 8.433 70 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.430 80 images/sec: 154.8 +/- 0.4 (jitter = 3.6) 8.199 80 images/sec: 154.8 +/- 0.4 (jitter = 3.8) 8.404 90 images/sec: 154.6 +/- 0.4 (jitter = 3.7) 8.418 90 images/sec: 154.6 +/- 0.4 (jitter = 3.6) 8.459 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.372 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.542 ---------------------------------------------------------------- total images/sec: 308.27 ``` ## QA #### **Q1.要如何使用mpi-operator的docker images?** **A1.** Kubeflow提供mpioperator image在Dockerhub上，讓使用者基於mpioperator image自定義自己的image做使用。 [mpi-operator Dockerfile](https://github.com/kubeflow/mpi-operator/blob/master/Dockerfile) #### **Q2.什麼是slot?** **A2.** 定義 ’slot’： * slot是一個process的分配單位。 * 節點上的slot數量表示可以運行多少個process。 * 預設情況下，Open MPI在每一個slot上運行一個process。 如果沒有直接告訴MPI需要多少個slots，系統會依照以下兩個行為設定： 1.**預設行為**：Open MPI會去找node上的processor cores數量，並設置相同數量slot。 2.**使用--use-hwthread-cpus**：Open MPI會去找node上hardware threads數量，並設置相同數量slot。 [mpirun(1) man page (version 4.1.6)](https://www.open-mpi.org/doc/current/man1/mpirun.1.php) #### **Q3. 如何配置slot的數量?** **A3.** 透過編寫hostfile，設定每個node上可以使用幾個slot。 ![image](https://hackmd.io/_uploads/Syj6ndYCa.png) **超額配置 oversubscribed，運行process數量超過預設配置的slot**。 "**Most MPI applications and HPC environments do not oversubscribe**; for simplicity, the majority of this documentation assumes that oversubscription is not enabled." [mpirun(1) man page (version 4.1.6)](https://www.open-mpi.org/doc/current/man1/mpirun.1.php) 在上述tensorflow-benchmark例子，k8s中hostfile通常是透過創建在ConfigMap內。 ![image](https://hackmd.io/_uploads/SkcN6_K0T.png) [reference](https://blog.kubeflow.org/integrations/operators/2020/03/16/mpi-operator.html) #### **Q4.如何指定mpirun job的process數量?** **A4.** **-np** 在節點上，運行n個processes **-npersocket (--map-by ppr:n:socket)** 在每個節點上，運行(n) * (socket)的processes **-npernode (--map-by ppr:n:node)** 在每個節點上，運行n個processes ![image](https://hackmd.io/_uploads/rkz0zKYRa.png) [Oracle, Sun HPC ClusterTools 8.2.1c Software User’s Guide](https://docs.oracle.com/cd/E19708-01/821-1319-10/ExecutingPrograms.html) 另外，也可以指定hostfile內的哪幾台host來運行job $ cat myhostfile aa slots=2 bb slots=2 cc slots=2 mpirun -hostfile myhostfile -host aa ./a.out #### **Q5.如何查看job運行結果?** **A5.** (待驗證) ### [官方FAQ:Running MPI jobs](https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles)