Kubernetes with MPI
===
# 什麼是MPI?
* MPI是一個跨語言的**通訊協定**,用於編寫平行計算機。支援對等和廣播。
* MPI是一個資訊傳遞應用程式介面,包括協定和和語意說明,他們指明其如何在各種實現中發揮其特性。
* MPI的目標是高效能,大規模性,和可移植性。MPI在今天仍為高效能計算的主要模型。
[Wiki, MPI ](https://zh.wikipedia.org/zh-tw/%E8%A8%8A%E6%81%AF%E5%82%B3%E9%81%9E%E4%BB%8B%E9%9D%A2 )
<br>
<br>
MPI 只是一個 介面,只定義了標準和規格,而沒有特定的實作,主流兩個open source實作 :
| **OpenMPI** | **Intel MPI** |
| -------- | -------- |
| 開源 的 MPI Library | Intel 開發、發行的 MPI |
| 支援 Unix 和 Unix-Like 作業系統 | 針對 Intel CPU 優化 |
| 可以跨 (硬體) 平台執行 | |
[HPC NTCU, MPI](https://hackmd.io/@hpc-ntcu/mpi-intro)
<br>
<br>
# Kubernetes 與 MPI
**1.kube-openmpi**
Uses Kubernetes to launch a cluster of containers capable of supporting the target application set. Once this Kubernetes namespace is created, it is possible to use kubectl to launch and mpiexec applications into the namespace and leverage the deployed OpenMPI environment. (++kube-openmpi only supports OpenMPI++, as the name suggests.)
<br>
* 一個專門在K8S上運行MPI porcess的工具。
* 在部屬的OpenMPI環境下,使用kubectl和mpiexec啟動porcess。
* 僅支持OpenMPI。
<br>
**2.Kubeflow/mpi-operator**
Kubeflow’s focus is evidence that the driving force for MPI-Kubernetes integration will be large-scale machine learning. Kubeflow uses a secondary scheduler within Kubernetes, kube-batch to support the scheduling and uses OpenMPI and a companion ssh daemon for the launch of MPI-based jobs.
<br>
* Kubeflow是開源機器學習平台,mpi-operator為一部份,專注在K8S上運行MPI process。
* 使用K8S的kube-batch進行scheduling。
* 使用OpenMPI和ssh daemon運行MPI process。
[StackHPC, Kubernetes, HPC and MPI](https://www.stackhpc.com/k8s-mpi.html)
<br>
<br>
## MPI Operator
Unlike other operators in Kubeflow such as TF Operator and PyTorch Operator that only supports for one machine learning framework, **MPI operator is decoupled from underlying framework so it can work well with many frameworks** such as Horovod, TensorFlow, PyTorch, Apache MXNet, and various collective communication implementations such as OpenMPI.
* mpi-operator可以獨立於特定的機器學習框架運作,適應各種不同機器學習框架。
* mpir-operator支援不同的集合通訊實作(如Open MPI)。
[Kubeflow, Introduction to Kubeflow MPI Operator and Industry Adoption ](https://medium.com/kubeflow/introduction-to-kubeflow-mpi-operator-and-industry-adoption-296d5f2e6edc)
[Kubeflow, mpi-operator-proposal](https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md)
<br>
<br>
<br>
<br>
### MPI job running on k8s - tensorflow-benchmarks
[kubeflow/mpi-operator tensorflow-benchmarks example](https://github.com/kubeflow/mpi-operator/tree/master)
##### 透過 kubectl 在 k8s 上部屬 mpi-operator
```
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
```
##### 檢查CRD是否有成功部屬
```
kubectl get crd
```
```
NAME AGE
...
mpijobs.kubeflow.org 4d
...
```
##### 運行MPI Job,使用的yaml檔<br>kubeflow/mpi-operator/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml
```
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
creationTimestamp: "2019-07-09T22:15:51Z"
generation: 1
name: tensorflow-benchmarks
namespace: default
resourceVersion: "5645868"
selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
spec:
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 2
```

##### 透過log查看pod運行狀態及輸出內容(若code有print的話)
```
PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=tensorflow-benchmarks,training.kubeflow.org/job-role=launcher -o name)
kubectl logs -f ${PODNAME}
```
```
TensorFlow: 1.14
Model: resnet101
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 128 global
64 per device
Num batches: 100
Num epochs: 0.01
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: horovod
...
40 images/sec: 154.4 +/- 0.7 (jitter = 4.0) 8.280
40 images/sec: 154.4 +/- 0.7 (jitter = 4.1) 8.482
50 images/sec: 154.8 +/- 0.6 (jitter = 4.0) 8.397
50 images/sec: 154.8 +/- 0.6 (jitter = 4.2) 8.450
60 images/sec: 154.5 +/- 0.5 (jitter = 4.1) 8.321
60 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.349
70 images/sec: 154.5 +/- 0.5 (jitter = 4.0) 8.433
70 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.430
80 images/sec: 154.8 +/- 0.4 (jitter = 3.6) 8.199
80 images/sec: 154.8 +/- 0.4 (jitter = 3.8) 8.404
90 images/sec: 154.6 +/- 0.4 (jitter = 3.7) 8.418
90 images/sec: 154.6 +/- 0.4 (jitter = 3.6) 8.459
100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.372
100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.542
----------------------------------------------------------------
total images/sec: 308.27
```
<br>
<br>
<br>
<br>
<br>
<br>
## QA
#### **Q1.要如何使用mpi-operator的docker images?**
**A1.** Kubeflow提供mpioperator image在Dockerhub上,讓使用者基於mpioperator image自定義自己的image做使用。
[mpi-operator Dockerfile](https://github.com/kubeflow/mpi-operator/blob/master/Dockerfile)
<br>
<br>
#### **Q2.什麼是slot?**
**A2.**
定義 ’slot’:
* slot是一個process的分配單位。
* 節點上的slot數量表示可以運行多少個process。
* 預設情況下,Open MPI在每一個slot上運行一個process。
<br>
如果沒有直接告訴MPI需要多少個slots,系統會依照以下兩個行為設定:
1.**預設行為**:Open MPI會去找node上的processor cores數量,並設置相同數量slot。
2.**使用--use-hwthread-cpus**:Open MPI會去找node上hardware threads數量,並設置相同數量slot。
[mpirun(1) man page (version 4.1.6)](https://www.open-mpi.org/doc/current/man1/mpirun.1.php)
<br>
<br>
#### **Q3. 如何配置slot的數量?**
**A3.** 透過編寫hostfile,設定每個node上可以使用幾個slot。

**超額配置 oversubscribed,運行process數量超過預設配置的slot**。
<br>
"**Most MPI applications and HPC environments do not oversubscribe**; for simplicity, the majority of this documentation assumes that oversubscription is not enabled."
[mpirun(1) man page (version 4.1.6)](https://www.open-mpi.org/doc/current/man1/mpirun.1.php)
<br>
在上述tensorflow-benchmark例子,k8s中hostfile通常是透過創建在ConfigMap內。

[reference](https://blog.kubeflow.org/integrations/operators/2020/03/16/mpi-operator.html)
<br>
<br>
#### **Q4.如何指定mpirun job的process數量?**
**A4.**
**-np** 在節點上,運行n個processes
**-npersocket (--map-by ppr:n:socket)** 在每個節點上,運行(n) * (socket)的processes
**-npernode (--map-by ppr:n:node)** 在每個節點上,運行n個processes

[Oracle, Sun HPC ClusterTools 8.2.1c Software User’s Guide](https://docs.oracle.com/cd/E19708-01/821-1319-10/ExecutingPrograms.html)
另外,也可以指定hostfile內的哪幾台host來運行job
$ cat myhostfile
aa slots=2
bb slots=2
cc slots=2
mpirun -hostfile myhostfile -host aa ./a.out
<br>
<br>
#### **Q5.如何查看job運行結果?**
**A5.**
(待驗證)
<br>
<br>
### [官方FAQ:Running MPI jobs](https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles)