---
tags: para., MPI, distribution system, message passing
---
# MPI Programming
## Q1: hello world
### 1. How do you control the number of MPI processes on each node?
* 使用 -npernode 參數調整每個 node 的 processes,如下面指令。
```
$ mpirun -np <# of processes> -npernode per node> --hostfile <hostfile> myprog
```
### 2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
Retrieving the rank:
```
int rank;
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
```
Retrieving the total number of processes:
```
int size;
MPI_Comm_size( MPI_COMM_WORLD, &size);
```
### 3. We use Open MPI for this assignment. What else MPI implementation is commonly used? What is the difference between them?
* 除了 Open MPI 以外 Intel MPI 也能實現 MPI。
| | Open MPI | Intel MPI |
|:--- |:------------:|:--------------------- |
| 優 | 硬體兼容性好 | 適用於 Intel 自家產品 |
| 劣 | 效率較差 | 效率較佳 |
<br><br>
## Q2: block_linear
### 1. Why MPI_Send and MPI_Recv are called “blocking” communication? (1 points)
* 使用 MPI_Send 指令後,須等到有人接收訊息才會繼續往下執行,MPI_Recv 則是須等到有人送出訊息,讓該行程式接收到訊息才會繼續往下執行,因此稱為 “blocking” communication
### 2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
執行所需花費的時間隨著 Processes total number 下降。

<br><br>
## Q3: block_tree
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points)
* 執行所花費的時間如下圖

### 2. How does the performance of binary tree reduction compare to the performance of linear reduction? (2 points)
統計結果如下表,從結果來看兩者的表現沒有顯著差異,但是第三題的時候將問題放大後處理就能看出,在 processes 數多的情況下,linear 會更具優勢。
| method\NP | 2 | 4 | 8 | 16 |
| --------- | ----- | ----- | ----- | ----- |
| linear | 6.72s | 3.47s | 1.73s | 0.92s |
| tree | 6.74s | 3.47s | 1.74s | 0.89s |
### 3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points)
* 我將 tosses 次數增加成上面統計的 100 倍後,linear 和 tree 的表現如下圖。
* 在NP number 2 和 4 的時候,tree 的表現較佳,但是到 NP 8 和 16 的時候結果相反。
* 隨著 processes 數增加,tree 需要更多次進入迴圈執行 message commuication,因此相較於 linear 總是只在最後使用一個迴圈接收訊息,tree 在溝通上花費更高。

<br><br>
## Q4: non-blocking linear
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
* 執行所花費的時間如下圖

### 2. What are the MPI functions for non-blocking communication? (1 points)
在 non-blocking 的通訊中 `MPI_Isend()` 用來送出訊息,`MPI_Irecv()`用來接收訊息。
### 3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points)
從結果我們可以發現 在這個任務中 non-blocking 雖然和 blocking 表現差不多,但的整體來說還是略好於 blocking 的表現。

<br><br>
## Q5: gather
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
* 執行所花費的時間如下圖

<br><br>
## Q6: Reduce
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
* 執行所花費的時間如下圖

<br><br>
## Q7: one-side communication
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
* 執行所花費的時間如下圖

### 2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? (3 points)
* 從下表統計來看,nonblock linear 的表現是最佳的,reduce 次之,one-side 最差
* 我覺得也許使用 MPI_Irecv + MPI_Wait 會比使用 MPI_Recv 還快,即使他們看起來是等價的,但是不需要等待接收到的過程還是節省了一點時間,因此雖然差距不大但 nonblocking 通常有較好的結果。
| method\NP | 2 | 4 | 8 | 12 | 16 |
|:--------------- |:----------- |:----------- |:----------- |:----------- |:----------- |
| block_linear | 6.74 | 3.47 | 1.74 | 1.17 | 0.89 |
| block_tree | <b>6.72</b> | 3.47 | <b>1.73</b> | | 0.92 |
| nonblock_linear | <b>6.72</b> | <b>3.46</b> | <b>1.73</b> | 1.19 | <b>0.88</b> |
| gather | 6.74 | 3.47 | 1.75 | <b>1.16</b> | 0.90 |
| reduce | <b>6.72</b> | <b>3.46</b> | 1.74 | <b>1.16</b> | 0.89 |
| one-side | 13.45 | 6.93 | 3.49 | 2.33 | 1.78 |
### 3. Which algorithm or algorithms do MPI implementations use for reduction operations? You can research this on the WEB focusing on one MPI implementation. (1 points)
* 使用 reduction operation 來實現 rank-order-based deterministic algorithms。
* 運用實驗通常必須反覆一直 Validation 的特性,使用固定 reduction 的操作來加速例行公事。
* ref: [On the Reproducibility of MPI Reduction perations](https://www.mcs.anl.gov/papers/P4093-0713_1.pdf)
<br><br>
## Q8: Ping-Pong
> case 1: intra-node communication (two processes on the same node), and
case 2: inter-node communication (two processes on different nodes). You should consider more different nodes in this case.
### 1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively. (2 points)
case1 指令: `mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 2 --hostfile hosts ping_pong`

case2 指令: `mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 1 --hostfile hosts ping_pong`
(此實驗採用 two processes on four different nodes.)

### 2. Calculate the bandwidth and latency for cases 1 and 2, respectively. (3 points)
bandwidth: 斜率倒數,latency: 與y軸截面
case1:
bandwidth: 6723345886.513239 ~= 6.72 GB/s
latency: 0.0007381264050581177 ~= 7.381 * 10^-4 s
case2:
bandwidth: 97132541.8124317 ~= 0.97 GB/s
latency: 0.0011550466383125624 ~= 1.155 * 10^-3 s
### 3. For case 2, how do the obtained values of bandwidth and latency compare to the nominal network bandwidth and latency of the NCTU-PP workstations. What are the differences and what could be the explanation of the differences if any? (4 points)
我使用 `ping -U 192.168.110.91` 指令來計算 pp6 的 bandwidth,ping 結果如下圖,

time 取平均值0.5235 ms。
NCTU-PP workstations 所聲稱的 bandwidth = 64*8 / 0.5235 = 978.032 ~= 0.978 GB / s
NCTU-PP workstations 所聲稱的 lantency = 0.0005235 s ~= 5.2 * 10^-4s
我覺得利用 ping pong 演算法所計算出的 **bandwidth 和 lantency 已經算是很接近NCTU-PP所聲稱的結果**。但就可能導致算出來不太一樣的因素討論的話,我覺得有可能因為使用 ping pong 計算時間很久,如果有其他同學中途使用該工作站,就有可能會導致計算出現誤差。
<br><br>
## Q9: Part Two: Matrix Multiplication with MPI
### 1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
* 在MPI 方面我先在 rank 0 將資料進行讀取後,使用 non-blocking 的方式來將 data 傳送到各個 processes 進行計算。再來將輸出的陣列依照列來分割給每個 processes 進行計算,最後一樣透過 non-blocking send 傳送結果回 rank 0 進行整合。
* 在矩陣乘法方面,A 矩陣乘以 B 矩陣,我將 B 矩陣進行"轉置"如此在計算的時候陣列讀取記憶體會變成連續的,這樣變能夠加快計算速度,這個方法幫助我在計算 data2的速度加快了將近2倍。
* 採用 SIMD AVX 256 bit register 來進行計算,如此可以一次計算八個整數(int)很直觀的加速了計算的速度。
* 我在矩陣乘法使用 `cache-friendly`的方式計算,這樣可以減少陣列讀取的次數,對於計算加速也有幫助,演算法如下:
```
// 如此一來,重複取用 a[i][k] 且每個矩陣都是以步長為 1 取用變數。
for (int i = 0; i < a.row; ++i)
for (int k = 0; k < a.col; ++k) {
int r = a.values[i][k];
for (int j = 0; j < b.col; ++j)
dst->values[i][j] += r * b.values[k][j];
}
```
最後在測試結果上在 data1 和 data2 的結果如下:
> data1: 0.1897 s
> data2: 3.8303 s
data1:
 
參考 MPI 筆記: http://cic-2012.pccu.edu.tw/HPC/pdf/MPI2.pdf
矩陣乘法: https://hackmd.io/@siahuat0727/HJscCbWgV?type=view#SIMD-AVX