MPI Programming

--- tags: para., MPI, distribution system, message passing --- # MPI Programming ## Q1: hello world ### 1. How do you control the number of MPI processes on each node? * 使用 -npernode 參數調整每個 node 的 processes，如下面指令。 ``` $ mpirun -np <# of processes> -npernode per node> --hostfile <hostfile> myprog ``` ### 2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? Retrieving the rank: ``` int rank; MPI_Comm_rank( MPI_COMM_WORLD, &rank); ``` Retrieving the total number of processes: ``` int size; MPI_Comm_size( MPI_COMM_WORLD, &size); ``` ### 3. We use Open MPI for this assignment. What else MPI implementation is commonly used? What is the difference between them? * 除了 Open MPI 以外 Intel MPI 也能實現 MPI。 | | Open MPI | Intel MPI | |:--- |:------------:|:--------------------- | | 優 | 硬體兼容性好 | 適用於 Intel 自家產品 | | 劣 | 效率較差 | 效率較佳 | ## Q2: block_linear ### 1. Why MPI_Send and MPI_Recv are called “blocking” communication? (1 points) * 使用 MPI_Send 指令後，須等到有人接收訊息才會繼續往下執行，MPI_Recv 則是須等到有人送出訊息，讓該行程式接收到訊息才會繼續往下執行，因此稱為 “blocking” communication ### 2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) 執行所需花費的時間隨著 Processes total number 下降。 ![](https://i.imgur.com/A5bMFPX.png) ## Q3: block_tree ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points) * 執行所花費的時間如下圖 ![](https://i.imgur.com/Cg72t8E.png) ### 2. How does the performance of binary tree reduction compare to the performance of linear reduction? (2 points) 統計結果如下表，從結果來看兩者的表現沒有顯著差異，但是第三題的時候將問題放大後處理就能看出，在 processes 數多的情況下，linear 會更具優勢。 | method\NP | 2 | 4 | 8 | 16 | | --------- | ----- | ----- | ----- | ----- | | linear | 6.72s | 3.47s | 1.73s | 0.92s | | tree | 6.74s | 3.47s | 1.74s | 0.89s | ### 3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points) * 我將 tosses 次數增加成上面統計的 100 倍後，linear 和 tree 的表現如下圖。 * 在NP number 2 和 4 的時候，tree 的表現較佳，但是到 NP 8 和 16 的時候結果相反。 * 隨著 processes 數增加，tree 需要更多次進入迴圈執行 message commuication，因此相較於 linear 總是只在最後使用一個迴圈接收訊息，tree 在溝通上花費更高。 ![](https://i.imgur.com/gjLWMPR.png) ## Q4: non-blocking linear ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) * 執行所花費的時間如下圖 ![](https://i.imgur.com/A5yMUCj.png) ### 2. What are the MPI functions for non-blocking communication? (1 points) 在 non-blocking 的通訊中 `MPI_Isend()` 用來送出訊息，`MPI_Irecv()`用來接收訊息。 ### 3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points) 從結果我們可以發現在這個任務中 non-blocking 雖然和 blocking 表現差不多，但的整體來說還是略好於 blocking 的表現。 ![](https://i.imgur.com/HSEcTSp.png) ## Q5: gather ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) * 執行所花費的時間如下圖 ![](https://i.imgur.com/xaSas53.png) ## Q6: Reduce ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) * 執行所花費的時間如下圖 ![](https://i.imgur.com/qijLwVX.png) ## Q7: one-side communication ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) * 執行所花費的時間如下圖 ![](https://i.imgur.com/t5UUIeC.png) ### 2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? (3 points) * 從下表統計來看，nonblock linear 的表現是最佳的，reduce 次之，one-side 最差 * 我覺得也許使用 MPI_Irecv + MPI_Wait 會比使用 MPI_Recv 還快，即使他們看起來是等價的，但是不需要等待接收到的過程還是節省了一點時間，因此雖然差距不大但 nonblocking 通常有較好的結果。 | method\NP | 2 | 4 | 8 | 12 | 16 | |:--------------- |:----------- |:----------- |:----------- |:----------- |:----------- | | block_linear | 6.74 | 3.47 | 1.74 | 1.17 | 0.89 | | block_tree | 6.72 | 3.47 | 1.73 | | 0.92 | | nonblock_linear | 6.72 | 3.46 | 1.73 | 1.19 | 0.88 | | gather | 6.74 | 3.47 | 1.75 | 1.16 | 0.90 | | reduce | 6.72 | 3.46 | 1.74 | 1.16 | 0.89 | | one-side | 13.45 | 6.93 | 3.49 | 2.33 | 1.78 | ### 3. Which algorithm or algorithms do MPI implementations use for reduction operations? You can research this on the WEB focusing on one MPI implementation. (1 points) * 使用 reduction operation 來實現 rank-order-based deterministic algorithms。 * 運用實驗通常必須反覆一直 Validation 的特性，使用固定 reduction 的操作來加速例行公事。 * ref: [On the Reproducibility of MPI Reduction perations](https://www.mcs.anl.gov/papers/P4093-0713_1.pdf) ## Q8: Ping-Pong > case 1: intra-node communication (two processes on the same node), and case 2: inter-node communication (two processes on different nodes). You should consider more different nodes in this case. ### 1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively. (2 points) case1 指令: `mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 2 --hostfile hosts ping_pong` ![](https://i.imgur.com/eQR2c5A.png) case2 指令: `mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 1 --hostfile hosts ping_pong` (此實驗採用 two processes on four different nodes.) ![](https://i.imgur.com/7EEfz8m.png) ### 2. Calculate the bandwidth and latency for cases 1 and 2, respectively. (3 points) bandwidth: 斜率倒數，latency: 與y軸截面 case1: bandwidth: 6723345886.513239 ~= 6.72 GB/s latency: 0.0007381264050581177 ~= 7.381 * 10^-4 s case2: bandwidth: 97132541.8124317 ~= 0.97 GB/s latency: 0.0011550466383125624 ~= 1.155 * 10^-3 s ### 3. For case 2, how do the obtained values of bandwidth and latency compare to the nominal network bandwidth and latency of the NCTU-PP workstations. What are the differences and what could be the explanation of the differences if any? (4 points) 我使用 `ping -U 192.168.110.91` 指令來計算 pp6 的 bandwidth，ping 結果如下圖， ![](https://i.imgur.com/nHxsAyT.png) time 取平均值0.5235 ms。 NCTU-PP workstations 所聲稱的 bandwidth = 64*8 / 0.5235 = 978.032 ~= 0.978 GB / s NCTU-PP workstations 所聲稱的 lantency = 0.0005235 s ~= 5.2 * 10^-4s 我覺得利用 ping pong 演算法所計算出的 **bandwidth 和 lantency 已經算是很接近NCTU-PP所聲稱的結果**。但就可能導致算出來不太一樣的因素討論的話，我覺得有可能因為使用 ping pong 計算時間很久，如果有其他同學中途使用該工作站，就有可能會導致計算出現誤差。 ## Q9: Part Two: Matrix Multiplication with MPI ### 1. Describe what approach(es) were used in your MPI matrix multiplication for each data set. * 在MPI 方面我先在 rank 0 將資料進行讀取後，使用 non-blocking 的方式來將 data 傳送到各個 processes 進行計算。再來將輸出的陣列依照列來分割給每個 processes 進行計算，最後一樣透過 non-blocking send 傳送結果回 rank 0 進行整合。 * 在矩陣乘法方面，A 矩陣乘以 B 矩陣，我將 B 矩陣進行"轉置"如此在計算的時候陣列讀取記憶體會變成連續的，這樣變能夠加快計算速度，這個方法幫助我在計算 data2的速度加快了將近2倍。 * 採用 SIMD AVX 256 bit register 來進行計算，如此可以一次計算八個整數(int)很直觀的加速了計算的速度。 * 我在矩陣乘法使用 `cache-friendly`的方式計算，這樣可以減少陣列讀取的次數，對於計算加速也有幫助，演算法如下: ``` // 如此一來，重複取用 a[i][k] 且每個矩陣都是以步長為 1 取用變數。 for (int i = 0; i < a.row; ++i) for (int k = 0; k < a.col; ++k) { int r = a.values[i][k]; for (int j = 0; j < b.col; ++j) dst->values[i][j] += r * b.values[k][j]; } ``` 最後在測試結果上在 data1 和 data2 的結果如下: > data1: 0.1897 s > data2: 3.8303 s data1: ![](https://i.imgur.com/W3YmT0q.png =340x) ![](https://i.imgur.com/A6REOd8.png =320x) 參考 MPI 筆記: http://cic-2012.pccu.edu.tw/HPC/pdf/MPI2.pdf 矩陣乘法: https://hackmd.io/@siahuat0727/HJscCbWgV?type=view#SIMD-AVX