Parallel Programming @ NYCU - HW4

# Parallel Programming @ NYCU - HW4 #### **`0716221 余忠旻`** ## Q1 ### 1. How do you control the number of MPI processes on each node? mpirun -np <processes> <execution> 可以決定總共將有幾個 processes。在 Q1: `mpi_hello` 執行時， `mpirun -np 8 --hostfile hosts mpi_hello` 就是使用8個processes 平行執行 `mpi_hello` :::info 加上 **-npernode <processes per node>** 的 flag，可以控制每個 node 上會有幾個 processes 或者在 `hostfile` 中，hostname後面加上 **slots = n**，表示該host為一個node時能執行 n 個processes ::: ### 2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? **Retrive the rank of an MPI process:** ```cpp= int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); ``` **Retrive the total number of processes:** ```cpp= int size; MPI_Comm_size(MPI_COMM_WORLD, &size); ``` ___ ## Q2. pi_block_linear.cc ### 1. Why MPI_Send and MPI_Recv are called “blocking” communication? :::info 因為這兩個 function 會等到整個**通訊完成**後才會return 在通訊完成之前不會 return ，程式也自然不會往下繼續執行。 ::: ### 2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | -------- | |time|6.300787|3.282009|1.636781|1.100735|0.822144| ![](https://i.imgur.com/tHI6MdE.png) ___ ## Q3. pi_block_tree.cc ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | |time|6.390032|3.328851|1.657919|0.862404| ![](https://i.imgur.com/qYIXeZ3.png) ### 2. How does the performance of binary tree reduction compare to the performance of linear reduction? ![](https://i.imgur.com/6mTfmII.png) 由上面比較圖可以看見，兩者是差不多的，但 `pi_block_linear.cc` 比 `pi_block_tree.cc`略快一點 ### 3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. ![](https://i.imgur.com/Y2vrDoC.jpg) 由上圖可以觀察到兩種 algorithm 的傳輸次數是一樣的，所以傳輸次數不是影響的因素，接著我們考慮兩者的實作方式，`pi_block_tree.cc` 實作中需要 **遞迴** 或是 **bottom up** 的方式去分配各個 process 對應到的傳送和接收，並且分批去累加，這樣和 linear reduction 方式一次累加全部相比，比較複雜且步驟較多，理論上是會比較耗時的。 ___ ## Q4. pi_nonblock_linear.cc ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | -------- | |time|6.221352|3.257266|1.623923|1.089929|0.816641| ![](https://i.imgur.com/EUYt77O.png) ### 2. What are the MPI functions for non-blocking communication? :::info * MPI_Isend * MPI_Irecv * MPI_Wait * MPI_Waitany * MPI_Test * MPI_Testany ::: 執行 non-blocking 的 MPI function 不需要等待 return 就能繼續執行接下來的程式，和 blocking 相比會節省一些時間，利用節省的時間做計算就能減少整個程式的總執行時間。 ### 3. How the performance of non-blocking communication compares to the performance of blocking communication? ![](https://i.imgur.com/ctIE40g.png) 由上面比較圖可以看見，兩者是差不多的，但 `pi_nonblock_linear.cc` 比 `pi_block_linear.cc`略快一點我猜想是 non-blocking 沒有充分利用節省下來的時間，因為計算 pi 的耗時最大是在隨機數的產生，在計算 pi 結果前需要利用隨機數產生來知道到底有多少的投擲數，造成投擲數需要在 `MPI_Isend()` 之前計算完成才能傳送，因此在無法完全利用到non-blocking的優勢 ___ ## Q5. pi_gather.cc ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | -------- | |time|6.185464|3.231177|1.639481|1.106469|0.839814| ![](https://i.imgur.com/7OlWXPy.png) ___ ## Q6. pi_reduce.cc ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | -------- | |time|6.545612|3.416387|1.719804|1.149395|0.879354| ![](https://i.imgur.com/AifOGKv.png) ___ ## Q7. pi_one_side.cc ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes| | -------- | -------- | -------- | -------- | -------- | -------- | |time|6.631016|3.475609|1.790234|1.206102|0.907661| ![](https://i.imgur.com/DzIVa4M.png) ### 2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? ![](https://i.imgur.com/le4WXL7.png) 當 mpi processes 數是 2,4 的時候， `pi_gather.cc` 執行速度比較快當 mpi processes 數是 8,12,16 的時候， `pi_nonblock_linear.cc` 執行速度比較快因為當 mpi processes 數比較小時， `MPI_Gather`能並行的將分散各個node的各個process計算的資料收集回來，會比較快一點，可能是因為 `MPI_Gather` 根據不同的communicator size大小會有不同種的實作方式，而 `MPI_Gather` 演算法在communicator size比較小的時候比較有效率。當 mpi processes 數比較大時， Linear Reduction Algorithm 的實作方式能在作業要求下執行較快，因為作業主要是在分配計算投擲數給各個 WORKER processes，而 MASTER process需要等大家都計算完投擲數後才能做 pi 的計算，因此 Linear Reduction Algorithm 的做法比較符合這種**多點發送單點接收**的架構方式， `pi_nonblock_linear.cc`會跑比較快。 ___ ## Q8 ### 1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively. * case 1 ![](https://i.imgur.com/hs58cGm.jpg) ![](https://i.imgur.com/iUsFXHl.png) * case 2 ![](https://i.imgur.com/rIJhzzF.jpg) ![](https://i.imgur.com/TpcJJbR.png) ### 2. Calculate the bandwidth and latency for cases 1 and 2, respectively. :::success T(n) = s + rn latency = s bandwidth = 1/r ::: * case 1 T(n) = 1.59389970869439E-10 n + 0.000113824934821575 latency = 0.000113824934821575 (ms) bandwidth = 1 / 1.59389970869439E-10 (B/s) &emsp; &emsp; &emsp; &emsp; = 6.2739204640367826511813994854224 (GB/s) * case 2 T(n) = 8.65344159019972E-9 n + 0.000216304976804107 latency = 0.000216304976804107 (ms) bandwidth = 1 / 8.65344159019972E-9 (B/s) &emsp; &emsp; &emsp; &emsp; = 0.1155609579814498020402013074126 (GB/s) ___ ## Q9 ### 1. Describe what approach(es) were used in your MPI matrix multiplication for each data set. ![](https://i.imgur.com/MfehKJj.jpg) 由上圖可以看到 A 對應的 row 和 B 對應的 column 相乘能得到對應 C element 的值因此我們先分成兩部分: MASTER 和 WORKER，分別代表 `rank = 0` 和 `rank > 0` **MASTER** 主要做的事是處理 load balance，將 A matrix 平分成好幾份 rows，將需要進行運算的資訊寄給 WORKER，之後等待各個 WORKER 計算完的答案傳回來印出答案 ``` MASTER 傳給 WORKER 的資訊包含: A,B matrix大小每個WORKER 所需運算的份量 rows 計算結果相對於 C 矩陣的位置 offset A partial matrix 和 B matrix ``` **WORKER** 需要做的事是接收 MASTER 傳來的資訊，並計算答案，最後將此WORKER 所需運算的份量 `rows`、計算結果相對於 C 矩陣的位置 `offset` 、以及計算答案 `C partial matrix` 傳回給 MASTER ___