Parallel Programming Assignment IV

Parallel Programming Assignment IV === ## Q1 ### 1. How do you control the number of MPI processes on each node? 在`hosts`中的hostname後加上`slots=n`，n為該hostname指定的processes個數。(範例如下) ```. #hosts: pp1 slots=2 pp2 slots=2 pp3 slots=2 pp4 slots=2 ``` ### 2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? 分別使用`MPI_Comm_size()`與`MPI_Comm_rank()`來取得rank與processes總數。(範例如下) ```. ... // TODO: Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // TODO: Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); ... ``` ## Q2 ### 1. Why MPI_Send and MPI_Recv are called “blocking” communication? 因為`MPI_Send`需要等`MPI_Recv`接收才會繼續執行下去，所以有被blocking 也被稱為`blocking communication`。 ### 2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/83j5h5a.png) ## Q3 ### 1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. ![](https://i.imgur.com/S6riJmx.png) ### 2. How does the performance of binary tree reduction compare to the performance of linear reduction? binary tree reduction對比linear reduction可以觀察出有更好的效能。 ### 3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. binary tree reduction對比linear reduction可以觀察出**當MPI processes 增加時執行的時間減少得更加明顯**，因為binary tree reduction在最後可以分擔加總到不同的processes，而linear reduction則是把加總的工作都給rank 0的processes做進而造成瓶頸。 ### Q4 ## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/nB7bs01.png) ## 2. What are the MPI functions for non-blocking communication? **Non-blocking send**: ``` int MPI_Isend(void *buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Communicator comm, MPI_Request *request); ``` **Non-blocking receive:** ``` int MPI_Irecv(void *buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Communicator comm, MPI_Request *request); ``` ## 3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points) 對比blocking communication，non-blocking communication的效能稍微好一些。(如下圖比較) ![](https://i.imgur.com/2CM5qqG.png) ### Q5 ## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/1SLWz5W.png) ### Q6 ## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/YVAHLmk.png) ### Q7 ## 1. Describe what approach(es) were used in your MPI matrix multiplication for each data set. 因為傳輸時間會大於計算時間，所以a_mat的部分我將row用world_size去切割並將rank=0(master processes)只做送資料與接收、顯示結果的processes，非master processes會交錯的分割計算a_mat的row。b_mat會完整的傳送到每個processes。(範例如下) ```. ex: 3 2 3 1 2 0 3 4 5 1 2 1 0 3 2 a_mat: 1 2 0 3 4 5 b_mat: 1 2 1 0 3 2 size = 3; rank = 0, 1, 2; c_mat = a_mat * b_mat rank=0: send a_mat(0, 1), a_mat(0, 2) to rank_1 send a_mat(1, 1), a_mat(1, 2) to rank_2 send a_mat(2, 1), a_mat(2, 2) to rank_1 send b_mat to rank_1, rank_2 receive c_mat(0, 1), c_mat(0, 2), c_mat(0, 3) from rank_1 receive c_mat(1, 1), c_mat(1, 2), c_mat(1, 3) from rank_2 receive c_mat(2, 1), c_mat(2, 2), c_mat(2, 3) from rank_1 rank=1: receive a_mat(0, 1), a_mat(0, 2) from rank_0 receive a_mat(1, 2), a_mat(2, 2) from rank_0 receive b_mat from rank_0 calculate c_mat(0, 1), c_mat(0, 2), c_mat(0, 3) calculate c_mat(2, 1), c_mat(2, 2), c_mat(2, 3) rank=2: receive a_mat(1, 1), a_mat(1, 2) from rank_0 receive b_mat from rank_0 calculate c_mat(1, 1), c_mat(1, 2), c_mat(1, 3) ```