Programming Assignment HW4

# Programming Assignment HW4 ## Q1 > How do you control the number of MPI processes on each node? (3 points) **Ans:** 在hostfiles中加入slot = N，其中N的值設定為host端最大的processes數量。這次作業中最大可以設定到16。 > Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (2 points) **Ans:** Retrive the total number of processes: ```c= int size; MPI_Comm_size(MPI_COMM_WORLD, &size); ``` Retrive the rank of an MPI process: ```c= int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); ``` ## Q2 > Why MPI_Send and MPI_Recv are called “blocking” communication? (2 points) **Ans:** 因為執行`MPI_Send`的時候會等待別的process中的`MPI_Recv`執行，因此程式會被停下而無法執行接下來的指令，同樣`MPI_Recv`也會等待`MPI_Send`傳送資料並收到之後才會執行接下來的指令，上面這些功能就會被稱為"blocking"。 > Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/SETpKFS.png) | Number of processes | Runtime | | -------- | -------- | |2| 6.458812 | |4| 3.292482 | |8| 1.650646 | |12| 1.129848 | |16| 0.842075 | ## Q3 > Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/Is44eBL.png) | Number of processes | Runtime | | -------- | -------- | |2| 6.499463 | |4| 3.321890 | |8| 1.673447 | |16| 0.855035 | > How does the performance of binary tree reduction compare to the performance of linear reduction? (3 points) **Ans:** ![](https://i.imgur.com/OirUXb0.png) 我將兩次的實驗數據一起呈現，觀察的結果差異不大。 > Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points) **Ans:** 結論：增加processes的數量，結果會不一定。因爲`MPI_Recv()`所需時間相對於每個process執行的時間差很大，代表linear所需時間會較久，那tree會比較快；反之，linear會比較快。所以process的數量很多的時候結果會不一定。 ## Q4 > Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/qs78XEn.png) | Number of processes | Runtime | | -------- | -------- | |2| 6.340269 | |4| 3.277678 | |8| 1.646690 | |12| 1.129992 | |16| 0.916773 | > What are the MPI functions for non-blocking communication? (2 points) - ```MPI_Isend():``` Begins a nonblocking send - ```MPI_Irecv():``` Begins a nonblocking receive - ```MPI_Wait()/MPI_WaitAll():``` Waits for an MPI request to complete & Waits for all given MPI Requests to complete > How the performance of non-blocking communication compares to the performance of blocking communication? (3 points) 實作的結果看起來差異不大，可能是因為有用到`MPI_Wait()`，所以都要等到結果都算完才能加總，不過當processes使用到16的時候看起來有稍微慢一點點。 ## Q5 > Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/0OwIJGJ.png) | Number of processes | Runtime | | -------- | -------- | |2| 6.344678 | |4| 3.277678 | |8| 1.648714 | |12| 1.116314 | |16| 0.833355 | ## Q6 > Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/gefKZjY.png) | Number of processes | Runtime | | -------- | -------- | |2| 6.353718 | |4| 3.310183 | |8| 1.651244 | |12| 1.124145 | |16| 0.849721 | ## Q7 > Describe what approach(es) were used in your MPI matrix multiplication for each data set. 首先，先將分工分成 MASTER (rank = 0) 以及 WORKER (rank > 0)，MASTER 負責做 load balancing，將 A 矩陣平分成好幾份 rows，將 WORKER 計算所需要的資訊發送過去，包含 WORKER 送回答案時相對於 C 矩陣的位置 offset、每個 WORKER 所需運算的份量 rows以及 A、B矩陣的元素。發送結束後即可以等待接收來自各個 WORKER 所運算後的結果。而 WORKER 的工作則是先接收來自 MASTER 的所分配的 workload 資訊，隨後接著運算，最終再將答案發送回去給 MASTER。