parallel programming NCTU Fall 2020 HW4

parallel programming NCTU Fall 2020 HW4 === ###### tags: `parallel_programming` # Part1 ## Q1 ### Q1-1.How do you control the number of MPI processes on each node? (1 points) 目前我找到有兩個方式可以控制 1. 在運行時加上 ==-host host_name1:N1,host_name2:N2== 其中 N1 與 N2 分別就是 pp1 這個host 所能運行 process 的最大數目加上 -display-map 測試如下 ![](https://i.imgur.com/fBXN0fQ.png) 2. 在hostfile 的 host name 後面上 ==slots=N== 則可以N就是此host 上所能運行 process 的最大數目，未特別定義則系統自動設置為該host的cpu核心數 Example: ``` pp1 slots=16# 1st node hostname pp2 # 2ed node hostname ``` Result ![](https://i.imgur.com/p8v3xHH.png) 可以注意到執行 mpirun -np 17 --hostfile hosts -display-map mpi_hello 我們要求了 17 個 process ### Q1-2:Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (1 points) #### The total number of MPI process: int MPI_Comm_size( MPI_Comm comm, int *size ) #### The rank of MPI process: int MPI_Comm_rank( MPI_Comm comm, int *size ) ### Q1-3: We use Open MPI for this assignment. What else MPI implementation is commonly used? What is the difference between them? (1 points) #### Else MPI implementation is commonly used: 另外也有人使用 MPICH #### Difference between them: 兩者的不同為：MPICH 會實現最新版本的MPI標準的功能，但不一定可以做出流暢的管理，我的理解為不夠穩定。而OpenMPI 是個流暢的管理器對於實物上要使用而不是實驗階段，提供比較穩定的使用體驗。 --- ## Q2 ### Q2-1: Why MPI_Send and MPI_Recv are called “blocking” communication? (1 points) 因為執行 MPI_Send 時會等待另外一個 process 裡面的 MPI_Recv ，因此程式會被 "block" 住，而無法執行接下來的步驟， MPI_Recv 也會等待 MPI_Send 有傳資料過來後，才可以繼續執行後續的程式，綜上所述，這些功能會被稱為 "blocking" ### Q2-2: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/vB8ML1R.png) | Thread num | Run time | | ---------- | -------- | | 2 | 14.957 | | 4 | 8.6364 | | 8 | 4.51 | | 12 | 2.198 | | 16 | 2.001 | --- ## Q3 ### Q3-1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/DNiVaYu.png) | Thread num | Run time | | ---------- | -------- | | 2 | 14.231 | | 4 | 6.635 | | 8 | 3.516 | | 16 | 1.8074 | ### Q3-2. How does the performance of binary tree reduction compare to the performance of linear reduction? (2 points) 兩個Thread 的時候執行時間相同，然而當Thread 數提高之後，速度就會比linear_block 更快。 ### Q3-3 Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points) 兩個 Thread 的時候執行時間比較久，然而當Thread 數提高之後，速度就會比linear_block 更快。我想，兩個 Thread 的情況，其實傳輸資料的情形與 linear block 一樣，但是卻要花費更大的功夫去建立 process，然而當Thread 數目提高後，執行 linear_blcok 時 process 互相等待的機會變多了，此時利用 Tree 的方式進行計算，就可以各自先行加總，而不會有等待其他 Thread 計算過久而塞車的情況。比較如下圖所示 ![](https://i.imgur.com/Wu7m8eP.png) --- ## Q4 ### Q4-1 :Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/hKP0mxH.png) | Thread num | Run time | | ---------- | -------- | | 2 | 14.726 | | 4 | 8.6364 | | 8 | 4.51 | | 12 | 2.198 | | 16 | 2.001 | ### Q4-2 :What are the MPI functions for non-blocking communication? (1 points) - 傳送資料: int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) - 接收資料：int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request * request) ### Q4-3 :How the performance of non-blocking communication compares to the performance of blocking communication? (3 points) ![](https://i.imgur.com/Sa0CxnE.png) 兩者效果差不多，我想原因也許是雖然nonblocking 在接收recieve 可以不用使用for 迴圈去逐一等待各個thread 算完再接收，不過依舊是要在while迴圈裡面等到所有的值都已經計算完畢，才再最後一次性的加總，因此同樣都被速度最慢的thread 給托住時間。 ## Q5 ### Q5-1 Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/uuvhTzn.png) | Thread num | Run time | | ---------- | -------- | | 2 | 12.092 | | 4 | 8.716 | | 8 | 4.477 | | 12 | 2.934 | | 16 | 2.228 | ## Q6 ### Q6-1 Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/FwhdXzK.png) | Thread num | Run time | | ---------- | -------- | | 2 | 14.403 | | 4 | 8.678 | | 8 | 4.447 | | 12 | 4.135 | | 16 | 2.228 | ## Q7 ### Q7-1 Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/9fUUvlW.png) | Thread num | Run time | | ---------- | -------- | | 2 | 12.397 | | 4 | 9.225 | | 8 | 4.914 | | 12 | 3.183 | | 16 | 2.777 | ### Q7-2 Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? (3 points) 1. 經過比較發現 1-2-2 於四個thread 與八個thread執行的表現最好，其他方式的表現則幾乎一致 2. 我想原因是因為使用tree 可以將兩兩計算完成後先行加法，所以最後的thread 完成計算後，也只需要進行幾次加法運算就好，因而省下了一點計算時間。不過也有可能是因為workstation 計算繁忙，因而出現誤差。 ![](https://i.imgur.com/adOUPNw.png) ### Q7-3 Which algorithm or algorithms do MPI implementations use for reduction operations? You can research this on the WEB focusing on one MPI implementation. (1 points) Linear allreduce algorithm: Open MPI implements the linear allreduce algorithm as a linear reduce to a specified root followed by a linear broadcast from the same root. ## Q8 ### Q8-1: Plot ping-pong time in function of the message size for cases 1 and 2, respectively. (2 points) - case 1 #### plot ![](https://i.imgur.com/M0INvRi.png) #### raw data ![](https://i.imgur.com/toBDpMH.png) - case 2 #### plot ![](https://i.imgur.com/tkOTrVv.png) #### raw data ![](https://i.imgur.com/xDPGDYB.png) ### Q8-2: Calculate the bandwidth and latency for cases 1 and 2, respectively. (3 points) - case 1 - latency: 1.58E-10 - bandwidth: 6,329,113,924 - case 2 - latency: 8.57E-9 - bandwidth: 116,686,114 ### Q8-3: For case 2, how do the obtained values of bandwidth and latency compare to the nominal network bandwidth and latency of the NCTU-PP workstations. What are the differences and what could be the explanation of the differences if any? (4 points) 實際測量的 bandwidth 與 latency 會比 normal network bandwidth 來的慢，因為在程式執行時，在硬體上，設備會受到做工、實驗環境的影響，例如: 1. 網路線 2.路由器 3.交換器等的運作。在程式運行方面，也有可能因為其他 process 搶佔頻寬，cpu排程需要時間等因素，導致傳輸的 bandwidth 與 latency 表現比理論值慢。 # part2 ## Q9 ### 9-1 :Describe what approach(es) were used in your MPI matrix multiplication for each data set. 1. minimizing latency for short messages 2. minimizing bandwidth use for long messages.