PP_HW4 - HackMD

# PP_HW4 ### Q1: Hello-mpi 1.How do you control the number of MPI processes on each node? (1 points) #### ans: #### 在hostfile定義哪些node使用在這次mpirun中 2.Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (1 points) #### ans: #### rank：MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); #### total:process number MPI_Comm_size(MPI_COMM_WORLD,&world_size); 3.We use Open MPI for this assignment. What else MPI implementation is commonly used? What is the difference between them? (1 points) #### ans: #### mpi主要應用在計算複雜型的應用如影像計算方面,master(rank=0)分配工作給每個node並拿回結果,在Hello World這個程式中沒有計算複雜的工作。 ### Q2: Block-linear 1.Why MPI_Send and MPI_Recv are called “blocking” communication? (1 points) #### ans: #### call MPI_Send後會等到其他process確實執行MPI_Recv後才會繼續執行 , MPI_Recv也是會block住直到他收到資料才會繼續執行。 2.Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/NpP859G.png) ![](https://i.imgur.com/nnEUXaH.png) ### Q3: Block-tree 1.Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points) #### ans: ![](https://i.imgur.com/A3vG9Jl.png) ![](https://i.imgur.com/W3Ig7XD.png) 2.How does the performance of binary tree reduction compare to the performance of linear reduction? (2 points) #### ans: #### linear都會比binary的略快 3/Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points) #### ans: #### 以8個process來舉例，在linear需要7次通訊來傳遞pi, 在tree中同樣也需要7次，但tree某些process需要先recv才會send導致雖然通訊數量相同但linear會比binary略快。 ![](https://i.imgur.com/pV083UC.png) ### Q4: Non-block-linear 1.Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) #### ans: ![](https://i.imgur.com/mJKLImA.png) ![](https://i.imgur.com/qkp00kP.png) 2.What are the MPI functions for non-blocking communication? (1 points) #### ans: #### MPI_Isend()跟MPI_Irecv() 3.How the performance of non-blocking communication compares to the performance of blocking communication? (3 points) #### ans: 基本上兩者差不多 ![](https://i.imgur.com/aswIwRB.png) ### Q5:MPI_Gather 1.Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/7qzstJX.png) ![](https://i.imgur.com/mI0TTvs.png) ### Q6：Reduce 1.Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/t0yfxXo.png) ![](https://i.imgur.com/rZtvMEx.png) ### Q7: One_side 1.Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points) ![](https://i.imgur.com/VqRK0pd.png) ![](https://i.imgur.com/VINYxDP.png) 2.Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? (3 points) #### ans : 在我的測試當中,nonblocking 會是其中最快的，我認為原因noblocking執行過後不管對方是否收到就繼續執行下去，當然也有可能是我跑測時間的時候，同時也有其他人在跑影響到時間。 3.Which algorithm or algorithms do MPI implementations use for reduction operations? You can research this on the WEB focusing on one MPI implementation. (1 points) #### ans : MPI reduction 用的演算法是rank-order-based deterministic algorithms 。https://www.mcs.anl.gov/papers/P4093-0713_1.pdf ### Q8: 1.Plot ping-pong time in function of the message size for cases 1 and 2, respectively. (2 points) #### ans: #### case 1: mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 2 --hostfile hosts ping_pong ![](https://i.imgur.com/XiIsfAV.png) #### case 2: mpicxx ping_pong.c -o ping_pong; mpirun -np 2 -npernode 1 --hostfile hosts ping_pong ![](https://i.imgur.com/8RJKGDg.png) 2.Calculate the bandwidth and latency for cases 1 and 2, respectively. (3 points) #### ans: #### case 1: #### bandwidth = 6780353631.749777 = 6.78 GB/s #### latency = 5.375391367104946e-05 s = 53.7 * 10^-6 s #### case 2: #### bandwidth = 116773102.89540826 = 0.116 GB/s #### latency = 0.0003346260147482886 = 334 * 10^-6 s 3.For case 2, how do the obtained values of bandwidth and latency compare to the nominal network bandwidth and latency of the NCTU-PP workstations. What are the differences and what could be the explanation of the differences if any? (4 points) #### ans: #### latency影響可能因素為以下幾種情況 #### propagation delay：封包在網路線上傳輸所花費的時間，與網路線上電子訊號跑的速度有關，這個時間就是距離除以訊號傳送速度所得到的數值。假設傳送距離為 d ，傳輸的速率為 s ，那麼 propagation delay 就是 d/s。 #### transmission delay：網路卡將資料傳送到網路線上（或從網路線上接收）所花的時間，與網路設備的傳送速度有關（如高速乙太網路傳送速度為 100Mbps）。假設頻寬為 L（bits），數據傳輸速率為 R（bits/sec），這樣產生的 transmission delay 就是 L/R。 #### nodal processing delay：路由器處理封包表頭（packet header）、檢查位元資料錯誤與尋找配送路徑等所花費的時間。 #### queuing delay：路由器因為某些因素無法立刻將封包傳送到網路上，造成封包暫存在佇列（queue）中等待的時間。 #### 種種影響導致結果不如我們使用ping所得到的結果。 ### Q9: Describe what approach(es) were used in your MPI matrix multiplication for each data set. #### ans: #### dataset1: 矩陣Ａ(6x8)和矩陣Ｂ(8x4)相乘,master(rank=0) send矩陣B和所需要算的列給每個process,舉例：rank=1的process拿到Ａ的2、3列和Ｂ矩陣，完成計算後返回2x4矩陣，master recv每個process算完的小矩陣，最後得到答案。