Programming Assignment IV: MPI Programming

# Programming Assignment IV: MPI Programming **My Name: DuBu** **My ID: 0616108** ### Part-1 #### Q1-1. How do you control the number of MPI processes on each node? Modify hostfile as below, ```slots=N``` mean able to create N number of processes on each node. ``` pp1 slots=N ``` #### Q1-2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? MPI_Comm_rank for current rank and MPI_Comm_size for number of processes. ```c= MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); ``` #### Q2-1. Why ```MPI_Send``` and ```MPI_Recv``` are called “blocking” communication? These functions do not return until they get their work done (finish communication). The process can't do anything during communication. #### Q2-2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Execution Time(second) | 6.285158| 3.202381|1.637128|1.115834|0.856629| ### **Linear Reduction** ![](https://i.imgur.com/SnlWxRj.png) #### Q3-1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 16 | | -------- | -------- | -------- |-------|-----| | Execution Time(second) | 6.282915| 3.374169|1.761876|0.844483| ### **Binary Tree Reduction** ![](https://i.imgur.com/G265rVq.png) #### Q3-2. How does the performance of binary tree reduction compare to the performance of linear reduction? | Number of Processes | 2 | 4 | 8 | 16 | | -------- | -------- | -------- |-------|------| | Execution Time(Linear) | 6.285158| 3.202381|1.637128|0.856629| | Execution Time(B Tree) | 6.282915| 3.374169|1.761876|0.844483| We can tell from above results both execution time are very similar. #### Q3-3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. Binary tree reduction going to perform better since binary tree reduction always has less messages communication than linear reduction. Linear reduction has N messages communication, Binary tree reduction has N-1 message communication. #### Q4-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Execution Time(second) | 6.305194| 3.220678|1.624813|1.098666|0.867730| ### **Non-Blocking Linear Reduction** ![](https://i.imgur.com/xzgh3qa.png) #### Q4-2. What are the MPI functions for non-blocking communication? In contrast, non-blocking communication is done using ```MPI_Isend()``` and ```MPI_Irecv()```. These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call ```MPI_Wait()``` or ```MPI_Test()``` to see whether the communication has finished. Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call ```MPI_Isend()```, do some computations, then do ```MPI_Wait()```. This allows computations and communication to overlap, which generally leads to improved performance. #### Q4-3. How the performance of non-blocking communication compares to the performance of blocking communication? | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Non-Blocking | 6.305194| 3.220678|1.624813|1.098666|0.867730| | Blocking | 6.285158| 3.202381|1.637128|1.115834|0.856629| We can tell from above both result are very similar, I can't tell the differences. #### Q5-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Execution Time(second) | 6.361072| 3.467300|3.001366|2.175896|1.609730| ### **MPI_Gather** ![](https://i.imgur.com/UNgxBxL.png) #### Q6-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Execution Time(second) | 6.336449| 3.288299|3.352255|2.079932|1.451272| ### **MPI_Reduce** ![](https://i.imgur.com/URc4QD9.png) #### Q7-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. | Number of Processes | 2 | 4 | 8 | 12 | 16 | | -------- | -------- | -------- |-------|------|-----| | Execution Time(second) | 6.314919| 3.501014|2.515297|1.628985|1.794264| ### **MPI Windows and One-Sided Communication** ![](https://i.imgur.com/bPIehn1.png) #### Q7-2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? According to above results can't really tell which one have the best performance, but One-Sided Communication has the worst result compare to MPI Collective and regular MPI (blocking and non-blocking) method. On the other hand, regular MPI method has better performance than MPI Collective method. Beside that, ```MPI_Reduce()``` has sligtly better result than ```MPI_Gather()```. So if I have to rank them in list would be: >Tier 1: Regular MPI Method (non-blocking & blocking) >Tier 2: MPI Collective with ```MPI_Reduce()``` >Tier 3: MPI Collective with ```MPI_Gather()``` >Tier 4: One-Side Communication #### Q8-1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively. ### Intra node hosts file ``` pp1 slots=2 ``` ![](https://i.imgur.com/GJsJh0O.png) --- ### Inter node hosts file ``` pp1 slots=1 pp2 slots=1 ``` ![](https://i.imgur.com/mgDHQbL.png) --- #### Q8-2. Calculate the bandwidth and latency for cases 1 and 2, respectively | | Latency | Bandwidth | | -------- | -------- | -------- | | Case1 | 207.02 ms | 3.86 GB/s | | Case2 | 798.62 ms | 3.57 GB/s | #### Q9-1. Describe what approach(es) were used in your MPI matrix multiplication for each data set. Step 1: Read matrices with root process, A.K.A rank 0. Step 2: Boardcast the matrices to other processes with ```MPI_Boardcast()``` from root process. Step 3: Split task, and do matrix multiplication. Step 4: Call ```MPI_Gatherv()``` gather the result to root process. Steo 5: Print it all out.