# Programming Assignment IV: MPI Programming
[toc]
## Q1 (3 points)
1. How do you control the number of MPI processes on each node? (1.5 points)
**A: Add slots=N after the hostname in the hostfile, which means that only N processes can be executed with the hostname as the node**
2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (1.5 points)
**A:**
```
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank) # To get the rank of the process
MPI_Comm_size(MPI_COMM_WORLD, &world_size) # To get the number of processes
```
## Q2 (2 points)
1. Why MPI_Send and MPI_Recv are called “blocking” communication? (1 points)
**A: These two primitives will wait for completion of communication to return and continue executing program.**
2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

## Q3 (6 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points)

2. How does the performance of binary tree reduction compare to the performance of linear reduction? (2 points)
**A: It can be seen from the comparison of the above two pictures that there is no obvious difference, and the performance of the two can be said to be the same. It may be that the number of processes is not large enough to see a significant difference.**
3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points)
**A: Linear approach is better. Linear approach needs n transmits, while tree approach needs (n+n/2+n/4...+1) transmits. The cost of tree approach is much more than linear approach.**
## Q4 (5 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

2. What are the MPI functions for non-blocking communication? (1 points)
```
MPI_Isend
MPI_Irecv
MPI_Wait
MPI_Waitany
MPI_Test
MPI_Testany
```
3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points)
**A: It can be seen from the comparison of the above two pictures that there is no obvious difference, and the performance of the two can be said to be the same. It may be that the number of processes is not large enough to see a significant difference.**
## Q5 (1 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

## Q6 (1 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

## Q7 (5 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that? (4 points)
**A: 1.2.4 MPI_Gather approach gives the best performance among these 6 methods according to the graphs above. MPI_Gather gather sendbuf to recvbuf, so there is no such iterative overhead like MPI_Recv primitive. As a consequence, it is better than 1.2.1 and 1.2.2. As for MPI_Reduce and one_side, there is a need for sync of recv_data to do operation so it introduces some sync overhead. Lastly, I feel like the primitive of non-blocking approach and gather is similar, however, gather beats non-blocking approach anyway, which I do not have a theory to explain that.**
## Q8 (9 points)
1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively. (4 points)


2. Calculate the bandwidth and latency for cases 1 and 2, respectively. (5 points)
**case1: bandwidth=5G, latency=0.4ms
case2: bandwidth=111MB, latency=0.5ms**
## Q9 (5 points)
1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
* Assumption: n, m is the dim of matrix A and m, l is the dim of matrix B. C is the multiplication of A and B. N is #nodes in MPI_COMM_WORLD.
* Work distribution: Segment matrix A to $(n / N) rows. Each node is responsible for multiplcation of n/N rows of matrix A.
* Method: Master Passes data to nodes by MPI_Isend, and use MPI_Irecv to receive the result from slave. In the meanwhile, master does its own task. Then master combine the results from slave, and present matrix C on screen.
* Question: I cannot figure out why the performance is so poor, 20% per dataset. I tried non-blocking method to boost performance, luckily execution time became less. However, the score remained 20%. I was wondering if the reason was the resources of work station was taken at the same time. But there is no way to find out the answer until due date.
:::info
Did you use any speicial technique to improve the performance? E.g. tiling, transpose...
>[name=TA]
:::