# Programming Assignment IV: MPI Programming
**My Name: DuBu**
**My ID: 0616108**
### <span style="font-size:17pt;">Part-1
#### <span style="color:#800000; font-size:13pt;">Q1-1. How do you control the number of MPI processes on each node?
Modify hostfile as below, ```slots=N``` mean able to create N number of processes on each node.
```
pp1 slots=N
```
#### <span style="color:#800000; font-size:13pt;">Q1-2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
MPI_Comm_rank for current rank and MPI_Comm_size for number of processes.
```c=
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
```
#### <span style="color:#800000; font-size:13pt;">Q2-1. Why ```MPI_Send``` and ```MPI_Recv``` are called “blocking” communication?
These functions do not return until they get their work done (finish communication). The process can't do anything during communication.
#### <span style="color:#800000; font-size:13pt;">Q2-2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Execution Time(second) | 6.285158| 3.202381|1.637128|1.115834|0.856629|
### **Linear Reduction**

#### <span style="color:#800000; font-size:13pt;">Q3-1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 16 |
| -------- | -------- | -------- |-------|-----|
| Execution Time(second) | 6.282915| 3.374169|1.761876|0.844483|
### **Binary Tree Reduction**

#### <span style="color:#800000; font-size:13pt;">Q3-2. How does the performance of binary tree reduction compare to the performance of linear reduction?
| Number of Processes | 2 | 4 | 8 | 16 |
| -------- | -------- | -------- |-------|------|
| Execution Time(Linear) | 6.285158| 3.202381|1.637128|0.856629|
| Execution Time(B Tree) | 6.282915| 3.374169|1.761876|0.844483|
We can tell from above results both execution time are very similar.
#### <span style="color:#800000; font-size:13pt;">Q3-3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
Binary tree reduction going to perform better since binary tree reduction always has less messages communication than linear reduction.
Linear reduction has N messages communication, Binary tree reduction has N-1 message communication.
#### <span style="color:#800000; font-size:13pt;">Q4-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Execution Time(second) | 6.305194| 3.220678|1.624813|1.098666|0.867730|
### **Non-Blocking Linear Reduction**

#### <span style="color:#800000; font-size:13pt;">Q4-2. What are the MPI functions for non-blocking communication?
In contrast, non-blocking communication is done using ```MPI_Isend()``` and ```MPI_Irecv()```. These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call ```MPI_Wait()``` or ```MPI_Test()``` to see whether the communication has finished.
Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call ```MPI_Isend()```, do some computations, then do ```MPI_Wait()```. This allows computations and communication to overlap, which generally leads to improved performance.
#### <span style="color:#800000; font-size:13pt;">Q4-3. How the performance of non-blocking communication compares to the performance of blocking communication?
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Non-Blocking | 6.305194| 3.220678|1.624813|1.098666|0.867730|
| Blocking | 6.285158| 3.202381|1.637128|1.115834|0.856629|
We can tell from above both result are very similar, I can't tell the differences.
#### <span style="color:#800000; font-size:13pt;">Q5-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Execution Time(second) | 6.361072| 3.467300|3.001366|2.175896|1.609730|
### **MPI_Gather**

#### <span style="color:#800000; font-size:13pt;">Q6-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Execution Time(second) | 6.336449| 3.288299|3.352255|2.079932|1.451272|
### **MPI_Reduce**

#### <span style="color:#800000; font-size:13pt;">Q7-1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| Number of Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |-------|------|-----|
| Execution Time(second) | 6.314919| 3.501014|2.515297|1.628985|1.794264|
### **MPI Windows and One-Sided Communication**

#### <span style="color:#800000; font-size:13pt;">Q7-2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that?
According to above results can't really tell which one have the best performance, but One-Sided Communication has the worst result compare to MPI Collective and regular MPI (blocking and non-blocking) method. On the other hand, regular MPI method has better performance than MPI Collective method. Beside that, ```MPI_Reduce()``` has sligtly better result than ```MPI_Gather()```.
So if I have to rank them in list would be:
>Tier 1: Regular MPI Method (non-blocking & blocking)
>Tier 2: MPI Collective with ```MPI_Reduce()```
>Tier 3: MPI Collective with ```MPI_Gather()```
>Tier 4: One-Side Communication
#### <span style="color:#800000; font-size:13pt;">Q8-1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively.
### Intra node hosts file
```
pp1 slots=2
```

---
### Inter node hosts file
```
pp1 slots=1
pp2 slots=1
```

---
#### <span style="color:#800000; font-size:13pt;">Q8-2. Calculate the bandwidth and latency for cases 1 and 2, respectively
| | Latency | Bandwidth |
| -------- | -------- | -------- |
| Case1 | 207.02 ms | 3.86 GB/s |
| Case2 | 798.62 ms | 3.57 GB/s |
#### <span style="color:#800000; font-size:13pt;">Q9-1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
Step 1: Read matrices with root process, A.K.A rank 0.
Step 2: Boardcast the matrices to other processes with ```MPI_Boardcast()``` from root process.
Step 3: Split task, and do matrix multiplication.
Step 4: Call ```MPI_Gatherv()``` gather the result to root process.
Steo 5: Print it all out.