---
tags: parallel_programming
---
# Parallel programming HW4
## Q1.1: How do you control the number of MPI processes on each node?
- MPI run programs according to hostfile, it assign programs to machine in hostfile one-by-one.
```
machine_1
machine_1
machine_2
```
- If we're going to run 8 copies of the program and hostfile is defined as above, program_1 & program 2 will be assigned to machine_1, program_3 will be assigned to machine_2, program_4 will be assigned to machine_1, ...etc.
> https://stackoverflow.com/questions/14411763/mapping-mpi-processes-to-particular-nodes
## Q1.2: Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
- `MPI_Comm_size` is used to get the total number of processes.
- `MPI_Comm_rank` is used to get the rank of current process.
## Q2.1: Why MPI_Send and MPI_Recv are called “blocking” communication?
- Since the result must be received by main machine, the communication should be syncronous, which means the program would be blocked until those operations are finished.
## Q2.2: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
![](https://i.imgur.com/3rqvkQB.png "linear")
| Number of processes | 2 | 4 | 8 | 12 | 16 |
|:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| Execution time | 9.908397 | 5.456350 | 2.688769 | 1.824668 | 1.364209 |
## Q3.1: Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
![](https://i.imgur.com/IfHqj5u.png "tree")
| Number of processed | 2 | 4 | 8 | 16 |
|:-------------------:|:--------:|:--------:|:--------:|:--------:|
| Execution time | 9.839091 | 5.292457 | 2.740805 | 1.832486 |
## Q3.2: How does the performance of binary tree reduction compare to the performance of linear reduction?
- The performance got not much improvement while binary tree reduction is used.
## Q3.3: Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
- linear approach may have better performance.
- Since the tree approach need more message transfer, which take much more time than CPU calculation.
## Q4.1: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
![](https://i.imgur.com/Xvsybdx.png "nonblocking")
| Number of processes | 2 | 4 | 8 | 12 | 16 |
|:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| Execution time | 9.797364 | 5.305192 | 2.683531 | 2.438963 | 1.862100 |
## Q4.2: What are the MPI functions for non-blocking communication?
- send: `MPI_Send`
- receive: `MPI_Irecv`
## Q4.3: How the performance of non-blocking communication compares to the performance of blocking communication?
- nonblocking communication have better performance while number of processes is small, but performance get worse compared to blocking one while the number of processes increases.
## Q5: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
![](https://i.imgur.com/oqxj6qc.png "gather")
| Number of processes | 2 | 4 | 8 | 12 | 16 |
|:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| Execution time | 10.049845 | 5.587510 | 3.200731 | 2.257737 | 2.097789 |
## Q6: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (reduce)
![](https://i.imgur.com/8VXuo47.png "reduce")
| Number of processes | 2 | 4 | 8 | 12 | 16 |
|:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| Execution time | 10.105532 | 5.587154 | 2.961666 | 2.046166 | 2.500450 |
## Q7: Describe what approach(es) were used in your MPI matrix multiplication for each data set.
1. Send part of values of matrix A to nodes according to MPI_rank, while sending whole matrix B to all nodes.
2. Calculate partial result
> each node calculate row [$rank*\dfrac{n}{numnodes},(rank+1)*\dfrac{n}{numnodes}$)
3. Send all result back to `node0` and ouput the result.