--- tags: parallel_programming --- # Parallel programming HW4 ## Q1.1: How do you control the number of MPI processes on each node? - MPI run programs according to hostfile, it assign programs to machine in hostfile one-by-one. ``` machine_1 machine_1 machine_2 ``` - If we're going to run 8 copies of the program and hostfile is defined as above, program_1 & program 2 will be assigned to machine_1, program_3 will be assigned to machine_2, program_4 will be assigned to machine_1, ...etc. > https://stackoverflow.com/questions/14411763/mapping-mpi-processes-to-particular-nodes ## Q1.2: Which functions do you use for retrieving the rank of an MPI process and the total number of processes? - `MPI_Comm_size` is used to get the total number of processes. - `MPI_Comm_rank` is used to get the rank of current process. ## Q2.1: Why MPI_Send and MPI_Recv are called “blocking” communication? - Since the result must be received by main machine, the communication should be syncronous, which means the program would be blocked until those operations are finished. ## Q2.2: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/3rqvkQB.png "linear") | Number of processes | 2 | 4 | 8 | 12 | 16 | |:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | Execution time | 9.908397 | 5.456350 | 2.688769 | 1.824668 | 1.364209 | ## Q3.1: Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. ![](https://i.imgur.com/IfHqj5u.png "tree") | Number of processed | 2 | 4 | 8 | 16 | |:-------------------:|:--------:|:--------:|:--------:|:--------:| | Execution time | 9.839091 | 5.292457 | 2.740805 | 1.832486 | ## Q3.2: How does the performance of binary tree reduction compare to the performance of linear reduction? - The performance got not much improvement while binary tree reduction is used. ## Q3.3: Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. - linear approach may have better performance. - Since the tree approach need more message transfer, which take much more time than CPU calculation. ## Q4.1: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/Xvsybdx.png "nonblocking") | Number of processes | 2 | 4 | 8 | 12 | 16 | |:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | Execution time | 9.797364 | 5.305192 | 2.683531 | 2.438963 | 1.862100 | ## Q4.2: What are the MPI functions for non-blocking communication? - send: `MPI_Send` - receive: `MPI_Irecv` ## Q4.3: How the performance of non-blocking communication compares to the performance of blocking communication? - nonblocking communication have better performance while number of processes is small, but performance get worse compared to blocking one while the number of processes increases. ## Q5: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. ![](https://i.imgur.com/oqxj6qc.png "gather") | Number of processes | 2 | 4 | 8 | 12 | 16 | |:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | Execution time | 10.049845 | 5.587510 | 3.200731 | 2.257737 | 2.097789 | ## Q6: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (reduce) ![](https://i.imgur.com/8VXuo47.png "reduce") | Number of processes | 2 | 4 | 8 | 12 | 16 | |:-------------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | Execution time | 10.105532 | 5.587154 | 2.961666 | 2.046166 | 2.500450 | ## Q7: Describe what approach(es) were used in your MPI matrix multiplication for each data set. 1. Send part of values of matrix A to nodes according to MPI_rank, while sending whole matrix B to all nodes. 2. Calculate partial result > each node calculate row [$rank*\dfrac{n}{numnodes},(rank+1)*\dfrac{n}{numnodes}$) 3. Send all result back to `node0` and ouput the result.