Try   HackMD

Parallel programming HW4

Q1.1: How do you control the number of MPI processes on each node?

  • MPI run programs according to hostfile, it assign programs to machine in hostfile one-by-one.
machine_1
machine_1
machine_2

Q1.2: Which functions do you use for retrieving the rank of an MPI process and the total number of processes?

  • MPI_Comm_size is used to get the total number of processes.
  • MPI_Comm_rank is used to get the rank of current process.

Q2.1: Why MPI_Send and MPI_Recv are called “blocking” communication?

  • Since the result must be received by main machine, the communication should be syncronous, which means the program would be blocked until those operations are finished.

Q2.2: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

Number of processes 2 4 8 12 16
Execution time 9.908397 5.456350 2.688769 1.824668 1.364209

Q3.1: Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.

Number of processed 2 4 8 16
Execution time 9.839091 5.292457 2.740805 1.832486

Q3.2: How does the performance of binary tree reduction compare to the performance of linear reduction?

  • The performance got not much improvement while binary tree reduction is used.

Q3.3: Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.

  • linear approach may have better performance.
  • Since the tree approach need more message transfer, which take much more time than CPU calculation.

Q4.1: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

Number of processes 2 4 8 12 16
Execution time 9.797364 5.305192 2.683531 2.438963 1.862100

Q4.2: What are the MPI functions for non-blocking communication?

  • send: MPI_Send
  • receive: MPI_Irecv

Q4.3: How the performance of non-blocking communication compares to the performance of blocking communication?

  • nonblocking communication have better performance while number of processes is small, but performance get worse compared to blocking one while the number of processes increases.

Q5: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

Number of processes 2 4 8 12 16
Execution time 10.049845 5.587510 3.200731 2.257737 2.097789

Q6: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (reduce)

Number of processes 2 4 8 12 16
Execution time 10.105532 5.587154 2.961666 2.046166 2.500450

Q7: Describe what approach(es) were used in your MPI matrix multiplication for each data set.

  1. Send part of values of matrix A to nodes according to MPI_rank, while sending whole matrix B to all nodes.
  2. Calculate partial result

    each node calculate row [

    ranknnumnodes,(rank+1)nnumnodes)

  3. Send all result back to node0 and ouput the result.