# HW4 Report
### Q1
##### 1.
The Way to control the number of processes on each node is to modify the hostfile by adding "slot=%N%", N is a positive number, after the node name. For example, in this case, "pp5 slot = 8" means that pp5 node can run more processes than cores it equipped with.
##### 2.
MPI_Comm_rank(MPI_Comm comm, int *rank) is the function allowing us to gain processes' rank.
MPI_Comm_size(MPI_Comm comm, int *size) is the function allowing us to gain the total number of processes.
### Q2
##### 1.
Blocking function is the function that wouldn't return until the communication is finished. The MPI_Send() uses the buffer to catch data, it only return when the system buffer is available for use. As same as mentioned above, MPI_Recv() means user buffer must prepare enough space to catch data, so that MPI_Recv() is able to return.
##### 2.


### Q3
##### 1.


##### 2.
As the graph showing above, it doesn't have a huge differ when the number of processes is two. However, as the number of processes increases, as we can observe that the time block_tree cost is less than block_linear. The slope of block_tree is negative further.
##### 3.
When the number of processes increases, the advantage of tree approach is obvious. In my opinion, both of them use the blocking function to communicate, but the tree approach disperse the MPI_Recv() throughtout all the nodes in each iteration. In the other side, the linear approach gather all the MPI_Recv() to master process at a time, which may cause the buffer filled, and MPI_Recv() is blocked.
### Q4
##### 1.


##### 2.
MPI_Isend() and MPI_Irecv is aim to non-blocking communication.
##### 3.
It is very similiar when both processes is less, but when the processes increase, the non-blocking function has better performance than blocking function.
### Q5
##### 1.


### Q6
##### 1.


### Q7
##### 1.


##### 2.
I found that tree method has the highest performance, i think since the communication is scatterred among the nodes, so the communication takes less time than other method. However, there isn't a dramatic gap between each approach, other uncontrol factors also need to be considered, such as the number of users on the workstation at a same time.
### Q8
##### 1.




##### 2.
case1: latency = 1.5331876e-10, bandwidth = 6,526,038,551.36.
case2: latency = 8.5881671e-9 , bandwidth = 116,440,583.19.
### Q9
I found out that all the processes can't do the scanf work except master process. Therefore, I use master process to load the data and send to each process. In my method, to reduce the waste of memory I make each process stores equal entries. For example, there is 4 processes going to handle 4* 4 A matrix multiplies with 4 * 8 B matrix, so rank 0 process would have row 0 of A matrix and column 0, 1 of B matrix, rank 1 would have row 1 of A matrix and column 2 3 of B matrix and so on.
I the multiply stage, i use a function MPI_Bcast() to send each process' storing context to other process to calculate. Finally, each process will get the part of result, such as rank 0 process will have first 4*2 result matrix partition and rank 1 process will have second 4 * 2 result matrix partition and so on. In the last part, by MPI_Gather() master process will gather all the result matrix and form it into the true result matrix.