Parallel Programming Assignment IV
===
## Q1
### 1. How do you control the number of MPI processes on each node?
在`hosts`中的hostname後加上`slots=n`,n為該hostname指定的processes個數。(範例如下)
```.
#hosts:
pp1 slots=2
pp2 slots=2
pp3 slots=2
pp4 slots=2
```
### 2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
分別使用`MPI_Comm_size()`與`MPI_Comm_rank()`來取得rank與processes總數。(範例如下)
```.
...
// TODO: Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// TODO: Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
...
```
## Q2
### 1. Why MPI_Send and MPI_Recv are called “blocking” communication?
因為`MPI_Send`需要等`MPI_Recv`接收才會繼續執行下去,所以有被blocking
也被稱為`blocking communication`。
### 2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

## Q3
### 1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.

### 2. How does the performance of binary tree reduction compare to the performance of linear reduction?
binary tree reduction對比linear reduction可以觀察出有更好的效能。
### 3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
binary tree reduction對比linear reduction可以觀察出**當MPI processes 增加時執行的時間減少得更加明顯**,因為binary tree reduction在最後可以分擔加總到不同的processes,而linear reduction則是把加總的工作都給rank 0的processes做進而造成瓶頸。
### Q4
## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

## 2. What are the MPI functions for non-blocking communication?
**Non-blocking send**:
```
int MPI_Isend(void *buffer, int count, MPI_Datatype datatype, int dest, int tag, MPI_Communicator comm, MPI_Request *request);
```
**Non-blocking receive:**
```
int MPI_Irecv(void *buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Communicator comm, MPI_Request *request);
```
## 3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points)
對比blocking communication,non-blocking communication的效能稍微好一些。(如下圖比較)

### Q5
## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

### Q6
## 1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.

### Q7
## 1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
因為傳輸時間會大於計算時間,所以a_mat的部分我將row用world_size去切割並將rank=0(master processes)只做送資料與接收、顯示結果的processes,非master processes會交錯的分割計算a_mat的row。b_mat會完整的傳送到每個processes。(範例如下)
```.
ex:
3 2 3
1 2
0 3
4 5
1 2 1
0 3 2
a_mat:
1 2
0 3
4 5
b_mat:
1 2 1
0 3 2
size = 3;
rank = 0, 1, 2;
c_mat = a_mat * b_mat
rank=0:
send a_mat(0, 1), a_mat(0, 2) to rank_1
send a_mat(1, 1), a_mat(1, 2) to rank_2
send a_mat(2, 1), a_mat(2, 2) to rank_1
send b_mat to rank_1, rank_2
receive c_mat(0, 1), c_mat(0, 2), c_mat(0, 3) from rank_1
receive c_mat(1, 1), c_mat(1, 2), c_mat(1, 3) from rank_2
receive c_mat(2, 1), c_mat(2, 2), c_mat(2, 3) from rank_1
rank=1:
receive a_mat(0, 1), a_mat(0, 2) from rank_0
receive a_mat(1, 2), a_mat(2, 2) from rank_0
receive b_mat from rank_0
calculate c_mat(0, 1), c_mat(0, 2), c_mat(0, 3)
calculate c_mat(2, 1), c_mat(2, 2), c_mat(2, 3)
rank=2:
receive a_mat(1, 1), a_mat(1, 2) from rank_0
receive b_mat from rank_0
calculate c_mat(1, 1), c_mat(1, 2), c_mat(1, 3)
```