# Programming Assignment HW4
## Q1
> How do you control the number of MPI processes on each node? (3 points)
**Ans:**
在hostfiles中加入slot = N,其中N的值設定為host端最大的processes數量。這次作業中最大可以設定到16。
> Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (2 points)
**Ans:**
Retrive the total number of processes:
```c=
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
```
Retrive the rank of an MPI process:
```c=
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
```
## Q2
> Why MPI_Send and MPI_Recv are called “blocking” communication? (2 points)
**Ans:**
因為執行`MPI_Send`的時候會等待別的process中的`MPI_Recv`執行,因此程式會被停下而無法執行接下來的指令,同樣`MPI_Recv`也會等待`MPI_Send`傳送資料並收到之後才會執行接下來的指令,上面這些功能就會被稱為"blocking"。
> Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

| Number of processes | Runtime |
| -------- | -------- |
|2| 6.458812 |
|4| 3.292482 |
|8| 1.650646 |
|12| 1.129848 |
|16| 0.842075 |
## Q3
> Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points)

| Number of processes | Runtime |
| -------- | -------- |
|2| 6.499463 |
|4| 3.321890 |
|8| 1.673447 |
|16| 0.855035 |
> How does the performance of binary tree reduction compare to the performance of linear reduction? (3 points)
**Ans:**

我將兩次的實驗數據一起呈現,觀察的結果差異不大。
> Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points)
**Ans:**
結論:增加processes的數量,結果會不一定。
因爲`MPI_Recv()`所需時間相對於每個process執行的時間差很大,代表linear所需時間會較久,那tree會比較快;反之,linear會比較快。所以process的數量很多的時候結果會不一定。
## Q4
> Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

| Number of processes | Runtime |
| -------- | -------- |
|2| 6.340269 |
|4| 3.277678 |
|8| 1.646690 |
|12| 1.129992 |
|16| 0.916773 |
> What are the MPI functions for non-blocking communication? (2 points)
- ```MPI_Isend():``` Begins a nonblocking send
- ```MPI_Irecv():``` Begins a nonblocking receive
- ```MPI_Wait()/MPI_WaitAll():``` Waits for an MPI request to complete & Waits for all given MPI Requests to complete
> How the performance of non-blocking communication compares to the performance of blocking communication? (3 points)
實作的結果看起來差異不大,可能是因為有用到`MPI_Wait()`,所以都要等到結果都算完才能加總,不過當processes使用到16的時候看起來有稍微慢一點點。
## Q5
> Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

| Number of processes | Runtime |
| -------- | -------- |
|2| 6.344678 |
|4| 3.277678 |
|8| 1.648714 |
|12| 1.116314 |
|16| 0.833355 |
## Q6
> Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)

| Number of processes | Runtime |
| -------- | -------- |
|2| 6.353718 |
|4| 3.310183 |
|8| 1.651244 |
|12| 1.124145 |
|16| 0.849721 |
## Q7
> Describe what approach(es) were used in your MPI matrix multiplication for each data set.
首先,先將分工分成 MASTER (rank = 0) 以及 WORKER (rank > 0),MASTER 負責做 load balancing,將 A 矩陣平分成好幾份 rows,將 WORKER 計算所需要的資訊發送過去,包含 WORKER 送回答案時相對於 C 矩陣的位置 offset、每個 WORKER 所需運算的份量 rows以及 A、B矩陣的元素。發送結束後即可以等待接收來自各個 WORKER 所運算後的結果。
而 WORKER 的工作則是先接收來自 MASTER 的所分配的 workload 資訊,隨後接著運算,最終再將答案發送回去給 MASTER。