# PP Assignment 4
###### tags: `PP`
:::info
**Webpage**
[Description](https://nycu-sslab.github.io/PP-f22/assignments/HW4/)
:::
## Q1
:::success
1. How do you control the number of MPI processes on each node?
2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
:::
1. 如同後面的功課說明所表示,可以在```hostfile```每個機器後面加上```slot=n```來限制每台機器為節點的情況下最多可以執行的processor數量
```bash!
pp1 slots=1 # pp1 最多1個processor
pp3 slots=2 # pp3 最多2個processor
```
2.
```cc!
// 取得目前這個processor的排號: rank of an MPI process
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank)
// 取得整個程式的processor數量: total number of processes
MPI_Comm_size(MPI_COMM_WORLD, &world_size)
```
## Q2
:::success
<MPI BLOCKING COMMUNICATION & LINEAR REDUCTION ALGORITHM>
1. Why MPI_Send and MPI_Recv are called “blocking” communication?
2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
:::
1. ```MPI_Send```, ```MPI_Recv```: 等到整個訊息完成後才回傳,所以這個process就必須等這兩個函式處理完才會繼續執行後續的動作
2.
|processor cnt|1|2|4|8|12|16|
|-|-|-|-|-|-|-|
|Time(s)|11.80677|6.106424|3.205323|2.189460|1.485867|1.049591|
|Speedup(times)|1|1.93350|3.68349|5.39255|7.94605|11.24892|

## Q3
:::success
<MPI BLOCKING COMMUNICATION & BINARY TREE REDUCTION COMMUNICATION ALGORITHM>
1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
2. How does the performance of binary tree reduction compare to the performance of linear reduction?
3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
:::
1.
|processor cnt|1|2|4|8|12|16|
|-|-|-|-|-|-|-|
|Time(s)|12.193638|6.071536|3.111489|2.488243|1.540553|1.657527|
|Speedup(times)|1|2.00833|3.91891|4.90050|7.91510|7.35652|

2.

根據圖表所示,兩者的相差不大,但大體來說linear的時間會比binary tree的時間還要短一些,就算是較多的也是0.1秒內的差距,因此可以推斷linear的效果會比binary還要好。
3. 可以由圖中的差值看出在processor越來越多的狀態下linear跟binary tree的相差有加大的趨勢,因此在processor多的狀況下也是linear會有比較好的表現
## Q4
:::success
<MPI NON-BLOCKING COMMUNICATION & LINEAR REDUCTION ALGORITHM>
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
2. What are the MPI functions for non-blocking communication?
3. How the performance of non-blocking communication compares to the performance of blocking communication?
:::
1.
|processor cnt|1|2|4|8|12|16|
|-|-|-|-|-|-|-|
|Time(s)|11.930514|6.122414|3.135446|1.596679|1.084176|0.809097|
|Speedup(times)|1|1.94866|3.80505|7.47208|11.00422|14.74547|

2. Some functions in the indirect communication: ```MPI_Isend```, ```MPI_Irecv```, ```MPI_Wait```, ```MPI_Waitall```, ```MPI_Gather```
3.

由圖可知blocking跟nonblocking的效果在processor數量越大的時候效果顯著,可以看到確切時間的減少
## Q5
:::success
< MPI COLLECTIVE: MPI_GATHER>
Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
:::
|processor cnt|1|2|4|8|12|16|
|-|-|-|-|-|-|-|
|Time(s)|11.805508|6.041052|3.115950|1.586252|1.084110|0.810262|
|Speedup(times)|1|1.95421|3.78873|7.44239|10.88959|14.56999|

## Q6
:::success
< MPI COLLECTIVE: MPI_REDUCE>
Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
:::
|processor cnt|1|2|4|8|12|16|
|-|-|-|-|-|-|-|
|Time(s)|11.815341|6.054686|3.121058|1.570721|1.076561|0.808215|
|Speedup(times)|1|2.00833|3.91891|4.90050|7.91510|7.35652|

## Q7
:::success
< Matrix Multiplication with MPI>
Describe what approach(es) were used in your MPI matrix multiplication for each data set.
:::
- 首先會先在root將兩個矩陣讀進來之後透過```MPI_Bcast```進行廣播,讓其他的process知道矩陣的內容
```cc=
MPI_Bcast(*a_mat_ptr, matrix_a_size, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(*b_mat_ptr, matrix_b_size, MPI_INT, 0, MPI_COMM_WORLD);
```
- 之後的每個process會平分整個運算量,透過第一個矩陣的行來區分,計算完成後共用的結果矩陣會透過```MPI_Reduce```進行整合。
```cc=
MPI_Reduce(local_result, result, total_size, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
```