###### tags: `Parallel Programming`
# Assignment 4 Report
`0811510 許承壹`
> Q1-1:How do you control the number of MPI processes on each node?
* 在Open MPI version 4.0.7裡,我們可以透過`mpirun`的參數`-N`來控制每個node上要跑多少個processes。
> Q1-2:Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
>
```cpp=
MPI_Comm_size(MPI_COMM_WORLD, &world_size; //for getting total number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); //for getting the rank
```
> Q2-1: Why MPI_Send and MPI_Recv are called “blocking” communication?
* 因為這些function會等到完整執行結束之後才會回傳,所以,在發送或接收資訊的過程結束之前,程式就不會往下執行。
> Q2-2: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| 執行時間 | 2 processes | 4 processes | 8 processes | 12 processes | 16 processes |
| -------- |:------------:|:------------:| ------------ |:-------------:|:------------- |
| 單位:sec | 6.221 | 3.257 | 1.623 | 1.089 | 0.816 |

> Q3-1:Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
>
| 執行時間 | 2 processes | 4 processes | 8 processes | 16 processes |
| -------- |:------------:|:------------:| ------------ |:-------------:|
| 單位:sec | 6.401 | 3.333 | 1.667 | 0,832 |

> Q3-2:How does the performance of binary tree reduction compare to the performance of linear reduction?
* 從實驗結果來看,速度上是差不多的,linear reduction的速度比binary tree reduction快了一點點。
> Q3-3:Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
* 改成使用binary tree的實作方法,需要"傳輸"的次數仍等同於linear的實作方法,但binary tree的實作方法需要是遞迴的,或是bottom-up的方法來分配每個process需要執行加法以及傳送和接收資料的次數,比起linear的方法更複雜也更耗時。因此,當我將processes的數量拉高時,linear的方法表現得會比較好。
> Q4-1:Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
>
| 執行時間 | 2 processes | 4 processes | 8 processes | 12 processes | 16 processes |
| -------- |:------------:|:------------:| ------------ |:-------------:|:------------- |
| 單位:sec | 6.288 | 3.252 | 1.627 | 1.082 | 0.816 |

> Q4-2:What are the MPI functions for non-blocking communication?
* Non-blocking commuication functions(e.g. MPI_Irecv)在執行時會立即回傳,並不需要等到整個function執行完畢才會回傳,這樣的好處是我們可以在call完這類function之後讓程式繼續往下執行,而在需要這些被傳輸的資訊時只要call MPI_Wait確認通訊有無正確執行完畢即可。
> Q4-3:How the performance of non-blocking communication compares to the performance of blocking communication?
* 理論上來說,non-blocking communication可以使我的程式在通訊時繼續往下執行運算,所以執行速度上會優於blocking communication。但在這次的作業裡,從實驗結果上來說,blocking以及non-blocking的執行速度差不多,原因可能是因為我的root process在呼叫MPI_Irecv時,雖然可以繼續往下計算在圓裡的投擲數,但是這段計算量並不大,所以能節省到的時間並不多,導致使用non-blocking communication的效益沒有被放大。
> Q5:Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| 執行時間 | 2 processes | 4 processes | 8 processes | 12 processes | 16 processes |
| -------- |:------------:|:------------:| ------------ |:-------------:|:------------- |
| 單位:sec | 6.178 | 3.245 | 1.645 | 1.109 | 0.837 |

> Q6:Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
| 執行時間 | 2 processes | 4 processes | 8 processes | 12 processes | 16 processes |
| -------- |:------------:|:------------:| ------------ |:-------------:|:------------- |
| 單位:sec | 6.557 | 3.416 | 1.706 | 1.147 | 0.861 |

> Q7:Describe what approach(es) were used in your MPI matrix multiplication for each data set.
* 我參照了老師上課的投影片的方法,在root process中做工作量分配,matrix A盡量平均地切分給各個worker,並利用MPI_Send來傳送這些切分好的matrix A以及完整的matrix B,而各個worker把部分的矩陣乘法結果利用MPI_Send回傳之後,由root process接收並印出完整的答案matrix C,所以,我使用的方法是linear blocking communication。