# Parallel Programming @ NYCU - HW4
#### **`0716221 余忠旻`**
## <font color="#006000">Q1</font>
### <font color="#3CB371">1. How do you control the number of MPI processes on each node?</font>
mpirun -np <processes> <execution> 可以決定總共將有幾個 processes。
在 Q1: `mpi_hello` 執行時,
`mpirun -np 8 --hostfile hosts mpi_hello` 就是使用8個processes 平行執行 `mpi_hello`
:::info
加上 **-npernode <processes per node>** 的 flag,
可以控制每個 node 上會有幾個 processes
或者在 `hostfile` 中,hostname後面加上 **slots = n**,
表示該host為一個node時能執行 n 個processes
:::
### <font color="#3CB371">2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?</font>
**Retrive the rank of an MPI process:**
```cpp=
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
```
**Retrive the total number of processes:**
```cpp=
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
```
___
## <font color="#006000">Q2. pi_block_linear.cc</font>
### <font color="#3CB371">1. Why MPI_Send and MPI_Recv are called “blocking” communication?</font>
:::info
因為這兩個 function 會等到整個**通訊完成**後才會return
在通訊完成之前不會 return ,程式也自然不會往下繼續執行。
:::
### <font color="#3CB371">2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.300787|3.282009|1.636781|1.100735|0.822144|

___
## <font color="#006000">Q3. pi_block_tree.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- |
|time|6.390032|3.328851|1.657919|0.862404|

<br>
### <font color="#3CB371">2. How does the performance of binary tree reduction compare to the performance of linear reduction?</font>

由上面比較圖可以看見,兩者是差不多的,
但 `pi_block_linear.cc` 比 `pi_block_tree.cc`略快一點
<br>
### <font color="#3CB371">3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.</font>

由上圖可以觀察到兩種 algorithm 的傳輸次數是一樣的,所以傳輸次數不是影響的因素,接著我們考慮兩者的實作方式,`pi_block_tree.cc` 實作中需要 **遞迴** 或是 **bottom up** 的方式去分配各個 process 對應到的傳送和接收,並且分批去累加,這樣和 linear reduction 方式一次累加全部相比,比較複雜且步驟較多,理論上是會比較耗時的。
___
## <font color="#006000">Q4. pi_nonblock_linear.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.221352|3.257266|1.623923|1.089929|0.816641|

<br>
### <font color="#3CB371">2. What are the MPI functions for non-blocking communication?</font>
:::info
* MPI_Isend
* MPI_Irecv
* MPI_Wait
* MPI_Waitany
* MPI_Test
* MPI_Testany
:::
執行 non-blocking 的 MPI function 不需要等待 return 就能繼續執行接下來的程式,和 blocking 相比會節省一些時間,利用節省的時間做計算就能減少整個程式的總執行時間。
<br>
### <font color="#3CB371">3. How the performance of non-blocking communication compares to the performance of blocking communication?</font>

由上面比較圖可以看見,兩者是差不多的,
但 `pi_nonblock_linear.cc` 比 `pi_block_linear.cc`略快一點
我猜想是 non-blocking 沒有充分利用節省下來的時間,因為計算 pi 的耗時最大是在隨機數的產生,在計算 pi 結果前需要利用隨機數產生來知道到底有多少的投擲數,造成投擲數需要在 `MPI_Isend()` 之前計算完成才能傳送,因此在無法完全利用到non-blocking的優勢
___
## <font color="#006000">Q5. pi_gather.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.185464|3.231177|1.639481|1.106469|0.839814|

___
## <font color="#006000">Q6. pi_reduce.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.545612|3.416387|1.719804|1.149395|0.879354|

___
## <font color="#006000">Q7. pi_one_side.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.631016|3.475609|1.790234|1.206102|0.907661|

<br>
### <font color="#3CB371">2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that?</font>

當 mpi processes 數是 2,4 的時候, `pi_gather.cc` 執行速度比較快
當 mpi processes 數是 8,12,16 的時候, `pi_nonblock_linear.cc` 執行速度比較快
因為當 mpi processes 數比較小時, `MPI_Gather`能並行的將分散各個node的各個process計算的資料收集回來,會比較快一點,可能是因為 `MPI_Gather` 根據不同的communicator size大小會有不同種的實作方式,而 `MPI_Gather` 演算法在communicator size比較小的時候比較有效率。
當 mpi processes 數比較大時, Linear Reduction Algorithm 的實作方式能在作業要求下執行較快,因為作業主要是在分配計算投擲數給各個 WORKER processes,而 MASTER process需要等大家都計算完投擲數後才能做 pi 的計算,因此 Linear Reduction Algorithm 的做法比較符合這種**多點發送單點接收**的架構方式, `pi_nonblock_linear.cc`會跑比較快。
___
## <font color="#006000">Q8</font>
### <font color="#3CB371">1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively.</font>
* <font size=5>case 1</font>


<br>
* <font size=5>case 2</font>


<br>
### <font color="#3CB371">2. Calculate the bandwidth and latency for cases 1 and 2, respectively.</font>
:::success
T(n) = s + rn
latency = s
bandwidth = 1/r
:::
* <font size=5>case 1</font>
T(n) = 1.59389970869439E-10 n + 0.000113824934821575
latency = 0.000113824934821575 (ms)
bandwidth = 1 / 1.59389970869439E-10 (B/s)
        = 6.2739204640367826511813994854224 (GB/s)
* <font size=5>case 2</font>
T(n) = 8.65344159019972E-9 n + 0.000216304976804107
latency = 0.000216304976804107 (ms)
bandwidth = 1 / 8.65344159019972E-9 (B/s)
        = 0.1155609579814498020402013074126 (GB/s)
___
## <font color="#006000">Q9</font>
### <font color="#3CB371">1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.</font>

由上圖可以看到 A 對應的 row 和 B 對應的 column 相乘能得到對應 C element 的值
因此我們先分成兩部分: MASTER 和 WORKER,分別代表 `rank = 0` 和 `rank > 0`
**MASTER** 主要做的事是處理 load balance,將 A matrix 平分成好幾份 rows,將需要進行運算的資訊寄給 WORKER,之後等待各個 WORKER 計算完的答案傳回來印出答案
```
MASTER 傳給 WORKER 的資訊包含:
A,B matrix大小
每個WORKER 所需運算的份量 rows
計算結果相對於 C 矩陣的位置 offset
A partial matrix 和 B matrix
```
**WORKER** 需要做的事是接收 MASTER 傳來的資訊,並計算答案,最後將此WORKER 所需運算的份量 `rows`、計算結果相對於 C 矩陣的位置 `offset` 、以及計算答案 `C partial matrix` 傳回給 MASTER
___