HW4
===
###### tags: `Parallel Programming 2020`
# Q1
**Q1** How do you control the number of MPI processes on each node?
**A1**: slots are Open MPI's representation of how many processors are available on a given host.
For instance:
``` bash
pp1 slots=8
pp2 slots=8
```
or
``` bash
pp1 slots=1
pp2 slots=1
pp3 slots=1
pp4 slots=1
```
**Q2**: Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
**A2**: `MPI_Comm_rank()` and `MPI_Comm_size()`
**Q3**: We use Open MPI for this assignment. What else MPI implementation is commonly used? What is the difference between them?
**A3**: 常見 MPI implementation 有:
- Intel MPI
- MPICH2
- OpenMPI
:::info
相同: 都遵循著 MPI standard
不同:
+ Intel MPI: 是基於開放源碼的 MPICH2 與 MVAPICH2 且由研發成的 MPI
+ MPICH: MPICH 的開發與 MPI 規範的制定是同步進行的,因此 MPICH 是最能反映 MPI 的變化和發展的 implementation
+ OpenMPI: 整合 FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI 這幾個 implementations 的開源 implementation
:::
# Q2
**Q1**: Why MPI_Send and MPI_Recv are called “blocking” communication?
**A1**: 因為 `Send()`, `Recv()` 不會回傳直到得到對方收到的訊息和直到得到的值已經複製到變數裡面了,如果沒有的話就會一直卡在 `Send()`/`Recv()`,所以才會叫 "blocking"
**Q2**: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
**A2**:

# Q3
**Q1**: Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
**A1**:

**Q2**: How does the performance of binary tree reduction compare to the performance of linear reduction?
**Q3**: Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs
**A2, A3**: 這是兩個圖合併的結果,藍色是 linear,橘色是 reduction

可以看到藍色和橘色靠的越來越近,基本上到後面時間差不多
下面給出 N = 4 的情形,可以由此推論時間的通式

變數解釋:
+ $Fin(i)$ : 從跑主程式到這個 process 計算並跑完自己的次數的所需時間
+ $ST$ : `MPI_Send()` 的所需時間
+ $RT$ : `MPI_Recv()` 的所需時間
可以大略給出公式:
+ linear
$$
max(Fin(i)) + ST + RT * (N - (max(Fin(i)) - Fin(0)) / RT)
$$
+ Reduction
$$
max(Fin(i)) + (ST + RT) * log2(N)
$$
:::success
結論:在 N 很大的時候不一定
因爲如果 RT 相對於每個 process 執行的時間差很大,代表 $RT * (N - (max(Fin(i)) - Fin(0)) / RT)$ 會很大,那 reduction 會比較快;反之,linear 會比較快
所以在 $N$ 很大的時候其實不一定。
:::
## Q4
**Q1**: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
**A1**:

**Q2**: What are the MPI functions for non-blocking communication?
**A2**:
- `MPI_ISend()`: asynchronously send, it will send without waiting
- `MPI_IRecv()`: asynchronous receive, it will receive without waiting
- `MPI_Test()`: test asynchronous send/receive operation states
- `MPI_Wait()`/`MPI_WaitAll()`: these two functions waits and block until the given request(s) is/are done
**Q3**: How the performance of non-blocking communication compares to the performance of blocking communication?
**A3**:
橘色是 non-blocking, 藍色是 blocking

其實這兩個方法在小範圍其實差異不大
但是可以看到 non-blocking 比 blocking 稍快了一點,推測是因爲 non-blocking 可以同時發 receive 的 request
# Q5
**Q1**: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
**A1**:

# Q6
**Q1**: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
**A1**:

# Q7
**Q1**: Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
**A1**:

**Q2**: Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that?
**A2**:
這是整體 benchmark 疊在一起的圖

可以看到 gather 的整體表現最佳
事實上 `MPI_Gather` 內部也是利用 `Send()`/`Recv()`,但是我認爲這只是測試範圍太小,所有方法的時間都差不多
我個人認爲如果把測試數量提高,最快的應該是 nonblock linear,因爲 nonblocking 可以 asynchronous 的 receive 其他節點傳過來的 data,進而提供程式執行效率
**A3**: Which algorithm or algorithms do MPI implementations use for reduction operations? You can research this on the WEB focusing on one MPI implementation.
**Q3**:
舉 MPICH 當做例子,
MPICH 使用了一個叫做 Rabenseifner's alogrithm 的演算法
ref: http://formalverification.cs.utah.edu/sawaya/html/d6/dfc/mpi_2coll_2reduce_8c-source.html
# Q8
**Q1**: Plot ping-pong time in function of the message size for cases 1 and 2, respectively
**A1**:
Case1:

Case2:

**Q2**: Calculate the bandwidth and latency for cases 1 and 2, respectively.
**A2**:
Case1:

```
p = [1.47836105e-10 1.10622380e-04]
```
- $latency = p[1] / 10 ^ {-6} = 110\ \mu s= 0.11\ ms$
- $bandwidth = 1 / p[0] / 10 ^ {-9} = 6.8\ GB/s$
Case 2:

```
p = [8.56741628e-09 1.70932162e-04]
```
- $latency = p[1] / 10 ^ {-6} = 170\ \mu s = 0.17\ ms$
- $bandwidth = 1 / p[0] / 10 ^ {-9} = 0.116\ GB/s = 116\ MB/s$
**Q3**: For case 2, how do the obtained values of bandwidth and latency compare to the nominal network bandwidth and latency of the NCTU-PP workstations. What are the differences and what could be the explanation of the differences if any?
**A3**: we can check the ethernet speed by using `ethtool`
```
ethtool enp0s31f6 | grep Speed
```
result:
```
Speed: 1000Mb/s
```
所以 bandwidth 上限是 $125 MB/s$,但是實際上並不會 fully-utilized,因爲實際上還要看很多因素。
例如:其他 processes 是否也有正在傳送資料, 封包的 queue 的 buffer 大小等等,所以出來的結果比理論值要略低是正常的
而 latency 理論是 $0ms$,但是在網路的世界中,很多環節可能會造成延遲,例如: propagation delay, queueing delay 等等。
而在自己機器上也有會造成 latency 的因素,例如說呼叫函式的時候,或是 context switching 在等新的 time quantum 。
# Q9
我用了 OpenMPI 的 `Send()` 與 `Recv()` ,並對 $A$ 矩陣等分,使所有節點進行相同的矩陣乘法計算量,並且合併。