---
# System prepended metadata

title: Parallel Programming @ NYCU - HW4

---

# Parallel Programming @ NYCU - HW4
#### **`0716221 余忠旻`**

## <font color="#006000">Q1</font>
### <font color="#3CB371">1. How do you control the number of MPI processes on each node?</font>

mpirun -np <processes> <execution> 可以決定總共將有幾個 processes。
在 Q1: `mpi_hello` 執行時，
`mpirun -np 8 --hostfile hosts mpi_hello` 就是使用8個processes 平行執行 `mpi_hello`
:::info 
加上 **-npernode <processes per node>** 的 flag，
可以控制每個 node 上會有幾個 processes
或者在 `hostfile` 中，hostname後面加上 **slots = n**，
表示該host為一個node時能執行 n 個processes
:::
    
### <font color="#3CB371">2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?</font>

**Retrive the rank of an MPI process:**
```cpp=
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
```

**Retrive the total number of processes:**
```cpp=
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
```
___

## <font color="#006000">Q2. pi_block_linear.cc</font>
### <font color="#3CB371">1. Why MPI_Send and MPI_Recv are called “blocking” communication?</font>

:::info
因為這兩個 function 會等到整個**通訊完成**後才會return
在通訊完成之前不會 return ，程式也自然不會往下繼續執行。
:::
    
### <font color="#3CB371">2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>

| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.300787|3.282009|1.636781|1.100735|0.822144|
    

![](https://i.imgur.com/tHI6MdE.png)

___

## <font color="#006000">Q3. pi_block_tree.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.</font>
    
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | 
|time|6.390032|3.328851|1.657919|0.862404|

![](https://i.imgur.com/qYIXeZ3.png)

<br>

### <font color="#3CB371">2. How does the performance of binary tree reduction compare to the performance of linear reduction?</font>

![](https://i.imgur.com/6mTfmII.png)

由上面比較圖可以看見，兩者是差不多的，
但 `pi_block_linear.cc` 比 `pi_block_tree.cc`略快一點

<br>    

### <font color="#3CB371">3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.</font>

![](https://i.imgur.com/Y2vrDoC.jpg)

由上圖可以觀察到兩種 algorithm 的傳輸次數是一樣的，所以傳輸次數不是影響的因素，接著我們考慮兩者的實作方式，`pi_block_tree.cc` 實作中需要 **遞迴** 或是 **bottom up** 的方式去分配各個 process 對應到的傳送和接收，並且分批去累加，這樣和 linear reduction 方式一次累加全部相比，比較複雜且步驟較多，理論上是會比較耗時的。
    

___

## <font color="#006000">Q4. pi_nonblock_linear.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
    
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.221352|3.257266|1.623923|1.089929|0.816641|

![](https://i.imgur.com/EUYt77O.png)
    
<br>
    
### <font color="#3CB371">2. What are the MPI functions for non-blocking communication?</font>

:::info
* MPI_Isend
* MPI_Irecv
* MPI_Wait
* MPI_Waitany
* MPI_Test
* MPI_Testany
::: 

執行 non-blocking 的 MPI function 不需要等待 return 就能繼續執行接下來的程式，和 blocking 相比會節省一些時間，利用節省的時間做計算就能減少整個程式的總執行時間。
    
<br>    
    
### <font color="#3CB371">3. How the performance of non-blocking communication compares to the performance of blocking communication?</font>

![](https://i.imgur.com/ctIE40g.png)

由上面比較圖可以看見，兩者是差不多的，
但 `pi_nonblock_linear.cc` 比 `pi_block_linear.cc`略快一點
我猜想是 non-blocking 沒有充分利用節省下來的時間，因為計算 pi 的耗時最大是在隨機數的產生，在計算 pi 結果前需要利用隨機數產生來知道到底有多少的投擲數，造成投擲數需要在 `MPI_Isend()` 之前計算完成才能傳送，因此在無法完全利用到non-blocking的優勢

___

## <font color="#006000">Q5. pi_gather.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>

| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.185464|3.231177|1.639481|1.106469|0.839814|
    
![](https://i.imgur.com/7OlWXPy.png)
    
___

## <font color="#006000">Q6. pi_reduce.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>

| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.545612|3.416387|1.719804|1.149395|0.879354|
    
![](https://i.imgur.com/AifOGKv.png)
    
___

## <font color="#006000">Q7. pi_one_side.cc</font>
### <font color="#3CB371">1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.</font>
    
| | 2 MPI processes | 4 MPI processes | 8 MPI processes | 12 MPI processes| 16 MPI processes|
| -------- | -------- | -------- | -------- | -------- | -------- |
|time|6.631016|3.475609|1.790234|1.206102|0.907661|

![](https://i.imgur.com/DzIVa4M.png)

<br>
    
### <font color="#3CB371">2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that?</font>

![](https://i.imgur.com/le4WXL7.png)


當 mpi processes 數是 2,4 的時候， `pi_gather.cc` 執行速度比較快
當 mpi processes 數是 8,12,16 的時候， `pi_nonblock_linear.cc` 執行速度比較快

因為當 mpi processes 數比較小時， `MPI_Gather`能並行的將分散各個node的各個process計算的資料收集回來，會比較快一點，可能是因為 `MPI_Gather` 根據不同的communicator size大小會有不同種的實作方式，而 `MPI_Gather` 演算法在communicator size比較小的時候比較有效率。
    
當 mpi processes 數比較大時， Linear Reduction Algorithm 的實作方式能在作業要求下執行較快，因為作業主要是在分配計算投擲數給各個 WORKER processes，而 MASTER process需要等大家都計算完投擲數後才能做 pi 的計算，因此 Linear Reduction Algorithm 的做法比較符合這種**多點發送單點接收**的架構方式， `pi_nonblock_linear.cc`會跑比較快。
___

## <font color="#006000">Q8</font>
### <font color="#3CB371">1. Plot ping-pong time in function of the message size for cases 1 and 2, respectively.</font>

* <font size=5>case 1</font>
    
![](https://i.imgur.com/hs58cGm.jpg)
    
    
![](https://i.imgur.com/iUsFXHl.png)


<br>
    
* <font size=5>case 2</font>    
    
![](https://i.imgur.com/rIJhzzF.jpg)

![](https://i.imgur.com/TpcJJbR.png)
  
<br>    
    
### <font color="#3CB371">2. Calculate the bandwidth and latency for cases 1 and 2, respectively.</font>

:::success
T(n) = s + rn
latency = s
bandwidth = 1/r
:::    

* <font size=5>case 1</font>

T(n) = 1.59389970869439E-10 n + 0.000113824934821575
latency =  0.000113824934821575 (ms)
bandwidth = 1 / 1.59389970869439E-10 (B/s)
&emsp; &emsp; &emsp; &emsp; = 6.2739204640367826511813994854224 (GB/s) 
    
    
* <font size=5>case 2</font>
    
T(n) = 8.65344159019972E-9 n + 0.000216304976804107
latency = 0.000216304976804107 (ms)
bandwidth = 1 / 8.65344159019972E-9 (B/s)
&emsp; &emsp; &emsp; &emsp; = 0.1155609579814498020402013074126 (GB/s)    
    
___

## <font color="#006000">Q9</font>
### <font color="#3CB371">1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.</font>
    
![](https://i.imgur.com/MfehKJj.jpg)

由上圖可以看到 A 對應的 row 和 B 對應的 column 相乘能得到對應 C element 的值
因此我們先分成兩部分: MASTER 和 WORKER，分別代表 `rank = 0` 和 `rank > 0`
    
**MASTER** 主要做的事是處理 load balance，將 A matrix 平分成好幾份 rows，將需要進行運算的資訊寄給 WORKER，之後等待各個 WORKER 計算完的答案傳回來印出答案
```
MASTER 傳給 WORKER 的資訊包含: 
    A,B matrix大小
    每個WORKER 所需運算的份量 rows
    計算結果相對於 C 矩陣的位置 offset    
    A partial matrix 和 B matrix
```    

**WORKER** 需要做的事是接收 MASTER 傳來的資訊，並計算答案，最後將此WORKER 所需運算的份量 `rows`、計算結果相對於 C 矩陣的位置 `offset` 、以及計算答案 `C partial matrix` 傳回給 MASTER
    
___