owned this note
owned this note
Published
Linked with GitHub
Parallel Programming HW4 @NYCU, 2022 Fall
===
###### tags: `2022_PP_NYCU`
<!-- | 學號 | 姓名 |
| -------- | -------- |
| 310552060 |湯智惟 | -->
## Q1
### How do you control the number of MPI processes on each node?
在 hosts中設置 **slots** ,hostname後面加上 slots = n,slots的數量表示該node可執行的processes數量。
For instance:
```
pp2 slots=4
pp3 slots=4
```
### Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
Retreve the rank of an MPI process:
```c
MPI_Comm_rank(MPI_Comm comm, int *rank)
```
Retrieve the total number of processes:
```c
MPI_Comm_size(MPI_Comm comm, int *size)
```
---
## Q2
### Why MPI_Send and MPI_Recv are called “blocking” communication?
`MPI_Send` 訊息被其他process接收才繼續往下執行程式。
`MPI_Recv` 接收到訊息才繼續往下執行程式。
### Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
`pi_block_linear.cc`
| Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |--- |--- |--- |
| time | 9.969061 | 5.510608 |3.981320 |2.941443| 1.908585 |

---
## Q3
### Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it.
`pi_block_tree.cc`
| Processes | 2 | 4 | 8 | 16 |
| -------- | -------- | -------- |--- |--- |
| time | 9.631155 | 5.942813 |3.754705 |1.443440|

### How does the performance of binary tree reduction compare to the performance of linear reduction?

linear reduction 和 binary tree reduction 的效果差不多,Binary tree reduction 稍微好一點。
### Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
Increasing the number of processes, **binary tree reduction** is going to perform better.
因為 binary tree reduction 可以讓加法運算分散在多個processors 上,隨著processors增加,傳輸messages也越少。
但是 linear reduction 需要等每個processor 算完後,在一個processor 中一一加總。
因此,Binary tree reduction在效能上會好一點。
---
## Q4
### Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
`pi_nonblock_linear.cc`
| Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |--- |--- |--- |
| time | 9.332650 | 4.995334 |3.136756 |2.651476| 1.936272 |

### What are the MPI functions for non-blocking communication?
```c
MPI_ISEND( start, count, datatype, dest, tag, comm, request )
MPI_IRECV( start, count, datatype, src, tag, comm, request )
MPI_WAIT( request, status )
MPI_TEST( request, flag, status )
```
### How the performance of non-blocking communication compares to the performance of blocking communication?

Blocking 和 Non-Blocking 的結果差異不大,Non-Blocing 略好一些。
因為Non-Blocking 不用等待其他人receive,可以先去做其他運算。
---
## Q5.
### Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
`pi_gather.cc`
| Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |--- |--- |--- |
| time | 9.236979| 5.001171 |2.567303 |1.785666| 1.357843 |

---
## Q6.
### Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
`MPI_Reduce()`
| Processes | 2 | 4 | 8 | 12 | 16 |
| -------- | -------- | -------- |--- |--- |--- |
| time | 9.233236| 5.236464 |2.508783|1.921540| 1.327587|

---
## Q7
### Describe what approach(es) were used in your MPI matrix multiplication for each data set.

先分成 `master`(rank=0)和 `worker`(rank>0)
**Master**把matrix A 的rows 平均分給每個worker去計算,把結果加總後,最後印出
- send:
- matrix A 中多少rows
- matrix B, C 長寬
- matrix B, matrix C
- offset(相對A或C矩陣的位置)
**Worker**各別把結果算出來後,把partial matrix send 回去
傳送方式皆以MPI中的`send`, `recv`作傳輸