# Parallel Programming HW-4
:::info
<font color=#4381FA>***Q1***</font>: (5 points)
1. How do you control the number of MPI processes on each node? (3 points)
2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (2 points)
:::
1. 在 hosts 中填上 slots 。
Hostfiles my_hostfile are simple text files with hosts specified, one per line. Each host can also specify a default and maximum number of slots to be used on that host (i.e., the number of available processors on that host). Comments are also supported, and blank lines are ignored. For example:
```
1 # This is an example hostfile. Comments begin with #
2 #
3 # The following node is a single processor machine:
4 foo.example.com
5
6 # The following node is a dual-processor machine:
7 bar.example.com slots=2
8
9 # The following node is a quad-processor machine, and we absolutely
10 # want to disallow over-subscribing it:
11 yow.example.com slots=4 max-slots=4
```
Ref - https://www.open-mpi.org/faq/?category=running#oversubscribing
2. 分別使用 MPI_Comm_size 和 MPI_Comm_rank 。
MPI_Comm_size - Returns the size of the group associated with a communicator.
MPI_Comm_rank - Determines the rank of the calling process in the communicator.
Ref - https://www.open-mpi.org/doc/v4.0/man3/MPI_Comm_size.3.php
Ref - https://www.open-mpi.org/doc/v4.0/man3/MPI_Comm_rank.3.php
:::info
<font color=#4381FA>***Q2***</font>: (3 points)
1. Why MPI_Send and MPI_Recv are called “blocking” communication? (2 points)
2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
:::
1. 因為這兩個 funciton 都會等到傳輸完畢才會 return, 如果沒有結束 process 會一直在那邊等待。
2.
| **pi_block_tree** |
|:---------------------------------------------------:|
| |
:::info
<font color=#4381FA>***Q3***</font>: (7 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points)
2. How does the performance of binary tree reduction compare to the performance of linear reduction? (3 points)
3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs. (3 points)
:::
1.
| **pi_block_tree** |
|:---------------------------------------------------:|
| |
2.
| **compare** |
|:---------------------------------------------------:|
| |
3. 從下圖可以看到,tree reduction 並沒有比 linear reduction 減少多少,再看下面兩張圖,我們可以得知兩者都是需要 7 次傳送資料,因此時間相同還算合理.
| **linear reduction** |
|:---------------------------------------------------:|
| |
| **tree reduction** |
|:---------------------------------------------------:|
| |
:::info
<font color=#4381FA>***Q4***</font>: (6 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
2. What are the MPI functions for non-blocking communication? (2 points)
3. How the performance of non-blocking communication compares to the performance of blocking communication? (3 points)
:::
1.
| **pi_nonblock_linear** |
|:---------------------------------------------------:|
| |
2. MPI_Isend 、 MPI_Irecv 、 MPI_Wait 、 MPI_Waitany 、 MPI_Test 、 MPI_Testany.
3. 這兩個數據數據非常接近,但我認為應該是nonblock會稍微快一點,但由於機器環境太複雜,所以測量不出來.
| **compare** |
|:---------------------------------------------------:|
| |
:::info
<font color=#4381FA>***Q5***</font>: (1 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
:::
| **pi_gather** |
|:---------------------------------------------------:|
| |
:::info
<font color=#4381FA>***Q6***</font>: (1 points)
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
:::
| **pi_reduce** |
|:---------------------------------------------------:|
| |
:::info
<font color=#4381FA>***Q7***</font>: (5 points)
1. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
:::
1. 首先,由於矩陣的相乘是由左矩陣的列與右矩陣的行做相乘。這種特性對右矩陣來說,以 c++ row-major 來說是非常不利的,因此會先將右矩陣以轉置的形式吃進來,這樣就可以用到連續的記憶體空間。然後我們希望使用 mutilthread 的方式來對整體加速,最簡單的方法就是矩陣拆成較小的單位,可以以行或是以列。在這邊我選擇右矩陣為拆分的對象,我希望每個 thread 可以分配到 row / n 個單位,使整體的 workload 是平衡的。再來就是分批的下去相乘,得到整體在將其利用 MPI reduce 的方式進行合併,來得到最終的結果。