Section (1) - Development Flow

# Section (1) - Development Flow ## Q1.a) Our program runs Aliev-Panfilov cardiac simulation and parallelizes it using MPI. The overall structure of the program is as following: ``` solve() { scatter initial data to workers for (int n = 0; i < iteration_number; i++) { start exchanging ghost cells with nearby processes compute center elements not involving ghost cells wait for sync finishes compute edge elements involving ghost cells } collect statistics } ``` Notice that the ghost cell exchange does not block the program, so the communication and computation of center elements can run at the same time. As to how to divide the mesh evenly, we use the following code: ``` int height = m / cb.py + (m % cb.py > rankY ? 1 : 0); int width = n / cb.px + (n % cb.px > rankX ? 1 : 0); ``` With the code above, the process on the top left corner will be bigger, but the size difference will not be more than 1. For ghost cell exchanges, we do exchanges in four directions. First we check whether the process belongs to the top boundary. If it is at the top boundary then we handle the boundary case and do not exchange. Otherwise, we use `MPI_Isend()` to send the first data row to the process above, along with using `MPI_Irecv()` to receive data from the process above and put the data in the uppermost line allocated for ghost cells. Such operation then continues in other three directions in the same way. For computing, we choose the fused version along with using register. The reason is that if we use the fused version, the register can help to reduce one data loading: ``` register double e_tmp; for (...) { // PDE SOLVER e_tmp = ... // ODE SOLVER e_tmp += ... R_tmp[i] += ... E_tmp[i] = e_tmp; } ``` In the code above, we use a register variable to replace the original `E_tmp[i]`. In this way, the computation result of PDE solver and the first line of ODE solver can be directly stored into the register, instead of loading `E_tmp[i]` twice. If we use the unfused version, two separate for loops will divide two `E_tmp[i]` away so we can not use the register variable under the circumstance. Since the code above is the main part of computation, the performance is improved a lot by using this computation method. ## Q1.b) 1. We learn to understand the starter code, along with trying to run it on various environments like Sorken and Expanse. 2. Our first try is to just broadcast the whole matrix to all the processes and doing computation, without actually doing partition. We successfully verified the correctness of our progress, but the performance is not good enough, as we expected. 3. We implemented partitioning and replace broadcast with scatter. The ghost cells and communication among processes are also added. The performance improved a lot, but there still exists gap between our performance and the expected value. 4. Thanks to the optimization hints talked in discussion, we understood the idea and implemented interleaving inner block computation and transmission. 5. Finally we utilize the register variable along with adapting fused version, achieving satisfactory performance. ## Q1.c) Our performance improves with several steps. The process is documented in the following table (N=N1, p=128, x=8, y=16): | Step | GFlops | |:-------------------------------- | ------ | | broadcast (step2 in Q2.b) | 999 | | scatter (step3 in Q2.b) | 1086 | | interleaving (step4 in Q2.b) | 1118 | | fused + register (step5 in Q2.b) | 1254 | # Section (2) - Result ## Q2.a) We divide our GFlop/s by the number of cores to get GFlops/core and compare the result with starter code. By calculating the ratio and overhead, we could see our optimized code has a better performace than starter code. | | np = 1 | np = 2 | np = 4 | np = 8 | np = 16 | start code | |:---------------- | ------ | ------ |:------ |:------ |:------- | ---------- | | GFlops/core | 14.39 | 13.915 | 13.975 | 13.737 | 13.16 | 11.03 | | ratio vs starter | 1.30 | 1.26 | 1.27 | 1.25 | -2.13 | 1 | | MPI overhead | -3.36 | -2.885 | -2.945 | -2.707 | -2.13 | 0 | ## Q2.b) | | np = 1 | np = 2 | np = 4 | np = 8 | np = 16 | | ------------------------ |:------ | ------ |:------ | ------ |:------- | | GFlop | 14.39 | 27.83 | 55.9 | 109.9 | 210.7 | | Scaling against np/2 | N/A | 0.966 | 0.99 | 0.98 | 0.96 | | Scaling against one core | 1 | 0.97 | 0.97 | 0.95 | 0.91 | When size = N0(n=800), we compute the scaling compared to half core numbers and one core. ## Q2.c) | Cores | Geometry | Gflops with communication | Gflops without communication | overhead | scaling | | ----- |:--------- |:------------------------- |:---------------------------- |:-------- |:------- | | 16 | x=1, y=16 | 228.4 | 231.9 | 2.1% | N/A | | 32 | x=2, y=16 | 440.9 | 456 | 3% | 0.96 | | 64 | x=2, y=32 | 799.6 | 877.6 | 1.2% | 0.94 | | 128 | x=8, y=16 | 1245 | 1435 | 13.4% | 0.78 | When size = N1(n=1800), we could find out the scaling decreases with the size of cores while the overhead increases on the other hand. ## Q2.d) N2(n=8000) | Cores | Geometry | overhead | Gflops with communication | Gflops without communication | | ----- | -------- |:-------- |:------------------------- |:---------------------------- | | 128 | x=4, y=32 | 2.3% | 508.7 | 520.7 | | 192 | x=4, y=48 | 2.9% | 544.1 | 560.3 | | 256 | x=16, y=16 | 17.8% | 786.5 | 957.3 | | 384 | x=4, y=96 | 5.7% | 2314 | 2430 | We could see communication overhead is in the range of 2% - 10% for most cases though the 256 cores result has a larger overhead. ## Q2.e) We found that the overhead increases with the size of cores. For cores <=128, the overhead is from 1.2% - 13.4% While, for cores >= 128 we found a irregularity occurs when cores = 256, the overhead is 61.2% Except for that, the overhead range for cores <= 128 and cores > 128 are quite similar. We believe the reason is because the communication cost is increasing since we have more cores and each process also has a smaller scope of working area when the cores increase. What is more, the max number of ntasks-per-node is 128 in Sbatch leading to a cross-node communication when cores > 128. ## Q2.f) | Cores | Computation Cost = #cores x computation time | | ----- |:-------------------------------------------- | | 128 | 71.599sec * 128 = 9,164.7 | | 192 | 36.2753sec * 192 = 6,964.9 | | 256 | 26.649sec * 256 = 6,822.1 | | 384 | 6.30524sec * 384 = 2,421.2 | For size = N2(n=8000), The most optimal strategy is to use more cores since the computation cost is decrasing when the cores increase. In this case, using 384 cores is the most optimal choice. # Section (3) - Determining Geometry ## Q3.a) For p=128, there are 8 possible geometries. We report the top 3 geometries for N1: | Rank | Geometry | GFlops | |:----:|:--------:|:------:| | 1 | x=8, y=16 | 1254 | | 2 | x=4, y=32 | 1249 | | 3 | x=2, y=64 | 1205 | ## Q3.b) Based on Q3.a), we describe the pattern for best geometry is that x is smaller than y, but not too small. To find the reason and prove our idea, we collect the data for all geometries for p=128. In the following chart, we get the running time without communication by adding command line parameter `-k`, and the time can be understood as computation time. The communication time can thus be calculated by subtracting the communication time from total running time. | Geometry | Running time with communication (s)| Running time without communication (s) | Communication time (s) | |:----------:|:-------------------------------:|:----------------------------------:|:------------------:| | x=1, y=128 | 8.816 | 7.511 | 1.305 | | x=2, y=64 | 7.501 | 6.146 | 1.355 | | x=4, y=32 | 7.261 | 6.133 | 1.128 | | x=8, y=16 | 7.234 | 6.908 | 0.326 | | x=16, y=8 | 7.605 | 6.889 | 0.716 | | x=32, y=4 | 8.356 | 6.694 | 1.662 | | x=64, y=2 | 9.782 | 7.546 | 2.236 | | x=128, y=1 | 12.36 | 9.326 | 3.034 | From the chart above, we can conclude that (1) Generally, the computation takes less time when x is smaller. This is because when x gets smaller, the number of elements stored along the row gets bigger, which utilizes the cache better. (2) The communication time is optimized when x is close to y. Since only ghost cells on edges require communication, the communication time is proportional to the total edge length. For a given `p` where `p = xy`, to minimize `2x+2y`, we can prove by mathematics that the minimal value is achieved when `x=y`. In addition, when x is too small (e.g. x=1 or x=2), the optimization of interleaving communication and computation will fail because almost all cells are on the edges and there are no center cells to compute. Therefore, the performance is reduced under this circumstance. Given the reasons discussed above, we think we can achieve the best geometry when x is smaller than y, but not too small, thus enjoying both good computation time and communication time to get the optimal performance. # Section(4) - Strong and Weak Scaling ## Q4.a) For size N1(n=1800), we get the following table and graph: | Cores | Geometry | Gflops | | ----- | -------- |:------ | | 128 | 8 * 16 | 1254 | | 192 | 4 * 48 | 1946 | | 256 | 4 * 64 | 2194 | | 384 | 8 * 48 | 2808 | ![](https://i.imgur.com/vmHDgyn.png) ## Q4.b) For experiments of N1, we could see the Gflops increase with the size of cores. For experiment of N2, the Gflops also increase with the size of cores but a sharp increase occurs when core size is 384. We could also find out the incraseing from 128 to 256 is slower. We believe this is because the matrix size for N2 is larger(8000) leading to a larger communcation overhead. What is more, the iteration time of N1 is larger leading to a larger time of scattering. # Section (5) Extra Credit: EC-a We have implemented result plotting. Here is the plot for n=300, i=10000: ![](https://i.imgur.com/LESMiFj.png) ![](https://i.imgur.com/7nznmub.png) ![](https://i.imgur.com/U3B9YHb.png) ![](https://i.imgur.com/nHNZ8pV.png) ![](https://i.imgur.com/M8ecQlb.png) ![](https://i.imgur.com/W5nctae.png) In our implementation, the node 0 is responsible for gathering data and plotting, while conducting normal computation like other nodes. This is a problem deserved to solve because the plotting procedure will reduce the overall performance. One solution is to add another node to work on plotting by itself. # Section(6) - Potential Future work 1. We could do more experiements to compare the performace of fused and unfused code. 2. We could write AVX code to further optimize the PDE and ODE computation. # Section (7) - References (as needed) 1. [Open MPI Documentation] (https://www.open-mpi.org/doc/)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.