Parprog Final Project - Gaussian Elimination (GE)

# Parprog Final Project - Gaussian Elimination (GE) ###### tags: `SS2021-IN2147-PP` <style> .iframe-body-sty { position: relative; overflow: hidden; height: 300px; width: 600px; } .iframe-body-sty > #iframe-shrink { position: absolute; transform: scale(0.5); transform-origin: 0 0; height: 200%; width: 200%; } </style> ## Introduction This is a HPC project to parallel **`Gaussian Elimination (GE)`** using **`OpenMP`**, **`MPI`**, and **`Hybrid Method`** (OpenMP + MPI). First, we did some **`serial optimization`** on the original GE program. Then, we parallel it using **`OpenMP`**. Next, we parallel it with **`MPI`**. Finally, we combine the implementaion of OpenMP and MPI into the **`Hybrid`** implementatin. In addition, we also did other optimizations on the operations that fit **`SIMD Intrinsics`**.   ## Useful Measurement Tools and Commands ### `perf` ```shell ## Run for 5 rounds ## Measure the cache-references and cache-misses $ perf stat \ -r 5 -e cache-references,cache-misses \ ./serialge ./ge_data/size2048x2048 ## General record $ perf record ./serialge ./ge_data/size64x64 ## Record with stack trace $ perf record -g ./serialge ./ge_data/size64x64 ## Print the report to the terminal $ perf report | cat ## Print the report based on each thread to the terminal ## Usful for observing multithread overhead $ perf report -s pid | cat ## Print the report with stack trace information to the terminal $ perf report -g graph,0.5,caller | cat ``` * [Linux perf Examples](http://www.brendangregg.com/perf.html) ### `offcputime` of BPF Compiler Collection (BCC) ```shell ## Off-CPU Time ./${PROG} ./ge_data/size8192x8192 & sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks pkill ${PROG} ``` * [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html) * [speedscope](https://www.speedscope.app/) ### Flame Graph ```shell ## Declare the program and matrix files PROG=hybridge FILES="size2048x2048 size1024x1024 size512x512" ## On-CPU Time for f in $FILES do sudo perf record -g mpirun --allow-run-as-root -n 3 ./${PROG} ./ge_data/${f} sudo perf script -i perf.data &> perf.unfold ./stackcollapse-perf.pl perf.unfold &> perf.folded grep ${PROG} perf.folded \ | ./flamegraph.pl \ --title="On-CPU Time Flame Graph: $ ./${PROG} ./ge_data/${f} (bonus)" \ > oncpu.bonus.${f}.svg done ## Off-CPU Time ./${PROG} ./ge_data/size8192x8192 & sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks pkill ${PROG} ./flamegraph.pl --color=io \ --title="Off-CPU Time Flame Graph (30s): $ ./${PROG} ./ge_data/size8192x8192 (bonus)" \ --countname=us < ./offcpu.stacks > offcpu.bonus.size8192x8192.svg ``` * [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) ### `valgrind` ```shell ## Dynamic instrumentation analysis using the Valgrind ## Callgrind: count the call stacks $ valgrind --tool=callgrind -- ./serialge ./ge_data/size64x64 ## Callgrind: measure the cache event $ valgrind --tool=cachegrind -- ./serialge ./ge_data/size64x64 ``` * [Valgrind Home](https://valgrind.org/) ### `kcachegrind` ```shell ## Visualization of the report from the Valgrind $ kcachegrind callgrind.out.26081 $ kcachegrind cachegrind.out.26180 ``` * [KDE/kcachegrind: GUI to profilers such as Valgrind](https://github.com/KDE/kcachegrind) ### `mpiP` ```shell $ LD_PRELOAD=/usr/local/lib/libmpiP.so \ mpirun -np 3 mpige ./ge_data/size2048x2048 $ less less mpige.3.50899.1.mpiP ``` * [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP) ## Original Serial Program ### `perf stat` locally ![](https://i.imgur.com/Fa66mNo.png =500x) * High ***`stalled-cycles-backend`*** percentage: * means that the stalled cycle is mainly happened after the instuctions was issued into the **`Issue Queue`**, **`L/S Queue`** or **`Reorder Buffer`**. ![](https://i.imgur.com/PwJzmX5.png =300x) * the stalled cycle may be due to waiting for **`memory oprations`** or other **`long latency operations`**. * [linux - What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result? - Stack Overflow](https://stackoverflow.com/questions/22165299/what-are-stalled-cycles-frontend-and-stalled-cycles-backend-in-perf-stat-resul) * Image ### `perf record` locally ![](https://i.imgur.com/wZFWHcy.png =500x) ### Dynamic instrumentation analysis locally #### `valgrind --tool=callgrind` ![](https://i.imgur.com/2UVir8B.png =500x) * We should focus on **`ForwardElimination()`** #### `valgrind --tool=cachegrind` ![](https://i.imgur.com/wa1GFCE.png =500x) * There are some L1 Data Read Miss in **`ForwardElimination()`**, and we could compare the value based on this first measurement (baseline). ### `perf stat` on the Submission Server ```shell stdout: Original Serial Program: stdout: 36396339 cache-references # 1.427 M/sec ( +- 0.69% ) (62.49%) stdout: 18876172 cache-misses # 51.863 % of all cache refs ( +- 0.55% ) (62.50%) stdout: 25498.70 msec cpu-clock # 1.000 CPUs utilized ( +- 0.01% ) stdout: 36 context-switches # 0.001 K/sec ( +- 6.54% ) stdout: 0 cpu-migrations # 0.000 K/sec stdout: 8337 page-faults # 0.327 K/sec ( +- 0.00% ) stdout: 68458659080 cycles # 2.685 GHz ( +- 0.02% ) (50.00%) stdout: 55243414654 stalled-cycles-frontend # 80.70% frontend cycles idle ( +- 0.01% ) (50.01%) stdout: 35294390248 stalled-cycles-backend # 51.56% backend cycles idle ( +- 0.21% ) (50.01%) stdout: 42094964427 instructions # 0.61 insn per cycle stdout: # 1.31 stalled cycles per insn ( +- 0.01% ) (62.51%) stdout: 9591554066 branches # 376.158 M/sec ( +- 0.00% ) (62.50%) stdout: 6967087 branch-misses # 0.07% of all branches ( +- 0.09% ) (62.50%) stdout: stdout: 25.50034 +- 0.00248 seconds time elapsed ( +- 0.01% ) ``` ### Flame Graph (2048x2048)  ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.serialge.size8192x8192.svg) ::: ## Serial Optimization ### Notice * Where do you see possible improvements through parallelization? * Explain the logic for each optimization step * Hint: A proper optimization of sequential code should render parallelization useless for smaller matrix size. Show why this happens in your presentation ### `perf stat` on the Submission Server ```shell stdout: Optimized Serial Program: stdout: 256291902 cache-references # 76.729 M/sec ( +- 0.30% ) (62.40%) stdout: 43814301 cache-misses # 17.095 % of all cache refs ( +- 1.40% ) (62.45%) stdout: 3340.23 msec cpu-clock # 1.000 CPUs utilized ( +- 0.16% ) stdout: 6 context-switches # 0.002 K/sec ( +- 19.25% ) stdout: 0 cpu-migrations # 0.000 K/sec stdout: 9523 page-faults # 0.003 M/sec ( +- 0.84% ) stdout: 8945290194 cycles # 2.678 GHz ( +- 0.13% ) (50.03%) stdout: 4299727133 stalled-cycles-frontend # 48.07% frontend cycles idle ( +- 0.36% ) (50.11%) stdout: 2294309912 stalled-cycles-backend # 25.65% backend cycles idle ( +- 0.60% ) (50.10%) stdout: 12790654776 instructions # 1.43 insn per cycle stdout: # 0.34 stalled cycles per insn ( +- 0.10% ) (62.59%) stdout: 1725125330 branches # 516.470 M/sec ( +- 0.04% ) (62.50%) stdout: 6838414 branch-misses # 0.40% of all branches ( +- 0.21% ) (62.42%) stdout: stdout: 3.34092 +- 0.00552 seconds time elapsed ( +- 0.17% ) ``` ### Comparison Table | SPEC | Original | Serial Optimization | Improvement | | -------------------- | -------- | ------------------- | ---------------- | | IPC | 0.61 | 1.43 | | | cache-references | 36396339 | 256291902 | | | cache-misses | 18876172 | 43814301 | | | cache-misses (%) | 51.863% | 17.095% | | | frontend cycles idle | 80.70% | 48.07% | | | backend cycles idle | 51.56% | 25.65% | | | time elapsed | 25.50034 | 3.34092 | 7.6327 (SpeedUp) | ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.serialge_opt.size8192x8192.svg) ::: ## OMP Optimization * Tue 15.06.2021 23:59 * Submission file "ompge.cpp" ### `perf stat` on the Submission Server ```shell stdout: Optimized OMP Program: stdout: 215464679 cache-references # 32.738 M/sec ( +- 1.53% ) (63.24%) stdout: 4563143 cache-misses # 2.118 % of all cache refs ( +- 2.60% ) (63.41%) stdout: 6581.54 msec cpu-clock # 6.428 CPUs utilized ( +- 0.74% ) stdout: 258 context-switches # 0.039 K/sec ( +- 10.23% ) stdout: 72 cpu-migrations # 0.011 K/sec ( +- 19.44% ) stdout: 8423 page-faults # 0.001 M/sec ( +- 0.00% ) stdout: 17611176782 cycles # 2.676 GHz ( +- 0.83% ) (50.12%) stdout: 12111969178 stalled-cycles-frontend # 68.77% frontend cycles idle ( +- 0.80% ) (49.48%) stdout: 7903992783 stalled-cycles-backend # 44.88% backend cycles idle ( +- 1.45% ) (49.08%) stdout: 13990348567 instructions # 0.79 insn per cycle stdout: # 0.87 stalled cycles per insn ( +- 0.66% ) (62.06%) stdout: 2062105804 branches # 313.317 M/sec ( +- 0.40% ) (62.12%) stdout: 7495666 branch-misses # 0.36% of all branches ( +- 1.56% ) (62.55%) stdout: stdout: 1.02396 +- 0.00432 seconds time elapsed ( +- 0.42% ) stdout: Optimized OMP Program with SIMD: stdout: 211111314 cache-references # 30.074 M/sec ( +- 1.41% ) (62.68%) stdout: 4312060 cache-misses # 2.043 % of all cache refs ( +- 3.04% ) (63.58%) stdout: 7019.81 msec cpu-clock # 6.804 CPUs utilized ( +- 4.03% ) stdout: 309 context-switches # 0.044 K/sec ( +- 18.90% ) stdout: 98 cpu-migrations # 0.014 K/sec ( +- 9.48% ) stdout: 8422 page-faults # 0.001 M/sec ( +- 0.01% ) stdout: 18547516255 cycles # 2.642 GHz ( +- 3.94% ) (51.48%) stdout: 12837235540 stalled-cycles-frontend # 69.21% frontend cycles idle ( +- 4.51% ) (50.60%) stdout: 8422308546 stalled-cycles-backend # 45.41% backend cycles idle ( +- 5.88% ) (49.18%) stdout: 14631081994 instructions # 0.79 insn per cycle stdout: # 0.88 stalled cycles per insn ( +- 1.43% ) (61.33%) stdout: 2187065496 branches # 311.556 M/sec ( +- 3.22% ) (60.96%) stdout: 7668582 branch-misses # 0.35% of all branches ( +- 0.58% ) (61.51%) stdout: stdout: 1.0318 +- 0.0105 seconds time elapsed ( +- 1.02% ) ``` ### Measurement Table - `omp for` | SPEC | Serial Optimization | FOR Optimization | Improvement | | -------------------- | ------------------- | ---------------- | ---------------- | | IPC | 1.43 | 0.79 | - | | cache-references | 256291902 | 215464679 | - | | cache-misses | 43814301 | 4563143 | - | | cache-misses (%) | 17.095% | 2.118% | - | | frontend cycles idle | 48.07% | 68.77% | - | | backend cycles idle | 25.65% | 44.88% | - | | time elapsed | 3.34092 | 1.02396 | 3.2708 (SpeedUp) | ### Measurement Table - `omp simd` | SPEC | FOR Optimization | FOR + SIMD Optimization | Improvement | | -------------------- | ---------------- | ----------------------- | ---------------- | | IPC | 0.79 | 0.79 | - | | cache-references | 215464679 | 211111314 | - | | cache-misses | 4563143 | 4312060 | - | | cache-misses (%) | 2.118% | 2.043% | - | | frontend cycles idle | 68.77% | 69.21% | - | | backend cycles idle | 44.88% | 45.41% | - | | time elapsed | 1.02396 | 1.0318 | 0.9924 (SpeedUp) | ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.ompge.size8192x8192.svg) ::: ### Amdahl's Law Calculations * **Methodology** * Run perf on the optimized serial code, and determine time taken in ForwardElimination. * Use this percentage in Amdahl's and calculate the theoritical speedup. * Then calculate experimental speedup using perf. SIMD is not considered! * [Amdahl’s Law Calculations - Google Sheets](https://docs.google.com/spreadsheets/d/1XL3Ti2cEkmKaU7CVWf9Fil3K2e0xWhpUakXvcS-04ZQ/edit#gid=747307773) ### Initial results #### Codes ![](https://i.imgur.com/HQUHusR.png =500x) #### `perf stat` ![](https://i.imgur.com/Hbq3M6p.png =500x) :::spoiler ```shell Performance counter stats for './ompge ./ge_data/size512x512' (5 runs): 264.32 msec task-clock # 5.445 CPUs utilized ( +- 3.55% ) 606 context-switches # 0.002 M/sec ( +- 11.62% ) 1 cpu-migrations # 0.004 K/sec ( +- 44.72% ) 702 page-faults # 0.003 M/sec ( +- 0.07% ) 737,283,512 cycles # 2.789 GHz ( +- 5.07% ) 315,708,039 stalled-cycles-frontend # 42.82% frontend cycles idle ( +- 9.48% ) 22,312,372 stalled-cycles-backend # 3.03% backend cycles idle ( +- 3.08% ) 493,289,410 instructions # 0.67 insn per cycle # 0.64 stalled cycles per insn ( +- 0.78% ) 91,134,557 branches # 344.785 M/sec ( +- 1.21% ) 220,230 branch-misses # 0.24% of all branches ( +- 1.10% ) 0.04854 +- 0.00153 seconds time elapsed ( +- 3.15% ) Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs): 666.17 msec task-clock # 4.252 CPUs utilized ( +- 1.24% ) 723 context-switches # 0.001 M/sec ( +- 12.63% ) 0 cpu-migrations # 0.000 K/sec 2,241 page-faults # 0.003 M/sec ( +- 0.04% ) 2,427,979,476 cycles # 3.645 GHz ( +- 1.12% ) 574,446,588 stalled-cycles-frontend # 23.66% frontend cycles idle ( +- 1.99% ) 112,657,599 stalled-cycles-backend # 4.64% backend cycles idle ( +- 0.90% ) 2,323,606,967 instructions # 0.96 insn per cycle # 0.25 stalled cycles per insn ( +- 0.12% ) 373,124,245 branches # 560.102 M/sec ( +- 0.21% ) 777,271 branch-misses # 0.21% of all branches ( +- 0.37% ) 0.15669 +- 0.00213 seconds time elapsed ( +- 1.36% ) Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs): 15,435.28 msec task-clock # 5.979 CPUs utilized ( +- 0.77% ) 128,235 context-switches # 0.008 M/sec ( +- 0.12% ) 36,035 cpu-migrations # 0.002 M/sec ( +- 1.29% ) 8,425 page-faults # 0.546 K/sec ( +- 0.01% ) 50,178,450,443 cycles # 3.251 GHz ( +- 0.86% ) 21,780,685,616 stalled-cycles-frontend # 43.41% frontend cycles idle ( +- 1.03% ) 5,790,393,909 stalled-cycles-backend # 11.54% backend cycles idle ( +- 0.80% ) 14,331,798,547 instructions # 0.29 insn per cycle # 1.52 stalled cycles per insn ( +- 0.02% ) 2,073,070,667 branches # 134.307 M/sec ( +- 0.03% ) 8,928,151 branch-misses # 0.43% of all branches ( +- 0.61% ) 2.5816 +- 0.0328 seconds time elapsed ( +- 1.27% ) Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs): 149,998.28 msec task-clock # 8.211 CPUs utilized ( +- 1.71% ) 286,910 context-switches # 0.002 M/sec ( +- 0.35% ) 92,869 cpu-migrations # 0.619 K/sec ( +- 1.18% ) 33,014 page-faults # 0.220 K/sec ( +- 0.00% ) 574,591,154,283 cycles # 3.831 GHz ( +- 1.60% ) 274,887,878,693 stalled-cycles-frontend # 47.84% frontend cycles idle ( +- 0.75% ) 48,741,150,001 stalled-cycles-backend # 8.48% backend cycles idle ( +- 1.93% ) 86,392,307,943 instructions # 0.15 insn per cycle # 3.18 stalled cycles per insn ( +- 0.03% ) 10,612,046,393 branches # 70.748 M/sec ( +- 0.04% ) 34,691,265 branch-misses # 0.33% of all branches ( +- 0.81% ) 18.269 +- 0.237 seconds time elapsed ( +- 1.30% ) ``` ::: ### Intermediate results #### Codes ![](https://i.imgur.com/kXPBgcC.png =500x) #### `perf stat` ![](https://i.imgur.com/ANdgvR7.png =500x) :::spoiler ```shell Performance counter stats for './ompge ./ge_data/size512x512' (5 runs): 630.58 msec task-clock # 4.514 CPUs utilized ( +- 3.20% ) 31,581 context-switches # 0.050 M/sec ( +- 0.29% ) 8,700 cpu-migrations # 0.014 M/sec ( +- 1.11% ) 736 page-faults # 0.001 M/sec ( +- 0.07% ) 1,312,049,911 cycles # 2.081 GHz ( +- 4.33% ) 379,265,192 stalled-cycles-frontend # 28.91% frontend cycles idle ( +- 11.42% ) 130,674,577 stalled-cycles-backend # 9.96% backend cycles idle ( +- 0.53% ) 827,190,457 instructions # 0.63 insn per cycle # 0.46 stalled cycles per insn ( +- 0.77% ) 164,415,402 branches # 260.736 M/sec ( +- 1.06% ) 1,087,639 branch-misses # 0.66% of all branches ( +- 0.24% ) 0.13970 +- 0.00109 seconds time elapsed ( +- 0.78% ) Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs): 1,551.52 msec task-clock # 4.356 CPUs utilized ( +- 3.38% ) 62,970 context-switches # 0.041 M/sec ( +- 0.08% ) 17,645 cpu-migrations # 0.011 M/sec ( +- 1.69% ) 2,274 page-faults # 0.001 M/sec ( +- 0.03% ) 4,057,597,625 cycles # 2.615 GHz ( +- 4.80% ) 961,906,690 stalled-cycles-frontend # 23.71% frontend cycles idle ( +- 9.21% ) 342,219,413 stalled-cycles-backend # 8.43% backend cycles idle ( +- 2.57% ) 3,039,407,567 instructions # 0.75 insn per cycle # 0.32 stalled cycles per insn ( +- 0.66% ) 533,379,028 branches # 343.777 M/sec ( +- 1.06% ) 2,497,786 branch-misses # 0.47% of all branches ( +- 0.54% ) 0.35619 +- 0.00834 seconds time elapsed ( +- 2.34% ) Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs): 17,014.81 msec task-clock # 5.825 CPUs utilized ( +- 0.54% ) 251,846 context-switches # 0.015 M/sec ( +- 0.31% ) 68,261 cpu-migrations # 0.004 M/sec ( +- 1.43% ) 8,425 page-faults # 0.495 K/sec ( +- 0.01% ) 52,808,551,492 cycles # 3.104 GHz ( +- 0.48% ) 22,414,794,512 stalled-cycles-frontend # 42.45% frontend cycles idle ( +- 1.59% ) 6,320,426,551 stalled-cycles-backend # 11.97% backend cycles idle ( +- 0.91% ) 15,690,574,537 instructions # 0.30 insn per cycle # 1.43 stalled cycles per insn ( +- 0.11% ) 2,375,361,565 branches # 139.606 M/sec ( +- 0.16% ) 12,082,568 branch-misses # 0.51% of all branches ( +- 0.99% ) 2.9211 +- 0.0549 seconds time elapsed ( +- 1.88% ) Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs): 158,882.20 msec task-clock # 8.454 CPUs utilized ( +- 1.06% ) 532,286 context-switches # 0.003 M/sec ( +- 0.20% ) 159,671 cpu-migrations # 0.001 M/sec ( +- 1.91% ) 33,014 page-faults # 0.208 K/sec ( +- 0.00% ) 598,956,780,751 cycles # 3.770 GHz ( +- 1.05% ) 277,193,463,570 stalled-cycles-frontend # 46.28% frontend cycles idle ( +- 0.27% ) 50,828,962,230 stalled-cycles-backend # 8.49% backend cycles idle ( +- 1.27% ) 89,076,682,733 instructions # 0.15 insn per cycle # 3.11 stalled cycles per insn ( +- 0.03% ) 11,210,287,810 branches # 70.557 M/sec ( +- 0.05% ) 41,578,396 branch-misses # 0.37% of all branches ( +- 0.79% ) 18.793 +- 0.268 seconds time elapsed ( +- 1.42% ) ``` ::: ### Documentation * Explain the parallelization strategies used and the logic behind them. * Use a diagram to show parallelization strategy and the data required at each step of parallelized algorithm. ## MPI Optimization * Tue 22.06.2021 23:59 * Submission file "mpige.cpp ### Development List * [x] Communication skeleton * [x] Preparing data * [x] MPI_Scatter() and MPI_Gather * [x] MPI_Isend() and MPI_Irecv() * use the Pipelined, Row-Oriented Algorithm (slide 39) * [x] ForwardElimination() parallelization * [x] [Optional] BackwardSubstitution() parallelization * [x] Optimize communication * [x] Measurement * [x] Documentation ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.mpige.size8192x8192.svg) ::: ### Initial Results - MPI_Ibcast #### Codes ![](https://i.imgur.com/6qdZHKX.png =600x) #### `perf stat` ![](https://i.imgur.com/5FYQSAw.png =600x) :::spoiler ```shell Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs): 264.72 msec task-clock # 1.594 CPUs utilized ( +- 2.31% ) 386 context-switches # 0.001 M/sec ( +- 1.79% ) 55 cpu-migrations # 0.207 K/sec ( +- 6.49% ) 9,144 page-faults # 0.035 M/sec ( +- 0.03% ) 796,795,509 cycles # 3.010 GHz ( +- 3.21% ) 23,601,010 stalled-cycles-frontend # 2.96% frontend cycles idle ( +- 1.30% ) 255,426,677 stalled-cycles-backend # 32.06% backend cycles idle ( +- 3.21% ) 1,679,943,613 instructions # 2.11 insn per cycle # 0.15 stalled cycles per insn ( +- 2.72% ) 339,717,099 branches # 1283.305 M/sec ( +- 2.84% ) 1,397,471 branch-misses # 0.41% of all branches ( +- 1.58% ) 0.16604 +- 0.00512 seconds time elapsed ( +- 3.08% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs): 826.36 msec task-clock # 2.372 CPUs utilized ( +- 2.32% ) 478 context-switches # 0.579 K/sec ( +- 1.47% ) 51 cpu-migrations # 0.061 K/sec ( +- 7.69% ) 13,858 page-faults # 0.017 M/sec ( +- 0.03% ) 3,140,217,954 cycles # 3.800 GHz ( +- 2.17% ) 76,546,845 stalled-cycles-frontend # 2.44% frontend cycles idle ( +- 8.55% ) 1,325,883,824 stalled-cycles-backend # 42.22% backend cycles idle ( +- 1.73% ) 6,446,686,641 instructions # 2.05 insn per cycle # 0.21 stalled cycles per insn ( +- 2.14% ) 1,232,693,549 branches # 1491.714 M/sec ( +- 2.44% ) 2,488,624 branch-misses # 0.20% of all branches ( +- 1.53% ) 0.34832 +- 0.00678 seconds time elapsed ( +- 1.95% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs): 6,938.97 msec task-clock # 2.873 CPUs utilized ( +- 2.90% ) 1,330 context-switches # 0.192 K/sec ( +- 2.21% ) 76 cpu-migrations # 0.011 K/sec ( +- 8.44% ) 32,500 page-faults # 0.005 M/sec ( +- 0.02% ) 28,169,268,111 cycles # 4.060 GHz ( +- 2.83% ) 2,745,914,988 stalled-cycles-frontend # 9.75% frontend cycles idle ( +- 16.85% ) 15,987,882,751 stalled-cycles-backend # 56.76% backend cycles idle ( +- 1.33% ) 30,822,596,196 instructions # 1.09 insn per cycle # 0.52 stalled cycles per insn ( +- 1.63% ) 5,430,351,277 branches # 782.588 M/sec ( +- 1.78% ) 8,610,384 branch-misses # 0.16% of all branches ( +- 3.42% ) 2.4154 +- 0.0706 seconds time elapsed ( +- 2.92% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs): 48,931.88 msec task-clock # 2.963 CPUs utilized ( +- 2.11% ) 6,836 context-switches # 0.140 K/sec ( +- 3.09% ) 169 cpu-migrations # 0.003 K/sec ( +- 3.89% ) 106,641 page-faults # 0.002 M/sec ( +- 0.00% ) 198,037,246,253 cycles # 4.047 GHz ( +- 2.00% ) 21,812,123,320 stalled-cycles-frontend # 11.01% frontend cycles idle ( +- 12.63% ) 123,291,160,551 stalled-cycles-backend # 62.26% backend cycles idle ( +- 0.87% ) 167,948,115,577 instructions # 0.85 insn per cycle # 0.73 stalled cycles per insn ( +- 1.31% ) 26,913,007,412 branches # 550.010 M/sec ( +- 1.52% ) 39,952,365 branch-misses # 0.15% of all branches ( +- 3.27% ) 16.515 +- 0.346 seconds time elapsed ( +- 2.10% ) ``` ::: #### `mpiP` ![](https://i.imgur.com/Hy5PRRT.png =600x) ### Intermediate results - Parallel BS() #### Codes ![](https://i.imgur.com/jcYHAUR.png =600x) #### `perf stat` ![](https://i.imgur.com/ym9ZXZD.png =600x) :::spoiler ```shell Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs): 248.11 msec task-clock # 1.540 CPUs utilized ( +- 3.12% ) 389 context-switches # 0.002 M/sec ( +- 2.40% ) 46 cpu-migrations # 0.187 K/sec ( +- 9.80% ) 9,571 page-faults # 0.039 M/sec ( +- 0.07% ) 762,786,422 cycles # 3.074 GHz ( +- 4.72% ) 26,023,616 stalled-cycles-frontend # 3.41% frontend cycles idle ( +- 5.67% ) 225,481,941 stalled-cycles-backend # 29.56% backend cycles idle ( +- 5.52% ) 1,573,853,837 instructions # 2.06 insn per cycle # 0.14 stalled cycles per insn ( +- 2.78% ) 321,708,657 branches # 1296.653 M/sec ( +- 2.97% ) 1,313,746 branch-misses # 0.41% of all branches ( +- 0.76% ) 0.16110 +- 0.00833 seconds time elapsed ( +- 5.17% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs): 929.34 msec task-clock # 2.425 CPUs utilized ( +- 6.16% ) 498 context-switches # 0.535 K/sec ( +- 3.36% ) 61 cpu-migrations # 0.065 K/sec ( +- 7.68% ) 15,733 page-faults # 0.017 M/sec ( +- 0.03% ) 3,554,226,495 cycles # 3.824 GHz ( +- 7.20% ) 206,006,990 stalled-cycles-frontend # 5.80% frontend cycles idle ( +- 28.37% ) 1,547,651,047 stalled-cycles-backend # 43.54% backend cycles idle ( +- 10.76% ) 6,387,034,798 instructions # 1.80 insn per cycle # 0.24 stalled cycles per insn ( +- 2.08% ) 1,242,126,964 branches # 1336.570 M/sec ( +- 2.24% ) 2,282,334 branch-misses # 0.18% of all branches ( +- 1.82% ) 0.3833 +- 0.0192 seconds time elapsed ( +- 5.02% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs): 6,212.68 msec task-clock # 2.862 CPUs utilized ( +- 0.91% ) 1,220 context-switches # 0.196 K/sec ( +- 1.48% ) 64 cpu-migrations # 0.010 K/sec ( +- 4.89% ) 40,341 page-faults # 0.006 M/sec ( +- 0.01% ) 25,446,446,484 cycles # 4.096 GHz ( +- 0.73% ) 1,127,432,163 stalled-cycles-frontend # 4.43% frontend cycles idle ( +- 13.73% ) 16,067,156,571 stalled-cycles-backend # 63.14% backend cycles idle ( +- 1.25% ) 29,982,500,623 instructions # 1.18 insn per cycle # 0.54 stalled cycles per insn ( +- 1.05% ) 5,408,456,832 branches # 870.552 M/sec ( +- 1.22% ) 6,568,256 branch-misses # 0.12% of all branches ( +- 1.55% ) 2.1711 +- 0.0230 seconds time elapsed ( +- 1.06% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs): 44,639.48 msec task-clock # 2.965 CPUs utilized ( +- 0.48% ) 5,878 context-switches # 0.132 K/sec ( +- 1.48% ) 139 cpu-migrations # 0.003 K/sec ( +- 7.06% ) 138,687 page-faults # 0.003 M/sec ( +- 0.00% ) 183,081,496,312 cycles # 4.101 GHz ( +- 0.60% ) 11,107,441,243 stalled-cycles-frontend # 6.07% frontend cycles idle ( +- 3.95% ) 124,142,472,219 stalled-cycles-backend # 67.81% backend cycles idle ( +- 0.64% ) 162,602,667,589 instructions # 0.89 insn per cycle # 0.76 stalled cycles per insn ( +- 0.82% ) 26,673,044,086 branches # 597.521 M/sec ( +- 1.05% ) 29,169,660 branch-misses # 0.11% of all branches ( +- 1.18% ) 15.0575 +- 0.0710 seconds time elapsed ( +- 0.47% ) ``` ::: #### `mpiP` ![](https://i.imgur.com/77PMQcX.png =600x) ### Point-to-Point vs Broadcast ![](https://docs.google.com/drawings/d/e/2PACX-1vQgiwAljjLcqs59WCrovKzLU7ac6vWPlN9whcc8X7S4TcbLKA-QLVn3vNB4Qom2T_SxKDvTlLof1FNH/pub?w=1115&h=463) Broadcast is more efficient as it internally distributes the messaging overhead. ![](https://docs.google.com/drawings/d/e/2PACX-1vRDxhb_Vf5Dz6Cfzr7cJmSmHT7F5qEOiUOJsllTPz0v8IddlHA2El1gAgDPdQU3ggrjmp7I6WZD7-6G/pub?w=657&h=463) ## Hybrid (MPI + OMP) Optimization + Bonus (Optional) * Tue 29.06.2021 23:59 * Submission file "hybridge.cpp * Submission file "bonusge.cpp (optional) ### Hybrid (MPI + OMP) #### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size2048x2048.svg) #### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.hybridge.size8192x8192.svg) ::: ### Bonus #### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size2048x2048.svg) #### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.bonus.size8192x8192.svg) ::: ## Presentation * Fri 09.07.2021 13:20 - 14:10 ## References * [Cornell Virtual Workshop: Amdahl's Law](https://cvw.cac.cornell.edu/vector/performance_amdahl) * [Gaussian Elimination](https://www.cs.rutgers.edu/~venugopa/parallel_summer2012/ge.html) * [Gaussian Elimination parallel slides](https://parallelcomp.github.io/Gaussian-Quinn.pdf) * [Loop dependence analysis - Wikipedia](https://en.wikipedia.org/wiki/Loop_dependence_analysis) * [Linux perf Examples](http://www.brendangregg.com/perf.html) * [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) * [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html) * [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP) * [SIMD Directives](https://www.openmp.org/spec-html/5.0/openmpsu42.html)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.