Levi Liang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Parprog Final Project - Gaussian Elimination (GE) ###### tags: `SS2021-IN2147-PP` <style> .iframe-body-sty { position: relative; overflow: hidden; height: 300px; width: 600px; } .iframe-body-sty > #iframe-shrink { position: absolute; transform: scale(0.5); transform-origin: 0 0; height: 200%; width: 200%; } </style> ## Introduction This is a HPC project to parallel **`Gaussian Elimination (GE)`** using **`OpenMP`**, **`MPI`**, and **`Hybrid Method`** (OpenMP + MPI). First, we did some **`serial optimization`** on the original GE program. Then, we parallel it using **`OpenMP`**. Next, we parallel it with **`MPI`**. Finally, we combine the implementaion of OpenMP and MPI into the **`Hybrid`** implementatin. In addition, we also did other optimizations on the operations that fit **`SIMD Intrinsics`**. <!-- ## Repository, Submisstion Server and Chat Room * [Group-213 · GitLab](https://gitlab.lrz.de/lrr-tum/teaching/parprog/ss2021/project-material/group-213) * [Parallel Programming Submission Tool](https://parprog.caps.in.tum.de/Submission/home) * [CHAT.TUM.DE](https://chat.tum.de/channel/parprog21) --> <!-- ## Schedule ``` Sat: D-1 => Discuss a little bit, come up with some ideas Sun: D+0 => Work together Mon: D+1 => Remedial time + Documentation Tue: D+2 => Documentation + Submission ``` --> ## Useful Measurement Tools and Commands ### `perf` ```shell ## Run for 5 rounds ## Measure the cache-references and cache-misses $ perf stat \ -r 5 -e cache-references,cache-misses \ ./serialge ./ge_data/size2048x2048 ## General record $ perf record ./serialge ./ge_data/size64x64 ## Record with stack trace $ perf record -g ./serialge ./ge_data/size64x64 ## Print the report to the terminal $ perf report | cat ## Print the report based on each thread to the terminal ## Usful for observing multithread overhead $ perf report -s pid | cat ## Print the report with stack trace information to the terminal $ perf report -g graph,0.5,caller | cat ``` * [Linux perf Examples](http://www.brendangregg.com/perf.html) ### `offcputime` of BPF Compiler Collection (BCC) ```shell ## Off-CPU Time ./${PROG} ./ge_data/size8192x8192 & sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks pkill ${PROG} ``` * [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html) * [speedscope](https://www.speedscope.app/) ### Flame Graph ```shell ## Declare the program and matrix files PROG=hybridge FILES="size2048x2048 size1024x1024 size512x512" ## On-CPU Time for f in $FILES do sudo perf record -g mpirun --allow-run-as-root -n 3 ./${PROG} ./ge_data/${f} sudo perf script -i perf.data &> perf.unfold ./stackcollapse-perf.pl perf.unfold &> perf.folded grep ${PROG} perf.folded \ | ./flamegraph.pl \ --title="On-CPU Time Flame Graph: $ ./${PROG} ./ge_data/${f} (bonus)" \ > oncpu.bonus.${f}.svg done ## Off-CPU Time ./${PROG} ./ge_data/size8192x8192 & sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks pkill ${PROG} ./flamegraph.pl --color=io \ --title="Off-CPU Time Flame Graph (30s): $ ./${PROG} ./ge_data/size8192x8192 (bonus)" \ --countname=us < ./offcpu.stacks > offcpu.bonus.size8192x8192.svg ``` * [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) ### `valgrind` ```shell ## Dynamic instrumentation analysis using the Valgrind ## Callgrind: count the call stacks $ valgrind --tool=callgrind -- ./serialge ./ge_data/size64x64 ## Callgrind: measure the cache event $ valgrind --tool=cachegrind -- ./serialge ./ge_data/size64x64 ``` * [Valgrind Home](https://valgrind.org/) ### `kcachegrind` ```shell ## Visualization of the report from the Valgrind $ kcachegrind callgrind.out.26081 $ kcachegrind cachegrind.out.26180 ``` * [KDE/kcachegrind: GUI to profilers such as Valgrind](https://github.com/KDE/kcachegrind) ### `mpiP` ```shell $ LD_PRELOAD=/usr/local/lib/libmpiP.so \ mpirun -np 3 mpige ./ge_data/size2048x2048 $ less less mpige.3.50899.1.mpiP ``` * [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP) ## Original Serial Program ### `perf stat` locally ![](https://i.imgur.com/Fa66mNo.png =500x) * High ***`stalled-cycles-backend`*** percentage: * means that the stalled cycle is mainly happened after the instuctions was issued into the **`Issue Queue`**, **`L/S Queue`** or **`Reorder Buffer`**. ![](https://i.imgur.com/PwJzmX5.png =300x) * the stalled cycle may be due to waiting for **`memory oprations`** or other **`long latency operations`**. * [linux - What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result? - Stack Overflow](https://stackoverflow.com/questions/22165299/what-are-stalled-cycles-frontend-and-stalled-cycles-backend-in-perf-stat-resul) * Image ### `perf record` locally ![](https://i.imgur.com/wZFWHcy.png =500x) ### Dynamic instrumentation analysis locally #### `valgrind --tool=callgrind` ![](https://i.imgur.com/2UVir8B.png =500x) * We should focus on **`ForwardElimination()`** #### `valgrind --tool=cachegrind` ![](https://i.imgur.com/wa1GFCE.png =500x) * There are some L1 Data Read Miss in **`ForwardElimination()`**, and we could compare the value based on this first measurement (baseline). ### `perf stat` on the Submission Server ```shell stdout: Original Serial Program: stdout: 36396339 cache-references # 1.427 M/sec ( +- 0.69% ) (62.49%) stdout: 18876172 cache-misses # 51.863 % of all cache refs ( +- 0.55% ) (62.50%) stdout: 25498.70 msec cpu-clock # 1.000 CPUs utilized ( +- 0.01% ) stdout: 36 context-switches # 0.001 K/sec ( +- 6.54% ) stdout: 0 cpu-migrations # 0.000 K/sec stdout: 8337 page-faults # 0.327 K/sec ( +- 0.00% ) stdout: 68458659080 cycles # 2.685 GHz ( +- 0.02% ) (50.00%) stdout: 55243414654 stalled-cycles-frontend # 80.70% frontend cycles idle ( +- 0.01% ) (50.01%) stdout: 35294390248 stalled-cycles-backend # 51.56% backend cycles idle ( +- 0.21% ) (50.01%) stdout: 42094964427 instructions # 0.61 insn per cycle stdout: # 1.31 stalled cycles per insn ( +- 0.01% ) (62.51%) stdout: 9591554066 branches # 376.158 M/sec ( +- 0.00% ) (62.50%) stdout: 6967087 branch-misses # 0.07% of all branches ( +- 0.09% ) (62.50%) stdout: stdout: 25.50034 +- 0.00248 seconds time elapsed ( +- 0.01% ) ``` ### Flame Graph (2048x2048) <!-- <div class="iframe-body-sty"> <iframe id="iframe-shrink" src="https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/serialge.1024x1024.svg"></iframe> </div> </div> --> ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.serialge.size8192x8192.svg) ::: ## Serial Optimization ### Notice * Where do you see possible improvements through parallelization? * Explain the logic for each optimization step * Hint: A proper optimization of sequential code should render parallelization useless for smaller matrix size. Show why this happens in your presentation ### `perf stat` on the Submission Server ```shell stdout: Optimized Serial Program: stdout: 256291902 cache-references # 76.729 M/sec ( +- 0.30% ) (62.40%) stdout: 43814301 cache-misses # 17.095 % of all cache refs ( +- 1.40% ) (62.45%) stdout: 3340.23 msec cpu-clock # 1.000 CPUs utilized ( +- 0.16% ) stdout: 6 context-switches # 0.002 K/sec ( +- 19.25% ) stdout: 0 cpu-migrations # 0.000 K/sec stdout: 9523 page-faults # 0.003 M/sec ( +- 0.84% ) stdout: 8945290194 cycles # 2.678 GHz ( +- 0.13% ) (50.03%) stdout: 4299727133 stalled-cycles-frontend # 48.07% frontend cycles idle ( +- 0.36% ) (50.11%) stdout: 2294309912 stalled-cycles-backend # 25.65% backend cycles idle ( +- 0.60% ) (50.10%) stdout: 12790654776 instructions # 1.43 insn per cycle stdout: # 0.34 stalled cycles per insn ( +- 0.10% ) (62.59%) stdout: 1725125330 branches # 516.470 M/sec ( +- 0.04% ) (62.50%) stdout: 6838414 branch-misses # 0.40% of all branches ( +- 0.21% ) (62.42%) stdout: stdout: 3.34092 +- 0.00552 seconds time elapsed ( +- 0.17% ) ``` ### Comparison Table | SPEC | Original | Serial Optimization | Improvement | | -------------------- | -------- | ------------------- | ---------------- | | IPC | 0.61 | 1.43 | | | cache-references | 36396339 | 256291902 | | | cache-misses | 18876172 | 43814301 | | | cache-misses (%) | 51.863% | 17.095% | | | frontend cycles idle | 80.70% | 48.07% | | | backend cycles idle | 51.56% | 25.65% | | | time elapsed | 25.50034 | 3.34092 | 7.6327 (SpeedUp) | ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.serialge_opt.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.serialge_opt.size8192x8192.svg) ::: ## OMP Optimization * Tue 15.06.2021 23:59 * Submission file "ompge.cpp" ### `perf stat` on the Submission Server ```shell stdout: Optimized OMP Program: stdout: 215464679 cache-references # 32.738 M/sec ( +- 1.53% ) (63.24%) stdout: 4563143 cache-misses # 2.118 % of all cache refs ( +- 2.60% ) (63.41%) stdout: 6581.54 msec cpu-clock # 6.428 CPUs utilized ( +- 0.74% ) stdout: 258 context-switches # 0.039 K/sec ( +- 10.23% ) stdout: 72 cpu-migrations # 0.011 K/sec ( +- 19.44% ) stdout: 8423 page-faults # 0.001 M/sec ( +- 0.00% ) stdout: 17611176782 cycles # 2.676 GHz ( +- 0.83% ) (50.12%) stdout: 12111969178 stalled-cycles-frontend # 68.77% frontend cycles idle ( +- 0.80% ) (49.48%) stdout: 7903992783 stalled-cycles-backend # 44.88% backend cycles idle ( +- 1.45% ) (49.08%) stdout: 13990348567 instructions # 0.79 insn per cycle stdout: # 0.87 stalled cycles per insn ( +- 0.66% ) (62.06%) stdout: 2062105804 branches # 313.317 M/sec ( +- 0.40% ) (62.12%) stdout: 7495666 branch-misses # 0.36% of all branches ( +- 1.56% ) (62.55%) stdout: stdout: 1.02396 +- 0.00432 seconds time elapsed ( +- 0.42% ) stdout: Optimized OMP Program with SIMD: stdout: 211111314 cache-references # 30.074 M/sec ( +- 1.41% ) (62.68%) stdout: 4312060 cache-misses # 2.043 % of all cache refs ( +- 3.04% ) (63.58%) stdout: 7019.81 msec cpu-clock # 6.804 CPUs utilized ( +- 4.03% ) stdout: 309 context-switches # 0.044 K/sec ( +- 18.90% ) stdout: 98 cpu-migrations # 0.014 K/sec ( +- 9.48% ) stdout: 8422 page-faults # 0.001 M/sec ( +- 0.01% ) stdout: 18547516255 cycles # 2.642 GHz ( +- 3.94% ) (51.48%) stdout: 12837235540 stalled-cycles-frontend # 69.21% frontend cycles idle ( +- 4.51% ) (50.60%) stdout: 8422308546 stalled-cycles-backend # 45.41% backend cycles idle ( +- 5.88% ) (49.18%) stdout: 14631081994 instructions # 0.79 insn per cycle stdout: # 0.88 stalled cycles per insn ( +- 1.43% ) (61.33%) stdout: 2187065496 branches # 311.556 M/sec ( +- 3.22% ) (60.96%) stdout: 7668582 branch-misses # 0.35% of all branches ( +- 0.58% ) (61.51%) stdout: stdout: 1.0318 +- 0.0105 seconds time elapsed ( +- 1.02% ) ``` ### Measurement Table - `omp for` | SPEC | Serial Optimization | FOR Optimization | Improvement | | -------------------- | ------------------- | ---------------- | ---------------- | | IPC | 1.43 | 0.79 | - | | cache-references | 256291902 | 215464679 | - | | cache-misses | 43814301 | 4563143 | - | | cache-misses (%) | 17.095% | 2.118% | - | | frontend cycles idle | 48.07% | 68.77% | - | | backend cycles idle | 25.65% | 44.88% | - | | time elapsed | 3.34092 | 1.02396 | 3.2708 (SpeedUp) | ### Measurement Table - `omp simd` | SPEC | FOR Optimization | FOR + SIMD Optimization | Improvement | | -------------------- | ---------------- | ----------------------- | ---------------- | | IPC | 0.79 | 0.79 | - | | cache-references | 215464679 | 211111314 | - | | cache-misses | 4563143 | 4312060 | - | | cache-misses (%) | 2.118% | 2.043% | - | | frontend cycles idle | 68.77% | 69.21% | - | | backend cycles idle | 44.88% | 45.41% | - | | time elapsed | 1.02396 | 1.0318 | 0.9924 (SpeedUp) | ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.ompge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.ompge.size8192x8192.svg) ::: ### Amdahl's Law Calculations * **Methodology** * Run perf on the optimized serial code, and determine time taken in ForwardElimination. * Use this percentage in Amdahl's and calculate the theoritical speedup. * Then calculate experimental speedup using perf. SIMD is not considered! * [Amdahl’s Law Calculations - Google Sheets](https://docs.google.com/spreadsheets/d/1XL3Ti2cEkmKaU7CVWf9Fil3K2e0xWhpUakXvcS-04ZQ/edit#gid=747307773) ### Initial results #### Codes ![](https://i.imgur.com/HQUHusR.png =500x) #### `perf stat` ![](https://i.imgur.com/Hbq3M6p.png =500x) :::spoiler ```shell Performance counter stats for './ompge ./ge_data/size512x512' (5 runs): 264.32 msec task-clock # 5.445 CPUs utilized ( +- 3.55% ) 606 context-switches # 0.002 M/sec ( +- 11.62% ) 1 cpu-migrations # 0.004 K/sec ( +- 44.72% ) 702 page-faults # 0.003 M/sec ( +- 0.07% ) 737,283,512 cycles # 2.789 GHz ( +- 5.07% ) 315,708,039 stalled-cycles-frontend # 42.82% frontend cycles idle ( +- 9.48% ) 22,312,372 stalled-cycles-backend # 3.03% backend cycles idle ( +- 3.08% ) 493,289,410 instructions # 0.67 insn per cycle # 0.64 stalled cycles per insn ( +- 0.78% ) 91,134,557 branches # 344.785 M/sec ( +- 1.21% ) 220,230 branch-misses # 0.24% of all branches ( +- 1.10% ) 0.04854 +- 0.00153 seconds time elapsed ( +- 3.15% ) Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs): 666.17 msec task-clock # 4.252 CPUs utilized ( +- 1.24% ) 723 context-switches # 0.001 M/sec ( +- 12.63% ) 0 cpu-migrations # 0.000 K/sec 2,241 page-faults # 0.003 M/sec ( +- 0.04% ) 2,427,979,476 cycles # 3.645 GHz ( +- 1.12% ) 574,446,588 stalled-cycles-frontend # 23.66% frontend cycles idle ( +- 1.99% ) 112,657,599 stalled-cycles-backend # 4.64% backend cycles idle ( +- 0.90% ) 2,323,606,967 instructions # 0.96 insn per cycle # 0.25 stalled cycles per insn ( +- 0.12% ) 373,124,245 branches # 560.102 M/sec ( +- 0.21% ) 777,271 branch-misses # 0.21% of all branches ( +- 0.37% ) 0.15669 +- 0.00213 seconds time elapsed ( +- 1.36% ) Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs): 15,435.28 msec task-clock # 5.979 CPUs utilized ( +- 0.77% ) 128,235 context-switches # 0.008 M/sec ( +- 0.12% ) 36,035 cpu-migrations # 0.002 M/sec ( +- 1.29% ) 8,425 page-faults # 0.546 K/sec ( +- 0.01% ) 50,178,450,443 cycles # 3.251 GHz ( +- 0.86% ) 21,780,685,616 stalled-cycles-frontend # 43.41% frontend cycles idle ( +- 1.03% ) 5,790,393,909 stalled-cycles-backend # 11.54% backend cycles idle ( +- 0.80% ) 14,331,798,547 instructions # 0.29 insn per cycle # 1.52 stalled cycles per insn ( +- 0.02% ) 2,073,070,667 branches # 134.307 M/sec ( +- 0.03% ) 8,928,151 branch-misses # 0.43% of all branches ( +- 0.61% ) 2.5816 +- 0.0328 seconds time elapsed ( +- 1.27% ) Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs): 149,998.28 msec task-clock # 8.211 CPUs utilized ( +- 1.71% ) 286,910 context-switches # 0.002 M/sec ( +- 0.35% ) 92,869 cpu-migrations # 0.619 K/sec ( +- 1.18% ) 33,014 page-faults # 0.220 K/sec ( +- 0.00% ) 574,591,154,283 cycles # 3.831 GHz ( +- 1.60% ) 274,887,878,693 stalled-cycles-frontend # 47.84% frontend cycles idle ( +- 0.75% ) 48,741,150,001 stalled-cycles-backend # 8.48% backend cycles idle ( +- 1.93% ) 86,392,307,943 instructions # 0.15 insn per cycle # 3.18 stalled cycles per insn ( +- 0.03% ) 10,612,046,393 branches # 70.748 M/sec ( +- 0.04% ) 34,691,265 branch-misses # 0.33% of all branches ( +- 0.81% ) 18.269 +- 0.237 seconds time elapsed ( +- 1.30% ) ``` ::: ### Intermediate results #### Codes ![](https://i.imgur.com/kXPBgcC.png =500x) #### `perf stat` ![](https://i.imgur.com/ANdgvR7.png =500x) :::spoiler ```shell Performance counter stats for './ompge ./ge_data/size512x512' (5 runs): 630.58 msec task-clock # 4.514 CPUs utilized ( +- 3.20% ) 31,581 context-switches # 0.050 M/sec ( +- 0.29% ) 8,700 cpu-migrations # 0.014 M/sec ( +- 1.11% ) 736 page-faults # 0.001 M/sec ( +- 0.07% ) 1,312,049,911 cycles # 2.081 GHz ( +- 4.33% ) 379,265,192 stalled-cycles-frontend # 28.91% frontend cycles idle ( +- 11.42% ) 130,674,577 stalled-cycles-backend # 9.96% backend cycles idle ( +- 0.53% ) 827,190,457 instructions # 0.63 insn per cycle # 0.46 stalled cycles per insn ( +- 0.77% ) 164,415,402 branches # 260.736 M/sec ( +- 1.06% ) 1,087,639 branch-misses # 0.66% of all branches ( +- 0.24% ) 0.13970 +- 0.00109 seconds time elapsed ( +- 0.78% ) Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs): 1,551.52 msec task-clock # 4.356 CPUs utilized ( +- 3.38% ) 62,970 context-switches # 0.041 M/sec ( +- 0.08% ) 17,645 cpu-migrations # 0.011 M/sec ( +- 1.69% ) 2,274 page-faults # 0.001 M/sec ( +- 0.03% ) 4,057,597,625 cycles # 2.615 GHz ( +- 4.80% ) 961,906,690 stalled-cycles-frontend # 23.71% frontend cycles idle ( +- 9.21% ) 342,219,413 stalled-cycles-backend # 8.43% backend cycles idle ( +- 2.57% ) 3,039,407,567 instructions # 0.75 insn per cycle # 0.32 stalled cycles per insn ( +- 0.66% ) 533,379,028 branches # 343.777 M/sec ( +- 1.06% ) 2,497,786 branch-misses # 0.47% of all branches ( +- 0.54% ) 0.35619 +- 0.00834 seconds time elapsed ( +- 2.34% ) Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs): 17,014.81 msec task-clock # 5.825 CPUs utilized ( +- 0.54% ) 251,846 context-switches # 0.015 M/sec ( +- 0.31% ) 68,261 cpu-migrations # 0.004 M/sec ( +- 1.43% ) 8,425 page-faults # 0.495 K/sec ( +- 0.01% ) 52,808,551,492 cycles # 3.104 GHz ( +- 0.48% ) 22,414,794,512 stalled-cycles-frontend # 42.45% frontend cycles idle ( +- 1.59% ) 6,320,426,551 stalled-cycles-backend # 11.97% backend cycles idle ( +- 0.91% ) 15,690,574,537 instructions # 0.30 insn per cycle # 1.43 stalled cycles per insn ( +- 0.11% ) 2,375,361,565 branches # 139.606 M/sec ( +- 0.16% ) 12,082,568 branch-misses # 0.51% of all branches ( +- 0.99% ) 2.9211 +- 0.0549 seconds time elapsed ( +- 1.88% ) Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs): 158,882.20 msec task-clock # 8.454 CPUs utilized ( +- 1.06% ) 532,286 context-switches # 0.003 M/sec ( +- 0.20% ) 159,671 cpu-migrations # 0.001 M/sec ( +- 1.91% ) 33,014 page-faults # 0.208 K/sec ( +- 0.00% ) 598,956,780,751 cycles # 3.770 GHz ( +- 1.05% ) 277,193,463,570 stalled-cycles-frontend # 46.28% frontend cycles idle ( +- 0.27% ) 50,828,962,230 stalled-cycles-backend # 8.49% backend cycles idle ( +- 1.27% ) 89,076,682,733 instructions # 0.15 insn per cycle # 3.11 stalled cycles per insn ( +- 0.03% ) 11,210,287,810 branches # 70.557 M/sec ( +- 0.05% ) 41,578,396 branch-misses # 0.37% of all branches ( +- 0.79% ) 18.793 +- 0.268 seconds time elapsed ( +- 1.42% ) ``` ::: ### Documentation * Explain the parallelization strategies used and the logic behind them. * Use a diagram to show parallelization strategy and the data required at each step of parallelized algorithm. ## MPI Optimization * Tue 22.06.2021 23:59 * Submission file "mpige.cpp ### Development List * [x] Communication skeleton * [x] Preparing data * [x] MPI_Scatter() and MPI_Gather * [x] MPI_Isend() and MPI_Irecv() * use the Pipelined, Row-Oriented Algorithm (slide 39) * [x] ForwardElimination() parallelization * [x] [Optional] BackwardSubstitution() parallelization * [x] Optimize communication * [x] Measurement * [x] Documentation ### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size2048x2048.svg) ### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.mpige.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.mpige.size8192x8192.svg) ::: ### Initial Results - MPI_Ibcast #### Codes ![](https://i.imgur.com/6qdZHKX.png =600x) #### `perf stat` ![](https://i.imgur.com/5FYQSAw.png =600x) :::spoiler ```shell Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs): 264.72 msec task-clock # 1.594 CPUs utilized ( +- 2.31% ) 386 context-switches # 0.001 M/sec ( +- 1.79% ) 55 cpu-migrations # 0.207 K/sec ( +- 6.49% ) 9,144 page-faults # 0.035 M/sec ( +- 0.03% ) 796,795,509 cycles # 3.010 GHz ( +- 3.21% ) 23,601,010 stalled-cycles-frontend # 2.96% frontend cycles idle ( +- 1.30% ) 255,426,677 stalled-cycles-backend # 32.06% backend cycles idle ( +- 3.21% ) 1,679,943,613 instructions # 2.11 insn per cycle # 0.15 stalled cycles per insn ( +- 2.72% ) 339,717,099 branches # 1283.305 M/sec ( +- 2.84% ) 1,397,471 branch-misses # 0.41% of all branches ( +- 1.58% ) 0.16604 +- 0.00512 seconds time elapsed ( +- 3.08% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs): 826.36 msec task-clock # 2.372 CPUs utilized ( +- 2.32% ) 478 context-switches # 0.579 K/sec ( +- 1.47% ) 51 cpu-migrations # 0.061 K/sec ( +- 7.69% ) 13,858 page-faults # 0.017 M/sec ( +- 0.03% ) 3,140,217,954 cycles # 3.800 GHz ( +- 2.17% ) 76,546,845 stalled-cycles-frontend # 2.44% frontend cycles idle ( +- 8.55% ) 1,325,883,824 stalled-cycles-backend # 42.22% backend cycles idle ( +- 1.73% ) 6,446,686,641 instructions # 2.05 insn per cycle # 0.21 stalled cycles per insn ( +- 2.14% ) 1,232,693,549 branches # 1491.714 M/sec ( +- 2.44% ) 2,488,624 branch-misses # 0.20% of all branches ( +- 1.53% ) 0.34832 +- 0.00678 seconds time elapsed ( +- 1.95% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs): 6,938.97 msec task-clock # 2.873 CPUs utilized ( +- 2.90% ) 1,330 context-switches # 0.192 K/sec ( +- 2.21% ) 76 cpu-migrations # 0.011 K/sec ( +- 8.44% ) 32,500 page-faults # 0.005 M/sec ( +- 0.02% ) 28,169,268,111 cycles # 4.060 GHz ( +- 2.83% ) 2,745,914,988 stalled-cycles-frontend # 9.75% frontend cycles idle ( +- 16.85% ) 15,987,882,751 stalled-cycles-backend # 56.76% backend cycles idle ( +- 1.33% ) 30,822,596,196 instructions # 1.09 insn per cycle # 0.52 stalled cycles per insn ( +- 1.63% ) 5,430,351,277 branches # 782.588 M/sec ( +- 1.78% ) 8,610,384 branch-misses # 0.16% of all branches ( +- 3.42% ) 2.4154 +- 0.0706 seconds time elapsed ( +- 2.92% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs): 48,931.88 msec task-clock # 2.963 CPUs utilized ( +- 2.11% ) 6,836 context-switches # 0.140 K/sec ( +- 3.09% ) 169 cpu-migrations # 0.003 K/sec ( +- 3.89% ) 106,641 page-faults # 0.002 M/sec ( +- 0.00% ) 198,037,246,253 cycles # 4.047 GHz ( +- 2.00% ) 21,812,123,320 stalled-cycles-frontend # 11.01% frontend cycles idle ( +- 12.63% ) 123,291,160,551 stalled-cycles-backend # 62.26% backend cycles idle ( +- 0.87% ) 167,948,115,577 instructions # 0.85 insn per cycle # 0.73 stalled cycles per insn ( +- 1.31% ) 26,913,007,412 branches # 550.010 M/sec ( +- 1.52% ) 39,952,365 branch-misses # 0.15% of all branches ( +- 3.27% ) 16.515 +- 0.346 seconds time elapsed ( +- 2.10% ) ``` ::: #### `mpiP` ![](https://i.imgur.com/Hy5PRRT.png =600x) ### Intermediate results - Parallel BS() #### Codes ![](https://i.imgur.com/jcYHAUR.png =600x) #### `perf stat` ![](https://i.imgur.com/ym9ZXZD.png =600x) :::spoiler ```shell Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs): 248.11 msec task-clock # 1.540 CPUs utilized ( +- 3.12% ) 389 context-switches # 0.002 M/sec ( +- 2.40% ) 46 cpu-migrations # 0.187 K/sec ( +- 9.80% ) 9,571 page-faults # 0.039 M/sec ( +- 0.07% ) 762,786,422 cycles # 3.074 GHz ( +- 4.72% ) 26,023,616 stalled-cycles-frontend # 3.41% frontend cycles idle ( +- 5.67% ) 225,481,941 stalled-cycles-backend # 29.56% backend cycles idle ( +- 5.52% ) 1,573,853,837 instructions # 2.06 insn per cycle # 0.14 stalled cycles per insn ( +- 2.78% ) 321,708,657 branches # 1296.653 M/sec ( +- 2.97% ) 1,313,746 branch-misses # 0.41% of all branches ( +- 0.76% ) 0.16110 +- 0.00833 seconds time elapsed ( +- 5.17% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs): 929.34 msec task-clock # 2.425 CPUs utilized ( +- 6.16% ) 498 context-switches # 0.535 K/sec ( +- 3.36% ) 61 cpu-migrations # 0.065 K/sec ( +- 7.68% ) 15,733 page-faults # 0.017 M/sec ( +- 0.03% ) 3,554,226,495 cycles # 3.824 GHz ( +- 7.20% ) 206,006,990 stalled-cycles-frontend # 5.80% frontend cycles idle ( +- 28.37% ) 1,547,651,047 stalled-cycles-backend # 43.54% backend cycles idle ( +- 10.76% ) 6,387,034,798 instructions # 1.80 insn per cycle # 0.24 stalled cycles per insn ( +- 2.08% ) 1,242,126,964 branches # 1336.570 M/sec ( +- 2.24% ) 2,282,334 branch-misses # 0.18% of all branches ( +- 1.82% ) 0.3833 +- 0.0192 seconds time elapsed ( +- 5.02% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs): 6,212.68 msec task-clock # 2.862 CPUs utilized ( +- 0.91% ) 1,220 context-switches # 0.196 K/sec ( +- 1.48% ) 64 cpu-migrations # 0.010 K/sec ( +- 4.89% ) 40,341 page-faults # 0.006 M/sec ( +- 0.01% ) 25,446,446,484 cycles # 4.096 GHz ( +- 0.73% ) 1,127,432,163 stalled-cycles-frontend # 4.43% frontend cycles idle ( +- 13.73% ) 16,067,156,571 stalled-cycles-backend # 63.14% backend cycles idle ( +- 1.25% ) 29,982,500,623 instructions # 1.18 insn per cycle # 0.54 stalled cycles per insn ( +- 1.05% ) 5,408,456,832 branches # 870.552 M/sec ( +- 1.22% ) 6,568,256 branch-misses # 0.12% of all branches ( +- 1.55% ) 2.1711 +- 0.0230 seconds time elapsed ( +- 1.06% ) Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs): 44,639.48 msec task-clock # 2.965 CPUs utilized ( +- 0.48% ) 5,878 context-switches # 0.132 K/sec ( +- 1.48% ) 139 cpu-migrations # 0.003 K/sec ( +- 7.06% ) 138,687 page-faults # 0.003 M/sec ( +- 0.00% ) 183,081,496,312 cycles # 4.101 GHz ( +- 0.60% ) 11,107,441,243 stalled-cycles-frontend # 6.07% frontend cycles idle ( +- 3.95% ) 124,142,472,219 stalled-cycles-backend # 67.81% backend cycles idle ( +- 0.64% ) 162,602,667,589 instructions # 0.89 insn per cycle # 0.76 stalled cycles per insn ( +- 0.82% ) 26,673,044,086 branches # 597.521 M/sec ( +- 1.05% ) 29,169,660 branch-misses # 0.11% of all branches ( +- 1.18% ) 15.0575 +- 0.0710 seconds time elapsed ( +- 0.47% ) ``` ::: #### `mpiP` ![](https://i.imgur.com/77PMQcX.png =600x) ### Point-to-Point vs Broadcast ![](https://docs.google.com/drawings/d/e/2PACX-1vQgiwAljjLcqs59WCrovKzLU7ac6vWPlN9whcc8X7S4TcbLKA-QLVn3vNB4Qom2T_SxKDvTlLof1FNH/pub?w=1115&h=463) Broadcast is more efficient as it internally distributes the messaging overhead. ![](https://docs.google.com/drawings/d/e/2PACX-1vRDxhb_Vf5Dz6Cfzr7cJmSmHT7F5qEOiUOJsllTPz0v8IddlHA2El1gAgDPdQU3ggrjmp7I6WZD7-6G/pub?w=657&h=463) ## Hybrid (MPI + OMP) Optimization + Bonus (Optional) * Tue 29.06.2021 23:59 * Submission file "hybridge.cpp * Submission file "bonusge.cpp (optional) ### Hybrid (MPI + OMP) #### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size2048x2048.svg) #### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.hybridge.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.hybridge.size8192x8192.svg) ::: ### Bonus #### Flame Graph (2048x2048) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size2048x2048.svg) #### Other Flame Graphs :::spoiler ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size512x512.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size1024x1024.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/oncpu.bonus.size2048x2048.svg) ![](https://raw.githubusercontent.com/leviliangtw/TUM-ParProg-Final/main/offcpu.bonus.size8192x8192.svg) ::: ## Presentation * Fri 09.07.2021 13:20 - 14:10 ## References * [Cornell Virtual Workshop: Amdahl's Law](https://cvw.cac.cornell.edu/vector/performance_amdahl) * [Gaussian Elimination](https://www.cs.rutgers.edu/~venugopa/parallel_summer2012/ge.html) * [Gaussian Elimination parallel slides](https://parallelcomp.github.io/Gaussian-Quinn.pdf) * [Loop dependence analysis - Wikipedia](https://en.wikipedia.org/wiki/Loop_dependence_analysis) * [Linux perf Examples](http://www.brendangregg.com/perf.html) * [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) * [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html) * [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP) * [SIMD Directives](https://www.openmp.org/spec-html/5.0/openmpsu42.html)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully