# Parprog Final Project - Gaussian Elimination (GE)
###### tags: `SS2021-IN2147-PP`
<style>
.iframe-body-sty {
position: relative;
overflow: hidden;
height: 300px;
width: 600px;
}
.iframe-body-sty > #iframe-shrink {
position: absolute;
transform: scale(0.5);
transform-origin: 0 0;
height: 200%;
width: 200%;
}
</style>
## Introduction
This is a HPC project to parallel **`Gaussian Elimination (GE)`** using **`OpenMP`**, **`MPI`**, and **`Hybrid Method`** (OpenMP + MPI).
First, we did some **`serial optimization`** on the original GE program.
Then, we parallel it using **`OpenMP`**.
Next, we parallel it with **`MPI`**.
Finally, we combine the implementaion of OpenMP and MPI into the **`Hybrid`** implementatin.
In addition, we also did other optimizations on the operations that fit **`SIMD Intrinsics`**.
<!-- ## Repository, Submisstion Server and Chat Room
* [Group-213 · GitLab](https://gitlab.lrz.de/lrr-tum/teaching/parprog/ss2021/project-material/group-213)
* [Parallel Programming Submission Tool](https://parprog.caps.in.tum.de/Submission/home)
* [CHAT.TUM.DE](https://chat.tum.de/channel/parprog21) -->
<!-- ## Schedule
```
Sat: D-1 => Discuss a little bit, come up with some ideas
Sun: D+0 => Work together
Mon: D+1 => Remedial time + Documentation
Tue: D+2 => Documentation + Submission
``` -->
## Useful Measurement Tools and Commands
### `perf`
```shell
## Run for 5 rounds
## Measure the cache-references and cache-misses
$ perf stat \
-r 5 -e cache-references,cache-misses \
./serialge ./ge_data/size2048x2048
## General record
$ perf record ./serialge ./ge_data/size64x64
## Record with stack trace
$ perf record -g ./serialge ./ge_data/size64x64
## Print the report to the terminal
$ perf report | cat
## Print the report based on each thread to the terminal
## Usful for observing multithread overhead
$ perf report -s pid | cat
## Print the report with stack trace information to the terminal
$ perf report -g graph,0.5,caller | cat
```
* [Linux perf Examples](http://www.brendangregg.com/perf.html)
### `offcputime` of BPF Compiler Collection (BCC)
```shell
## Off-CPU Time
./${PROG} ./ge_data/size8192x8192 &
sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks
pkill ${PROG}
```
* [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html)
* [speedscope](https://www.speedscope.app/)
### Flame Graph
```shell
## Declare the program and matrix files
PROG=hybridge
FILES="size2048x2048 size1024x1024 size512x512"
## On-CPU Time
for f in $FILES
do
sudo perf record -g mpirun --allow-run-as-root -n 3 ./${PROG} ./ge_data/${f}
sudo perf script -i perf.data &> perf.unfold
./stackcollapse-perf.pl perf.unfold &> perf.folded
grep ${PROG} perf.folded \
| ./flamegraph.pl \
--title="On-CPU Time Flame Graph: $ ./${PROG} ./ge_data/${f} (bonus)" \
> oncpu.bonus.${f}.svg
done
## Off-CPU Time
./${PROG} ./ge_data/size8192x8192 &
sudo offcputime-bpfcc -df -p `pgrep -nx ${PROG}` 30 > offcpu.stacks
pkill ${PROG}
./flamegraph.pl --color=io \
--title="Off-CPU Time Flame Graph (30s): $ ./${PROG} ./ge_data/size8192x8192 (bonus)" \
--countname=us < ./offcpu.stacks > offcpu.bonus.size8192x8192.svg
```
* [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html)
### `valgrind`
```shell
## Dynamic instrumentation analysis using the Valgrind
## Callgrind: count the call stacks
$ valgrind --tool=callgrind -- ./serialge ./ge_data/size64x64
## Callgrind: measure the cache event
$ valgrind --tool=cachegrind -- ./serialge ./ge_data/size64x64
```
* [Valgrind Home](https://valgrind.org/)
### `kcachegrind`
```shell
## Visualization of the report from the Valgrind
$ kcachegrind callgrind.out.26081
$ kcachegrind cachegrind.out.26180
```
* [KDE/kcachegrind: GUI to profilers such as Valgrind](https://github.com/KDE/kcachegrind)
### `mpiP`
```shell
$ LD_PRELOAD=/usr/local/lib/libmpiP.so \
mpirun -np 3 mpige ./ge_data/size2048x2048
$ less less mpige.3.50899.1.mpiP
```
* [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP)
## Original Serial Program
### `perf stat` locally

* High ***`stalled-cycles-backend`*** percentage:
* means that the stalled cycle is mainly happened after the instuctions was issued into the **`Issue Queue`**, **`L/S Queue`** or **`Reorder Buffer`**.

* the stalled cycle may be due to waiting for **`memory oprations`** or other **`long latency operations`**.
* [linux - What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result? - Stack Overflow](https://stackoverflow.com/questions/22165299/what-are-stalled-cycles-frontend-and-stalled-cycles-backend-in-perf-stat-resul)
* Image
### `perf record` locally

### Dynamic instrumentation analysis locally
#### `valgrind --tool=callgrind`

* We should focus on **`ForwardElimination()`**
#### `valgrind --tool=cachegrind`

* There are some L1 Data Read Miss in **`ForwardElimination()`**, and we could compare the value based on this first measurement (baseline).
### `perf stat` on the Submission Server
```shell
stdout: Original Serial Program:
stdout: 36396339 cache-references # 1.427 M/sec ( +- 0.69% ) (62.49%)
stdout: 18876172 cache-misses # 51.863 % of all cache refs ( +- 0.55% ) (62.50%)
stdout: 25498.70 msec cpu-clock # 1.000 CPUs utilized ( +- 0.01% )
stdout: 36 context-switches # 0.001 K/sec ( +- 6.54% )
stdout: 0 cpu-migrations # 0.000 K/sec
stdout: 8337 page-faults # 0.327 K/sec ( +- 0.00% )
stdout: 68458659080 cycles # 2.685 GHz ( +- 0.02% ) (50.00%)
stdout: 55243414654 stalled-cycles-frontend # 80.70% frontend cycles idle ( +- 0.01% ) (50.01%)
stdout: 35294390248 stalled-cycles-backend # 51.56% backend cycles idle ( +- 0.21% ) (50.01%)
stdout: 42094964427 instructions # 0.61 insn per cycle
stdout: # 1.31 stalled cycles per insn ( +- 0.01% ) (62.51%)
stdout: 9591554066 branches # 376.158 M/sec ( +- 0.00% ) (62.50%)
stdout: 6967087 branch-misses # 0.07% of all branches ( +- 0.09% ) (62.50%)
stdout:
stdout: 25.50034 +- 0.00248 seconds time elapsed ( +- 0.01% )
```
### Flame Graph (2048x2048)
<!-- <div class="iframe-body-sty">
<iframe id="iframe-shrink" src="https://raw.githubusercontent.com/leviliangtw/TUM-ParProg/main/serialge.1024x1024.svg"></iframe>
</div>
</div> -->

### Other Flame Graphs
:::spoiler




:::
## Serial Optimization
### Notice
* Where do you see possible improvements through parallelization?
* Explain the logic for each optimization step
* Hint: A proper optimization of sequential code should render parallelization useless for smaller matrix size. Show why this happens in your presentation
### `perf stat` on the Submission Server
```shell
stdout: Optimized Serial Program:
stdout: 256291902 cache-references # 76.729 M/sec ( +- 0.30% ) (62.40%)
stdout: 43814301 cache-misses # 17.095 % of all cache refs ( +- 1.40% ) (62.45%)
stdout: 3340.23 msec cpu-clock # 1.000 CPUs utilized ( +- 0.16% )
stdout: 6 context-switches # 0.002 K/sec ( +- 19.25% )
stdout: 0 cpu-migrations # 0.000 K/sec
stdout: 9523 page-faults # 0.003 M/sec ( +- 0.84% )
stdout: 8945290194 cycles # 2.678 GHz ( +- 0.13% ) (50.03%)
stdout: 4299727133 stalled-cycles-frontend # 48.07% frontend cycles idle ( +- 0.36% ) (50.11%)
stdout: 2294309912 stalled-cycles-backend # 25.65% backend cycles idle ( +- 0.60% ) (50.10%)
stdout: 12790654776 instructions # 1.43 insn per cycle
stdout: # 0.34 stalled cycles per insn ( +- 0.10% ) (62.59%)
stdout: 1725125330 branches # 516.470 M/sec ( +- 0.04% ) (62.50%)
stdout: 6838414 branch-misses # 0.40% of all branches ( +- 0.21% ) (62.42%)
stdout:
stdout: 3.34092 +- 0.00552 seconds time elapsed ( +- 0.17% )
```
### Comparison Table
| SPEC | Original | Serial Optimization | Improvement |
| -------------------- | -------- | ------------------- | ---------------- |
| IPC | 0.61 | 1.43 | |
| cache-references | 36396339 | 256291902 | |
| cache-misses | 18876172 | 43814301 | |
| cache-misses (%) | 51.863% | 17.095% | |
| frontend cycles idle | 80.70% | 48.07% | |
| backend cycles idle | 51.56% | 25.65% | |
| time elapsed | 25.50034 | 3.34092 | 7.6327 (SpeedUp) |
### Flame Graph (2048x2048)

### Other Flame Graphs
:::spoiler




:::
## OMP Optimization
* Tue 15.06.2021 23:59
* Submission file "ompge.cpp"
### `perf stat` on the Submission Server
```shell
stdout: Optimized OMP Program:
stdout: 215464679 cache-references # 32.738 M/sec ( +- 1.53% ) (63.24%)
stdout: 4563143 cache-misses # 2.118 % of all cache refs ( +- 2.60% ) (63.41%)
stdout: 6581.54 msec cpu-clock # 6.428 CPUs utilized ( +- 0.74% )
stdout: 258 context-switches # 0.039 K/sec ( +- 10.23% )
stdout: 72 cpu-migrations # 0.011 K/sec ( +- 19.44% )
stdout: 8423 page-faults # 0.001 M/sec ( +- 0.00% )
stdout: 17611176782 cycles # 2.676 GHz ( +- 0.83% ) (50.12%)
stdout: 12111969178 stalled-cycles-frontend # 68.77% frontend cycles idle ( +- 0.80% ) (49.48%)
stdout: 7903992783 stalled-cycles-backend # 44.88% backend cycles idle ( +- 1.45% ) (49.08%)
stdout: 13990348567 instructions # 0.79 insn per cycle
stdout: # 0.87 stalled cycles per insn ( +- 0.66% ) (62.06%)
stdout: 2062105804 branches # 313.317 M/sec ( +- 0.40% ) (62.12%)
stdout: 7495666 branch-misses # 0.36% of all branches ( +- 1.56% ) (62.55%)
stdout:
stdout: 1.02396 +- 0.00432 seconds time elapsed ( +- 0.42% )
stdout: Optimized OMP Program with SIMD:
stdout: 211111314 cache-references # 30.074 M/sec ( +- 1.41% ) (62.68%)
stdout: 4312060 cache-misses # 2.043 % of all cache refs ( +- 3.04% ) (63.58%)
stdout: 7019.81 msec cpu-clock # 6.804 CPUs utilized ( +- 4.03% )
stdout: 309 context-switches # 0.044 K/sec ( +- 18.90% )
stdout: 98 cpu-migrations # 0.014 K/sec ( +- 9.48% )
stdout: 8422 page-faults # 0.001 M/sec ( +- 0.01% )
stdout: 18547516255 cycles # 2.642 GHz ( +- 3.94% ) (51.48%)
stdout: 12837235540 stalled-cycles-frontend # 69.21% frontend cycles idle ( +- 4.51% ) (50.60%)
stdout: 8422308546 stalled-cycles-backend # 45.41% backend cycles idle ( +- 5.88% ) (49.18%)
stdout: 14631081994 instructions # 0.79 insn per cycle
stdout: # 0.88 stalled cycles per insn ( +- 1.43% ) (61.33%)
stdout: 2187065496 branches # 311.556 M/sec ( +- 3.22% ) (60.96%)
stdout: 7668582 branch-misses # 0.35% of all branches ( +- 0.58% ) (61.51%)
stdout:
stdout: 1.0318 +- 0.0105 seconds time elapsed ( +- 1.02% )
```
### Measurement Table - `omp for`
| SPEC | Serial Optimization | FOR Optimization | Improvement |
| -------------------- | ------------------- | ---------------- | ---------------- |
| IPC | 1.43 | 0.79 | - |
| cache-references | 256291902 | 215464679 | - |
| cache-misses | 43814301 | 4563143 | - |
| cache-misses (%) | 17.095% | 2.118% | - |
| frontend cycles idle | 48.07% | 68.77% | - |
| backend cycles idle | 25.65% | 44.88% | - |
| time elapsed | 3.34092 | 1.02396 | 3.2708 (SpeedUp) |
### Measurement Table - `omp simd`
| SPEC | FOR Optimization | FOR + SIMD Optimization | Improvement |
| -------------------- | ---------------- | ----------------------- | ---------------- |
| IPC | 0.79 | 0.79 | - |
| cache-references | 215464679 | 211111314 | - |
| cache-misses | 4563143 | 4312060 | - |
| cache-misses (%) | 2.118% | 2.043% | - |
| frontend cycles idle | 68.77% | 69.21% | - |
| backend cycles idle | 44.88% | 45.41% | - |
| time elapsed | 1.02396 | 1.0318 | 0.9924 (SpeedUp) |
### Flame Graph (2048x2048)

### Other Flame Graphs
:::spoiler




:::
### Amdahl's Law Calculations
* **Methodology**
* Run perf on the optimized serial code, and determine time taken in ForwardElimination.
* Use this percentage in Amdahl's and calculate the theoritical speedup.
* Then calculate experimental speedup using perf. SIMD is not considered!
* [Amdahl’s Law Calculations - Google Sheets](https://docs.google.com/spreadsheets/d/1XL3Ti2cEkmKaU7CVWf9Fil3K2e0xWhpUakXvcS-04ZQ/edit#gid=747307773)
### Initial results
#### Codes

#### `perf stat`

:::spoiler
```shell
Performance counter stats for './ompge ./ge_data/size512x512' (5 runs):
264.32 msec task-clock # 5.445 CPUs utilized ( +- 3.55% )
606 context-switches # 0.002 M/sec ( +- 11.62% )
1 cpu-migrations # 0.004 K/sec ( +- 44.72% )
702 page-faults # 0.003 M/sec ( +- 0.07% )
737,283,512 cycles # 2.789 GHz ( +- 5.07% )
315,708,039 stalled-cycles-frontend # 42.82% frontend cycles idle ( +- 9.48% )
22,312,372 stalled-cycles-backend # 3.03% backend cycles idle ( +- 3.08% )
493,289,410 instructions # 0.67 insn per cycle
# 0.64 stalled cycles per insn ( +- 0.78% )
91,134,557 branches # 344.785 M/sec ( +- 1.21% )
220,230 branch-misses # 0.24% of all branches ( +- 1.10% )
0.04854 +- 0.00153 seconds time elapsed ( +- 3.15% )
Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs):
666.17 msec task-clock # 4.252 CPUs utilized ( +- 1.24% )
723 context-switches # 0.001 M/sec ( +- 12.63% )
0 cpu-migrations # 0.000 K/sec
2,241 page-faults # 0.003 M/sec ( +- 0.04% )
2,427,979,476 cycles # 3.645 GHz ( +- 1.12% )
574,446,588 stalled-cycles-frontend # 23.66% frontend cycles idle ( +- 1.99% )
112,657,599 stalled-cycles-backend # 4.64% backend cycles idle ( +- 0.90% )
2,323,606,967 instructions # 0.96 insn per cycle
# 0.25 stalled cycles per insn ( +- 0.12% )
373,124,245 branches # 560.102 M/sec ( +- 0.21% )
777,271 branch-misses # 0.21% of all branches ( +- 0.37% )
0.15669 +- 0.00213 seconds time elapsed ( +- 1.36% )
Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs):
15,435.28 msec task-clock # 5.979 CPUs utilized ( +- 0.77% )
128,235 context-switches # 0.008 M/sec ( +- 0.12% )
36,035 cpu-migrations # 0.002 M/sec ( +- 1.29% )
8,425 page-faults # 0.546 K/sec ( +- 0.01% )
50,178,450,443 cycles # 3.251 GHz ( +- 0.86% )
21,780,685,616 stalled-cycles-frontend # 43.41% frontend cycles idle ( +- 1.03% )
5,790,393,909 stalled-cycles-backend # 11.54% backend cycles idle ( +- 0.80% )
14,331,798,547 instructions # 0.29 insn per cycle
# 1.52 stalled cycles per insn ( +- 0.02% )
2,073,070,667 branches # 134.307 M/sec ( +- 0.03% )
8,928,151 branch-misses # 0.43% of all branches ( +- 0.61% )
2.5816 +- 0.0328 seconds time elapsed ( +- 1.27% )
Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs):
149,998.28 msec task-clock # 8.211 CPUs utilized ( +- 1.71% )
286,910 context-switches # 0.002 M/sec ( +- 0.35% )
92,869 cpu-migrations # 0.619 K/sec ( +- 1.18% )
33,014 page-faults # 0.220 K/sec ( +- 0.00% )
574,591,154,283 cycles # 3.831 GHz ( +- 1.60% )
274,887,878,693 stalled-cycles-frontend # 47.84% frontend cycles idle ( +- 0.75% )
48,741,150,001 stalled-cycles-backend # 8.48% backend cycles idle ( +- 1.93% )
86,392,307,943 instructions # 0.15 insn per cycle
# 3.18 stalled cycles per insn ( +- 0.03% )
10,612,046,393 branches # 70.748 M/sec ( +- 0.04% )
34,691,265 branch-misses # 0.33% of all branches ( +- 0.81% )
18.269 +- 0.237 seconds time elapsed ( +- 1.30% )
```
:::
### Intermediate results
#### Codes

#### `perf stat`

:::spoiler
```shell
Performance counter stats for './ompge ./ge_data/size512x512' (5 runs):
630.58 msec task-clock # 4.514 CPUs utilized ( +- 3.20% )
31,581 context-switches # 0.050 M/sec ( +- 0.29% )
8,700 cpu-migrations # 0.014 M/sec ( +- 1.11% )
736 page-faults # 0.001 M/sec ( +- 0.07% )
1,312,049,911 cycles # 2.081 GHz ( +- 4.33% )
379,265,192 stalled-cycles-frontend # 28.91% frontend cycles idle ( +- 11.42% )
130,674,577 stalled-cycles-backend # 9.96% backend cycles idle ( +- 0.53% )
827,190,457 instructions # 0.63 insn per cycle
# 0.46 stalled cycles per insn ( +- 0.77% )
164,415,402 branches # 260.736 M/sec ( +- 1.06% )
1,087,639 branch-misses # 0.66% of all branches ( +- 0.24% )
0.13970 +- 0.00109 seconds time elapsed ( +- 0.78% )
Performance counter stats for './ompge ./ge_data/size1024x1024' (5 runs):
1,551.52 msec task-clock # 4.356 CPUs utilized ( +- 3.38% )
62,970 context-switches # 0.041 M/sec ( +- 0.08% )
17,645 cpu-migrations # 0.011 M/sec ( +- 1.69% )
2,274 page-faults # 0.001 M/sec ( +- 0.03% )
4,057,597,625 cycles # 2.615 GHz ( +- 4.80% )
961,906,690 stalled-cycles-frontend # 23.71% frontend cycles idle ( +- 9.21% )
342,219,413 stalled-cycles-backend # 8.43% backend cycles idle ( +- 2.57% )
3,039,407,567 instructions # 0.75 insn per cycle
# 0.32 stalled cycles per insn ( +- 0.66% )
533,379,028 branches # 343.777 M/sec ( +- 1.06% )
2,497,786 branch-misses # 0.47% of all branches ( +- 0.54% )
0.35619 +- 0.00834 seconds time elapsed ( +- 2.34% )
Performance counter stats for './ompge ./ge_data/size2048x2048' (5 runs):
17,014.81 msec task-clock # 5.825 CPUs utilized ( +- 0.54% )
251,846 context-switches # 0.015 M/sec ( +- 0.31% )
68,261 cpu-migrations # 0.004 M/sec ( +- 1.43% )
8,425 page-faults # 0.495 K/sec ( +- 0.01% )
52,808,551,492 cycles # 3.104 GHz ( +- 0.48% )
22,414,794,512 stalled-cycles-frontend # 42.45% frontend cycles idle ( +- 1.59% )
6,320,426,551 stalled-cycles-backend # 11.97% backend cycles idle ( +- 0.91% )
15,690,574,537 instructions # 0.30 insn per cycle
# 1.43 stalled cycles per insn ( +- 0.11% )
2,375,361,565 branches # 139.606 M/sec ( +- 0.16% )
12,082,568 branch-misses # 0.51% of all branches ( +- 0.99% )
2.9211 +- 0.0549 seconds time elapsed ( +- 1.88% )
Performance counter stats for './ompge ./ge_data/size4096x4096' (5 runs):
158,882.20 msec task-clock # 8.454 CPUs utilized ( +- 1.06% )
532,286 context-switches # 0.003 M/sec ( +- 0.20% )
159,671 cpu-migrations # 0.001 M/sec ( +- 1.91% )
33,014 page-faults # 0.208 K/sec ( +- 0.00% )
598,956,780,751 cycles # 3.770 GHz ( +- 1.05% )
277,193,463,570 stalled-cycles-frontend # 46.28% frontend cycles idle ( +- 0.27% )
50,828,962,230 stalled-cycles-backend # 8.49% backend cycles idle ( +- 1.27% )
89,076,682,733 instructions # 0.15 insn per cycle
# 3.11 stalled cycles per insn ( +- 0.03% )
11,210,287,810 branches # 70.557 M/sec ( +- 0.05% )
41,578,396 branch-misses # 0.37% of all branches ( +- 0.79% )
18.793 +- 0.268 seconds time elapsed ( +- 1.42% )
```
:::
### Documentation
* Explain the parallelization strategies used and the logic behind them.
* Use a diagram to show parallelization strategy and the data required at each step of parallelized algorithm.
## MPI Optimization
* Tue 22.06.2021 23:59
* Submission file "mpige.cpp
### Development List
* [x] Communication skeleton
* [x] Preparing data
* [x] MPI_Scatter() and MPI_Gather
* [x] MPI_Isend() and MPI_Irecv()
* use the Pipelined, Row-Oriented Algorithm (slide 39)
* [x] ForwardElimination() parallelization
* [x] [Optional] BackwardSubstitution() parallelization
* [x] Optimize communication
* [x] Measurement
* [x] Documentation
### Flame Graph (2048x2048)

### Other Flame Graphs
:::spoiler




:::
### Initial Results - MPI_Ibcast
#### Codes

#### `perf stat`

:::spoiler
```shell
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs):
264.72 msec task-clock # 1.594 CPUs utilized ( +- 2.31% )
386 context-switches # 0.001 M/sec ( +- 1.79% )
55 cpu-migrations # 0.207 K/sec ( +- 6.49% )
9,144 page-faults # 0.035 M/sec ( +- 0.03% )
796,795,509 cycles # 3.010 GHz ( +- 3.21% )
23,601,010 stalled-cycles-frontend # 2.96% frontend cycles idle ( +- 1.30% )
255,426,677 stalled-cycles-backend # 32.06% backend cycles idle ( +- 3.21% )
1,679,943,613 instructions # 2.11 insn per cycle
# 0.15 stalled cycles per insn ( +- 2.72% )
339,717,099 branches # 1283.305 M/sec ( +- 2.84% )
1,397,471 branch-misses # 0.41% of all branches ( +- 1.58% )
0.16604 +- 0.00512 seconds time elapsed ( +- 3.08% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs):
826.36 msec task-clock # 2.372 CPUs utilized ( +- 2.32% )
478 context-switches # 0.579 K/sec ( +- 1.47% )
51 cpu-migrations # 0.061 K/sec ( +- 7.69% )
13,858 page-faults # 0.017 M/sec ( +- 0.03% )
3,140,217,954 cycles # 3.800 GHz ( +- 2.17% )
76,546,845 stalled-cycles-frontend # 2.44% frontend cycles idle ( +- 8.55% )
1,325,883,824 stalled-cycles-backend # 42.22% backend cycles idle ( +- 1.73% )
6,446,686,641 instructions # 2.05 insn per cycle
# 0.21 stalled cycles per insn ( +- 2.14% )
1,232,693,549 branches # 1491.714 M/sec ( +- 2.44% )
2,488,624 branch-misses # 0.20% of all branches ( +- 1.53% )
0.34832 +- 0.00678 seconds time elapsed ( +- 1.95% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs):
6,938.97 msec task-clock # 2.873 CPUs utilized ( +- 2.90% )
1,330 context-switches # 0.192 K/sec ( +- 2.21% )
76 cpu-migrations # 0.011 K/sec ( +- 8.44% )
32,500 page-faults # 0.005 M/sec ( +- 0.02% )
28,169,268,111 cycles # 4.060 GHz ( +- 2.83% )
2,745,914,988 stalled-cycles-frontend # 9.75% frontend cycles idle ( +- 16.85% )
15,987,882,751 stalled-cycles-backend # 56.76% backend cycles idle ( +- 1.33% )
30,822,596,196 instructions # 1.09 insn per cycle
# 0.52 stalled cycles per insn ( +- 1.63% )
5,430,351,277 branches # 782.588 M/sec ( +- 1.78% )
8,610,384 branch-misses # 0.16% of all branches ( +- 3.42% )
2.4154 +- 0.0706 seconds time elapsed ( +- 2.92% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs):
48,931.88 msec task-clock # 2.963 CPUs utilized ( +- 2.11% )
6,836 context-switches # 0.140 K/sec ( +- 3.09% )
169 cpu-migrations # 0.003 K/sec ( +- 3.89% )
106,641 page-faults # 0.002 M/sec ( +- 0.00% )
198,037,246,253 cycles # 4.047 GHz ( +- 2.00% )
21,812,123,320 stalled-cycles-frontend # 11.01% frontend cycles idle ( +- 12.63% )
123,291,160,551 stalled-cycles-backend # 62.26% backend cycles idle ( +- 0.87% )
167,948,115,577 instructions # 0.85 insn per cycle
# 0.73 stalled cycles per insn ( +- 1.31% )
26,913,007,412 branches # 550.010 M/sec ( +- 1.52% )
39,952,365 branch-misses # 0.15% of all branches ( +- 3.27% )
16.515 +- 0.346 seconds time elapsed ( +- 2.10% )
```
:::
#### `mpiP`

### Intermediate results - Parallel BS()
#### Codes

#### `perf stat`

:::spoiler
```shell
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size512x512' (5 runs):
248.11 msec task-clock # 1.540 CPUs utilized ( +- 3.12% )
389 context-switches # 0.002 M/sec ( +- 2.40% )
46 cpu-migrations # 0.187 K/sec ( +- 9.80% )
9,571 page-faults # 0.039 M/sec ( +- 0.07% )
762,786,422 cycles # 3.074 GHz ( +- 4.72% )
26,023,616 stalled-cycles-frontend # 3.41% frontend cycles idle ( +- 5.67% )
225,481,941 stalled-cycles-backend # 29.56% backend cycles idle ( +- 5.52% )
1,573,853,837 instructions # 2.06 insn per cycle
# 0.14 stalled cycles per insn ( +- 2.78% )
321,708,657 branches # 1296.653 M/sec ( +- 2.97% )
1,313,746 branch-misses # 0.41% of all branches ( +- 0.76% )
0.16110 +- 0.00833 seconds time elapsed ( +- 5.17% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size1024x1024' (5 runs):
929.34 msec task-clock # 2.425 CPUs utilized ( +- 6.16% )
498 context-switches # 0.535 K/sec ( +- 3.36% )
61 cpu-migrations # 0.065 K/sec ( +- 7.68% )
15,733 page-faults # 0.017 M/sec ( +- 0.03% )
3,554,226,495 cycles # 3.824 GHz ( +- 7.20% )
206,006,990 stalled-cycles-frontend # 5.80% frontend cycles idle ( +- 28.37% )
1,547,651,047 stalled-cycles-backend # 43.54% backend cycles idle ( +- 10.76% )
6,387,034,798 instructions # 1.80 insn per cycle
# 0.24 stalled cycles per insn ( +- 2.08% )
1,242,126,964 branches # 1336.570 M/sec ( +- 2.24% )
2,282,334 branch-misses # 0.18% of all branches ( +- 1.82% )
0.3833 +- 0.0192 seconds time elapsed ( +- 5.02% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size2048x2048' (5 runs):
6,212.68 msec task-clock # 2.862 CPUs utilized ( +- 0.91% )
1,220 context-switches # 0.196 K/sec ( +- 1.48% )
64 cpu-migrations # 0.010 K/sec ( +- 4.89% )
40,341 page-faults # 0.006 M/sec ( +- 0.01% )
25,446,446,484 cycles # 4.096 GHz ( +- 0.73% )
1,127,432,163 stalled-cycles-frontend # 4.43% frontend cycles idle ( +- 13.73% )
16,067,156,571 stalled-cycles-backend # 63.14% backend cycles idle ( +- 1.25% )
29,982,500,623 instructions # 1.18 insn per cycle
# 0.54 stalled cycles per insn ( +- 1.05% )
5,408,456,832 branches # 870.552 M/sec ( +- 1.22% )
6,568,256 branch-misses # 0.12% of all branches ( +- 1.55% )
2.1711 +- 0.0230 seconds time elapsed ( +- 1.06% )
Performance counter stats for 'mpirun -np 3 mpige ./ge_data/size4096x4096' (5 runs):
44,639.48 msec task-clock # 2.965 CPUs utilized ( +- 0.48% )
5,878 context-switches # 0.132 K/sec ( +- 1.48% )
139 cpu-migrations # 0.003 K/sec ( +- 7.06% )
138,687 page-faults # 0.003 M/sec ( +- 0.00% )
183,081,496,312 cycles # 4.101 GHz ( +- 0.60% )
11,107,441,243 stalled-cycles-frontend # 6.07% frontend cycles idle ( +- 3.95% )
124,142,472,219 stalled-cycles-backend # 67.81% backend cycles idle ( +- 0.64% )
162,602,667,589 instructions # 0.89 insn per cycle
# 0.76 stalled cycles per insn ( +- 0.82% )
26,673,044,086 branches # 597.521 M/sec ( +- 1.05% )
29,169,660 branch-misses # 0.11% of all branches ( +- 1.18% )
15.0575 +- 0.0710 seconds time elapsed ( +- 0.47% )
```
:::
#### `mpiP`

### Point-to-Point vs Broadcast

Broadcast is more efficient as it internally distributes the messaging overhead.

## Hybrid (MPI + OMP) Optimization + Bonus (Optional)
* Tue 29.06.2021 23:59
* Submission file "hybridge.cpp
* Submission file "bonusge.cpp (optional)
### Hybrid (MPI + OMP)
#### Flame Graph (2048x2048)

#### Other Flame Graphs
:::spoiler




:::
### Bonus
#### Flame Graph (2048x2048)

#### Other Flame Graphs
:::spoiler




:::
## Presentation
* Fri 09.07.2021 13:20 - 14:10
## References
* [Cornell Virtual Workshop: Amdahl's Law](https://cvw.cac.cornell.edu/vector/performance_amdahl)
* [Gaussian Elimination](https://www.cs.rutgers.edu/~venugopa/parallel_summer2012/ge.html)
* [Gaussian Elimination parallel slides](https://parallelcomp.github.io/Gaussian-Quinn.pdf)
* [Loop dependence analysis - Wikipedia](https://en.wikipedia.org/wiki/Loop_dependence_analysis)
* [Linux perf Examples](http://www.brendangregg.com/perf.html)
* [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html)
* [Off-CPU Analysis](http://www.brendangregg.com/offcpuanalysis.html)
* [LLNL/mpiP: A light-weight MPI profiler.](https://github.com/LLNL/mpiP)
* [SIMD Directives](https://www.openmp.org/spec-html/5.0/openmpsu42.html)