owned this note
owned this note
Published
Linked with GitHub
# 羅習五 HW7 perf 分析
###### tags: `LINUX_並發與競態_行程同步`
peterson_correct.c 和 peterson_trival.c
程式統一改成計時30秒後就退出
```c
int count=0;
//每秒鐘印出P0和P1進入CS的次數
void per_second(int signum) {
...
count++;
if(count==30)
exit(0);
alarm(1);
}
```
## 關鍵要領 [在 Linux 上使用 Perf 做效能分析(入門篇)](https://tigercosmos.xyz/post/2020/08/system/perf-basic/)
[Source level analysis with perf annotate](https://perf.wiki.kernel.org/index.php/Tutorial#Source_level_analysis_with_perf_annotate)
:::success
使用 perf annotate
```
$ sudo perf record ./你指定的執行檔
$ sudo perf annotate
```
就能看到組語的效能瓶頸
使用 perf stat
```
$ sudo perf stat ./peterson_trival-g
```
就能直接「掌握大局」
:::
例如
```
$ sudo perf record ./peterson_trival-O3
$ sudo perf annotate
```


```
$ sudo perf stat ./peterson_trival-g
Performance counter stats for './peterson_trival-g':
48,227.98 msec task-clock # 1.596 CPUs utilized
1,144 context-switches # 0.024 K/sec
61 cpu-migrations # 0.001 K/sec
62 page-faults # 0.001 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
30.221688059 seconds time elapsed
20.631166000 seconds user
27.596390000 seconds sys
```
```
$ sudo perf stat ./peterson_trival-O3
Performance counter stats for './peterson_trival-O3':
60,011.23 msec task-clock # 1.999 CPUs utilized
413 context-switches # 0.007 K/sec
2 cpu-migrations # 0.000 K/sec
64 page-faults # 0.001 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
30.014287374 seconds time elapsed
59.991092000 seconds user
0.015996000 seconds sys
```
其中有個參數 cpu-migrations
[Linux scheduling](https://www.ibm.com/docs/sk/linux-on-systems?topic=management-linux-scheduling)
> **Moving a virtual CPU from one run queue to another is called a (CPU) migration.** Be sure not to confuse the term “CPU migration” with a “live migration”, which is the migration of a virtual server from one host to another. The Linux scheduler might decide to migrate a virtual CPU when the estimated wait time until the virtual CPU will be executed is too long, the run queue where it is supposed to be waiting is full, or another run queue is empty and needs to be filled up.
~~cpu-migrations代表在虛擬CPU中搬動行程,因為我這個程式在Vmware虛擬機上執行,因此會有 cpu-migrations 在虛擬CPU中搬動的情況~~
[Linux - Difference between migrations and switches?](https://stackoverflow.com/questions/45368742/linux-difference-between-migrations-and-switches)
> **Migration is when a thread, usually after a context switch, get scheduled on a different CPU than it was scheduled before.**
>
---
```
$ sudo perf record ./peterson_trival-O3
$ sudo perf annotate
```

```
$ sudo perf record ./peterson_trival-g
$ sudo perf annotate
```
**若P0 P1同時進入臨界區時,會造成 perf record 紀錄失敗**
造成以下 perf annotate 結果

./peterson_trival-g P0 P1 量測成功時
`$ sudo perf annotate` 結果如下


其中在這段

```
Percent│ ▒
│ /home/blue76815/.debug/.build-id/29/1959098c03695a00cf391506bf5deb5c7fd70a/elf: 檔案格式 elf64-x86-64 ▒
│ ▒
│
│ void p0(void) { ▒
│ push %rbp ▒
│ mov %rsp,%rbp ▒
│ printf("p0: start\n"); ▒
│ lea _IO_stdin_used+0x60,%rdi ▒
│ → callq puts@plt ▒
│ while (1) { ▒
│ //🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 ▒ │ //Peteron's solution的進去部分的程式碼 ▒
│ flag0 = 1; ▒
│10: movl $0x1,flag0 ▒
│ turn = 1; ▒
0.04 │ movl $0x1,turn ▒
│ while (flag1==1 && turn==1) ▒
0.01 │ nop ▒
9.37 │25: mov flag1,%eax ▒
56.77 │ cmp $0x1,%eax ▒
│ ↓ jne 3b ▒
30.91 │ mov turn,%eax ▒
1.80 │ cmp $0x1,%eax ◆
│ ↑ je 25 ▒
│ ; //waiting
```
cmp $0x1,%eax 的**百分比最高**,**呼應老師說的compare是效能瓶頸**
---
## peterson_correct 分析
```
$ sudo perf stat ./peterson_correct-O3
Performance counter stats for './peterson_correct-O3':
60,031.37 msec task-clock # 2.000 CPUs utilized
208 context-switches # 0.003 K/sec
1 cpu-migrations # 0.000 K/sec
62 page-faults # 0.001 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
30.019816344 seconds time elapsed
60.019472000 seconds user
0.012003000 seconds sys
$ sudo perf stat ./peterson_correct-g
Performance counter stats for './peterson_correct-g':
60,023.59 msec task-clock # 2.000 CPUs utilized
202 context-switches # 0.003 K/sec
2 cpu-migrations # 0.000 K/sec
63 page-faults # 0.001 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
30.015192347 seconds time elapsed
60.023823000 seconds user
0.000000000 seconds sys
```
```
$ sudo perf record ./peterson_correct-O3
$ sudo perf annotate
```

```
Percent│ ▒
│ /home/blue76815/.debug/.build-id/aa/fd55126b43dc94f67ef2d9eebd9e309fcf0f87/elf: 檔案格式 elf64-x86-64 ▒
│ ▒
│ ▒
│ Disassembly of section .text: ▒
│ ▒
│ 0000000000000b00 <p1>: ▒
│ p1(): ▒
│ lea _IO_stdin_used+0xd,%rdi ▒
│ sub $0x8,%rsp ▒
│ → callq puts@plt ◆
0.16 │10: movl $0x1,flag+0x4 ▒
0.01 │ mfence ▒
15.48 │ mfence ▒
6.28 │ movl $0x0,turn ▒
0.07 │ mfence ▒
17.47 │ ↓ jmp 3a ▒
│ nop ▒
3.30 │30: mov turn,%eax ▒
9.97 │ test %eax,%eax ▒
0.01 │ ↓ jne 44 ▒
4.98 │3a: mov flag,%eax ▒
10.04 │ test %eax,%eax ▒
0.02 │ ↑ jne 30 ▒
1.10 │44: → callq sched_getcpu@plt ▒
0.08 │ mov %eax,cpu_p1 ▒
0.05 │ mov in_cs,%eax ▒
5.98 │ add $0x1,%eax ▒
0.15 │ cmp $0x2,%eax ▒
0.11 │ mov %eax,in_cs ▒
0.07 │ ↓ je 88 ▒
0.01 │63: addl $0x1,p1_in_cs ▒
0.35 │ sub $0x1,%eax ▒
0.00 │ mov %eax,in_cs ▒
0.01 │ movl $0x0,flag+0x4 ▒
0.10 │ mfence ▒
24.20 │ ↑ jmp 10 ▒
│ nop ▒
│88: mov stderr@@GLIBC_2.2.5,%rcx ▒
│ lea _IO_stdin_used+0x18,%rdi ▒
│ mov $0x1e,%edx ▒
│ mov $0x1,%esi ▒
│ → callq fwrite@plt ▒
│ mov in_cs,%eax ▒
│ ↑ jmp 63 ▒
│ nop
```
```
$ sudo perf record ./peterson_correct-g
$ sudo perf annotate
```


```
Percent│ ◆
│ /home/blue76815/.debug/.build-id/5d/76ec6367369031d0f2d82842955ade16df54b2/elf: 檔案格式 elf64-x86-64 ▒
│ ▒
│ ▒
│ Disassembly of section .text: ▒
│ ▒
│ 0000000000000a3e <p0>: ▒
│ p0(): ▒
│ if(count==30) ▒
│ exit(0); ▒
│ alarm(1); ▒
│ } ▒
│ ▒
│ void p0(void) { ▒
│ push %rbp ▒
│ mov %rsp,%rbp ▒
│ sub $0x40,%rsp ▒
│ mov %fs:0x28,%rax ▒
│ mov %rax,-0x8(%rbp) ▒
│ xor %eax,%eax ▒
│ printf("start p0\n"); ▒
│ lea _IO_stdin_used+0x60,%rdi ▒
│ → callq puts@plt ▒
│ while (1) { ▒
│ //🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 │ //Peteron's solution的進去部分的程式碼 ▒
│ atomic_store(&flag[0], 1); ▒
0.09 │ 23: lea flag,%rax ▒
0.00 │ mov %rax,-0x30(%rbp) ▒
0.01 │ movl $0x1,-0x34(%rbp) ▒
0.11 │ mov -0x34(%rbp),%eax ▒
0.01 │ mov %eax,%edx ▒
│ mov -0x30(%rbp),%rax ▒
0.01 │ mov %edx,(%rax) ▒
0.61 │ mfence ▒
│ atomic_thread_fence(memory_order_seq_cst); ▒
15.24 │ mfence ▒
│ atomic_store(&turn, 1); ▒
6.30 │ lea turn,%rax ▒
0.03 │ mov %rax,-0x28(%rbp) ▒
0.00 │ movl $0x1,-0x34(%rbp) ▒
0.00 │ mov -0x34(%rbp),%eax ▒
0.09 │ mov %eax,%edx ▒
│ mov -0x28(%rbp),%rax ▒
0.01 │ mov %edx,(%rax) ▒
0.62 │ mfence ▒
│ while (atomic_load(&flag[1]) && atomic_load(&turn)==1) ▒
16.43 │ nop ▒
1.42 │ 67: lea flag+0x4,%rax ▒
0.72 │ mov %rax,-0x20(%rbp)
0.52 │ mov -0x20(%rbp),%rax ▒
0.49 │ mov (%rax),%eax ▒
12.32 │ mov %eax,-0x34(%rbp) ▒
1.61 │ mov -0x34(%rbp),%eax ▒
1.42 │ test %eax,%eax ▒
0.00 │ ↓ je 9e ▒
1.29 │ lea turn,%rax ▒
1.01 │ mov %rax,-0x18(%rbp) ▒
0.57 │ mov -0x18(%rbp),%rax ▒
0.33 │ mov (%rax),%eax ▒
5.74 │ mov %eax,-0x34(%rbp) ▒
1.81 │ mov -0x34(%rbp),%eax ▒
1.61 │ cmp $0x1,%eax ▒
│ ↑ je 67 ▒
│ ; //waiting ▒
│ ▒
│ ▒
│ //底下程式碼用於模擬在critical section ▒
│ cpu_p0 = sched_getcpu(); ▒
1.31 │ 9e: → callq sched_getcpu@plt ▒
0.07 │ mov %eax,cpu_p0 ▒
│ in_cs++; //計算有多少人在CS中 ▒
0.04 │ mov in_cs,%eax ▒
5.28 │ add $0x1,%eax ▒
0.09 │ mov %eax,in_cs ▒
│ //nanosleep(&ts, NULL); ▒
│ if (in_cs == 2) fprintf(stderr, "p0及p1都在critical section\n"); ▒
0.18 │ mov in_cs,%eax ▒
0.93 │ cmp $0x2,%eax ▒
│ ↓ jne e3 ▒
│ mov stderr@@GLIBC_2.2.5,%rax ▒
│ mov %rax,%rcx ▒
│ mov $0x1e,%edx ▒
│ mov $0x1,%esi ▒
│ lea _IO_stdin_used+0x70,%rdi ▒
│ → callq fwrite@plt ▒
│ p0_in_cs++; //P0在CS幾次 ▒
0.12 │ e3: mov p0_in_cs,%eax ◆
0.90 │ add $0x1,%eax ▒
0.02 │ mov %eax,p0_in_cs ▒
│ //nanosleep(&ts, NULL); ▒
│ in_cs--; //計算有多少人在CS中
0.02 │ │ mov in_cs,%eax ▒
0.16 │ │ sub $0x1,%eax ▒
0.01 │ │ mov %eax,in_cs ▒
│ │ ▒
│ │ ▒
│ │ ▒
│ │//🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 ▒
│ │//Peteron's solution的離開部分的程式碼 ▒
│ │atomic_store(&flag[0], 0); ▒
0.03 │ │ lea flag,%rax ▒
0.01 │ │ mov %rax,-0x10(%rbp) ▒
0.08 │ │ movl $0x0,-0x34(%rbp) ▒
0.00 │ │ mov -0x34(%rbp),%eax ▒
0.03 │ │ mov %eax,%edx ▒
0.01 │ │ mov -0x10(%rbp),%rax ▒
0.06 │ │ mov %edx,(%rax) ▒
0.07 │ │ mfence ▒
│ │atomic_store(&flag[0], 1); ▒
20.14 │ └──jmpq 23
```
## hw7 改成在實機linux上 做perf分析
### peterson_trival-g
:::success
**hw7在實機linux上,做perf分析才能真的量到
cycles,instructions,branches,branch-misses**

```
$ sudo perf stat ./peterson_trival-g
Performance counter stats for './peterson_trival-g':
60,006.22 msec task-clock # 2.000 CPUs utilized
580 context-switches # 0.010 K/sec
4 cpu-migrations # 0.000 K/sec
67 page-faults # 0.001 K/sec
211,445,396,951 cycles # 3.524 GHz
443,129,572,320 instructions # 2.10 insn per cycle
144,841,527,517 branches # 2413.775 M/sec
414,350,220 branch-misses # 0.29% of all branches
30.002342037 seconds time elapsed
60.006886000 seconds user
0.000000000 seconds sys
```
:::
---
### peterson_trival-O3
:::success

```
$ sudo perf stat ./peterson_trival-O3
Performance counter stats for './peterson_trival-O3':
60,003.36 msec task-clock # 2.000 CPUs utilized
171 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
69 page-faults # 0.001 K/sec
232,921,278,865 cycles # 3.882 GHz
231,316,350,274 instructions # 0.99 insn per cycle
231,264,858,250 branches # 3854.199 M/sec
398,117 branch-misses # 0.00% of all branches
30.003705481 seconds time elapsed
60.003975000 seconds user
0.000000000 seconds sys
```
:::
### peterson_correct-g
:::success

```
$ sudo perf stat ./peterson_correct-g
Performance counter stats for './peterson_correct-g':
60,006.16 msec task-clock # 2.000 CPUs utilized
188 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
68 page-faults # 0.001 K/sec
228,001,906,530 cycles # 3.800 GHz
88,866,279,243 instructions # 0.39 insn per cycle
11,808,654,166 branches # 196.791 M/sec
225,345,520 branch-misses # 1.91% of all branches
30.002368372 seconds time elapsed
60.006800000 seconds user
0.000000000 seconds sys
```
:::
### peterson_correct-O3
:::success

```
$ sudo perf stat ./peterson_correct-O3
Performance counter stats for './peterson_correct-O3':
60,002.05 msec task-clock # 2.000 CPUs utilized
158 context-switches # 0.003 K/sec
6 cpu-migrations # 0.000 K/sec
70 page-faults # 0.001 K/sec
223,994,816,129 cycles # 3.733 GHz
129,570,754,972 instructions # 0.58 insn per cycle
42,035,686,618 branches # 700.571 M/sec
249,549,299 branch-misses # 0.59% of all branches
30.002715298 seconds time elapsed
60.002754000 seconds user
0.000000000 seconds sys
```
:::
## hw7在Linux實機上實測, perf annotate
```
$ sudo perf record ./peterson_trival-O3
$ sudo perf annotate
$ sudo perf record ./peterson_trival-g
$ sudo perf annotate
$ sudo perf record ./peterson_correct-O3
$ sudo perf annotate
$ sudo perf record ./peterson_correct-g
$ sudo perf annotate
```
### peterson_trival-O3
```
$ sudo perf record ./peterson_trival-O3
$ sudo perf annotate
```

100% 都卡在 jmp 38

### peterson_trival-g
```
$ sudo perf record ./peterson_trival-g
$ sudo perf annotate
```

在實機上執行 peterson_trival-g 時,變成效能瓶頸在
`29: mov flag0,%eax`

### peterson_correct-O3
```
$ sudo perf record ./peterson_correct-O3
$ sudo perf annotate
```



```
$ gdb ./peterson_correct-O3
(gdb) disassemble p0
```
比較

### peterson_correct-g
```
$ sudo perf record ./peterson_correct-g
$ sudo perf annotate
```


### 火焰圖分析
以 ./peterson_correct-O3 為例
```
1.記錄數據
此處-F 99表示一秒採樣99次;
-a 紀錄所有CPU
./peterson_correct-O3 後面帶一個執行檔
執行完成後會生成perf.data文件
$ sudo perf record -F 99 -a -g ./peterson_correct-O3
2.用perf script工具對perf.data進行解析,執行完成後生成out.perf文件
$ sudo perf script > out.perf
3.折疊調用棧
(用到FlameGraph目錄工具中的stackcollapse-perf.pl文件 out.folded)
$ sudo /home/blue76185/桌面/作業/FlameGraph/stackcollapse-perf.pl out.perf > out.folded
4.生成火焰圖(用到FlameGraph中的flamegraph.pl文件):
執行後生成perf.svg文件,這個就是我們想要的火焰圖了。
$ sudo /home/blue76185/桌面/作業/FlameGraph/flamegraph.pl out.folded > perf.svg
```


### 其他參考資料
[LeCun转推,PyTorch GPU内存分配有了火焰图可视化工具](https://mp.weixin.qq.com/s/7boNWkTqy3AY0pdAzpEY1A?fbclid=IwAR0mPl9C2k4ldU8ROWiAUsYWTdgdQHRIRgAmzVI-6-f0-9txaASVIS21QPA)