羅習五 HW7 perf 分析

# 羅習五 HW7 perf 分析 ###### tags: `LINUX_並發與競態_行程同步` peterson_correct.c 和 peterson_trival.c 程式統一改成計時30秒後就退出 ```c int count=0; //每秒鐘印出P0和P1進入CS的次數 void per_second(int signum) { ... count++; if(count==30) exit(0); alarm(1); } ``` ## 關鍵要領 [在 Linux 上使用 Perf 做效能分析(入門篇)](https://tigercosmos.xyz/post/2020/08/system/perf-basic/) [Source level analysis with perf annotate](https://perf.wiki.kernel.org/index.php/Tutorial#Source_level_analysis_with_perf_annotate) :::success 使用 perf annotate ``` $ sudo perf record ./你指定的執行檔 $ sudo perf annotate ``` 就能看到組語的效能瓶頸使用 perf stat ``` $ sudo perf stat ./peterson_trival-g ``` 就能直接「掌握大局」 ::: 例如 ``` $ sudo perf record ./peterson_trival-O3 $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/ByqqxjAmn.png) ![](https://hackmd.io/_uploads/Bk35ejRm2.png) ``` $ sudo perf stat ./peterson_trival-g Performance counter stats for './peterson_trival-g': 48,227.98 msec task-clock # 1.596 CPUs utilized 1,144 context-switches # 0.024 K/sec 61 cpu-migrations # 0.001 K/sec 62 page-faults # 0.001 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses 30.221688059 seconds time elapsed 20.631166000 seconds user 27.596390000 seconds sys ``` ``` $ sudo perf stat ./peterson_trival-O3 Performance counter stats for './peterson_trival-O3': 60,011.23 msec task-clock # 1.999 CPUs utilized 413 context-switches # 0.007 K/sec 2 cpu-migrations # 0.000 K/sec 64 page-faults # 0.001 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses 30.014287374 seconds time elapsed 59.991092000 seconds user 0.015996000 seconds sys ``` 其中有個參數 cpu-migrations [Linux scheduling](https://www.ibm.com/docs/sk/linux-on-systems?topic=management-linux-scheduling) > **Moving a virtual CPU from one run queue to another is called a (CPU) migration.** Be sure not to confuse the term “CPU migration” with a “live migration”, which is the migration of a virtual server from one host to another. The Linux scheduler might decide to migrate a virtual CPU when the estimated wait time until the virtual CPU will be executed is too long, the run queue where it is supposed to be waiting is full, or another run queue is empty and needs to be filled up. ~~cpu-migrations代表在虛擬CPU中搬動行程，因為我這個程式在Vmware虛擬機上執行，因此會有 cpu-migrations 在虛擬CPU中搬動的情況~~ [Linux - Difference between migrations and switches?](https://stackoverflow.com/questions/45368742/linux-difference-between-migrations-and-switches) > **Migration is when a thread, usually after a context switch, get scheduled on a different CPU than it was scheduled before.** > --- ``` $ sudo perf record ./peterson_trival-O3 $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/rkaols0mh.png) ``` $ sudo perf record ./peterson_trival-g $ sudo perf annotate ``` **若P0 P1同時進入臨界區時，會造成 perf record 紀錄失敗** 造成以下 perf annotate 結果 ![](https://hackmd.io/_uploads/S1gpliCm2.png) ./peterson_trival-g P0 P1 量測成功時 `$ sudo perf annotate` 結果如下 ![](https://hackmd.io/_uploads/HybAxjR73.png) ![](https://hackmd.io/_uploads/ByEAejC73.png) 其中在這段 ![](https://hackmd.io/_uploads/rJjxbs0Xh.png) ``` Percent│ ▒ │ /home/blue76815/.debug/.build-id/29/1959098c03695a00cf391506bf5deb5c7fd70a/elf：檔案格式 elf64-x86-64 ▒ │ ▒ │ │ void p0(void) { ▒ │ push %rbp ▒ │ mov %rsp,%rbp ▒ │ printf("p0: start\n"); ▒ │ lea _IO_stdin_used+0x60,%rdi ▒ │ → callq puts@plt ▒ │ while (1) { ▒ │ //🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 ▒ │ //Peteron's solution的進去部分的程式碼 ▒ │ flag0 = 1; ▒ │10: movl $0x1,flag0 ▒ │ turn = 1; ▒ 0.04 │ movl $0x1,turn ▒ │ while (flag1==1 && turn==1) ▒ 0.01 │ nop ▒ 9.37 │25: mov flag1,%eax ▒ 56.77 │ cmp $0x1,%eax ▒ │ ↓ jne 3b ▒ 30.91 │ mov turn,%eax ▒ 1.80 │ cmp $0x1,%eax ◆ │ ↑ je 25 ▒ │ ; //waiting ``` cmp $0x1,%eax 的**百分比最高**，**呼應老師說的compare是效能瓶頸** --- ## peterson_correct 分析 ``` $ sudo perf stat ./peterson_correct-O3 Performance counter stats for './peterson_correct-O3': 60,031.37 msec task-clock # 2.000 CPUs utilized 208 context-switches # 0.003 K/sec 1 cpu-migrations # 0.000 K/sec 62 page-faults # 0.001 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses 30.019816344 seconds time elapsed 60.019472000 seconds user 0.012003000 seconds sys $ sudo perf stat ./peterson_correct-g Performance counter stats for './peterson_correct-g': 60,023.59 msec task-clock # 2.000 CPUs utilized 202 context-switches # 0.003 K/sec 2 cpu-migrations # 0.000 K/sec 63 page-faults # 0.001 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses 30.015192347 seconds time elapsed 60.023823000 seconds user 0.000000000 seconds sys ``` ``` $ sudo perf record ./peterson_correct-O3 $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/Hk2WZo0m3.png) ``` Percent│ ▒ │ /home/blue76815/.debug/.build-id/aa/fd55126b43dc94f67ef2d9eebd9e309fcf0f87/elf：檔案格式 elf64-x86-64 ▒ │ ▒ │ ▒ │ Disassembly of section .text: ▒ │ ▒ │ 0000000000000b00 <p1>: ▒ │ p1(): ▒ │ lea _IO_stdin_used+0xd,%rdi ▒ │ sub $0x8,%rsp ▒ │ → callq puts@plt ◆ 0.16 │10: movl $0x1,flag+0x4 ▒ 0.01 │ mfence ▒ 15.48 │ mfence ▒ 6.28 │ movl $0x0,turn ▒ 0.07 │ mfence ▒ 17.47 │ ↓ jmp 3a ▒ │ nop ▒ 3.30 │30: mov turn,%eax ▒ 9.97 │ test %eax,%eax ▒ 0.01 │ ↓ jne 44 ▒ 4.98 │3a: mov flag,%eax ▒ 10.04 │ test %eax,%eax ▒ 0.02 │ ↑ jne 30 ▒ 1.10 │44: → callq sched_getcpu@plt ▒ 0.08 │ mov %eax,cpu_p1 ▒ 0.05 │ mov in_cs,%eax ▒ 5.98 │ add $0x1,%eax ▒ 0.15 │ cmp $0x2,%eax ▒ 0.11 │ mov %eax,in_cs ▒ 0.07 │ ↓ je 88 ▒ 0.01 │63: addl $0x1,p1_in_cs ▒ 0.35 │ sub $0x1,%eax ▒ 0.00 │ mov %eax,in_cs ▒ 0.01 │ movl $0x0,flag+0x4 ▒ 0.10 │ mfence ▒ 24.20 │ ↑ jmp 10 ▒ │ nop ▒ │88: mov stderr@@GLIBC_2.2.5,%rcx ▒ │ lea _IO_stdin_used+0x18,%rdi ▒ │ mov $0x1e,%edx ▒ │ mov $0x1,%esi ▒ │ → callq fwrite@plt ▒ │ mov in_cs,%eax ▒ │ ↑ jmp 63 ▒ │ nop ``` ``` $ sudo perf record ./peterson_correct-g $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/S1qGWoAXh.png) ![](https://hackmd.io/_uploads/rye7ZsCQ3.png) ``` Percent│ ◆ │ /home/blue76815/.debug/.build-id/5d/76ec6367369031d0f2d82842955ade16df54b2/elf：檔案格式 elf64-x86-64 ▒ │ ▒ │ ▒ │ Disassembly of section .text: ▒ │ ▒ │ 0000000000000a3e <p0>: ▒ │ p0(): ▒ │ if(count==30) ▒ │ exit(0); ▒ │ alarm(1); ▒ │ } ▒ │ ▒ │ void p0(void) { ▒ │ push %rbp ▒ │ mov %rsp,%rbp ▒ │ sub $0x40,%rsp ▒ │ mov %fs:0x28,%rax ▒ │ mov %rax,-0x8(%rbp) ▒ │ xor %eax,%eax ▒ │ printf("start p0\n"); ▒ │ lea _IO_stdin_used+0x60,%rdi ▒ │ → callq puts@plt ▒ │ while (1) { ▒ │ //🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 │ //Peteron's solution的進去部分的程式碼 ▒ │ atomic_store(&flag[0], 1); ▒ 0.09 │ 23: lea flag,%rax ▒ 0.00 │ mov %rax,-0x30(%rbp) ▒ 0.01 │ movl $0x1,-0x34(%rbp) ▒ 0.11 │ mov -0x34(%rbp),%eax ▒ 0.01 │ mov %eax,%edx ▒ │ mov -0x30(%rbp),%rax ▒ 0.01 │ mov %edx,(%rax) ▒ 0.61 │ mfence ▒ │ atomic_thread_fence(memory_order_seq_cst); ▒ 15.24 │ mfence ▒ │ atomic_store(&turn, 1); ▒ 6.30 │ lea turn,%rax ▒ 0.03 │ mov %rax,-0x28(%rbp) ▒ 0.00 │ movl $0x1,-0x34(%rbp) ▒ 0.00 │ mov -0x34(%rbp),%eax ▒ 0.09 │ mov %eax,%edx ▒ │ mov -0x28(%rbp),%rax ▒ 0.01 │ mov %edx,(%rax) ▒ 0.62 │ mfence ▒ │ while (atomic_load(&flag[1]) && atomic_load(&turn)==1) ▒ 16.43 │ nop ▒ 1.42 │ 67: lea flag+0x4,%rax ▒ 0.72 │ mov %rax,-0x20(%rbp) 0.52 │ mov -0x20(%rbp),%rax ▒ 0.49 │ mov (%rax),%eax ▒ 12.32 │ mov %eax,-0x34(%rbp) ▒ 1.61 │ mov -0x34(%rbp),%eax ▒ 1.42 │ test %eax,%eax ▒ 0.00 │ ↓ je 9e ▒ 1.29 │ lea turn,%rax ▒ 1.01 │ mov %rax,-0x18(%rbp) ▒ 0.57 │ mov -0x18(%rbp),%rax ▒ 0.33 │ mov (%rax),%eax ▒ 5.74 │ mov %eax,-0x34(%rbp) ▒ 1.81 │ mov -0x34(%rbp),%eax ▒ 1.61 │ cmp $0x1,%eax ▒ │ ↑ je 67 ▒ │ ; //waiting ▒ │ ▒ │ ▒ │ //底下程式碼用於模擬在critical section ▒ │ cpu_p0 = sched_getcpu(); ▒ 1.31 │ 9e: → callq sched_getcpu@plt ▒ 0.07 │ mov %eax,cpu_p0 ▒ │ in_cs++; //計算有多少人在CS中 ▒ 0.04 │ mov in_cs,%eax ▒ 5.28 │ add $0x1,%eax ▒ 0.09 │ mov %eax,in_cs ▒ │ //nanosleep(&ts, NULL); ▒ │ if (in_cs == 2) fprintf(stderr, "p0及p1都在critical section\n"); ▒ 0.18 │ mov in_cs,%eax ▒ 0.93 │ cmp $0x2,%eax ▒ │ ↓ jne e3 ▒ │ mov stderr@@GLIBC_2.2.5,%rax ▒ │ mov %rax,%rcx ▒ │ mov $0x1e,%edx ▒ │ mov $0x1,%esi ▒ │ lea _IO_stdin_used+0x70,%rdi ▒ │ → callq fwrite@plt ▒ │ p0_in_cs++; //P0在CS幾次 ▒ 0.12 │ e3: mov p0_in_cs,%eax ◆ 0.90 │ add $0x1,%eax ▒ 0.02 │ mov %eax,p0_in_cs ▒ │ //nanosleep(&ts, NULL); ▒ │ in_cs--; //計算有多少人在CS中 0.02 │ │ mov in_cs,%eax ▒ 0.16 │ │ sub $0x1,%eax ▒ 0.01 │ │ mov %eax,in_cs ▒ │ │ ▒ │ │ ▒ │ │ ▒ │ │//🐉 🐲 🌵 🎄 🌲 🌳 🌴 🌱 🌿 ☘️ 🍀 ▒ │ │//Peteron's solution的離開部分的程式碼 ▒ │ │atomic_store(&flag[0], 0); ▒ 0.03 │ │ lea flag,%rax ▒ 0.01 │ │ mov %rax,-0x10(%rbp) ▒ 0.08 │ │ movl $0x0,-0x34(%rbp) ▒ 0.00 │ │ mov -0x34(%rbp),%eax ▒ 0.03 │ │ mov %eax,%edx ▒ 0.01 │ │ mov -0x10(%rbp),%rax ▒ 0.06 │ │ mov %edx,(%rax) ▒ 0.07 │ │ mfence ▒ │ │atomic_store(&flag[0], 1); ▒ 20.14 │ └──jmpq 23 ``` ## hw7 改成在實機linux上做perf分析 ### peterson_trival-g :::success **hw7在實機linux上，做perf分析才能真的量到 cycles,instructions,branches,branch-misses** ![](https://hackmd.io/_uploads/H1zVWs0mn.png) ``` $ sudo perf stat ./peterson_trival-g Performance counter stats for './peterson_trival-g': 60,006.22 msec task-clock # 2.000 CPUs utilized 580 context-switches # 0.010 K/sec 4 cpu-migrations # 0.000 K/sec 67 page-faults # 0.001 K/sec 211,445,396,951 cycles # 3.524 GHz 443,129,572,320 instructions # 2.10 insn per cycle 144,841,527,517 branches # 2413.775 M/sec 414,350,220 branch-misses # 0.29% of all branches 30.002342037 seconds time elapsed 60.006886000 seconds user 0.000000000 seconds sys ``` ::: --- ### peterson_trival-O3 :::success ![](https://hackmd.io/_uploads/S1kS-oR72.png) ``` $ sudo perf stat ./peterson_trival-O3 Performance counter stats for './peterson_trival-O3': 60,003.36 msec task-clock # 2.000 CPUs utilized 171 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 69 page-faults # 0.001 K/sec 232,921,278,865 cycles # 3.882 GHz 231,316,350,274 instructions # 0.99 insn per cycle 231,264,858,250 branches # 3854.199 M/sec 398,117 branch-misses # 0.00% of all branches 30.003705481 seconds time elapsed 60.003975000 seconds user 0.000000000 seconds sys ``` ::: ### peterson_correct-g :::success ![](https://hackmd.io/_uploads/BJcrWsRX3.png) ``` $ sudo perf stat ./peterson_correct-g Performance counter stats for './peterson_correct-g': 60,006.16 msec task-clock # 2.000 CPUs utilized 188 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 68 page-faults # 0.001 K/sec 228,001,906,530 cycles # 3.800 GHz 88,866,279,243 instructions # 0.39 insn per cycle 11,808,654,166 branches # 196.791 M/sec 225,345,520 branch-misses # 1.91% of all branches 30.002368372 seconds time elapsed 60.006800000 seconds user 0.000000000 seconds sys ``` ::: ### peterson_correct-O3 :::success ![](https://hackmd.io/_uploads/H1PLbsAX2.png) ``` $ sudo perf stat ./peterson_correct-O3 Performance counter stats for './peterson_correct-O3': 60,002.05 msec task-clock # 2.000 CPUs utilized 158 context-switches # 0.003 K/sec 6 cpu-migrations # 0.000 K/sec 70 page-faults # 0.001 K/sec 223,994,816,129 cycles # 3.733 GHz 129,570,754,972 instructions # 0.58 insn per cycle 42,035,686,618 branches # 700.571 M/sec 249,549,299 branch-misses # 0.59% of all branches 30.002715298 seconds time elapsed 60.002754000 seconds user 0.000000000 seconds sys ``` ::: ## hw7在Linux實機上實測, perf annotate ``` $ sudo perf record ./peterson_trival-O3 $ sudo perf annotate $ sudo perf record ./peterson_trival-g $ sudo perf annotate $ sudo perf record ./peterson_correct-O3 $ sudo perf annotate $ sudo perf record ./peterson_correct-g $ sudo perf annotate ``` ### peterson_trival-O3 ``` $ sudo perf record ./peterson_trival-O3 $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/rJXv-jAQn.png) 100％都卡在 jmp 38 ![](https://hackmd.io/_uploads/By3vbs0X3.png) ### peterson_trival-g ``` $ sudo perf record ./peterson_trival-g $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/Sy4KWjCX3.png) 在實機上執行 peterson_trival-g 時,變成效能瓶頸在 `29: mov flag0,%eax` ![](https://hackmd.io/_uploads/S1Qc-sRm3.png) ### peterson_correct-O3 ``` $ sudo perf record ./peterson_correct-O3 $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/HyAibsCQn.png) ![](https://hackmd.io/_uploads/r1znZsRQh.png) ![](https://hackmd.io/_uploads/BJUn-o0mh.png) ``` $ gdb ./peterson_correct-O3 (gdb) disassemble p0 ``` 比較 ![](https://hackmd.io/_uploads/ryJAZjRm2.png) ### peterson_correct-g ``` $ sudo perf record ./peterson_correct-g $ sudo perf annotate ``` ![](https://hackmd.io/_uploads/rkL8GjCXn.png) ![](https://hackmd.io/_uploads/rJtIMiRQ3.png) ### 火焰圖分析以 ./peterson_correct-O3 為例 ``` 1.記錄數據此處-F 99表示一秒採樣99次； -a 紀錄所有CPU ./peterson_correct-O3 後面帶一個執行檔執行完成後會生成perf.data文件 $ sudo perf record -F 99 -a -g ./peterson_correct-O3 2.用perf script工具對perf.data進行解析，執行完成後生成out.perf文件 $ sudo perf script > out.perf 3.折疊調用棧 (用到FlameGraph目錄工具中的stackcollapse-perf.pl文件 out.folded) $ sudo /home/blue76185/桌面/作業/FlameGraph/stackcollapse-perf.pl out.perf > out.folded 4.生成火焰圖（用到FlameGraph中的flamegraph.pl文件）：執行後生成perf.svg文件，這個就是我們想要的火焰圖了。 $ sudo /home/blue76185/桌面/作業/FlameGraph/flamegraph.pl out.folded > perf.svg ``` ![](https://hackmd.io/_uploads/BktuMsAXh.png) ![](https://hackmd.io/_uploads/B13uzj0Qn.png) ### 其他參考資料 [LeCun转推，PyTorch GPU内存分配有了火焰图可视化工具](https://mp.weixin.qq.com/s/7boNWkTqy3AY0pdAzpEY1A?fbclid=IwAR0mPl9C2k4ldU8ROWiAUsYWTdgdQHRIRgAmzVI-6-f0-9txaASVIS21QPA)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.