--- tags: linux-perf, branch-inst --- [TOC] # Note for [perf 原理和實務](https://hackmd.io/@sysprog/B11109rdg?type=view#perf-%E5%8E%9F%E7%90%86%E5%92%8C%E5%AF%A6%E5%8B%99) 藉由 perf,應用程式可以利用 ==PMU (Performance Monitoring Unit)==, tracepoint 和核心內部的特殊計數器 (counter) 來進行統計,另外還能同時分析運行中的核心程式碼,從而更全面了解應用程式中的效能瓶頸。 perf 基本原理是對目標進行取樣,紀錄特定的條件下所偵測的事件是否發生以及發生的次數。 - 例如++根據 tick 中斷進行取樣++,即在 tick 中斷內觸發取樣點,在取樣點裡判斷行程 (process) 當時的 context。假如一個行程 90% 的時間都花費在函式 foo() 上,那麼 90% 的取樣點都應該落在函式 foo() 中。 Perf 可取樣的事件非常多,**可以分析 Hardware event**,如 cpu-cycles、instructions 、cache-misses、branch-misses …等等。 - 其餘 event types 如 `Software event` 或 `Tracepoint event`。 :star: 知道了 **cpu-cycles**、**instructions** 我們可以了解 Instruction per cycle 是多少,進而判斷程式碼有沒有好好利用 CPU > 某 CPU IP 公司的面試題目 **cache-misses** 可以曉得是否有善用 Locality of reference ,branch-misses 多了是否導致嚴重的 pipeline hazard? ## 安裝 :notes: check processes status: `journalctl -xe` ## 範例 - on my GitHub: [perf_top_example.c](https://github.com/asahsieh/linux-kernel-internals/blob/main/2022q1/week1/perf/perf_top_example.c) ## [背景知識](https://hackmd.io/@sysprog/B11109rdg?type=view#%E8%83%8C%E6%99%AF%E7%9F%A5%E8%AD%98) ### **硬體特性之branch prediction** $\to$ 對照閱讀: [Fast and slow if-statements: branch prediction in modern processors](http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/) :information_source: However, the problem is that modern processors don’t track history for each branching instruction separately. > [Branch Target Buffer](https://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/5904.html) $\to$ 編譯器提供的輔助機制: [Branch Patterns, Using GCC](http://cellperformance.beyond3d.com/articles/2006/04/branch-patterns-using-gcc.html) #### Rule #1: Forward branches are *not likely* to be taken * Forward branches are generated for *if* statements. * The rationale for making them not likely to be taken is that ++the processor can take advantage of the fact that instructions following the branch instruction may already be placed in the instruction buffer inside the Instruction Unit++. :::info When writing an if-else statement, ++always make the "then" block more likely to be executed than the else block++, so the processor can take advantage of instructions already placed in the instruction fetch buffer. ::: * Also, rewrite your `if`s that prematurely terminate the function via a return when they are unlikely to occur to the "else block", which should be **the rule for error conditions**. ```c ptr = malloc(sizeof(some_structure)); if (ptr == NULL) { return -1; } /* Malloc successful, continue execution. */ ``` Might be instead: ```c ptr = malloc(sizeof(some_structure)); if (ptr != NULL) { /* Malloc successful, continue execution. */ } else { return -1; } ``` :notes: 把不常執行到的 code (errors) 放到 `else` 中,以減少 branch mis-prediction 的次數 #### Rule #2: Backward branches are likely to be taken * Backward branches are generated for while, do-while and for statements * The reason to make them likely is that loops are generally executed one or more times, so the branch at the end of the loop that ++tests for the condition is likely to be taken++. :::info It's important to note that the compiler rearranges the while and for loops to make the test at the end. It's a good optimization that will save a branch instruction ::: e.g.: ```c int while_test( int a, int b, int c ) { int count; while ( c < a ) { c += b; a--; count++; } return (count); } ``` *The optimization version which checking loop condition on the end:* ``` while_test: stwu 1,-16(1) # allocate stack b .L9 # jump to condition test .L11: add 5,5,4 # c += b addi 3,3,-1 # a-- addi 9,9,1 # count++ .L9: cmpw 7,5,3 # check c < a blt+ 7,.L11 # iterate if true mr 3,9 # = count addi 1,1,16 # restore stack blr # return ``` *The version of loop that test at the beginning*: ``` while_test: stwu 1,-16(1) # allocate stack #b .L9 # jump to condition test .L9: cmpw 7,5,3 # check c < a bge+ 7,.END # iterate if true .L11: add 5,5,4 # c += b addi 3,3,-1 # a-- addi 9,9,1 # count++ b .L9 # the extra unconditional branch .END: mr 3,9 # = count addi 1,1,16 # restore stack blr # return ``` > :warning: The number of branch instructions is not reduced. The differences are: the uncoditional branch jumps the conditional test (`.L9`) be moved after the action of the loop and change the conditional branch to jump to the end or the loop (blt+) to `bge` then jump to a newly-added label, `.END`. This however adds a constraint to the compiler that prevents it from further optimizing the code. In order to put the test at the end of the loop and save one branch instruction, ++the compiler generates an unconditional branch before the start of the loop that will be executed only once and jumps to the condition test at the end++. ```asm while_test: stwu 1,-16(1) # allocate stack b .L9 # jump to condition test .L11: ... .L9: cmpw 7,5,3 # check c < a blt+ 7,.L11 # iterate if true ... ``` :information_source: The `cmpw` instruction in the previous for and while examples *put its result in CR7, and the branch instruction that just follows tests for CR7*. This will cause a **stall** (control hazard) in the execution pipeline, because the branch will have to wait for the result of cmpw to be written in CR7. :+1: The do-while loop doesn't suffer from this problem. Without the branch to the test condition, the compiler can reschedule instructions in order to fill the space: ```asm do_while_test: stwu 1,-16(1) # allocate stack .L13: add 5,5,4 # c += b addi 3,3,-1 # a-- cmpw 7,5,3 # check c < a addi 9,9,1 # count++ blt+ 7,.L13 # iterate if true ... ``` > the compiler moved the `add 9,9,1 (count++)` instruction in between `cmpw` and `the branch`. While the addition is being computed, the result of the comparison is being saved into CR7 to be used in the branch that follows. * Conclusion: * :::info Try to change for and while loops to do-while loops. ::: * :::info - [ ] When converting for loops that have `continue` statements is their bodies, change those statements to an `explicit goto` to the update part of the code near the end of the do-while loop. ::: * :::info In cases where it's mandatory to test for the loop condition at the beginning, ++convert the loop into a do-while and add an if to test if it has to be executed the at least once++. ::: ```c void print_list(const list_t *list) { while (list != NULL) { printf("%d\n", list->value); list = list->next; } } void print_list(const list_t *list) { if (list != NULL) // the added if { do { printf("%d"\n", list->value); list = list->next; } while (list != NULL); } } ``` #### Rule #3: SWITCH STATEMENTS - [ ] Analysis performance of different implementations - Example 1: **Binary search (number of cases < 5)** > source code: [switch_small_cases.c](https://github.com/asahsieh/linux-kernel-internals/blob/cd49e4389180a5291a2599d236f913d4fe903323/2022q1/week1/perf/branch_pattern_analysis/switch_small_cases.c) By **perf record** and **perf report**: - *Executed command*: `perf record -e branch-misses:u,branch-instructions:u ./switch_small_cases 4` - Result from executing `perf report`: ```shell Available samples 7 branch-misses:u 9 branch-instructions:u ... ESC: exit, ENTER|->: Browse histograms ``` - Go deeper to `Browse histograms` by press `ENTER` on **7 branch-misses:u** - Observe branches with source code :question: no `functions` in the source code be shown, use `branch-misses:u` as example: ```shell Samples: 7 of event 'branch-misses:u', Event count (approx.): 2785 Overhead Command Shared Object Symbol 52.60% switch_small_ca ld-linux-x86-64.so.2 [.] _dl_relocate_object 33.61% switch_small_ca ld-linux-x86-64.so.2 [.] __tunable_get_val 11.20% switch_small_ca ld-linux-x86-64.so.2 [.] __GI___tunables_init 2.59% switch_small_ca ld-linux-x86-64.so.2 [.] _dl_start ``` :bulb:Try to **increase sample rate** by the command metioned on the lecture: `perf record -F 100000 -e branch-misses:u,branch-instructions:u ./switch_small_cases 4` The given frequency rate exceeded the Maximum freqency rate already, but still few samples be captured: ```shell warning: Maximum frequency rate (2,3750 Hz) exceeded, throttling from 10,0000 Hz to 2,3750 Hz. The limit can be raised via /proc/sys/kernel/perf_event_max_sample_rate. The kernel will lower it when perf's interrupts take too long. Use --strict-freq to disable this throttling, refusing to record. The input value is: 4. Corrently no mapped function. [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.003 MB perf.data (29 samples) ] ``` :bulb: To increase number of exections as example code in the lecture: > Source code: [switch_small_cases_v1.c](https://github.com/asahsieh/linux-kernel-internals/blob/614fa21c81e6e0659ca1ba2adef855eb06cc76e4/2022q1/week1/perf/branch_pattern_analysis/switch_small_cases.c) ```c ... #define N 5000000 ... int main (int argc, char *argv[]) { int i; for (i = 0; i < N; i++) test_0(i%4); return EXIT_SUCCESS; } ``` :::success Then we can observe that the overhead of function `test_0` has 33.05%: ```c Samples: 7 of event 'branch-misses:u', Event count (approx.): 2787 Overhead Command Shared Object Symbol 40.80% switch_small_ca ld-linux-x86-64.so.2 [.] _dl_name_match_p 33.05% switch_small_ca switch_small_cases [.] test_0 20.70% switch_small_ca ld-linux-x86-64.so.2 [.] __GI___tunables_init 5.45% switch_small_ca ld-linux-x86-64.so.2 [.] _dl_start ``` ::: :star: To annote a Shared Object with its source code, compiling the source code with debugging information, i.e., `gcc -g ${SOURCE}`; the result of annote the Symbol `[.] test_0`: ![](https://i.imgur.com/pcgKwSI.png) [Analyze the result by perf-annotate](https://hackmd.io/19xPmPOJTHemR5ZKdAvgkQ?view#Further-reading-Analyzing-performance-with-perf-annotate): :notes: Reference `CS:APP` or [Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2: Instruction Set Reference](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf) to interpret x86 assembly instructions What's the `rbp` register? --> *Base pointer* or *Frame pointer*, it gets a snapshot of the stack pointer (`rsp`), local variables and function parameters are still accessible from a constant offset from `rbp`[^first] --> all registers[^second] --> The relation of \[R/E\]AX registers[^third]: ![](https://i.imgur.com/KxEqTyZ.png) **Hot spots of branch misses** are located on `jg 65` instructions, which jump to the `default` case. --> *Why?* the input pattern to the switch-case statement is looping ++integer zero to three except the default case++. If the input pattern is changed to ++include the default case into the loop++, hot spot will change to **the first jump instruction**: ```c int main (int argc, char *argv[]) { ... for (i = 0; i < N; i++) test_0(i%5); // from 4 to 5 ... } ``` ![](https://i.imgur.com/udhhICA.png) > Observe and analyze: > :question: the distribution of branch misses is spread on different cases, but the order of assembly code is opposited to the input pattern causing most of branch misprediction on the first comparsion. - Example 2: **Jump Tables** There have three kinds of implementations: - the initial one is from *Branch Table* on Wikipedia: [Jump table example in C](https://en.wikipedia.org/wiki/Branch_table#Jump_table_example_in_C) - the later two examples are `large-sparse` and `large-dense` examples shown in this article > Links of the example codes are shown respectively: > - [initial example](https://github.com/asahsieh/linux-kernel-internals/blob/main/2022q1/week1/perf/branch_pattern_analysis/switch_jump_table_ex.c) > - [large-sparse example](https://github.com/asahsieh/linux-kernel-internals/blob/main/2022q1/week1/perf/branch_pattern_analysis/switch_jump_table_in_large-sparse.c) > - [large-dense example](https://github.com/asahsieh/linux-kernel-internals/blob/main/2022q1/week1/perf/branch_pattern_analysis/switch_jump_table_in_large-dense.c) :information_source: From the article: *The program is trying to get away with code that doesn't use branches at all.* But assembly codes of the three implementation don't have the `address table` shown in the article and still have many branches: - expected assembly code ```asm .L19: # address table .word .L13 .word .L14 .word .L15 .word .L16 .word .L17 .word .L18 ``` - comparsion of the example code, the evolution order is from left to right ![](https://i.imgur.com/pFDwuRi.png) - [ ] (TODO) the above example code should be to implemented by assembly code like in this article - by `goto`? [6.8.6.1 The `goto` statement](http://port70.net/~nsz/c/c99/n1256.html#6.8.6.1) > Constraints: > 1 The identifier in a goto statement shall name a label located somewhere in the enclosing function. A goto statement shall not jump from outside the scope of an identifier having a variably modified type to inside the scope of that identifier. --> the implementation is not a `table`, may be only valid in assembly code :accept: I listed the tips from this section, and would add the implementations on the future :::info Avoid switches for high performance code. Changes in case values can result in **non-obvious** performance impacts. ::: - When the switch is small - number of case values less than five in the case of both the PPU and SPU - the number of comparisons GCC generates is comparable to a series of if statements. > - [ ] PPU and SPU: > - [SPU Options in gcc](https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/SPU-Options.html) > - [Processor Units in PS3](https://www.embedded.com/using-open-source-gnu-eclipse-linux-to-develop-multicore-cell-apps-part-2/) - GCC cannot take into account the likelihood of each case being taken --> :star: In this case, manually converting the switch to a series of ifs, ordering the comparisons from most likely to least likely, can have a big impact in performance. :::info Eliminate small switches. Convert to a series of if statements contained by an outer branch (when the default case is likely). Alternatively convert to a branch-free expression. ::: - Sometimes ==medium size switches== that would ++perform better with jump tables are generated to use a binary search by GCC++. --> :star: In those cases, converting the switch statement to a series of labels and using an array of addresses (using GCC's &&label_name syntax) :::success It's important to note that GCC does not have a definition for medium-size switches, only for small, large-sparse and large-dense. The definition of medium-size switches is subjective to ++how critical the code is and memory constraints of the program++. ::: :::info Convert large, sparse switches to smaller ones by pre-processing the case values (i.e. via a hash function), then manually convert to a jump table depending on the number of cases and their likelihood. ::: #### **Tip #1: Don´t change GCC´s default static prediction** - The default static prediction generated by GCC can be overridden using the __builtin_expect compiler extension. - This will perform worse than the case where the fall-through path is the likely continuation address, it is preferable to not change the default GCC prediction and change the conditional and loop statements to benefit from the prediction. - I did the comparion between default static prediction and overriding the default static prediction (the right one), GCC generates a *forward branch* for the `if` statement is likely to be taken: ![](https://i.imgur.com/NvvoOsc.png) :notes: [movzbl inst](https://stackoverflow.com/questions/9317922/what-does-the-movzbl-instruction-do-in-ia-32-att-syntax); the mnemonic is: MOV --> Zero extend --> Byte (8-bit) --> to Long (32-bit) > :question: The extra-added code in the right one shows the forward branch which jump to the likely continuation address, the jump instruction should be `jne`? > > :question: The part will perform worse is that the prediction becomes to jump to the likely continuation address? :::info Don't use the __builtin_expect extension to change GCC's default branch prediction, rearrange your code instead. Only use this extension to convert what would have been dynamically predicted branches to statically predicted ones, and only where the data has been analyzed and the predicability is well-known in advance. ::: #### **Tip #2: Boolean Operators Affect Static Branch Prediction** Static prediction rules become increasingly complicated with ++a mix of boolean operators are used++ (due to C short-circuiting rules), and often the results will be unexpected. :::info Avoid complex boolean expressions. ::: To match C short-circuiting rules, in complex comparisons - "and" implies "unlikely" - "or" implies "likely" :::info When cases are not equally likely, use "and" and order by most->least likely. When cases are equally likely, use "or". The order is not important. ::: #### **Tip #3: Simple traces (not shorter traces) will almost always generate better code in GCC** ==Something about what a trace is. Something about simplifying expressions and combining at the end.== For example: ```c int box_test_overlap_combine_1( const Box* restrict const a, const Box* restrict const b ) { float acx = a->center.e[0]; float acy = a->center.e[1]; float acz = a->center.e[2]; float aex = a->extent.e[0]; float aey = a->extent.e[1]; float aez = a->extent.e[2]; float bcx = b->center.e[0]; float bcy = b->center.e[1]; float bcz = b->center.e[2]; float bex = b->extent.e[0]; float bey = b->extent.e[1]; float bez = b->extent.e[2]; float adiff_cx = fabs( acx-bcx ); float adiff_cy = fabs( acy-bcy ); float adiff_cz = fabs( acz-bcz ); float sum_ex = aex + bex; float sum_ey = aey + bey; float sum_ez = aez + bez; int overlap_x = adiff_cx <= sum_ex; int overlap_y = adiff_cy <= sum_ey; int overlap_z = adiff_cz <= sum_ez; int overlap = overlap_x && overlap_y && overlap_z; return (overlap); } ``` ```asm box_test_overlap_combine_1: ... and. 0,11,9 mfcr 0 rlwinm 0,0,31,1 cmpwi 7,0,0 beq- 0,.L11 beq- 7,.L11 li 10,1 .L11: mr 3,10 addi 1,1,16 blr ``` The compiler has been able to reduce the number of branches from three to two by using Condition Register (CR) boolean instructions. ++Additionally, the entire expression is evaluated in a single basic block which can be better pipelined and both branches occur at the end++. :::info Use the LOAD->COMPARE->COMBINE pattern. ::: #### **Tip #4: Bitwise operators generate fewer branches and microcoded instructions than logical operators.** ++Bitwise operation on boolean results "encourages" use of CR boolean instructions, many of which are microcoded.++ Using bitwise operators to generate the result produces better results. *Updated assembly code*: ``` box_test_overlap_combine_2: ... and 3,4,0 mfcr 9 rlwinm 9,9,31,1 and 3,3,9 blr ``` In addition to ++being free of microcoded instructions, this version is also branch-free++. Because the overlap_x & overlap_y & overlap_z expression does not use boolean operators, there is no branch implied :::info Combine boolean results with bitwise operators. ::: ## Perf基本使用 ### Perf list 這應該是大部分的人第一次安裝 perf 後所下的第一個命令: *印出 perf 可以觸發的event* ### Perf top - 若符號旁顯示`[.]`表示其位於 User mode,`[k]`則為 kernel mode。 #### (Further reading) Analyzing performance with perf annotate > Ref.: https://developer.ibm.com/tutorials/l-analyzing-performance-perf-annotate-trs/ **Annote** Introduce that how to operate and interpret the result after executing `perf report` on text-based user interface (TUI): - An example report from `Perf report`: ![](https://i.imgur.com/eZmMANb.png) - Annotate particular function: > :key: Pressing ‘a’ on any `symbol` ![](https://i.imgur.com/e7ToXdd.png) For example on function `snooze_loop()`, assembly instructions of that function with the source code are displayed: - ++Numbers on the left side of the bar++ indicate the percentage of total samples that are recorded *against* that particular instruction. on this example, - 40% samples of `snooze_loop()` were recorded on the `beq 90` instruction - different colors based on how *hot* the instruction is; from hottest to coolest are `red`, `green` and `white`. - Jump to a target location from branch/call/return instructions: > :key: Pressing ‘Enter’ on an instruction has `arrow` before it The meaning of arrows with different directions: - ++Up/down arrows++ are used to show direction to target instruction of branch instructions - a ++right arrow++ is displayed for `call` instructions, and a ++left arrow++ is displayed for `return` instructions, respectively. ![](https://i.imgur.com/wuAsDO4.png) > Select the `bl arch_local_irq_restore+0x8` line and press ‘Enter’. You will see disassembly of `arch_local_irq_restore()` as follows. ![](https://i.imgur.com/fa4JFft.png) > Also, you can press ‘q’ to go back one step. ### Perf stat 相較於 top,使用 perf stat 往往是你已經有個要優化的目標,對這個目標進行特定或一系列的 event 檢查,進而了解該程序的效能概況。 對以下程式使用 perf stat 工具 分析 *cache miss* 情形: ```c static char array[10000][10000]; int main (void){ int i, j; for (i = 0; i < 10000; i++) for (j = 0; j < 10000; j++) array[j][i]++; return 0; } ``` *Command*: ```shell $ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./perf_stat_cache_miss ``` *Result on my Ubuntu server*: ```shell [18:36] asahsieh@asahsieh-MacBookAir:~/Documents/Courses/Linux-Kernel-Design/linux-kernel-internals/2022q1/week1/perf/cache-miss-analysis$ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./cache-miss-analysis Performance counter stats for './cache-miss-analysis' (5 runs): 252,5511 cache-misses # 1.921 % of all cache refs ( +- 6.32% ) 1,3150,1467 cache-references ( +- 0.02% ) 24,4419,6751 instructions # 0.68 insn per cycle ( +- 0.03% ) 35,6095,9809 cycles ( +- 0.34% ) 1.4891 +- 0.0170 seconds time elapsed ( +- 1.14% ) ``` :::info --repeat <n>或是-r <n> 可以重複執行 n 次該程序,並顯示每個 event 的變化區間。 cache-misses,cache-references和 instructions,cycles類似這種成對的 event,若同時出現 perf 會很貼心幫你計算比例。 ::: - 根據這次 perf stat 結果可以觀察到程序的 **cache miss rate** 為 `1.921`,連帶影響 **IPC** 只有 `0.65`。 - L1/L2 caches 各別的 **miss rate** ? - the corresponding performance counters are checked by `perf list` - e.g., **mem_load_uops_retired.l1_miss** [Retired load uops misses in L1 cache as data sources Spec update: HSM30. Supports address when precise (Precise event)] :notes: 如果我們善用一下存取的局部性,將 i,j對調改成array[i][j]++: - *perf stat* 的結果: ```shell [00:56] asahsieh@asahsieh-MacBookAir:~/Documents/Courses/Linux-Kernel-Design/linux-kernel-internals/2022q1/week1/perf/cache-miss-analysis$ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,mem_load_uops_retired.l1_miss ./cache-miss-analysis formance counter stats for './cache-miss-analysis' (5 runs): 26,0702 cache-misses # 81.012 % of all cache refs ( +- 4.84% ) (39.98%) 36,2588 cache-references ( +- 6.13% ) (39.98%) 24,3915,8263 instructions # 2.21 insn per cycle ( +- 0.17% ) (59.98%) 11,2115,6475 cycles ( +- 0.78% ) (79.99%) 54,1160 mem_load_uops_retired.l1_miss ( +- 2.85% ) (79.16%) 0.43847 +- 0.00913 seconds time elapsed ( +- 2.08% ) ``` - *cache-misses* 與 *cache-references* 大幅減少,執行時間縮短了 $\dfrac{1}{3}$ 之多! - [ ] :question: 但 *cache miss rate* 提高約 80 倍,應該可以再改進? [^first]: [What is the purpose of the RBP register in x86_64 assembler?]( https://stackoverflow.com/questions/41912684/what-is-the-purpose-of-the-rbp-register-in-x86-64-assembler) [^second]: [x86 Registers](https://www.eecg.utoronto.ca/~amza/www.mindsec.com/files/x86regs.html) [^third]: [『 Day 25』拜託別 Pwn 我啦! - Register & Assembly ](https://ithelp.ithome.com.tw/articles/10227112)