Note for perf 原理和實務

Note for perf 原理和實務

藉由 perf，應用程式可以利用 PMU (Performance Monitoring Unit), tracepoint 和核心內部的特殊計數器 (counter) 來進行統計，另外還能同時分析運行中的核心程式碼，從而更全面了解應用程式中的效能瓶頸。

perf 基本原理是對目標進行取樣，紀錄特定的條件下所偵測的事件是否發生以及發生的次數。

例如根據 tick 中斷進行取樣，即在 tick 中斷內觸發取樣點，在取樣點裡判斷行程 (process) 當時的 context。假如一個行程 90% 的時間都花費在函式 foo() 上，那麼 90% 的取樣點都應該落在函式 foo() 中。

Perf 可取樣的事件非常多，可以分析 Hardware event，如 cpu-cycles、instructions 、cache-misses、branch-misses …等等。

其餘 event types 如 Software event 或 Tracepoint event。

知道了 cpu-cycles、instructions 我們可以了解 Instruction per cycle 是多少，進而判斷程式碼有沒有好好利用 CPU

某 CPU IP 公司的面試題目

cache-misses 可以曉得是否有善用 Locality of reference ，branch-misses 多了是否導致嚴重的 pipeline hazard？

安裝

check processes status: journalctl -xe

範例

on my GitHub: perf_top_example.c

背景知識

硬體特性之branch prediction

\to

對照閱讀: Fast and slow if-statements: branch prediction in modern processors

However, the problem is that modern processors don’t track history for each branching instruction separately.

Branch Target Buffer

\to

編譯器提供的輔助機制: Branch Patterns, Using GCC

Rule #1: Forward branches are not likely to be taken

Forward branches are generated for if statements.
The rationale for making them not likely to be taken is that the processor can take advantage of the fact that instructions following the branch instruction may already be placed in the instruction buffer inside the Instruction Unit.

When writing an if-else statement, always make the "then" block more likely to be executed than the else block, so the processor can take advantage of instructions already placed in the instruction fetch buffer.

Also, rewrite your ifs that prematurely terminate the function via a return when they are unlikely to occur to the "else block", which should be the rule for error conditions.

ptr = malloc(sizeof(some_structure));
if (ptr == NULL)
{
    return -1;
}
/*
Malloc successful, continue execution.
*/

Might be instead:

ptr = malloc(sizeof(some_structure));
if (ptr != NULL)
{
    /*
    Malloc successful, continue execution.
    */
}
else
{
    return -1;
}

把不常執行到的 code (errors) 放到 else 中，以減少 branch mis-prediction 的次數

Rule #2: Backward branches are likely to be taken

Backward branches are generated for while, do-while and for statements

The reason to make them likely is that loops are generally executed one or more times, so the branch at the end of the loop that tests for the condition is likely to be taken.

It's important to note that the compiler rearranges the while and for loops to make the test at the end. It's a good optimization that will save a branch instruction

e.g.:

int while_test( int a, int b, int c )
{
  int count;

  while ( c < a )
  {
    c += b;
    a--;
    count++;
  }

  return (count);
}

The optimization version which checking loop condition on the end:

while_test:
    stwu 1,-16(1) # allocate stack
    b .L9         # jump to condition test
.L11:
    add 5,5,4     # c += b
    addi 3,3,-1   # a--
    addi 9,9,1    # count++
.L9:
    cmpw 7,5,3    # check c < a
    blt+ 7,.L11   # iterate if true
    mr 3,9        #  = count
    addi 1,1,16   # restore stack
    blr # return

The version of loop that test at the beginning:

while_test:
    stwu 1,-16(1) # allocate stack
    #b .L9         # jump to condition test
.L9:
    cmpw 7,5,3    # check c < a
    bge+ 7,.END   # iterate if true
.L11:
    add 5,5,4     # c += b
    addi 3,3,-1   # a--
    addi 9,9,1    # count++
    b .L9         # the extra unconditional branch
.END:
    mr 3,9        #  = count
    addi 1,1,16   # restore stack
    blr # return

The number of branch instructions is not reduced. The differences are: the uncoditional branch jumps the conditional test (.L9) be moved after the action of the loop and change the conditional branch to jump to the end or the loop (blt+) to bge then jump to a newly-added label, .END.

This however adds a constraint to the compiler that prevents it from further optimizing the code. In order to put the test at the end of the loop and save one branch instruction, the compiler generates an unconditional branch before the start of the loop that will be executed only once and jumps to the condition test at the end.

while_test:
    stwu 1,-16(1) # allocate stack
    b .L9         # jump to condition test
.L11:
    ...
.L9: 
    cmpw 7,5,3    # check c < a
    blt+ 7,.L11   # iterate if true
    ...

The cmpw instruction in the previous for and while examples put its result in CR7, and the branch instruction that just follows tests for CR7. This will cause a stall (control hazard) in the execution pipeline, because the branch will have to wait for the result of cmpw to be written in CR7.

The do-while loop doesn't suffer from this problem. Without the branch to the test condition, the compiler can reschedule instructions in order to fill the space:

do_while_test:
    stwu 1,-16(1) # allocate stack
.L13:
	add 5,5,4     # c += b
	addi 3,3,-1   # a--
	cmpw 7,5,3    # check c < a
	addi 9,9,1    # count++
	blt+ 7,.L13   # iterate if true
    ...

the compiler moved the add 9,9,1 (count++) instruction in between cmpw and the branch. While the addition is being computed, the result of the comparison is being saved into CR7 to be used in the branch that follows.
* Conclusion:

Try to change for and while loops to do-while loops.
- When converting for loops that have continue statements is their bodies, change those statements to an explicit goto to the update part of the code near the end of the do-while loop.

In cases where it's mandatory to test for the loop condition at the beginning, convert the loop into a do-while and add an if to test if it has to be executed the at least once.

void
print_list(const list_t *list)
{
    while (list != NULL)
    {
        printf("%d\n", list->value);
        list = list->next;
    }
}	

void
print_list(const list_t *list)
{
    if (list != NULL) // the added if 
    {
        do
        {
            printf("%d"\n", list->value);
            list = list->next;
        } while (list != NULL);
    }
}

Rule #3: SWITCH STATEMENTS

Analysis performance of different implementations

Example 1: Binary search (number of cases < 5)

source code: switch_small_cases.c

By perf record and perf report:

Executed command:
perf record -e branch-misses:u,branch-instructions:u ./switch_small_cases 4

Result from executing perf report:

Available samples
7 branch-misses:u
9 branch-instructions:u
...

ESC: exit, ENTER|->: Browse histograms

Go deeper to Browse histograms by press ENTER on 7 branch-misses:u

Observe branches with source code
no functions in the source code be shown, use branch-misses:u as example:

Samples: 7  of event 'branch-misses:u', Event count (approx.): 2785
Overhead  Command          Shared Object         Symbol
  52.60%  switch_small_ca  ld-linux-x86-64.so.2  [.] _dl_relocate_object
  33.61%  switch_small_ca  ld-linux-x86-64.so.2  [.] __tunable_get_val
  11.20%  switch_small_ca  ld-linux-x86-64.so.2  [.] __GI___tunables_init
   2.59%  switch_small_ca  ld-linux-x86-64.so.2  [.] _dl_start

Try to increase sample rate by the command metioned on the lecture:
perf record -F 100000 -e branch-misses:u,branch-instructions:u ./switch_small_cases 4

The given frequency rate exceeded the Maximum freqency rate already, but still few samples be captured:

warning: Maximum frequency rate (2,3750 Hz) exceeded, throttling from 10,0000 Hz to 2,3750 Hz.
  The limit can be raised via /proc/sys/kernel/perf_event_max_sample_rate.
  The kernel will lower it when perf's interrupts take too long.
  Use --strict-freq to disable this throttling, refusing to record.
The input value is: 4.
Corrently no mapped function.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (29 samples) ]

To increase number of exections as example code in the lecture:

Source code: switch_small_cases_v1.c

...
#define N 5000000
...
int main (int argc, char *argv[]) {
    int i;

    for (i = 0; i < N; i++) test_0(i%4);

    return EXIT_SUCCESS; 
}

Then we can observe that the overhead of function test_0 has 33.05%:

Samples: 7  of event 'branch-misses:u', Event count (approx.): 2787
Overhead  Command          Shared Object         Symbol 
40.80%    switch_small_ca  ld-linux-x86-64.so.2  [.] _dl_name_match_p
33.05%    switch_small_ca  switch_small_cases    [.] test_0
20.70%    switch_small_ca  ld-linux-x86-64.so.2  [.] __GI___tunables_init
 5.45%    switch_small_ca  ld-linux-x86-64.so.2  [.] _dl_start

To annote a Shared Object with its source code, compiling the source code with debugging information, i.e., gcc -g ${SOURCE}; the result of annote the Symbol [.] test_0:

Analyze the result by perf-annotate:
Reference CS:APP or Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 2: Instruction Set Reference to interpret x86 assembly instructions
What's the rbp register?
–> Base pointer or Frame pointer, it gets a snapshot of the stack pointer (rsp), local variables and function parameters are still accessible from a constant offset from rbp^[1]
–> all registers^[2]
–> The relation of [R/E]AX registers^[3]:

Hot spots of branch misses are located on jg 65 instructions, which jump to the default case.
–> Why? the input pattern to the switch-case statement is looping integer zero to three except the default case.

If the input pattern is changed to include the default case into the loop, hot spot will change to the first jump instruction:

int main (int argc, char *argv[]) {
    ... 
    for (i = 0; i < N; i++) test_0(i%5); // from 4 to 5
    ...
}

Observe and analyze:
the distribution of branch misses is spread on different cases, but the order of assembly code is opposited to the input pattern causing most of branch misprediction on the first comparsion.

Example 2: Jump Tables
There have three kinds of implementations:
- the initial one is from Branch Table on Wikipedia:
  Jump table example in C
- the later two examples are large-sparse and large-dense examples shown in this article
  Links of the example codes are shown respectively:
From the article: The program is trying to get away with code that doesn't use branches at all.

But assembly codes of the three implementation don't have the address table shown in the article and still have many branches:
- expected assembly code
```
.L19:                  # address table
.word .L13
.word .L14
.word .L15
.word .L16
.word .L17
.word .L18
```
- comparsion of the example code, the evolution order is from left to right
  - (TODO) the above example code should be to implemented by assembly code like in this article
    - by goto?
      6.8.6.1 The goto statement
      
      Constraints:
      1 The identifier in a goto statement shall name a label located somewhere in the enclosing function. A goto statement shall not jump from outside the scope of an identifier having a variably modified type to inside the scope of that identifier.
      –> the implementation is not a table, may be only valid in assembly code

I listed the tips from this section, and would add the implementations on the future

Avoid switches for high performance code. Changes in case values can result in non-obvious performance impacts.

When the switch is small - number of case values less than five in the case of both the PPU and SPU - the number of comparisons GCC generates is comparable to a series of if statements.
- PPU and SPU:
- SPU Options in gcc
- Processor Units in PS3
GCC cannot take into account the likelihood of each case being taken
–> In this case, manually converting the switch to a series of ifs, ordering the comparisons from most likely to least likely, can have a big impact in performance.

Eliminate small switches. Convert to a series of if statements contained by an outer branch (when the default case is likely). Alternatively convert to a branch-free expression.

Sometimes medium size switches that would perform better with jump tables are generated to use a binary search by GCC.
–> In those cases, converting the switch statement to a series of labels and using an array of addresses (using GCC's &&label_name syntax)

It's important to note that GCC does not have a definition for medium-size switches, only for small, large-sparse and large-dense. The definition of medium-size switches is subjective to how critical the code is and memory constraints of the program.

Convert large, sparse switches to smaller ones by pre-processing the case values (i.e. via a hash function), then manually convert to a jump table depending on the number of cases and their likelihood.

Tip #1: Don´t change GCC´s default static prediction

The default static prediction generated by GCC can be overridden using the __builtin_expect compiler extension.
This will perform worse than the case where the fall-through path is the likely continuation address, it is preferable to not change the default GCC prediction and change the conditional and loop statements to benefit from the prediction.
- I did the comparion between default static prediction and overriding the default static prediction (the right one), GCC generates a forward branch for the if statement is likely to be taken:
  
  movzbl inst; the mnemonic is:
  MOV –> Zero extend –> Byte (8-bit) –> to Long (32-bit)
  
  The extra-added code in the right one shows the forward branch which jump to the likely continuation address, the jump instruction should be jne?
  
  The part will perform worse is that the prediction becomes to jump to the likely continuation address?

Don't use the __builtin_expect extension to change GCC's default branch prediction, rearrange your code instead. Only use this extension to convert what would have been dynamically predicted branches to statically predicted ones, and only where the data has been analyzed and the predicability is well-known in advance.

Tip #2: Boolean Operators Affect Static Branch Prediction

Static prediction rules become increasingly complicated with a mix of boolean operators are used (due to C short-circuiting rules), and often the results will be unexpected.

Avoid complex boolean expressions.

To match C short-circuiting rules, in complex comparisons

"and" implies "unlikely"
"or" implies "likely"

When cases are not equally likely, use "and" and order by most->least likely.

When cases are equally likely, use "or". The order is not important.

Tip #3: Simple traces (not shorter traces) will almost always generate better code in GCC

Something about what a trace is.
Something about simplifying expressions and combining at the end.

For example:

int
box_test_overlap_combine_1( const Box* restrict const a, const Box* restrict const b )
{
  float acx       = a->center.e[0];
  float acy       = a->center.e[1];
  float acz       = a->center.e[2];
  float aex       = a->extent.e[0];
  float aey       = a->extent.e[1];
  float aez       = a->extent.e[2];
  float bcx       = b->center.e[0];
  float bcy       = b->center.e[1];
  float bcz       = b->center.e[2];
  float bex       = b->extent.e[0];
  float bey       = b->extent.e[1];
  float bez       = b->extent.e[2];
  float adiff_cx  = fabs( acx-bcx );
  float adiff_cy  = fabs( acy-bcy );
  float adiff_cz  = fabs( acz-bcz );
  float sum_ex    = aex + bex;
  float sum_ey    = aey + bey;
  float sum_ez    = aez + bez;
  int   overlap_x = adiff_cx <= sum_ex;
  int   overlap_y = adiff_cy <= sum_ey;
  int   overlap_z = adiff_cz <= sum_ez;
  int   overlap   = overlap_x && overlap_y && overlap_z; 

  return (overlap);
}

box_test_overlap_combine_1:
...
and.   0,11,9  
mfcr   0  
rlwinm 0,0,31,1
cmpwi  7,0,0 
beq-   0,.L11
beq-   7,.L11
li     10,1  
.L11:
	mr     3,10  
	addi   1,1,16
	blr

The compiler has been able to reduce the number of branches from three to two by using Condition Register (CR) boolean instructions. Additionally, the entire expression is evaluated in a single basic block which can be better pipelined and both branches occur at the end.

Use the LOAD->COMPARE->COMBINE pattern.

Tip #4: Bitwise operators generate fewer branches and microcoded instructions than logical operators.

Bitwise operation on boolean results "encourages" use of CR boolean instructions, many of which are microcoded. Using bitwise operators to generate the result produces better results.

Updated assembly code:

box_test_overlap_combine_2:
  ...
  and    3,4,0 
  mfcr   9  
  rlwinm 9,9,31,1
  and    3,3,9 
  blr

In addition to being free of microcoded instructions, this version is also branch-free. Because the overlap_x & overlap_y & overlap_z expression does not use boolean operators, there is no branch implied

Combine boolean results with bitwise operators.

Perf基本使用

Perf list

這應該是大部分的人第一次安裝 perf 後所下的第一個命令: 印出 perf 可以觸發的event

Perf top

若符號旁顯示[.]表示其位於 User mode，[k]則為 kernel mode。

(Further reading) Analyzing performance with perf annotate

Ref.: https://developer.ibm.com/tutorials/l-analyzing-performance-perf-annotate-trs/

Annote
Introduce that how to operate and interpret the result after executing perf report on text-based user interface (TUI)：

An example report from Perf report:
Annotate particular function:

Pressing ‘a’ on any symbol

For example on function snooze_loop(), assembly instructions of that function with the source code are displayed:
- Numbers on the left side of the bar indicate the percentage of total samples that are recorded against that particular instruction.
  on this example,
  - 40% samples of snooze_loop() were recorded on the beq 90 instruction
  - different colors based on how hot the instruction is; from hottest to coolest are red, green and white.
Jump to a target location from branch/call/return instructions:

Pressing ‘Enter’ on an instruction has arrow before it
The meaning of arrows with different directions:
- Up/down arrows are used to show direction to target instruction of branch instructions
- a right arrow is displayed for call instructions, and a left arrow is displayed for return instructions, respectively.
Select the bl arch_local_irq_restore+0x8 line and press ‘Enter’. You will see disassembly of arch_local_irq_restore() as follows.

Also, you can press ‘q’ to go back one step.

Perf stat

相較於 top，使用 perf stat 往往是你已經有個要優化的目標，對這個目標進行特定或一系列的 event 檢查，進而了解該程序的效能概況。

對以下程式使用 perf stat 工具分析 cache miss 情形:

static char array[10000][10000];
int main (void){
  int i, j;
  for (i = 0; i < 10000; i++)
    for (j = 0; j < 10000; j++)
       array[j][i]++;
  return 0;
}

Command:

$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./perf_stat_cache_miss

Result on my Ubuntu server:

[18:36] asahsieh@asahsieh-MacBookAir:~/Documents/Courses/Linux-Kernel-Design/linux-kernel-internals/2022q1/week1/perf/cache-miss-analysis$ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./cache-miss-analysis

 Performance counter stats for './cache-miss-analysis' (5 runs):

          252,5511      cache-misses              #    1.921 % of all cache refs      ( +-  6.32% )
       1,3150,1467      cache-references                                              ( +-  0.02% )
      24,4419,6751      instructions              #    0.68  insn per cycle           ( +-  0.03% )
      35,6095,9809      cycles                                                        ( +-  0.34% )

            1.4891 +- 0.0170 seconds time elapsed  ( +-  1.14% )

–repeat <n>或是-r <n> 可以重複執行 n 次該程序，並顯示每個 event 的變化區間。
cache-misses,cache-references和 instructions,cycles類似這種成對的 event，若同時出現 perf 會很貼心幫你計算比例。

根據這次 perf stat 結果可以觀察到程序的 cache miss rate 為 1.921，連帶影響 IPC 只有 0.65。
- L1/L2 caches 各別的 miss rate ?
  - the corresponding performance counters are checked by perf list
  - e.g., mem_load_uops_retired.l1_miss
    [Retired load uops misses in L1 cache as data sources Spec update: HSM30. Supports address when precise (Precise event)]

如果我們善用一下存取的局部性，將 i，j對調改成array[i][j]++:

perf stat 的結果:

[00:56] asahsieh@asahsieh-MacBookAir:~/Documents/Courses/Linux-Kernel-Design/linux-kernel-internals/2022q1/week1/perf/cache-miss-analysis$ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,mem_load_uops_retired.l1_miss ./cache-miss-analysis

formance counter stats for './cache-miss-analysis' (5 runs):

       26,0702      cache-misses              #   81.012 % of all cache refs      ( +-  4.84% )  (39.98%)
       36,2588      cache-references                                              ( +-  6.13% )  (39.98%)
  24,3915,8263      instructions              #    2.21  insn per cycle           ( +-  0.17% )  (59.98%)
  11,2115,6475      cycles                                                        ( +-  0.78% )  (79.99%)
       54,1160      mem_load_uops_retired.l1_miss                                     ( +-  2.85% )  (79.16%)

       0.43847 +- 0.00913 seconds time elapsed  ( +-  2.08% )

cache-misses 與 cache-references 大幅減少，執行時間縮短了
$\frac{1}{3}$ 之多！
- 但 cache miss rate 提高約 80 倍，應該可以再改進？