藉由 perf,應用程式可以利用 PMU (Performance Monitoring Unit), tracepoint 和核心內部的特殊計數器 (counter) 來進行統計,另外還能同時分析運行中的核心程式碼,從而更全面了解應用程式中的效能瓶頸。
perf 基本原理是對目標進行取樣,紀錄特定的條件下所偵測的事件是否發生以及發生的次數。
Perf 可取樣的事件非常多,可以分析 Hardware event,如 cpu-cycles、instructions 、cache-misses、branch-misses …等等。
Software event
或 Tracepoint event
。 知道了 cpu-cycles、instructions 我們可以了解 Instruction per cycle 是多少,進而判斷程式碼有沒有好好利用 CPU
某 CPU IP 公司的面試題目
cache-misses 可以曉得是否有善用 Locality of reference ,branch-misses 多了是否導致嚴重的 pipeline hazard?
check processes status:
journalctl -xe
對照閱讀: Fast and slow if-statements: branch prediction in modern processors
However, the problem is that modern processors don’t track history for each branching instruction separately.
編譯器提供的輔助機制: Branch Patterns, Using GCC
When writing an if-else statement, always make the "then" block more likely to be executed than the else block, so the processor can take advantage of instructions already placed in the instruction fetch buffer.
if
s that prematurely terminate the function via a return when they are unlikely to occur to the "else block", which should be the rule for error conditions.
Might be instead:
else
中,以減少 branch mis-prediction 的次數Backward branches are generated for while, do-while and for statements
The reason to make them likely is that loops are generally executed one or more times, so the branch at the end of the loop that tests for the condition is likely to be taken.
It's important to note that the compiler rearranges the while and for loops to make the test at the end. It's a good optimization that will save a branch instruction
e.g.:
The optimization version which checking loop condition on the end:
The version of loop that test at the beginning:
The number of branch instructions is not reduced. The differences are: the uncoditional branch jumps the conditional test (
.L9
) be moved after the action of the loop and change the conditional branch to jump to the end or the loop (blt+) tobge
then jump to a newly-added label,.END
.
This however adds a constraint to the compiler that prevents it from further optimizing the code. In order to put the test at the end of the loop and save one branch instruction, the compiler generates an unconditional branch before the start of the loop that will be executed only once and jumps to the condition test at the end.
The
cmpw
instruction in the previous for and while examples put its result in CR7, and the branch instruction that just follows tests for CR7. This will cause a stall (control hazard) in the execution pipeline, because the branch will have to wait for the result of cmpw to be written in CR7.
The do-while loop doesn't suffer from this problem. Without the branch to the test condition, the compiler can reschedule instructions in order to fill the space:
the compiler moved the
add 9,9,1 (count++)
instruction in betweencmpw
andthe branch
. While the addition is being computed, the result of the comparison is being saved into CR7 to be used in the branch that follows.
* Conclusion:
Try to change for and while loops to do-while loops.
continue
statements is their bodies, change those statements to an explicit goto
to the update part of the code near the end of the do-while loop.In cases where it's mandatory to test for the loop condition at the beginning, convert the loop into a do-while and add an if to test if it has to be executed the at least once.
Analysis performance of different implementations
Example 1: Binary search (number of cases < 5)
source code: switch_small_cases.c
By perf record and perf report:
Executed command:
perf record -e branch-misses:u,branch-instructions:u ./switch_small_cases 4
Result from executing perf report
:
Go deeper to Browse histograms
by press ENTER
on 7 branch-misses:u
Observe branches with source code
no
functions
in the source code be shown, use branch-misses:u
as example:
Try to increase sample rate by the command metioned on the lecture:
perf record -F 100000 -e branch-misses:u,branch-instructions:u ./switch_small_cases 4
The given frequency rate exceeded the Maximum freqency rate already, but still few samples be captured:
To increase number of exections as example code in the lecture:
Source code: switch_small_cases_v1.c
Then we can observe that the overhead of function test_0
has 33.05%:
To annote a Shared Object with its source code, compiling the source code with debugging information, i.e.,
gcc -g ${SOURCE}
; the result of annote the Symbol [.] test_0
:
Analyze the result by perf-annotate:
Reference
CS:APP
or Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 2: Instruction Set Reference to interpret x86 assembly instructions
What's the rbp
register?
–> Base pointer or Frame pointer, it gets a snapshot of the stack pointer (rsp
), local variables and function parameters are still accessible from a constant offset from rbp
[1]
–> all registers[2]
–> The relation of [R/E]AX registers[3]:
Hot spots of branch misses are located on jg 65
instructions, which jump to the default
case.
–> Why? the input pattern to the switch-case statement is looping integer zero to three except the default case.
If the input pattern is changed to include the default case into the loop, hot spot will change to the first jump instruction:
Observe and analyze:
the distribution of branch misses is spread on different cases, but the order of assembly code is opposited to the input pattern causing most of branch misprediction on the first comparsion.
Example 2: Jump Tables
There have three kinds of implementations:
the initial one is from Branch Table on Wikipedia:
Jump table example in C
the later two examples are large-sparse
and large-dense
examples shown in this article
Links of the example codes are shown respectively:
From the article: The program is trying to get away with code that doesn't use branches at all.
But assembly codes of the three implementation don't have the address table
shown in the article and still have many branches:
comparsion of the example code, the evolution order is from left to right
goto
?goto
statement
Constraints:
1 The identifier in a goto statement shall name a label located somewhere in the enclosing function. A goto statement shall not jump from outside the scope of an identifier having a variably modified type to inside the scope of that identifier.
–> the implementation is not atable
, may be only valid in assembly code
I listed the tips from this section, and would add the implementations on the future
Avoid switches for high performance code. Changes in case values can result in non-obvious performance impacts.
- PPU and SPU:
- SPU Options in gcc
- Processor Units in PS3
Eliminate small switches. Convert to a series of if statements contained by an outer branch (when the default case is likely). Alternatively convert to a branch-free expression.
It's important to note that GCC does not have a definition for medium-size switches, only for small, large-sparse and large-dense. The definition of medium-size switches is subjective to how critical the code is and memory constraints of the program.
Convert large, sparse switches to smaller ones by pre-processing the case values (i.e. via a hash function), then manually convert to a jump table depending on the number of cases and their likelihood.
I did the comparion between default static prediction and overriding the default static prediction (the right one), GCC generates a forward branch for the if
statement is likely to be taken:
movzbl inst; the mnemonic is:
MOV –> Zero extend –> Byte (8-bit) –> to Long (32-bit)
The extra-added code in the right one shows the forward branch which jump to the likely continuation address, the jump instruction should be
jne
?
The part will perform worse is that the prediction becomes to jump to the likely continuation address?
Don't use the __builtin_expect extension to change GCC's default branch prediction, rearrange your code instead. Only use this extension to convert what would have been dynamically predicted branches to statically predicted ones, and only where the data has been analyzed and the predicability is well-known in advance.
Static prediction rules become increasingly complicated with a mix of boolean operators are used (due to C short-circuiting rules), and often the results will be unexpected.
Avoid complex boolean expressions.
To match C short-circuiting rules, in complex comparisons
When cases are not equally likely, use "and" and order by most->least likely.
When cases are equally likely, use "or". The order is not important.
Something about what a trace is.
Something about simplifying expressions and combining at the end.
For example:
The compiler has been able to reduce the number of branches from three to two by using Condition Register (CR) boolean instructions. Additionally, the entire expression is evaluated in a single basic block which can be better pipelined and both branches occur at the end.
Use the LOAD->COMPARE->COMBINE pattern.
Bitwise operation on boolean results "encourages" use of CR boolean instructions, many of which are microcoded. Using bitwise operators to generate the result produces better results.
Updated assembly code:
In addition to being free of microcoded instructions, this version is also branch-free. Because the overlap_x & overlap_y & overlap_z expression does not use boolean operators, there is no branch implied
Combine boolean results with bitwise operators.
這應該是大部分的人第一次安裝 perf 後所下的第一個命令: 印出 perf 可以觸發的event
[.]
表示其位於 User mode,[k]
則為 kernel mode。Ref.: https://developer.ibm.com/tutorials/l-analyzing-performance-perf-annotate-trs/
Annote
Introduce that how to operate and interpret the result after executing perf report
on text-based user interface (TUI):
An example report from Perf report
:
Annotate particular function:
Pressing ‘a’ on any
symbol
For example on function snooze_loop()
, assembly instructions of that function with the source code are displayed:
snooze_loop()
were recorded on the beq 90
instructionred
, green
and white
.Jump to a target location from branch/call/return instructions:
Pressing ‘Enter’ on an instruction has
arrow
before it
The meaning of arrows with different directions:
call
instructions, and a left arrow is displayed for return
instructions, respectively.
Select the
bl arch_local_irq_restore+0x8
line and press ‘Enter’. You will see disassembly ofarch_local_irq_restore()
as follows.
Also, you can press ‘q’ to go back one step.
相較於 top,使用 perf stat 往往是你已經有個要優化的目標,對這個目標進行特定或一系列的 event 檢查,進而了解該程序的效能概況。
對以下程式使用 perf stat 工具 分析 cache miss 情形:
Command:
Result on my Ubuntu server:
–repeat <n>或是-r <n> 可以重複執行 n 次該程序,並顯示每個 event 的變化區間。
cache-misses,cache-references和 instructions,cycles類似這種成對的 event,若同時出現 perf 會很貼心幫你計算比例。
1.921
,連帶影響 IPC 只有 0.65
。
perf list
如果我們善用一下存取的局部性,將 i,j對調改成array[i][j]++: