gprof

tags: `profiling` `gprof` `perf` `linux` `learning`

Summary

gprof 處理 call graph 的方式只考慮單一 edge, 而不是整條 call path, 導致其只能用估算的方式得知各 caller 佔用的 children time, 也無法正確判斷 recursive call
- e.g., 假設一程式含有以下 call path: A -> C, B -> C, 及 C -> D, 我們會想知道 function A 與 B 各自引發幾次 D, 以及佔用的時間比例?
- gprof 計算方式是, 假設每條 call path (edge) 發生機率都是 100%, 但忽略了部份 function 可能會提早結束的事實. 這個假設意即: A & B 一定會走到 C, 且若 C 沒有其他 callee, 則 call path A -> C -> D 發生次數等同於 A 呼叫 C 的次數。但這個假設是錯的, 實際上 A -> C 可能只有 50% 的機率會走到 D, 或甚至 call path A -> C -> D 可能根本不存在
gprof 只會採樣 function self-time 及呼叫次數, 其他數值都是估計出來的, 僅供參考
- 某些需要以 function call 做為 profiling 依據的場合, 就適合用 gprof。例如程式執行時間與輸入資料量不成比例時, 就可以用 gprof 檢查 function call 次數是否異常
僅適用於 CPU-bound 程式

Profiling Concept

Profiling 包括以下需求

找出最耗時的 function, 還要區分 self time 及 children time
統計 caller & callee 呼叫次數

e.g., 考慮以下 call graph, 包含以下 call path:

A -> C -> D
B -> C -> D
X1 -> C -> Y1
X2 -> C -> Y2
Z -> D

以 function C 為例, 我們會想知道:

C 的 self seconds 總合 (含所有 caller, threads 造成的 self-seconds)
C 的 child time 總合 (含 C -> D, Y1, Y2, …) (但不含 Z -> D)
C 在特定 context 下的 self seconds (e.g., 不同 caller 或參數下的表現)
C 在特定 context 下的 child time (e.g., C 被 A 呼叫時, 不會走到 Y1 或 Y2, 所以這種狀況下不用採計 Y1&Y2 的時間)

不過 gprof 僅是統計 self seconds 及呼叫次數, 估計值未必準確, 詳後述

gprof Tryout

Compile your program with option -pg, e.g., $ gcc -pg -o hello hello.c
To capture all the symbols, it's better to use the -O0 option.
Run your program, then the file gmon.out will be generated.
Use gprof to analysis gmon.out

$ gprof [-b] <exec_name> gmon.out > gprof.out

e.g.,

$ gprof [-b] hello gmon.out > gprof.out

Basics

gprof Documents
https://sourceware.org/binutils/docs/gprof/index.html#Top
old version: https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter/gprof_toc.html

gprof原理
https://sourceware.org/binutils/docs/gprof/Implementation.html
https://docs.freebsd.org/44doc/psd/18.gprof/paper.pdf

The mcount routine, included in the profiling library, is responsible for recording in an in-memory call graph table both its parent routine (the child) and its parent’s parent.
Number-of-calls information for library routines is collected by using a special version of the C library
The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth.

常用參數

-b|–brief

gprof 輸出說明

https://sourceware.org/binutils/docs/gprof/Output.html

腦補: gprof 只是統計每個 function 佔用的 CPU time 及被呼叫次數 (self-seconds & calls), 以及 caller 呼叫 callee 的次數。
然後再由 caller 呼叫次數, 推算出每個 caller 佔用的時間比例
至於 caller self-seconds 在各 caller , children time 可能是由呼叫次數估算而來, 僅供參考

Flat profile

列出 self seconds 及呼叫次數。其他欄位也是由 self seconds 推算出來的

self seconds: 本函式於 profiling 期間所花費的時間總和, 不含 callee 花費的時間
self s/call: 承上, (self s/call) = self seconds / calls
- 陷阱: 未考慮參數及 caller 差異, 所以僅供參考
total s/call: 本函式執行一次所花費的時間 (含 callee 花費的時間), 概念與 perf 的 child overhead 相同。
- "This is the only field in the flat profile that uses call graph analysis. "
- total seconds == self seconds + all callee's self seconds introduced by this function
- 陷阱: 未考慮不同 call path 及 callee 參數的差異, 所以僅供參考
cumulative seconds:　本列(含)以上所有函式的 self seconds 總和, 僅表示累計程式整體執行時間, 沒什麼參考價值 (所有 functions 的 self seconds 總和當然就是程式總體時間)
- 本欄位概念與 perf 的 child overhead 不同: perf 是加總 self-time 與 callee time, 但 gprof flat profile 每一列之間不見得有 caller-callee 關係

至於表格第一行是 Sampling period (SIGPROF period)

e.g., "Each sample counts as 0.01 seconds", 表示 sampling rate = 100Hz

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.18      0.75     0.75 69646433     0.00     0.00  dot_product
 16.56      1.24     0.49 56956357     0.00     0.00  subtract_vector
 10.82      1.56     0.32 31410180     0.00     0.00  multiply_vector
  9.13      1.83     0.27 13861875     0.00     0.00  raySphereIntersection
  8.45      2.08     0.25 13861875     0.00     0.00  rayRectangularIntersection
  ...
  1.01      2.89     0.03        1     0.03     2.95  raytracing

程式總執行時間: 2.89 sec

Call graph

本功能顯示一個函式的 self time 是由哪個 caller 引起 (一個函式可能有數個caller), 以及各 callee 花費的時間, 可用來了解 code path 整體效能

方括號索引開頭者為 primary line。Primary line 上面是 caller, 下面是 callee

index % time    self  children    called     name
                0.12    1.77 4620625/4620625     ray_color [3]
[4]     63.9    0.12    1.77 4620625         ray_hit_object [4]
                0.25    0.71 13861875/13861875     rayRectangularIntersection [5]
                0.27    0.45 13861875/13861875     raySphereIntersection [7]
                0.05    0.00 4620625/31410180     multiply_vector [10]
                0.04    0.00 4620625/17836094     add_vector [13]  // notice this line!
-----------------------------------------------
index % time    self  children    called     name
                0.00    0.00  621866/17836094     raySphereIntersection [7]
                0.01    0.00  686738/17836094     ray_color [3]
                0.01    0.00 1187811/17836094     rayRectangularIntersection [5]
                0.01    0.00 1241598/17836094     reflection [17]
                0.02    0.00 3145728/17836094     rayConstruction [14]
                0.04    0.00 4620625/17836094     ray_hit_object [4]  // notice this line!
                0.05    0.00 6331728/17836094     localColor [12]
[13]     4.7    0.14    0.00 17836094         add_vector [13]
-----------------------------------------------
index % time    self  children    called     name
                0.27    0.45 13861875/13861875     ray_hit_object [4]
[7]     24.5    0.27    0.45 13861875         raySphereIntersection [7]
                0.30    0.00 28345616/69646433     dot_product [6]
                0.12    0.00 14483741/56956357     subtract_vector [8]
                0.01    0.00  621866/10598450     normalize [11]
                0.01    0.00  933714/31410180     multiply_vector [10]
                0.00    0.00  621866/17836094     add_vector [13]
-----------------------------------------------
index % time    self  children    called     name
                0.25    0.71 13861875/13861875     ray_hit_object [4]
[5]     32.6    0.25    0.71 13861875         rayRectangularIntersection [5]
                0.32    0.00 36717187/56956357     subtract_vector [8]
                0.31    0.00 28627374/69646433     dot_product [6]
                0.07    0.00 17821807/17821809     cross_product [15]
                0.01    0.00 1187811/31410180     multiply_vector [10]
                0.01    0.00 1187811/17836094     add_vector [13]

ray_hit_object self-second + children time 佔總體時間 65.1%, 算是滿耗時的 function

先分析 ray_hit_object self seconds, 抓上游原因 (caller 呼叫次數)

self seconds == 0.12 sec, 佔程式整體執行時間 0.04% (0.12/2.89)
ray_hit_object 總共被呼叫 4620625 次, 全部來自於 ray_color (ray_color 該列的 self-seconds 顯示 0.12), 所以暫時先不用往上追

再來看 ray_hit_object callees, 查看最耗時的 callee

children time == 1.77 sec (僅供參考)
注意, children time 是所有 reachable nodes 的 self seconds 總和, 意即相鄰節點的 self seconds (0.25 + 0.27 + 0.05 + 0.04), 加上相鄰節點的 children time (raySphereIntersection 0.45 + rayRectangularIntersection 0.71)
當然, 這也是用各別函式的 self seconds 以及呼叫次數推算而得
flat view 的 total seconds 還包含 self seconds。

再來看 ray_hit_object 的 callee add_vector, 以及 add_vector 的其他 callers

add_vector 總共被呼叫 17836094 次, total self-seconds == 0.14 sec
其中 4620625 次來自於 ray_hit_object, 大約佔其 self-seconds 0.04 sec (4620625 / 17836094 * 0.14)
注意 graph 之間的關聯: 一個 edge 會顯示於 caller & callee's graph。以 ray_hit_object -> add_vector 為例:
- in ray_hit_object's graph: 0.04 0.00 4620625/17836094 add_vector [13]
- in add_vector's graph: 0.04 0.00 4620625/17836094 ray_hit_object [4]

Cycle view

Recursive functions 無法精確分割 children time, 例如 A -> B -> A, 則 A 的 children time 是否該包含 A 本身?
gprof 的做法是把 recursive functions 視為一個 virtual function (directed graph with cycle), 其 children time 為 graph 以外的 functions

注意

函式構成 cycle 與否以 runtime call path 為準, 如果兩個 function 在執行期間沒有互相呼叫 (沒有 edge), 就不會構成 cycle
注意 cycle 僅表示兩個 node 之間存在雙向 edge (bidirected edge)，但不表示程式執行期間存在 cycle path (recursive call)。這是 gprof 計量方式的缺點: gprof 沒有把整個 call path 視為一體處理。
例如以下程式包含兩條 call path: main -> f1 -> f2 及 main -> f2 -> f1, 但沒有 f1 -> f2 -> f1 或 f1 -> f2 -> f1, gprof 仍把 f1 & f2 視為 cycle 的一部份

#include <stdio.h>

int f2(int x);

int f1(int x) {
	if (!x)
		return 1;
	return f2(--x);
}

int f2(int x) {
	if (!x)
		return 1;
	return f1(--x);
}

int main() {
	f1(1);
	f2(1);
    return 0;
}

承上, f1 <-> f2 <-> f3 也會構成 cycle

gprof 缺點

Stackoverflow: Alternatives to gprof
Stackoverflow: gprof vs. cachegrind profiles
gprof原理与缺陷
 Linux性能评测工具之一：gprof篇

不適用於執行時間變異量過大的 functions
gprof 假設函式每次執行時間及呼叫路徑等成本皆相同, 實際上一個 function 的 self-time 或 total-time 可能因執行環境或參數而異 (包括 caller 傳入的參數, 或 global variable)。這種平均計算方式可能會稀釋 hot-spot 的真實執行時間，導致 user 找不出 hot-spot
- e.g., 不同 caller 傳入不同參數: f1() -> f3() vs. f2() -> f3()
  兩條 call path 的 f3() self-time 可能不同, 但 flat profile 的 self s/call 是假設任何狀況下的 self seconds 皆相同, call graph 則是以呼叫次數估算 self-seconds 來源, 兩者皆無法反應不同 caller 與參數的差異
- e.g., callee 執行時間的差異: f3() -> f4() vs. f3() -> f5()
  兩條 call path 的 f3() total time 不同 (一個走到 f4, 一個走到 f5), 但 flat profile 假設 f4(), f5() 的執行時間都一樣
- 承上, 因為只記錄一層 parent/child 呼叫關係, 所以無法得知不同 caller 對 children time 的影響
  考慮以下 callgraph: f1() -> f3() -> f4() f2() -> f3() -> f5()
  gprof 無法判斷 f4 & f5 是由 f1 or f2 引發, 導致 f3 的 children time 被平均分配到 f1 & f2
- 就算只考慮單一 call path f1() -> f3(), 實際程式可能因參數不同導致 f1() & f3() self-time / children time 的差異
無法處理 inline function or macro
不適用於 recursive call, multi-thread program, daemon program
Dynamic linked library 也必須以 -pg 重新編譯
- Ref: gprof 與 shared library 的問題
- Compiling a Program for Profiling
無法得知 CPU 在 kernel-mode 耗費時間 (syscalls, waiting for CPU or I/O waiting)
- GNU文件: The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth.
- "no measurements are taken while the program is not running."
  http://stackoverflow.com/questions/2803930/get-gprof-to-profile-based-on-wall-clock-time
  http://www.ibm.com/developerworks/library/l-gnuprof.html
Clock interrupt 可能造成 system call 被中斷 (EINTR) 及程式執行異常
https://sourceware.org/binutils/docs/gprof/Implementation.html
ftp://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter/gprof_9.html

Example: 估算錯誤的 caller self-seconds 及 callee self-seconds per call

f1 的兩個 caller 會引發不同的 self-seconds

#include <stdio.h>

int f1(int arg) {
	switch (arg) {
	case 1:
		for (int i = 0; i < 100000000; i++)
			;
	default:
		for (int i = 0; i < 300000000; i++)
			;
	}
	return 1;
}

int f2(int arg) {
	f1(arg);
	return 1;
}

int f3(int arg) {
	f1(arg);
	return 1;
}

int main(int argc, char *argv[]) {
	f2(1);
	f3(2);
	return 0;
}

Flat profile: gprof 無法辨識 caller 傳入參數的差異, f1() self-seconds per call 僅為平均值

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.51      2.49     2.49        2     1.25     1.25  f1
  0.00      2.49     0.00        1     0.00     1.25  f2
  0.00      2.49     0.00        1     0.00     1.25  f3

Call graph: gprof 無法辨識 caller 傳入參數的差異, 因此 f1 的 self-seconds 被均分至 f2 & f3

index % time    self  children    called     name
                1.25    0.00       1/2           f2 [3]
                1.25    0.00       1/2           f3 [4]
[1]    100.0    2.49    0.00       2         f1 [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    2.49                 main [2]
                0.00    1.25       1/1           f3 [4]
                0.00    1.25       1/1           f2 [3]
-----------------------------------------------
                0.00    1.25       1/1           main [2]
[3]     50.0    0.00    1.25       1         f2 [3]
                1.25    0.00       1/2           f1 [1]
-----------------------------------------------
                0.00    1.25       1/1           main [2]
[4]     50.0    0.00    1.25       1         f3 [4]
                1.25    0.00       1/2           f1 [1]
-----------------------------------------------

gprof 由 f1 呼叫次數推斷 f2 和 f3 的 children time 都一樣, 這就是悲劇的開始。這個錯誤假設會一路 propagate 到所有的 caller, 例如 f2 和 f3 的 call graph 和 total s/call 都是錯的, 導致我們看不出 f3 才是真正花時間的 function

Example: 估算錯誤的 caller children time

f3() 的兩個 caller 會引發不同的 children time

#include <stdio.h>

int f4() {
	int i;
	for (i = 0; i < 100000000; i++)
		;
	return 1;
}

int f5() {
	int i;
	for (i = 0; i < 400000000; i++)
		;
	return 1;
}

int f3(int arg) {
	switch (arg) {
	case 1:
		f4();
	default:
		f5();
	}
	return 1;
}

int f1() {
	return f3(1);
}

int f2() {
	return f3(0);
}

int main(int argc, char *argv[]) {
	f1();
	f2();
	return 0;
}

如同上述, gprof 無法區分 f4, f5 是由 f1 或 f2 引發
gprof 假設 f3 每次執行成本都一樣, 所以 f3 children time 被均分至 f1 & f2

index % time    self  children    called     name
                0.00    0.80       1/2           f1 [4]
                0.00    0.80       1/2           f2 [5]
[1]    100.0    0.00    1.60       2         f3 [1]
                1.38    0.00       2/2           f5 [3]
                0.22    0.00       1/1           f4 [6]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    1.60                 main [2]
                0.00    0.80       1/1           f1 [4]
                0.00    0.80       1/1           f2 [5]
-----------------------------------------------
                1.38    0.00       2/2           f3 [1]
[3]     86.1    1.38    0.00       2         f5 [3]
-----------------------------------------------
                0.00    0.80       1/1           main [2]
[4]     50.0    0.00    0.80       1         f1 [4]
                0.00    0.80       1/2           f3 [1]
-----------------------------------------------
                0.00    0.80       1/1           main [2]
[5]     50.0    0.00    0.80       1         f2 [5]
                0.00    0.80       1/2           f3 [1]
-----------------------------------------------
                0.22    0.00       1/1           f3 [1]
[6]     13.9    0.22    0.00       1         f4 [6]

最糟的例子是 f1 根本不會走到 f4/f5, 但 gprof 仍誤把 children time 累加至所有 caller, 導致某些 caller 得到不正確的 total time, 實際上這些函式不是 hot spot

Case Studies

Profiling 重點

total seconds == self seconds + all the callee's self seconds
用 flat profile 找出最花時間的 function A
用 call graph 找出 function A 的 caller functions B, C, D, 釐清整個 call path
用 call graph 找出 caller functions 有無其他較耗時的 callees

Example: 2016q3 hw2 raytracing

在 flat profile 裡, multiply_vector() 被呼叫　31410180 次 self second = 0.93 秒, 但我們想知道進一步知道這 0.93 秒是由哪個 caller 引起的
由 Call graph 可得知,　localColor() 呼叫 multiply_vector() 8442304 次, 佔了 0.93 秒當中的 0.25 秒 (大約等於 0.93 * 8442304 / 31410180)
往下看 localColor() 的 call graph: 呼叫 callees 共花費 0.79s (包括 multiply_vector() 的 0.25s)

Flat profile:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  7.26      6.88     0.93 31410180     0.00     0.00  multiply_vector
  ...
  0.86     12.43     0.11  2110576     0.00     0.00  localColor

Call graph:

index % time    self  children    called     name
                0.01    0.00  226162/31410180     refraction [18]
                0.03    0.00  933714/31410180     raySphereIntersection [7]
                0.04    0.00 1187811/31410180     rayRectangularIntersection [5]
                0.04    0.00 1241598/31410180     reflection [17]
                0.09    0.00 3145728/31410180     rayConstruction [15]
                0.09    0.00 3169934/31410180     ray_color [3]
                0.14    0.00 4620625/31410180     ray_hit_object [4]
                0.25    0.00 8442304/31410180     localColor [13]
                0.25    0.00 8442304/31410180     compute_specular_diffuse [9]
[10]     7.3    0.93    0.00 31410180         multiply_vector [10]

index % time    self  children    called     name
                0.11    0.79 2110576/2110576     ray_color [3]
[13]     7.0    0.11    0.79 2110576         localColor [13]
                0.31    0.00 6331728/17836094     add_vector [14]
                0.25    0.00 8442304/31410180     multiply_vector [10]
                0.23    0.00 4221152/4221152     multiply_vectors [16]

References

https://sourceware.org/binutils/docs/gprof/index.html

https://users.cs.duke.edu/~ola/courses/programming/gprof.html

SIGPROF
https://blog.csdn.net/wallwind/article/details/50393546
http://www.berlinix.com/dev/gprof.php