contributed by <nekoneko
>
sys2016
nekoneko
homework
ChenYi
-pg
,會在所以函式中加入mount的函式-pg
的參數也可以下在編譯與連結同時做的時候 (gcc -o)cc -o myprog myprog.c utils.c -g -pg
ld
(linker)時,必須指定profiling startup file gctr0.o,連結檔的檔改為lib_c_p.a,連結參數要改為-lc_p
ld -o myprog /lib/gcrt0.o myprog.o utils.o -lc_p
run into problems with the profiling support code in a shared library being called before that library has been fully initialised.
,解決方式為改為靜態連結到有包含profile support code的函式庫。gcc -g -pg -static-libgcc myprog.c utils.c -o myprog
不懂為什麼文件這麼寫,先記錄下來,方便以後用到查閱cheng hung lin
-pg
returning by main or calling exit
Unfortunately, the appearance of a human-readable bb.out means the basic-block counts didn't get written into gmon.out.
-Q
, --no-graph
: 不印出 call graph資料-q
, --graph
: 印出 call graph資料-p
, --flat-profile
: 印出 flat-profile-A
: Ref顯示每個函式呼叫的關係,呼叫到哪些函式,本身被哪些函式呼叫
這邊參考了Chen Yi同學的共筆,使用gprof2dot
$ sudo apt-get install pip3
$ pip3 install gprof2dot
照script安裝或pip都沒辦法成功執行cheng hung lin
Exception: using gprof2dot.py as a module is unsupported
make clean
make PROFILE=1
$ ./raytracing
$ gprof raytracing | gprof2dot | dot -Tpng -o output.pn
設定PROFILE的原因,可以看Makefile 11行到15行。
輸出結果
輸出
11.650036 sec
# Rendering scene
Done!
Execution time of raytracing() : 11.650036 sec
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
23.26 1.46 1.46 69646433 0.00 0.00 dot_product
18.32 2.61 1.15 56956357 0.00 0.00 subtract_vector
9.40 3.20 0.59 31410180 0.00 0.00 multiply_vector
8.29 3.72 0.52 17836094 0.00 0.00 add_vector
7.01 4.16 0.44 13861875 0.00 0.00 rayRectangularIntersection
5.90 4.53 0.37 17821809 0.00 0.00 cross_product
5.74 4.89 0.36 13861875 0.00 0.00 raySphereIntersection
5.74 5.25 0.36 10598450 0.00 0.00 normalize
3.98 5.50 0.25 4620625 0.00 0.00 ray_hit_object
1.75 5.61 0.11 4221152 0.00 0.00 multiply_vectors
1.67 5.72 0.11 2110576 0.00 0.00 compute_specular_diffuse
1.67 5.82 0.11 2110576 0.00 0.00 localColor
1.12 5.89 0.07 1048576 0.00 0.00 ray_color
1.12 5.96 0.07 1 0.07 6.26 raytracing
0.96 6.02 0.06 2520791 0.00 0.00 idx_stack_top
0.96 6.08 0.06 3838091 0.00 0.00 length
0.96 6.14 0.06 1048576 0.00 0.00 rayConstruction
0.64 6.18 0.04 1241598 0.00 0.00 refraction
0.40 6.21 0.03 2558386 0.00 0.00 idx_stack_empty
0.40 6.23 0.03 1204003 0.00 0.00 idx_stack_push
0.32 6.25 0.02 1 0.02 0.02 delete_sphere_list
0.24 6.27 0.02 1241598 0.00 0.00 reflection
0.16 6.28 0.01 1241598 0.00 0.00 protect_color_overflow
0.00 6.28 0.00 1048576 0.00 0.00 idx_stack_init
0.00 6.28 0.00 113297 0.00 0.00 fresnel
0.00 6.28 0.00 37595 0.00 0.00 idx_stack_pop
0.00 6.28 0.00 3 0.00 0.00 append_rectangular
0.00 6.28 0.00 3 0.00 0.00 append_sphere
0.00 6.28 0.00 2 0.00 0.00 append_light
0.00 6.28 0.00 1 0.00 0.00 calculateBasisVectors
0.00 6.28 0.00 1 0.00 0.00 delete_light_list
0.00 6.28 0.00 1 0.00 0.00 delete_rectangular_list
0.00 6.28 0.00 1 0.00 0.00 diff_in_second
0.00 6.28 0.00 1 0.00 0.00 write_to_ppm
4,785,429,867 branches
26,321,418 branch-misses # 0.55% of all branches
12.091801958 seconds time elapsed
+------------------------------+
| function name |
| total time % ( self time % ) |
| total calls |
+------------------------------+
$ perf record -F 12500 -e cycles ./raytracing && perf report
做Annotate可以了解,但是x86_64組語看不懂><cheng hung lin
# Rendering scene
Done!
Execution time of raytracing() : 9.119676 sec
其中dot_product除了改成loop unrooling之外,將三行改成一行(這是參考別人共筆的,但已經忘記是那一筆QQ)
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
15.52 0.56 0.56 56956357 0.00 0.00 subtract_vector
13.17 1.04 0.48 13861875 0.00 0.00 rayRectangularIntersection
12.89 1.50 0.47 69646433 0.00 0.00 dot_product
12.47 1.95 0.45 10598450 0.00 0.00 normalize
7.48 2.22 0.27 17821809 0.00 0.00 cross_product
7.35 2.49 0.27 13861875 0.00 0.00 raySphereIntersection
6.24 2.71 0.23 31410180 0.00 0.00 multiply_vector
6.10 2.93 0.22 4620625 0.00 0.00 ray_hit_object
4.43 3.09 0.16 17836094 0.00 0.00 add_vector
2.49 3.18 0.09 2110576 0.00 0.00 compute_specular_diffuse
2.49 3.27 0.09 1048576 0.00 0.00 ray_color
1.94 3.34 0.07 4221152 0.00 0.00 multiply_vectors
1.66 3.40 0.06 2520791 0.00 0.00 idx_stack_top
1.39 3.45 0.05 3838091 0.00 0.00 length
1.11 3.49 0.04 2110576 0.00 0.00 localColor
1.11 3.53 0.04 1 0.04 3.61 raytracing
0.55 3.55 0.02 1241598 0.00 0.00 refraction
0.55 3.57 0.02 1204003 0.00 0.00 idx_stack_push
0.55 3.59 0.02 1048576 0.00 0.00 rayConstruction
0.28 3.60 0.01 1241598 0.00 0.00 reflection
0.28 3.61 0.01 1048576 0.00 0.00 idx_stack_init
0.00 3.61 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 3.61 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 3.61 0.00 113297 0.00 0.00 fresnel
0.00 3.61 0.00 37595 0.00 0.00 idx_stack_pop
0.00 3.61 0.00 3 0.00 0.00 append_rectangular
0.00 3.61 0.00 3 0.00 0.00 append_sphere
0.00 3.61 0.00 2 0.00 0.00 append_light
0.00 3.61 0.00 1 0.00 0.00 calculateBasisVectors
0.00 3.61 0.00 1 0.00 0.00 delete_light_list
0.00 3.61 0.00 1 0.00 0.00 delete_rectangular_list
0.00 3.61 0.00 1 0.00 0.00 delete_sphere_list
0.00 3.61 0.00 1 0.00 0.00 diff_in_second
0.00 3.61 0.00 1 0.00 0.00 write_to_ppm
Performance counter stats for './raytracing':
3,875,532,175 branches
26,090,737 branch-misses # 0.67% of all branches
9.382581189 seconds time elapsed
__attribute__
: 宣告某些標語用來優化程式碼。void foo () __attribute__((always_inline));
(Ref)static inline __attribute__((always_inline))
小實驗
原因: 因為使用inline function的話,gprof是不會有function call的顯現,會不方便分析math-toolkit.h的函數。所以換個角度想,能不能測出使用inline function所減少的instruction?
$ perf stat -r 1 -e instructions ./raytracing #得到total instructions
$ gprof -b raytracing #得到function call次數
instructions | function call | time | |
---|---|---|---|
non-inline | 25,758,523,927 | 10598450 | 9.157607 |
inline | 25,121,100,243 | - | 8.967475 |
因為使用inline不會有normalize的function call 次數紀錄,所以沿用non-line的
(25,758,523,927 - 25,121,100,243)/10,598,450 = 60.14 inst per normal func call
所有math-tooltik.h函式都改成inline function
# Rendering scene
Done!
Execution time of raytracing() : 4.762099 sec
__attribute__((aligned(n[, offset])))或__attribute__((aligned(n[,offset])))
不知道要放在哪,後來就沒放了
__m128d _v1;
__m128d _v2;
double _v_tmp[3];
_v1 = _mm_loadu_pd(v1);
_v2 = _mm_loadu_pd(v2);
_v1 = _mm_mul_pd(_v1, _v2);
_mm_store_pd(_v_tmp, _v1);
return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];
# Rendering scene
Done!
Execution time of raytracing() : 9.730340 sec
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.75 1.18 1.18 69646433 0.00 0.00 dot_product
11.25 1.64 0.46 13861875 0.00 0.00 rayRectangularIntersection
10.52 2.07 0.43 10598450 0.00 0.00 normalize
9.54 2.46 0.39 13861875 0.00 0.00 raySphereIntersection
8.68 2.81 0.36 56956357 0.00 0.00 subtract_vector
__m128d _v1 __attribute(aligned(64));
__m128d _v2 __attribute(aligned(64));
double _v_tmp[3] __attribute(aligned(64));
結果Segmentation fault.,使用gdb bt的功能
#0 0x00000000004015c2 in _mm_load_pd (__P=0x7fffffffd9a8) at /usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h:119
#1 dot_product (v1=0x7fffffffd9a8, v2=0x7fffffffdd00) at math-toolkit.h:72
#2 0x0000000000401ee0 in rayRectangularIntersection (ray_e=0x7fffffffd940, ray_d=0x7fffffffdd00, rec=0x60bbf0, ip=0x7fffffffd990,
t1=0x7fffffffd8f8) at raytracing.c:121
#3 0x00000000004028a5 in ray_hit_object (e=0x4043c0 <view>, d=0x7fffffffdd00, t0=0, t1=59.948704218900112, rectangulars=0x60bbf0,
hit_rectangular=0x7fffffffda78, spheres=0x60ba70, hit_sphere=0x7fffffffda88) at raytracing.c:264
#4 0x0000000000402f94 in ray_color (e=0x4043c0 <view>, t=0, d=0x7fffffffdd00, stk=0x7fffffffdd40, rectangulars=0x60bbf0,
spheres=0x60ba70, lights=0x60b9d0, object_color=0x7fffffffdd20, bounces_left=3) at raytracing.c:367
#5 0x0000000000403835 in raytracing (pixels=0x7ffff7f1b010 "", background_color=0x7fffffffdec0, rectangulars=0x60bbf0,
spheres=0x60ba70, lights=0x60b9d0, view=0x4043c0 <view>, width=512, height=512) at raytracing.c:481
#6 0x0000000000403caa in main () at main.c:52
math-toolkit.h:72
_v1 = _mm_load_pd(v1);
會不會是const double *v1, const double *v2
傳近來的參數問題呢?
double _v1_tmp[3] __attribute__((aligned(64)));
double _v2_tmp[3] __attribute__((aligned(64)));
memcpy(_v1_tmp, v1, sizeof(double)*3 );
memcpy(_v2_tmp, v2, sizeof(double)*3 );
# Rendering scene
Done!
Execution time of raytracing() : 10.993178 sec
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.74 2.11 2.11 69646433 0.00 0.00 dot_product
10.45 2.65 0.54 13861875 0.00 0.00 rayRectangularIntersection
6.77 3.00 0.35 13861875 0.00 0.00 raySphereIntersection
6.68 3.34 0.35 56956357 0.00 0.00 subtract_vector
$ grep -n 'dot_product' *
後來直接試改primitivs.h
和 math-toolkit.h
typedef double point3[3] __attribute__((aligned(64))); <--
...
typedef struct {
point4 vertices[4]; <-- 3 改 4
point3 normal;
object_fill rectangular_fill;
} rectangular;
__m128d _v1;
__m128d _v2;
double _v_tmp[3];
_v1 = _mm_load_pd(v1);
_v2 = _mm_load_pd(v2);
_v1 = _mm_mul_pd(_v1, _v2);
_mm_store_pd(_v_tmp, _v1);
return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];
結果:
Execution time of raytracing() : 9.984913 sec
是的,整體比用 _mm_loadu_pd 的版本慢!
但是dot_product的時間是有變少的多執行幾次,_mm_loadu_pd 與 aligned兩個版本得到的時間會飄
jacklin@jacklin-CX420-CX420-MX:[~/Documents/sys2016/week1/raytracing] (master) 0h24m $ gprof -p -b raytracing
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.12 1.09 1.09 69646433 0.00 0.00 dot_product
12.79 1.72 0.63 56956357 0.00 0.00 subtract_vector
12.08 2.32 0.60 13861875 0.00 0.00 rayRectangularIntersection
9.03 2.76 0.45 10598450 0.00 0.00 normalize
8.02 3.16 0.40 13861875 0.00 0.00 raySphereIntersection
7.10 3.51 0.35 31410180 0.00 0.00 multiply_vector
6.90 3.85 0.34 17821809 0.00 0.00 cross_product
6.29 4.16 0.31 4620625 0.00 0.00 ray_hit_object
3.45 4.33 0.17 1048576 0.00 0.00 ray_color
3.04 4.48 0.15 17836094 0.00 0.00 add_vector
2.03 4.58 0.10 2110576 0.00 0.00 compute_specular_diffuse
1.62 4.66 0.08 4221152 0.00 0.00 multiply_vectors
1.42 4.73 0.07 1048576 0.00 0.00 rayConstruction
0.81 4.77 0.04 2110576 0.00 0.00 localColor
0.81 4.81 0.04 1 0.04 4.93 raytracing
0.61 4.84 0.03 1241598 0.00 0.00 reflection
0.61 4.87 0.03 1241598 0.00 0.00 refraction
0.41 4.89 0.02 3838091 0.00 0.00 length
0.41 4.91 0.02 1204003 0.00 0.00 idx_stack_push
0.20 4.92 0.01 2520791 0.00 0.00 idx_stack_top
0.20 4.93 0.01 1241598 0.00 0.00 protect_color_overflow
0.10 4.93 0.01 1 0.01 0.01 delete_sphere_list
=而且,回來看這段code,似乎有改的不對不知道原本為何編譯有誤,之後改成=
point3 vertices[4] --> point4 vertices[4];
原本的編譯會有錯
In file included from objects.c:4:0:
primitives.h:32:5: error: alignment of array elements is greater than element size
point3 vertices[4];
^
Makefile:24: recipe for target 'objects.o' failed
到moduls.inc
來看,就知道原本point3 vertices[4]想要表達是3 x 4的2d array,所以改rectangular的方式不對