# 2016q3 Homework1 (raytracing) contributed by <`nekoneko`> ###### tags: `sys2016` `nekoneko` `homework` ### Reviewed by `ChenYi` * 易混淆的 git commit message -- last commit "SIMD dot_product solved-1",實際程式碼為"完成" * 另外-1 -2會使人感到意義不明 * 可以試著分析struct更改後的cache-misses,以及使用圖表表示速度上的差異 * 撰寫使用SIMD的部份請儘量避免底線開頭的變數名稱,以免產生[Naming convention](https://en.wikipedia.org/wiki/Naming_convention_(programming))的狀況發生 ## gprof ### 使用方式 - gcc編譯時下`-pg`,會在所以函式中加入mount的函式 - 在編譯和連結時,允許profile功能開啟 (CFLAGS=-gp, LDFLAGS=-gp) - `-pg`的參數也可以下在編譯與連結同時做的時候 (gcc -o) `cc -o myprog myprog.c utils.c -g -pg` - 單獨使用`ld`(linker)時,必須指定profiling startup file gctr0.o,連結檔的檔改為lib_c_p.a,連結參數要改為`-lc_p` `ld -o myprog /lib/gcrt0.o myprog.o utils.o -lc_p` - `run into problems with the profiling support code in a shared library being called before that library has been fully initialised.`,解決方式為改為靜態連結到有包含profile support code的函式庫。 `gcc -g -pg -static-libgcc myprog.c utils.c -o myprog` > 不懂為什麼文件這麼寫,先記錄下來,方便以後用到查閱[name=cheng hung lin] - 可以只針對想要測試的module編譯上加上`-pg` - 執行測試檔後,會產生gmon.out,為了能正常的產生gmon.out,測試程式要正常的結束--*`returning by main or calling exit`* - 產生的gmon.out會在程式執行當下所在的資料夾(directory) - **bb.out**: *`Unfortunately, the appearance of a human-readable bb.out means the basic-block counts didn't get written into gmon.out.`* - default excutable file: a.out, default profile data file: gmon.out ### 參數 - `-Q`, `--no-graph`: 不印出 call graph資料 - `-q`, `--graph`: 印出 call graph資料 - `-p`, `--flat-profile`: 印出 flat-profile - `-A` : [Ref](https://books.google.com.tw/books?id=wQ6r3UTivJgC&pg=PA147&lpg=PA147&dq=gprof:+could+not+locate&source=bl&ots=ELZoMp4BDt&sig=dDSi4XvTbx6NlgQ40yiqqlHdkew&hl=zh-TW&sa=X&ved=0ahUKEwiA2qKpz7HPAhUBNY8KHbvcC1kQ6AEIJDAB#v=onepage&q=gprof%3A%20could%20not%20locate&f=false) ### flat profile - 函式所花的時間 - 函式被呼叫的次數 ### call graph 顯示每個函式呼叫的關係,呼叫到哪些函式,本身被哪些函式呼叫 ### 可以測試到 - program spent is time - 函式互相呼叫的關係 - 顯示某些比認知上預期還慢的程式片段 - 函式呼叫的次數 - 檢視為注意到的bug ## graphviz ## gprof2dot 這邊參考了[Chen Yi同學的共筆](https://hackmd.io/OwYwRmDMCskKYFoCGA2YAzBAWaSzIA4BOABgQEYSUT11oUATAgJnKA==?view#2016q3-homework-1-raytracing),使用[gprof2dot](https://github.com/jrfonseca/gprof2dot) ```txt $ sudo apt-get install pip3 $ pip3 install gprof2dot ``` > 照script安裝或pip都沒辦法成功執行[name=cheng hung lin] > `Exception: using gprof2dot.py as a module is unsupported` ## raytracing - 編譯和執行 ```txt make clean make PROFILE=1 $ ./raytracing $ gprof raytracing | gprof2dot | dot -Tpng -o output.pn ``` 設定PROFILE的原因,可以看Makefile 11行到15行。 - 輸出結果 - 輸出 - 11.650036 sec ```txt # Rendering scene Done! Execution time of raytracing() : 11.650036 sec ``` - flat profile ```txt Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 23.26 1.46 1.46 69646433 0.00 0.00 dot_product 18.32 2.61 1.15 56956357 0.00 0.00 subtract_vector 9.40 3.20 0.59 31410180 0.00 0.00 multiply_vector 8.29 3.72 0.52 17836094 0.00 0.00 add_vector 7.01 4.16 0.44 13861875 0.00 0.00 rayRectangularIntersection 5.90 4.53 0.37 17821809 0.00 0.00 cross_product 5.74 4.89 0.36 13861875 0.00 0.00 raySphereIntersection 5.74 5.25 0.36 10598450 0.00 0.00 normalize 3.98 5.50 0.25 4620625 0.00 0.00 ray_hit_object 1.75 5.61 0.11 4221152 0.00 0.00 multiply_vectors 1.67 5.72 0.11 2110576 0.00 0.00 compute_specular_diffuse 1.67 5.82 0.11 2110576 0.00 0.00 localColor 1.12 5.89 0.07 1048576 0.00 0.00 ray_color 1.12 5.96 0.07 1 0.07 6.26 raytracing 0.96 6.02 0.06 2520791 0.00 0.00 idx_stack_top 0.96 6.08 0.06 3838091 0.00 0.00 length 0.96 6.14 0.06 1048576 0.00 0.00 rayConstruction 0.64 6.18 0.04 1241598 0.00 0.00 refraction 0.40 6.21 0.03 2558386 0.00 0.00 idx_stack_empty 0.40 6.23 0.03 1204003 0.00 0.00 idx_stack_push 0.32 6.25 0.02 1 0.02 0.02 delete_sphere_list 0.24 6.27 0.02 1241598 0.00 0.00 reflection 0.16 6.28 0.01 1241598 0.00 0.00 protect_color_overflow 0.00 6.28 0.00 1048576 0.00 0.00 idx_stack_init 0.00 6.28 0.00 113297 0.00 0.00 fresnel 0.00 6.28 0.00 37595 0.00 0.00 idx_stack_pop 0.00 6.28 0.00 3 0.00 0.00 append_rectangular 0.00 6.28 0.00 3 0.00 0.00 append_sphere 0.00 6.28 0.00 2 0.00 0.00 append_light 0.00 6.28 0.00 1 0.00 0.00 calculateBasisVectors 0.00 6.28 0.00 1 0.00 0.00 delete_light_list 0.00 6.28 0.00 1 0.00 0.00 delete_rectangular_list 0.00 6.28 0.00 1 0.00 0.00 diff_in_second 0.00 6.28 0.00 1 0.00 0.00 write_to_ppm ``` - branch 和 branch-misses ```txt 4,785,429,867 branches 26,321,418 branch-misses # 0.55% of all branches 12.091801958 seconds time elapsed ``` - gprof2dot ![](https://i.imgur.com/NU6IE2z.png) ```txt +------------------------------+ | function name | | total time % ( self time % ) | | total calls | +------------------------------+ ``` ### 分析 - math-toolkit.h裡宣告的函式 - normalize - length - add_vector - subtract_vector - multiply_vectors - multiply_vector - cross_product - dot_product - scalar_triple_product - scalar_triple ### 優化 #### Loop Unrooling - 有loop的函式 - dot_product 1.46s - subtract_vector ==1.15s== - multiply_vector ==0.59s== - add_vector 0.52s - multiply_vectors ==0.11s== - scalar_triple_product -> cross_product, multiply_vectors - scalar_triple -> cross_product, dot_product - 觀察 1. scalar_triple_product和scalar_triple 都沒有出現在flat profile 2. subtract_vector(1.15) > multiply_vector(0.59) > multiply_vectors(0.11) - 乘法其實來比減法還要快 (原因不知道) - multiply_vector > multiply_vectors (原因不知道) > `$ perf record -F 12500 -e cycles ./raytracing && perf report` 做Annotate可以了解,但是x86_64組語看不懂><[name=cheng hung lin] - 改成loop unrooling: 9.119676 s ```txt # Rendering scene Done! Execution time of raytracing() : 9.119676 sec ``` 其中dot_product除了改成loop unrooling之外,將三行改成一行(這是參考別人共筆的,但已經忘記是那一筆QQ) ```txt Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 15.52 0.56 0.56 56956357 0.00 0.00 subtract_vector 13.17 1.04 0.48 13861875 0.00 0.00 rayRectangularIntersection 12.89 1.50 0.47 69646433 0.00 0.00 dot_product 12.47 1.95 0.45 10598450 0.00 0.00 normalize 7.48 2.22 0.27 17821809 0.00 0.00 cross_product 7.35 2.49 0.27 13861875 0.00 0.00 raySphereIntersection 6.24 2.71 0.23 31410180 0.00 0.00 multiply_vector 6.10 2.93 0.22 4620625 0.00 0.00 ray_hit_object 4.43 3.09 0.16 17836094 0.00 0.00 add_vector 2.49 3.18 0.09 2110576 0.00 0.00 compute_specular_diffuse 2.49 3.27 0.09 1048576 0.00 0.00 ray_color 1.94 3.34 0.07 4221152 0.00 0.00 multiply_vectors 1.66 3.40 0.06 2520791 0.00 0.00 idx_stack_top 1.39 3.45 0.05 3838091 0.00 0.00 length 1.11 3.49 0.04 2110576 0.00 0.00 localColor 1.11 3.53 0.04 1 0.04 3.61 raytracing 0.55 3.55 0.02 1241598 0.00 0.00 refraction 0.55 3.57 0.02 1204003 0.00 0.00 idx_stack_push 0.55 3.59 0.02 1048576 0.00 0.00 rayConstruction 0.28 3.60 0.01 1241598 0.00 0.00 reflection 0.28 3.61 0.01 1048576 0.00 0.00 idx_stack_init 0.00 3.61 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 3.61 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 3.61 0.00 113297 0.00 0.00 fresnel 0.00 3.61 0.00 37595 0.00 0.00 idx_stack_pop 0.00 3.61 0.00 3 0.00 0.00 append_rectangular 0.00 3.61 0.00 3 0.00 0.00 append_sphere 0.00 3.61 0.00 2 0.00 0.00 append_light 0.00 3.61 0.00 1 0.00 0.00 calculateBasisVectors 0.00 3.61 0.00 1 0.00 0.00 delete_light_list 0.00 3.61 0.00 1 0.00 0.00 delete_rectangular_list 0.00 3.61 0.00 1 0.00 0.00 delete_sphere_list 0.00 3.61 0.00 1 0.00 0.00 diff_in_second 0.00 3.61 0.00 1 0.00 0.00 write_to_ppm ``` - branch branch-miss - branches : 減少927897692 ```txt Performance counter stats for './raytracing': 3,875,532,175 branches 26,090,737 branch-misses # 0.67% of all branches 9.382581189 seconds time elapsed ``` - 函式時間 - dot_product 0.47s - subtract_vector 0.56s - multiply_vector 0.23s - add_vector 0.16s - multiply_vectors 0.07s #### force inline - `__attribute__`: 宣告某些標語用來優化程式碼。 - 用法: ~~`void foo () __attribute__((always_inline));` ([Ref](http://stackoverflow.com/questions/13228326/force-inline-function-in-other-translation-unit))~~(還未確定,原文文件看不懂) ```clike= static inline __attribute__((always_inline)) ``` - 小實驗 原因: 因為使用inline function的話,gprof是不會有function call的顯現,會不方便分析math-toolkit.h的函數。所以換個角度想,能不能測出使用inline function所減少的instruction? ```txt $ perf stat -r 1 -e instructions ./raytracing #得到total instructions $ gprof -b raytracing #得到function call次數 ``` - 以normalize為例 | | instructions | function call | time | |:-:| - | - | - | | non-inline | 25,758,523,927 | 10598450 | 9.157607 | | inline | 25,121,100,243 | - | 8.967475 | 因為使用inline不會有normalize的function call 次數紀錄,所以沿用non-line的 - `(25,758,523,927 - 25,121,100,243)/10,598,450 = 60.14 inst per normal func call` - 問題: 不知道60.14能代表什麼意義 - 不了解inline與function call實際的差別 - 不了解gprof對function call是否有影響 - 所有math-tooltik.h函式都改成inline function ```txt # Rendering scene Done! Execution time of raytracing() : 4.762099 sec ``` #### SIMD - cpu 型號: intel i3 330M - 支援到SSE4.1/4.2 - 照著 [共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP)和[SPEC](https://software.intel.com/en-us/node/524253) - compiler對 __m128d 和 __m128i 型態的變數,無論是區域變數還是全域變數,都是對齊16 byte為界的stack上,可以下\__attribute更改,`__attribute__((aligned(n[, offset])))或__attribute__((aligned(n[,offset])))` > 不知道要放在哪,後來就沒放了 ```clike= __m128d _v1; __m128d _v2; double _v_tmp[3]; _v1 = _mm_loadu_pd(v1); _v2 = _mm_loadu_pd(v2); _v1 = _mm_mul_pd(_v1, _v2); _mm_store_pd(_v_tmp, _v1); return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2]; ``` - 結果 ```txt # Rendering scene Done! Execution time of raytracing() : 9.730340 sec ``` ```txt Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 28.75 1.18 1.18 69646433 0.00 0.00 dot_product 11.25 1.64 0.46 13861875 0.00 0.00 rayRectangularIntersection 10.52 2.07 0.43 10598450 0.00 0.00 normalize 9.54 2.46 0.39 13861875 0.00 0.00 raySphereIntersection 8.68 2.81 0.36 56956357 0.00 0.00 subtract_vector ``` - 嘗試加上aligned ```clike= __m128d _v1 __attribute(aligned(64)); __m128d _v2 __attribute(aligned(64)); double _v_tmp[3] __attribute(aligned(64)); ``` 結果Segmentation fault.,使用gdb bt的功能 ```txt #0 0x00000000004015c2 in _mm_load_pd (__P=0x7fffffffd9a8) at /usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h:119 #1 dot_product (v1=0x7fffffffd9a8, v2=0x7fffffffdd00) at math-toolkit.h:72 #2 0x0000000000401ee0 in rayRectangularIntersection (ray_e=0x7fffffffd940, ray_d=0x7fffffffdd00, rec=0x60bbf0, ip=0x7fffffffd990, t1=0x7fffffffd8f8) at raytracing.c:121 #3 0x00000000004028a5 in ray_hit_object (e=0x4043c0 <view>, d=0x7fffffffdd00, t0=0, t1=59.948704218900112, rectangulars=0x60bbf0, hit_rectangular=0x7fffffffda78, spheres=0x60ba70, hit_sphere=0x7fffffffda88) at raytracing.c:264 #4 0x0000000000402f94 in ray_color (e=0x4043c0 <view>, t=0, d=0x7fffffffdd00, stk=0x7fffffffdd40, rectangulars=0x60bbf0, spheres=0x60ba70, lights=0x60b9d0, object_color=0x7fffffffdd20, bounces_left=3) at raytracing.c:367 #5 0x0000000000403835 in raytracing (pixels=0x7ffff7f1b010 "", background_color=0x7fffffffdec0, rectangulars=0x60bbf0, spheres=0x60ba70, lights=0x60b9d0, view=0x4043c0 <view>, width=512, height=512) at raytracing.c:481 #6 0x0000000000403caa in main () at main.c:52 ``` math-toolkit.h:72 ```clike= _v1 = _mm_load_pd(v1); ``` 會不會是`const double *v1, const double *v2`傳近來的參數問題呢? - 重新map到aligned array ```clike= double _v1_tmp[3] __attribute__((aligned(64))); double _v2_tmp[3] __attribute__((aligned(64))); memcpy(_v1_tmp, v1, sizeof(double)*3 ); memcpy(_v2_tmp, v2, sizeof(double)*3 ); ``` - 結果: 可以跑出結果,時間又更長 ```txt # Rendering scene Done! Execution time of raytracing() : 10.993178 sec ``` ```txt Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 40.74 2.11 2.11 69646433 0.00 0.00 dot_product 10.45 2.65 0.54 13861875 0.00 0.00 rayRectangularIntersection 6.77 3.00 0.35 13861875 0.00 0.00 raySphereIntersection 6.68 3.34 0.35 56956357 0.00 0.00 subtract_vector ``` - 換一個方式: 把有傳進dot_product變數的宣告加上aligned ```txt $ grep -n 'dot_product' * ``` 後來直接試改`primitivs.h` 和 `math-toolkit.h` ```clike= typedef double point3[3] __attribute__((aligned(64))); <-- ... typedef struct { point4 vertices[4]; <-- 3 改 4 point3 normal; object_fill rectangular_fill; } rectangular; ``` ```clike= __m128d _v1; __m128d _v2; double _v_tmp[3]; _v1 = _mm_load_pd(v1); _v2 = _mm_load_pd(v2); _v1 = _mm_mul_pd(_v1, _v2); _mm_store_pd(_v_tmp, _v1); return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2]; ``` 結果: ```txt Execution time of raytracing() : 9.984913 sec ``` 是的,==整體==比用 \_mm_loadu_pd 的版本慢! ~~但是dot_product的時間是有變少的~~多執行幾次,\_mm_loadu_pd 與 aligned兩個版本得到的時間會==飄== ``` jacklin@jacklin-CX420-CX420-MX:[~/Documents/sys2016/week1/raytracing] (master) 0h24m $ gprof -p -b raytracing Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 22.12 1.09 1.09 69646433 0.00 0.00 dot_product 12.79 1.72 0.63 56956357 0.00 0.00 subtract_vector 12.08 2.32 0.60 13861875 0.00 0.00 rayRectangularIntersection 9.03 2.76 0.45 10598450 0.00 0.00 normalize 8.02 3.16 0.40 13861875 0.00 0.00 raySphereIntersection 7.10 3.51 0.35 31410180 0.00 0.00 multiply_vector 6.90 3.85 0.34 17821809 0.00 0.00 cross_product 6.29 4.16 0.31 4620625 0.00 0.00 ray_hit_object 3.45 4.33 0.17 1048576 0.00 0.00 ray_color 3.04 4.48 0.15 17836094 0.00 0.00 add_vector 2.03 4.58 0.10 2110576 0.00 0.00 compute_specular_diffuse 1.62 4.66 0.08 4221152 0.00 0.00 multiply_vectors 1.42 4.73 0.07 1048576 0.00 0.00 rayConstruction 0.81 4.77 0.04 2110576 0.00 0.00 localColor 0.81 4.81 0.04 1 0.04 4.93 raytracing 0.61 4.84 0.03 1241598 0.00 0.00 reflection 0.61 4.87 0.03 1241598 0.00 0.00 refraction 0.41 4.89 0.02 3838091 0.00 0.00 length 0.41 4.91 0.02 1204003 0.00 0.00 idx_stack_push 0.20 4.92 0.01 2520791 0.00 0.00 idx_stack_top 0.20 4.93 0.01 1241598 0.00 0.00 protect_color_overflow 0.10 4.93 0.01 1 0.01 0.01 delete_sphere_list ``` ===而且,回來看這段code,~~似乎有改的不對~~不知道原本為何編譯有誤,之後改成=== ```clike point3 vertices[4] --> point4 vertices[4]; ``` 原本的編譯會有錯 ``` In file included from objects.c:4:0: primitives.h:32:5: error: alignment of array elements is greater than element size point3 vertices[4]; ^ Makefile:24: recipe for target 'objects.o' failed ``` 到`moduls.inc`來看,就知道原本point3 vertices[4]想要表達是3 x 4的2d array,所以改rectangular的方式不對 - 資料: [Attribute Syntax](https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Attribute-Syntax.html#Attribute-Syntax), [Ref](https://gcc.gnu.org/ml/gcc-help/2007-01/msg00051.html) ## 參考文件 - [gnu gprof](https://sourceware.org/binutils/docs/gprof/) - [Chen Yi 同學的共筆](https://hackmd.io/OwYwRmDMCskKYFoCGA2YAzBAWaSzIA4BOABgQEYSUT11oUATAgJnKA==?view#2016q3-homework-1-raytracing) - [Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) - [課程說明影片](https://www.youtube.com/watch?v=m1RmfOfSwno) - - [ ] [Attribute Syntax](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Attribute-Syntax.html#Attribute-Syntax) - [吳彥寬的共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP) (SIMD的部份) - [Intel® Core™ i3-330M Processor Spec](http://ark.intel.com/products/47663/Intel-Core-i3-330M-Processor-3M-Cache-2_13-GHz) - [SSE part1 ~ part7](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html) (2001年的文章)