2016q3 Homework1 (raytracing)

# 2016q3 Homework1 (raytracing) contributed by <`petermouse`> ### Reviewed by <`vic85821`> * 建議以圖表呈現結果較明顯 * 可以比較看看inline function跟Macro的差異 * commit [5705101](https://github.com/petermouse/raytracing/commit/5705101b8945eec441b06fe6f6f87e01c6f5908e) 可以再明確的說明更改內容及目的 * 使用astyle幫助檢查coding style ## 開發環境 * OS:Lubuntu 16.04 LTS * Memory: 8 GB * CPU: * Name: * Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz * Cache: * L1d Cache： 32K * L1i Cache： 32K * L2 Cache： 256K * L3 Cache： 3072K ## 使用GNU Gprof GNU Gprof:可以幫分析程式在執行期間各個部份(函式)所消耗的時間。為了使raytracing使用Gprof，make時附加參數`PROFILE=1`，`Makefile`的`ifeq`成立，`CFLAGS`(編譯參數)與`LDFLAGS`(連結參數)將append`-pg`，。 `-pg`：建立call-graph data。執行raytracing ``` # Rendering scene Done! Execution time of raytracing() : 5.190075 sec ``` gprof分析 ``` gprof -b raytracing gmon.out ``` `-b`:簡化說明 [其他參數](https://sourceware.org/binutils/docs/gprof/Output-Options.html#Output-Options) 部份結果 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.75 0.47 0.47 69646433 0.00 0.00 dot_product 24.75 0.94 0.47 56956357 0.00 0.00 subtract_vector 7.64 1.09 0.15 4620625 0.00 0.00 ray_hit_object 6.85 1.22 0.13 13861875 0.00 0.00 rayRectangularIntersection 6.58 1.34 0.13 31410180 0.00 0.00 multiply_vector 6.32 1.46 0.12 17836094 0.00 0.00 add_vector 6.32 1.58 0.12 10598450 0.00 0.00 normalize 4.74 1.67 0.09 17821809 0.00 0.00 cross_product 3.16 1.73 0.06 13861875 0.00 0.00 raySphereIntersection 2.63 1.78 0.05 2110576 0.00 0.00 compute_specular_diffuse 2.11 1.82 0.04 1048576 0.00 0.00 ray_color 1.58 1.85 0.03 2110576 0.00 0.00 localColor 1.32 1.88 0.03 4221152 0.00 0.00 multiply_vectors 0.53 1.89 0.01 3838091 0.00 0.00 length ``` 關於self seconds與cumulative seconds與欄位的解讀： cumulative seconds:呼叫該函式(包含子函式)佔的總時間 self seconds:呼叫該函式(不包含子函式)佔的總時間，反映在時間上的百分比。從這個例子可以發現，dot_product與subtract_vector等都呼叫了幾千萬次，也佔了不少時間，從這些呼叫多次的函式改善，對整體的改善幫助最多。 ## 改進 ### (1)loop unrolling 在`math-toolkit.h`中，有向量的基本運算，長得如下程式碼： ```clike= static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ``` 都是對index 0~2作個別運算，等於for迴圈根本沒有必要，造成多餘的branh prediction 利用perf分析 ``` perf stat -e cache-references,cache-misses,branches -r 10 ./raytracing ``` 結果 ``` Performance counter stats for './raytracing' (10 runs): 442,951 cache-references ( +- 18.95% ) 96,074 cache-misses # 21.690 % of all cache refs ( +- 28.09% ) 1,870,505,339 branches ( +- 0.01% ) 2.500741620 seconds time elapsed ( +- 0.35% ) ``` loop unrolling後 ``` Performance counter stats for './raytracing' (10 runs): 246,565 cache-references ( +- 9.09% ) 45,257 cache-misses # 18.355 % of all cache refs ( +- 7.32% ) 969,883,821 branches ( +- 0.00% ) 1.713952851 seconds time elapsed ( +- 0.12% ) ``` 竟然差了0.8秒，而branch數差了將近一半！ Gprof分析 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 18.36 0.20 0.20 69646433 0.00 0.00 dot_product 12.85 0.34 0.14 10598450 0.00 0.00 normalize 11.02 0.46 0.12 17821809 0.00 0.00 cross_product 10.10 0.57 0.11 56956357 0.00 0.00 subtract_vector 10.10 0.68 0.11 13861875 0.00 0.00 rayRectangularIntersection 8.26 0.77 0.09 4620625 0.00 0.00 ray_hit_object 6.43 0.84 0.07 13861875 0.00 0.00 raySphereIntersection 5.05 0.90 0.06 17836094 0.00 0.00 add_vector 2.75 0.93 0.03 31410180 0.00 0.00 multiply_vector 2.75 0.96 0.03 1241598 0.00 0.00 refraction 2.75 0.99 0.03 1048576 0.00 0.00 ray_color 2.75 1.02 0.03 1 0.03 1.09 raytracing 2.29 1.04 0.03 2110576 0.00 0.00 compute_specular_diffuse 1.84 1.06 0.02 2520791 0.00 0.00 idx_stack_top 1.38 1.08 0.02 2110576 0.00 0.00 localColor 0.92 1.09 0.01 3838091 0.00 0.00 length ``` ### (2)inline function inline function:不呼叫function，而是直接將function主體嵌入程式碼中，如此有助於減少function call的次數。在函數前使用`__attribute__((always_inline))`來強制gcc inline 還有很多選項來設定function的attribute，可以在[gcc manual](https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Function-Attributes.html#Function-Attributes)找到 perf分析 ``` Performance counter stats for './raytracing' (10 runs): 131,085 cache-references ( +- 8.77% ) 31,756 cache-misses # 24.226 % of all cache refs ( +- 2.33% ) 545,079,262 branches ( +- 0.00% ) 1.533319775 seconds time elapsed ( +- 0.31% ) ``` 比起先前版本在快了約0.2秒 gprof分析 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 39.32 0.57 0.57 13861875 0.00 0.00 rayRectangularIntersection 19.32 0.85 0.28 13861875 0.00 0.00 raySphereIntersection 15.18 1.07 0.22 2110576 0.00 0.00 compute_specular_diffuse 8.97 1.20 0.13 1048576 0.00 0.00 ray_color 5.52 1.28 0.08 2110576 0.00 0.00 localColor 5.52 1.36 0.08 1048576 0.00 0.00 rayConstruction 3.45 1.41 0.05 4620625 0.00 0.00 ray_hit_object 0.69 1.42 0.01 1241598 0.00 0.00 reflection 0.69 1.43 0.01 1241598 0.00 0.00 refraction 0.69 1.44 0.01 113297 0.00 0.00 fresnel 0.69 1.45 0.01 1 0.01 1.45 raytracing 0.00 1.45 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.45 0.00 2520791 0.00 0.00 idx_stack_top ``` 已經看不到`math-toolkit.h`的函式了，因為已經inline它們了 ## SIMD:使用AVX2.0 先到[intel官網](http://ark.intel.com/products/75027/Intel-Core-i5-4200H-Processor-3M-Cache-up-to-3_40-GHz)查詢我的CPU是否支援extension，發現可以支援SSE4.1/4.2, AVX 2.0 。gcc也在4.7版以後支援AVX2[(來源)](https://gcc.gnu.org/gcc-4.7/changes.html) 使用AVX2，必須include`immintrin.h`標頭檔，並且編譯時加上 `-mavx`。 ```clike= inline __attribute__((always_inline)) void add_vector(const double *a, const double *b, double *out) { __m256d ma = _mm256_set_pd(0,a[2],a[1],a[0]); __m256d mb = _mm256_set_pd(0,b[2],b[1],b[0]); ma = _mm256_add_pd(ma,mb); double* d=(double *)&ma; out[0]=d[0]; out[1]=d[1]; out[2]=d[2]; } ``` perf結果 ``` Performance counter stats for './raytracing' (10 runs): 231,011 cache-references ( +- 7.25% ) 36,516 cache-misses # 15.807 % of all cache refs ( +- 4.92% ) 546,709,778 branches ( +- 0.01% ) 3.619358562 seconds time elapsed ( +- 0.10% ) ``` 實際上反而更慢了，我在想是不是因為比之前多了很多新的變數宣告以及函數呼叫，增加了許多成本呢？ >> 提出猜想後，要設計實驗證明，或者找到相關的技術手冊，不要只是「憑感覺」。SIMD register load/store 有成本。 [name=jserv] 實驗證明：(待續) ### OpenMP 待續 ## 參考資料 [How to read gprof output](http://badgertronics.com/writings/gprof.html) [perf tutorial](https://perf.wiki.kernel.org/index.php/Tutorial) [Crunching Numbers with AVX and AVX2](http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX) [作業video](https://www.youtube.com/watch?v=m1RmfOfSwno) [2016/5/3 Embedded System HW5 Team 4共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)