2016q3 Homework1(raytracing)

# 2016q3 Homework1(raytracing) ###### tags: `tundergod` `hw1` `2016q3` contributed by <`tundergod`> ## 作業要求 * 在 GitHub 上 fork [raytracing](https://github.com/sysprog21/raytracing)，並思考如何改善程式效能 * 以 `make PROFILE=1` 重新編譯程式碼，並且學習 [gprof](https://sourceware.org/binutils/docs/gprof/) * 參考 [使用 GNU gprof 進行 Linux 平台的程式分析](http://os.51cto.com/art/200703/41426.htm) * 以 [gprof](https://sourceware.org/binutils/docs/gprof/) 指出效能瓶頸，並且著手改寫檔案 `math-toolkit.h` 在內的函式實做，充分紀錄效能差異在共筆 * 注意: 請勿更動編譯器最佳化選項 `-O0` (關閉最佳化) * 檔案 `math-toolkit.h` 定義的若干數學操作函式很重要，可參考 [Investigating SSE cross product performance](http://threadlocalmutex.com/?p=8)、[MathFu](https://google.github.io/mathfu/) 原始程式碼，以及 [2015q3 Homework #1 Ext](http://wiki.csie.ncku.edu.tw/embedded/2015q3h1ext) * 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 * 將你的觀察、分析，以及各式效能改善過程，並善用 gnuplot 製圖，紀錄於「[作業區](https://hackmd.io/s/H1B7-hGp)」 ## 開發環境 * Linux 版本: Ubuntu 16.04 LTS * 硬體資訊： ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Stepping: 1 CPU MHz: 1661.660 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.54 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 ``` ## 初始資料 * 執行時間:(make PROFILE=1) ``` # Rendering scene Done! Execution time of raytracing() : 9.463558 sec ``` * gprof 觀察: * 發現math-toolkit.h中的dot_product , subtract_vector , multiply_vector 這三個function是最耗費時間的，佔了總執行時間將近50%,而單單是math-toolkit.h中函數的被呼叫次數高達約2億次 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 22.98 0.99 0.99 69646433 0.00 0.00 dot_product 17.64 1.75 0.76 56956357 0.00 0.00 subtract_vector 9.87 2.18 0.43 13861875 0.00 0.00 rayRectangularIntersection 9.29 2.58 0.40 31410180 0.00 0.00 multiply_vector 7.43 2.90 0.32 10598450 0.00 0.00 normalize 5.92 3.15 0.26 13861875 0.00 0.00 raySphereIntersection 5.69 3.40 0.25 17836094 0.00 0.00 add_vector 5.57 3.64 0.24 4620625 0.00 0.00 ray_hit_object 4.41 3.83 0.19 17821809 0.00 0.00 cross_product 3.71 3.99 0.16 1048576 0.00 0.00 ray_color 2.79 4.11 0.12 4221152 0.00 0.00 multiply_vectors 1.86 4.19 0.08 1 0.08 4.31 raytracing 0.70 4.22 0.03 2520791 0.00 0.00 idx_stack_top 0.70 4.25 0.03 1048576 0.00 0.00 rayConstruction 0.58 4.27 0.03 3838091 0.00 0.00 length 0.46 4.29 0.02 2110576 0.00 0.00 compute_specular_diffuse 0.46 4.31 0.02 1241598 0.00 0.00 refraction 0.00 4.31 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 4.31 0.00 2110576 0.00 0.00 localColor 0.00 4.31 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 4.31 0.00 1241598 0.00 0.00 reflection 0.00 4.31 0.00 1204003 0.00 0.00 idx_stack_push 0.00 4.31 0.00 1048576 0.00 0.00 idx_stack_init 0.00 4.31 0.00 113297 0.00 0.00 fresnel 0.00 4.31 0.00 37595 0.00 0.00 idx_stack_pop 0.00 4.31 0.00 3 0.00 0.00 append_rectangular 0.00 4.31 0.00 3 0.00 0.00 append_sphere 0.00 4.31 0.00 2 0.00 0.00 append_light 0.00 4.31 0.00 1 0.00 0.00 calculateBasisVectors 0.00 4.31 0.00 1 0.00 0.00 delete_light_list 0.00 4.31 0.00 1 0.00 0.00 delete_rectangular_list 0.00 4.31 0.00 1 0.00 0.00 delete_sphere_list 0.00 4.31 0.00 1 0.00 0.00 diff_in_second 0.00 4.31 0.00 1 0.00 0.00 write_to_ppm ``` ## 優化 ### Loop Unrooling * 在math-toolkit.h中的各個function多是以for迴圈實作，可以利用循環展開的方式加快程式的執行速度(並行執行)。 * 循環展開的缺點則是犧牲了程式的排版與閱讀性，代碼行數倍增。 * 以dot_product爲例: ``` static inline double dot_product(const double *v1, const double *v2) { return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; } ``` * Loop Unrooling的展開原理(利用輸出組合語言觀察)可以查看之前的[公筆](https://embedded2016.hackpad.com/Enhance-raytracing-program-f5CCUGMQ4Kp#:h=Benchmarking-program) ### Force Inline * inline是向編譯器提出“建議”，把inline的函數在函數位置直接展開（減輕系統負擔(overhead)），編譯器會對比兩者執行時間後選擇執行inline與否。 * 在本次實作中使用inline能夠減少2億次(math-toolkit中的函式)的function call * 因爲在Makefile中擁有 -O0 的指令（關閉最佳化），所以inline不會被執行，要讓inline執行必須在每一個function後面加上__attribute__((always_inline))強行讓CPU去執行inline這個“建議” ``` static inline __attribute__((always_inline)) double length(const double *v) { return sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); } ``` ### Macro * Macro巨集跟inline有點像,都是對該段代碼進行直接替換 * 用法:在math-toolkit.h中直接使用 ``` #define DOT(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2])) ``` ### Macro 和 Inline的差別? * Macro是在Compiler進行編譯之前就被替換,而Inline是會被Compiler看到的,且在展開時系統會自動做安全檢查或Type的轉換 ### 去掉Assert() * 這是gprof中normalize的數據: * 它佔用了7.43%的總時間,且被呼叫了1千萬次,也就是說normalize中的一個防錯函數assert()也被執行了1千萬次,而他在正確輸出的環境下是不會用到的,可以在確定程式正確或完成作業之後把它的作用去掉 * 方法: * 直接刪除 * 修改 Makefile，新增 `CFLAGS=-DNDEBUG` 以消除 assert 對效能的影響 ``` 7.43 2.90 0.32 10598450 0.00 0.00 normalize ``` ### OpenMP * OpenMP（Open Multi-Processing）是一套支持跨平台、共享記憶體方式的多執行緒平行運算的API ``` #pragma omp parallel for num_threads(N) \ private(stk), private(d), private(object_color) for (int j = 0 ; j < 512; j++) { for (int i = 0; i < 512; i++) { ..... } } ``` ### POSIX Thread * 實作如下: * 做法爲把512個column分別交給N個thread去做 ``` typedef struct __ARG { uint8_t *pixels; color background; rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *View; int start_j; point3 u, v, w, d; } arg; ``` ``` for (int j = (*data).start_j ; j < 512; j+=N) { for (int i = 0; i < 512; i++) { ... } } ``` * 用數量不同的thread去測試哪個效能最高: ![](https://i.imgur.com/m9ljulE.png) ## 最終結果 * 用最快的64thread去與ofast比較 ``` # Rendering scene Done! Execution time of raytracing() : 0.671045 sec ``` 圖表: * 比ofast的0.674040 sec 快了一點點!![](https://i.imgur.com/V5f0lWx.png)