2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by < `etc276` > ###### tags: `嵌入式` >>* 請加快進度 >>* 記得補上硬體相關資訊 >>[name=課程助教][color=red] ### Reviewed by `illusion030` * 可以嘗試 openMP 不同的 schedule 實作不同的平行化策略，可參考[ryanwang522的共筆](https://hackmd.io/s/BJsz9RZqx#multi-thread) * 可以把實驗結果畫出比較圖，試著跟編譯器最佳化比較看看 ## 開發環境 - Ubuntu 16.10 (64bit) ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 61 Model name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.878 CPU max MHz: 3000.0000 CPU min MHz: 500.0000 BogoMIPS: 4788.90 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 ``` ## 相關連結 - [2017 年春季作業說明](https://hackmd.io/s/Bk-1zqIte#) - [2017q1 Homework1 (作業區)](https://hackmd.io/MYJjEYGZITgWgOwCMAmw4BYYwAxyQQKxwBmwGOkAHMDeEgGxA===?view) - [B02: raytracing作業要求](https://hackmd.io/s/HyuBWDwYl) - [raytracing Github連結 ( etc276 ) ](https://github.com/etc276/raytracing) - [作業解說 video](https://www.youtube.com/watch?v=m1RmfOfSwno) - [參考實做程式的解說 video](https://www.youtube.com/watch?v=V_rLyzecuaE) - [課程進度與開放資源 Wiki](http://wiki.csie.ncku.edu.tw/sysprog/schedule) ## 開發紀錄 ### Original * 先 make 然後 ./raytracing 看看，得到 ```clike # Rendering scene Done! Execution time of raytracing() : 2.655160 sec ``` ![](https://i.imgur.com/4Zl4Bxq.png) * 再來使用 gprof 查看效能瓶頸 ```clike $ make clean $ make PROFILE=1 $ ./raytracing $ gprof ./raytacing | less ``` ```clike Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.46 0.56 0.56 69646433 0.00 0.00 dot_product 22.04 1.06 0.50 56956357 0.00 0.00 subtract_vector 8.82 1.26 0.20 17821809 0.00 0.00 cross_product 8.38 1.45 0.19 31410180 0.00 0.00 multiply_vector 6.61 1.60 0.15 10598450 0.00 0.00 normalize 6.17 1.74 0.14 13861875 0.00 0.00 rayRectangularIntersection 4.85 1.85 0.11 4620625 0.00 0.00 ray_hit_object 3.97 1.94 0.09 17836094 0.00 0.00 add_vector 3.53 2.02 0.08 13861875 0.00 0.00 raySphereIntersection 3.09 2.09 0.07 1048576 0.00 0.00 ray_color 1.98 2.13 0.05 4221152 0.00 0.00 multiply_vectors 1.32 2.16 0.03 2110576 0.00 0.00 localColor 1.32 2.19 0.03 1 0.03 2.27 raytracing 0.88 2.21 0.02 2110576 0.00 0.00 compute_specular_diffuse 0.88 2.23 0.02 1048576 0.00 0.00 rayConstruction 0.44 2.24 0.01 3838091 0.00 0.00 length 0.44 2.25 0.01 1241598 0.00 0.00 protect_color_overflow 0.44 2.26 0.01 1204003 0.00 0.00 idx_stack_push 0.44 2.27 0.01 1048576 0.00 0.00 idx_stack_init 0.00 2.27 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.27 0.00 2520791 0.00 0.00 idx_stack_top 0.00 2.27 0.00 1241598 0.00 0.00 reflection 0.00 2.27 0.00 1241598 0.00 0.00 refraction 0.00 2.27 0.00 113297 0.00 0.00 fresnel 0.00 2.27 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.27 0.00 3 0.00 0.00 append_rectangular 0.00 2.27 0.00 3 0.00 0.00 append_sphere 0.00 2.27 0.00 2 0.00 0.00 append_light 0.00 2.27 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.27 0.00 1 0.00 0.00 delete_light_list 0.00 2.27 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.27 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.27 0.00 1 0.00 0.00 diff_in_second 0.00 2.27 0.00 1 0.00 0.00 write_to_ppm % the percentage of the total running time of the ``` * 發現最花時間的是 ```subtract_vector``` 和 ```dot_product``` ```clike== static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` ```clike== static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ``` ### Optimization 1 (by loop unrolling) 迴圈的用意為，重複執行差異不大的指令，這樣的好處是可以增加程式的可讀性，也可以在執行時期才決定要重複幾次。但如果我們事先就知道迴圈要執行的次數，就可以手動展開，達到 prepocessing 的效果以減少時間，缺點是可讀性會降低。如下所示，將 math-toolkit.h 裡的 function 都用相同方法改寫。 ```clike== static inline double dot_product(const double *v1, const double *v2) { return (v1[0] * v2[0]) + (v1[1] * v2[1]) + (v1[2] * v2[2]); } ``` 效果顯著，時間從 2.6 秒降為 1.8 秒。 ``` # Rendering scene Done! Execution time of raytracing() : 1.809386 sec ``` ### Optimization 2 (by force inline) 手動展開迴圈之後，效能瓶頸就卡在 function call 所需的時間。因為每當要呼叫 function 時，所以下一個優化方法就選用可以將 function 在 compile 階段就拆成 instruction cache 的 force inline 以加速，缺點是若函式的代碼過多，會提升 cache miss. 如下所示，將 math-toolkit.h 裡的 function 都用相同方法改寫。要注意的是像 ```scalar_triple_product``` 這種有多層呼叫的函式不要 force inline. ```clike== static inline __attribute__((always_inline)) double dot_product(const double *v1, const double *v2) { return (v1[0] * v2[0]) + (v1[1] * v2[1]) + (v1[2] * v2[2]); } ``` 時間又下降了約 0.3 秒。 ``` # Rendering scene Done! Execution time of raytracing() : 1.518595 sec ``` ### Optimization 3 (by OpenMP) 降低 function call 對效能的影響之後，試圖把任務分給多核心去做處理，以降低 real time，這邊選用 OpenMP 的 API 並挑選 raytracing 的其中一段 loop 實現。要注意的是要去 Makefile 裡兩個 FLAG 的後面多上一段指令 ``` -fopenMP``` ```c= int j, i, s; double r, g, b; #pragma omp parallel for private (i, r, g, b, s, object_color, stk, d) for (j = 0; j < height; j++) { for (i = 0; i < width; i++) { r = 0, g = 0, b = 0; /* MSAA */ for (s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, ``` 參考 [簡易的程式平行化－OpenMP（五）變數的平行化](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/)後得知要將變數平行化，否則會產生非預期結果的情況，所以將迴圈中的變數都設為 private. 時間減少約0.6秒，效果顯著。 ``` # Rendering scene Done! Execution time of raytracing() : 0.977788 sec ``` ## 相關知識 ### gprof 在使用 gprof 檢查效能瓶頸時，遇到跟 [FB 討論區](https://www.facebook.com/groups/system.software2017/permalink/1325126694219363/)一樣的問題，在參考了[劉俊林的共筆](https://hackmd.io/s/B1hlfxl9e#)之後才解決。 ### inline v.s. macro C99的標準新增了 inline function 的定義： >> Making a function an inline function suggests that calls to the function be as fast as possible. >> 值得注意的是 suggest 這個字，加上 inline 的關鍵字只是建議編譯器可以這樣做，但是因為我們的 makefile 中將最佳化關閉 (-O0)，所以如果不 force inline 則所有建議將不被採用。 macro 和 inline 的區別在於，前者是在 prepocess 階段將程式碼 (source code) 代換成 #define 後面的模樣，所以編譯器並不會幫你檢查傳入參數的型別，而且也因為只是代換，所以在巨集被呼叫多次之後，會存放使用大量的記憶體空間；後者是在 complie 階段將 function 轉成 instruction cache，因此可以減去 function 呼叫時，程式跑到函式所在的記憶體位置的執行時間。 #### 參考資料 [深入理解内联inline函数的优缺点，性能及使用指南](http://blog.csdn.net/acs713/article/details/42649949) [Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) [macro與inline差異](http://hugedream2.blogspot.tw/2008/09/macroinline.html) ## 心得以前就有聽過 function call 會佔用不少時間，所以建議使用 gdb 等 debug 的方法，而不是 ```printf()```等呼叫函式以檢查輸出。但是直到這次作業才知道效能上的差異，也學習到使用 inline 等優化方法，但目前對於 OpenMP, PTread 等方法還不盡理解，仍待補實際的 code 以分析效能。 :::danger 真正問題在於太少動手了，你要想想自己對這個世界的影響，活了二十多年，結果這個世界裡頭，誰用了你開發的程式碼？請重新認真看待你和這個世界的關聯，持續提升軟體品質 --jserv :::