2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by < `henry0929016816` > ### Reviewed by `xdennisx` - 沒有實作 force inline 的部份 - 平行化可以利用 pthread 去分析程式熱點所在，去了解用幾條 thread 和圖片裁切的方式才能達到最佳平行化 - Git commit message 都有點太過簡略，應提到實做方式及實做後的影響，例如 commit [a1295e42c72e301f4603115b828b56699fdf4726](https://github.com/henry0929016816/raytracing/commit/a1295e42c72e301f4603115b828b56699fdf4726) 就可以提到整體速度提升了多少 ## 未優化在拿到程式碼時，為了先知道原本的程式碼，運行的時間，所以我執行了 `make` 得到了執行檔，執行結果為： ``` # Rendering scene Done! Execution time of raytracing() : 2.636573 sec ``` 可以看到花了 2.636573 秒執行此程式，然而這只能知道程式整體的執行時間，無法得知細部的內容，究竟程式再哪個環節執行的比較久，是無從得知的，所以我們運用了工具幫我們做分析。 ## 效能分析 * ### GNU gprof 利用 gprof 我們可以分析 raytracing 程式引入了哪些函式，這些函式的執行時間，以及函式調用的關聯性。透過輸入指令： `gprof raytracing gmon.out | less` 期許得到程式效能分析的資訊，可惜什麼資訊也沒有，找出原因後發現原來是 ==make== 的時候忘記給 ==PROFILE== 初始值，導致沒有 gmon.out 檔，重新輸入： `make PROFILE=1` ，再次執行raytracing ``` # Rendering scene Done! Execution time of raytracing() : 5.543456 sec ``` 發現到執行的時間增為了兩倍，由於中間安插了一些程式碼所以執行時間增加了，但是我們可以得到一些有用的資訊，再次輸入： `gprof raytracing gmon.out | less` 得到了： ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.09 0.47 0.47 69646433 0.00 0.00 dot_product 19.29 0.90 0.43 56956357 0.00 0.00 subtract_vector 10.99 1.15 0.25 13861875 0.00 0.00 rayRectangularIntersection 8.53 1.34 0.19 10598450 0.00 0.00 normalize 8.08 1.52 0.18 31410180 0.00 0.00 multiply_vector 5.83 1.65 0.13 17836094 0.00 0.00 add_vector 5.83 1.78 0.13 4620625 0.00 0.00 ray_hit_object 4.71 1.88 0.11 13861875 0.00 0.00 raySphereIntersection 4.49 1.98 0.10 17821809 0.00 0.00 cross_product 2.69 2.04 0.06 1048576 0.00 0.00 ray_color 2.24 2.09 0.05 2110576 0.00 0.00 localColor 1.79 2.13 0.04 2110576 0.00 0.00 compute_specular_diffuse 1.35 2.16 0.03 4221152 0.00 0.00 multiply_vectors 1.35 2.19 0.03 2520791 0.00 0.00 idx_stack_top 0.90 2.21 0.02 1 0.02 2.23 raytracing 0.45 2.22 0.01 1241598 0.00 0.00 reflection 0.45 2.23 0.01 1048576 0.00 0.00 rayConstruction 0.00 2.23 0.00 3838091 0.00 0.00 length 0.00 2.23 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.23 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 2.23 0.00 1241598 0.00 0.00 refraction 0.00 2.23 0.00 1204003 0.00 0.00 idx_stack_push 0.00 2.23 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.23 0.00 113297 0.00 0.00 fresnel 0.00 2.23 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.23 0.00 3 0.00 0.00 append_rectangular 0.00 2.23 0.00 3 0.00 0.00 append_sphere 0.00 2.23 0.00 2 0.00 0.00 append_light 0.00 2.23 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.23 0.00 1 0.00 0.00 delete_light_list 0.00 2.23 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.23 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.23 0.00 1 0.00 0.00 diff_in_second 0.00 2.23 0.00 1 0.00 0.00 write_to_ppm ``` 當然底下還有一些，關於函式詳細資訊，列出這個函式呼叫了幾個函式，然而由於文字的資訊讓人看的頭昏眼花，所以我參考了 [hugikun999的共筆](https://hackmd.io/s/HyHhgcv6#) 使用 [graphviz](http://www.openfoundry.org/tw/foss-programs/8820-graphviz-) 將 gprof 得到的資訊用圖像表示，輸入： ` gprof ./raytracing | gprof2dot | dot -T png -o output.png` 產生以下的函式關聯圖： ![](https://i.imgur.com/I1oJnoK.png) 透過以上的資訊我們可以分析出前三名最耗時的函式，dot_product 、subtract_vector 、 rayRectangularIntersection ，而這些函式就是這支程式的效能瓶頸，準備開始對它們進行優化。 ## 優化方式 * ### 方法一： [loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling) 觀察 `math-toolkit.h` 的==dot_product== 函式 ```c= double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 裡面的 for 回圈只重複了三次，如果我們將回圈拔掉，改成以下型式： ```c= dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; ``` 雖然程式的可讀性降低了，但是減少了[分支預測](https://zh.wikipedia.org/wiki/%E5%88%86%E6%94%AF%E9%A0%90%E6%B8%AC%E5%99%A8)錯誤的產生，也減少了 data dependency，讓程式可以同時處理多一點資料，而不用等到上一筆資料的值確定後（原本的 i 值在做完 i++ 後才能確定下一次的 i 值），才執行下一筆，期望程式能降低執行時間，重新執行程式，得到結果： ``` # Rendering scene Done! Execution time of raytracing() : 3.166653 sec ``` 原以為執行的時間會縮減，結果執行時間反而增長了，讓我感到很驚訝，利用 gprof 確認看看程式效能的變化： ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.68 0.48 0.48 56956357 0.00 0.00 subtract_vector 12.31 0.71 0.23 31410180 0.00 0.00 multiply_vector 10.70 0.91 0.20 13861875 0.00 0.00 rayRectangularIntersection 7.76 1.06 0.15 17836094 0.00 0.00 add_vector 7.22 1.19 0.14 69646433 0.00 0.00 dot_product 6.96 1.32 0.13 13861875 0.00 0.00 raySphereIntersection 6.96 1.45 0.13 10598450 0.00 0.00 normalize 4.82 1.54 0.09 17821809 0.00 0.00 cross_product 4.28 1.62 0.08 4620625 0.00 0.00 ray_hit_object 4.28 1.70 0.08 1048576 0.00 0.00 ray_color 2.14 1.74 0.04 1048576 0.00 0.00 rayConstruction 1.61 1.77 0.03 2110576 0.00 0.00 compute_specular_diffuse 1.61 1.80 0.03 1 0.03 1.87 raytracing 1.07 1.82 0.02 2110576 0.00 0.00 localColor 1.07 1.84 0.02 1241598 0.00 0.00 refraction 0.54 1.85 0.01 4221152 0.00 0.00 multiply_vectors 0.54 1.86 0.01 1241598 0.00 0.00 protect_color_overflow 0.27 1.87 0.01 3838091 0.00 0.00 length 0.27 1.87 0.01 1048576 0.00 0.00 idx_stack_init 0.00 1.87 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.87 0.00 2520791 0.00 0.00 idx_stack_top 0.00 1.87 0.00 1241598 0.00 0.00 reflection 0.00 1.87 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.87 0.00 113297 0.00 0.00 fresnel 0.00 1.87 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.87 0.00 3 0.00 0.00 append_rectangular 0.00 1.87 0.00 3 0.00 0.00 append_sphere 0.00 1.87 0.00 2 0.00 0.00 append_light 0.00 1.87 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.87 0.00 1 0.00 0.00 delete_light_list 0.00 1.87 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.87 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.87 0.00 1 0.00 0.00 diff_in_second 0.00 1.87 0.00 1 0.00 0.00 write_to_ppm ``` 可以看到 dot_product 的執行時間，從原本佔了整支程式的 21.09 % 掉到了 7.22 % ，代表 dot_product 的執行效能是有被改善的，合理懷疑是電腦久沒關機，導致執行程式有時忽快忽慢，重複執行幾次，果然如我設想，執行時間飄呼不定有 2.282860 sec ，也有 2.569222 sec，讓我再次體認到 gprof 的好用之處，讓我可以確認是否真的有改善到函式效能。 * 圖示 ![](https://i.imgur.com/4nuxALJ.png) 確認到 loop unrolling 對效能的有顯著的影響後，著手將其它函式（add_vector、subtract_vector、multiply_vectors、 multiply_vector ）有 for 迴圈的都修改過，整體的執行時間已經降為了 1.961615 sec ``` # Rendering scene Done! Execution time of raytracing() : 1.961615 sec ``` * 圖示 ![](https://i.imgur.com/b6P806X.png) 使用 gprof 觀察 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 17.22 0.21 0.21 69646433 0.00 0.00 dot_product 16.40 0.41 0.20 13861875 0.00 0.00 rayRectangularIntersection 11.89 0.56 0.15 56956357 0.00 0.00 subtract_vector 9.02 0.67 0.11 10598450 0.00 0.00 normalize 8.61 0.77 0.11 4620625 0.00 0.00 ray_hit_object 7.38 0.86 0.09 17821809 0.00 0.00 cross_product 5.74 0.93 0.07 13861875 0.00 0.00 raySphereIntersection 4.51 0.99 0.06 17836094 0.00 0.00 add_vector 4.10 1.04 0.05 31410180 0.00 0.00 multiply_vector 2.46 1.07 0.03 1048576 0.00 0.00 ray_color 2.05 1.09 0.03 4221152 0.00 0.00 multiply_vectors 2.05 1.12 0.03 1048576 0.00 0.00 rayConstruction 1.64 1.14 0.02 2110576 0.00 0.00 compute_specular_diffuse 1.64 1.16 0.02 2110576 0.00 0.00 localColor 1.23 1.17 0.02 3838091 0.00 0.00 length 0.82 1.18 0.01 2520791 0.00 0.00 idx_stack_top 0.82 1.19 0.01 1241598 0.00 0.00 protect_color_overflow 0.82 1.20 0.01 1241598 0.00 0.00 reflection 0.82 1.21 0.01 1241598 0.00 0.00 refraction 0.82 1.22 0.01 1 0.01 1.22 raytracing 0.00 1.22 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.22 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.22 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.22 0.00 113297 0.00 0.00 fresnel 0.00 1.22 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.22 0.00 3 0.00 0.00 append_rectangular 0.00 1.22 0.00 3 0.00 0.00 append_sphere 0.00 1.22 0.00 2 0.00 0.00 append_light 0.00 1.22 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.22 0.00 1 0.00 0.00 delete_light_list 0.00 1.22 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.22 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.22 0.00 1 0.00 0.00 diff_in_second 0.00 1.22 0.00 1 0.00 0.00 write_to_ppm ``` 可以更清楚知道被修改的函式所降的執行秒數 * 圖例： ![](https://i.imgur.com/nnTdqDy.png) * ### 方法二： [OpenMp](https://zh.wikipedia.org/wiki/OpenMP) * #### 介紹 OpenMP 是一套支援多平台，共享記憶體多平行處理的 API，使用在 c 、c++ 、 fortran 語言。它的優點是提供了對平行演算法的抽象描述，使人可以用最簡單的一行程式，就能讓一段程式碼使用多平行化的方式去執行。 ==語法== `#pragma omp <directive> [clause[[,] clause] ...]` ==編譯方式== `gcc -fopenmp` * #### 運用將 `raytracing.c` 的 raytracing 函式裡的 for 迴圈，使用 OpenMp 做平行處理，使得程式的 throughput 增加，函式能更快得到結果，進而增進整體程式的效能。然而要注意多平行化的 for 迴圈裡，是否有變數是在迴圈外宣告的，迴圈外宣告的變數會在所有執行序裡，共享同一個變數空間，導致結果出錯，例如 a 執行序算出 number 變數為 10 ，然而 b 執行序算出 number 變數為 20，導致 a 執行序的結果變為 20。下圖為運算結果出錯的圖： ![](https://i.imgur.com/97JvQb0.png) 為了不讓計算結果出錯，可以在 # pragma omp parallel for 後面加個 ==private ( 變數名稱 )== ，讓 private 裡的變數在每個執行序裡都有自己的副本，彼此不會影響。認真觀察 raytracing 函式，發現有 3 個變數需要被保護起來，將 code 改成 `# pragma omp parallel for private(d,object_color,stk)` 得到了正確的圖形： ![](https://i.imgur.com/WvQrjBo.png) `./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 0.949682 sec ``` 測得執行時間為 0.949682 sec，執行效率提升了一倍。 * 圖例 ![](https://i.imgur.com/2YjXDQZ.png) ## 參考資料 * [hugikun999的共筆](https://hackmd.io/s/HyHhgcv6#) * [使用Gnu gprof进行Linux平台下的程序分析](http://os.51cto.com/art/200703/41426.htm) * [loop rnrolling 心得](https://www.ptt.cc/bbs/C_and_CPP/M.1246071002.A.A54.html) * [loop unrolling wiki](https://en.wikipedia.org/wiki/Loop_unrolling) * [OpenMp wiki](https://zh.wikipedia.org/wiki/OpenMP) * [簡易的程式平行化方法－OpenMP（一）簡介](https://kheresy.wordpress.com/2006/06/09/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%E6%96%B9%E6%B3%95%EF%BC%8Dopenmp%EF%BC%88%E4%B8%80%EF%BC%89%E7%B0%A1%E4%BB%8B/)