2016q3 Homework1 (raytracing)

# 2016q3 Homework1 (raytracing) ##### tags: `jeff60907` contributed by <`jeff60907`> ## [作業說明](https://hackmd.io/s/B1W5AWza)及預期目標善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 ## 開發環境 ``` 作業系統 Lubuntu 16.04 CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K ``` * 依照作業介紹安裝以下開發工具 ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagick ``` ### 未優化前測試 ``` # Rendering scene Done! Execution time of raytracing() : 2.404955 sec ``` [學習GNU gprof ](http://os.51cto.com/art/200703/41426.htm) >可以追蹤程式時間到底花在那邊 function到底執行多少次、時間、流程量出效能上的熱點 ``` $ make PROFILE=1 ``` ``` # Rendering scene Done! Execution time of raytracing() : 5.026501 sec ``` ``` $ gprof ./raytracing | less ``` 分析程式執行的內容 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.12 0.42 0.42 69646433 0.00 0.00 dot_product 17.10 0.76 0.34 56956357 0.00 0.00 subtract_vector 11.56 0.99 0.23 31410180 0.00 0.00 multiply_vector 11.06 1.21 0.22 10598450 0.00 0.00 normalize 8.80 1.39 0.18 13861875 0.00 0.00 rayRectangularIntersection 5.78 1.50 0.12 4620625 0.00 0.00 ray_hit_object 4.02 1.58 0.08 17821809 0.00 0.00 cross_product 4.02 1.66 0.08 1048576 0.00 0.00 ray_color 3.52 1.73 0.07 17836094 0.00 0.00 add_vector 3.02 1.79 0.06 4221152 0.00 0.00 multiply_vectors 2.77 1.85 0.06 13861875 0.00 0.00 raySphereIntersection 2.01 1.89 0.04 2110576 0.00 0.00 compute_specular_diffuse 2.01 1.93 0.04 1241598 0.00 0.00 protect_color_overflow 1.01 1.95 0.02 1048576 0.00 0.00 rayConstruction 0.75 1.96 0.02 113297 0.00 0.00 fresnel ``` > 可以發現前面呼叫次數非常多，時間花費幾乎都在這邊，所以想要優化程式從這邊著手，而不是靠`感覺`去瞎猜，要依照真實數據去改善。上課講解影片提到，Makefile 可以更改編譯最佳化測試 ``` CFLAGS = \ -std=gnu99 -Wall -O0 -g 改成 CFLAGS = \ -std=gnu99 -Wall -Ofast -g ``` ``` Execution time of raytracing() : 0.559535 sec ``` 電腦編譯自動最佳化 5秒變成0.55秒 ## 人腦VS大腦超越電腦編譯器最佳化 ### 優化測試1 > 課程video 提到 for 回圈會造成分支，如果使用展開可以有效提升執行速度 > 但是不可能總把回圈拆開來硬幹，要依照自己程式的需求目標。 #### loop unrolling 改善第1名 dot_product() 把for展開測試 ``` double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; return dp; } ``` ``` Execution time of raytracing() : 4.577559 sec ``` > 原本5.02秒變成4.57秒，只改了一行for寫法就改善了時間效能 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.41 0.40 0.40 56956357 0.00 0.00 subtract_vector 12.81 0.61 0.21 69646433 0.00 0.00 dot_product 11.59 0.80 0.19 17836094 0.00 0.00 add_vector 9.76 0.96 0.16 31410180 0.00 0.00 multiply_vector ``` > 重新看了 gprof dot_product的名次也下降了，繼續看其他數學式程式碼改善效能把add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()的for loop也拆開來 ``` Execution time of raytracing() : 4.211204 sec ``` > 效能改善4.57 到 4.211 改善幅度沒有像dot_product 多 ### 優化測試2 [wiki inline function](https://en.wikipedia.org/wiki/Inline_function) [參考rnic開發紀錄](https://embedded2016.hackpad.com/2016q1-Homework-2A-GalzL151aZc) 其中提到 inline function效能提升，看資料研究原理 [參考1](http://blog.yam.com/swwuyam/article/11745212) [參考2](http://hugedream2.blogspot.tw/2008/09/macroinline.html)提到巨集 VS inline > macro與inline的差異： > * macro是由preprocessor處理，由inline則是compiler來負責 > * macro對於檢查傳入參數的型別，而inline則會檢查型別 > * marco 不執行參數型別檢查、也沒有語法檢查，也可能產生邏輯錯誤 [參考3](http://openhome.cc/Gossip/CppGossip/inlineFunction.html)提到行內函式只能建議編譯器，也就是說建議並不一定會被採納，這視您的編譯器而定。 #### 嘗試inline function 確定展開將原先每個 static inline 後面加上 _ _attribute _ _((always_inline))確定展開 ``` static inline __attribute__((always_inline)) ``` ``` Execution time of raytracing() : 2.048317 sec ``` > 從4.211秒效能直接改善約一半到2.048秒 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 36.52 0.46 0.46 13861875 0.00 0.00 rayRectangularIntersection 15.88 0.66 0.20 2110576 0.00 0.00 compute_specular_diffuse 11.91 0.81 0.15 13861875 0.00 0.00 raySphereIntersection 11.91 0.96 0.15 1048576 0.00 0.00 ray_color 7.15 1.05 0.09 4620625 0.00 0.00 ray_hit_object 4.76 1.11 0.06 1048576 0.00 0.00 rayConstruction 3.97 1.16 0.05 2110576 0.00 0.00 localColor ``` > 因為function都展開了，少了呼叫數學 function傳值的時間嘗試把 return 計算式內容，而不是先存後傳發現效能有提高一些！ ``` static inline __attribute__((always_inline)) double dot_product(const double *v1, const double *v2) { return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; } ``` ``` Execution time of raytracing() : 1.988770 sec ``` > 時間降約 0.005秒左右 ### 優化測試3 #### 多執行緒處理－使用OpenMP > 過去沒有做過多執行緒平行化處理的程式，知道一些些概念卻不是很懂，需要補充自己的背景知識 > > 多執行緒還有許多方法可以探討 [OpenMP系列教學文](https://kheresy.wordpress.com/2006/06/09/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%E6%96%B9%E6%B3%95%EF%BC%8Dopenmp%EF%BC%88%E4%B8%80%EF%BC%89%E7%B0%A1%E4%BB%8B/) 回頭再試試看其他指令及範例練習 [參考dada實作](https://embedded2016.hackpad.com/2016q1-Homework-2-A-jugsB8br8Bt) [參考raytracing共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) OpenMP（Open Multi-Processing）是一套支持跨平台、共享記憶體方式的多執行緒平行運算的API。 * 它比起使用作業系統直接建立執行緒有三大優點： * CPU核心數量的擴展性 * 便利性 * 可移植性首先要進行平行化處理，必須先探討平行處理的程式碼是不是各自獨立沒有相依性，如果有相依性，進行平行化處理thread計算速度各自不同，資料就會錯誤。 * 查看 raytracing.c的 raytracing function 發現pixel可以各自平行處理，彼此沒有相依性的問題。在 raytracing.c 中的 raytracing function內的for建立平行化處理 * 沒有選擇多少執行緒處理，直接用電腦核心平行運算 ``` Execution time of raytracing() : 1.207583 sec ``` > 時間減少約 0.8秒如果指定多少執行緒處理? * num_threads選擇多少數量的執行緒平行處理 * private 讓每個執行緒中，都有一份變數的複本，以免互相干擾。 ``` #pragma omp parallel for num_threads(512) \ private(stk), private(d), \ private(object_color) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { ``` ``` 2 threads Execution time of raytracing() : 1.965742 sec 4 threads Execution time of raytracing() : 1.655131 sec 8 threads Execution time of raytracing() : 1.229635 sec 32 threads Execution time of raytracing() : 1.006759 sec 64 threads Execution time of raytracing() : 0.968196 sec 128 threads Execution time of raytracing() : 0.941614 sec 256 threads Execution time of raytracing() : 0.911764 sec 512 threads Execution time of raytracing() : 0.908156 sec 1024 threads Execution time of raytracing() : 0.949907 sec ``` ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 40.01 0.02 0.02 85023 0.00 0.00 raySphereIntersection 40.01 0.04 0.02 84535 0.00 0.00 rayRectangularIntersection 20.01 0.05 0.01 7767 0.00 0.00 rayConstruction 0.00 0.05 0.00 33026 0.00 0.00 ray_hit_object 0.00 0.05 0.00 18471 0.00 0.00 idx_stack_empty ``` > threads 前面看起來越高效果越好，可是真的是這樣子嗎? > 原本dada使用的是64 threads但是效能卻不是最高的，看了一下main.c COL 是用512，所以使用 512 threads可以有效率的平行展開處理，反而獲得最佳的處理效能。 >