2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by <`ryanpatiency`> ### Reviewed by `zhanyangch` * 在標題簡介所使用的方法，例如：實作二，可改為實作二： loop 展開 * 可以嘗試使用平行化的技巧 ## 目標作業要求 [B02: raytracing](https://hackmd.io/s/HyuBWDwYl#) * 學習gprof [參考資料](http://awkzone.blogspot.tw/2010/11/gprof.html) * 以 gprof 指出效能瓶頸 * 著手改寫檔案 math-toolkit.h 的函式實做 * 紀錄效能差異 ## gprof 介紹 >學任何事最簡單就是從範例著手，以下是程式碼，請將它複製為`test.c` ``` void fun2() { for (int i = 0; i < 0xfffffff; i++); } void fun1() { for (int i = 0; i < 0x4; i++) fun2(); } int main() { fun1(); fun2(); } ``` >內容超簡單的，可以看到程式大部分的時間都花在 func2, 一共執行了五次， fun1 執行 fun2 四次， main 執行 fun1 一次接著請編譯：`$ gcc test.c -pg -o test` >-pg 代表 profile generate，產生對 gprof 有幫助的資料接著請執行：`$ ./test` >執行成功後會發現一檔案`gmon.out`，就是 gprof 所需的檔案接著打開 gprof `gprof test gmon.out > gprof.txt` > \> 符號代表將 gprof 的 output 輸出到 gprof.txt 檔案打開產生的 gprof.txt 你會看到： ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 101.27 3.15 3.15 5 0.63 0.63 fun2 0.00 3.15 0.00 1 0.00 2.52 fun1 ... ... index % time self children called name 0.63 0.00 1/5 main [2] 2.52 0.00 4/5 fun1 [3] [1] 100.0 3.15 0.00 5 fun2 [1] ----------------------------------------------- <spontaneous> [2] 100.0 0.00 3.15 main [2] 0.00 2.52 1/1 fun1 [3] 0.63 0.00 1/5 fun2 [1] ----------------------------------------------- 0.00 2.52 1/1 main [2] [3] 80.0 0.00 2.52 1 fun1 [3] 2.52 0.00 4/5 fun2 [1] ----------------------------------------------- ``` 先說明第一部分： ``` % cumulative self self total time seconds seconds calls s/call s/call name 101.27 3.15 3.15 5 0.63 0.63 fun2 0.00 3.15 0.00 1 0.00 2.52 fun1 ``` * `name`欄位，最後一欄，為所執行的函式的名字，`test`中只有 fun1 和 fun2， gprof 為按照佔程式執行時間排序，所以 fun2 在第一個。 * `time`欄位，對應到第三欄 `self seconds`，為本身(不包涵子程式)的執行時間佔全部時間的比重，第一個`101.27`代表佔了 101.27% 的執行時間，因為 gprof 是用取樣的方式，所以才超過了 100% * `seconds`欄位，有兩欄，`cumulative`是累積，分別是累積執行時間和自己的執行時間。 * `calls`呼叫次數 * `s/call`代表`seconds per call`，每次呼叫平均佔時。觀察 2.52 為 0.63 的四倍，可知道`self s/call`代表不包涵子函式的執行時間，`total s/call`相反第二部分說明 caller 和 callee 的關係，說明完了第一部分後，這部分我只挑重點講： ``` index % time self children called name 0.63 0.00 1/5 main [2] 2.52 0.00 4/5 fun1 [3] [1] 100.0 3.15 0.00 5 fun2 [1] ----------------------------------------------- <spontaneous> [2] 100.0 0.00 3.15 main [2] 0.00 2.52 1/1 fun1 [3] 0.63 0.00 1/5 fun2 [1] ----------------------------------------------- 0.00 2.52 1/1 main [2] [3] 80.0 0.00 2.52 1 fun1 [3] 2.52 0.00 4/5 fun2 [1] ----------------------------------------------- ``` 可看到第二部分分為三個小部分，分別有三個主角，主角是`time`欄位有數字者，主角的上排為 caller ，主角的下排為 callee * `time`欄位，程式 **和其子函式** 佔`main()`的時間比重(%)，注意是`main()`而不是`caller()` * `self`欄位，分為三部分 * 對主角而言，代表在`main`執行**主角本身**的總時間 * 對 caller 而言，代表 caller 執行**主角本身**的總時間，請對照看 caller 的 `called`欄位 * 對 callee 而言，代表主角執行 **callee 本身**的總時間，請對照看 callee 的 `called`欄位 >剩下部分請看 gprof.txt 的文字說明 #### 結論第二部分的東西最重要的就是 called 欄位，其他可由第一部分的數字推導出來。 #### 和 perf 的比較如果只是要找程式熱點的話，`perf record -e cpu-clock` 就可以了。 gprof 的用途應該是找出熱點程式的 seconds/call 和呼叫次數(程式架構) ## 修改前的程式效能 `$ gprof -b ./raytracing gmon.out > raytracing.gprof ` 可以看到下列資訊： ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.80 0.61 0.61 69646433 0.00 0.00 dot_product 19.30 1.15 0.54 56956357 0.00 0.00 subtract_vector 11.44 1.47 0.32 31410180 0.00 0.00 multiply_vector ... ``` 可得效能瓶頸前三名分別為 1. `dot_product()` 被呼叫 69646433 次`memccpy()` 2. `subtract_vector()`被呼叫 56956357 次`memccpy()` 3. `multiply_vector()`被呼叫 31410180 次`memccpy()` ## 實驗過程 ### 實作一觀看 math-toolkit.h ，可得上述實作如下： ```c double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 網路上得到的[資料](http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/)是將 loop 的方法由遞增改為遞減。理由是 assembly code 較短： ``` 遞增 INC lpc ; Increment loop counter SUB lpc, #10 ; Compare loop counter to 10 BNZ loop ; Branch if loop counter not equal to 10 ``` ``` 遞減 DEC lpc ; Decrement loop counter BNZ loop ; Branch if non zero ``` 所以我更改`dot_product()`如下： ```c double dot_product(const double *v1, const double *v2) { double dp = 0.0; int i=2; do{ dp += v1[i] * v2[i]; }while(i--); return dp; } ``` 再跑一次 gprof: ``` % cumulative self self total time seconds seconds calls s/call s/call name 21.07 0.52 0.52 56956357 0.00 0.00 subtract_vector 17.42 0.95 0.43 69646433 0.00 0.00 dot_product 9.32 1.18 0.23 31410181 0.00 0.00 multiply_vector ``` 真是令人高興，原本第一名的`dot_product`變為第二名了。時間也從 0.61 變為 0.43 --- ### 實作二為了測試 for loop 的影響，我將 for loop 展開，同時保有原本的算式，更改`dot_product()`如下： ```clike double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp+=v1[0]*v2[0]; dp+=v1[1]*v2[1]; dp+=v1[2]*v2[2]; return dp; } ``` 得到 gprof 如下： ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.02 0.46 0.46 69646433 0.00 0.00 dot_product 17.36 0.84 0.38 56956357 0.00 0.00 subtract_vector 9.59 1.05 0.21 31410180 0.00 0.00 multiply_vector ``` 發現展開 for loop 並無任何影響。 --- ### 實作三但如果減少 dp 和 i 的 init，和 assign , 修改 `dot_product()`如下 ```c double dot_product(const double *v1, const double *v2) { return v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]; } ``` 得到 gprof 如下： ``` % cumulative self self total time seconds seconds calls s/call s/call name 28.14 1.53 1.53 148233778 0.00 0.00 subtract_vector 12.51 2.21 0.68 35651587 0.00 0.00 normalize 11.96 2.86 0.65 46157185 0.00 0.00 add_vector 9.38 3.37 0.51 28311552 0.00 0.00 raySphereIntersection 8.92 3.86 0.49 146229607 0.00 0.00 dot_product ``` 發現原本第一名的`dot_product`變為第五名了。但發現原本呼叫 6964-6433 次`memccpy()`，現在卻呼叫 1-4622-9607 次，差了兩倍多，可見 gprof 的 calls 並不可靠。 :::danger 如果工具「不可靠」，為什麼還會持續存在呢？與其用一兩句話描述，何不從其行為和限制去探討呢？請重新分析並提出你的解釋 --jserv ::: ## 參考資料 [gprof](http://awkzone.blogspot.tw/2010/11/gprof.html) [count down may be more efficient compare to increment](http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/)