# 2017q1 Homework1 (raytracing) contributed by <`rayleigh0407`> ### Reviewed by `twzjwang` - 可使用圖表比較不同實作方法的結果,並與編譯器優化比較 - 可以刪除多餘的空行及已註解的程式碼 - 如果有時間可以嘗試其他方法加速 ## 開發環境 ```shell $lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Stepping: 3 CPU MHz: 800.000 CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 8016.72 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 ``` >中英文間記得用空白隔開~ >[name=課程助教][color=red] ## 前置作業 ### gprof gprof 是一個可以協助你開發程式的工具, 它可以檢測你的程式執行時間的分佈, 在 GNU gprof 的簡介就提到 *This manual describes the gnu profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. We assume that you know how to write, compile, and execute programs...* 在編譯的時候, 你可以加入參數```-pg```, 接著執行過程式後會產生gmon.out檔, 接著只要輸入以下指令 ```shell gprof -b $(exec) gmon.out | less ``` -b 是將各種結果對應的說明刪除, 只留下執行分析的結果 less 可以讓你上下滾動查看輸出結果 ### raytracing ## 開發過程 ### 1. 原始版本 先下載原始碼 載完立即來測試看看 ```shell $ git clone https://github.com/sysprog21/raytracing $ cd raytracing $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 2.224557 sec ``` 用 gprof 來看看程式執行的部份 ```shell $ make PROFILE=1(輸出gmon.out) $ ./raytracing $ gprof -b raytracing gmon.out | less ``` 執行的結果如下 ```shell Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 20.23 0.37 0.37 69646433 0.00 0.00 dot_product 18.04 0.70 0.33 56956357 0.00 0.00 subtract_vector 12.85 0.94 0.24 13861875 0.00 0.00 rayRectangularIntersection 9.30 1.11 0.17 4620625 0.00 0.00 ray_hit_object 8.20 1.26 0.15 31410180 0.00 0.00 multiply_vector 6.29 1.37 0.12 17836094 0.00 0.00 add_vector 6.29 1.49 0.12 13861875 0.00 0.00 raySphereIntersection 5.47 1.59 0.10 10598450 0.00 0.00 normalize 4.92 1.68 0.09 17821809 0.00 0.00 cross_product 1.64 1.71 0.03 4221152 0.00 0.00 multiply_vectors 1.09 1.73 0.02 2110576 0.00 0.00 compute_specular_diffuse 1.09 1.75 0.02 2110576 0.00 0.00 localColor 1.09 1.77 0.02 1241598 0.00 0.00 protect_color_overflow 1.09 1.79 0.02 1048576 0.00 0.00 rayConstruction 0.82 1.80 0.02 3838091 0.00 0.00 length 0.55 1.81 0.01 1241598 0.00 0.00 refraction 0.55 1.82 0.01 1048576 0.00 0.00 ray_color 0.27 1.83 0.01 2558386 0.00 0.00 idx_stack_empty 0.27 1.83 0.01 1204003 0.00 0.00 idx_stack_push 0.00 1.83 0.00 2520791 0.00 0.00 idx_stack_top 0.00 1.83 0.00 1241598 0.00 0.00 reflection 0.00 1.83 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.83 0.00 113297 0.00 0.00 fresnel 0.00 1.83 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.83 0.00 3 0.00 0.00 append_rectangular 0.00 1.83 0.00 3 0.00 0.00 append_sphere 0.00 1.83 0.00 2 0.00 0.00 append_light 0.00 1.83 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.83 0.00 1 0.00 0.00 delete_light_list 0.00 1.83 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.83 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.83 0.00 1 0.00 0.00 diff_in_second 0.00 1.83 0.00 1 0.00 1.83 raytracing 0.00 1.83 0.00 1 0.00 0.00 write_to_ppm ... ``` 下半部份主要為函式呼叫的分支(即一個函式呼叫了哪些函式)和統計次數 由此可看出, 佔用大部分時間的都是一些小函式( dot_product 不到十行) 試著從這些小地方開始著手 ### 2. 測試1 Loop unrolling 這是 math-toolkit.h 中的 dot_product() 和 subtract_vector() ```c= static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ... static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 可以發現, 原開發者早就考慮到他們佔用相當多的時間, 便宣告成 inline function 以提昇執行速度和減少記憶體的佔用([關於inline function](http://www.greenend.org.uk/rjk/tech/inline.html)), 但程式呼叫次數太多了, 佔用時間還是相對地大 首先我先這樣修改: ```c= static inline void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } ... static inline double dot_product(const double *v1, const double *v2) { dp += v1[0] * v2[0]; dp += v1[1] * v2[1]; dp += v1[2] * v2[2]; return dp; } ``` 將函式拆開的用意是, 在每次執行一次迴圈時, 都會檢查是否要跳出迴圈, 以組語來說就類似 jne jge 等等...(可參考[x86 Assembly Guide](http://www.cs.virginia.edu/~evans/cs216/guides/x86.html)), 使用 loop unrolling, 每次執行函式就會少一兩個 instructions, 多次使用後節省的時間也是很可觀的 我們來看執行結果 **origin** ```shell Performance counter stats for './raytracing' (10 runs): 22,9075 cache-misses # 23.203 % of all cache refs ( +- 16.92% ) 98,7278 cache-references ( +- 16.48% ) 194,6446,2315 instructions # 2.21 insns per cycle ( +- 0.00% ) 88,2142,5404 cycles ( +- 0.93% ) 2.251422149 seconds time elapsed ( +- 0.92% ) ``` **loop unrolling** ```shell Performance counter stats for './raytracing' (10 runs): 13,3195 cache-misses # 22.574 % of all cache refs ( +- 9.44% ) 59,0030 cache-references ( +- 9.96% ) 147,1677,2511 instructions # 1.86 insns per cycle ( +- 0.00% ) 79,0934,9083 cycles ( +- 1.25% ) 2.019096528 seconds time elapsed ( +- 1.26% ) ``` 雖然 cache-miss 稍微地下降而已, 但是 instrcution 足足少了 **24.39%**, 執行時間也隨之減少 **10.33%**, 有時候看似細微的地方, 也能牽扯到整個大局 執行 gprof: ```shell Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 18.01 0.23 0.23 69646433 0.00 0.00 dot_product 13.21 0.39 0.17 56956357 0.00 0.00 subtract_vector 11.21 0.53 0.14 10598450 0.00 0.00 normalize 10.81 0.67 0.14 13861875 0.00 0.00 rayRectangularIntersection 8.81 0.78 0.11 4620625 0.00 0.00 ray_hit_object 7.60 0.87 0.10 31410180 0.00 0.00 multiply_vector 7.60 0.97 0.10 13861875 0.00 0.00 raySphereIntersection 5.20 1.03 0.07 17836094 0.00 0.00 add_vector 5.20 1.10 0.07 17821809 0.00 0.00 cross_product 3.60 1.14 0.05 4221152 0.00 0.00 multiply_vectors 2.80 1.18 0.04 2110576 0.00 0.00 compute_specular_diffuse 1.60 1.20 0.02 1048576 0.00 0.00 ray_color 1.20 1.21 0.02 2110576 0.00 0.00 localColor 0.80 1.22 0.01 2558386 0.00 0.00 idx_stack_empty 0.80 1.23 0.01 2520791 0.00 0.00 idx_stack_top 0.80 1.24 0.01 1 0.01 0.01 delete_sphere_list 0.80 1.25 0.01 1 0.01 1.24 raytracing 0.00 1.25 0.00 3838091 0.00 0.00 length 0.00 1.25 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 1.25 0.00 1241598 0.00 0.00 reflection 0.00 1.25 0.00 1241598 0.00 0.00 refraction 0.00 1.25 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.25 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.25 0.00 1048576 0.00 0.00 rayConstruction 0.00 1.25 0.00 113297 0.00 0.00 fresnel 0.00 1.25 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.25 0.00 3 0.00 0.00 append_rectangular 0.00 1.25 0.00 3 0.00 0.00 append_sphere 0.00 1.25 0.00 2 0.00 0.00 append_light 0.00 1.25 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.25 0.00 1 0.00 0.00 delete_light_list 0.00 1.25 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.25 0.00 1 0.00 0.00 diff_in_second 0.00 1.25 0.00 1 0.00 0.00 write_to_ppm ``` dot_product 下降了0.14s substact_vector 下降了0.16s 那既然如此, 乾脆把 math-toolkit 內的迴圈都做 loop unrolling 看看 執行結果: ```shell Performance counter stats for './raytracing' (10 runs): 11,9138 cache-misses # 23.585 % of all cache refs ( +- 2.98% ) 50,5147 cache-references ( +- 10.63% ) 127,3165,0289 instructions # 1.74 insns per cycle ( +- 0.00% ) 73,2732,5074 cycles ( +- 0.31% ) 1.869817600 seconds time elapsed ( +- 0.31% ) ``` 相較於 origin instrcution 減少 **34.59%** time elapsed 減少 **16.95%** 以上所有測試皆通過圖形測試 ```shell $make check # Rendering scene Done! Execution time of raytracing() : 4.027406 sec Verified OK ``` ### 3. 測試2 平行化 試著平行化迴圈 math-toolkit.h 內的函式皆為 inline function 實行平行化可能沒辦法在編譯時展開 先看看 raytracing.c ```c= void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { ... for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); ... } ``` 在 raytracing() 內 是一次畫一個點去作圖的 看能不能針對這巢狀迴圈做平行化 在這之前先考慮 data dependency ... 看了很久才發現資料沒有任何資料寫入衝突(可以看 function parameter 來過濾, 如果宣告為 const 則此資料不會被更動) 那就來實做平行化 #### Pthread 將外層函式做平行化 ```c= int main(...) { ... pthread_t *thread_handles; thread_handles = (pthread_t*)malloc(sizeof(pthread_t) * thread_num); for(int i = 0; i < thread_num; i++) { each_rank[i] = i; pthread_create(&thread_handles[i], NULL, raytracing_pthread, (void *)&each_rank[i]); } } void *raytracing_pthread(void *rank) { int my_rank = *(int*)rank; ... int start = (height / THREAD_NUMBER) * my_rank; int end = (height / THREAD_NUMBER) * (my_rank + 1); for (int j = start; j < end; j++) { for (int i = 0; i < width; i++) { ... } } } ``` 執行結果: 為了方便分配工作區域, 以2的倍數來分配 雙核: ```shell # Rendering scene Thread number : 2 Done! Execution time of raytracing() : 1.110519 sec ``` 四核: ```shell # Rendering scene Thread number : 4 Done! Execution time of raytracing() : 0.816143 sec ``` 八核: ```shell # Rendering scene Thread number : 8 Done! Execution time of raytracing() : 0.506245 sec ``` 效能成長沒有想像中好 換個方式分配看看 ```c= for (int j = my_rank; j < height; j+=THREAD_NUMBER) { ``` 執行結果: 單執行序: ```shell Performance counter stats for './raytracing' (10 runs): 11,0062 cache-misses # 23.736 % of all cache refs ( +- 2.04% ) 46,3689 cache-references ( +- 5.54% ) 127,2937,7767 instructions # 1.75 insns per cycle ( +- 0.00% ) 72,9304,6745 cycles ( +- 0.12% ) 1.865169660 seconds time elapsed ( +- 0.23% ) ``` 雙核: ```shell Performance counter stats for './raytracing' (10 runs): 11,0145 cache-misses # 18.638 % of all cache refs ( +- 2.23% ) 59,0976 cache-references ( +- 5.39% ) 127,2961,2413 instructions # 1.58 insns per cycle ( +- 0.00% ) 80,5269,8888 cycles ( +- 0.07% ) 1.126548896 seconds time elapsed ( +- 0.09% ) ``` 四核: ```shell Performance counter stats for './raytracing' (10 runs): 10,5345 cache-misses # 15.362 % of all cache refs ( +- 3.07% ) 68,5756 cache-references ( +- 2.35% ) 127,2978,3539 instructions # 1.47 insns per cycle ( +- 0.00% ) 86,4925,3375 cycles ( +- 1.49% ) 0.592542459 seconds time elapsed ( +- 2.45% ) ``` 八核: ```shell Performance counter stats for './raytracing' (10 runs): 11,4074 cache-misses # 14.391 % of all cache refs ( +- 3.90% ) 79,2670 cache-references ( +- 1.85% ) 127,3064,3855 instructions # 1.21 insns per cycle ( +- 0.00% ) 105,0530,6169 cycles ( +- 0.09% ) 0.354254347 seconds time elapsed ( +- 0.69% ) ``` 效能提昇不少 幸好上學期有修過關於平行化的課程 cache-misses 減少了不少 可能是 cpu 能用的空間變多了 但還是沒有依倍數成長 參考資料 - [GNU gprof](https://sourceware.org/binutils/docs/gprof/index.html#Top) - [使用Gnu gprof进行Linux平台下的程序分析](http://os.51cto.com/art/200703/41426.htm) - [Inline Functions In C](http://www.greenend.org.uk/rjk/tech/inline.html) - [GNU MP 6.1.2: Assembly Loop Unrolling](https://gmplib.org/manual/Assembly-Loop-Unrolling.html) - [行內函式(Inline function)](https://openhome.cc/Gossip/CppGossip/inlineFunction.html) - [x86 Assembly Guide](http://www.cs.virginia.edu/~evans/cs216/guides/x86.html) - [多處理器課程](nop) - [邱靖吉、林文盛、陳楷雯同學的共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) - [HackMD Command](https://hackmd.io/s/S1vAcMjFe#)