2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by < `laochanlam` > ###### tags: `laochanlam` ### Reviewed by `Daichou` * Makefile中強制指派gcc版本是否為好事(Makefile:33)? * 應該討論OpenMP在不同核心數與執行序數的情況下的效能比較 * 應該嘗試使用不同的openmp schedule方式來測試效能 ## 開發環境 - Ubuntu 16.10 (64bit) ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 61 Model name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.878 CPU max MHz: 3000.0000 CPU min MHz: 500.0000 BogoMIPS: 4788.90 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 ``` ## 相關連結 - [2017 年春季作業說明](https://hackmd.io/s/Bk-1zqIte#) - [2017q1 Homework1 (作業區)](https://hackmd.io/MYJjEYGZITgWgOwCMAmw4BYYwAxyQQKxwBmwGOkAHMDeEgGxA===?view) - [B02: raytracing作業要求](https://hackmd.io/s/HyuBWDwYl) - [raytracing Github連結 ( laochanlam ) ](https://github.com/laochanlam/raytracing) - [作業解說 video](https://www.youtube.com/watch?v=m1RmfOfSwno) - [參考實做程式的解說 video](https://www.youtube.com/watch?v=V_rLyzecuaE) - [課程進度與開放資源 Wiki](http://wiki.csie.ncku.edu.tw/sysprog/schedule) ## Raytracing 開發紀錄在看完解說 video 後，開始依作業的要求嘗試使用 gprof 分析 raytracing 的效能。因老師已經在Makefile中加入了 ```-pg``` 的指令，所以我們只需在make的時候輸入 ```$ make PROFILE=1``` 編譯後即可使用 gprof ，然後輸入 ```$ gprof raytracing | less``` 發現... ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name ``` 沒有結果！！！然後發現[FB討論區](https://www.facebook.com/groups/system.software2017/permalink/1325126694219363/)有人遇到了同樣的問題.. 傳說中好像是gcc版本的問題... ### 解決過程：先查看自己安裝的 gcc 版本 ```$ gcc -v``` ``` gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ``` 然後參考[ＦＢ討論區](https://www.facebook.com/groups/system.software2017/permalink/1325126694219363/)中翁同學的方法，安裝 gcc5 ```$ sudo apt-get install gcc-5``` ，然後在Makefile中把 ./raytracing 的輸出部分的 ``` $(CC)``` 改成 ```gcc-5 -no-pie``` 。 **解決了！** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.75 0.70 0.70 69646433 0.00 0.00 dot_product 20.60 1.26 0.56 56956357 0.00 0.00 subtract_vector 9.93 1.53 0.27 13861875 0.00 0.00 rayRectangularIntersection 9.20 1.78 0.25 31410180 0.00 0.00 multiply_vector 7.36 1.98 0.20 17836094 0.00 0.00 add_vector 6.62 2.16 0.18 10598450 0.00 0.00 normalize 4.41 2.28 0.12 17821809 0.00 0.00 cross_product 3.68 2.38 0.10 4620625 0.00 0.00 ray_hit_object 3.13 2.47 0.09 13861875 0.00 0.00 raySphereIntersection 2.58 2.54 0.07 4221152 0.00 0.00 multiply_vectors 1.84 2.59 0.05 1048576 0.00 0.00 ray_color 1.47 2.63 0.04 2110576 0.00 0.00 compute_specular_diffuse 0.92 2.65 0.03 2520791 0.00 0.00 idx_stack_top 0.74 2.67 0.02 3838091 0.00 0.00 length 0.74 2.69 0.02 2110576 0.00 0.00 localColor 0.74 2.71 0.02 1 0.02 2.72 raytracing 0.37 2.72 0.01 1241598 0.00 0.00 refraction 0.00 2.72 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.72 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 2.72 0.00 1241598 0.00 0.00 reflection 0.00 2.72 0.00 1204003 0.00 0.00 idx_stack_push 0.00 2.72 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.72 0.00 1048576 0.00 0.00 rayConstruction 0.00 2.72 0.00 113297 0.00 0.00 fresnel 0.00 2.72 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.72 0.00 3 0.00 0.00 append_rectangular 0.00 2.72 0.00 3 0.00 0.00 append_sphere 0.00 2.72 0.00 2 0.00 0.00 append_light 0.00 2.72 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.72 0.00 1 0.00 0.00 delete_light_list 0.00 2.72 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.72 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.72 0.00 1 0.00 0.00 diff_in_second 0.00 2.72 0.00 1 0.00 0.00 write_to_ppm ``` 至於 ```-no-pie``` 是什麼這個晚點再來研究。接下來要正式進入主題了，先來理解一下整個專案的架構。由上邊 gporf 的輸出結果可以看得到 ```dot_product```及```subtract_vector```被呼叫得最多，所以從這兩個函式開始著手。來看看原來程式執行所需的時間吧！ ``` # Rendering scene Done! Execution time of raytracing() : 2.673156 sec ``` 大概是2.6秒左右。然後在執行得最多的函式 ```dot_product``` 及 ```subtract_vector``` 當中發現， ```C static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` ```C static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ``` 可進行最簡單的 **loop unrolling** ，接著就把 math-toolkit.h 中的函式都用 loop unrolling的方法展開。 ## 優化１: loop unrolling ``` # Rendering scene Done! Execution time of raytracing() : 1.803709 sec ``` 效果拔群！變成1.8秒了！再來用 gprof 看一下最上邊的幾個函式。 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 18.10 0.28 0.28 69646433 0.00 0.00 dot_product 14.81 0.50 0.23 56956357 0.00 0.00 subtract_vector 13.17 0.70 0.20 13861875 0.00 0.00 rayRectangularIntersection 9.22 0.84 0.14 31410180 0.00 0.00 multiply_vector 7.24 0.95 0.11 17836094 0.00 0.00 add_vector 6.91 1.06 0.11 17821809 0.00 0.00 cross_product .......... ``` 雖然 call 的次數沒有變，可是總執行時間變少了。 ## 優化2: inline 因為在 math-toolkit.h 中看到 **static inline** 這個之前沒見過的關鍵字，所以在閱讀完inline的資料後，以及參考完[參考實做程式的解說 video](https://www.youtube.com/watch?v=V_rLyzecuaE) 後，得知 inline 原來在被關閉編譯器最佳化 ```-O0``` 後是不會被執行的，所以我們加入讓編譯器可以強執執行 inline function 的程式碼(詳見下方相關資料)試試看可不可以加快執行速度。 ``` # Rendering scene Done! Execution time of raytracing() : 1.590904 sec ``` 成功了！然後用 gprof 要驗証一下函式是否真的被展開。 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 32.87 0.46 0.46 13861875 0.00 0.00 rayRectangularIntersection 19.29 0.73 0.27 13861875 0.00 0.00 raySphereIntersection 15.00 0.94 0.21 2110576 0.00 0.00 compute_specular_diffuse 9.29 1.07 0.13 1048576 0.00 0.00 ray_color 5.72 1.15 0.08 4620625 0.00 0.00 ray_hit_object ....... ``` 從最上邊幾個花費時間最多的函式中，發現 math-toolkit.h 中的幾個函式都不見了！被展開了 ! ## 優化3: OpenMP 接下來嘗試使用 OpenMP 進行優化。在研究 **raytracing.c** 的時候，找到了以下程式碼。 ```C for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } ``` 乍看來每次 loop 都沒有相依性，所以嘗試對這個 for 進行平行化的處理。 ``` # Rendering scene Done! Execution time of raytracing() : 0.871998 sec ``` 然後 ```$ make check```就 fail 了... 果然事情並沒有這麼簡單。 <br> 在閱讀完 [king1224大大的共筆](https://hackmd.io/EYUwTAnMEQbAtBAZgdifALCkJ4EMwBmCeAVgA4QBGEDPWUjEUoA=#)後，發現了就算參數之間沒有相依性，也有可能在多個執行緒中被同時改變著，所以參考了其程式碼的寫法,把部份參數給于```private``` 的屬性。新版本： ```C int i,j,s; double r,g,b; #pragma omp parallel for private(i,s,r,g,b,object_color,stk,d) for ( j = 0; j < height; j++) { for ( i = 0; i < width; i++) { r = 0, g = 0, b = 0; /* MSAA */ for ( s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } } ``` 這次 verify 終於成功了，而且時間也被大量縮減。 ``` # Rendering scene Done! Execution time of raytracing() : 0.795355 sec ``` <br> 最後來個效能對比圖作結 ![Imgur](http://i.imgur.com/FSiSgOy.png =800x) --- ## 補充知識 - [x] gprof - [ ] POSIX Thread - [x] OpenMP - [x] software pipelining - [x] loop unrolling - [x] static inline - [x] marco --- ## 程式性能分析工具： gprof 是一個基於 Linux 下方便易用效能分析的工具，會輸出程式中 function 被 call 的次數及時間等重要資訊，如上方開發紀錄，但缺點是會加入一些額外的程式碼使執行時間變長。用法很簡單，只要在編譯時加入``` -pg ```的參數，然後```$ gprof [執行檔]```即可觀看分析資訊。 #### 參考及引用資料 [使用Gnu gprof进行Linux平台下的程序分析](http://os.51cto.com/art/200703/41426.htm) --- ## Loop unrolling 犧性程式碼的可讀性及行數來換取程式執行速度的一種優化方法。例子如下圖： ```C for (i = 1; i <= 60; i++) a[i] = a[i] * b + c; ``` 可以展開成 ```C for (i = 1; i <= 60; i+=3) { a[i] = a[i] * b + c; a[i+1] = a[i+1] * b + c; a[i+2] = a[i+2] * b + c; } ``` 這被稱作loop unrolling。 #### 參考及引用資料 [循环展开 Wiki](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80) --- ## OpenMP 依我的理解是一套讓程式設計師可以簡單地實現程式多執行緒化的API。例子如下圖： ```C #include <omp.h> #include <stdio.h> #include <stdlib.h> void Test( int n ) { for( int i = 0; i < 10000; ++ i ) { //do nothing, just waste time } printf( "%d, ", n ); } int main(int argc, char* argv[]) { #pragma omp parallel for for( int i = 0; i < 10; ++ i ) Test( i ); system( "pause" ); } ``` 這段簡單的程式碼加入了 ```#include <omp.h>``` 及 ```#pragma omp parallel for```即完成了多執行緒的實現，十分輕便。輸出結果： ``` 0, 5, 1, 6, 2, 7, 3, 8, 4, 9, ``` 使用多執行緒處理程式可使程式的效能提高，但要小心有些程式碼並不能隨便更改執行順序。 > private 讓每個執行緒中，都有一份變數的複本，以免互相干擾。 Specifies that each thread should have its own instance of a variable. >shared 將變數設定為各執行緒共用（應該算是相對於 private 的）。 Specifies that one or more variables should be shared among all threads. <br> ```C #pragma omp parallel num_threads(THREAD_CNT) ``` 控制要進行平行化所用的執行緒數量。 ```C shared(height,width,factor) private(stk,d,object_color) ``` 把參數 ```height,width,factor``` 設定為被執行緒共用，把參數 ```stk,d,object_color``` 設定為執行緒緒手一份。 #### 參考及引用資料 [簡易的程式平行化方法－OpenMP（一）簡介](https://kheresy.wordpress.com/2006/06/09/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%E6%96%B9%E6%B3%95%EF%BC%8Dopenmp%EF%BC%88%E4%B8%80%EF%BC%89%E7%B0%A1%E4%BB%8B/) [簡易的程式平行化－OpenMP（二）語法說明](https://kheresy.wordpress.com/2006/09/13/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%8C%EF%BC%89%E8%AA%9E%E6%B3%95%E8%AA%AA%E6%98%8E/) [簡易的程式平行化－OpenMP（五）變數的平行化](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/) [OpenMP Wiki](https://zh.wikipedia.org/zh-hk/OpenMP) --- ## inline 使用內嵌函數（ inline　function ）是一種加快程式執行速度的方法，當我們加入關鍵字時，就像是給編譯器"建議"，在編譯時把函數中的程式直接展開，以換取速度，省去了許多的呼叫函數的時間。注意: 加入了關閉編譯器最佳化 ```-O0``` 後編譯器是不會執行inline的。 #### 參考及引用資料 [内联函数：static inline 和 extern inline 的含义](http://www.cnblogs.com/pengyingh/articles/2405718.html) [內嵌函數（inline　function）筆記](https://dotblogs.com.tw/v6610688/2013/11/27/introduction_inline_function) [强制函数inline | 100个gcc小技巧](https://wizardforcel.gitbooks.io/100-gcc-tips/content/must-forceinline-function-code.html) ---