2016q1 Homework2 (A) raytracing

# 2016q1 Homework2 (A) raytracing **作業要求 (A)** * 在 GitHub 上 fork [raytracing](https://github.com/embedded2016/raytracing)，並思考如何改善程式效能 * 以 make PROFILE=1 重新編譯程式碼，並且學習 [gprof](https://sourceware.org/binutils/docs/gprof/) * 參考 [使用 GNU gprof 進行 Linux 平台的程式分析](http://os.51cto.com/art/200703/41426.htm) * 以 [gprof](https://sourceware.org/binutils/docs/gprof/) 指出效能瓶頸，並且著手改寫檔案 math-toolkit.h 在內的函式實做，充分紀錄效能差異在共筆 * 注意: 請勿更動編譯器最佳化選項 -O0 (關閉最佳化) * 檔案 math-toolkit.h 定義的若干數學操作函式很重要，可參考 [SIMD optimized dot and cross product functions](http://tomjbward.co.uk/simd-optimized-dot-and-cross/) 和 [2015q3 Homework #1 Ext](http://wiki.csie.ncku.edu.tw/embedded/2015q3h1ext) * 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 * 將你的觀察、分析，以及各式效能改善過程，並善用 gnuplot 製圖，紀錄於「[作業區](https://embedded2016.hackpad.com/2016q1-Homework-2-v37rXPJjRlW)」，記得標注「開發紀錄(A)」安裝開發工具： * $ sudo apt-get update * $ sudo apt-get install graphviz * $ sudo apt-get install imagemagick 再來是取得程式碼： * git clone [](https://github.com/embedded2016/raytracing.git)https://github.com/embedded2016/raytracing.git 嘗試編譯執行，結果如下： ``` # Rendering scene Done! Execution time of raytracing() : 3.285271 sec ``` **利用gprof來進行分析** 編譯執行： * make PROFILE=1 * ./raytracing 程式執行時間： ``` # Rendering scene Done! Execution time of raytracing() : 6.851365 sec ``` 再來是觀察gprof，先將結果輸出去一個檔案，方便之後對比，而且輸出去檔案裏的資料還有顏色，看了比較舒服，也方便查看 * gprof raytracing gmon.out > ori.txt * Flat profile(只貼上一部分）: * Each sample counts as 0.01 seconds. * % cumulative self self total * time seconds seconds calls s/call s/call name * 19.58 0.55 0.55 69646433 0.00 0.00 dot_product * 16.74 1.02 0.47 56956357 0.00 0.00 subtract_vector * 10.33 1.31 0.29 31410180 0.00 0.00 multiply_vector * 9.61 1.58 0.27 13861875 0.00 0.00 rayRectangularIntersection * 8.01 1.81 0.23 17836094 0.00 0.00 add_vector * 6.05 1.98 0.17 10598450 0.00 0.00 normalize * 5.70 2.14 0.16 13861875 0.00 0.00 raySphereIntersection * 5.70 2.30 0.16 4620625 0.00 0.00 ray_hit_object * 5.34 2.45 0.15 17821809 0.00 0.00 cross_product * 2.85 2.53 0.08 1048576 0.00 0.00 ray_color * 2.49 2.60 0.07 2520791 0.00 0.00 idx_stack_top * 2.14 2.66 0.06 4221152 0.00 0.00 multiply_vectors * 1.78 2.71 0.05 2110576 0.00 0.00 compute_specular_diffuse * 1.42 2.75 0.04 2110576 0.00 0.00 localColor 可以很明顯的看見dot_product佔了最多時間，被call了近7千萬次，在來是subtract也是佔了大部分時間。 **優化** 大致上瞭解後，就是來進行優化啦。首先嘗試 loop unrolling，將math-toolkit.h裏的所有for 展開。優點： * [分支預測](https://zh.wikipedia.org/wiki/%E5%88%86%E6%94%AF%E9%A0%90%E6%B8%AC)（**branch predictor）**失敗減少。 * 如果循環體內語句沒有數據相關，增加了並發執行的機會。 * 可以在執行時動態循環展開。 * 參考資料：[](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80)https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80 以dot_product爲例子： ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; return dp; } ``` 優化後的執行時間： ``` # Rendering scene Done! Execution time of raytracing() : 5.835674 sec ``` 降低了不少！再來是利用gprof分析 * gprof raytracing gmon.out > loop_unrolling.txt * time seconds seconds calls s/call s/call name * 16.79 0.26 0.26 69646433 0.00 0.00 dot_product * 15.14 0.49 0.23 56956357 0.00 0.00 subtract_vector * 9.87 0.64 0.15 13861875 0.00 0.00 rayRectangularIntersection * 9.22 0.78 0.14 17821809 0.00 0.00 cross_product * 8.23 0.90 0.13 17836094 0.00 0.00 add_vector * 6.58 1.00 0.10 13861875 0.00 0.00 raySphereIntersection * 6.58 1.10 0.10 10598450 0.00 0.00 normalize * 5.60 1.19 0.09 4620625 0.00 0.00 ray_hit_object 可以發現dot_product跟subtract_vector的時間都降低了不少！ **SIMD optimized** 嘗試利用SIMD指令來優化dot_product跟cross_product。這裏我先嘗試使用AVX指令來實做dot_product。會使用AVX而不是SSE是因爲它支援256bits，足夠可以儲存4個double，運算也能達到4個。使用AVX需要在編譯時加上 -mavx2，因此就到makefile加上-mavx2 ``` -std=gnu99 -Wall -O0 -g -mavx2 ``` 更改的程式碼如下： ```clike= double dot_product(const double *v1, const double *v2) { __m256d x = _mm256_loadu_pd((__m256d *)v1); __m256d y = _mm256_loadu_pd((__m256d *)v2); __m256d xy = _mm256_mul_pd(x,y); return (xy[0]+xy[1]+xy[2]); } ``` 執行結果如下： ``` # Rendering scene Done! Execution time of raytracing() : 9.886427 sec ``` 觀察gprof（使用loadu）： * gprof raytracing gmon.out > AVX_loadu.txt * time seconds seconds calls s/call s/call name * 48.12 2.01 2.01 69646433 0.00 0.00 dot_product 嗯時間加長了不少，上網查了一下發現loadu的效率不是很好，所以就不使用loadu，改用別的方法（換成直接set每個值）： ```clike= double dot_product(const double *v1, const double *v2) { __m256d x = _mm256_set_pd(0,v1[2],v1[1],v1[0]); __m256d y = _mm256_set_pd(0,v2[2],v2[1],v2[0]); __m256d xy = _mm256_mul_pd(x,y); return (xy[0]+xy[1]+xy[2]); } ``` 換成set後的結果如下（時間加快了不少，可見loadu的效率是真的不好）： ``` # Rendering scene Done! Execution time of raytracing() : 8.460185 sec ``` 觀察gprof（使用set）： * gprof raytracing gmon.out > AVX_set.txt * time seconds seconds calls s/call s/call name * 25.88 0.82 0.82 69646433 0.00 0.00 dot_product **Inline** * math-toolkit.h中只是建議compiler將function設定爲inline function，因爲是建議，所以compiler可以不管，因此就將它改成強制inline，以dot_product爲例子： ```clike= static inline __forceinline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp += v1[0] * v2[0]; dp += v1[1] * v2[1]; dp += v1[2] * v2[2]; return dp; } ``` 修改Makefile，加入： `-D__forceinline="__attribute__((always_inline))" \` **Thread 優化** 先來搞懂Thread：為了說明何謂 thread ，我們必須先瞭解 UNIX process 與 Mach task 、 thread 間的關係。在 UNIX 中，一個 process 包括了一個執行中的程式，和一些他所需的系統資源，諸如檔案描述表和位址空間等。但是在 Mach 中，一個 task 卻只包括了一些系統資源；而由thread 掌握了所有的執行活動。一個 Mach task 可能有任意多個 thread ，而 Mach 系統中所有的 thread 均屬於一些 task。屬於同一個 task 的所有 thread 共享該 task 所擁有的系統資源。因此， thread 實質上就是一個程式計數器、一個堆疊再加上一組暫存器。 UNIX 的一個 process 可以看成是一個只有一個 thread 的 Mach task。對 CPU 而言，產生一個 thread 是一件相對便宜的工作。另一方面，共享系統資源的 thread 跟獨佔系統資源的 process 比起來，thread 是也是相當節省記憶體的。 * 參考資料：[](http://www.csie.ntu.edu.tw/~r92094/c++/pthread.tx)http://www.csie.ntu.edu.tw/~r92094/c++/pthread.tx 再來就是學一些function： ``` pthread_t ：宣告 pthread_create ：建立thread pthread_equal ：判斷是thread是否相同 pthread_join：讓一個thread等待另一個指定thread直到它結束，也可以用來讓主程式等待thread ``` 實做： 研究原始碼了一段時間，決定在main.c中宣告thread來使用 ` pthread_t THREAD[4];` 再來是傳入raytracing的參數，由於參數多於一個，因此需要包裝成一個結構才能傳入，所以就另外建立一個結構放在raytracing.h，讓main.c跟raytracing.c都能使用： ```clike= typedef struct __ARG { uint8_t *pixels; color background; rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *View; int row; int col; } arg; ``` 初始化結構： ```clike= arg data; data.lights = NULL; data.rectangulars = NULL; data.spheres = NULL; data.View = &view; data.col = COLS; data.row = ROWS; data.pixels = malloc(sizeof(unsigned char) * ROWS * COLS * 3); ``` 到了這裏發現程式會編譯不過，發現是use-models.h裏的append的問題，因爲這裏看不太懂code是怎樣運作的，所以不太會改，就用了別的方法，先將append需要用到的變數名稱宣告好，再進行append，最後再給data： ```clike= light_node lights = NULL; rectangular_node rectangulars = NULL; sphere_node spheres = NULL; #include "use-models.h" data.lights = lights; data.rectangulars = rectangulars; data.spheres = spheres; ``` 再來是建立thread： ```clike= for(int i=0; i<4; i++) { if(pthread_create(&THREAD[i],NULL,(void*)raytracing,(void*)&data)) { printf("ERROR thread create!\n"); pthread_exit(NULL); } } for(int i=0; i<4; i++) pthread_join(THREAD[i],NULL); ``` 修改在raytracing.c中的raytracing，將col 切成4等份，利用4個thread來做運算。修改完後發現各個thread會覆蓋掉之前還沒使用完畢的變數（就是切成4等份的col），在這裏卡了很久，決定將thread宣告在raytracing.h中，當作全域變數來使用。再利用pthread_equal來判斷目前是哪個thread，就用不同的變數來實做： ```clike= if(pthread_equal(pthread_self(),THREAD[0])) { start_j = 0; end_j = height / 4; } ``` 都修改完畢後，結果如下： ``` # Rendering scene Done! Execution time of raytracing() : 1.018773 sec ``` 時間降低到了1.01秒！