2016q3 Homework1 (ray_tracing)

# 2016q3 Homework1 (ray_tracing) contributed by <`jayfeng0225`> ###### tags: `jayfeng0225` ## 作業要求 * 在 GitHub 上 fork [raytracing](https://github.com/sysprog21/raytracing)，並思考如何改善程式效能 * 以 `make PROFILE=1` 重新編譯程式碼，並且學習 [gprof](https://sourceware.org/binutils/docs/gprof/) * 參考 [使用 GNU gprof 進行 Linux 平台的程式分析](http://os.51cto.com/art/200703/41426.htm) * 以 [gprof](https://sourceware.org/binutils/docs/gprof/) 指出效能瓶頸，並且著手改寫檔案 `math-toolkit.h` 在內的函式實做，充分紀錄效能差異在共筆 * 注意: 請勿更動編譯器最佳化選項 `-O0` (關閉最佳化) * 檔案 `math-toolkit.h` 定義的若干數學操作函式很重要，可參考 [Investigating SSE cross product performance](http://threadlocalmutex.com/?p=8)、[MathFu](https://google.github.io/mathfu/) 原始程式碼，以及 [2015q3 Homework #1 Ext](http://wiki.csie.ncku.edu.tw/embedded/2015q3h1ext) * 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 * 將你的觀察、分析，以及各式效能改善過程，並善用 gnuplot 製圖，紀錄於「[作業區](https://hackmd.io/s/H1B7-hGp)」 --- ## Gprof分析參考 : [How to Profile a C program in Linux using GNU gprof](https://www.maketecheasier.com/profile-c-program-linux-using-gprof/) 根據文章所寫，只要我們在使用gcc編譯程式時加上 ''-pg'' ，此時gcc會再每個函數後面加上mcount的函數，用來紀錄每個函式消耗的時間。並且再執行程式後產生==gmon.out==檔。 ` $ ./gprof raytracing gmon.out > prof_output ` 就可以在prof_output檔案內看到各個函數的執行時間。由於gprof會在函數後面加上mcount，因此會增加整體程式的執行時間首先我們比較有無使用gprof的時間差異：未使用gprof : * Execution time of raytracing() : 4.734594 sec 使用gprof後 : * Execution time of raytracing() : 9.201828 sec 接著我們使用gprof來觀察哪些函式執行時間最長執行結果如下： ![](https://i.imgur.com/Zx69ouq.png) 由此可知執行時間主要卡在==dot_prodoct==與==subtract_vector==兩個函數。因此我們可以從dot_product與subtract_vector兩個函式著手進行修改 ### Gprof 欄位介紹參考網站 : [gprof - linux program profiling tool](http://awkzone.blogspot.tw/2010/11/gprof.html) * % time : 執行時間所佔的比例 * cumulative seconds : 累積執行時間 * self seconds : 函式本身的執行時間 * calls : 函式被呼叫的次數 * self ms/call: 每次被呼叫時平均花費了多少時間執行 * total ms/call: 每次被呼叫時，平均花費了多時間執行函式和函式中呼叫的routine --- ## 使用Graphviz產生函式關係圖參考網站 : [2016 HW2 開發紀錄(A)](https://embedded2016.hackpad.com/ep/pad/static/pc1Em0kW5dg) 使用[Gprof2dot](https://github.com/jrfonseca/gprof2dot/blob/master/gprof2dot.py)來將gmon.out轉成dot檔再產生函數相對應關係圖首先需要先安裝python與graphviz ``` $ sudo apt-get install python graphviz $ sudo apt-get install python-pip ``` 下載[Standalone scripts](https://raw.githubusercontent.com/jrfonseca/gprof2dot/master/gprof2dot.py)放在目錄底下，並修改其權限 ``` $ sudo chmod 744 gprof2dot ``` 最後將執行步驟直接寫在Makefile ``` plot : $(EXEC) ./raytracing gprof -b raytracing gmon.out > prof_output gprof ./raytracing | ./gprof2dot.py | dot -Tpng -o output.png ``` --- ## Optimization ### Loop Unrolling [Loop Unrolling](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80):循環展開，英文中稱（Loop unwinding或loop unrolling），是一種犧牲程序的尺寸來加快程序的執行速度的優化方法。可以由程式設計師完成，也可由編譯器自動優化完成。循環展開最常用來降低循環開銷，為具有多個功能單元的處理器提供指令級並行。也有利於指令流水線的調度。 #### 範例： ```clike= for (i = 1; i <= 60; i++) a[i] = a[i] * b + c; ``` 可以如此循環展開： ```clike= for (i = 1; i <= 60; i+=3) { a[i] = a[i] * b + c; a[i+1] = a[i+1] * b + c; a[i+2] = a[i+2] * b + c; } ``` 首先針對這Loop Unrolling做優化，先找到dot_product與subtract_vector程式碼 ```clike= double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ``` 此時執行時間為 * Execution time of raytracing() : 4.715642 sec * 觀察每個函數的執行時間： ![](https://i.imgur.com/TPuhUCx.png) 將原來for迴圈的部份改為unrolling的型式 ```clike= double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp = v1[0]*v2[0] + v1[1]*v2[1] + v1[2]*v2[2] ; return dp; } void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } ``` 結果大約減少1秒的執行時間 * Execution time of raytracing() : 3.759857 sec * 觀察每個函數的執行時間 ![](https://i.imgur.com/aAEE44Z.png) ==這裡可以觀察到，dot_product與subtract_vector的執行時間減少== ## OpenMP 參考網站 : [簡易的程式平行化](https://goo.gl/b3W7B3) 要使用OpenMP的話也需要修改Makefile的編譯參數 ``` CFLAGS = \ -std=gnu99 -Wall -O0 -g -fopenmp LDFLAGS = \ -lm -lgomp ``` 而要在程式內加入 ``` #include <omp.h> 在for loop前加入 #pragma omp parallel for ``` 因此先將dot_vector的unrolling寫法改回來並且再for loop前加入OpenMP的語法 ```clike= double dot_product(const double *v1, const double *v2) { double dp = 0.0; #pragma omp parallel for num_threads(64) for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 結果卻發現執行時間變成 39秒！ * Execution time of raytracing() : 39.186028 sec 透過gprof觀察函數的執行時間 ![](https://i.imgur.com/4n9o6p6.png) 發現raytracing函式本身所佔用的時間變得很長因此可以推斷，raytracing()應該會多次呼叫dot_product這些函式而這些函式的for loop其實很短，將其平行化後，不斷呼叫反而增加了程式的執行時間。接著trace raytracing.c **發現raytracing()有nested for loop 或許可以利用OpenMP來做優化** 因此我們在for loop前加上 ``` # pragma omp parallel for num_threads(64) ``` 執行時間縮短為1.5秒 * Execution time of raytracing() : 1.539363 sec 但是產生的圖，再執行make check後檢查結果有錯誤參考其他同學的[作業紀錄](https://hackmd.io/MYQwpgnARsBMUFoBmAWAjADgSsBmLGSuArAgAwDsFKAJmmAGxJgbFA==?view)，發現有變數平行化的問題因此我們還需要將某些變數設定為private，以免產生race condition的問題 ``` #pragma omp parallel for num_threads(64) private(d) private(object_color) private(stk) ``` 最後執行的執行時間為1.4秒，且驗證的結果正確 * Execution time of raytracing() : 1.411351 sec ![](https://i.imgur.com/cnsFcJG.png) 這邊可以觀察到所有函式的執行時間都大大減少了產生的gprof檔案其函數關係圖為 ![](https://i.imgur.com/feHp5JE.png) 使用perf檢查cache-miss的情況 ``` Performance counter stats for './raytracing' (2 runs): 64,127,094 branch-misses ( +- 3.88% ) (56.93%) 318,180 cache-misses # 0.119 % of all cache refs ( +- 6.27% ) (57.19%) 267,926,726 cache-references ( +- 0.25% ) (57.42%) 48,145,492 L1-dcache-load-misses ( +- 30.98% ) (57.71%) 2,082,508 L1-dcache-store-misses ( +- 1.28% ) (57.57%) 1,692,077 L1-dcache-prefetch-misses ( +- 0.75% ) (57.21%) 5,946,610 L1-icache-load-misses ( +- 1.15% ) (56.81%) 7.391614307 seconds time elapsed ( +- 0.91% ) ``` --- ## Force inline always_inline Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified. 簡單來說，我們需要將inline function改成 ``__attribute__((always_inline))`` 如此一來，compiler就會辨別這是一定需要優化的function 優化後的執行時間能夠再減少約0.05秒 Execution time of raytracing() : 1.383203 最後用gnuplot產生的執行時間比較圖 ![](https://i.imgur.com/Nck8qGL.png) ## 參考資料 * [使用 GNU gprof 進行 Linux 平台的程式分析](http://os.51cto.com/art/200703/41426.htm) * [How to Profile a C program in Linux using GNU gprof](https://www.maketecheasier.com/profile-c-program-linux-using-gprof/) * [gprof - linux program profiling tool](http://awkzone.blogspot.tw/2010/11/gprof.html) * [2016 HW2 開發紀錄(A)](https://embedded2016.hackpad.com/ep/pad/static/pc1Em0kW5dg) * [簡易的程式平行化](https://goo.gl/b3W7B3)