Raytracing (2016q1 Homework 2A)

Reviewed by `HuangWenChen`

可再使用 OpenMP 多執行緒去執行
數據的地方可使用 ``` 方便閱讀
github尚未 push 上去

gprof學習

linux分析工具-gprof 使用Gnu gprof進行程式分析 gprof - linux program profiling tool

GNU gprof manual

gprof是一種GNU profiler**工具，**可以顯示程式進行的；"flatprofile"，包含程式執行時各個function的執行時間以及call次數，可以幫助分析程式找到效能瓶頸，加以優化

原理：

再編譯時加入 -pg ，編譯器會在每個function中加入mcount ( 或 “_mcount” 或 “__mcount”)的函數，所以每個function都會call mmcount ，由此可以做

使用步驟：

編譯時，加入 -pg參數
執行一次程式，會輸出預設為gmon.out的檔案
gmon.out會儲存程式執行的資料，function call次數，執行時間，及function互call的次數等等，使用gprof execution_file 分析這些數據，做成使用者好讀取的型式

執行：

老師幫我們在Makefile中包好了，如果再編譯時定義PROFILE為1，就會加入 -pg參數

ponsheng@X55VD:[~/embed/raytracing]$ make PROFILE=1
cc -std=gnu99 -Wall -O0 -g -pg -c -o objects.o objects.c
cc -std=gnu99 -Wall -O0 -g -pg -c -o raytracing.o raytracing.c
cc -std=gnu99 -Wall -O0 -g -pg -c -o main.o main.c
cc -o raytracing objects.o raytracing.o main.o -lm -pg
ponsheng@X55VD:[~/embed/raytracing]$ ./raytracing
Rendering scene
Done!
Execution time of raytracing() : 7.745289 sec
ponsheng@X55VD:[~/embed/raytracing]$ ls
AUTHORS LICENSE Makefile objects.c out.png raytracing raytracing.o
gmon.out main.c math-toolkit.h objects.h out.ppm raytracing.c result.txt
idx_stack.h main.o models.inc objects.o primitives.h raytracing.h use-models.h
ponsheng@X55VD:[~/embed/raytracing]$ gprof raytracing -b > result.txt

執行時間為7.74秒，有生成gmon.out，接著用gprof去分析，-b這個參數代表不顯示統計圖的文字敘述，再把結果導入 result.txt

(gprof 指令也可以加入data檔： gprof raytracing gmon.out ，這裡沒加是因為直接使用預設名稱)

執行結果：

<undefined style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; letter-spacing: normal;">* flat profile</undefined>

呈現每個function所耗的時間以及call的次數，由此可知前幾個函式單一執行時間雖然小於0.01秒，卻都執行了1千萬次以上，所以如果要對單一函數做優化，當然是以他們優先，就算只有些微優化，再總時間上就會很明顯的影響：

每次結果，執行時間排序都有些微不同，但是dot_product都是最久的，其餘大於1千萬calls且時間較長的為

dot_product , subtract_vector , multiply_vector rayRectangularIntersection normalize add_vector raySphereIntersection

Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.47 1.34 1.34 69646433 0.00 0.00 dot_product
14.23 2.01 0.67 56956357 0.00 0.00 subtract_vector
11.26 2.54 0.53 31410180 0.00 0.00 multiply_vector
8.92 2.96 0.42 13861875 0.00 0.00 rayRectangularIntersection
7.65 3.32 0.36 10598450 0.00 0.00 normalize
6.16 3.61 0.29 17836094 0.00 0.00 add_vector
5.10 3.85 0.24 13861875 0.00 0.00 raySphereIntersection
3.61 4.02 0.17 4620625 0.00 0.00 ray_hit_object
3.40 4.18 0.16 17821809 0.00 0.00 cross_product
1.91 4.27 0.09 4221152 0.00 0.00 multiply_vectors
1.91 4.36 0.09 1 0.09 4.71 raytracing
1.70 4.44 0.08 1048576 0.00 0.00 ray_color
…

疑點：我看manual裡，有說到mcount跟profil這兩個內部function，但是這裡卻沒有顯示出來

<undefined style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; letter-spacing: normal;">* call graph</undefined>

*函式叫用表，左邊為function編號，右邊為有叫用此function的parent，編號可以在最後的索引中找到相對函式

Call graph
granularity: each sample hit covers 2 byte(s) for 0.22% of 4.49 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.00 4.49 main [1]
0.05 4.43 1/1 raytracing [2]
0.01 0.00 1/1 delete_sphere_list [21]
0.00 0.00 3/3 append_sphere [30]
0.00 0.00 3/3 append_rectangular [29]
0.00 0.00 2/2 append_light [31]
0.00 0.00 1/1 write_to_ppm [35]
0.00 0.00 1/1 delete_rectangular_list [33]
0.00 0.00 1/1 delete_light_list [32]
0.00 0.00 1/1 diff_in_second [34]
0.05 4.43 1/1 main [1]
[2] 99.8 0.05 4.43 1 raytracing [2]
0.10 4.17 1048576/1048576 ray_color [3]
0.02 0.14 1048576/1048576 rayConstruction [15]
0.00 0.00 1/1 calculateBasisVecto
…

用虛線切開的算一個entry，一個entry代表所有跟某個function有關的呼叫，我們用第二個來看，以[X]開頭的這行是primary line，在name欄位的就是這個entry的主角，以上為callers line，以下為subroutine line(children)

用第二個entry舉例：

primary line:
index % time self children called name
[2] 99.8 0.05 4.43 1 raytracing [2]

%time:佔所有時間比例(含call function)

self:執行時間(不含call) =flat profile的時間

children:function call時間

called被呼叫幾次(如果有兩個數字"A+B" A是被外部呼叫，B是遞迴呼叫(自己呼叫)

callers line:
index % time self children called name
0.05 4.43 1/1 main [1]

self&childen:同primay

called:呼叫raytracing的函式，佔的比例(不包含遞迴)

name:如果caller無法被定義，就印<spontaneous>，ex.main

subroutine line
index % time self children called name
0.10 4.17 1048576/1048576 ray_color [3]
0.02 0.14 1048576/1048576 rayConstruction [15]
0.00 0.00 1/1 calculateBasisVecto

self:由父函式呼叫的子函式總執行時間

children:由父函式呼叫時，子函式呼叫別人所花的總時間

called:子函式被父函式呼叫的比例

要閱讀這個graph其實不容易，如果想要簡易明瞭的_gprof2dot_

https://github.com/jrfonseca/gprof2dot

同學共筆

<undefined style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; letter-spacing: normal;">* Index by function name</undefined>

函式編號

Index by function name
[11] add_vector (math-toolkit.h) [22] fresnel (raytracing.c) [15] rayConstruction (raytracing.c)
[31] append_light [24] idx_stack_empty (idx_stack.h) [5] rayRectangularIntersection (raytracing.c)
[29] append_rectangular [27] idx_stack_init (idx_stack.h) [7] raySphereInterse
…

以下為沒有加入-pg編譯的執行時間

Execution time of raytracing() : 4.156151 sec

目標就是把時間降低吧~

了解raytracing

math-toolkit.h：

由gprof可知raytracing的熱點為： dot_product , subtract_vector , multiply_vector rayRectangularIntersection normalize add_vector raySphereIntersection

math-toolkit.h：

dot_product , subtract_vector , multiply_vector normalize add_vector

raytracing.c：

rayRectangularIntersection raySphereIntersection ->無規則非迴圈

優化方式：

原版

執行時間：

Execution time of raytracing() : 4.156151 sec

gprof:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.47 1.34 1.34 69646433 0.00 0.00 dot_product
14.23 2.01 0.67 56956357 0.00 0.00 subtract_vector
11.26 2.54 0.53 31410180 0.00 0.00 multiply_vector
8.92 2.96 0.42 13861875 0.00 0.00 rayRectangularIntersection
7.65 3.32 0.36 10598450 0.00 0.00 normalize
6.16 3.61 0.29 17836094 0.00 0.00 add_vector
5.10 3.85 0.24 13861875 0.00 0.00 raySphereIntersection
3.61 4.02 0.17 4620625 0.00 0.00 ray_hit_object
3.40 4.18 0.16 17821809 0.00 0.00 cross_product
1.91 4.27 0.09 4221152 0.00 0.00 multiply_vectors
1.91 4.36 0.09 1 0.09 4.71 raytracing
1.70 4.44 0.08 1048576 0.00 0.00 ray_color
…

Loop Unrolling

為了減少loop的預測錯誤，展開loop內容，減少loop次數，會因此而縮短執行時間，卻也會讓程式體積變大，是一種以空間換取時間的方式

舉例：dot_product

double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}

把for loop拆開：

double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
dp += v1[0] * v2[0];
dp += v1[1] * v2[1];
dp += v1[2] * v2[2];
return dp;
}

能實現優化的函式：add_vector, subtract_vector, multiply_vectors, multiply_vector , dot_product

執行時間：

Execution time of raytracing() : 2.887629 sec

gprof分析

% cumulative self self total
time seconds seconds calls s/call s/call name
31.35 0.99 0.99 69646433 0.00 0.00 dot_product
13.30 1.41 0.42 10598450 0.00 0.00 normalize
11.72 1.78 0.37 13861875 0.00 0.00 rayRectangularIntersection
10.45 2.11 0.33 13861875 0.00 0.00 raySphereIntersection
5.86 2.30 0.19 56956357 0.00 0.00 subtract_vector
5.70 2.48 0.18 31410180 0.00 0.00 multiply_vector
4.75 2.63 0.15 17821809 0.00 0.00 cross_product
4.59 2.77 0.15 4620625 0.00 0.00 ray_hit_object
3.80 2.89 0.12 1048576 0.00 0.00 ray_color
3.48 3.00 0.11 17836094 0.00 0.00 add_vector

在計組有學過prediction，記得loop的prediction會預測forward，所以可以減少猜錯，或是用1-2 bits，去記憶forward或backward，但是因為這裡只有loop3次，沒辦法很確定看出在-O0時式採用哪一種，所以要用一個大一點的for loop來驗證

驗證

如果要將大forloop做loop unrolling，可以使用Duff’s device，會是一個較簡潔的辦法

Loop unrolling, Duff’s device

** inline 函數**

C99的inline function

在 math-toolkit.h 中，每個函數前面都有 static inline

Static:
靜態函數
因為這個.h檔可能被多個檔案引用，若不設為static，會造成在linking成執行檔時發生找到多個重複function的錯誤，所以宣告為靜態數static，只讓function在header裡能被找到，減少錯誤，
Inline:
inline 是一個keyword，提醒compiler可以將function本體直接填入呼叫該function的位置
優點為:
減少呼叫function的overhead，如參數傳遞、回傳值處理等。
切換到function會jump，也就是會有branch的行為。因為CPU的pipeline和cache的特性，會影響效能。
經由代換程式碼，也許可以增加最佳化的可能性。

知道了 inline 是建議編輯器在函數很小的時候可以直接內嵌進程式碼，藉此減少呼叫函數的 overhead 。

因為是「建議」，所以編輯器可以忽略你的建議，此外，這個case老師規定編器不能開最佳化，要用-O0，這時即使被宣告inline，任何函數都不會被展開，所以要強制編譯器 inline

在程式碼中加入 __attribute__ ((always_inline)) 來達到強制 inline

double attribute ((always_inline)) dot_product(const double *v1, const double *v2)

能實現優化的函式：all

執行時間：

Execution time of raytracing() : 4.050906 sec

gprof分析

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
30.11 1.27 1.27 13861875 0.00 0.00 rayRectangularIntersection
20.62 2.14 0.87 13861875 0.00 0.00 raySphereIntersection
12.09 2.65 0.51 2110576 0.00 0.00 compute_specular_diffuse
8.06 2.99 0.34 2110576 0.00 0.00 localColor

跟原版相比math-toolkit.h裡的funtion都不見了，其餘的時間都增加了，明顯就是展開後，就不算是function call了，時間都分攤在caller上，所以減少的時間就是function call造成的overhead

**POSIX Thread **

使用SIMD指令

OPENMP

作業目標

以 make PROFILE=1 重新編譯程式碼，並且學習 gprof
參考使用 GNU gprof 進行 Linux 平台的程式分析
以 gprof 指出效能瓶頸，並且著手改寫檔案 math-toolkit.h 在內的函式實做，充分紀錄效能差異在共筆
注意: 請勿更動編譯器最佳化選項 -O0 (關閉最佳化)
檔案 math-toolkit.h 定義的若干數學操作函式很重要，可參考 SIMD optimized dot and cross product functions 和 2015q3 Homework #1 Ext
可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
將你的觀察、分析，以及各式效能改善過程，並善用 gnuplot 製圖，紀錄於「作業區」，記得標注「開發紀錄(A)」

Raytracing (2016q1 Homework 2A)

Reviewed by HuangWenChen

gprof學習

Rendering scene

了解raytracing

優化方式：

作業目標

Reviewed by `HuangWenChen`