2017q1 Homework1 (raytracing)

contributed by < henry0929016816 >

Reviewed by `xdennisx`

沒有實作 force inline 的部份
平行化可以利用 pthread 去分析程式熱點所在，去了解用幾條 thread 和圖片裁切的方式才能達到最佳平行化
Git commit message 都有點太過簡略，應提到實做方式及實做後的影響，例如 commit a1295e42c72e301f4603115b828b56699fdf4726 就可以提到整體速度提升了多少

未優化

在拿到程式碼時，為了先知道原本的程式碼，運行的時間，所以我執行了 make 得到了執行檔，執行結果為：

# Rendering scene
Done!
Execution time of raytracing() : 2.636573 sec

可以看到花了 2.636573 秒執行此程式，然而這只能知道程式整體的執行時間，無法得知細部的內容，究竟程式再哪個環節執行的比較久，是無從得知的，所以我們運用了工具幫我們做分析。

效能分析

GNU gprof

利用 gprof 我們可以分析 raytracing 程式引入了哪些函式，這些函式的執行時間，以及函式調用的關聯性。透過輸入指令：
gprof raytracing gmon.out | less
期許得到程式效能分析的資訊，可惜什麼資訊也沒有，找出原因後發現原來是 make 的時候忘記給 PROFILE 初始值，導致沒有 gmon.out 檔，重新輸入： make PROFILE=1 ，再次執行raytracing

# Rendering scene
Done!
Execution time of raytracing() : 5.543456 sec

發現到執行的時間增為了兩倍，由於中間安插了一些程式碼所以執行時間增加了，但是我們可以得到一些有用的資訊，再次輸入：
gprof raytracing gmon.out | less

得到了：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.09      0.47     0.47 69646433     0.00     0.00  dot_product
 19.29      0.90     0.43 56956357     0.00     0.00  subtract_vector
 10.99      1.15     0.25 13861875     0.00     0.00  rayRectangularIntersection
  8.53      1.34     0.19 10598450     0.00     0.00  normalize
  8.08      1.52     0.18 31410180     0.00     0.00  multiply_vector
  5.83      1.65     0.13 17836094     0.00     0.00  add_vector
  5.83      1.78     0.13  4620625     0.00     0.00  ray_hit_object
  4.71      1.88     0.11 13861875     0.00     0.00  raySphereIntersection
  4.49      1.98     0.10 17821809     0.00     0.00  cross_product
  2.69      2.04     0.06  1048576     0.00     0.00  ray_color
  2.24      2.09     0.05  2110576     0.00     0.00  localColor
  1.79      2.13     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.35      2.16     0.03  4221152     0.00     0.00  multiply_vectors
  1.35      2.19     0.03  2520791     0.00     0.00  idx_stack_top
  0.90      2.21     0.02        1     0.02     2.23  raytracing
  0.45      2.22     0.01  1241598     0.00     0.00  reflection
  0.45      2.23     0.01  1048576     0.00     0.00  rayConstruction
  0.00      2.23     0.00  3838091     0.00     0.00  length
  0.00      2.23     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.23     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      2.23     0.00  1241598     0.00     0.00  refraction
  0.00      2.23     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.23     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.23     0.00   113297     0.00     0.00  fresnel
  0.00      2.23     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      2.23     0.00        3     0.00     0.00  append_rectangular
  0.00      2.23     0.00        3     0.00     0.00  append_sphere
  0.00      2.23     0.00        2     0.00     0.00  append_light
  0.00      2.23     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      2.23     0.00        1     0.00     0.00  delete_light_list
  0.00      2.23     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      2.23     0.00        1     0.00     0.00  delete_sphere_list
  0.00      2.23     0.00        1     0.00     0.00  diff_in_second
  0.00      2.23     0.00        1     0.00     0.00  write_to_ppm

當然底下還有一些，關於函式詳細資訊，列出這個函式呼叫了幾個函式，然而由於文字的資訊讓人看的頭昏眼花，所以我參考了 hugikun999的共筆使用 graphviz 將 gprof 得到的資訊用圖像表示，
輸入：
gprof ./raytracing | gprof2dot | dot -T png -o output.png
產生以下的函式關聯圖：

透過以上的資訊我們可以分析出前三名最耗時的函式，dot_product 、subtract_vector 、 rayRectangularIntersection ，而這些函式就是這支程式的效能瓶頸，準備開始對它們進行優化。

優化方式

方法一： loop unrolling

觀察 math-toolkit.h 的dot_product 函式







double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp; 
}

裡面的 for 回圈只重複了三次，如果我們將回圈拔掉，改成以下型式：


 dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];

雖然程式的可讀性降低了，但是減少了分支預測錯誤的產生，也減少了 data dependency，
讓程式可以同時處理多一點資料，而不用等到上一筆資料的值確定後（原本的 i 值在做完 i++ 後才能確定下一次的 i 值），才執行下一筆，期望程式能降低執行時間，重新執行程式，得到結果：

# Rendering scene
Done!
Execution time of raytracing() : 3.166653 sec

原以為執行的時間會縮減，結果執行時間反而增長了，讓我感到很驚訝，利用 gprof 確認看看程式效能的變化：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.68      0.48     0.48 56956357     0.00     0.00  subtract_vector
 12.31      0.71     0.23 31410180     0.00     0.00  multiply_vector
 10.70      0.91     0.20 13861875     0.00     0.00  rayRectangularIntersection
  7.76      1.06     0.15 17836094     0.00     0.00  add_vector
  7.22      1.19     0.14 69646433     0.00     0.00  dot_product
  6.96      1.32     0.13 13861875     0.00     0.00  raySphereIntersection
  6.96      1.45     0.13 10598450     0.00     0.00  normalize
  4.82      1.54     0.09 17821809     0.00     0.00  cross_product
  4.28      1.62     0.08  4620625     0.00     0.00  ray_hit_object
  4.28      1.70     0.08  1048576     0.00     0.00  ray_color
  2.14      1.74     0.04  1048576     0.00     0.00  rayConstruction
  1.61      1.77     0.03  2110576     0.00     0.00  compute_specular_diffuse
  1.61      1.80     0.03        1     0.03     1.87  raytracing
  1.07      1.82     0.02  2110576     0.00     0.00  localColor
  1.07      1.84     0.02  1241598     0.00     0.00  refraction
  0.54      1.85     0.01  4221152     0.00     0.00  multiply_vectors
  0.54      1.86     0.01  1241598     0.00     0.00  protect_color_overflow
  0.27      1.87     0.01  3838091     0.00     0.00  length
  0.27      1.87     0.01  1048576     0.00     0.00  idx_stack_init
  0.00      1.87     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.87     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      1.87     0.00  1241598     0.00     0.00  reflection
  0.00      1.87     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.87     0.00   113297     0.00     0.00  fresnel
  0.00      1.87     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.87     0.00        3     0.00     0.00  append_rectangular
  0.00      1.87     0.00        3     0.00     0.00  append_sphere
  0.00      1.87     0.00        2     0.00     0.00  append_light
  0.00      1.87     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.87     0.00        1     0.00     0.00  delete_light_list
  0.00      1.87     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.87     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.87     0.00        1     0.00     0.00  diff_in_second
  0.00      1.87     0.00        1     0.00     0.00  write_to_ppm

可以看到 dot_product 的執行時間，從原本佔了整支程式的 21.09 % 掉到了 7.22 % ，代表 dot_product 的執行效能是有被改善的，合理懷疑是電腦久沒關機，導致執行程式有時忽快忽慢，重複執行幾次，果然如我設想，執行時間飄呼不定有 2.282860 sec ，也有 2.569222 sec，讓我再次體認到 gprof 的好用之處，讓我可以確認是否真的有改善到函式效能。

圖示

確認到 loop unrolling 對效能的有顯著的影響後，著手將其它函式（add_vector、subtract_vector、multiply_vectors、 multiply_vector ）有 for 迴圈的都修改過，整體的執行時間已經降為了 1.961615 sec

# Rendering scene
Done!
Execution time of raytracing() : 1.961615 sec

圖示

使用 gprof 觀察

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.22      0.21     0.21 69646433     0.00     0.00  dot_product
 16.40      0.41     0.20 13861875     0.00     0.00  rayRectangularIntersection
 11.89      0.56     0.15 56956357     0.00     0.00  subtract_vector
  9.02      0.67     0.11 10598450     0.00     0.00  normalize
  8.61      0.77     0.11  4620625     0.00     0.00  ray_hit_object
  7.38      0.86     0.09 17821809     0.00     0.00  cross_product
  5.74      0.93     0.07 13861875     0.00     0.00  raySphereIntersection
  4.51      0.99     0.06 17836094     0.00     0.00  add_vector
  4.10      1.04     0.05 31410180     0.00     0.00  multiply_vector
  2.46      1.07     0.03  1048576     0.00     0.00  ray_color
  2.05      1.09     0.03  4221152     0.00     0.00  multiply_vectors
  2.05      1.12     0.03  1048576     0.00     0.00  rayConstruction
  1.64      1.14     0.02  2110576     0.00     0.00  compute_specular_diffuse
  1.64      1.16     0.02  2110576     0.00     0.00  localColor
  1.23      1.17     0.02  3838091     0.00     0.00  length
  0.82      1.18     0.01  2520791     0.00     0.00  idx_stack_top
  0.82      1.19     0.01  1241598     0.00     0.00  protect_color_overflow
  0.82      1.20     0.01  1241598     0.00     0.00  reflection
  0.82      1.21     0.01  1241598     0.00     0.00  refraction
  0.82      1.22     0.01        1     0.01     1.22  raytracing
  0.00      1.22     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.22     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.22     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.22     0.00   113297     0.00     0.00  fresnel
  0.00      1.22     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.22     0.00        3     0.00     0.00  append_rectangular
  0.00      1.22     0.00        3     0.00     0.00  append_sphere
  0.00      1.22     0.00        2     0.00     0.00  append_light
  0.00      1.22     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.22     0.00        1     0.00     0.00  delete_light_list
  0.00      1.22     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.22     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.22     0.00        1     0.00     0.00  diff_in_second
  0.00      1.22     0.00        1     0.00     0.00  write_to_ppm

可以更清楚知道被修改的函式所降的執行秒數

圖例：
方法二： OpenMp
介紹

OpenMP 是一套支援多平台，共享記憶體多平行處理的 API，使用在 c 、c++ 、 fortran 語言。它的優點是提供了對平行演算法的抽象描述，使人可以用最簡單的一行程式，就能讓一段程式碼使用多平行化的方式去執行。
語法
#pragma omp <directive> [clause[[,] clause] ...]
編譯方式
gcc -fopenmp

運用

將 raytracing.c 的 raytracing 函式裡的 for 迴圈，使用 OpenMp 做平行處理，使得程式的 throughput 增加，函式能更快得到結果，進而增進整體程式的效能。然而要注意多平行化的 for 迴圈裡，是否有變數是在迴圈外宣告的，迴圈外宣告的變數會在所有執行序裡，共享同一個變數空間，導致結果出錯，例如 a 執行序算出 number 變數為 10 ，然而 b 執行序算出 number 變數為 20，導致 a 執行序的結果變為 20。
下圖為運算結果出錯的圖：

為了不讓計算結果出錯，可以在 # pragma omp parallel for 後面加個 private ( 變數名稱 ) ，讓 private 裡的變數在每個執行序裡都有自己的副本，彼此不會影響。
認真觀察 raytracing 函式，發現有 3 個變數需要被保護起來，將 code 改成
# pragma omp parallel for private(d,object_color,stk)
得到了正確的圖形：

./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.949682 sec

測得執行時間為 0.949682 sec，執行效率提升了一倍。

圖例

2017q1 Homework1 (raytracing)

Reviewed by xdennisx

未優化

效能分析

GNU gprof

優化方式

方法一： loop unrolling

方法二： OpenMp

介紹

運用

參考資料

Reviewed by `xdennisx`