2016q3 Homework1 (raytracing)

contributed by <linachiu>

Reviewed by `shelly4132`

可以再嘗試OpenMp、POSIX Thread等優化方式
可以利用 gnuplot 繪製出效能比較圖表

安裝相關工具

graphviz: 畫示意圖
imagemagick: 格式轉換

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

預期目標

學習效能分析工具
優化程式

先跟著老師的步驟看看會得到什麼

$ make
$ ./raytracing

結果

# Rendering scene
Done!
Execution time of raytracing() : 3.209233 sec

和跑出一張光影圖

使用以下指令將他轉為png檔

$ convert out.ppm out.png

效能分析工具 –gprof

GNU的工具
使用方式
- 編譯時加上-pg的參數，編譯器會在各函數中加入mcount函數
- 執行產生gmon.out
- $ ./raytracing gprof -b raytracing gmon.out | less 執行gmon.out
- $ gprof ./raytracing | less 可以看每個函式所佔時間比率
gprof v.s perf
perf top 的原理是每隔一段時間採樣一次，最後根據這些資料輸出,因此採樣的頻率可能會造成結果的不同。

使用 gprof 編譯程式 (未優化)

先做$ make clean
$ make PROFILE=1重新編譯 (使用gprof)

執行
./raytracing gprof -b raytracing gmon.out | less

我們得到

# Rendering scene
Done!
Execution time of raytracing() : 7.402606 sec

時間上比剛才多出了很多，為什麼呢?
因為編譯器會在每個函數中加入 mcount 函數，在執行的時候記錄相關資訊。這些資訊會儲存至 gmon.out 中，最後呼叫 gprof 來繪製相關表格。

呼叫 gprof 繪製相關表格

$ gprof ./raytracing | less

因為很多所以只列出前幾個



















Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.83      0.61     0.61 69646433     0.00     0.00  dot_product
 19.12      1.17     0.56 56956357     0.00     0.00  subtract_vector
  8.54      1.42     0.25 10598450     0.00     0.00  normalize
  8.37      1.67     0.25 31410180     0.00     0.00  multiply_vector
  8.03      1.90     0.24 13861875     0.00     0.00  rayRectangularIntersection
  8.03      2.14     0.24 13861875     0.00     0.00  raySphereIntersection
  6.83      2.34     0.20  4620625     0.00     0.00  ray_hit_object
  3.42      2.44     0.10 17821809     0.00     0.00  cross_product
  3.24      2.53     0.10  4221152     0.00     0.00  multiply_vectors
  2.90      2.62     0.09 17836094     0.00     0.00  add_vector
  2.05      2.68     0.06  1048576     0.00     0.00  ray_color
  1.71      2.73     0.05  2110576     0.00     0.00  compute_specular_diffuse
  1.71      2.78     0.05  2110576     0.00     0.00  localColor
  1.54      2.82     0.05  3838091     0.00     0.00  length

我們可以發現 dot_product 函數執行時間佔了最多，就讓我們來看看 dot_product 長怎麼樣

避免用圖片呈現原始程式碼，請改正 jserv

看起來很正常的for loop ，但是其實其中牽扯到了branch 分支，我們可以使用減少MIPS的方式來優化他

優化方法(一) Loop unrolling

像是 dot_product 這樣可以簡單展開又不會影響閱讀的loop，其實就可以使用Loop unrolling 的方式優化

# Rendering scene
Done!
Execution time of raytracing() : 6.884073 sec

可以發現 Loop unrolling 後，執行時間從原本的 7.402606 sec 降至 6.884073 sec

看看各函式的執行時間



















Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.53      0.55     0.55 56956357     0.00     0.00  subtract_vector
 13.44      0.91     0.36 13861875     0.00     0.00  rayRectangularIntersection
 11.57      1.22     0.31 69646433     0.00     0.00  dot_product
 10.83      1.51     0.29 10598450     0.00     0.00  normalize
  8.96      1.75     0.24 31410180     0.00     0.00  multiply_vector
  8.03      1.97     0.22 17836094     0.00     0.00  add_vector
  7.09      2.16     0.19 13861875     0.00     0.00  raySphereIntersection
  6.35      2.33     0.17 17821809     0.00     0.00  cross_product
  2.99      2.41     0.08  1048576     0.00     0.00  ray_color
  2.61      2.48     0.07  4620625     0.00     0.00  ray_hit_object
  1.87      2.53     0.05  2110576     0.00     0.00  compute_specular_diffuse
  1.49      2.57     0.04  4221152     0.00     0.00  multiply_vectors
  1.12      2.60     0.03  1048576     0.00     0.00  rayConstruction
  1.12      2.63     0.03        1     0.03     2.68  raytracing

dot_product 的所佔比率從原本的21% 大幅降至 12%
Loop unrolling，可以減少branch數量
- 減少branch的使用
運算上，加減次數並不會減少，省下的每次for loop 產生的jmp

優化方法(二) force inline function

我們可以在位優化前的表格中發現 math-toolkit.h 中的函示幾乎都排在前幾名，這裡使用force inline 的方式將static inline 改成 attribute((always_inline))強制開啟inline。
因為我們不使用編譯器的最佳化，所以在編譯時會產生許多warming

# Rendering scene
Done!
Execution time of raytracing() : 5.294769 sec

執行後發現從 6.884073 sec 又少了 1.6秒





















Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 35.19      0.99     0.99 13861875     0.00     0.00  rayRectangularIntersection
 12.68      1.34     0.36 13861875     0.00     0.00  raySphereIntersection
  9.29      1.60     0.26 31410180     0.00     0.00  multiply_vector
  8.22      1.83     0.23  2110576     0.00     0.00  compute_specular_diffuse
  7.15      2.03     0.20 17821809     0.00     0.00  cross_product
  6.61      2.22     0.19 17836094     0.00     0.00  add_vector
  6.43      2.40     0.18  4620625     0.00     0.00  ray_hit_object
  3.57      2.50     0.10  1048576     0.00     0.00  ray_color
  2.14      2.56     0.06        1     0.06     2.78  raytracing
  1.79      2.61     0.05  4221152     0.00     0.00  multiply_vectors
  1.43      2.65     0.04  2110576     0.00     0.00  localColor
  1.07      2.68     0.03  1241598     0.00     0.00  refraction
  0.71      2.70     0.02  1241598     0.00     0.00  protect_color_overflow
  0.71      2.72     0.02  1048576     0.00     0.00  idx_stack_init
  0.71      2.74     0.02  1048576     0.00     0.00  rayConstruction
  0.71      2.76     0.02                             subtract_vector

原本的前三名也都到後面去了