2017q1 Homework1 (raytracing)

tags: `heathcliffYang`

contributed by <heathcliffYang>

Reviewd by `yachiyang01`

因為沒有特別說明，不太清楚loop unrolling是用在所有function嗎?每一個function的效果會不一，可以試著分析看看。
選擇使用這些優化方法的原因是?

Reviewed by `etc276`

善用 HackMD 中的 # 符號，可以讓讀者更明瞭整篇共筆的架構與順序，例如各種嘗試優化的方法就可以使用 ###，這樣讀者就可以在看到「進行優化與比較」時一目了然。
最後兩次的 commit message 看起來相同，如有修改 commit message 之類的需求可參考這篇

分析原本程式的效能

Call graph
gprof ./raytracing | gprof2dot | dot -T png -o output.png

指令參考 hugikun999 的共筆，將 gprof 輸出的資訊轉成圖片表示

可以發現，raytracing 要建立光影的模型會呼叫到幾種數學運算

程式熱點

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.49      0.49     0.49 69646433     0.00     0.00  dot_product
 16.98      0.86     0.37 56956357     0.00     0.00  subtract_vector
 10.79      1.10     0.24 13861875     0.00     0.00  rayRectangularIntersection
  8.49      1.28     0.19 31410180     0.00     0.00  multiply_vector
  7.34      1.44     0.16 17836094     0.00     0.00  add_vector
  7.34      1.60     0.16 10598450     0.00     0.00  normalize
  5.97      1.73     0.13  4620625     0.00     0.00  ray_hit_object
  4.82      1.84     0.11 13861875     0.00     0.00  raySphereIntersection
  4.13      1.93     0.09 17821809     0.00     0.00  cross_product
  2.29      1.98     0.05  1048576     0.00     0.00  ray_color
  1.84      2.02     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.61      2.05     0.04  4221152     0.00     0.00  multiply_vectors
  1.38      2.08     0.03  1241598     0.00     0.00  refraction
  0.92      2.10     0.02  3838091     0.00     0.00  length

在官方 Manual 提到：

cumulative seconds
This is the cumulative total number of seconds the computer spent executing this functions, plus the time spent in all the functions above this one in this table.
self seconds
This is the number of seconds accounted for by this function alone. The flat profile listing is sorted first by this number.

主要花時間的數學運算可以從 self seconds 看出。

Cache-misses & branch-misses

 Performance counter stats for './raytracing' (100 runs):

           703,209      cache-misses              #   31.249 % of all cache refs      ( +-  4.67% )
         2,250,333      cache-references                                              ( +-  5.17% )
         7,446,324      branch-misses                                                 ( +-  0.59% )
       4.132374478 seconds time elapsed                                          ( +-  2.20% )

進行優化與比較

預計使用的方法有 : POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling

Loop unrolling

Cache-misses & branch-misses

 Performance counter stats for './raytracing' (100 runs):

       323,532      cache-misses              #   23.108 % of all cache refs      ( +-  5.11% )
     1,400,109      cache-references                                              ( +-  5.30% )
     5,727,748      branch-misses                                                 ( +-  0.23% )

2.586578907 seconds time elapsed                                          ( +-  1.32% )

程式熱點

%   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
24.86      0.41     0.41 69646433     0.00     0.00  dot_product
11.22      0.60     0.19 13861875     0.00     0.00  rayRectangularIntersection
11.22      0.78     0.19 31410180     0.00     0.00  multiply_vector
10.92      0.96     0.18 56956357     0.00     0.00  subtract_vector
 9.70      1.12     0.16 10598450     0.00     0.00  normalize
 7.58      1.25     0.13 17821809     0.00     0.00  cross_product
 7.28      1.37     0.12  4620625     0.00     0.00  ray_hit_object
 5.76      1.46     0.10 13861875     0.00     0.00  raySphereIntersection
 2.12      1.50     0.04  3838091     0.00     0.00  length
 1.82      1.53     0.03  2110576     0.00     0.00  localColor
 1.52      1.55     0.03 17836094     0.00     0.00  add_vector
 1.21      1.57     0.02  4221152     0.00     0.00  multiply_vectors
 1.21      1.59     0.02  1048576     0.00     0.00  rayConstruction
 1.21      1.61     0.02  1048576     0.00     0.00  ray_color

POSIX thread
因為 typedef double point3[3]; -> point3 是傳 address

Software pipeline

這個 loop 總共有六圈，每一圈被分為 4 個 stage (也就是 4 個 block)，讓他們前後 overlap –> 可同時執行不同階段的工作，增加 throughput

將 raytracing() 裡的第 3 層 loop 劃分 stage : 注意 dependent 參數為 d, stk, object_color

Stage 1.
idx_stack_init() + rayConstruction() -> d & stk
Stage 2.
ray_color() -> d, stk & object_color
Stage 3.
rgb operations -> object_color & rgb
Stage 4.
pixels -> rgb

Performance counter stats for './raytracing' (100 runs):

    95,023      cache-misses              #   26.193 % of all cache refs      ( +-  8.58% )
   362,784      cache-references                                              ( +- 13.95% )
 5,151,867      branch-misses                                                 ( +-  0.26% )

2.337996284 seconds time elapsed                                          ( +-  0.89% )

總體所需的要用到 cache 次數大幅下降

圖片源自 Introduction to Software Pipelining in the IA-64 Architecture Architecture

OpenMP + loopunroll + swp

2 threads

Performance counter stats for './raytracing' (100 runs):

        50,139      cache-misses              #   24.425 % of all cache refs      ( +-  1.03% )
       205,274      cache-references                                              ( +-  2.93% )
     5,184,165      branch-misses                                                 ( +-  0.83%)

   1.320096975 seconds time elapsed                                          ( +-  1.69% )

 Performance counter stats for './raytracing' (100 runs):

        61,863      cache-misses              #   18.482 % of all cache refs      ( +-  2.03% )
       334,723      cache-references                                              ( +-  2.57% )
     5,663,438      branch-misses                                                 ( +-  0.32% )

   1.497532812 seconds time elapsed                                          ( +-  2.81% )

 Performance counter stats for './raytracing' (100 runs):

        62,719      cache-misses              #   19.101 % of all cache refs      ( +-  1.46% )
       328,354      cache-references                                              ( +-  1.91% )
     5,877,346      branch-misses                                                 ( +-  0.07% )

   1.197949277 seconds time elapsed                                          ( +-  2.10% )

待參照資料 :
Compiler Optimizations for Modern VLIW/EPIC Architectures

My computer's specifications

2017q1 Homework1 (raytracing)

tags: heathcliffYang

Reviewd by yachiyang01

Reviewed by etc276

分析原本程式的效能

進行優化與比較

tags: `heathcliffYang`

Reviewd by `yachiyang01`

Reviewed by `etc276`