2017q1 Homework1 (raytracing)

contributed by <steven0203>

Reviewed by `davis8211`

嘗試用了 loop unrolling 後，或許可以嘗試 Force Inline 或是 Macro 等方法，可以學習到更多。
部分英文錯字，需在留意一些。
在紀錄使用方法前，如 OpenMP，可以說明一下是用什麼原理，幫助自己強化觀念，也讓閱讀這篇記錄的同學更清楚為什麼要這樣做。
OpenMP 加速後，可再探討一下，能不能讓程式執行得更快，比如說多下什麼參數，為什麼 thread 數設定為 4，是根據你的硬體配備嗎? 做些微調整或許能執行的更快。

開發環境

os: Ubuntu 16.04 LTS
Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：2
每通訊端核心數：4
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              58
Model name:            Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
製程：              9
CPU MHz：             1200.132
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              4589.56
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           6144K

未優化版本

執行：
$make
$./raytracing
raytracing 結果圖

執行時間：
Execution time of raytracing() : 3.026700 sec
使用 gprof 看一下,時間是花在那些function
執行：
$make PROFILE=1
$./raytracing
$gprof ./raytracing|less
結果：

    Flat profile:

    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls   s/call   s/call  name    
     20.44      0.48     0.48 56956357     0.00     0.00  subtract_vector
     18.73      0.92     0.44 69646433     0.00     0.00  dot_product
     11.92      1.20     0.28 10598450     0.00     0.00  normalize
     10.64      1.45     0.25 13861875     0.00     0.00  rayRectangularIntersection
     10.22      1.69     0.24 31410180     0.00     0.00  multiply_vector
      5.96      1.83     0.14 13861875     0.00     0.00  raySphereIntersection
      4.68      1.94     0.11 17836094     0.00     0.00  add_vector
      4.26      2.04     0.10 17821809     0.00     0.00  cross_product
      4.05      2.14     0.10  4620625     0.00     0.00  ray_hit_object
      2.13      2.19     0.05  4221152     0.00     0.00  multiply_vectors

可以看到 subtract_vector , dot_product , normalize , rayRectangularIntersection , multiply_vector 是所佔執行時間的前幾名

branch 和 branch-misses
執行：perf stat -e branches,branch-miss ./raytracing

Performance counter stats for './raytracing':

     1,870,213,826      branches                                                    
         6,935,477      branch-misses             #    0.37% of all branches        

       2.957526562 seconds time elapsed

優化-loop unroolling

看一下 math-toolkit.h 內的 function add_vector







static inline 
void add_vector(const double *a, const double *b, double *out)
{ 
    for (int i = 0; i < 3; i++)
        out[i] = a[i] + b[i];  
}

這裡其實是不需要用 loop 的,使用 loop 會使用 branch 指令,造成 brach miss 會使得執行所需時間增加,所以將 function 改成這樣







static inline
void add_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] + b[0];
    out[1] = a[1] + b[1];
    out[2] = a[2] + b[2];
}

同樣的, subtract_vector , multiply_vectors , multiply_vector , dot_product 內的 loop 也是不需要的 ,所以作同樣的修改

結果：
Execution time of raytracing() : 2.309990 sec
執行時間比較:

執行時間比起原本少了不少
用 gprof 來分析

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 16.78      0.27     0.27 69646433     0.00     0.00  dot_product
 14.92      0.51     0.24 13861875     0.00     0.00  rayRectangularIntersection
 14.92      0.75     0.24 56956357     0.00     0.00  subtract_vector
  8.70      0.89     0.14 10598450     0.00     0.00  normalize
  8.08      1.02     0.13 13861875     0.00     0.00  raySphereIntersection
  5.90      1.12     0.10 31410180     0.00     0.00  multiply_vector
  4.97      1.20     0.08  4620625     0.00     0.00  ray_hit_object
  4.35      1.27     0.07 17836094     0.00     0.00  add_vector
  3.73      1.33     0.06  1048576     0.00     0.00  ray_color
  3.42      1.38     0.06 17821809     0.00     0.00  cross_product
  3.11      1.43     0.05  2110576     0.00     0.00  localColor
  3.11      1.48     0.05        1     0.05     1.61  raytracing

可以看到之前前幾名的 function subtract_vector , dot_product , normalize , multiply_vector 的所佔的執行時間都有所減少

branch 和 branch-misses
執行：perf stat -e branches,branch-miss ./raytracing

Execution time of raytracing() : 2.186736 sec

 Performance counter stats for './raytracing':

       969,893,092      branches                                                    
         5,695,990      branch-misses             #    0.59% of all branches        

       2.188774361 seconds time elapsed

可以看到 branch和 branch miss 的數量都有減少,所以執行時間有所下降

優化-openmp

在 raytracing function 中 for loop 加上


 #pragma omp parallel for num_threads(4) private(stk),private(object_color),private(d)

變成





#pragma omp parallel for num_threads(4) private(stk),private(object_color),private(d)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;

用4個 thread 對 loop 作平行運算, d , stk , object_color 在 loop 中會被更動,所以設成 private , 使這幾個變數在 thread 中是各自獨立不會被其他 thread 變動
Makefile 中的編譯選項後加上 -fopenmp

執行時間
Execution time of raytracing() : 0.860706 sec

可以看到做了平行化後時間有大幅減少

參考資料

ChenYi的共筆
 nekoneko的共筆
 TempoJiJi的共筆