進階電腦系統理論與實作 (Fall 2016)A02: raytracing

前置動作

在編譯期間為了要明顯看得出不同的程式寫法產生效能不同我們通常會再打編譯指令時把最佳化完全關閉，所以我需要加上-O0。接著為了可以使用gdb除錯則須加上-g，接著為了要使用math.h內的函式，我們需要加入-lm讓linker知道，如果有人問說為什麼我們用printf之類的函式卻
不需要跟linker說呢？其實也是要的，加的東西是-lc，但這個選項其實gcc已經幫你預設好了。

參考網址

接著如果要使用到gprof的話，則需要加-pg
gprof 是一個gnu開發，用來檢視程式中每個副程式被執行時的效能統計工具。

當程式編譯好，並把程式執行一次之後便會在當前資料夾裡產生一個gmon.out檔，接著就可以用gprof對這個檔案去做分析了，最後可以選擇一個檔案去做輸出。

指令

gprof raytracing -b gmon.out > analys.out

截圖

analys.out

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 18.03      0.40     0.40 69646433     0.00     0.00  dot_product
 16.23      0.76     0.36 56956357     0.00     0.00  subtract_vector
  9.92      0.98     0.22 13861875     0.00     0.00  rayRectangularIntersection
  9.01      1.18     0.20 13861875     0.00     0.00  raySphereIntersection
  9.01      1.38     0.20 10598450     0.00     0.00  normalize
  8.11      1.56     0.18 31410180     0.00     0.00  multiply_vector
  7.66      1.73     0.17  4620625     0.00     0.00  ray_hit_object
  6.76      1.88     0.15 17836094     0.00     0.00  add_vector
  3.61      1.96     0.08  1048576     0.00     0.00  ray_color
  2.70      2.02     0.06 17821809     0.00     0.00  cross_product
  1.80      2.06     0.04  4221152     0.00     0.00  multiply_vectors
  1.80      2.10     0.04  1241598     0.00     0.00  refraction
  1.35      2.13     0.03  1048576     0.00     0.00  rayConstruction
  0.90      2.15     0.02  3838091     0.00     0.00  length
  0.90      2.17     0.02  2110576     0.00     0.00  compute_specular_diffuse
  0.90      2.19     0.02  2110576     0.00     0.00  localColor
  0.45      2.20     0.01  2520791     0.00     0.00  idx_stack_top
  0.45      2.21     0.01        1     0.01     0.01  delete_sphere_list
  0.45      2.22     0.01        1     0.01     2.21  raytracing
  0.00      2.22     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.22     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      2.22     0.00  1241598     0.00     0.00  reflection
  0.00      2.22     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.22     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.22     0.00   113297     0.00     0.00  fresnel
  0.00      2.22     0.00    37595     0.00     0.00  idx_stack_pop

程式總執行時間是 2.881241 sec

從統計我們可以看得出哪些函式時常被呼叫，如果可以針對被呼叫頻繁的程式去改善效能，將會對程式做最有效率的改善。
所以我們針對dot_product、subtract_vector、add_vector去做最佳化。

Loop Unrolling

方法是把迴圈展開

analys.out

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 14.65      0.27     0.27 56956357     0.00     0.00  subtract_vector
 13.82      0.52     0.25 69646433     0.00     0.00  dot_product
 11.06      0.72     0.20  4620625     0.00     0.00  ray_hit_object
  9.40      0.89     0.17 10598450     0.00     0.00  normalize
  8.57      1.04     0.16 31410180     0.00     0.00  multiply_vector
  8.02      1.19     0.15 13861875     0.00     0.00  rayRectangularIntersection
  6.91      1.31     0.13 13861875     0.00     0.00  raySphereIntersection
  6.63      1.43     0.12 17821809     0.00     0.00  cross_product
  4.15      1.51     0.08 17836094     0.00     0.00  add_vector
  3.32      1.57     0.06  1048576     0.00     0.00  ray_color
  2.76      1.62     0.05  2110576     0.00     0.00  localColor
  2.49      1.66     0.05  3838091     0.00     0.00  length
  2.21      1.70     0.04        1     0.04     1.81  raytracing
  1.66      1.73     0.03  2110576     0.00     0.00  compute_specular_diffuse
  1.11      1.75     0.02  4221152     0.00     0.00  multiply_vectors
  1.11      1.77     0.02  1048576     0.00     0.00  idx_stack_init
  0.55      1.78     0.01  2520791     0.00     0.00  idx_stack_top
  0.55      1.79     0.01  1241598     0.00     0.00  protect_color_overflow
  0.55      1.80     0.01  1204003     0.00     0.00  idx_stack_push
  0.55      1.81     0.01  1048576     0.00     0.00  rayConstruction
  0.00      1.81     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.81     0.00  1241598     0.00     0.00  reflection
  0.00      1.81     0.00  1241598     0.00     0.00  refraction
  0.00      1.81     0.00   113297     0.00     0.00  fresnel
  0.00      1.81     0.00    37595     0.00     0.00  idx_stack_pop

總執行時間是 2.056526 sec，變快了。

SIMD

如果要加入simd的指令集的話記得makefile要加上需要的flag。
當然程式檔也要include特定的函式庫。

用SIMD時十分要注意資料放進跟拿出的順序，下面的程式碼就可以看得出我是正的放進去，反的拿出來。

改良 dot_product（double, double）

如果我們把 Loop 展開的話可以得到回等於

v1[0]*v2[0]+
v1[1]*v2[1]+
v1[2]*v2[2]

指令集可以參考這個網站。

原本

double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

平行後

double dot_product(const double *v1, const double *v2)
{
    __m256d v1_reg = _mm256_set_pd ((double)0, v1[2], v1[1], v1[0]);
    __m256d v2_reg = _mm256_set_pd ((double)0, v2[2], v2[1], v2[0]);
    __m256d v1_add_v2 = _mm256_mul_pd( v1_reg, v2_reg );
    double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, v1_add_v2);
    return (double)(tmp[0]+tmp[1]+tmp[2]);
}

改良 subtract_vector（double, double）

原本

void subtract_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] - b[i];
    return;
}

平行後

void subtract_vector(const double *a, const double *b, double *out)
{
    __m256d a_reg = _mm256_set_pd ((double)0, a[2], a[1], a[0]);
    __m256d b_reg = _mm256_set_pd ((double)0, b[2], b[1], b[0]);
    __m256d a_add_b = _mm256_sub_pd( a_reg, b_reg );
    double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, a_add_b);
    out[0] = (double)tmp[0];
    out[1] = (double)tmp[1];
    out[2] = (double)tmp[2];
}

改良 add_vector（double, double）

原本

void add_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] + b[i];
	return;
}

平行後

void add_vector(const double *a, const double *b, double *out)
{
    __m256d a_reg = _mm256_set_pd ((double)0, a[2], a[1], a[0]);
    __m256d b_reg = _mm256_set_pd ((double)0, b[2], b[1], b[0]);
    __m256d a_add_b = _mm256_add_pd( a_reg, b_reg );
    double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, a_add_b);
    out[0] = (double)tmp[0];
    out[1] = (double)tmp[1];
    out[2] = (double)tmp[2];
}

analys.out

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 32.65      1.08     1.08 69646433     0.00     0.00  dot_product
 19.04      1.71     0.63 56956357     0.00     0.00  subtract_vector
 12.09      2.11     0.40 10598450     0.00     0.00  normalize
  8.31      2.39     0.28 17836094     0.00     0.00  add_vector
  6.65      2.61     0.22 13861875     0.00     0.00  rayRectangularIntersection
  4.08      2.74     0.14 31410180     0.00     0.00  multiply_vector
  3.63      2.86     0.12  4620625     0.00     0.00  ray_hit_object
  3.63      2.98     0.12 13861875     0.00     0.00  raySphereIntersection
  3.17      3.09     0.11 17821809     0.00     0.00  cross_product
  1.81      3.15     0.06  1048576     0.00     0.00  ray_color
  1.66      3.20     0.06  3838091     0.00     0.00  length
  1.51      3.25     0.05  2110576     0.00     0.00  compute_specular_diffuse
  0.60      3.27     0.02  4221152     0.00     0.00  multiply_vectors
  0.60      3.29     0.02        1     0.02     3.31  raytracing
  0.30      3.30     0.01  1241598     0.00     0.00  protect_color_overflow
  0.30      3.31     0.01  1048576     0.00     0.00  rayConstruction
  0.00      3.31     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      3.31     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      3.31     0.00  2110576     0.00     0.00  localColor
  0.00      3.31     0.00  1241598     0.00     0.00  reflection
  0.00      3.31     0.00  1241598     0.00     0.00  refraction
  0.00      3.31     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      3.31     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      3.31     0.00   113297     0.00     0.00  fresnel
  0.00      3.31     0.00    37595     0.00     0.00  idx_stack_pop

總執行時間是 4.482673 sec，執行時間反而變長了，感覺是因為我們改寫的函式比較簡單，光assignment的動作就已經佔用了大部分時間，所以過短的程式其實不需要做simd。

參考網址：

http://wiki.csie.ncku.edu.tw/embedded/2015q3h1ext

進階電腦系統理論與實作 (Fall 2016)A02: raytracing

前置動作

截圖

analys.out

Loop Unrolling

analys.out

SIMD

改良 dot_product（double*, double*）

改良 subtract_vector（double*, double*）

改良 add_vector（double*, double*）

analys.out

參考網址：

改良 dot_product（double, double）

改良 subtract_vector（double, double）

改良 add_vector（double, double）