2016q3 Homework1 (raytracing)

優化前

gprof vs. perf

在phonebook中，我們是利用 perf top 來「統計」程式在執行過程中的事件。perf top 的原理是每隔一段時間採樣一次，最後根據這些資料輸出。因此採樣的頻率可能會造成結果的不同。

而gprof則是需要在編譯時加入 -pg 的選項。編譯器會在各函數中加入 mcount 函數，此函數在執行期間會記錄呼叫次數等相關資訊。而這些資訊會儲存至 gmon.out 中，最後呼叫 gprof 來繪製相關表格供開發者分析。

-pg Generate extra code to write profile information suitable for the analysis program gprof.
    You must use this option when compiling the source files you want data about, and you must also use it when linking.

分析

依照作業說明，先以 make PROFILE=1 編譯並執行：

# Rendering scene
Done!
Execution time of raytracing() : 5.121206 sec

那如果傳入 PROFILE=0 呢？看來在計算呼叫次數等資訊時，花費了快3秒的時間。

# Rendering scene
Done!
Execution time of raytracing() : 2.476936 sec

如果我們開啟編譯器的最佳化(-Ofast)的話：

# Rendering scene
Done!
Execution time of raytracing() : 0.514999 sec

時間又縮短了2秒左右，可以看出編譯器做了很多優化阿…

回到正題，我們來使用gprof來觀看各函數的使用狀況

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.73      0.54     0.54 69646433     0.00     0.00  dot_product
 19.06      0.94     0.40 56956357     0.00     0.00  subtract_vector
 10.96      1.17     0.23 31410180     0.00     0.00  multiply_vector
  9.05      1.36     0.19 10598450     0.00     0.00  normalize
  8.10      1.53     0.17 13861875     0.00     0.00  rayRectangularIntersection
  6.43      1.67     0.14 17836094     0.00     0.00  add_vector
  4.76      1.77     0.10  4620625     0.00     0.00  ray_hit_object
  4.29      1.86     0.09 17821809     0.00     0.00  cross_product
  3.34      1.93     0.07  1048576     0.00     0.00  ray_color
  2.86      1.99     0.06 13861875     0.00     0.00  raySphereIntersection
  1.91      2.03     0.04  4221152     0.00     0.00  multiply_vectors
  1.91      2.07     0.04  2110576     0.00     0.00  compute_specular_diffuse
  0.48      2.08     0.01  2110576     0.00     0.00  localColor
  0.48      2.09     0.01  1048576     0.00     0.00  rayConstruction
  0.48      2.10     0.01        1     0.01     0.01  delete_sphere_list
  0.24      2.10     0.01  3838091     0.00     0.00  length
  0.00      2.10     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.10     0.00  2520791     0.00     0.00  idx_stack_top

從結果中我們可以看到，前幾名的函數都是向量操作函數，也都在 math-toolkit.h 中。因此，未來將先關注於 math-toolkit.h。

優化後

Loop unrolling

當我們需要重複做同一件事情的時候，會使用for迴圈。但是，在執行的過程中卻會因為branch prediction，而讓程式需要更多的cpu cycle來運算。

因此，這邊將 math-toolkit.h 中的迴圈展開，執行時間從原先的 5.121206 秒下降至 4.347630 秒，節省 0.773576 秒。

# Rendering scene
Done!
Execution time of raytracing() : 4.347630 sec

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 15.90      0.21     0.21 69646433     0.00     0.00  dot_product
 14.74      0.40     0.19 56956357     0.00     0.00  subtract_vector
 12.41      0.56     0.16 10598450     0.00     0.00  normalize
 10.47      0.69     0.14 13861875     0.00     0.00  rayRectangularIntersection
  9.31      0.81     0.12 17821809     0.00     0.00  cross_product
  8.53      0.92     0.11 31410180     0.00     0.00  multiply_vector
  5.43      0.99     0.07 17836094     0.00     0.00  add_vector
  5.04      1.06     0.07 13861875     0.00     0.00  raySphereIntersection
  5.04      1.12     0.07  4620625     0.00     0.00  ray_hit_object
  3.10      1.16     0.04        1     0.04     1.29  raytracing
  2.33      1.19     0.03  2110576     0.00     0.00  localColor
  2.33      1.22     0.03  1048576     0.00     0.00  ray_color
  1.55      1.24     0.02  4221152     0.00     0.00  multiply_vectors
  1.16      1.26     0.02  3838091     0.00     0.00  length
  0.78      1.27     0.01  2520791     0.00     0.00  idx_stack_top
  0.78      1.28     0.01  2110576     0.00     0.00  compute_specular_diffuse
  0.78      1.29     0.01  1241598     0.00     0.00  reflection
  0.39      1.29     0.01   113297     0.00     0.00  fresnel
  0.00      1.29     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.29     0.00  1241598     0.00     0.00  protect_color_overflow

Force GCC to inline function

在 math-toolkit.h 中每個函數前都有 inline 的修飾字，這是要建議編譯器，如果函數很小的話，可以直接內嵌到程式碼中，藉此節省呼叫函數的 overhead。

由上述可以知道，編譯器也可忽視你的建議，所以在這邊透過加入 __attribute__ ((always_inline)) 來強制編譯器做到內嵌的效果。

執行時間從 4.347630 秒降至 2.109358 秒，共減少 2.238272 秒。
從 gprof 的輸出中也可以看到 dot_product、subtract_vector 等等的函數從列表中消失了，說明了他們成功的被 inline 進去。

# Rendering scene
Done!
Execution time of raytracing() : 2.109358 sec

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 34.93      0.44     0.44 13861875     0.00     0.00  rayRectangularIntersection
 23.82      0.74     0.30 13861875     0.00     0.00  raySphereIntersection
 16.67      0.95     0.21  2110576     0.00     0.00  compute_specular_diffuse
  9.53      1.07     0.12  4620625     0.00     0.00  ray_hit_object
  6.35      1.15     0.08  1048576     0.00     0.00  ray_color
  4.76      1.21     0.06  2110576     0.00     0.00  localColor
  1.59      1.23     0.02        1     0.02     1.26  raytracing
  0.79      1.24     0.01  1241598     0.00     0.00  protect_color_overflow
  0.79      1.25     0.01  1048576     0.00     0.00  rayConstruction
  0.79      1.26     0.01    37595     0.00     0.00  idx_stack_pop
  0.00      1.26     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.26     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      1.26     0.00  1241598     0.00     0.00  reflection
  0.00      1.26     0.00  1241598     0.00     0.00  refraction
  0.00      1.26     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.26     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.26     0.00   113297     0.00     0.00  fresnel
  0.00      1.26     0.00        3     0.00     0.00  append_rectangular
  0.00      1.26     0.00        3     0.00     0.00  append_sphere
  0.00      1.26     0.00        2     0.00     0.00  append_light
  0.00      1.26     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.26     0.00        1     0.00     0.00  delete_light_list
  0.00      1.26     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.26     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.26     0.00        1     0.00     0.00  diff_in_second
  0.00      1.26     0.00        1     0.00     0.00  write_to_ppm

截至目前為止，效能比較圖如下

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SIMD (AVX)

將向量部份的運算改用SIMD指令操作。

使用前要在 source code 中 #include <immintrin.h>，也要在Makefile中的編譯選項加入 -mavx。

從 force inline 前的 gprof 結果可以看到 dot_product 佔的比例最高，所以先從他下手。












static inline __attribute__ ((always_inline))
double dot_product(const double *v1, const double *v2)
{
    __m256d t1 = _mm256_set_pd(v1[0], v1[1], v1[2], 0.0);
    __m256d t2 = _mm256_set_pd(v2[0], v2[1], v2[2], 0.0);
    __m256d mul = _mm256_mul_pd(t1, t2);
    __m256d hadd = _mm256_hadd_pd(mul, mul);
    __m128d t3 = _mm256_extractf128_pd(hadd, 1);
    __m128d t4 = _mm256_extractf128_pd(mul, 0);

    return _mm_add_pd(t3, t4)[1];
}

結果時間沒有縮短，反而變長了…

# Rendering scene
Done!
Execution time of raytracing() : 4.482411 sec

在 SIMD 的部份，看起來不只是把他換成 AVX 指令那麼簡單。應該也要使用相關的 prefetch 指令先把資料載進去，才能夠發揮應有的表現。

後來決定利用 _mm256_load_pd 來試試看。這個函數會從記憶體載入 256bit 的資料進 register，而且他還要求記憶體是要對齊的。因此，先將 primitives.h 做了一個小修改。


typedef double __attribute__ ((aligned(32))) point3[4];

再來，將剛剛的 code 修改成









static inline __attribute__ ((always_inline))
double dot_product(const double *v1, const double *v2)
{
    __m256d t1 = _mm256_load_pd(v1);
    __m256d t2 = _mm256_load_pd(v2);
    __m256d mul = _mm256_mul_pd(t1, t2);

    return mul[0] + mul[1] + mul[2];
}

然後，結果好像沒什麼變…

# Rendering scene
Done!
Execution time of raytracing() : 4.432902 sec

OpenMP

參考陳品睿學長Hackpad，發現 Pthread 可以加速程式的速度。

從 gprof 結果看出 rayRectangularIntersection 是比例最大的。讓我們透過 gprof2dot 來看看他的 call graph。

便可以清楚地發現 rayRectangularIntersection 其實是被 raytracing 間接呼叫的。接著我們到 raytracing.c 中看看，就能找到學長說的 for 迴圈。

再來，利用 OpenMP 的方式來實現平行化處理。另外，在編譯的時候也要記得傳入參數 -fopenmp，如此一來才能編譯過。

要注意的是，變數 stk、d 以及 object_color 必須要透過 private clause 宣告為執行緒私有的，否則預設之下在 #pragma omp 前的變數都是 shared，會導致計算結果錯誤。



#pragma omp parallel for num_threads(4) \
            private(stk) private(d)     \
            private(object_color)

OpenMP 可以透過 num_threads 來指定執行緒數量，透過圖表可以發現，在超過CPU數量 (nproc為 4) 後，時間減少的幅度愈來愈小。但奇怪的是，2個thread的時間卻比1個來的多？