2016 Homework 1 - raytracing

Review by TotallyWrong

Raytracying 是可以平行化的可以嘗試做作看，Callgraph 可以讓你了解各個Funcation 之間的關係。可以解釋一下Force inline的功能。

原程式執行結果分析

先看看編譯器最佳化的結果

# Rendering scene
Done!
Execution time of raytracing() : 0.730467 sec

沒有最佳化的執行時間

Execution time of raytracing() : 7.109102 sec

gprof 結果

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.63      0.84     0.84 69646433     0.00     0.00  dot_product
 16.78      1.39     0.55 56956357     0.00     0.00  subtract_vector
  9.76      1.71     0.32 10598450     0.00     0.00  normalize
  8.85      2.00     0.29 31410180     0.00     0.00  multiply_vector
  7.93      2.26     0.26 17836094     0.00     0.00  add_vector
  7.78      2.52     0.26 13861875     0.00     0.00  rayRectangularIntersection
  5.49      2.70     0.18 17821809     0.00     0.00  cross_product
  4.88      2.86     0.16  4620625     0.00     0.00  ray_hit_object
  3.20      2.96     0.11 13861875     0.00     0.00  raySphereIntersection
  2.75      3.05     0.09  4221152     0.00     0.00  multiply_vectors
  1.53      3.10     0.05  1048576     0.00     0.00  ray_color
  0.92      3.13     0.03  3838091     0.00     0.00  length
  0.92      3.16     0.03  2110576     0.00     0.00  compute_specular_diffuse
  0.92      3.19     0.03  1241598     0.00     0.00  refraction
  0.92      3.22     0.03        1     0.03     3.28  raytracing
  0.61      3.24     0.02  1241598     0.00     0.00  protect_color_overflow
  0.31      3.25     0.01  1241598     0.00     0.00  reflection
  0.31      3.26     0.01  1048576     0.00     0.00  rayConstruction
  0.31      3.27     0.01   113297     0.00     0.00  fresnel
  0.15      3.28     0.01  2520791     0.00     0.00  idx_stack_top
  0.15      3.28     0.01    37595     0.00     0.00  idx_stack_pop
  0.00      3.28     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      3.28     0.00  2110576     0.00     0.00  localColor
  0.00      3.28     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      3.28     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      3.28     0.00        3     0.00     0.00  append_rectangular
  0.00      3.28     0.00        3     0.00     0.00  append_sphere
  0.00      3.28     0.00        2     0.00     0.00  append_light
  0.00      3.28     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      3.28     0.00        1     0.00     0.00  delete_light_list
  0.00      3.28     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      3.28     0.00        1     0.00     0.00  delete_sphere_list
  0.00      3.28     0.00        1     0.00     0.00  diff_in_second
  0.00      3.28     0.00        1     0.00     0.00  write_to_ppm

perf

Execution time of raytracing() : 7.425032 sec

 Performance counter stats for './raytracing':

            37,344      cache-misses              #   23.133 % of all cache refs    
           161,432      cache-references                                            
    32,983,244,564      instructions              #    1.54  insns per cycle        
    21,383,486,641      cycles                                                      

       7.426804060 seconds time elapsed

可以得知在 dot_product 及 subtract_vector 在執行次數及總執行時間中佔了很大的部份，因此先針對這兩個部份進行優化。

dot_product








static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

subtract_vector







static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] - b[i];
}

dot_product / subtract_vector 優化 by loop-unrolling

採用loop- unrolling 方式優化

執行時間

Execution time of raytracing() : 6.381563 sec

gprof結果

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 13.83      0.30     0.30 10598450     0.00     0.00  normalize
 12.68      0.58     0.28 31410180     0.00     0.00  multiply_vector
 11.99      0.84     0.26 17836094     0.00     0.00  add_vector
 10.61      1.07     0.23 56956357     0.00     0.00  subtract_vector
 10.14      1.29     0.22 69646433     0.00     0.00  dot_product
  8.07      1.46     0.18 13861875     0.00     0.00  rayRectangularIntersection
  6.92      1.61     0.15  4620625     0.00     0.00  ray_hit_object
  5.30      1.73     0.12 17821809     0.00     0.00  cross_product

其他的運算(add_vector, multiply_vectors)也有類似的方式就一起改善

Execution time of raytracing() : 6.075147 sec

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 14.48      0.28     0.28 31410180     0.00     0.00  multiply_vector
 14.22      0.55     0.27 10598450     0.00     0.00  normalize
 13.17      0.80     0.25 56956357     0.00     0.00  subtract_vector
 12.90      1.04     0.25 69646433     0.00     0.00  dot_product
 12.11      1.27     0.23 13861875     0.00     0.00  rayRectangularIntersection
  4.74      1.36     0.09 17821809     0.00     0.00  cross_product
  4.74      1.45     0.09 13861875     0.00     0.00  raySphereIntersection
  4.74      1.54     0.09  4620625     0.00     0.00  ray_hit_object

Execution time of raytracing() : 7.425032 sec

 Performance counter stats for './raytracing':

            37,344      cache-misses              #   23.133 % of all cache refs    
           161,432      cache-references                                            
    32,983,244,564      instructions              #    1.54  insns per cycle        
    21,383,486,641      cycles                                                      

       7.426804060 seconds time elapsed

但是多跑了幾次同樣的程式，執行時間跳動幅度很大，從6.0xxxxx到6.6xxxxx都有出現，原因尚未了解。

force inline

影片中提及inline並不一定會在未優化compiler 狀況下執行，因此針對此點進行改善，不只單單只將gprof的前幾名force inline ，嘗試將所有有inline 的function 都做相同處理。











static inline __attribute__((always_inline))
void normalize(double *v)
{
    double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    assert(d != 0.0 && "Error calculating normal");

    v[0] /= d;
    v[1] /= d;
    v[2] /= d;
}

執行時間

Execution time of raytracing() : 3.098384 sec

gprof 結果

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 33.83      0.70     0.70 13861875     0.00     0.00  rayRectangularIntersection
 15.46      1.02     0.32 13861875     0.00     0.00  raySphereIntersection
 13.53      1.30     0.28  2110576     0.00     0.00  localColor
 12.08      1.55     0.25  2110576     0.00     0.00  compute_specular_diffuse
  8.22      1.72     0.17  1048576     0.00     0.00  ray_color
  5.32      1.83     0.11  4620625     0.00     0.00  ray_hit_object
  3.87      1.91     0.08  1241598     0.00     0.00  reflection
  3.38      1.98     0.07  1048576     0.00     0.00  rayConstruction
  0.97      2.00     0.02  2558386     0.00     0.00  idx_stack_empty
  0.97      2.02     0.02  1241598     0.00     0.00  refraction
  0.97      2.04     0.02        1     0.02     2.07  raytracing

降低的時間多達3秒，一開始看到也不太敢相信，需要再找資料看看inline 相關原理及效果！
發現新的可能優化目標(rayRectangularIntersection, raySphereIntersection)

SIMD-AVX

嘗試使用AVX改善(僅corss_product)
一開始的時候以為每一個數學運算都可以用這個方式改善，後來忘記local varible 的問題，浪費時間！
_mm256_mul_pd 輸出的結果不是照順序的，原本應該是dp = f[0]+f[1]+f[2]，但答案不對，要改成 dp = f[1]+f[2]+f[3]，需要再找看看輸出方式。
執行時間

# Rendering scene
Done!
Execution time of raytracing() : 4.239852 sec

比force inline 多了約一秒左右

參考資料

TotallyWrong 共筆
 inline 了解 - function attribute
force inline方式