2017q1 Homework1 (raytracing)

tags: `embedded`

contributed by <Cayonliow>
tags: gprof loop unrolling Inline template programming gcc version

Reviewed by `king1224`

關於 metaprogramming 來強迫使用 loop unrolling 的一些問題
首先我不清楚那個全黑的圖是怎麼來的，template 是 C++ 的用法，因此 gcc 編譯不過，需用 g++ 才可編譯，而 Makefile 中可以看到此題為 gcc 編譯
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
關於 dot_product 傳入的參數主要為 point3 型態的變數，在 primitives.h 中可以看到 point3 是用來宣告一個長度為 3 的 double array 的，因此原版程式碼 dot_product 的參數與回傳值都設為 double，但妳在下方貼的程式碼皆以 int 型態表示，轉型時會有錯誤的 value 問題

作業

題目： B02: raytracing
github(原來的): raytracing
作業解說: 2016 年春季系統程式課程作業 2 + 作業 3 解說 , 2016 年春季系統程式課程作業 2 + 作業 3 解說 (錄影)
參考實做程式的解說: 2016/5/3 Embedded System HW5 Team 4 , 對應的共筆

開發環境

























cayon@cayon-X550JX:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
Stepping:              3
CPU MHz:               2423.484
CPU max MHz:           3600.0000
CPU min MHz:           800.0000
BogoMIPS:              5188.45
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7

工具

GNU gprof
- 文章： gprof, 使用Gnu gprof进行Linux平台下的程序分析(1)

原理與概念

Loop unrolling
- 文章： wikipedia , 批踢踢 , C++ Loop Unrolling Using Metaprogramming
Template metaprogramming
- 文章： wikipedia , 巴哈姆特 , 用 C++ template meta-programming 寫九九乘法程式
Inline
- stackflow :force gcc , stackflow : differences ,
  內聯函數：static inline 和 extern inline 的含義， [C++]內嵌函數（inline　function）筆記， # 强制函数永远以inline的形式调用

開發記錄

~2 March:

研究 gprof, 在臉書討論區看見 gcc 版本問題,去檢查自己的 gcc 版本

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

沒有 gcc6 的問題，所以不用重灌，相關文章

原始版本

執行 $ make, 然後 $ ./raytracing
- 時間爲 2.393859 sec

# Rendering scene
Done!
Execution time of raytracing() : 2.393859 sec

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

我使用 gprof 執行程式

$ make clean
$ make PROFILE=1
$ ./raytracing
$ gprof ./raytacing | less

執行時間：

# Rendering scene
Done!
Execution time of raytracing() : 5.026474 sec

輸出結果：

其中 subtract_vector 被呼叫 56956357 次、佔用了 20.34% 的時間； dot_product被呼叫 69646433 次、佔用了15.94 的時間，這兩個函式最爲耗時

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.34      0.37     0.37 56956357     0.00     0.00  subtract_vector
 15.94      0.66     0.29 69646433     0.00     0.00  dot_product
 11.27      0.87     0.21 17836094     0.00     0.00  add_vector
 10.45      1.06     0.19 10598450     0.00     0.00  normalize
  8.25      1.21     0.15 31410180     0.00     0.00  multiply_vector
  6.32      1.32     0.12 13861875     0.00     0.00  rayRectangularIntersection
  5.77      1.43     0.11 13861875     0.00     0.00  raySphereIntersection
  4.95      1.52     0.09 17821809     0.00     0.00  cross_product
  3.85      1.59     0.07  4620625     0.00     0.00  ray_hit_object
  3.85      1.66     0.07  1048576     0.00     0.00  ray_color
  2.20      1.70     0.04  4221152     0.00     0.00  multiply_vectors
  1.92      1.73     0.04  1048576     0.00     0.00  rayConstruction
  1.65      1.76     0.03  1241598     0.00     0.00  refraction
  1.10      1.78     0.02  2110576     0.00     0.00  localColor
  0.55      1.79     0.01  1241598     0.00     0.00  reflection
  0.55      1.80     0.01  1204003     0.00     0.00  idx_stack_push
  0.55      1.81     0.01        1     0.01     1.82  raytracing
  0.27      1.82     0.01  3838091     0.00     0.00  length

優化版本

版本一： Loop unrolling

參考邁向王者的旅途

static inline
double dot_product(const double *v1, const double *v2)
{
    return (v1[0] * v2[0]) + (v1[1] * v2[1]) + (v1[2] * v2[2]);
}

執行時間：

的確變快了約 0.5 sec (4.517215 sec < 5.026474 sec)

# Rendering scene
Done!
Execution time of raytracing() : 4.517215 sec

然後將 math-toolkit.h 裏頭的函式都用 loop unrolling 的手法改寫
- 執行時間：
  - 再變快了約 0.2 sec (4.266808 sec < 4.517215 sec < 5.026474 sec)
  - 在這裏會發現時間只變短了一點，因爲其他函式被呼叫的次數不多，所以改變不大

# Rendering scene
Done!
Execution time of raytracing() : 4.266808 sec

試着利用 metaprogramming 來強迫使用 loop unrolling, 可是失敗，在找爲什麼

請注意程式碼排版，縮排為四個空白
課程助教

好的謝謝助教
Cayon

// general template for loop unrolling
template <int N>
int dot_product (const int* a, const int* b)
{
    return ((*a) * (*b)) + dot_product<N-1> (a + 1, b + 1);
}

// template specialization
template <>
int dot_product<1> (const int* a, const int* b)
{
    return (*a) * (*b);
}

// usage
dot_product<3>(a, b);

out.ppm 變成這樣..

版本二： Inline

經過版本一的優化，利用 gprof 來看執行時間的分布

函式的執行時間都變短了，尤其以 subtract_vector ， dot_product 最爲明顯

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 15.10      0.18     0.18 56956357     0.00     0.00  subtract_vector
 13.80      0.34     0.16 69646433     0.00     0.00  dot_product
 11.21      0.47     0.13 10598450     0.00     0.00  normalize
  9.06      0.57     0.11 17821809     0.00     0.00  cross_product
  7.76      0.66     0.09 13861875     0.00     0.00  raySphereIntersection
  7.76      0.75     0.09  4620625     0.00     0.00  ray_hit_object
  6.90      0.83     0.08  1048576     0.00     0.00  ray_color
  5.61      0.90     0.07 31410180     0.00     0.00  multiply_vector
  4.74      0.95     0.06 17836094     0.00     0.00  add_vector
  3.45      0.99     0.04 13861875     0.00     0.00  rayRectangularIntersection
  3.45      1.03     0.04        1     0.04     1.16  raytracing
  2.59      1.06     0.03  4221152     0.00     0.00  multiply_vectors
  2.59      1.09     0.03  2110576     0.00     0.00  compute_specular_diffuse
  1.73      1.11     0.02  2110576     0.00     0.00  localColor
  0.86      1.12     0.01  3838091     0.00     0.00  length
  0.86      1.13     0.01  2520791     0.00     0.00  idx_stack_top
  0.86      1.14     0.01  1241598     0.00     0.00  protect_color_overflow
  0.86      1.15     0.01  1241598     0.00     0.00  refraction
  0.86      1.16     0.01  1048576     0.00     0.00  rayConstruction
  0.00      1.16     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.16     0.00  1241598     0.00     0.00  reflection
  0.00      1.16     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.16     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.16     0.00   113297     0.00     0.00  fresnel
  0.00      1.16     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.16     0.00        3     0.00     0.00  append_rectangular
  0.00      1.16     0.00        3     0.00     0.00  append_sphere

引用：[C++]內嵌函數（inline　function）筆記

即便加入inline想要使用內嵌函數，編譯時也不一定就會實作，編譯器會選擇，如果你執行的函數中程式碼所需時間大於處理呼叫函數的時間，則能節省的時間比較少，反之若是你的函數中程式碼執行的時間很短，則使用內前函數可以省去較多的呼叫函數的時間，並且如果常常會使用到此函數，使用內嵌的效率也會比較好
math-toolkit.h 中的函式在第一行都有宣告 static inline，但是由於關閉了編譯器最佳化，因此編譯器會將所有函式的 static function 忽略，不會有任何 inline function

所以要強制函數永遠以 inline 的形式調用

static inline __attribute__((always_inline))
double dot_product(const double *v1, const double *v2)
{
    return (v1[0] * v2[0]) + (v1[1] * v2[1]) + (v1[2] * v2[2]);
}

執行結果：

# Rendering scene
Done!
Execution time of raytracing() : 2.027904 sec

執行時間竟然減少一半之多 (2.027904 sec < 4.266808 sec < 4.517215 sec < 5.026474 sec)

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 37.02      0.47     0.47 13861875     0.00     0.00  rayRectangularIntersection
 15.75      0.67     0.20 13861875     0.00     0.00  raySphereIntersection
 14.18      0.85     0.18  2110576     0.00     0.00  compute_specular_diffuse
 11.81      1.00     0.15  2110576     0.00     0.00  localColor
  7.88      1.10     0.10  1048576     0.00     0.00  ray_color
  4.73      1.16     0.06  4620625     0.00     0.00  ray_hit_object
  3.15      1.20     0.04  1241598     0.00     0.00  refraction
  2.36      1.23     0.03        1     0.03     1.27  raytracing
  1.58      1.25     0.02  1048576     0.00     0.00  rayConstruction
  0.79      1.26     0.01  1241598     0.00     0.00  protect_color_overflow
  0.79      1.27     0.01  1241598     0.00     0.00  reflection
  0.00      1.27     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.27     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      1.27     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.27     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.27     0.00   113297     0.00     0.00  fresnel
  0.00      1.27     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.27     0.00        3     0.00     0.00  append_rectangular
  0.00      1.27     0.00        3     0.00     0.00  append_sphere
  0.00      1.27     0.00        2     0.00     0.00  append_light
  0.00      1.27     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.27     0.00        1     0.00     0.00  delete_light_list
  0.00      1.27     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.27     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.27     0.00        1     0.00     0.00  diff_in_second
  0.00      1.27     0.00        1     0.00     0.00  write_to_ppm

發現subtract_vector ，dot_product 等在上一個版本很花時間的函式都消失了，因爲都變成 inline function 被展開了