2016q3 Homework1 (raytracing)

contributed by <RayPan>

Reviewed bt `finalallpass`

參考共筆可以附上共筆人的名子。
可以將修改的inline程式碼打出來方便參考。

開發環境:

Ubuntu 16.04 LTS
CPU

raypan@raypan-X550JD:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
Stepping:              3
CPU MHz:               2782.281
CPU max MHz:           3400.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

安裝套件
$ sudo apt-get update
$ sudo apt-get install graphviz ：可以繪製各類示意圖
$ sudo apt-get install imagemagick ：可以利用convert將輸出的圖檔轉換格式

使用gprof

簡介

gprof可以用來檢視程式執行時間、執行次數等等的工具，藉由產生的文件得以檢視程式問題。
使用過程

$ make clean
$ make PROFILE=1 ：不同於$ make，會在編譯時加入-pg參數，用途在於呼叫函式開始和結束的時候會加入統計資料，實務上稱為instrumentation。
$ ./raytracing :執行後會多產生一個由gprof分析後產生的profiling： gmon.out
$ gprof -b raytracing gmon.out | less :加入-b可以移除統計圖表中詳細文字描述less可以以頁面方式瀏覽

可能的效能改進方法

Loop unrolling
force inline
MARCO
Open MD

改善過程

參考共筆

未修改之前

由於使用gprof中間會執行額外程式碼
執行時間會增加

Execution time of raytracing() : 5.140832 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.87      0.48     0.48 69646433     0.00     0.00  dot_product
 18.58      0.87     0.39 56956357     0.00     0.00  subtract_vector
  9.53      1.07     0.20 31410180     0.00     0.00  multiply_vector
  8.10      1.24     0.17 13861875     0.00     0.00  rayRectangularIntersection
  7.62      1.40     0.16  4620625     0.00     0.00  ray_hit_object
  6.67      1.54     0.14 17836094     0.00     0.00  add_vector
  5.72      1.66     0.12 17821809     0.00     0.00  cross_product
  5.72      1.78     0.12 13861875     0.00     0.00  raySphereIntersection
  5.24      1.89     0.11 10598450     0.00     0.00  normalize
  3.34      1.96     0.07  4221152     0.00     0.00  multiply_vectors
  3.10      2.03     0.07  1048576     0.00     0.00  ray_color
  1.43      2.06     0.03  2110576     0.00     0.00  localColor
  0.95      2.08     0.02  2110576     0.00     0.00  compute_specular_diffuse
  0.48      2.09     0.01  2520791     0.00     0.00  idx_stack_top
  0.48      2.10     0.01  1048576     0.00     0.00  rayConstruction
  0.00      2.10     0.00  3838091     0.00     0.00  length
  0.00      2.10     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.10     0.00  1241598     0.00     0.00  protect_color_overflow

使用perf觀測cache-misses等情形
$perf stat -r 10 -e cache-misses,cache-references,instructions,cycles ./raytracing

 Performance counter stats for './raytracing' (10 runs):

           114,364      cache-misses              #   24.126 % of all cache refs      ( +-  5.19% )
           474,038      cache-references                                              ( +-  7.13% )
    33,007,085,278      instructions              #    1.90  insns per cycle          ( +-  0.01% )
    17,415,754,020      cycles                                                        ( +-  0.30% )

       5.155209650 seconds time elapsed                                          ( +-  0.29% )

1.使用loop unrolling

由於for迴圈每次都要執行終止條件的判斷
所以把for展開捨去判斷的步驟可以減少執行時間

優點
- 減少branch penalty
- 若迴圈內的statement彼此獨立，則statement可能做平行運算
缺點
- 使程式膨脹，降低可讀性

以math-toolkit.h內的subtract_vector為例
將for回圈展開如下：






void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
}

執行看看
Execution time of raytracing() : 4.409033 sec

首先可以發現執行時間下降了0.7秒

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.17      0.23     0.23 69646433     0.00     0.00  dot_product
 17.17      0.46     0.23 13861875     0.00     0.00  rayRectangularIntersection
 11.20      0.61     0.15 10598450     0.00     0.00  normalize
 10.45      0.75     0.14 56956357     0.00     0.00  subtract_vector
  7.47      0.85     0.10  4620625     0.00     0.00  ray_hit_object
  6.72      0.94     0.09 17821809     0.00     0.00  cross_product
  6.35      1.03     0.09 31410180     0.00     0.00  multiply_vector
  3.73      1.08     0.05 13861875     0.00     0.00  raySphereIntersection
  3.73      1.13     0.05  1048576     0.00     0.00  ray_color
  3.36      1.17     0.05 17836094     0.00     0.00  add_vector
  3.36      1.22     0.05  4221152     0.00     0.00  multiply_vectors
  2.99      1.26     0.04  2520791     0.00     0.00  idx_stack_top
  2.24      1.29     0.03  1241598     0.00     0.00  refraction
  1.49      1.31     0.02  2110576     0.00     0.00  compute_specular_diffuse
  1.49      1.33     0.02  2110576     0.00     0.00  localColor
  0.75      1.34     0.01        1     0.01     1.34  raytracing
  0.37      1.34     0.01  3838091     0.00     0.00  length
  0.00      1.34     0.00  2558386     0.00     0.00  idx_stack_empty

使用gprof後分析後
發現subtract_vector執行時間由原本的0.39秒下降至0.14秒
multiply_vector也由0.2秒降至0.09秒

 Performance counter stats for './raytracing' (10 runs):

            85,665      cache-misses              #   18.558 % of all cache refs      ( +-  4.49% )
           461,603      cache-references                                              ( +-  7.24% )
    26,164,199,466      instructions              #    1.72  insns per cycle          ( +-  0.00% )
    15,251,178,470      cycles                                                        ( +-  0.72% )

       4.521390109 seconds time elapsed                                          ( +-  0.69% )

cache-misses也由24.126%下降至18.558%

2.使用force inline

An Inline Function is As Fast As a Macro
由於沒有開啟gcc優化功能,所以在每個函式後面加上
__attribute__((always_inline))
強制使用inline功能

inline: 在函式前面加入關鍵字 inline , 此函式就可以「建議」編譯器將之設定為「行內函式」(Inline function)，如果編譯器評估效能有改善而建議被採納，則該函式會自動在呼叫點展開為程式碼。

執行看看
Execution time of raytracing() : 2.217506 sec
可以發現執行時間大幅下降

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 36.37      0.52     0.52 13861875     0.00     0.00  rayRectangularIntersection
 23.08      0.85     0.33 13861875     0.00     0.00  raySphereIntersection
  9.79      0.99     0.14  2110576     0.00     0.00  compute_specular_diffuse
  9.79      1.13     0.14  1048576     0.00     0.00  ray_color
  5.60      1.21     0.08  4620625     0.00     0.00  ray_hit_object
  4.90      1.28     0.07  2110576     0.00     0.00  localColor
  4.20      1.34     0.06  1048576     0.00     0.00  rayConstruction
  2.80      1.38     0.04  1241598     0.00     0.00  reflection
  2.10      1.41     0.03        1     0.03     1.43  raytracing
  1.40      1.43     0.02  1241598     0.00     0.00  refraction
  0.00      1.43     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.43     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      1.43     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      1.43     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.43     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.43     0.00   113297     0.00     0.00  fresnel
  0.00      1.43     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.43     0.00        3     0.00     0.00  append_rectangular

可以發現在math-toolkits.h內的函式
原本在排行榜前幾名的
做了force inline後已經掉到後面

3.使用MARCO

把dot_product換成


#define dot_product(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2]))

執行看看
Execution time of raytracing() : 2.041964 sec
可以發現執行時間比force inline又略為下降

Performance counter stats for './raytracing' (10 runs):

           259,586      cache-misses              #   31.459 % of all cache refs      ( +-  8.95% )
           825,149      cache-references                                              ( +-  9.63% )
    12,050,482,702      instructions              #    1.74  insns per cycle          ( +-  0.00% )
     6,934,713,035      cycles                                                        ( +-  0.30% )

       2.070231925 seconds time elapsed                                          ( +-  0.35% )

4.OpenMD

首先以下是我的cpu資訊
每個核心執行緒數為2

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1

使用方法
- 在MAKEFILE的CFLAGS前加入-fopenmp以及LDFLAGS前加入-lgomp
- 在raytracing.c內#include <omp.h>並加入(thread數目設為2)


#pragma omp parallel for num_threads(1024), private(d, object_color, stk)

執行看看

thread = 2 Execution time of raytracing() : 2.293012 sec
可以發現到thread數愈多執行時間愈短
thread = 4 Execution time of raytracing() : 1.537311 sec
thread = 8 Execution time of raytracing() : 1.348184 sec
thread = 16 Execution time of raytracing() : 1.322113 sec
thread = 32 Execution time of raytracing() : 1.277774 sec
thread = 64 Execution time of raytracing() : 1.118317 sec
thread = 128 Execution time of raytracing() : 1.095946 sec
thread = 256 Execution time of raytracing() : 1.104022 sec
thread = 512 Execution time of raytracing() : 1.073616 sec
愈來愈接近電腦優化目標。

然而最多只能執行到2¹³ = 8192個

加工中

筆記

簡單區分：
define（定義）：寫出程式碼/declare(宣告)：沒有寫出程式碼
static: define的程式碼僅在該檔案有效。
inline: 提示編譯器做最佳化時，將function call轉為程式碼，可減少呼叫程式成本。

作業詳細資訊：A02: raytracing

2016q3 Homework1 (raytracing)

Reviewed bt finalallpass