2017q1 Homework1 (raytracing)

tags: `PeterTing`

contributed by <PerterTing>

Reviewed by `jack81306`

優化後的方法也可以加入 call graph 來觀察
可以加入給 compile 優化後的結果
可以解釋一下 force inline 會降低時間的理由
可以加入 simd 來優化

開發環境

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              61
Model name:            Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
製程：              4
CPU MHz：             1941.699
CPU max MHz:           2700.0000
CPU min MHz:           500.0000
BogoMIPS:              3200.29
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K

linux kernel:

peterting@peterting-MacBookAir:~$ uname -a
Linux peterting-MacBookAir 4.8.0-36-generic

未優化前

執行時間

Execution time of raytracing() : 2.935029 sec

效能測試

使用 Phonebook 作業所提到的 perf 工具來做測試，得到：

 Performance counter stats for './raytracing' (5 runs):

           204,985      cache-misses              #   31.439 % of all cache refs      ( +-  7.09% )
           651,997      cache-references                                              ( +- 10.63% )
    19,469,436,804      instructions              #    2.48  insn per cycle           ( +-  0.00% )
     7,848,363,703      cycles                                                        ( +-  0.10% )

       2.942645053 seconds time elapsed                                          ( +-  0.13% )

call graph

安裝gprof2dot後，可以產生關聯圖，讓我們更清楚funtion之間的關聯

指令參考 hugikun999 的共筆
gprof ./raytracing | gprof2dot | dot -T png -o output.png

gmon.out

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.74      0.61     0.61 69646433     0.00     0.00  dot_product
 21.83      1.09     0.48 56956357     0.00     0.00  subtract_vector
 10.01      1.31     0.22 31410180     0.00     0.00  multiply_vector
  7.50      1.48     0.17 17836094     0.00     0.00  add_vector
  5.91      1.61     0.13  4620625     0.00     0.00  ray_hit_object
  5.69      1.73     0.13 13861875     0.00     0.00  rayRectangularIntersection
  4.32      1.83     0.10 13861875     0.00     0.00  raySphereIntersection
  3.18      1.90     0.07 17821809     0.00     0.00  cross_product
  3.18      1.97     0.07 10598450     0.00     0.00  normalize
  2.73      2.03     0.06  2110576     0.00     0.00  compute_specular_diffuse
  2.27      2.08     0.05  4221152     0.00     0.00  multiply_vectors
  1.59      2.11     0.04  1048576     0.00     0.00  ray_color

由此可知 dot_product 被 call 最多次，消耗的時間也是最多，因此，若將時間降低，效能必定也能改善許多。

首次優化 : Loop-unrolling

原始程式碼

double dot_product(const double *v1, const double *v2)
 63 {
 64     double dp = 0.0;
 65     for (int i = 0; i < 3; i++)
 66         dp += v1[i] * v2[i];
 67     return dp;
 68 }

因為loop次數不多，因此直接把他拆開，變成：

double dot_product(const double *v1, const double *v2)
 63 {
 64     double dp = 0.0;
 65 
 66     dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];                     
 67     return dp;
 68 }

來看看改變後的執行時間、效能以及gprof所分析出來的數據結果

執行時間

# Rendering scene
Done!
Execution time of raytracing() : 2.534777 sec

執行時間從 2.935s 降低至 2.535s ，減少 13.7%。

效能

 Performance counter stats for './raytracing' (5 runs):

           253,765      cache-misses              #   34.504 % of all cache refs      ( +- 29.15% )
           735,464      cache-references                                              ( +- 34.95% )
    16,750,538,375      instructions              #    2.44  insn per cycle           ( +-  0.00% )
     6,866,806,918      cycles                                                        ( +-  0.57% )

       2.582372910 seconds time elapsed                                          ( +-  0.84% )

instruction 與 cycle 大幅減少

gprof

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.03      0.47     0.47 56956357     0.00     0.00  subtract_vector
 13.80      0.71     0.24 69646433     0.00     0.00  dot_product
 11.21      0.91     0.20 17836094     0.00     0.00  add_vector
 10.93      1.10     0.19 31410180     0.00     0.00  multiply_vector
  9.78      1.27     0.17 13861875     0.00     0.00  rayRectangularIntersection
  6.90      1.39     0.12  4620625     0.00     0.00  ray_hit_object
  5.18      1.48     0.09 17821809     0.00     0.00  cross_product
  3.45      1.54     0.06  1048576     0.00     0.00  ray_color
  2.88      1.59     0.05 10598450     0.00     0.00  normalize
  2.30      1.63     0.04  4221152     0.00     0.00  multiply_vectors
  1.73      1.66     0.03 13861875     0.00     0.00  raySphereIntersection
  1.15      1.68     0.02  2110576     0.00     0.00  compute_specular_diffuse

時間從 0.61s 降低至 0.47s，減少 23%。

後來發現 add_vector() 和 subtract_vector() 以及mutiply_vector() 也都花蠻多時間的，因此也對這三個 function 做 loop-unrolling(此次優化有加上 force inline)

發現效能有更大的提升

執行時間

# Rendering scene
Done!
Execution time of raytracing() : 2.051923 sec

執行效能

 Performance counter stats for './raytracing' (5 runs):

           174,025      cache-misses              #   27.751 % of all cache refs      ( +-  5.36% )
           627,101      cache-references                                              ( +-  9.28% )
    12,496,257,428      instructions              #    2.31  insn per cycle           ( +-  0.00% )
     5,416,204,477      cycles                                                        ( +-  0.21% )

       2.030640000 seconds time elapsed                                          ( +-  0.20% )

gprof分析

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.22      0.35     0.35 69646433     0.00     0.00  dot_product
 11.22      0.54     0.19 13861875     0.00     0.00  raySphereIntersection
 10.31      0.71     0.17 56956357     0.00     0.00  subtract_vector
 10.31      0.88     0.17 17821809     0.00     0.00  cross_product
  7.88      1.01     0.13  4620625     0.00     0.00  ray_hit_object
  7.88      1.14     0.13 17836094     0.00     0.00  add_vector
  6.67      1.25     0.11 10598450     0.00     0.00  normalize
  6.06      1.35     0.10  1048576     0.00     0.00  ray_color
  4.55      1.42     0.08 13861875     0.00     0.00  rayRectangularIntersection
  1.82      1.45     0.03 31410180     0.00     0.00  multiply_vector
  1.82      1.48     0.03  4221152     0.00     0.00  multiply_vectors
  1.82      1.51     0.03  2520791     0.00     0.00  idx_stack_top
  1.82      1.54     0.03        1     0.03     1.64  raytracing

由以上三像數據可以得知使用 loop-unrolling 對於效能改善的確有幫助。

第二次優化 : force inline

此次優化很簡單，強制開啟inline功能，看看效能是否能有所改善

執行時間

# Rendering scene
Done!
Execution time of raytracing() : 2.322062 sec

比起首次優化後的時間還要更低

執行效能

 Performance counter stats for './raytracing' (5 runs):

           213,282      cache-misses              #   27.128 % of all cache refs      ( +- 12.64% )
           786,211      cache-references                                              ( +- 17.59% )
    15,010,319,433      instructions              #    2.42  insn per cycle           ( +-  0.00% )
     6,207,870,112      cycles                                                        ( +-  0.41% )

       2.325456367 seconds time elapsed                                          ( +-  0.51% )

cycles, instructions, cache-misses 均有降低，所以 force inline 也可以優化效能

第三次優化：利用OpenMP

看到OpenMP的相關資料後，發現他可以對 for 回圈來進行平行處理，並且只需要短短一行的程式碼：
#pragma omp parallel for

哇！對於有很多回圈並且不會互相干擾的程式碼來說這真的是天上佳音！
但是之後經過實驗發現，若原本的執行時間不長，加上OpenMP的指令只會讓速度更慢。

剛開始時原本只利用上面提到的指令來改善 function raytracing ，不過後來發現效果不彰，後來參考到 Enhance raytracing program 這篇文章，才發現原來是因為我少加了很多參數，之後一一去查參數的功用，才明白箇中道理。

正確指令：
#pragma omp parallel for schedule(guided,1) collapse(2) num_threads(64) private(stk), private(d),private(object_color)

OpenMP 語法：

#pragma omp directive [clause]

參數功用：

類別：
- directive:
  - parallel:Defines a parallel region, which is code that will be executed by multiple threads in parallel.
- clause:
  - schedule:Applies to the for (OpenMP) directive.
  - collapse:Specifies how many loops in a nested loop should be collapsed into one large iteration space and divided according to the schedule clause. The sequential execution of the iterations in all associated loops determines the order of the iterations in the collapsed iteration space.
  - private:Specifies that each thread should have its own instance of a variable.
  - num_threads:Sets the number of threads in a thread team.

此次優化延續上面所有的優化

執行時間

# Rendering scene
Done!
Execution time of raytracing() : 0.951524 sec

時間完全大幅提升！

執行效能

 Performance counter stats for './raytracing' (5 runs):

           463,753      cache-misses              #   24.932 % of all cache refs      ( +-  4.66% )
         1,860,067      cache-references                                              ( +-  5.58% )
    12,546,283,213      instructions              #    1.34  insn per cycle           ( +-  0.01% )
     9,364,090,976      cycles                                                        ( +-  0.22% )

       0.965814707 seconds time elapsed                                          ( +-  0.63% )

除了 instructions 和 cache-misses 比例減少外，其他數值都上升，不過時間下降，推斷是效能測試有把所有平行的處理器的數值都加進來，所以數值會近乎double。

參考資料

使用Gnu gprof进行Linux平台下的程序分析
 簡易的程式平行化方法－OpenMP（一）簡介