2017q1 Homework1 (raytracing)

contributed by<zhanyangch>

tags:`zhanyangch` `sysprog2017` `week1`

Reviewed by `etc276`

執行環境的資訊需要排版以利閱讀，另外可以考慮展示出程式碼時才使用 clike=，而展示 terminal 上的資訊不用 = 來顯示行數
commit message 可以以大寫開頭，並將修改的部分寫在內文，例如這次 commit 可以在內文補上 "rewrite the function in math-toolkit.h by loop unrolling to optimization"
可以使用 force inline 或 macro 等方法降低 function call 所花的時間

執行環境

lscpu 的資訊

$ lscpu
Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              42
Model name:            Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
製程：              7
CPU MHz：             855.421
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.06
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           4096K
NUMA node0 CPU(s):     0-3

執行原始程式

使用 make PROFILE=1 ，加入 gprof 所需的編譯及連結參數，gprof 會在函數內加入 mcount，以繪製函數的調用圖，因此執行時間較長，但利用gprof 可以快速的讓我們找到程式改進的方向。











$ make 
$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 3.248656 sec
$ make clean 
$ make PROFILE=1
./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 6.375895 sec

將輸出檔案轉換成 .png

$ convert out.ppm out.png

使用 gprof 觀察，可以發現耗時的前幾名都是數學運算，試著去改善此部份。

gprof -b raytracing gmon.out | less
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.41      0.61     0.61 69646433     0.00     0.00  dot_product
 20.41      1.12     0.51 56956357     0.00     0.00  subtract_vector
 12.21      1.43     0.31 31410180     0.00     0.00  multiply_vector
  9.21      1.66     0.23 10598450     0.00     0.00  normalize
  8.00      1.86     0.20 17836094     0.00     0.00  add_vector
  5.40      1.99     0.14 13861875     0.00     0.00  rayRectangularIntersection
  4.80      2.11     0.12  4620625     0.00     0.00  ray_hit_object
  3.00      2.19     0.08 13861875     0.00     0.00  raySphereIntersection

loop unrolling

觀察 math-toolkit.h，可以發現 dot_product 等向量運算的迴圈執行次數是固定的，因此我們可以將它預先展開，此方法可以不用 branch，降低執行時間。

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name   
 27.52      0.55     0.55 69646433     0.00     0.00  dot_product
 10.51      0.76     0.21 56956357     0.00     0.00  subtract_vector
 10.51      0.97     0.21 10598450     0.00     0.00  normalize
  8.76      1.15     0.18 13861875     0.00     0.00  rayRectangularIntersection
  8.51      1.32     0.17 17821809     0.00     0.00  cross_product
  8.25      1.48     0.17 13861875     0.00     0.00  raySphereIntersection
  6.50      1.61     0.13 17836094     0.00     0.00  add_vector
  4.50      1.70     0.09 31410180     0.00     0.00  multiply_vector
  3.50      1.77     0.07  4620625     0.00     0.00  ray_hit_object
  2.00      1.81     0.04  4221152     0.00     0.00  multiply_vectors
  2.00      1.85     0.04  2110576     0.00     0.00  compute_specular_diffuse
  2.00      1.89     0.04  1048576     0.00     0.00  ray_color
  1.00      1.91     0.02  1241598     0.00     0.00  protect_color_overflow
  1.00      1.93     0.02  1241598     0.00     0.00  refraction
  1.00      1.95     0.02  1048576     0.00     0.00  rayConstruction
  1.00      1.97     0.02        1     0.02     1.99  raytracing
  0.50      1.98     0.01  2520791     0.00     0.00  idx_stack_top
  0.50      1.99     0.01  2110576     0.00     0.00  localColor

$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 2.242073 sec
$ make clean 
$ make PROFILE=1
./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 5.370418 sec

使用openmp

在 openmp 範圍外的變數預設是共用的，對於需要在執行序間獨立的變數，可以使用 private(list)，num_threads (n) 使用 n 個 thread。




#pragma omp parallel for num_threads (2) private(stk,object_color,d)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
        ...

在加入 gprof 的情況下，需要的時間會比原本的多，但在不加入 gprof 的情況下，時間從 2.24 降低至 1.22 。

$ make
$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 1.221488 sec

$ make PROFILE=1
$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 9.702812 sec

測試不同的 num_threads ，可以發現除了 1、2 之間差距很大，再增加 thread 並無太大的效益，因為 cpu 的核心數目有限。

參考資料

資訊系多處理機平行程式PPT

2017q1 Homework1 (raytracing)

tags:zhanyangch sysprog2017 week1

Reviewed by etc276

執行環境

執行原始程式

loop unrolling

使用openmp

參考資料

tags:`zhanyangch` `sysprog2017` `week1`

Reviewed by `etc276`