Try   HackMD

2017 q1 Homework1 (raytracing)

contributed by <zmke>

Reviewed by twzjwang

Reviewed by henry0929016816

  • 比較了 thread 數的差異 ,這一點我覺得做的很好
  • 可以試試看 SIMD

開發環境

OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core i5-4200H CPU @ 2.80GHz
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K

gprof

  • 可以印出程式執行時各函數消耗的函數、 funtion 之間互相呼叫的關係,幫助 programmer 分析出消耗較多時間的函數,找出當作改善效能的方向,尤其是對大型程式來說是相當實用的工具

  • $ make PROFILE=1: 加入 -pg , compiler 會在程式中間插入額外的程式碼,執行後會產生 gmon.out

  • $ gprof ./raytracing | less

原始版本

  • 直接執行 clone 下來的程式

$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 2.499698 sec
  • 使用 gprof 進行分析

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 5.441733 sec

compiler 在程式中加入額外程式碼,執行時間增加了

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.02      0.51     0.51 69646433     0.00     0.00  dot_product
 14.71      0.81     0.30 56956357     0.00     0.00  subtract_vector
 11.28      1.04     0.23 31410180     0.00     0.00  multiply_vector
 10.06      1.25     0.21 13861875     0.00     0.00  rayRectangularIntersection
  7.85      1.41     0.16 17836094     0.00     0.00  add_vector
  7.36      1.56     0.15 10598450     0.00     0.00  normalize
  5.89      1.68     0.12  4620625     0.00     0.00  ray_hit_object
  4.41      1.77     0.09 17821809     0.00     0.00  cross_product
  3.19      1.83     0.07 13861875     0.00     0.00  raySphereIntersection
  2.45      1.88     0.05  1048576     0.00     0.00  ray_color
  1.96      1.92     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.96      1.96     0.04        1     0.04     2.04  raytracing

可以清楚看到時間集中在 dot_product 、 subtract_vector 、 multiply_vector ,這些 function 主要用來進行三維向量運算,可以在 math-toolkit.h 找到定義

  • 用 perf 來分析 branch miss
 Performance counter stats for './raytracing' (10 runs):

        19,296,540      branch-misses             #    0.40% of all branches          ( +-  0.43% )
     4,782,031,932      branches                                                      ( +-  0.00% )
    32,981,345,906      instructions              #    1.90  insn per cycle           ( +-  0.00% )
    17,360,336,322      cycles                                                        ( +-  0.22% )

       5.137705397 seconds time elapsed                                          ( +-  0.22% )

compiler 優化

  • 實際動手之前先來看看 compiler 到底有多強大,更改優化選項 -Ofast

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.518435 sec

這效果
$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 52.21      0.24     0.24  4620625    51.98    51.98  ray_hit_object
 21.76      0.34     0.10  2110576    47.42    47.42  compute_specular_diffuse
 10.88      0.39     0.05   592239    84.49   408.09  ray_color
 10.88      0.44     0.05                             raytracing
  2.18      0.45     0.01  2110576     4.74     4.74  localColor
  2.18      0.46     0.01  1241598     8.06     8.06  refraction
  0.00      0.46     0.00   113297     0.00     0.00  fresnel
  • 發現剛剛看到最花時間的 function 都沒在表上,應該是 inline 發揮作用了,在 function 的定義前加入 inline 提示 compiler 在編譯的時候就將 function 直接展開,減少呼叫 function 所需要成本

Method1 - force inline

  • 實驗將 math-toolkit.h 裡面的 inline 改成 __attribute__((always_inline)) inline ,讓 compiler 在關閉編譯最佳化的情形下將 function inline

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 2.967841 sec

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 40.48      0.87     0.87 13861875     0.00     0.00  rayRectangularIntersection
 18.61      1.27     0.40 13861875     0.00     0.00  raySphereIntersection
 10.93      1.51     0.24  2110576     0.00     0.00  compute_specular_diffuse
  7.91      1.68     0.17  1048576     0.00     0.00  ray_color
  6.51      1.82     0.14  2110576     0.00     0.00  localColor
  6.28      1.95     0.14  4620625     0.00     0.00  ray_hit_object
  4.89      2.06     0.11  1048576     0.00     0.00  rayConstruction
  1.86      2.10     0.04  1241598     0.00     0.00  reflection
  1.40      2.13     0.03        1     0.03     2.15  raytracing
  • 時間從 5.4417 sec 下降到 2.9678 sec ,從 gprof 的報表上也看不到剛剛在 math-tool-kit.h force inline 的 function

Method2 - loop unrolling

  • 將迴圈展開,減少 branch misses 帶來的成本,不過可能會造成程式碼變冗長、可讀性降低的問題,所以通常是由 compiler 來完成
  • 實驗將 math-toolkit.h function 內的 loop 展開

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 4.346391 sec

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 16.03      0.21     0.21 69646433     0.00     0.00  dot_product
 15.63      0.41     0.20 56956357     0.00     0.00  subtract_vector
 14.85      0.60     0.19 10598450     0.00     0.00  normalize
  8.99      0.71     0.12 17821809     0.00     0.00  cross_product
  7.82      0.81     0.10 13861875     0.00     0.00  rayRectangularIntersection
  6.64      0.90     0.09 13861875     0.00     0.00  raySphereIntersection
  6.25      0.98     0.08  1048576     0.00     0.00  ray_color
  4.69      1.04     0.06  4620625     0.00     0.00  ray_hit_object
  3.91      1.09     0.05  1048576     0.00     0.00  rayConstruction
  3.52      1.13     0.05 31410180     0.00     0.00  multiply_vector
  2.74      1.17     0.04 17836094     0.00     0.00  add_vector
  2.35      1.20     0.03  2110576     0.00     0.00  localColor
  1.95      1.22     0.03  4221152     0.00     0.00  multiply_vectors
  1.56      1.24     0.02        1     0.02     1.27  raytracing
  • 時間從 5.4417 sec 下降到 4.3464 sec
    ​​​​* dot_product:     0.51 -> 0.21
    ​​​​* subtract_vector: 0.30 -> 0.20
    ​​​​* multify_vector:  0.23 -> 0.05
    ​​​​* add_vector:      0.16 -> 0.04
    
  • 用 perf 來分析 branch miss
 Performance counter stats for './raytracing' (10 runs):

        16,359,752      branch-misses             #    0.42% of all branches          ( +-  0.69% )
     3,873,917,856      branches                                                      ( +-  0.00% )
    25,746,564,892      instructions              #    1.76  insn per cycle           ( +-  0.00% )
    14,607,513,939      cycles                                                        ( +-  0.23% )

       4.322380613 seconds time elapsed                                          ( +-  0.27% )

  • branch misses 從 19,197,236 -> 16,387,279
  • branches 從 4,782,031,932 -> 3,873,917,856 ,減少了9億次!!
  • 從實驗結果可以看出, loop unrolling 對於改善 branch miss 是有效的
  • 測試不加上 -pg 的情形,時間縮短 (2.50 -> 1.68)
    $ make clean
    $ make
    $ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 1.682859 sec

Method3 - OpenMP

剛好上學期有修平行程式設計,上課有提過 OpenMP ,把投影片找出來複習一下

  • An API for shared-memory parallel programming
  • Designed for systems in which each thread or process can potentially have access to all available memory
  • System is viewed as a collection of cores or CPU’s, all of which have access to main memory
  • 以 loop unrolling 的版本為基礎,針對 raytracing() 做平行化

(1) default schedule

void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { point3 u, v, w, d; color object_color = { 0.0, 0.0, 0.0 }; /* calculate u, v, w */ calculateBasisVectors(u, v, w, view); idx_stack stk; int factor = sqrt(SAMPLES); #pragma omp parallel for num_threads (thread_count) \ default(shared) private(stk, d, object_color) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } }

thread_count = 4
$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 7.930475 sec

時間比只有 loop unrolling 增加了 (4.35 -> 7.93)
先不使用 gprof 測試看看
$ make clean
$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.818154 sec

和 loop unrolling 不加上 -pg 的情況相比時間下降了 (1.68 -> 0.89) ,推測在 OpenMP multi threads 的情況下, -pg 對程式帶來的成本也跟著增加了

  • 比較不同 thread 數量對程式加速的效果,隨著 thread 增加,加速的比例也逐漸降低,當 thread_count 超過 16 之後, creat thread 和 join 的成本已經比速度提升還要高,反而失去加速的效果
thread_count = 1        1.720003 sec
thread_count = 2        0.955195 sec
thread_count = 4        0.818154 sec
thread_count = 8        0.767247 sec
thread_count = 16       0.743067 sec  
thread_count = 32       0.801165 sec

(2) cyclic schedule

  • static schedule : the iterations can be assigned to the thread before the loop is executed
  • 避免程式運算熱點分佈不均,利用 static schedule ,將raytracing() 作 cyclic 切割
  • 加上 schedule(static, 1)
#pragma omp parallel for num_threads (thread_count) \ default(shared) private(stk, d, object_color) schedule(static, 1)

thread_count = 4
$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.728800 sec

在 thread_count = 4 的情形下,與 default schedule 相比,執行時間縮短 (0.818 -> 0.729)

  • cyclic schedule 可以將運算平均分配給各 thread ,平行加速效果比 default schedule 還要好
  • 將以上3個實驗一起實作,得到的運算時間是 0.629104 sec ,只有原始版本的 25%,不過還是沒贏過 -Ofast

Reference

簡易的程式平行化-OpenMP
hugikun999 筆記
LanKuDot 筆記
成大資訊系-多處理機平行程式設計 課程投影片