Try   HackMD

2017q1 Homework1 (raytracing)

contributed by <rayleigh0407>

Reviewed by twzjwang

  • 可使用圖表比較不同實作方法的結果,並與編譯器優化比較
  • 可以刪除多餘的空行及已註解的程式碼
  • 如果有時間可以嘗試其他方法加速

開發環境

$lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Stepping:              3
CPU MHz:               800.000
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              8016.72
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

中英文間記得用空白隔開~
課程助教

前置作業

gprof

gprof 是一個可以協助你開發程式的工具, 它可以檢測你的程式執行時間的分佈, 在 GNU gprof 的簡介就提到

This manual describes the gnu profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. We assume that you know how to write, compile, and execute programs

在編譯的時候, 你可以加入參數-pg, 接著執行過程式後會產生gmon.out檔, 接著只要輸入以下指令

gprof -b $(exec) gmon.out | less

-b 是將各種結果對應的說明刪除, 只留下執行分析的結果
less 可以讓你上下滾動查看輸出結果

raytracing

開發過程

1. 原始版本

先下載原始碼
載完立即來測試看看

$ git clone https://github.com/sysprog21/raytracing 
$ cd raytracing
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.224557 sec

用 gprof 來看看程式執行的部份

$ make PROFILE=1(輸出gmon.out)
$ ./raytracing
$ gprof -b raytracing gmon.out | less

執行的結果如下

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.23      0.37     0.37 69646433     0.00     0.00  dot_product
 18.04      0.70     0.33 56956357     0.00     0.00  subtract_vector
 12.85      0.94     0.24 13861875     0.00     0.00  rayRectangularIntersection
  9.30      1.11     0.17  4620625     0.00     0.00  ray_hit_object
  8.20      1.26     0.15 31410180     0.00     0.00  multiply_vector
  6.29      1.37     0.12 17836094     0.00     0.00  add_vector
  6.29      1.49     0.12 13861875     0.00     0.00  raySphereIntersection
  5.47      1.59     0.10 10598450     0.00     0.00  normalize
  4.92      1.68     0.09 17821809     0.00     0.00  cross_product
  1.64      1.71     0.03  4221152     0.00     0.00  multiply_vectors
  1.09      1.73     0.02  2110576     0.00     0.00  compute_specular_diffuse
  1.09      1.75     0.02  2110576     0.00     0.00  localColor
  1.09      1.77     0.02  1241598     0.00     0.00  protect_color_overflow
  1.09      1.79     0.02  1048576     0.00     0.00  rayConstruction
  0.82      1.80     0.02  3838091     0.00     0.00  length
  0.55      1.81     0.01  1241598     0.00     0.00  refraction
  0.55      1.82     0.01  1048576     0.00     0.00  ray_color
  0.27      1.83     0.01  2558386     0.00     0.00  idx_stack_empty
  0.27      1.83     0.01  1204003     0.00     0.00  idx_stack_push
  0.00      1.83     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      1.83     0.00  1241598     0.00     0.00  reflection
  0.00      1.83     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.83     0.00   113297     0.00     0.00  fresnel
  0.00      1.83     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.83     0.00        3     0.00     0.00  append_rectangular
  0.00      1.83     0.00        3     0.00     0.00  append_sphere
  0.00      1.83     0.00        2     0.00     0.00  append_light
  0.00      1.83     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.83     0.00        1     0.00     0.00  delete_light_list
  0.00      1.83     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.83     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.83     0.00        1     0.00     0.00  diff_in_second
  0.00      1.83     0.00        1     0.00     1.83  raytracing
  0.00      1.83     0.00        1     0.00     0.00  write_to_ppm
    ...

下半部份主要為函式呼叫的分支(即一個函式呼叫了哪些函式)和統計次數
由此可看出, 佔用大部分時間的都是一些小函式( dot_product 不到十行)
試著從這些小地方開始著手

2. 測試1 Loop unrolling

這是 math-toolkit.h 中的 dot_product() 和 subtract_vector()

static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ... static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; }

可以發現, 原開發者早就考慮到他們佔用相當多的時間, 便宣告成 inline function 以提昇執行速度和減少記憶體的佔用(關於inline function), 但程式呼叫次數太多了, 佔用時間還是相對地大

首先我先這樣修改:

static inline void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } ... static inline double dot_product(const double *v1, const double *v2) { dp += v1[0] * v2[0]; dp += v1[1] * v2[1]; dp += v1[2] * v2[2]; return dp; }

將函式拆開的用意是, 在每次執行一次迴圈時, 都會檢查是否要跳出迴圈, 以組語來說就類似 jne jge 等等(可參考x86 Assembly Guide), 使用 loop unrolling, 每次執行函式就會少一兩個 instructions, 多次使用後節省的時間也是很可觀的
我們來看執行結果
origin

 Performance counter stats for './raytracing' (10 runs):

           22,9075      cache-misses              #   23.203 % of all cache refs      ( +- 16.92% )
           98,7278      cache-references                                              ( +- 16.48% )
     194,6446,2315      instructions              #    2.21  insns per cycle          ( +-  0.00% )
      88,2142,5404      cycles                                                        ( +-  0.93% )

       2.251422149 seconds time elapsed                                          ( +-  0.92% )

loop unrolling

 Performance counter stats for './raytracing' (10 runs):

           13,3195      cache-misses              #   22.574 % of all cache refs      ( +-  9.44% )
           59,0030      cache-references                                              ( +-  9.96% )
     147,1677,2511      instructions              #    1.86  insns per cycle          ( +-  0.00% )
      79,0934,9083      cycles                                                        ( +-  1.25% )

       2.019096528 seconds time elapsed                                          ( +-  1.26% )

雖然 cache-miss 稍微地下降而已, 但是 instrcution 足足少了 24.39%, 執行時間也隨之減少 10.33%, 有時候看似細微的地方, 也能牽扯到整個大局

執行 gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 18.01      0.23     0.23 69646433     0.00     0.00  dot_product
 13.21      0.39     0.17 56956357     0.00     0.00  subtract_vector
 11.21      0.53     0.14 10598450     0.00     0.00  normalize
 10.81      0.67     0.14 13861875     0.00     0.00  rayRectangularIntersection
  8.81      0.78     0.11  4620625     0.00     0.00  ray_hit_object
  7.60      0.87     0.10 31410180     0.00     0.00  multiply_vector
  7.60      0.97     0.10 13861875     0.00     0.00  raySphereIntersection
  5.20      1.03     0.07 17836094     0.00     0.00  add_vector
  5.20      1.10     0.07 17821809     0.00     0.00  cross_product
  3.60      1.14     0.05  4221152     0.00     0.00  multiply_vectors
  2.80      1.18     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.60      1.20     0.02  1048576     0.00     0.00  ray_color
  1.20      1.21     0.02  2110576     0.00     0.00  localColor
  0.80      1.22     0.01  2558386     0.00     0.00  idx_stack_empty
  0.80      1.23     0.01  2520791     0.00     0.00  idx_stack_top
  0.80      1.24     0.01        1     0.01     0.01  delete_sphere_list
  0.80      1.25     0.01        1     0.01     1.24  raytracing
  0.00      1.25     0.00  3838091     0.00     0.00  length
  0.00      1.25     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      1.25     0.00  1241598     0.00     0.00  reflection
  0.00      1.25     0.00  1241598     0.00     0.00  refraction
  0.00      1.25     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.25     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.25     0.00  1048576     0.00     0.00  rayConstruction
  0.00      1.25     0.00   113297     0.00     0.00  fresnel
  0.00      1.25     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.25     0.00        3     0.00     0.00  append_rectangular
  0.00      1.25     0.00        3     0.00     0.00  append_sphere
  0.00      1.25     0.00        2     0.00     0.00  append_light
  0.00      1.25     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.25     0.00        1     0.00     0.00  delete_light_list
  0.00      1.25     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.25     0.00        1     0.00     0.00  diff_in_second
  0.00      1.25     0.00        1     0.00     0.00  write_to_ppm

dot_product 下降了0.14s
substact_vector 下降了0.16s

那既然如此, 乾脆把 math-toolkit 內的迴圈都做 loop unrolling 看看
執行結果:

 Performance counter stats for './raytracing' (10 runs):

           11,9138      cache-misses              #   23.585 % of all cache refs      ( +-  2.98% )
           50,5147      cache-references                                              ( +- 10.63% )
     127,3165,0289      instructions              #    1.74  insns per cycle          ( +-  0.00% )
      73,2732,5074      cycles                                                        ( +-  0.31% )

       1.869817600 seconds time elapsed                                          ( +-  0.31% )

相較於 origin
instrcution 減少 34.59%
time elapsed 減少 16.95%

以上所有測試皆通過圖形測試

$make check
# Rendering scene
Done!
Execution time of raytracing() : 4.027406 sec
Verified OK

3. 測試2 平行化

試著平行化迴圈
math-toolkit.h 內的函式皆為 inline function
實行平行化可能沒辦法在編譯時展開
先看看 raytracing.c

void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { ... for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); ... }

在 raytracing() 內
是一次畫一個點去作圖的
看能不能針對這巢狀迴圈做平行化
在這之前先考慮 data dependency

看了很久才發現資料沒有任何資料寫入衝突(可以看 function parameter 來過濾, 如果宣告為 const 則此資料不會被更動)
那就來實做平行化

Pthread

將外層函式做平行化

int main(...) { ... pthread_t *thread_handles; thread_handles = (pthread_t*)malloc(sizeof(pthread_t) * thread_num); for(int i = 0; i < thread_num; i++) { each_rank[i] = i; pthread_create(&thread_handles[i], NULL, raytracing_pthread, (void *)&each_rank[i]); } } void *raytracing_pthread(void *rank) { int my_rank = *(int*)rank; ... int start = (height / THREAD_NUMBER) * my_rank; int end = (height / THREAD_NUMBER) * (my_rank + 1); for (int j = start; j < end; j++) { for (int i = 0; i < width; i++) { ... } } }

執行結果:
為了方便分配工作區域, 以2的倍數來分配
雙核:

# Rendering scene
Thread number : 2
Done!
Execution time of raytracing() : 1.110519 sec

四核:

# Rendering scene
Thread number : 4
Done!
Execution time of raytracing() : 0.816143 sec

八核:

# Rendering scene
Thread number : 8
Done!
Execution time of raytracing() : 0.506245 sec

效能成長沒有想像中好
換個方式分配看看

for (int j = my_rank; j < height; j+=THREAD_NUMBER) {

執行結果:
單執行序:

 Performance counter stats for './raytracing' (10 runs):

           11,0062      cache-misses              #   23.736 % of all cache refs      ( +-  2.04% )
           46,3689      cache-references                                              ( +-  5.54% )
     127,2937,7767      instructions              #    1.75  insns per cycle          ( +-  0.00% )
      72,9304,6745      cycles                                                        ( +-  0.12% )

       1.865169660 seconds time elapsed                                          ( +-  0.23% )

雙核:

 Performance counter stats for './raytracing' (10 runs):

           11,0145      cache-misses              #   18.638 % of all cache refs      ( +-  2.23% )
           59,0976      cache-references                                              ( +-  5.39% )
     127,2961,2413      instructions              #    1.58  insns per cycle          ( +-  0.00% )
      80,5269,8888      cycles                                                        ( +-  0.07% )

       1.126548896 seconds time elapsed                                          ( +-  0.09% )

四核:

 Performance counter stats for './raytracing' (10 runs):

           10,5345      cache-misses              #   15.362 % of all cache refs      ( +-  3.07% )
           68,5756      cache-references                                              ( +-  2.35% )
     127,2978,3539      instructions              #    1.47  insns per cycle          ( +-  0.00% )
      86,4925,3375      cycles                                                        ( +-  1.49% )

       0.592542459 seconds time elapsed                                          ( +-  2.45% )

八核:

 Performance counter stats for './raytracing' (10 runs):

           11,4074      cache-misses              #   14.391 % of all cache refs      ( +-  3.90% )
           79,2670      cache-references                                              ( +-  1.85% )
     127,3064,3855      instructions              #    1.21  insns per cycle          ( +-  0.00% )
     105,0530,6169      cycles                                                        ( +-  0.09% )

       0.354254347 seconds time elapsed                                          ( +-  0.69% )

效能提昇不少
幸好上學期有修過關於平行化的課程
cache-misses 減少了不少
可能是 cpu 能用的空間變多了
但還是沒有依倍數成長

參考資料