2017 q1 Homework1 (raytracing)

contributed by <zmke>

Reviewed by `twzjwang`

multithreads 下使用 gprof ，時間卻增加:

gprof is poorly suited for analysis of parallel programs as it doesn't understand the intricacies of OpenMP.
gprof on both OpenMP and without OpenMP codes produces different flat profile
可用其他方法實作平行化，如 Pthread

Reviewed by `henry0929016816`

比較了 thread 數的差異，這一點我覺得做的很好
可以試試看 SIMD

開發環境

OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式： 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz
L1d 快取： 32K
L1i 快取： 32K
L2 快取： 256K
L3 快取： 3072K

gprof

可以印出程式執行時各函數消耗的函數、 funtion 之間互相呼叫的關係，幫助 programmer 分析出消耗較多時間的函數，找出當作改善效能的方向，尤其是對大型程式來說是相當實用的工具
$ make PROFILE=1：加入 -pg ， compiler 會在程式中間插入額外的程式碼，執行後會產生 gmon.out
$ gprof ./raytracing | less

原始版本

直接執行 clone 下來的程式

$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 2.499698 sec

使用 gprof 進行分析

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 5.441733 sec

compiler 在程式中加入額外程式碼，執行時間增加了

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.02      0.51     0.51 69646433     0.00     0.00  dot_product
 14.71      0.81     0.30 56956357     0.00     0.00  subtract_vector
 11.28      1.04     0.23 31410180     0.00     0.00  multiply_vector
 10.06      1.25     0.21 13861875     0.00     0.00  rayRectangularIntersection
  7.85      1.41     0.16 17836094     0.00     0.00  add_vector
  7.36      1.56     0.15 10598450     0.00     0.00  normalize
  5.89      1.68     0.12  4620625     0.00     0.00  ray_hit_object
  4.41      1.77     0.09 17821809     0.00     0.00  cross_product
  3.19      1.83     0.07 13861875     0.00     0.00  raySphereIntersection
  2.45      1.88     0.05  1048576     0.00     0.00  ray_color
  1.96      1.92     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.96      1.96     0.04        1     0.04     2.04  raytracing

可以清楚看到時間集中在 dot_product 、 subtract_vector 、 multiply_vector ，這些 function 主要用來進行三維向量運算，可以在 math-toolkit.h 找到定義

用 perf 來分析 branch miss

 Performance counter stats for './raytracing' (10 runs):

        19,296,540      branch-misses             #    0.40% of all branches          ( +-  0.43% )
     4,782,031,932      branches                                                      ( +-  0.00% )
    32,981,345,906      instructions              #    1.90  insn per cycle           ( +-  0.00% )
    17,360,336,322      cycles                                                        ( +-  0.22% )

       5.137705397 seconds time elapsed                                          ( +-  0.22% )

compiler 優化

實際動手之前先來看看 compiler 到底有多強大，更改優化選項 -Ofast

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.518435 sec

這效果…
$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 52.21      0.24     0.24  4620625    51.98    51.98  ray_hit_object
 21.76      0.34     0.10  2110576    47.42    47.42  compute_specular_diffuse
 10.88      0.39     0.05   592239    84.49   408.09  ray_color
 10.88      0.44     0.05                             raytracing
  2.18      0.45     0.01  2110576     4.74     4.74  localColor
  2.18      0.46     0.01  1241598     8.06     8.06  refraction
  0.00      0.46     0.00   113297     0.00     0.00  fresnel

發現剛剛看到最花時間的 function 都沒在表上，應該是 inline 發揮作用了，在 function 的定義前加入 inline 提示 compiler 在編譯的時候就將 function 直接展開，減少呼叫 function 所需要成本

Method1 - force inline

實驗將 math-toolkit.h 裡面的 inline 改成 __attribute__((always_inline)) inline ，讓 compiler 在關閉編譯最佳化的情形下將 function inline

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 2.967841 sec

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 40.48      0.87     0.87 13861875     0.00     0.00  rayRectangularIntersection
 18.61      1.27     0.40 13861875     0.00     0.00  raySphereIntersection
 10.93      1.51     0.24  2110576     0.00     0.00  compute_specular_diffuse
  7.91      1.68     0.17  1048576     0.00     0.00  ray_color
  6.51      1.82     0.14  2110576     0.00     0.00  localColor
  6.28      1.95     0.14  4620625     0.00     0.00  ray_hit_object
  4.89      2.06     0.11  1048576     0.00     0.00  rayConstruction
  1.86      2.10     0.04  1241598     0.00     0.00  reflection
  1.40      2.13     0.03        1     0.03     2.15  raytracing

時間從 5.4417 sec 下降到 2.9678 sec ，從 gprof 的報表上也看不到剛剛在 math-tool-kit.h force inline 的 function

Method2 - loop unrolling

將迴圈展開，減少 branch misses 帶來的成本，不過可能會造成程式碼變冗長、可讀性降低的問題，所以通常是由 compiler 來完成
實驗將 math-toolkit.h function 內的 loop 展開

$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 4.346391 sec

$ gprof ./raytracing | less

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 16.03      0.21     0.21 69646433     0.00     0.00  dot_product
 15.63      0.41     0.20 56956357     0.00     0.00  subtract_vector
 14.85      0.60     0.19 10598450     0.00     0.00  normalize
  8.99      0.71     0.12 17821809     0.00     0.00  cross_product
  7.82      0.81     0.10 13861875     0.00     0.00  rayRectangularIntersection
  6.64      0.90     0.09 13861875     0.00     0.00  raySphereIntersection
  6.25      0.98     0.08  1048576     0.00     0.00  ray_color
  4.69      1.04     0.06  4620625     0.00     0.00  ray_hit_object
  3.91      1.09     0.05  1048576     0.00     0.00  rayConstruction
  3.52      1.13     0.05 31410180     0.00     0.00  multiply_vector
  2.74      1.17     0.04 17836094     0.00     0.00  add_vector
  2.35      1.20     0.03  2110576     0.00     0.00  localColor
  1.95      1.22     0.03  4221152     0.00     0.00  multiply_vectors
  1.56      1.24     0.02        1     0.02     1.27  raytracing

時間從 5.4417 sec 下降到 4.3464 sec

* dot_product:     0.51 -> 0.21
* subtract_vector: 0.30 -> 0.20
* multify_vector:  0.23 -> 0.05
* add_vector:      0.16 -> 0.04

用 perf 來分析 branch miss

 Performance counter stats for './raytracing' (10 runs):

        16,359,752      branch-misses             #    0.42% of all branches          ( +-  0.69% )
     3,873,917,856      branches                                                      ( +-  0.00% )
    25,746,564,892      instructions              #    1.76  insn per cycle           ( +-  0.00% )
    14,607,513,939      cycles                                                        ( +-  0.23% )

       4.322380613 seconds time elapsed                                          ( +-  0.27% )

branch misses 從 19,197,236 -> 16,387,279
branches 從 4,782,031,932 -> 3,873,917,856 ，減少了9億次！！
從實驗結果可以看出， loop unrolling 對於改善 branch miss 是有效的
測試不加上 -pg 的情形，時間縮短 (2.50 -> 1.68)
$ make clean
$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 1.682859 sec

Method3 - OpenMP

剛好上學期有修平行程式設計，上課有提過 OpenMP ，把投影片找出來複習一下

An API for shared-memory parallel programming
Designed for systems in which each thread or process can potentially have access to all available memory
System is viewed as a collection of cores or CPU’s, all of which have access to main memory
以 loop unrolling 的版本為基礎，針對 raytracing() 做平行化

(1) default schedule













































void raytracing(uint8_t *pixels, color background_color,
                rectangular_node rectangulars, sphere_node spheres,
                light_node lights, const viewpoint *view,
                int width, int height)
{
    point3 u, v, w, d;
    color object_color = { 0.0, 0.0, 0.0 };

    /* calculate u, v, w */
    calculateBasisVectors(u, v, w, view);

    idx_stack stk;

    int factor = sqrt(SAMPLES);
#pragma omp parallel for num_threads (thread_count) \
	default(shared) private(stk, d, object_color)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
            for (int s = 0; s < SAMPLES; s++) {
                idx_stack_init(&stk);
                rayConstruction(d, u, v, w,
                                i * factor + s / factor,
                                j * factor + s % factor,
                                view,
                                width * factor, height * factor);
                if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
                              lights, object_color,
                              MAX_REFLECTION_BOUNCES)) {
                    r += object_color[0];
                    g += object_color[1];
                    b += object_color[2];
                } else {
                    r += background_color[0];
                    g += background_color[1];
                    b += background_color[2];
                }
                pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
            }
        }
    }
}

thread_count = 4
$ make PROFILE=1
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 7.930475 sec

時間比只有 loop unrolling 增加了 (4.35 -> 7.93)
先不使用 gprof 測試看看
$ make clean
$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.818154 sec

和 loop unrolling 不加上 -pg 的情況相比時間下降了 (1.68 -> 0.89) ，推測在 OpenMP multi threads 的情況下， -pg 對程式帶來的成本也跟著增加了

比較不同 thread 數量對程式加速的效果，隨著 thread 增加，加速的比例也逐漸降低，當 thread_count 超過 16 之後， creat thread 和 join 的成本已經比速度提升還要高，反而失去加速的效果

thread_count = 1        1.720003 sec
thread_count = 2        0.955195 sec
thread_count = 4        0.818154 sec
thread_count = 8        0.767247 sec
thread_count = 16       0.743067 sec  
thread_count = 32       0.801165 sec

(2) cyclic schedule

static schedule : the iterations can be assigned to the thread before the loop is executed
避免程式運算熱點分佈不均，利用 static schedule ，將raytracing() 作 cyclic 切割
加上 schedule(static, 1)


#pragma omp parallel for num_threads (thread_count) \
	default(shared) private(stk, d, object_color) schedule(static, 1)

thread_count = 4
$ make
$ ./raytracing

# Rendering scene
Done!
Execution time of raytracing() : 0.728800 sec

在 thread_count = 4 的情形下，與 default schedule 相比，執行時間縮短 (0.818 -> 0.729)

cyclic schedule 可以將運算平均分配給各 thread ，平行加速效果比 default schedule 還要好
將以上3個實驗一起實作，得到的運算時間是 0.629104 sec ，只有原始版本的 25%，不過還是沒贏過 -Ofast

Reference

簡易的程式平行化－OpenMP
hugikun999 筆記
 LanKuDot 筆記
成大資訊系-多處理機平行程式設計課程投影片

2017 q1 Homework1 (raytracing)

Reviewed by twzjwang

Reviewed by henry0929016816

開發環境

gprof

原始版本

compiler 優化

Method1 - force inline

Method2 - loop unrolling

Method3 - OpenMP

(1) default schedule

(2) cyclic schedule

Reference

Reviewed by `twzjwang`

Reviewed by `henry0929016816`