2016q3 Homework1 (raytracing)

contributed by <HaoTse>

Reviewed by `shelly4132`

可以利用 gnuplot 繪製出效能比較圖表
可將POSIX Thread的方法實作出來

開發工具

graphviz: 畫示意圖
imagemagick: 格式轉換

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

gprof

GNU的工具
使用方式
- 編譯時加上-pg的參數，編譯器會在各函數中加入mcount函數
- 執行產生gmon.out
- $ gprof -b raytracing gmon.out | less
由於輸出訊息多，$ less 使我們可以使用上下鍵看訊息。鄭皓澤
gprof v.s perf
perf top 的原理是每隔一段時間採樣一次，最後根據這些資料輸出,因此採樣的頻率可能會造成結果的不同。

效能分析

首先不開啟gprof，得到結果如下

# Rendering scene
Done!
Execution time of raytracing() : 3.338670 sec

接著使用$ make PROFILE=1重新編譯

注意得先做$ make clean

# Rendering scene
Done!
Execution time of raytracing() : 6.912961 sec

兩者對照可以看出gprof對速度的影響。

gmon.out

$ gprof -b raytracing gmon.out | less




































  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.79      0.54     0.54 69646433     0.00     0.00  dot_product
 20.58      1.05     0.51 56956357     0.00     0.00  subtract_vector
 12.91      1.37     0.32 31410180     0.00     0.00  multiply_vector
  9.68      1.61     0.24 13861875     0.00     0.00  raySphereIntersection
  8.47      1.82     0.21 13861875     0.00     0.00  rayRectangularIntersection
  6.86      1.99     0.17 10598450     0.00     0.00  normalize
  4.44      2.10     0.11 17836094     0.00     0.00  add_vector
  4.03      2.20     0.10 17821809     0.00     0.00  cross_product
  2.42      2.26     0.06  4620625     0.00     0.00  ray_hit_object
  2.02      2.31     0.05  1048576     0.00     0.00  ray_color
  1.61      2.35     0.04  4221152     0.00     0.00  multiply_vectors
  1.21      2.38     0.03  2520791     0.00     0.00  idx_stack_top
  1.01      2.41     0.03  1048576     0.00     0.00  rayConstruction
  0.81      2.43     0.02  2110576     0.00     0.00  localColor
  0.81      2.45     0.02  1241598     0.00     0.00  refraction
  0.40      2.46     0.01  2110576     0.00     0.00  compute_specular_diffuse
  0.40      2.47     0.01  1241598     0.00     0.00  reflection
  0.40      2.48     0.01        1     0.01     0.01  delete_sphere_list
  0.20      2.48     0.01        1     0.01     0.01  calculateBasisVectors
  0.00      2.48     0.00  3838091     0.00     0.00  length
  0.00      2.48     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.48     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      2.48     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.48     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.48     0.00   113297     0.00     0.00  fresnel
  0.00      2.48     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      2.48     0.00        3     0.00     0.00  append_rectangular
  0.00      2.48     0.00        3     0.00     0.00  append_sphere
  0.00      2.48     0.00        2     0.00     0.00  append_light
  0.00      2.48     0.00        1     0.00     0.00  delete_light_list
  0.00      2.48     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      2.48     0.00        1     0.00     0.00  diff_in_second
  0.00      2.48     0.00        1     0.00     2.47  raytracing
  0.00      2.48     0.00        1     0.00     0.00  write_to_ppm

前三名dot_product、subtract_vector、multiply_vector都在math-toolkit.h這個檔案中，因此我們首先著手對它做優化。

優化方法1： force inline function

開啟math-toolkit.h後，最明顯的部份就是static inline，可是這次我們要求不使用編譯器最佳化，因此使用__attribute__((always_inline))強制開啟inline。

先將前面提到的3個最花時間的function加上force inline。

執行結果

# Rendering scene
Done!
Execution time of raytracing() : 4.771009 sec

執行時間從 6.912961sec下降至 4.771009sec。

gprof輸出

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 32.06      1.00     1.00 13861875     0.00     0.00  rayRectangularIntersection
 22.77      1.71     0.71 13861875     0.00     0.00  raySphereIntersection
  8.66      1.98     0.27 10598450     0.00     0.00  normalize
  7.70      2.22     0.24  2110576     0.00     0.00  compute_specular_diffuse
  6.41      2.42     0.20 17836094     0.00     0.00  add_vector
  5.45      2.59     0.17 17821809     0.00     0.00  cross_product
  3.85      2.71     0.12  4620625     0.00     0.00  ray_hit_object
  3.53      2.82     0.11  2110576     0.00     0.00  localColor
  2.89      2.91     0.09  1048576     0.00     0.00  ray_color
  2.24      2.98     0.07  1048576     0.00     0.00  rayConstruction
  1.28      3.02     0.04  1241598     0.00     0.00  reflection
  0.96      3.05     0.03  4221152     0.00     0.00  multiply_vectors
  0.96      3.08     0.03  1241598     0.00     0.00  refraction
  0.64      3.10     0.02  3838091     0.00     0.00  length
  0.32      3.11     0.01        1     0.01     0.01  delete_sphere_list
  0.32      3.12     0.01        1     0.01     3.11  raytracing

前面最耗時的3個function也從前3名消失了。

將所有的inline funciotn作force inline後，花費時間明顯下降，結果如下

# Rendering scene
Done!
Execution time of raytracing() : 3.916428 sec

優化方法2： loop unrolling

在math-toolkit.h中有許多function都有for loop，因此決定再套入loop unrolling方法降低速度。
修改後得到結果如下

# Rendering scene
Done!
Execution time of raytracing() : 2.934387 sec

執行時間從 3.916428sec降至 2.934387sec。

優化方法3： OpenMP

在raytracing.c的raytracing()中有一段程式碼如下




























for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
            for (int s = 0; s < SAMPLES; s++) {
                idx_stack_init(&stk);
                rayConstruction(d, u, v, w,
                                i * factor + s / factor,
                                j * factor + s % factor,
                                view,
                                width * factor, height * factor);
                if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
                              lights, object_color,
                              MAX_REFLECTION_BOUNCES)) {
                    r += object_color[0];
                    g += object_color[1];
                    b += object_color[2];
                } else {
                    r += background_color[0];
                    g += background_color[1];
                    b += background_color[2];
                }
                pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
            }
        }
    }

可以看出各個pixel可以獨立運算，因此想使用多執行緒的方式來試著改善程式效能，在同學的筆記中看到OpenMP這套工具，先嘗試使用看看。

使用方式
- Makefile部份改為
```
CFLAGS = \
	-std=gnu99 -Wall -O0 -g -fopenmp
LDFLAGS = \
	-lm -lgomp
```
- 在上述的for loop前面增加#pragma omp parallel for

輸出結果

# Rendering scene
Done!
Execution time of raytracing() : 2.306696 sec

原先為 2.934387sec，發現下降並不明顯。

#pragma omp parallel for num_threads(64)
將thread數量指定至64條。

不過$ make check後發現結果fail，參考同學的筆記之後發現要改為
```
#pragma omp parallel for num_threads(64) \
        private(stk) private(d)     \
        private(object_color)
```
因為變數 stk、d 以及 object_color 必須要透過 private clause 宣告為執行緒私有的，否則預設之下在 #pragma omp 前的變數都是 shared，會導致計算結果錯誤。成功輸出結果如下
```
# Rendering scene
Done!
Execution time of raytracing() : 1.606055 sec
```
時間下降更多了。

目標

在-Ofast且不開啟-pg的情況下，結果如下

# Rendering scene
Done!
Execution time of raytracing() : 0.389374 sec

目標為超越電腦。

其他方法(待實驗)

pthread

2016q3 Homework1 (raytracing)

Reviewed by `shelly4132`

開發工具

gprof

效能分析

gmon.out

優化方法1： force inline function

優化方法2： loop unrolling

優化方法3： OpenMP

輸出結果

目標

其他方法(待實驗)

參考資料

tags: `HaoTse` `sysprog21` `raytracing` `gprof` `OpenMP`

2016q3 Homework1 (raytracing)

Reviewed by shelly4132

開發工具

gprof

效能分析

gmon.out

優化方法1： force inline function

優化方法2： loop unrolling

優化方法3： OpenMP

輸出結果

目標

其他方法(待實驗)

參考資料

tags: HaoTse sysprog21 raytracing gprof OpenMP

Reviewed by `shelly4132`

tags: `HaoTse` `sysprog21` `raytracing` `gprof` `OpenMP`