2016q3 Homework1 (raytracing)

contributed by <TempoJiJi>

tags: `TempoJiJi` `raytracing` `sysprog21`

Reviewed by `SarahYuHanCheng`

建議圖表數值命名可以用縮寫取代代稱

預期目標

之前已對raytracing做過各方面的研究，但會做一次總復習，且會着重在上次未能達到的目標上。
過去的共筆：組別共筆, 個人共筆

學習 gprof
以 gprof 指出效能瓶頸，並且著手改寫檔案 math-toolkit.h 在內的函式實做
善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
實做SIMD

開發環境

Ubuntu 16.04 LTS
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-4200U CPU @ 1.60GHz

測試未優化的程式效能：

# Rendering scene
Done!
Execution time of raytracing() : 3.338328 sec

利用gprof觀察程式效能及行爲，在Makefile的編譯選項裏加入 -pg ，再進行編譯跟執行，就會產生一個gmon.out檔：

$ gprof raytracing gmon.out

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.91      0.60     0.60 69646433     0.00     0.00  dot_product
 14.12      0.94     0.34 56956357     0.00     0.00  subtract_vector
 10.38      1.19     0.25 13861875     0.00     0.00  rayRectangularIntersection
  9.34      1.42     0.23 31410180     0.00     0.00  multiply_vector
  8.72      1.63     0.21 10598450     0.00     0.00  normalize
  7.06      1.80     0.17  4620625     0.00     0.00  ray_hit_object
  4.98      1.92     0.12 17836094     0.00     0.00  add_vector
  4.57      2.03     0.11 13861875     0.00     0.00  raySphereIntersection
  4.15      2.13     0.10 17821809     0.00     0.00  cross_product
  2.49      2.19     0.06  1048576     0.00     0.00  ray_color
  2.08      2.24     0.05  1048576     0.00     0.00  rayConstruction

這裏可以看到dot_product跟subtrace_vector是最花時間的地方，程式執行期間總共被呼叫了接近7前萬次(dot_product)。

所以先從math-toolkit.h開始觀察：

Loop Unrolling

優點：

分支預測（branch predictor）失敗減少。
如果循環體內語句沒有數據相關，增加了並發執行的機會。
可以在執行時動態循環展開。

缺點：

降低可讀性

static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;                                                                                    
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

這裏可以看到利用迴圈來計算dot_product，而迴圈會有branch的情況發生，導致效能不好，所以就將迴圈拆開(loop unrolling)：






double dp = 0.0;
dp += v1[0] * v2[0];
dp += v1[1] * v2[1];
dp += v1[2] * v2[2];
return dp;

# Rendering scene
Done!
Execution time of raytracing() : 2.999242 sec

可以看到只改一個dot_product時間就快了近0.4秒左右。

Force Inline

inline是向編譯器提出“建議”，把inline的函數在函數位置直接展開（減輕系統負擔(overhead)），編譯器會對比兩者執行時間後選擇執行inline與否。
優點：

減少function call

在fucntion後面加上 __forceinline








static inline __forceinline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for(int i =0;i<3;i++)
        dp += v1[i] * v2[i];
    return dp;
}

編譯加上 -D__forceinline="attribute((always_inline))" 進行編譯：

# Rendering scene
Done!
Execution time of raytracing() : 3.059393 sec

時間比原本的快了0.2秒左右，接下來用gprof觀察function call：

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 34.98      1.00     1.00 13861875     0.00     0.00  rayRectangularIntersection
 22.38      1.64     0.64 13861875     0.00     0.00  raySphereIntersection
 10.84      1.95     0.31  2110576     0.00     0.00  compute_specular_diffuse
  9.44      2.22     0.27  2110576     0.00     0.00  localColor
  7.17      2.43     0.21  1048576     0.00     0.00  ray_color
  5.42      2.58     0.16  4620625     0.00     0.00  ray_hit_object
  4.20      2.70     0.12  1048576     0.00     0.00  rayConstruction

可以看到math_tookit.h裏的function已經不見了

Macro

優點：

執行速度快，沒有堆疊的 push 和 pop 動作的需要，減少時間的耗損

缺點：

巨集被呼叫多次以後，會耗損存放及使用大量的記憶體空間

將math_tookit.h裏的function改爲Macro來進行計算，這樣就能減少function call(stack frame的push,pop等行爲)所帶來的時間花費


#define DOT(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2]))

# Rendering scene
Done!
Execution time of raytracing() : 2.741537 sec

這裏可以看到時間比loop unrolling快了近0.2秒左右，因此決定將math_tookit.h裏的花費較大的function都改爲Macro argument

根據gprof的結果，可以知道哪個function的花費最大，因此將它們都改爲Macro argument即可

#define DOT(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2]))
#define SUB_VEC(a,b,c) (c[0]=a[0]-b[0], c[1]=a[1]-b[1], c[2]=a[2]-b[2])
#define MVEC(a,b,out) (out[0]=a[0]*(b), out[1]=a[1]*(b), out[2]=a[2]*(b))
#define ADD(a,b,c) (c[0]=a[0]+b[0], c[1]=a[1]+b[1], c[2]=a[2]+b[2])
#define CDOT(a,b,out) (out[0]=(a[1]*b[2])-(a[2]*b[1]),out[1]=(a[2]*b[0])-(a[0]*b[2]),out[2]=(a[0]*b[1])-(a[1]*b[0]))

# Rendering scene
Done!
Execution time of raytracing() : 1.634365 sec

可以看到只改了math_tookit.h整個程式的效能就快了接近1倍，可見程式的效能瓶頸真的是在math_tookit.h裏。

實做PThread 跟 OpenMP

OpenMP

首先來實做OpenMP，在for迴圈上加上：

 #pragma omp parallel for num_threads(4)   \
    private(x), private(y)

num_threads是要建立的執行緒數量。private可以確保變數在各個執行緒裏是獨立的，並不會因爲某個執行緒改變了變數，而影響到其它執行緒。

在raytracing的兩層for迴圈上加上：

    #pragma omp parallel for num_threads(4)   \
    private(stk), private(d),   \
    private(object_color)
    for (int j = 0 ; j < 512; j++) {
        for (int i = 0; i < 512; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */

這裏將stk、d、object_color設爲private，是因爲這3個的值是會在各個執行緒裏進行更動。

最後要記得 #include<omp.h> ，以及在編譯選項中加上 -fopenp

執行結果：

num_thread(4)

# Rendering scene
Done!
Execution time of raytracing() : 1.738263 sec

不同的num_thread比較圖：

POSIX Thread

當function的參數超過一個，在建立pthread的時候需要將所有參數打包起來，才能傳送給function：










typedef struct __ARG {
    uint8_t *pixels;
    color background;
    rectangular_node rectangulars;
    sphere_node spheres;
    light_node lights;
    const viewpoint *View;
    int start_j;
    point3 u, v, w, d;
} arg;

將圖片分成不同的等份交給不同的執行緒去進行，這裏我會先將row分成4個等份，再用4個執行緒去執行：







/* (*data).start_j 爲每個執行緒所執行的起點  */

    for (int j = (*data).start_j ; j < 512; j+=MAXX) {
        for (int i = 0; i < 512; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
		...

執行結果：

# Rendering scene
Done!
Execution time of raytracing() : 1.603646 sec

不同thread數量的比較圖：

由上圖可得知，thread的數量越多，不代表程式效能越好。

最佳優化，合並所有方法

合並所有方法，再跟不同的編譯優化做對比：

2016q3 Homework1 (raytracing)

tags: TempoJiJi raytracing sysprog21

Reviewed by SarahYuHanCheng