2017q1 Homework1 (raytracing)

contributed by <Hong>

tags: `sysprog2017` `raytracing` `gprof` `gnuplot` `king1224`

Reviewed by `Cayonliow`

commit message 還是有不足的部分，可以參考之前老師貼的鏈接
- 造句語法可改進
- 內文略顯冗長，可以改用命令式語句
排版細節要注意，如符號之間的空格，這是你的 Do loop unrolling & force gcc to inline functions 的版本的程式碼

+#define length(a) (sqrt(a[0]*a[0]+a[1]*a[1]+a[2]*a[2]))

開發環境

OS:                    16.04.1 Ubuntu
Architecture:          x86_64
CPU 作業模式：          32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：         2
每通訊端核心數：         2
Socket(s):             1
NUMA 節點：             1
供應商識別號：           GenuineIntel
CPU 家族：              6
型號：                  60
Model name:            Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
製程：                  3
CPU MHz：              2913.305
CPU max MHz:           3400.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.41
虛擬：                  VT-x
L1d 快取：              32K
L1i 快取：              32K
L2 快取：               256K
L3 快取：               3072K
NUMA node0 CPU(s):     0-3

未優化版本

使用編譯器最佳化執行時間 without gprof ：

# Rendering scene
Done!
Execution time of raytracing() : 0.524929 sec

使用編譯器最佳化執行時間 with gprof ：

# Rendering scene
Done!
Execution time of raytracing() : 0.652154 sec

關閉編譯器最佳化執行時間 without gprof ：

# Rendering scene
Done!
Execution time of raytracing() : 2.611245 sec

關閉編譯器最佳化執行時間 with gprof ：

# Rendering scene
Done!
Execution time of raytracing() : 5.297611 sec

預期目標

在不使用編譯器最佳化的情況下，藉由 gprof 的幫助找出程式耗時的部份，針對重點地方優化運算，以改善整體效能，期望能得到與編譯器最佳化等值的效果

優化版本 1 - 減少迴圈 branch 判斷

觀察 math-toolkit.h 中的 function，可以發現大部分只是簡短的一個 for-loop，而這些 loop 都只做三次，因此對這些 loop 做 loop-unrolling 也不會導致過大的程式內容。

loop-unrolling ：

原始版本








static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

展開版本





static inline
double dot_product(const double *v1, const double *v2)
{
    return (v1[0] * v2[0]) + (v1[1] * v2[1]) + (v1[2] + v2[2]);
}

展開後的時間（未使用編譯器優化 & 使用gprof ）：

# Rendering scene
Done!
Execution time of raytracing() : 4.433639 sec

少了一些迴圈的條件判斷，執行時間居然快了接近 1 秒，已經離目標邁進一步了！

優化版本 2 - inline function 與 #define 的比較

經過第一個版本的優化以後，以 gprof 協助看出程式中最耗能的部份，可以看到 dot_product 這個函式呼叫了將近七千萬次。

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.66      0.21     0.21 69646433     0.00     0.00  dot_product
 15.56      0.40     0.19 56956357     0.00     0.00  subtract_vector
 13.45      0.56     0.16 10598450     0.00     0.00  normalize
  9.25      0.67     0.11 13861875     0.00     0.00  raySphereIntersection
  8.83      0.77     0.11 17821809     0.00     0.00  cross_product
  7.57      0.86     0.09 13861875     0.00     0.00  rayRectangularIntersec
tion
  6.73      0.94     0.08 31410180     0.00     0.00  multiply_vector
  6.73      1.02     0.08  1048576     0.00     0.00  ray_color
  5.05      1.08     0.06 17836094     0.00     0.00  add_vector
  4.20      1.13     0.05  4620625     0.00     0.00  ray_hit_object
  1.68      1.15     0.02  2520791     0.00     0.00  idx_stack_top
  1.26      1.17     0.02  4221152     0.00     0.00  multiply_vectors

讓我們看一下 math-toolkit.h 中的函式，雖然在第一行有 static inline，但到底是否要放成 inline function 是由編譯器來決定的，這邊我們關閉了編譯器最佳化，因此所有 function都不會變成 inline function。







static inline
void add_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] + b[0];
    out[1] = a[1] + b[1];
    out[2] = a[2] + b[2];
}

我們可以對這個函式加上 attribute，來強制一個函式為 inline。

請注意程式縮排，是四個空白
課程助教

謝謝好心的助教w
洪正皇







static inline __attribute__((always_inline))
void add_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] + b[0];
    out[1] = a[1] + b[1];
    out[2] = a[2] + b[2];
}

將所有 math-toolkit.h 中的 inline function 都補上強制 inline 的程式碼，執行效率如下：

Force inline（ 10次取平均）

# Rendering scene
Done!
Execution time of raytracing() : 2.155811 sec

inline function 是在編譯時期把整個 function copy paste 到 function call 的地方，而巨集（ #define ）也可以達到一樣的效果，但這兩個方法真的是完全一樣的嗎？
在 math-toolkit.h 中有一些函式是只做基本運算的（沒有 assign value ），如 dot_product 和 length，可以改成 #define 的寫法：

#define dot_product(a,b) ( a[0]*b[0] + a[1]*b[1] + a[2]*b[2] )

#define length(a) ( sqrt(a[0]*a[0] + a[1]*a[1] + a[2]*a[2]) )

Force inline + #define（ 10次取平均）

# Rendering scene
Done!
Execution time of raytracing() : 2.037899 sec

很神奇的是，#define 的方法比 inline function 還快了0.1秒，是什麼導致這0.1秒的差異？
巨集（ #define A B）是完全將程式碼中的 A 直接以 B 貼上覆蓋，inline 則是將 function 中的程式碼放到呼叫他的地方，但直接完全貼上會出現很恐怖的後果，像是 return 就不可能原封不動的貼上來，而 function 定義的變數名稱與 caller 傳入的參數名稱不一定完全相同，必定先做一些處理才能 inline，從 inline 與 #define 的不同中可以看到，若是以下的程式碼，#define 會執行兩次 getValue()，而 inline function 只會執行一次：

int getValue() { return 1; }

static inline __attribute__((always_inline))
int add(int a) { return a+a; }

#define ADD(a) a+a

...

int main(){
    //先呼叫一次getValue後得到值，直接取用這個值兩次
    add(getValue());
    
    //經過編譯後，程式碼會變成 getValue()+getValue();
    //因此會呼叫 getValue() 兩次
    ADD(getValue());
}

不論是 inline 或是 #define，這裡又比前一個版本快了 2.4 秒，而已經對於原始版本優化了大約 62％。

優化版本 3 - OpenMP

在多核心的時代，若想要加快執行速度，可以將任務分散給各個處理器，而 OpenMP 這個函式庫可以幫助我們簡單的寫出平行運算的程式碼。

OpenMP 語法的樣子基本上都是以這個形式開頭：

#pragma omp <接其他語法>

這邊對於 raytracing.c 中的一個 for-loop 進行平行化：































#pragma omp parallel num_threads(THREAD_CNT) \
shared(height,width,factor) private(stk,d,object_color)
#pragma omp for schedule(guided) collapse(2)
for (int j = 0; j < height; j++) {
    for (int i = 0; i < width; i++) {
        double r = 0, g = 0, b = 0;
        /* MSAA */
        for (int s = 0; s < SAMPLES; s++) {
            idx_stack_init(&stk);
            rayConstruction(d, u, v, w,
                            i * factor + s / factor,
                            j * factor + s % factor,
                            view,
                            width * factor, height * factor);
            if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
                          lights, object_color,
                          MAX_REFLECTION_BOUNCES)) {
                r += object_color[0];
                g += object_color[1];
                b += object_color[2];
            } else {
                r += background_color[0];
                g += background_color[1];
                b += background_color[2];
            }
            pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
            pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
            pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
        }
    }
}

其中 num_threads() 可以選擇要讓 OpenMP 創建幾個 threads，針對2的次方的數字測試後發現 16 個 threads 以上差異就不大了，我認為是因為一個 core 可以處理兩個 threads，多一些些 threads 可以讓各個 CPU 在排程上得到更大的使用率，但就算創建大量的 thread，多出來的還是得停在那邊等系統排程。

1024 個 threads 的執行時間：

# Rendering scene
Done!
Execution time of raytracing() : 1.087111 sec

各個版本執行時間分析圖

以下待完成

想嘗試的事

觀察 math-toolkit.h 中的 normalize 發現 v[0], v[1], v[2] 的乘法除法皆為分別運算，想到先前學過的方法，位元運算會比乘法除法還快，因此我想是否能將 v[0], v[1], v[2] shift後相加成一個數，隨後只做一次乘法，再利用 shift & mask 取回各自的值。

原來這是SIMD在做的事！！！！！ Hong

參考資料

Hw1 作業區
 raytracing 題目與 video
程式碼
 force inline

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.