2017q1 Homework1 (raytracing)

contributed by <petermouse>

溫習去年秋季做的共筆並繼續改進

Reviewed by `hunng`

利用 OpenMP loop scheduling 去解決原先分配不均的問題，並有顯著效果
並在 thread 內加上測試時間去檢查是否有執行時間差異過大的問題
可以從系統的角度解釋為何 thread 數從 8 到 128 時效能沒有增長

開發環境

OS: Ubuntu 16.04 LTS
Memory: 8 GB
CPU:
- Name:
  - Intel® Core™ i5-4200H CPU @ 2.80GHz
- Cache:
  - L1d Cache： 32K
  - L1i Cache： 32K
  - L2 Cache： 256K
  - L3 Cache： 3072K

Makefile

每次看 Makefile 都很新奇，以前只用過最基本的，但 Makefile 遠不只這些！

紀錄一下不會的 (網路參考資料)

可使用條件句: 有 ifeq ifneq ifdef ifndef等 if 述句，也可以有對應的 else，結尾以endif 表示
類似 operator 的使用
- := : = 內容裡的變數更動後會隨之更動，:= 避免這種狀況
- += : 很直覺地代表參數的串接
- ?= : 沒有 define 才指定
%.c: 所有.c結尾的名稱，很像終端機中的 *
$< : 第一個 dependencies 之值 ( i.e. 冒號後的第一個)
一堆的字串函式 (英文,中文)
- 這次的 $(strip) 代表將字串省去開頭結尾的空白

raytracing

原始版本

make 時不加 PROFILE=1 時執行 raytracing

# Rendering scene
Done!
Execution time of raytracing() : 2.495271 sec

再 PROFILE=1 產生 gprof 的相關資料 (gmon.out)

# Rendering scene
Done!
Execution time of raytracing() : 5.125118 sec

時間要原本的兩倍左右
使用 gprof 分析函數間的執行時間以及執行次數
gprof 使用時同時包含 image-file 以及 profile-file

$ gprof -b raytracing gmon.out | less

產生的部份結果如下：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.18      0.47     0.47 69646433     0.00     0.00  dot_product
 17.94      0.85     0.38 56956357     0.00     0.00  subtract_vector
 11.80      1.10     0.25 10598450     0.00     0.00  normalize
  9.68      1.31     0.21 13861875     0.00     0.00  rayRectangularIntersection
  7.08      1.46     0.15 31410180     0.00     0.00  multiply_vector
  6.37      1.59     0.14 17836094     0.00     0.00  add_vector

多數的時間都發生在數學上的運算，比如前三名 dot_product() subtract_vector() normalize 就佔了 50% ，而且被呼叫的次數極多，從這裡下手對效能也很大的影響

再來觀察 perf 對原本程式(未使用 PROFILE=1)的結果

 Performance counter stats for './raytracing':

           121,000      cache-references                                            
            41,214      cache-misses              #   34.061 % of all cache refs    
     1,870,361,272      branches                                                    
         5,298,723      branch-misses             #    0.28% of all branches        

       2.475591343 seconds time elapsed

branch 數量竟然高達了18億！branch 除了增加判斷的成本，還有預測錯誤遭造成的懲罰，如果能夠降低 branch 的數量也預期會有更好的表現。

改進方向(1) loop unrolling

在 math-toolkit.h 中，有多個 function 內都有固定次數的 for 迴圈






static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] - b[i];
}

不是很必要去宣告一個變數 i，判斷 i 是否大於 3，或是對 i 加 1，類似的都可以改成







static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
}

以及最重的 dot_product








static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp; 
}

可以直接寫成 return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];

時間比起最初要快了不少

# Rendering scene
Done!
Execution time of raytracing() : 1.687605 sec

gprof 的結果，有做 loop unrolling 的 function 都把時間壓低了

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.78      0.25     0.25 56956357     0.00     0.00  subtract_vector
 13.14      0.40     0.16 69646433     0.00     0.00  dot_product
 11.87      0.54     0.14 10598450     0.00     0.00  normalize
  9.33      0.65     0.11 13861875     0.00     0.00  raySphereIntersection
  7.63      0.74     0.09  4620625     0.00     0.00  ray_hit_object
  6.78      0.82     0.08 31410180     0.00     0.00  multiply_vector
  6.78      0.90     0.08 13861875     0.00     0.00  rayRectangularIntersection
  5.09      0.96     0.06 17836094     0.00     0.00  add_vector
  4.66      1.02     0.06 17821809     0.00     0.00  cross_product
  4.24      1.07     0.05  1048576     0.00     0.00  ray_color
  1.70      1.09     0.02  3838091     0.00     0.00  length
  1.70      1.11     0.02  1204003     0.00     0.00  idx_stack_push
  1.70      1.13     0.02  1048576     0.00     0.00  rayConstruction
  1.27      1.14     0.02  4221152     0.00     0.00  multiply_vectors

perf 的結果：

 Performance counter stats for './raytracing':

            92,736      cache-references                                            
            36,946      cache-misses              #   39.840 % of all cache refs    
       969,826,972      branches                                                    
         3,103,684      branch-misses             #    0.32% of all branches        

       1.726622426 seconds time elapsed

branch 數直接少一半！
簡單地估算一下，i 每次 function call 需判斷 4 次 ( i = 0 ~ i = 3)
( 56956357 + 69646433 + 31410180 + 17836094 + 4221152 ) * 4 = 720280864
至少少了7億次，實際上少更多

改進方向(2) inline function

看看那些 function 加起來可是呼叫了幾億次！，如果全部 inline 就可以減少呼叫的成本，但缺點是程式碼會很長，增加整體的大小

因為要強制在 -O0 的情形下 inline ，在 declarator 前要加 gcc 的 attribute 來強制執行







static inline __attribute__((always_inline))
void add_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] + b[0];
    out[1] = a[1] + b[1];
    out[2] = a[2] + b[2];
}

gprof 中已經看不見這些 function

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 31.55      0.41     0.41 13861875     0.00     0.00  rayRectangularIntersection
 29.24      0.79     0.38 13861875     0.00     0.00  raySphereIntersection
 16.16      1.00     0.21  2110576     0.00     0.00  compute_specular_diffuse
  6.93      1.09     0.09  1048576     0.00     0.00  ray_color
  3.08      1.13     0.04  2110576     0.00     0.00  localColor

執行時間少了約 0.2 秒

# Rendering scene
Done!
Execution time of raytracing() : 1.456480 sec

改進方向(3): 使用 OpenMP

OpenMP 可以簡單地開啟多個 thread 完成工作，只要在適當的地點加入pragma omp，並在編譯連結時加入 -fopenmp 即可

程式主體為 raytracing() ，再看看 raytracing() 內其實由雙層 for 迴圈，對圖像的每一個點(x,y) 依序建立，像素之間沒有什麼相依關係，只要小心有些變數不能共用就好

以 row 為最小單位，對於最外層的迴圈平行化






#pragma omp parallel for private(stk, object_color)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            /* statements */
        }
    }

不過建出來是錯的，呈現就像下圖一樣，還有共用的變數少考慮了!

少讓 d 變數變成私有






    #pragma omp parallel for private(d, stk, object_color)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            /* statements */    
        }
    }

在沒有指定 thread 數量情況下的結果

# Rendering scene
Done!
Execution time of raytracing() : 0.763289 sec

執行時間降為一半！

考慮不同 thread 數量的執行時間，繪製成下圖

在 2 與 4 個 thread 數量時間是差不多的， 8 以後雖然差不多，但是時間有在稍微將低一點。

參考2016春季班 team4 共筆中的圖片複雜度分析結果，可以發現每個點做的運算量都不太一樣，部份點比起其他點更耗時，看起來是個熱點！如果在平行時能夠 load balancing，就減少 thread 等待其他 thread 的時間，加快整體的速度

嘗試使用 loop scheduling

OpenMP loop scheduling 共分為 5 種

static : 依指定的 chunk size 預先分配每個 thread 的 chunk，未指定 chunk size 為 ( 總 chunk 數量 / 總 thread 數量 )
dynamic : 動態分配，誰先做完誰就繼續依 chunk size 拿取剩下的 chunk
guided : 類似 dynamic ，不過 thread 取得的 chunk size 依對數遞減，適合複雜度隨迭代指數成長的迴圈
auto : complier 決定適合使用 static dynamic guided 哪一個
runtime: 執行時依環境變數 OMP_SCHEDULE 決定 static dynamic guided

未說明 loop scheduling 由系統的 def-sched-var 決定， def-sched-var 的值對應的方法也是 implementation-defined

嘗試使用 static，並將 chunk size 設定為 1，也就是 cyclic partition

程式碼如下







    #pragma omp parallel for if(OMP_THREADS - 1) num_threads(OMP_THREADS) \
    private(d, stk, object_color) schedule(static, 1)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            /* statements */
        }
    }

與原本的相比

在 4 個 thread 數量的效能增加很多！，原本的圖片在切割成 4 等分是最不平衡的，現在稍微平衡了一點。

其實這裡忘了先確認 default 的分配方式，以上只是猜測 default 直接將 for 迴圈切成等分分配給每個 thread ，才會在 4個 thread 時無法獲得很好的效能

先寫一個小程式看看結果











#include <stdio.h>                                                              
#include <omp.h>
int main()
{
    int i = 0;
    #pragma omp parallel for num_threads(4)
    for (i = 0; i < 8; i++) {
        printf("%d:%d\n", omp_get_thread_num(), i); 
    }   
    return 0;
}

結果輸出

thread 0 : 0
thread 0 : 1
thread 1 : 2
thread 1 : 3
thread 3 : 6
thread 3 : 7
thread 2 : 4
thread 2 : 5

看來 default 就如猜想一樣，是直接切割等分分配 chunk

再來驗證 default 4 threads 情形下，是不是因為 thread 2 (熱點最高)拖累整體速度













#pragma omp parallel num_threads(OMP_THREADS)
{   
    double time1,time2;
    time1 = omp_get_wtime();
    #pragma omp for private(d, stk, object_color)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            /* statements */
        }
        time2 = omp_get_wtime();
        printf("thread %d: %lf\n",omp_get_thread_num(), time2-time1);
    }
}

每一列做完便會輸出 thread id 及累計時間，執行後的最後幾行輸出

thread 2: 0.751149
thread 2: 0.754657
thread 1: 0.755651
thread 2: 0.758141
thread 1: 0.760791
thread 2: 0.762016
thread 2: 0.765384
thread 2: 0.768642
thread 2: 0.771855
thread 2: 0.775030

其實 thread 0 與 thread 3 (熱點較少)已經先完成了，thread 1 與 thread 2 還在做

這樣測量有點不準確，因為 CPU 先挑選哪個 thread 執行也無法確定，不過多次測試情況下都由 thread 2 最後執行完，(static, 1)沒有這個問題 petermouse

再來試試 dynamic，load balancing 交由程式以一列為單位自動分配




    #pragma omp parallel for if(OMP_THREADS - 1) num_threads(OMP_THREADS) \
    private(d, stk, object_color) schedule(dynamic, 1)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {

4 與 8 個 thread 小幅度地改善，其他的與 static 相近

改進方向(4):

參考資料

gnuplot 柱狀圖製作
 2016春季班 team4 共筆
 omp for scheduling
openmp loop scheduling