2016q3 Homework1 (raytracing)

tags: `jeff60907`

contributed by <jeff60907>

作業說明及預期目標

善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作

開發環境

作業系統 Lubuntu 16.04
CPU:Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

依照作業介紹安裝以下開發工具

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

未優化前測試

# Rendering scene
Done!
Execution time of raytracing() : 2.404955 sec

學習GNU gprof

可以追蹤程式時間到底花在那邊
function到底執行多少次、時間、流程
量出效能上的熱點

$ make PROFILE=1

# Rendering scene
Done!
Execution time of raytracing() : 5.026501 sec

$ gprof  ./raytracing | less

分析程式執行的內容

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.12      0.42     0.42 69646433     0.00     0.00  dot_product
 17.10      0.76     0.34 56956357     0.00     0.00  subtract_vector
 11.56      0.99     0.23 31410180     0.00     0.00  multiply_vector
 11.06      1.21     0.22 10598450     0.00     0.00  normalize
  8.80      1.39     0.18 13861875     0.00     0.00  rayRectangularIntersection
  5.78      1.50     0.12  4620625     0.00     0.00  ray_hit_object
  4.02      1.58     0.08 17821809     0.00     0.00  cross_product
  4.02      1.66     0.08  1048576     0.00     0.00  ray_color
  3.52      1.73     0.07 17836094     0.00     0.00  add_vector
  3.02      1.79     0.06  4221152     0.00     0.00  multiply_vectors
  2.77      1.85     0.06 13861875     0.00     0.00  raySphereIntersection
  2.01      1.89     0.04  2110576     0.00     0.00  compute_specular_diffuse
  2.01      1.93     0.04  1241598     0.00     0.00  protect_color_overflow
  1.01      1.95     0.02  1048576     0.00     0.00  rayConstruction
  0.75      1.96     0.02   113297     0.00     0.00  fresnel

可以發現前面呼叫次數非常多，時間花費幾乎都在這邊，所以想要優化程式從這邊著手，而不是靠感覺去瞎猜，要依照真實數據去改善。

上課講解影片提到，Makefile 可以更改編譯最佳化測試

CFLAGS = \
        -std=gnu99 -Wall -O0 -g
改成
CFLAGS = \
        -std=gnu99 -Wall -Ofast -g

Execution time of raytracing() : 0.559535 sec

電腦編譯自動最佳化 5秒變成0.55秒

人腦VS大腦超越電腦編譯器最佳化

優化測試1

課程video 提到 for 回圈會造成分支，如果使用展開可以有效提升執行速度
但是不可能總把回圈拆開來硬幹，要依照自己程式的需求目標。

loop unrolling

改善第1名 dot_product()

把for展開測試

double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
        dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
    return dp;
}

Execution time of raytracing() : 4.577559 sec

原本5.02秒變成4.57秒，只改了一行for寫法就改善了時間效能

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.41      0.40     0.40 56956357     0.00     0.00  subtract_vector
 12.81      0.61     0.21 69646433     0.00     0.00  dot_product
 11.59      0.80     0.19 17836094     0.00     0.00  add_vector
  9.76      0.96     0.16 31410180     0.00     0.00  multiply_vector

重新看了 gprof dot_product的名次也下降了，繼續看其他數學式程式碼改善效能

把add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()的for loop也拆開來

Execution time of raytracing() : 4.211204 sec

效能改善4.57 到 4.211 改善幅度沒有像dot_product 多

優化測試2

wiki inline function
參考rnic開發紀錄其中提到 inline function效能提升，看資料研究原理

參考1 參考2提到巨集 VS inline

macro與inline的差異：

macro是由preprocessor處理，由inline則是compiler來負責

macro對於檢查傳入參數的型別，而inline則會檢查型別

marco 不執行參數型別檢查、也沒有語法檢查，也可能產生邏輯錯誤

參考3提到行內函式只能建議編譯器，也就是說建議並不一定會被採納，這視您的編譯器而定。

嘗試inline function 確定展開

將原先每個 static inline 後面加上 _ _attribute _ _((always_inline))確定展開

static inline __attribute__((always_inline))

Execution time of raytracing() : 2.048317 sec

從4.211秒效能直接改善約一半到2.048秒

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 36.52      0.46     0.46 13861875     0.00     0.00  rayRectangularIntersection
 15.88      0.66     0.20  2110576     0.00     0.00  compute_specular_diffuse
 11.91      0.81     0.15 13861875     0.00     0.00  raySphereIntersection
 11.91      0.96     0.15  1048576     0.00     0.00  ray_color
  7.15      1.05     0.09  4620625     0.00     0.00  ray_hit_object
  4.76      1.11     0.06  1048576     0.00     0.00  rayConstruction
  3.97      1.16     0.05  2110576     0.00     0.00  localColor

因為function都展開了，少了呼叫數學 function傳值的時間

嘗試把 return 計算式內容，而不是先存後傳發現效能有提高一些！

static inline __attribute__((always_inline))
double dot_product(const double *v1, const double *v2)
{
    return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
}

Execution time of raytracing() : 1.988770 sec

時間降約 0.005秒左右

優化測試3

多執行緒處理－使用OpenMP

過去沒有做過多執行緒平行化處理的程式，知道一些些概念卻不是很懂，需要補充自己的背景知識

多執行緒還有許多方法可以探討

OpenMP系列教學文回頭再試試看其他指令及範例練習
參考dada實作
 參考raytracing共筆

OpenMP（Open Multi-Processing）是一套支持跨平台、共享記憶體方式的多執行緒平行運算的API。

它比起使用作業系統直接建立執行緒有三大優點：
- CPU核心數量的擴展性
- 便利性
- 可移植性

首先要進行平行化處理，必須先探討平行處理的程式碼是不是各自獨立沒有相依性，如果有相依性，進行平行化處理thread計算速度各自不同，資料就會錯誤。

查看 raytracing.c的 raytracing function 發現pixel可以各自平行處理，彼此沒有相依性的問題。

在 raytracing.c 中的 raytracing function內的for建立平行化處理

沒有選擇多少執行緒處理，直接用電腦核心平行運算

Execution time of raytracing() : 1.207583 sec

時間減少約 0.8秒如果指定多少執行緒處理?

num_threads選擇多少數量的執行緒平行處理
private 讓每個執行緒中，都有一份變數的複本，以免互相干擾。

#pragma omp parallel for num_threads(512)      \
                private(stk), private(d),     \
                private(object_color)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {

2   threads
Execution time of raytracing() : 1.965742 sec
4   threads
Execution time of raytracing() : 1.655131 sec
8   threads
Execution time of raytracing() : 1.229635 sec
32  threads
Execution time of raytracing() : 1.006759 sec
64  threads
Execution time of raytracing() : 0.968196 sec
128 threads
Execution time of raytracing() : 0.941614 sec
256  threads
Execution time of raytracing() : 0.911764 sec
512  threads
Execution time of raytracing() : 0.908156 sec
1024  threads
Execution time of raytracing() : 0.949907 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 40.01      0.02     0.02    85023     0.00     0.00  raySphereIntersection
 40.01      0.04     0.02    84535     0.00     0.00  rayRectangularIntersection
 20.01      0.05     0.01     7767     0.00     0.00  rayConstruction
  0.00      0.05     0.00    33026     0.00     0.00  ray_hit_object
  0.00      0.05     0.00    18471     0.00     0.00  idx_stack_empty

threads 前面看起來越高效果越好，可是真的是這樣子嗎?
原本dada使用的是64 threads但是效能卻不是最高的，看了一下main.c COL 是用512，所以使用 512 threads可以有效率的平行展開處理，反而獲得最佳的處理效能。