raytracing

github contributed by <Diana Ho>

tags: `d0651` `sys`

案例分析：

作業要求

以 make PROFILE=1 重新編譯程式碼
以 gprof 指出效能瓶頸，改寫檔案 math-toolkit.h 在內的函式實做

開發環境

Ubuntu 14.04 LTS

開發工具

$ sudo apt-get install graphviz 畫示意圖
$ sudo apt-get install imagemagick 轉換格式

gprof

GNU Profiling Tool
使用Gnu gprof进行Linux平台下的程序分析
- gprof 是一個可以分析程式的每個 function 使用多少次的工具（包含執行次數、消耗時間…等）,方便我們知道程式的瓶頸,並且知道應該對哪些 function 進行優化可以得到性能提升,尤其是 CPU 應用
- Gprof 實現原理
  - $ gcc –pg test.c –o test :
    新建 test.c 文件，並用 -pg 編譯和鏈結，產生一個 test.o 的目的檔
  - $./test：
    執行 test.o，產生 gmon.out 文件，供 gprof 分析數據
  - $ gprof -b a.out gmon.out | less :
    因為 gprof 输出的訊息比較多，使用 less 命令可以直接查看 gprof 的輸出
    | 表示 gprof -b a.out gmon.out 的輸出作為 less 的輸入
    可以看到 gprof 分析程式，每個函數所花的時間，因此可以針對時間花費最多的地方進行改善
- 使用 Gprof 分析 Cflow 開源項目
  - CFlow 是程序流程分析工具，可以分析 C 語言，產生程序調用圖。和 Gprof 差不多，但 CFlow 是靜態分析且不能分析 C++ 語言
  - 下載 http://www.gnu.org/software/cflow/ ，解壓縮 $ tar zxvf cflow-1.1.tar.gz
  - 編譯與執行 $./configure
  - $ make CFLAGS=-pg LDFLAGS=-pg 產生 cflow，但是沒有 –pg
  - …
- 生成圖形化的函數調用圖
  - Graphviz : Graph Visualization 圖形可視覺化工具

未優化版本

作業操作

編譯時make PROFILE=1 產生raytracing執行檔
讓程式重新編譯並加上參數-pg
編譯器在每一個 function 加上 mcount
執行時執行 ./raytracing
讓 gprof 啟動並分析數據，執行產生 gmon.out
$ gprof -b raytracing gmon.out | less 得到 raytracing 中各函數的詳細資訊，執行並紀錄，所以跑起來會比原程式慢的許多
與perf top比較：
每隔一段 period 會去做採樣(Sample)，最後統計出大概的數據

測試程式效能

原始測試










$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 2.923739 sec

$ make check 
# Rendering scene
Done!
Execution time of raytracing() : 3.038861 sec
Verified OK

gprof測試

$ make PROFILE=1 之前必須 $ make clean











$ make PROFILE=1
cc -std=gnu99 -Wall -O0 -g -c -o objects.o objects.c
cc -std=gnu99 -Wall -O0 -g -c -o raytracing.o raytracing.c
cc -std=gnu99 -Wall -O0 -g -c -o main.o main.c
cc -o raytracing objects.o raytracing.o main.o -lm

$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 5.704226 sec

clike= $ gprof raytracing gmon.out

第一部分

shows how much time was spent executing directly in each function


























Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.35      0.88     0.88 69646433     0.00     0.00  dot_product
 17.09      1.43     0.55 56956357     0.00     0.00  subtract_vector
  9.94      1.75     0.32 31410180     0.00     0.00  multiply_vector
  9.01      2.04     0.29 10598450     0.00     0.00  normalize
  8.70      2.32     0.28 13861875     0.00     0.00  rayRectangularIntersection
  7.15      2.55     0.23 17836094     0.00     0.00  add_vector
  4.04      2.68     0.13 17821809     0.00     0.00  cross_product
  3.26      2.79     0.11 13861875     0.00     0.00  raySphereIntersection
  2.80      2.88     0.09  4620625     0.00     0.00  ray_hit_object
  2.49      2.96     0.08  1048576     0.00     0.00  ray_color
  1.55      3.01     0.05  1048576     0.00     0.00  rayConstruction
  1.24      3.05     0.04        1     0.04     3.22  raytracing
  0.93      3.08     0.03  4221152     0.00     0.00  multiply_vectors
  0.93      3.11     0.03  2110576     0.00     0.00  compute_specular_diffuse
  0.78      3.13     0.03  2110576     0.00     0.00  localColor
  0.62      3.15     0.02  2520791     0.00     0.00  idx_stack_top
  0.62      3.17     0.02  1241598     0.00     0.00  refraction
  0.62      3.19     0.02  1204003     0.00     0.00  idx_stack_push
  0.31      3.20     0.01  3838091     0.00     0.00  length
  0.31      3.21     0.01  2558386     0.00     0.00  idx_stack_empty
  0.31      3.22     0.01  1241598     0.00     0.00  reflection

執行時間較多的 function，在 math-toolkit.h 裡，以他們為目標優先優化

第二部分

shows which functions called which others, and how much time each function used when its subroutine calls are included

























Call graph

granularity: each sample hit covers 2 byte(s) for 0.31% of 3.22 seconds

index % time    self  children    called     name
                0.04    3.18       1/1           main [2]
[1]    100.0    0.04    3.18       1         raytracing [1]
                0.08    2.94 1048576/1048576     ray_color [3]
                0.05    0.11 1048576/1048576     rayConstruction [14]
                0.00    0.00       1/1           calculateBasisVectors [25]
                0.00    0.00 1048576/1048576     idx_stack_init [27]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    3.22                 main [2]
                0.04    3.18       1/1           raytracing [1]
                0.00    0.00       3/3           append_sphere [29]
                0.00    0.00       3/3           append_rectangular [28]
                0.00    0.00       2/2           append_light [30]
                0.00    0.00       1/1           write_to_ppm [35]
                0.00    0.00       1/1           delete_rectangular_list [32]
                0.00    0.00       1/1           delete_sphere_list [33]
                0.00    0.00       1/1           delete_light_list [31]
                0.00    0.00       1/1           diff_in_second [34]
-----------------------------------------------

小結

由結果可知：
我們必須對前幾個執行次數高、執行時間長的函式，如dot_product()、subtract_vector()、multiply_vector()等進行效能的改善。
關於執行時間：
使用 gprof 來追蹤程式的時間較直接$ make來產生 raytracting 執行檔的時間久。

優化版本1： loop unrolling

使用 loop unrolling 將 for 迴圈中的表示式直接展開時，我們可以省下 counter 計算時間 和 branch predict 的 fail 的時間

loop unrolling
loop unrolling wiki

dot_product 的 function 改寫：和原先的 loop 相比，instructions 數只剩1/3









static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0];
    dp += v1[1] * v2[1];
    dp += v1[2] * v2[2];
    return dp;
}

對 multiply_vector、multiply_vectors、subtract_vector，有相似程式碼也進行改寫

測試程式效能

Execution time: 從 5.704226 改善到 4.397736 sec




$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 4.397736 sec

經過 loop unrolling 的 function 也降低執行時間





















$ gprof -b raytracing gmon.out | head -n 20
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 28.00      0.68     0.68 69646433     0.00     0.00  dot_product
 10.29      0.93     0.25 56956357     0.00     0.00  subtract_vector
  9.06      1.15     0.22 13861875     0.00     0.00  raySphereIntersection
  9.06      1.37     0.22 13861875     0.00     0.00  rayRectangularIntersection
  8.24      1.57     0.20 10598450     0.00     0.00  normalize
  7.00      1.74     0.17 17836094     0.00     0.00  add_vector
  6.59      1.90     0.16 17821809     0.00     0.00  cross_product
  6.59      2.06     0.16 31410180     0.00     0.00  multiply_vector
  3.71      2.15     0.09  4620625     0.00     0.00  ray_hit_object
  2.26      2.21     0.06  1048576     0.00     0.00  ray_color
  1.65      2.25     0.04  4221152     0.00     0.00  multiply_vectors
  1.65      2.29     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.24      2.32     0.03  2110576     0.00     0.00  localColor
  0.82      2.34     0.02  1241598     0.00     0.00  protect_color_overflow
  0.82      2.36     0.02  1241598     0.00     0.00  refraction

優化版本2： force inline function

因爲在 Makefile 中擁有 -O0 的指令（關閉最佳化），所以 inline 不會被執行
inline 在編譯器中不一定會展開，因此修改函式就可以強制展開：static inline __attribute__((always_inline))

static：對 definition 有效，限制 function 或 global variable 只會在所在檔案中被 access
lnline：告訴編譯器將 function 直接展開成 definition 的程式，可透過加上 __attribute__((always_line)) 強制 gcc 在最佳化沒有開啟時 inline。

Inline function觀念
 內嵌函數（inline function）筆記
 Inline function in C

Inline function、Marco
Macro 是在 Compiler 進行編譯之前就被替換,而 Inline 是會被 Compiler 看到的,且在展開時系統會自動做安全檢查或型態的轉換

Macro巨集
跟 Inline 有點像,都是對該段代碼進行直接替換
用法: 在 math-toolkit.h 中直接使用

測試程式效能

Execution time: 從 4.397736 改善到 2.473517 sec




$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 2.473517 sec

優化版本3： OpenMP

OpenMP（Open Multi-Processing）是一套支持跨平台、共享記憶體方式的多執行緒平行運算的API

它比起使用作業系統直接建立執行緒有三大優點：
- CPU核心數量的擴展性
- 便利性
- 可移植性
OpenMP的用法：
- 平行設計
  parallel：用以建造一個平行塊
- 資料處理
  資料處理在OpenMP是最爲重要的一部分，例如private( )：它把變數宣告爲私有變數，讓其他執行緒無法存取，可以解決race condition的問題。
- 任務排程
  schedule：有四種類型：dynamic、guided、runtime和static，使用時可以根據程式的結構對線程進行不同調度。
OpenMP在什麼情況下可提升效能？
使用平均分配的方法，適合執行工作複雜度平均的程式碼
- 以光影追蹤爲例，產生的圖片每個地方的複雜度是不一樣的，如果依照平均分配工作要進行任務，被分派至複雜度高區域做工的執行緒會比其他的慢。

觀念參考

OpenMP 的用法

step1: 呼叫 #include <omp.h>
step2: 在迴圈前面加一行指令 #pragma omp parallel for
step3: 編譯時加上參數 -fopenmp

案例執行

#pragma omp parallel for schedule(guided,1) collapse(2) num_threads(64) private(stk), private(d),private(object_color)

測試程式效能

程式記憶體區段錯誤
- #pragma omp parallel for
- #pragma omp parallel for num_threads(64)
- ($ ./raytracing 可以執行)
















$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 0.025289 sec

$ ./raytracing 
# Rendering scene
程式記憶體區段錯誤 (core dumped)

$ make check
# Rendering scene
Done!
Execution time of raytracing() : 0.019904 sec
二元碼檔 baseline.ppm 與 out.ppm 不同
Fail
Verified OK

Execution time: 從 2.473517 改善到 1.039256 sec
- 因為變數 stk、d、object_color 必須要透過 private clause 宣告為執行緒私有，否則預設之下在 #pragma omp 前的變數都是 shared，會導致計算結果錯誤。
- #pragma omp parallel for num_threads(64) private(stk), private(d), private(object_color)









$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 1.039256 sec
$ make check
# Rendering scene
Done!
Execution time of raytracing() : 1.044793 sec
Verified OK

優化版本4： Multi-Thread

POSIX線程(pthread)入門
 POSIX Threads Programming
Introduction to Parallel progamming

多核心程式的好處：
- Parallelism：一個系統可以同時執行多個任務
- Data Parallelism：分佈相同的資料子集在多核心，每個資料具有相同的運作
- Task Parallelism：分佈執行緒在核心中，每個執行緒具有獨特的運作
PThread：
PThread 定義了一套 C 語言的型態定義、函數和常數。Pthread API 中有 100 個函式呼叫，全都以 pthread\_ 開頭，並可以分為 4 類
- 執行緒管理：建立執行緒、等待執行緒、查詢執行緒狀態
- Mutex：建立、銷毀、鎖定、解鎖、設定屬性等操作
- Condition Variable：建立、消滅、等待、通知、設定與查詢屬性等操作
- 讀寫鎖的執行緒間的同步管理

觀念參考

PThread 函式

pthread_create


#include <pthread.h>
int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);

產生一個 Thread 並執行附帶的 Function

pthread_join


#include <pthread.h>
int pthread_join(pthread_t thread, void **retval);

暫停目前執行 pthread_join 的 Thread，等到目標 Thread 執行完畢之後目前執行 pthread_join 的 Thread 才會繼續執行

Thread 策略分析

實作參考
 實作參考

使用 Light Complexity 看每一點的複雜度，降低每個thread的時間差
- 可以降低每個thread的時間差，讓比較快做完的thread不必等慢的太久
- 比較複雜部分的先做
- 縮小比較複雜的部分，減少計算這個部分的執行緒的時間
決定thread平行策略，讓每個部分都可以並行處理
- 分row / 分column
- 上左、上右、下左、下右
- 旋轉

raytracing

tags: d0651 sys

案例分析：

作業要求

開發環境

開發工具

gprof

未優化版本

作業操作

測試程式效能

原始測試

gprof測試

第一部分

第二部分

小結

優化版本1： loop unrolling

測試程式效能

優化版本2： force inline function

測試程式效能

優化版本3： OpenMP

OpenMP 的用法

案例執行

測試程式效能

優化版本4： Multi-Thread

PThread 函式

pthread_create

pthread_join

Thread 策略分析

tags: `d0651` `sys`