2017q1 Homework1 (raytracing)

contributed by <Sean1127>

Reviewed by `Daichou`

應該寫一個script產生time.txt供time.gp讀取
thread_count為常數是否以#define成marco會比較省資源，而不用浪費資源傳值。
可嘗試不同openmp的schedule方式並做比較。

電腦規格

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Model name:            Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Stepping:              1
CPU MHz:               1166.748
CPU max MHz:           2700.0000
CPU min MHz:           800.0000
BogoMIPS:              4789.06
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

安裝必須軟體

graphviz: 圖形視覺化軟體，但使用時並不是直接$ graphviz，而是使用它提供的工具函式，$ man graphviz可以瀏覽所有可用的同行化工具
(官網 http://www.graphviz.org/)
gprof: 效能檢測工具，與 perf 相似，它是包含在linux-tools-generic之中 (請注意版本)
gprof2dot: dot 語言生成器，將 gprof 監測的效能報告轉換程 graphviz 可以認得的一個叫做dot的語言，才能繪製出方塊圖
imagemagick: 研究中
python: 因為 gprof2dot 是以 python 開發的，所以當然需要(ubuntu 內因有許多工具也以 python 開發，所以內建就有)(我的版本: python 2.7.12)
pithon-pip: 全名 python package index，用於安裝 python 的額外套件，這裡只用來安裝 gprof2dot

$ sudo apt-get install graphviz
$ sudo apt-get install python-pip
$ sudo pip install gprof2dot
$ sudo apt-get install imagemagick

學習 gprof, gprof2dot, graphviz

Graphviz - 用指令來畫關係圖吧
HackMD 也可以用 dot 語法畫圖，例如

有空再研究，這不是今天的重點
我們只需要知道:
$ gprof ./raytracing | gprof2dot | dot -Tpng -o output.png

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

作業目標

以 make PROFILE=1 重新編譯程式碼，並且學習 gprof

以 gprof 指出效能瓶頸，並且著手改寫檔案math-toolkit.h在內的函式實做，充分紀錄效能差異在共筆

可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作

先把公開的共筆都讀過 ktvexe, hugikun999, LanKuDot, 影片對應的共筆
以簡單的著手:

loop unrolling
force inline
pthread
openmp

把編譯器最佳化打開，看看這支程式速度的極限在哪裡
- -O1: 0.815723 sec
- -O2: 0.757749 sec
- -O3: 0.773548 sec
意思就是要以 0.75 秒為目標！

原始分析

$ gprof ./raytracing | less

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.69      0.66     0.66 69646433     0.00     0.00  dot_product
 20.63      1.26     0.60 56956357     0.00     0.00  subtract_vector
  9.80      1.55     0.29 13861875     0.00     0.00  rayRectangularIntersection
  8.60      1.80     0.25 10598450     0.00     0.00  normalize
  7.56      2.02     0.22 31410180     0.00     0.00  multiply_vector
  6.53      2.21     0.19 17821809     0.00     0.00  cross_product

欄位解釋：

 %         the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
 seconds   for by this function and those listed above it.

 self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
           listing.

白話文：
- self是自己執行時間的加總
- % time是比例
- cumulative是自己（% time第 n 名）跟 <n 名時間的加總
dot_product, subtract_vector各佔了 20 % 左右
罪魁禍首是這樣寫的






static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] - b[i];
}








static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

進階分析

嘗試一 loop unrolling

寫組語的時候常常用到的技巧，但是寫 C/C++ 卻好像完全忘了這回事
代表花時間唸了書卻不會用，好險這裡有給一記當頭棒喝，不然真的是白學了
原理：迴圈就等於 jump 指令加上 branch 指令，對於 pipe 的 CPU 來說，如果需要 branch，會把 CPU 內的指令清空，非但無法發揮 pipe 優點，還會拉低 CPU 功率
程式跳動時會使用 branch prediction，但畢竟是預測，如果可以直接避免 branch 會更好








static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
    out[3] = a[3] - b[3];
}

*** stack smashing detected ***: ./raytracing terminated
Aborted (core dumped)

好，一定是有什麼搞錯了







static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
}

手殘嚴重…

結果
Execution time of raytracing() : 2.203631 sec
分析前三名

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 23.73      0.42     0.42 69646433     0.00     0.00  dot_product
 14.87      0.68     0.26 56956357     0.00     0.00  subtract_vector
 11.15      0.87     0.20 31410180     0.00     0.00  multiply_vector

dot_product: 0.66 降到 0.42
subtract_vector: 0.60 降到 0.26

嘗試二 force inline

inline就是把 function 裡的東西都複製貼上到外層（有點像一開始學程式因為不會分檔、寫 function，所以所有功能都在main裡面做）

不用呼叫函式就少了
重點: 因為-O0會讓編譯器忽略inline，所以要再函式前加上__attribute__((always_inline))強制 inline









static inline __attribute__((always_inline))
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0];
    dp += v1[1] * v2[1];
    dp += v1[2] * v2[2];
    return dp;
}

結果
Execution time of raytracing() : 1.997267 sec
分析

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 30.99      0.57     0.57 13861875     0.00     0.00  rayRectangularIntersection
 29.36      1.11     0.54 13861875     0.00     0.00  raySphereIntersection
 10.33      1.30     0.19  2110576     0.00     0.00  compute_specular_diffuse

已經看不到底層的函數了，畫個圖看看新的分佈

嘗試二.一額外 inline

少了最底層的 function call 可以節省 0.3 秒多的時間，如果再減少一層是不是又可以加速呢？這次主要更動raytracing.c裡除了ray_hit_object, ray_color, raytracing的所有函式
結果
Execution time of raytracing() : 1.950107 sec
分析
減少了 0.04 秒，顯然還是有用的，所以接下來直接試試全部 inline！

嘗試二.二全部 inline

這次將把raytracing.c, idx_stack.h所有函式都 inline
結果

raytracing.c: In function ‘ray_color’:
raytracing.c:356:14: error: inlining failed in call to always_inline ‘ray_color’: recursive inlining
 unsigned int ray_color(const point3 e, double t,
              ^
raytracing.c:439:13: error: called from here
         if (ray_color(ip.point, MIN_DISTANCE, r, stk, rectangulars, sp

嘗試二.三全部 inline (除了 ray_color)

結果
Execution time of raytracing() : 1.868269 sec
分析
就如預期的把所有函式都塞進去了，但其實我還挺訝異時間居然有進步 0.11 秒！
根據 When to use inline function and when not to use it? 的解釋，inline 的使用時機
1. very small
2. called often
除了raytracing被呼叫 1 次之外，其他函數的呼叫次數都是百萬次，所以都是 inline 候選人
依照結果來看，這次作業的函式還"不夠大"，所以並沒有產生 inline 的副作用
比較執行檔的大小
2.0 59552
2.1 59664
2.2 失敗
2.3 71248

更深入的解釋 To Inline or Not To Inline（研究中）

平行化

pthread, openmp 兩者本質上並無不同，都是在同一個 process 產生多個平行 threads
在做法上
- openmp 比較簡單，也屬於比較高階的語法，故可攜性也較高，當然也不限 C/C++ 語言
- pthread 比較低階，相對來說工程師可以有較多的控制權與變化空間
依照影片對應的共筆所云，兩者單獨使用的表現的不錯，但兩者一起使用時反而會變差
我的電腦是 4 核，所以執行序的上限也是 4(之後測試 2, 8, 16 等)

嘗試三.一 pthread

因為在傳入參數時共用變數，所以參數不正確進而導致資料分割錯誤，以後要多注意！
修正後結果
Execution time of raytracing() : 1.030958 sec
這是以 inline math-toolkit.h的版本平行化的結果，所以進步是 1.997267 -> 1.030985，共 0.96 秒
目前表現最佳版本

嘗試三.二 openMP

openmp 真的很好寫，只要新增三行













...
    int factor = sqrt(SAMPLES);
#   pragma omp parallel num_threads(thread_count) \
        firstprivate(u, v, w, d, object_color, stk, factor)
    {
#       pragma omp for schedule(static,1)
        for (int j = 0; j < height; j++) {
            for (int i = 0; i < width; i++) {
                double r = 0, g = 0, b = 0;
                /* MSAA */
                ...
    }
}

注意變數 scope {}
- scope 外宣告的變數，預設為 shared，故容易產生 race condition
- 我懶得檢查程式碼，所以一律設定成 firstprivate
  - priavte: thread 之間的變數儲存地址不同，亦即不同份變數，變數值為 undefined
  - firstprivate: 同 private，且變數的值會初始化為 scope 外最新的值
- scope 內宣告的變數一律為 private
結果
Execution time of raytracing() : 0.855305 sec
分析

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 93.06      1.87     1.87   891880     0.00     0.00  ray_color
  6.97      2.01     0.14        1     0.14     2.01  raytracing
  0.00      2.01     0.00        3     0.00     0.00  append_rectangular

因為是用 inline-all 版本實做，所以只剩下ray_color, raytracing兩個函式有時間，而raytracing則被平行化，接下來的目標當然是平行化ray_color

嘗試三.三 omp parallel ray_color

看了ray_color覺得實在麻煩
- 它有遞迴呼叫–-不能 inline
- 它的迴圈形式不是靜態分割–-不能平行化


for (light_node light = lights; light; light = light->next) {

用一開始的關係圖分析
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- ray_hit_object: 68 %
  - rayRectangularIntersection: 40 %
  - raySphereIntersection: 20 %
- compute_specular_defuse: 12 %
但重點是他們都沒有迴圈，所以也無法平行化

結論

pthread 使用最快的 16 threads
openmp 放棄平行 ray_color
現在大學部有開一堂課叫做平行程式設計，好險有修，不然這個作業要從頭開始查，一定會死得很難看
看圖表也知道程式平行化的功力有多強，也就是說用平行化才能發揮多核心電腦真正的效能，往後這塊應該會發展越來越快，需要趁早熟悉啊

2017q1 Homework1 (raytracing)

Reviewed by `Daichou`

電腦規格

安裝必須軟體

學習 gprof, gprof2dot, graphviz

作業目標

原始分析

進階分析

嘗試一 loop unrolling

嘗試二 force inline

嘗試二.一額外 inline

嘗試二.二全部 inline

嘗試二.三全部 inline (除了 ray_color)

平行化

嘗試三.一 pthread

嘗試三.二 openMP

嘗試三.三 omp parallel ray_color

結論

參考資料

tags: `sysprog`

2017q1 Homework1 (raytracing)

Reviewed by Daichou

電腦規格

安裝必須軟體

學習 gprof, gprof2dot, graphviz

作業目標

原始分析

進階分析

嘗試一 loop unrolling

嘗試二 force inline

嘗試二.一 額外 inline

嘗試二.二 全部 inline

嘗試二.三 全部 inline (除了 ray_color)

平行化

嘗試三.一 pthread

嘗試三.二 openMP

嘗試三.三 omp parallel ray_color

結論

參考資料

tags: sysprog

Reviewed by `Daichou`

嘗試二.一額外 inline

嘗試二.二全部 inline

嘗試二.三全部 inline (除了 ray_color)

tags: `sysprog`