2017q1 Homework1 (raytracing)

contributed by < twzjwang >
作業說明: B02: raytracing
github: twzjwang/raytracing

Reviewed by `Sean1127`

在 git commit message 不需要寫Modify xxx.c因為 GitHub 會自動追蹤被修改過的檔案

Reviewed by `Billy4195`

"發現切割的範圍(h_start~h_end) h_end 設錯，每個 thread 都執行到 height" 的程式碼可以強調修改的地方，這樣看一段code有點辛苦。
raytracing()中的thread number可以用處理器數量來決定，讓程式碼有擴充性。
commit c268ef的Subject超過50字，內容被隱藏在…裡面

開發環境

Ubuntu Ubuntu 16.04.2 LTS
Linux 4.8.0-36-generic
cpu
version: Intel® Core™ i5-3337U CPU @ 1.80GHz
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
memory
size: 4GiB

開發前

於 2017q1 Homework1 (phonebook) jserv 提到 Ubuntu 14.04 版本太舊
- 因為專題使用的套件僅支援 Ubuntu 14.04 LTS
  保留 Ubuntu 14.04 LTS 再另外灌 Ubuntu 16.04 LTS
- 灌完後 grub 選單沒有第三個 OS
- 參考 Ubuntu重建GRUB開機選單解決
安裝開發工具
- graphviz
- imagemagick
第一次看到在 main() 裡使用 #include 的寫法
call graph

未優化版本

直接測試

編譯+執行

$ make
$ ./raytracing

結果

# Rendering scene
Done!
Execution time of raytracing() : 3.388795 sec

使用gprof分析

編譯+執行

$ make clean
$ make PROFILE=1
$ ./raytracing
$ gprof ./raytracing | less

結果
- 應該對呼叫次數最多的前幾個 function 做最佳化

# Rendering scene
Done!
Execution time of raytracing() : 7.183814 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.41      0.54     0.54 69646433     0.00     0.00  dot_product
 17.61      0.98     0.44 56956357     0.00     0.00  subtract_vector
  9.61      1.22     0.24 13861875     0.00     0.00  rayRectangularIntersection
  8.81      1.44     0.22 13861875     0.00     0.00  raySphereIntersection
  8.41      1.65     0.21 10598450     0.00     0.00  normalize
...

實驗一 - loop unrolling

在作業要求中提到幾個技巧中，這是唯一沒聽過的名詞，所以想從這個方法開始
先查了資料
- 循環展開，英文中稱（Loop unwinding或loop unrolling），是一種犧牲程序的尺寸來加快程序的執行速度的優化方法。可以由程式設計師完成，也可由編譯器自動優化完成。 (Loop unrolling-wiki)

將 math-toolkit.h 中所有 function 的 loop unroll

如 dot_product :

static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

unroll =>

static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0];
    dp += v1[1] * v2[1];
    dp += v1[2] * v2[2];
    return dp;
}

結果

Execution time 7.183814 sec => 6.414793 sec (減為 89%)
call 最多次的 dot_product
- self seconds 0.54 => 0.42 (減為 78%)

# Rendering scene
Done!
Execution time of raytracing() : 6.414793 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.94      0.42     0.42 69646433     0.00     0.00  dot_product
 11.89      0.63     0.22 56956357     0.00     0.00  subtract_vector
 11.61      0.84     0.21 13861875     0.00     0.00  rayRectangularIntersection
 11.06      1.04     0.20 10598450     0.00     0.00  normalize
  8.85      1.20     0.16 17821809     0.00     0.00  cross_product
 ...

實驗二 - 使用 Pthread 平行化程式

在 多處理機與平行程式設計 課程曾使用過以下 API
- MPI
  - distributed memory => 傳遞訊息
  - cluster computer
  - 只有一台電腦所以暫時不考慮用MPI實作
- Pthread(POSIX Threads)
  - shared memory
- OpenMP
  - shared memory
  - 使用指令就可平行化，但不知如何實作，容易出錯
partitioning option

(截自多處理機與平行程式設計 課程投影片ch3)

方法一 - block partition

延續實驗一
選擇平行化的部分(function)
- 如果資料互相獨立，不需要處理 critical section
- 選擇重複執行相同指令、不同資料的部分，如:迴圈
選擇平行化 raytracing() 內的最外圍迴圈( height )
先用4個 threads
傳遞參數部分寫法有點忘了
- 參考 Multiple arguments to function called by pthread_create()?
編譯失敗
- Makefile 忘記加上 -pthread
- 參考 undefined reference to pthread_create in Linux
過程中不小心下到 git reset 到上一次commit
- 原本以為要重寫…
- 參考执行了git reset，还有办法取消吗？解決

一開始輸出的圖如下

把height width myrank threat_count 以外的參數傳遞方法從 call by value => call by address
- make check : Verified OK

但時間沒有明顯變快!?

$ ./raytracing 
# Rendering scene
arg myrank 1
arg myrank 0
arg myrank 2
arg myrank 3
Done!
Execution time of raytracing() : 3.450752 sec

發現切割的範圍(h_start~h_end) h_end 設錯，每個 thread 都執行到 height

int h_size = arg->height / arg->thread_count;
int h_start = h_size * arg->myrank;
int h_end = (arg->myrank == arg->thread_count - 1) ?
            h_size * (arg->myrank + 1) : arg->height;
for (int j = h_start; j < h_end; j++) {   
...

修正如下

int h_size = arg->height / arg->thread_count;
int h_start = h_size * arg->myrank;
int h_end = (arg->myrank < arg->thread_count - 1) ?
            h_size * (arg->myrank + 1) : arg->height;
for (int j = h_start; j < h_end; j++) { 
...

可以強調一下修改了哪裡，閱讀速度會比較快Billy SU

結果
- Execution time 3.388795 sec => 1.216715 sec (減為 36%)
```
$ ./raytracing 
# Rendering scene
arg myrank 1
arg myrank 0
arg myrank 3
arg myrank 2
Done!
Execution time of raytracing() : 1.216715 sec    
```
- 若改變thread數量?
  - thread越多 => 效率越低 (因為 create thread 所需時間變多)
  - Speedup = Time(serial) / Time(parallel)
  - Efficenfy = Speedup / number of thread
  num of thread 1 2 4 8 16
  
  Execution time 2.51 1.38 1.22 1.11 1.10
  
  Speedup 1.00 1.82 2.06 2.26 2.28
  
  Efficency 1.00 0.91 0.52 0.28 0.14

num of thread	1	2	4	8	16
Execution time	2.51	1.38	1.22	1.11	1.10
Speedup	1.00	1.82	2.06	2.26	2.28
Efficency	1.00	0.91	0.52	0.28	0.14

方法二 - cyclic partition

作業解說影片中提到熱區不平均的問題
原本切法 - block partition
- myrank = 0 的 thread 處理 :

改成 - cyclic partition

    for (int j = arg->myrank; j < arg->height; j += arg->thread_count)

myrank = 0 的 thread 處理 :

結果 (4 threads)

Execution time 1.216715 sec => 1.086578 sec (減為 89%)

$ ./raytracing 
# Rendering scene
arg myrank 0
arg myrank 1
arg myrank 2
arg myrank 3
Done!
Execution time of raytracing() : 1.086578 sec

2017q1 Homework1 (raytracing)

Reviewed by `Sean1127`

Reviewed by `Billy4195`

開發環境

開發前

未優化版本

實驗一 - loop unrolling

實驗二 - 使用 Pthread 平行化程式

方法一 - block partition

方法二 - cyclic partition

參考資料

tags: `twzjwang`

2017q1 Homework1 (raytracing)

Reviewed by Sean1127

Reviewed by Billy4195

開發環境

開發前

未優化版本

實驗一 - loop unrolling

實驗二 - 使用 Pthread 平行化程式

方法一 - block partition

方法二 - cyclic partition

參考資料

tags: twzjwang

Reviewed by `Sean1127`

Reviewed by `Billy4195`

tags: `twzjwang`