2017q1 Homework1 (raytracing)

contributed by <ierosodin>

Reviewed by `janetwei`

可以證明跟解釋爲何執行緒數量的結果差距
用 SIMD 來加速整個資料的處理

開發環境

作業系統 : elementary OS loki

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel® CPU @ 3.30GHz
Stepping: 5
CPU MHz: 2101.558
CPU max MHz: 3900.0000
CPU min MHz: 1200.0000
BogoMIPS: 6600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11

軟體安裝

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

未優化前執行時間：

Execution time of raytracing() : 2.961364 sec

使用gprof分析

$ make PROFILE=1
$ ./raytracing

使用 gprof 時, 執行時間會變長
( gprof 使 gcc 在每個函数中都加入了一個 mcount ，也就是說每個函數都會調用 mcount , 增加執行時間)
Execution time of raytracing() : 5.704768 sec

Using gprof2dot

gprof2dot 能將 gprof 或 perf 的分析結果轉成 dot 格式，裡面會描述各個節點間的關係

$ git clone https://github.com/jrfonseca/gprof2dot
$ gprof raytracing| ../gprof2dot/gprof2dot.py | dot -Tpng -o dot.png

開始分析

$ gprof -b raytracing gmon.out | less

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.82      0.52     0.52 69646433     0.00     0.00  dot_product
 19.75      0.97     0.45 56956357     0.00     0.00  subtract_vector
 10.97      1.22     0.25 10598450     0.00     0.00  normalize
  9.66      1.44     0.22 13861875     0.00     0.00  rayRectangularIntersection
  9.44      1.66     0.22 31410180     0.00     0.00  multiply_vector
  7.02      1.82     0.16 13861875     0.00     0.00  raySphereIntersection
  4.39      1.92     0.10 17821809     0.00     0.00  cross_product
  3.73      2.00     0.09 17836094     0.00     0.00  add_vector
  2.19      2.05     0.05  4620625     0.00     0.00  ray_hit_object
  2.19      2.10     0.05  1048576     0.00     0.00  ray_color
  1.54      2.14     0.04  4221152     0.00     0.00  multiply_vectors
  1.32      2.17     0.03        1     0.03     2.27  raytracing
  1.10      2.19     0.03  2520791     0.00     0.00  idx_stack_top
  0.88      2.21     0.02  1241598     0.00     0.00  protect_color_overflow
  0.88      2.23     0.02  1048576     0.00     0.00  rayConstruction
  0.66      2.25     0.02  3838091     0.00     0.00  length
  0.44      2.26     0.01  2110576     0.00     0.00  compute_specular_diffuse
  0.44      2.27     0.01  2110576     0.00     0.00  localColor
  0.44      2.28     0.01        1     0.01     0.01  delete_sphere_list
  0.22      2.28     0.01    37595     0.00     0.00  idx_stack_pop
  0.00      2.28     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.28     0.00  1241598     0.00     0.00  reflection
  0.00      2.28     0.00  1241598     0.00     0.00  refraction
  0.00      2.28     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.28     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.28     0.00   113297     0.00     0.00  fresnel
  0.00      2.28     0.00        3     0.00     0.00  append_rectangular
  0.00      2.28     0.00        3     0.00     0.00  append_sphere

從 gprof 調用表可以發現， dot_product 與 subtract_vector 被呼叫次數最多, 嘗試針對這兩項進行優化。

baseline.ppm

優化

先試試 compiler 的優化能力吧！(目標超越它)

loop unrolling

可以發現， math-toolkit.h 是一個小迴圈，試著將這些 for 迴圈打開：







static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v1[2];
    return dp;
}

再用 gprof 檢查一次，數學運算的部份所花費的時間下降了！

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 16.90      0.25     0.25 10598450     0.00     0.00  normalize
 12.85      0.44     0.19 69646433     0.00     0.00  dot_product
 10.82      0.60     0.16 13861875     0.00     0.00  rayRectangularIntersection
 10.48      0.76     0.16 56956357     0.00     0.00  subtract_vector
 10.14      0.91     0.15  4620625     0.00     0.00  ray_hit_object

gprof2dot

執行時間：

Execution time of raytracing() : 2.252323 sec

force inline

inline function 只能建議編譯器，也就是說建議並不一定會被採納，可以嘗試使用 __attribute__((always_inline))

執行時間：

Execution time of raytracing() : 1.979259 sec

平行化

OpenMP

發現rayRectangularIntersection與raySphereIntersection對程式效能有影響, 嘗試對raytracing.c進行優化。

#pragma omp parallel default (shared) private(stk, d, object_color) num_threads(omp_get_max_threads())
#pragma omp for schedule(static)

使用 OpenMP 時要注意變數的性質，如果是每個 thread 都各自擁有的變數，須設為 privated。

可以用 diff 來檢查結果( out.ppm )是否正確
$ diff baseline.ppm out.ppm

out.ppm

使用 OpenMP 對效能有很明顯的提升
Execution time of raytracing() : 0.398234 sec

gprof2dot

光是使用 OpenMP 就超越 -Ofast 拉~~ ierosodin

pThreads

另一種平行化的方式，可以決定將某個 function 交給 thread 去做。

main.c

宣告 thread id ，並分配記憶體

pthread_t *id = (pthread_t *) malloc(THREADNUM * sizeof(pthread_t));

++宣告一個指標型態的陣列，使每一個 id 都有一個指標型態的參數( ptr[i] ),用來傳入 function ++

rays **ptr = (rays **) malloc(THREADNUM * sizeof(rays *));

建立 thread ，參數分別是：
（ thread id ，屬性( NULL )，要平行化的 function ，傳入的參數）

pthread_create(&id[i], NULL, (void *) &raytracing, (void *) ptr[i]);

用來等待該 id 的 thread 結束，第二個參數用來儲存回傳值

pthread_join(id[i], NULL);

raytracing.c

放在 raytracing() 的最後，表示該 function 結束後， thread 關閉

pthread_exit(0);

在 THREAD_NUM = 12 的情況下，執行時間縮短到 0.315882 秒！

gporf2dot

不同 THREAD_NUM 對效能的影響

可以發現無止境的增加 THREAD_NUM 並不會對效能有明顯的影響，甚至會增加執行時間。

threadpool

threadpool 不同於一般的 thread ，是將所有的 tasks 丟入 pool 中，而閒置的 thread 則從這個 pool 中取出可以執行的 task ，這樣可以避免不同 thread 之間工作量不平均或是 thread 執行速度不一而造成效能的損耗。
參考 mergesort-concurrent 中對 threadpool 的應用，實作在 raytracing 中，從效能比較圖可以發現， threadpool 比一般的 pthread 又更來得快速。

不同 THREAD_NUM 對效能的影響

thread number	execution time
1	1.769772s
2	0.867196s
4	0.468847s
8	0.302147s
12	0.257031s
16	0.258006s
64	0.261613s
128	0.262480s
256	0.266355s
512	0.282510s

OpenCL

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
http://www.kimicat.com/opencl-1/opencl-jiao-xue-yi
https://github.com/smistad/OpenCL-Getting-Started

GCC optimization

最後，再用一次 -Ofast 吧~（不過要檢查最後出來的 out.ppm 是否正確，優化 pthread 可能會產生錯誤）

參考資料

gprof2dot github
pThreads for Raytracing
Enhance raytracing program