# 2016q3 Homework (raytracing)
contributed by <`kaizsv`>
## prof
```
# Rendering scene
Done!
Execution time of raytracing() : 6.473611 sec
```
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
24.22 0.53 0.53 56956357 0.00 0.00 subtract_vector
23.30 1.04 0.51 69646433 0.00 0.00 dot_product
13.94 1.35 0.31 31410180 0.00 0.00 multiply_vector
6.85 1.50 0.15 17836094 0.00 0.00 add_vector
5.48 1.62 0.12 10598450 0.00 0.00 normalize
4.57 1.72 0.10 17821809 0.00 0.00 cross_product
4.57 1.82 0.10 13861875 0.00 0.00 rayRectangularIntersection
4.57 1.92 0.10 13861875 0.00 0.00 raySphereIntersection
```
上面是我的 prof 結果,先對常被呼叫的`subtract_vector`, `dot_product`, `multiply_vector`, `add_vector`,做 loop unrolling
#### loop unrolling
```
# Rendering scene
Done!
Execution time of raytracing() : 5.899568 sec
```
快了1秒左右。
#### OpenMP
```
# Rendering scene
Done!
Execution time of raytracing() : 0.983607 sec
Verified OK
```
OpenMP 可以在`shared-memory machine`上執行平行程式,在編繹時加上`-fopenmp`。
```C=
#include <omp.h>
#pragma omp parallel for num_threads(16) \
schedule(guided, 4) \
private(d) \
private(stk) \
firstprivate(object_color)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
```
`#pragma omp parallel for`
: 這是 OpenMP 的編繹器指令,表示要將這個 for 迴圈平行化。
`num_threads(16)`
: 要用幾個 threads 來執行,也可以用 `omp_get_max_threads()`讓最多的 threads 執行。
`schedule(dynamic, 4)`
: 編繹器要如何分配工作,就用`schedule`指令。
`static`
: 每個 threads 會依序執行被切割的工作,而 schedule(static, 4) 的意思如下例子。
#pragma omp parallel for num_threads[4] schedule(static)
for (int i = 0; i < 1000; i++) {}
thread 1: i = 0 ~ 249
thread 2: i = 250 ~ 499
thread 3: i = 500 ~ 749
thread 4: i = 750 ~ 999
#pragma omp parallel for num_threads[4] schedule(static, 4)
for (int i = 0; i < 1000; i++) {}
thread 1: i = 0, 1, 2, 3, 16, 17...
thread 2: i = 4, 5, 6, 7, 20, 21...
thread 3: i = 8, 9, 10, 11, 24, 25...
thread 4: i = 12, 13, 14, 15, 28, 29...
`dynamic`
: 當 threads 完成某個區塊後,才會動態分配另一個區塊去執行。
`guided`
: 與 dynamic 類似,但區塊大小會指數遞減。
`auto`
: 編繹器自行決定。
`runtime`
: 使用者用環境變數 OMP_SCHEDULE 決定。
`private and shared`
: private 就是每個 thread 有自己的一份變數,同理 shared 就是所有 threads 共用。
`firstprivate`
: 它就是 private 變數,但如果該變數在進入迴圈前就有初始值,則 firstprivate 會保留,如果是 private 的話會是未知,同理還有 lastprivate 是執行緒結束後 private 變數是否會更新。
在`raytracing`迴圈內,`stk`每次迴圈都會清空,`d`會先在`rayConstruction`被正規化,再計算顏色,而`object_color`在進入迴圈前有初始值了,就設成`firstprivate`。
[多核心高效能程式開發](http://weblis.lib.ncku.edu.tw/search~S1*cht?/X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D/X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D&SUBKEY=%E5%A4%9A%E6%A0%B8%E5%BF%83/1%2C159%2C159%2CB/frameset&FF=X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D&11%2C11%2C)
###### tags: `assigment_2` `raytracing`