# 2017 q1 Homework1 (raytracing)
contributed by <`zmke`>
### Reviewed by `twzjwang`
- multithreads 下使用 gprof ,時間卻增加:
`gprof` is poorly suited for analysis of parallel programs as it doesn't understand the intricacies of OpenMP.
[gprof on both OpenMP and without OpenMP codes produces different flat profile](http://stackoverflow.com/questions/36919518/gprof-on-both-openmp-and-without-openmp-codes-produces-different-flat-profile)
- 可用其他方法實作平行化,如 Pthread
### Reviewed by `henry0929016816`
* 比較了 thread 數的差異 ,這一點我覺得做的很好
* 可以試試看 SIMD
## 開發環境
OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
## gprof
* 可以印出程式執行時各函數消耗的函數、 funtion 之間互相呼叫的關係,幫助 programmer 分析出消耗較多時間的函數,找出當作改善效能的方向,尤其是對大型程式來說是相當實用的工具
* `$ make PROFILE=1`: 加入 -pg , compiler 會在程式中間插入額外的程式碼,執行後會產生 gmon.out
* `$ gprof ./raytracing | less`
## 原始版本
* 直接執行 clone 下來的程式
`$ make`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 2.499698 sec
```
* 使用 gprof 進行分析
`$ make PROFILE=1`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 5.441733 sec
```
compiler 在程式中加入額外程式碼,執行時間增加了
`$ gprof ./raytracing | less`
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
25.02 0.51 0.51 69646433 0.00 0.00 dot_product
14.71 0.81 0.30 56956357 0.00 0.00 subtract_vector
11.28 1.04 0.23 31410180 0.00 0.00 multiply_vector
10.06 1.25 0.21 13861875 0.00 0.00 rayRectangularIntersection
7.85 1.41 0.16 17836094 0.00 0.00 add_vector
7.36 1.56 0.15 10598450 0.00 0.00 normalize
5.89 1.68 0.12 4620625 0.00 0.00 ray_hit_object
4.41 1.77 0.09 17821809 0.00 0.00 cross_product
3.19 1.83 0.07 13861875 0.00 0.00 raySphereIntersection
2.45 1.88 0.05 1048576 0.00 0.00 ray_color
1.96 1.92 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.96 1.96 0.04 1 0.04 2.04 raytracing
```
可以清楚看到時間集中在 dot_product 、 subtract_vector 、 multiply_vector ,這些 function 主要用來進行三維向量運算,可以在 `math-toolkit.h` 找到定義
* 用 perf 來分析 branch miss
```
Performance counter stats for './raytracing' (10 runs):
19,296,540 branch-misses # 0.40% of all branches ( +- 0.43% )
4,782,031,932 branches ( +- 0.00% )
32,981,345,906 instructions # 1.90 insn per cycle ( +- 0.00% )
17,360,336,322 cycles ( +- 0.22% )
5.137705397 seconds time elapsed ( +- 0.22% )
```
## compiler 優化
* 實際動手之前先來看看 compiler 到底有多強大,更改優化選項 `-Ofast`
`$ make PROFILE=1`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 0.518435 sec
```
這效果...
`$ gprof ./raytracing | less`
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
52.21 0.24 0.24 4620625 51.98 51.98 ray_hit_object
21.76 0.34 0.10 2110576 47.42 47.42 compute_specular_diffuse
10.88 0.39 0.05 592239 84.49 408.09 ray_color
10.88 0.44 0.05 raytracing
2.18 0.45 0.01 2110576 4.74 4.74 localColor
2.18 0.46 0.01 1241598 8.06 8.06 refraction
0.00 0.46 0.00 113297 0.00 0.00 fresnel
```
* 發現剛剛看到最花時間的 function 都沒在表上,應該是 inline 發揮作用了,在 function 的定義前加入 inline 提示 compiler 在編譯的時候就將 function 直接展開,減少呼叫 function 所需要成本
## Method1 - force inline
* 實驗將 `math-toolkit.h` 裡面的 inline 改成 `__attribute__((always_inline)) inline` ,讓 compiler 在關閉編譯最佳化的情形下將 function inline
`$ make PROFILE=1`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 2.967841 sec
```
`$ gprof ./raytracing | less`
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.48 0.87 0.87 13861875 0.00 0.00 rayRectangularIntersection
18.61 1.27 0.40 13861875 0.00 0.00 raySphereIntersection
10.93 1.51 0.24 2110576 0.00 0.00 compute_specular_diffuse
7.91 1.68 0.17 1048576 0.00 0.00 ray_color
6.51 1.82 0.14 2110576 0.00 0.00 localColor
6.28 1.95 0.14 4620625 0.00 0.00 ray_hit_object
4.89 2.06 0.11 1048576 0.00 0.00 rayConstruction
1.86 2.10 0.04 1241598 0.00 0.00 reflection
1.40 2.13 0.03 1 0.03 2.15 raytracing
```
* 時間從 5.4417 sec 下降到 2.9678 sec ,從 gprof 的報表上也看不到剛剛在 `math-tool-kit.h` force inline 的 function
## Method2 - loop unrolling
* 將迴圈展開,減少 branch misses 帶來的成本,不過可能會造成程式碼變冗長、可讀性降低的問題,所以通常是由 compiler 來完成
* 實驗將 `math-toolkit.h` function 內的 loop 展開
`$ make PROFILE=1`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 4.346391 sec
```
`$ gprof ./raytracing | less`
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
16.03 0.21 0.21 69646433 0.00 0.00 dot_product
15.63 0.41 0.20 56956357 0.00 0.00 subtract_vector
14.85 0.60 0.19 10598450 0.00 0.00 normalize
8.99 0.71 0.12 17821809 0.00 0.00 cross_product
7.82 0.81 0.10 13861875 0.00 0.00 rayRectangularIntersection
6.64 0.90 0.09 13861875 0.00 0.00 raySphereIntersection
6.25 0.98 0.08 1048576 0.00 0.00 ray_color
4.69 1.04 0.06 4620625 0.00 0.00 ray_hit_object
3.91 1.09 0.05 1048576 0.00 0.00 rayConstruction
3.52 1.13 0.05 31410180 0.00 0.00 multiply_vector
2.74 1.17 0.04 17836094 0.00 0.00 add_vector
2.35 1.20 0.03 2110576 0.00 0.00 localColor
1.95 1.22 0.03 4221152 0.00 0.00 multiply_vectors
1.56 1.24 0.02 1 0.02 1.27 raytracing
```
* 時間從 5.4417 sec 下降到 4.3464 sec
```
* dot_product: 0.51 -> 0.21
* subtract_vector: 0.30 -> 0.20
* multify_vector: 0.23 -> 0.05
* add_vector: 0.16 -> 0.04
```
* 用 perf 來分析 branch miss
```
Performance counter stats for './raytracing' (10 runs):
16,359,752 branch-misses # 0.42% of all branches ( +- 0.69% )
3,873,917,856 branches ( +- 0.00% )
25,746,564,892 instructions # 1.76 insn per cycle ( +- 0.00% )
14,607,513,939 cycles ( +- 0.23% )
4.322380613 seconds time elapsed ( +- 0.27% )
```
* branch misses 從 19,197,236 -> 16,387,279
* branches 從 4,782,031,932 -> 3,873,917,856 ,減少了9億次!!
* 從實驗結果可以看出, loop unrolling 對於改善 branch miss 是有效的
* 測試不加上 -pg 的情形,時間縮短 (2.50 -> 1.68)
`$ make clean`
`$ make`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 1.682859 sec
```
## Method3 - OpenMP
剛好上學期有修平行程式設計,上課有提過 OpenMP ,把投影片找出來複習一下
* An API for shared-memory parallel programming
* Designed for systems in which each thread or process can potentially have access to all available memory
* System is viewed as a collection of cores or CPU’s, all of which have access to main memory
* 以 loop unrolling 的版本為基礎,針對 raytracing() 做平行化
### (1) default schedule
``` ==
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
point3 u, v, w, d;
color object_color = { 0.0, 0.0, 0.0 };
/* calculate u, v, w */
calculateBasisVectors(u, v, w, view);
idx_stack stk;
int factor = sqrt(SAMPLES);
#pragma omp parallel for num_threads (thread_count) \
default(shared) private(stk, d, object_color)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
}
```
**thread_count = 4**
`$ make PROFILE=1`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 7.930475 sec
```
時間比只有 loop unrolling 增加了 (4.35 -> 7.93)
先不使用 gprof 測試看看
`$ make clean`
`$ make`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 0.818154 sec
```
和 loop unrolling 不加上 -pg 的情況相比時間下降了 (1.68 -> 0.89) ,推測在 OpenMP multi threads 的情況下, -pg 對程式帶來的成本也跟著增加了
* 比較不同 thread 數量對程式加速的效果,隨著 thread 增加,加速的比例也逐漸降低,當 thread_count 超過 16 之後, creat thread 和 join 的成本已經比速度提升還要高,反而失去加速的效果
```
thread_count = 1 1.720003 sec
thread_count = 2 0.955195 sec
thread_count = 4 0.818154 sec
thread_count = 8 0.767247 sec
thread_count = 16 0.743067 sec
thread_count = 32 0.801165 sec
```
### (2) cyclic schedule
* static schedule : the iterations can be assigned to the thread before the loop is executed
* 避免程式運算熱點分佈不均,利用 static schedule ,將raytracing() 作 cyclic 切割
* 加上 `schedule(static, 1)`
``` ==
#pragma omp parallel for num_threads (thread_count) \
default(shared) private(stk, d, object_color) schedule(static, 1)
```
**thread_count = 4**
`$ make`
`$ ./raytracing`
```
# Rendering scene
Done!
Execution time of raytracing() : 0.728800 sec
```
在 thread_count = 4 的情形下,與 default schedule 相比,執行時間縮短 (0.818 -> 0.729)
* cyclic schedule 可以將運算平均分配給各 thread ,平行加速效果比 default schedule 還要好
* 將以上3個實驗一起實作,得到的運算時間是 0.629104 sec ,只有原始版本的 25%,不過還是沒贏過 -Ofast
![](https://i.imgur.com/gj1OAnW.png)
## Reference
[簡易的程式平行化-OpenMP](https://kheresy.wordpress.com/2006/09/15/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E5%9B%9B%EF%BC%89%E7%AF%84%E4%BE%8B-for/)
[hugikun999 筆記](https://hackmd.io/s/HyHhgcv6)
[LanKuDot 筆記](https://hackmd.io/s/rkggUeKa)
成大資訊系-多處理機平行程式設計 課程投影片