# 2017q1 Homework1 (raytracing)
contributed by <`rayleigh0407`>
### Reviewed by `twzjwang`
- 可使用圖表比較不同實作方法的結果,並與編譯器優化比較
- 可以刪除多餘的空行及已註解的程式碼
- 如果有時間可以嘗試其他方法加速
## 開發環境
```shell
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Stepping: 3
CPU MHz: 800.000
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 8016.72
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
```
>中英文間記得用空白隔開~
>[name=課程助教][color=red]
## 前置作業
### gprof
gprof 是一個可以協助你開發程式的工具, 它可以檢測你的程式執行時間的分佈, 在 GNU gprof 的簡介就提到
*This manual describes the gnu profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. We assume that you know how to write, compile, and execute programs...*
在編譯的時候, 你可以加入參數```-pg```, 接著執行過程式後會產生gmon.out檔, 接著只要輸入以下指令
```shell
gprof -b $(exec) gmon.out | less
```
-b 是將各種結果對應的說明刪除, 只留下執行分析的結果
less 可以讓你上下滾動查看輸出結果
### raytracing
## 開發過程
### 1. 原始版本
先下載原始碼
載完立即來測試看看
```shell
$ git clone https://github.com/sysprog21/raytracing
$ cd raytracing
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.224557 sec
```
用 gprof 來看看程式執行的部份
```shell
$ make PROFILE=1(輸出gmon.out)
$ ./raytracing
$ gprof -b raytracing gmon.out | less
```
執行的結果如下
```shell
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.23 0.37 0.37 69646433 0.00 0.00 dot_product
18.04 0.70 0.33 56956357 0.00 0.00 subtract_vector
12.85 0.94 0.24 13861875 0.00 0.00 rayRectangularIntersection
9.30 1.11 0.17 4620625 0.00 0.00 ray_hit_object
8.20 1.26 0.15 31410180 0.00 0.00 multiply_vector
6.29 1.37 0.12 17836094 0.00 0.00 add_vector
6.29 1.49 0.12 13861875 0.00 0.00 raySphereIntersection
5.47 1.59 0.10 10598450 0.00 0.00 normalize
4.92 1.68 0.09 17821809 0.00 0.00 cross_product
1.64 1.71 0.03 4221152 0.00 0.00 multiply_vectors
1.09 1.73 0.02 2110576 0.00 0.00 compute_specular_diffuse
1.09 1.75 0.02 2110576 0.00 0.00 localColor
1.09 1.77 0.02 1241598 0.00 0.00 protect_color_overflow
1.09 1.79 0.02 1048576 0.00 0.00 rayConstruction
0.82 1.80 0.02 3838091 0.00 0.00 length
0.55 1.81 0.01 1241598 0.00 0.00 refraction
0.55 1.82 0.01 1048576 0.00 0.00 ray_color
0.27 1.83 0.01 2558386 0.00 0.00 idx_stack_empty
0.27 1.83 0.01 1204003 0.00 0.00 idx_stack_push
0.00 1.83 0.00 2520791 0.00 0.00 idx_stack_top
0.00 1.83 0.00 1241598 0.00 0.00 reflection
0.00 1.83 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.83 0.00 113297 0.00 0.00 fresnel
0.00 1.83 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.83 0.00 3 0.00 0.00 append_rectangular
0.00 1.83 0.00 3 0.00 0.00 append_sphere
0.00 1.83 0.00 2 0.00 0.00 append_light
0.00 1.83 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.83 0.00 1 0.00 0.00 delete_light_list
0.00 1.83 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.83 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.83 0.00 1 0.00 0.00 diff_in_second
0.00 1.83 0.00 1 0.00 1.83 raytracing
0.00 1.83 0.00 1 0.00 0.00 write_to_ppm
...
```
下半部份主要為函式呼叫的分支(即一個函式呼叫了哪些函式)和統計次數
由此可看出, 佔用大部分時間的都是一些小函式( dot_product 不到十行)
試著從這些小地方開始著手
### 2. 測試1 Loop unrolling
這是 math-toolkit.h 中的 dot_product() 和 subtract_vector()
```c=
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
...
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
可以發現, 原開發者早就考慮到他們佔用相當多的時間, 便宣告成 inline function 以提昇執行速度和減少記憶體的佔用([關於inline function](http://www.greenend.org.uk/rjk/tech/inline.html)), 但程式呼叫次數太多了, 佔用時間還是相對地大
首先我先這樣修改:
```c=
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
...
static inline
double dot_product(const double *v1, const double *v2)
{
dp += v1[0] * v2[0];
dp += v1[1] * v2[1];
dp += v1[2] * v2[2];
return dp;
}
```
將函式拆開的用意是, 在每次執行一次迴圈時, 都會檢查是否要跳出迴圈, 以組語來說就類似 jne jge 等等...(可參考[x86 Assembly Guide](http://www.cs.virginia.edu/~evans/cs216/guides/x86.html)), 使用 loop unrolling, 每次執行函式就會少一兩個 instructions, 多次使用後節省的時間也是很可觀的
我們來看執行結果
**origin**
```shell
Performance counter stats for './raytracing' (10 runs):
22,9075 cache-misses # 23.203 % of all cache refs ( +- 16.92% )
98,7278 cache-references ( +- 16.48% )
194,6446,2315 instructions # 2.21 insns per cycle ( +- 0.00% )
88,2142,5404 cycles ( +- 0.93% )
2.251422149 seconds time elapsed ( +- 0.92% )
```
**loop unrolling**
```shell
Performance counter stats for './raytracing' (10 runs):
13,3195 cache-misses # 22.574 % of all cache refs ( +- 9.44% )
59,0030 cache-references ( +- 9.96% )
147,1677,2511 instructions # 1.86 insns per cycle ( +- 0.00% )
79,0934,9083 cycles ( +- 1.25% )
2.019096528 seconds time elapsed ( +- 1.26% )
```
雖然 cache-miss 稍微地下降而已, 但是 instrcution 足足少了 **24.39%**, 執行時間也隨之減少 **10.33%**, 有時候看似細微的地方, 也能牽扯到整個大局
執行 gprof:
```shell
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
18.01 0.23 0.23 69646433 0.00 0.00 dot_product
13.21 0.39 0.17 56956357 0.00 0.00 subtract_vector
11.21 0.53 0.14 10598450 0.00 0.00 normalize
10.81 0.67 0.14 13861875 0.00 0.00 rayRectangularIntersection
8.81 0.78 0.11 4620625 0.00 0.00 ray_hit_object
7.60 0.87 0.10 31410180 0.00 0.00 multiply_vector
7.60 0.97 0.10 13861875 0.00 0.00 raySphereIntersection
5.20 1.03 0.07 17836094 0.00 0.00 add_vector
5.20 1.10 0.07 17821809 0.00 0.00 cross_product
3.60 1.14 0.05 4221152 0.00 0.00 multiply_vectors
2.80 1.18 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.60 1.20 0.02 1048576 0.00 0.00 ray_color
1.20 1.21 0.02 2110576 0.00 0.00 localColor
0.80 1.22 0.01 2558386 0.00 0.00 idx_stack_empty
0.80 1.23 0.01 2520791 0.00 0.00 idx_stack_top
0.80 1.24 0.01 1 0.01 0.01 delete_sphere_list
0.80 1.25 0.01 1 0.01 1.24 raytracing
0.00 1.25 0.00 3838091 0.00 0.00 length
0.00 1.25 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 1.25 0.00 1241598 0.00 0.00 reflection
0.00 1.25 0.00 1241598 0.00 0.00 refraction
0.00 1.25 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.25 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.25 0.00 1048576 0.00 0.00 rayConstruction
0.00 1.25 0.00 113297 0.00 0.00 fresnel
0.00 1.25 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.25 0.00 3 0.00 0.00 append_rectangular
0.00 1.25 0.00 3 0.00 0.00 append_sphere
0.00 1.25 0.00 2 0.00 0.00 append_light
0.00 1.25 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.25 0.00 1 0.00 0.00 delete_light_list
0.00 1.25 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.25 0.00 1 0.00 0.00 diff_in_second
0.00 1.25 0.00 1 0.00 0.00 write_to_ppm
```
dot_product 下降了0.14s
substact_vector 下降了0.16s
那既然如此, 乾脆把 math-toolkit 內的迴圈都做 loop unrolling 看看
執行結果:
```shell
Performance counter stats for './raytracing' (10 runs):
11,9138 cache-misses # 23.585 % of all cache refs ( +- 2.98% )
50,5147 cache-references ( +- 10.63% )
127,3165,0289 instructions # 1.74 insns per cycle ( +- 0.00% )
73,2732,5074 cycles ( +- 0.31% )
1.869817600 seconds time elapsed ( +- 0.31% )
```
相較於 origin
instrcution 減少 **34.59%**
time elapsed 減少 **16.95%**
以上所有測試皆通過圖形測試
```shell
$make check
# Rendering scene
Done!
Execution time of raytracing() : 4.027406 sec
Verified OK
```
### 3. 測試2 平行化
試著平行化迴圈
math-toolkit.h 內的函式皆為 inline function
實行平行化可能沒辦法在編譯時展開
先看看 raytracing.c
```c=
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
...
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
...
}
```
在 raytracing() 內
是一次畫一個點去作圖的
看能不能針對這巢狀迴圈做平行化
在這之前先考慮 data dependency
...
看了很久才發現資料沒有任何資料寫入衝突(可以看 function parameter 來過濾, 如果宣告為 const 則此資料不會被更動)
那就來實做平行化
#### Pthread
將外層函式做平行化
```c=
int main(...)
{
...
pthread_t *thread_handles;
thread_handles = (pthread_t*)malloc(sizeof(pthread_t) * thread_num);
for(int i = 0; i < thread_num; i++) {
each_rank[i] = i;
pthread_create(&thread_handles[i], NULL, raytracing_pthread, (void *)&each_rank[i]);
}
}
void *raytracing_pthread(void *rank)
{
int my_rank = *(int*)rank;
...
int start = (height / THREAD_NUMBER) * my_rank;
int end = (height / THREAD_NUMBER) * (my_rank + 1);
for (int j = start; j < end; j++) {
for (int i = 0; i < width; i++) {
...
}
}
}
```
執行結果:
為了方便分配工作區域, 以2的倍數來分配
雙核:
```shell
# Rendering scene
Thread number : 2
Done!
Execution time of raytracing() : 1.110519 sec
```
四核:
```shell
# Rendering scene
Thread number : 4
Done!
Execution time of raytracing() : 0.816143 sec
```
八核:
```shell
# Rendering scene
Thread number : 8
Done!
Execution time of raytracing() : 0.506245 sec
```
效能成長沒有想像中好
換個方式分配看看
```c=
for (int j = my_rank; j < height; j+=THREAD_NUMBER) {
```
執行結果:
單執行序:
```shell
Performance counter stats for './raytracing' (10 runs):
11,0062 cache-misses # 23.736 % of all cache refs ( +- 2.04% )
46,3689 cache-references ( +- 5.54% )
127,2937,7767 instructions # 1.75 insns per cycle ( +- 0.00% )
72,9304,6745 cycles ( +- 0.12% )
1.865169660 seconds time elapsed ( +- 0.23% )
```
雙核:
```shell
Performance counter stats for './raytracing' (10 runs):
11,0145 cache-misses # 18.638 % of all cache refs ( +- 2.23% )
59,0976 cache-references ( +- 5.39% )
127,2961,2413 instructions # 1.58 insns per cycle ( +- 0.00% )
80,5269,8888 cycles ( +- 0.07% )
1.126548896 seconds time elapsed ( +- 0.09% )
```
四核:
```shell
Performance counter stats for './raytracing' (10 runs):
10,5345 cache-misses # 15.362 % of all cache refs ( +- 3.07% )
68,5756 cache-references ( +- 2.35% )
127,2978,3539 instructions # 1.47 insns per cycle ( +- 0.00% )
86,4925,3375 cycles ( +- 1.49% )
0.592542459 seconds time elapsed ( +- 2.45% )
```
八核:
```shell
Performance counter stats for './raytracing' (10 runs):
11,4074 cache-misses # 14.391 % of all cache refs ( +- 3.90% )
79,2670 cache-references ( +- 1.85% )
127,3064,3855 instructions # 1.21 insns per cycle ( +- 0.00% )
105,0530,6169 cycles ( +- 0.09% )
0.354254347 seconds time elapsed ( +- 0.69% )
```
效能提昇不少
幸好上學期有修過關於平行化的課程
cache-misses 減少了不少
可能是 cpu 能用的空間變多了
但還是沒有依倍數成長
參考資料
- [GNU gprof](https://sourceware.org/binutils/docs/gprof/index.html#Top)
- [使用Gnu gprof进行Linux平台下的程序分析](http://os.51cto.com/art/200703/41426.htm)
- [Inline Functions In C](http://www.greenend.org.uk/rjk/tech/inline.html)
- [GNU MP 6.1.2: Assembly Loop Unrolling](https://gmplib.org/manual/Assembly-Loop-Unrolling.html)
- [行內函式(Inline function)](https://openhome.cc/Gossip/CppGossip/inlineFunction.html)
- [x86 Assembly Guide](http://www.cs.virginia.edu/~evans/cs216/guides/x86.html)
- [多處理器課程](nop)
- [邱靖吉、林文盛、陳楷雯同學的共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)
- [HackMD Command](https://hackmd.io/s/S1vAcMjFe#)