# 2017q1 Homework1 (raytracing)
contributed by < `changyuanhua` >
### Reviewed by `ryanwang522`
* Git Commit Message 過於簡略,可以在表達的詳細一點,commit 的目的跟過程都可以在詳加描述,參考 [How to wirte git commit message](https://chris.beams.io/posts/git-commit/)。
* 可以嘗試比較 OpenMP 的不同 schedule method有何差異。
* 思考最後比較圖表 Cache-miss 的差異原因?
---
>請修正程式碼縮排,為四個空白
>[name=課程助教][color=red]
>>ok, 謝謝[name=changyuanhua][color=black]
## 開發環境
* 輸入指令 ` lscpu `
```
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:1
每通訊端核心數:4
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 94
Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
製程: 3
CPU MHz: 799.890
CPU max MHz: 3200.0000
CPU min MHz: 800.0000
BogoMIPS: 4608.00
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 6144K
NUMA node0 CPU(s): 0-3
```
## 預先準備工作
* **安裝準備**
```
$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagic
```
## 光影追蹤程式
* **取得原始程式碼、編譯和測試**
```
$ git clone https://github.com/sysprog21/raytracing
$ cd raytracing
$ make
$ ./raytracing
```
* **可以看到程式碼的行數**
```
$ cloc *.[ch]
```
* **開啟 .ppm 檔的兩種方式**
```
1.eog out.ppm
2.gimp out.ppm
```
* **透過 convert 這個工具可將 PPM 轉為 PNG 或其他格式**
```
$ convert out.ppm out.png
```
* **圖片顯示**
![](https://i.imgur.com/wrAxKJo.png)
* **檢驗程式碼輸出的圖片是否符合預期,符合的話會得到 “Verified OK!” 字樣**
```
$ make check
```
* **清除執行檔**
```
$ make clean
```
* **gprof 分析程式**
```
$ make PROFILE=1
$ ./raytracing
$ gprof -b raytracing gmon.out | less
```
* [gprof 教學](http://os.51cto.com/art/200703/41426.htm)
* **install gprof2dot**
```
$sudo apt install python python-pip
$sudo pip install --upgrade pip
$sudo pip install gprof2dot
```
* **作圖**
```
gprof ./raytracing | gprof2dot | dot -T png -o output.png
```
* **使用 astyle + Git 實做自動程式碼排版檢查**
```
$ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]
```
## 未優化
* **時間**
```
Done!
Execution time of raytracing() : 6.158007 sec
```
* **gprof 分析程式 , 指出效能瓶頸**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.64 0.71 0.71 69646433 0.00 0.00 dot_product
28.18.69 1.19 0.48 56956357 0.00 0.00 subtract_vector
7.79 1.39 0.20 31410180 0.00 0.00 multiply_vector
6.62 1.56 0.17 13861875 0.00 0.00 raySphereIntersection
6.62 1.73 0.17 10598450 0.00 0.00 normalize
6.23 1.89 0.16 17836094 0.00 0.00 add_vector
6.23 2.05 0.16 13861875 0.00 0.00 rayRectangularIntersection
5.06 2.18 0.13 4620625 0.00 0.00 ray_hit_object
4.67 2.30 0.12 17821809 0.00 0.00 cross_product
2.73 2.37 0.07 2110576 0.00 0.00 localColor
1.75 2.42 0.05 1048576 0.00 0.00 ray_color
1.56 2.46 0.04 4221152 0.00 0.00 multiply_vectors
1.17 2.49 0.03 3838091 0.00 0.00 length
0.78 2.51 0.02 2110576 0.00 0.00 compute_specular_diffuse
0.78 2.53 0.02 1048576 0.00 0.00 rayConstruction
0.58 2.54 0.02 1241598 0.00 0.00 reflection
0.39 2.55 0.01 1241598 0.00 0.00 refraction
0.39 2.56 0.01 1 0.01 2.56 raytracing
0.00 2.56 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 2.56 0.00 2520791 0.00 0.00 idx_stack_top
0.00 2.56 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 2.56 0.00 1204003 0.00 0.00 idx_stack_push
0.00 2.56 0.00 1048576 0.00 0.00 idx_stack_init
0.00 2.56 0.00 113297 0.00 0.00 fresnel
0.00 2.56 0.00 37595 0.00 0.00 idx_stack_pop
0.00 2.56 0.00 3 0.00 0.00 append_rectangular
0.00 2.56 0.00 3 0.00 0.00 append_sphere
0.00 2.56 0.00 2 0.00 0.00 append_light
0.00 2.56 0.00 1 0.00 0.00 calculateBasisVectors
0.00 2.56 0.00 1 0.00 0.00 delete_light_list
0.00 2.56 0.00 1 0.00 0.00 delete_rectangular_list
0.00 2.56 0.00 1 0.00 0.00 delete_sphere_list
0.00 2.56 0.00 1 0.00 0.00 diff_in_second
0.00 2.56 0.00 1 0.00 0.00 write_to_ppm
```
* **cache miss**
* 清空 ``` echo 1 | sudo tee /proc/sys/vm/drop_caches ```
* 指令 ``` perf stat -r 10 -e branch-misses,cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./raytracing ```
```
Performance counter stats for './raytracing' (10 runs):
20,968,120 branch-misses ( +- 0.33% )
2,014,576 cache-misses # 65.248 % of all cache refs ( +- 18.82% )
3,087,590 cache-references ( +- 14.01% )
3,721,407 L1-dcache-load-misses ( +- 2.23% )
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetch-misses
1,456,429 L1-icache-load-misses ( +- 9.33% )
6.357804873 seconds time elapsed ( +- 0.99% )
```
* **圖**
![](https://i.imgur.com/dUNLO3Z.png)
* 可以發現在 dot_product function 呼叫的次數最高,所以先嘗試以此著手
## 實驗一: loop unrolling
* **[loop unrolling](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80)**
* **code**
```clike=
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
dp += v1[0] * v2[0]+ v1[1] * v2[1]+ v1[2] * v2[2];
return dp;
}
```
* **時間**
```
Done!
Execution time of raytracing() : 5.412815 sec
```
* **gprof 分析程式 , 指出效能瓶頸**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.86 0.45 0.45 56956357 0.00 0.00 subtract_vector
22.11.90 0.70 0.25 69646433 0.00 0.00 dot_product
11.41 0.93 0.24 31410180 0.00 0.00 multiply_vector
9.23 1.12 0.19 13861875 0.00 0.00 rayRectangularIntersection
7.77 1.28 0.16 13861875 0.00 0.00 raySphereIntersection
7.29 1.43 0.15 10598450 0.00 0.00 normalize
6.56 1.57 0.14 17836094 0.00 0.00 add_vector
4.86 1.67 0.10 17821809 0.00 0.00 cross_product
4.86 1.77 0.10 4620625 0.00 0.00 ray_hit_object
2.91 1.83 0.06 1 0.06 2.05 raytracing
2.43 1.88 0.05 2110576 0.00 0.00 compute_specular_diffuse
1.70 1.91 0.04 4221152 0.00 0.00 multiply_vectors
1.46 1.94 0.03 2110576 0.00 0.00 localColor
1.46 1.97 0.03 1048576 0.00 0.00 ray_color
1.21 2.00 0.03 3838091 0.00 0.00 length
0.97 2.02 0.02 1241598 0.00 0.00 refraction
0.97 2.04 0.02 1048576 0.00 0.00 rayConstruction
0.73 2.05 0.02 1048576 0.00 0.00 idx_stack_init
0.49 2.06 0.01 1 0.01 0.01 delete_sphere_list
0.00 2.06 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 2.06 0.00 2520791 0.00 0.00 idx_stack_top
0.00 2.06 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 2.06 0.00 1241598 0.00 0.00 reflection
0.00 2.06 0.00 1204003 0.00 0.00 idx_stack_push
0.00 2.06 0.00 113297 0.00 0.00 fresnel
0.00 2.06 0.00 37595 0.00 0.00 idx_stack_pop
0.00 2.06 0.00 3 0.00 0.00 append_rectangular
0.00 2.06 0.00 3 0.00 0.00 append_sphere
0.00 2.06 0.00 2 0.00 0.00 append_light
0.00 2.06 0.00 1 0.00 0.00 calculateBasisVectors
0.00 2.06 0.00 1 0.00 0.00 delete_light_list
0.00 2.06 0.00 1 0.00 0.00 delete_rectangular_list
0.00 2.06 0.00 1 0.00 0.00 diff_in_second
0.00 2.06 0.00 1 0.00 0.00 write_to_ppm
```
* **cache miss**
```
Performance counter stats for './raytracing' (10 runs):
19,438,104 branch-misses ( +- 0.33% )
1,210,765 cache-misses # 42.924 % of all cache refs ( +- 38.56% )
2,820,696 cache-references ( +- 31.18% )
3,801,936 L1-dcache-load-misses ( +- 5.67% )
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetch-misses
1,368,386 L1-icache-load-misses ( +- 16.02% )
5.751152093 seconds time elapsed ( +- 1.83% )
```
* **圖**
![](https://i.imgur.com/B5Dg2ng.png)
* **結果與發現**
當把 dot_product function 經過 loop unrolling 後 , 所花的整體執行時間從 6.158007 秒降至 5.412815 秒 ,而 dot_product function 所花的時間從 0.71 秒降至 0.45 秒
## 實驗二: loop unrolling add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()、dot_product()
* **code**
```clike=
static inline
void add_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] + b[0];
out[1] = a[1] + b[1];
out[2] = a[2] + b[2];
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
out[0] = a[0] * b[0];
out[1] = a[1] * b[1];
out[2] = a[2] * b[2];
}
static inline
void multiply_vector(const double *a, double b, double *out)
{
out[0] = a[0] * b;
out[1] = a[1] * b;
out[2] = a[2] * b;
}
```
* **時間**
```
Done!
Execution time of raytracing() : 4.924315 sec
```
* **gprof 分析程式 , 指出效能瓶頸**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
14.50 0.20 0.20 69646433 0.00 0.00 dot_product
13.78 0.39 0.19 56956357 0.00 0.00 subtract_vector
11.60 0.55 0.16 10598450 0.00 0.00 normalize
6.89 0.65 0.10 13861875 0.00 0.00 raySphereIntersection
6.89 0.74 0.10 31410180 0.00 0.00 multiply_vector
6.89 0.84 0.10 17836094 0.00 0.00 add_vector
6.89 0.93 0.10 13861875 0.00 0.00 rayRectangularIntersection
6.53 1.02 0.09 4620625 0.00 0.00 ray_hit_object
5.44 1.10 0.08 17821809 0.00 0.00 cross_product
4.35 1.16 0.06 1048576 0.00 0.00 ray_color
3.63 1.21 0.05 2110576 0.00 0.00 compute_specular_diffuse
3.26 1.25 0.05 4221152 0.00 0.00 multiply_vectors
2.90 1.29 0.04 2110576 0.00 0.00 localColor
2.90 1.33 0.04 1048576 0.00 0.00 rayConstruction
2.90 1.37 0.04 1 0.04 1.38 raytracing
0.73 1.38 0.01 1241598 0.00 0.00 refraction
0.00 1.38 0.00 3838091 0.00 0.00 length
0.00 1.38 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.38 0.00 2520791 0.00 0.00 idx_stack_top
0.00 1.38 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 1.38 0.00 1241598 0.00 0.00 reflection
0.00 1.38 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.38 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.38 0.00 113297 0.00 0.00 fresnel
0.00 1.38 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.38 0.00 3 0.00 0.00 append_rectangular
0.00 1.38 0.00 3 0.00 0.00 append_sphere
0.00 1.38 0.00 2 0.00 0.00 append_light
0.00 1.38 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.38 0.00 1 0.00 0.00 delete_light_list
0.00 1.38 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.38 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.38 0.00 1 0.00 0.00 diff_in_second
0.00 1.38 0.00 1 0.00 0.00 write_to_ppm
```
* **cache miss**
```
Performance counter stats for './raytracing' (10 runs):
16,816,057 branch-misses ( +- 0.10% )
1,189,917 cache-misses # 31.525 % of all cache refs ( +- 23.36% )
3,774,516 cache-references ( +- 14.38% )
4,019,687 L1-dcache-load-misses ( +- 3.71% )
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetch-misses
1,507,763 L1-icache-load-misses ( +- 8.15% )
5.034827277 seconds time elapsed ( +- 1.28% )
```
* **圖**
![](https://i.imgur.com/9fimoDX.png)
* **結果與發現**
當把 add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()、dot_product() 能不用for迴圈的都展開來 , 所花的整體執行時間從 5.412815 秒降至 4.924315 秒
## 實驗三: force nline + loop unrolling
* **code**
* 把 math-toolkit.h 檔的所有 static inline 改成下列
```cstyles=
static __attribute__((always_inline))
```
* **時間**
```
Done!
Execution time of raytracing() : 3.206461 sec
```
* **gprof 分析程式 , 指出效能瓶頸**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
24.19 0.37 0.37 13861875 0.00 0.00 rayRectangularIntersection
15.04 0.60 0.23 13861875 0.00 0.00 raySphereIntersection
12.42 0.79 0.19 69646433 0.00 0.00 dot_product
11.77 0.97 0.18 2110576 0.00 0.00 compute_specular_diffuse
8.50 1.10 0.13 2110576 0.00 0.00 localColor
8.50 1.23 0.13 1048576 0.00 0.00 ray_color
5.88 1.32 0.09 4620625 0.00 0.00 ray_hit_object
3.92 1.38 0.06 1048576 0.00 0.00 rayConstruction
1.96 1.41 0.03 2520791 0.00 0.00 idx_stack_top
1.96 1.44 0.03 113297 0.00 0.00 fresnel
1.31 1.46 0.02 1241598 0.00 0.00 protect_color_overflow
1.31 1.48 0.02 1241598 0.00 0.00 reflection
1.31 1.50 0.02 1048576 0.00 0.00 idx_stack_init
1.31 1.52 0.02 1 0.02 1.52 raytracing
0.65 1.53 0.01 1 0.01 0.01 delete_sphere_list
0.00 1.53 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.53 0.00 1241598 0.00 0.00 refraction
0.00 1.53 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.53 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.53 0.00 3 0.00 0.00 append_rectangular
0.00 1.53 0.00 3 0.00 0.00 append_sphere
0.00 1.53 0.00 2 0.00 0.00 append_light
0.00 1.53 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.53 0.00 1 0.00 0.00 delete_light_list
0.00 1.53 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.53 0.00 1 0.00 0.00 diff_in_second
0.00 1.53 0.00 1 0.00 0.00 write_to_ppm
```
* **cache miss**
```
Performance counter stats for './raytracing' (10 runs):
3,691,085 branch-misses ( +- 0.30% )
1,926,179 cache-misses # 58.611 % of all cache refs ( +- 14.85% )
3,286,392 cache-references ( +- 11.63% )
3,047,752 L1-dcache-load-misses ( +- 3.31% )
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetch-misses
2,445,715 L1-icache-load-misses ( +- 9.69% )
2.928272691 seconds time elapsed ( +- 1.24% )
```
* **圖**
![](https://i.imgur.com/a4samjG.png)
* **結果與發現**
當把 math-toolkit.h 檔所有 static inline 都改成 static __attribute__((always_inline)), 所花的整體執行時間從 4.924315 秒降至 3.206461 秒
## 實驗四:OPENMP
* [openmp](https://drive.google.com/file/d/0B4WZ6ihm_C7zYkdvNUJyYlJFRG8/view)
* **code**
* Makefile
```cstyles=
CFLAGS = \
-std=gnu99 -Wall -O0 -g -fopenmp
LDFLAGS = \
-lm -lgomp
```
* raytracing.c
```cstyles=
int factor = sqrt(SAMPLES);
#pragma omp parallel for num_threads(64) private(d),private(object_color),private(stk)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
}
```
* **時間**
```
Done!
Execution time of raytracing() : 1.635157 sec
```
* **gprof 分析程式 , 指出效能瓶頸**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
27.79 0.05 0.05 208968 0.00 0.00 rayRectangularIntersection
27.79 0.10 0.05 32376 0.00 0.00 compute_specular_diffuse
16.67 0.13 0.03 34239 0.00 0.00 localColor
11.11 0.15 0.02 65640 0.00 0.00 ray_hit_object
5.56 0.16 0.01 41770 0.00 0.00 idx_stack_top
5.56 0.17 0.01 13934 0.00 0.01 ray_color
2.78 0.18 0.01 23951 0.00 0.00 idx_stack_push
2.78 0.18 0.01 14600 0.00 0.00 idx_stack_init
0.00 0.18 0.00 204869 0.00 0.00 raySphereIntersection
0.00 0.18 0.00 54181 0.00 0.00 idx_stack_empty
0.00 0.18 0.00 21517 0.00 0.00 refraction
0.00 0.18 0.00 19371 0.00 0.00 protect_color_overflow
0.00 0.18 0.00 19012 0.00 0.00 reflection
0.00 0.18 0.00 16662 0.00 0.00 rayConstruction
0.00 0.18 0.00 1593 0.00 0.00 fresnel
0.00 0.18 0.00 681 0.00 0.00 idx_stack_pop
0.00 0.18 0.00 3 0.00 0.00 append_rectangular
0.00 0.18 0.00 3 0.00 0.00 append_sphere
0.00 0.18 0.00 2 0.00 0.00 append_light
0.00 0.18 0.00 1 0.00 0.00 calculateBasisVectors
0.00 0.18 0.00 1 0.00 0.00 delete_light_list
0.00 0.18 0.00 1 0.00 0.00 delete_rectangular_list
0.00 0.18 0.00 1 0.00 0.00 delete_sphere_list
0.00 0.18 0.00 1 0.00 0.00 diff_in_second
0.00 0.18 0.00 1 0.00 180.06 raytracing
0.00 0.18 0.00 1 0.00 0.00 write_to_ppm
```
* **cache miss**
```
Performance counter stats for './raytracing' (10 runs):
4,282,936 branch-misses ( +- 1.31% )
729,409 cache-misses # 0.173 % of all cache refs ( +- 5.46% )
421,057,183 cache-references ( +- 0.23% )
62,879,634 L1-dcache-load-misses ( +- 0.31% )
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetch-misses
4,155,291 L1-icache-load-misses ( +- 2.28% )
1.644746532 seconds time elapsed ( +- 0.11% )
* **圖**
![](https://i.imgur.com/kdXFAUw.png)
* **結果與發現**
當使用openmp, 所花的整體執行時間從 3.206461 秒降至 1.635157 秒
## 比較圖
* **時間**
![](https://i.imgur.com/P6BiOqy.png)
* **cache miss**
![](https://i.imgur.com/mYMYC9P.png)
## 參考
* [hugikun999](https://hackmd.io/s/HyHhgcv6#)
* [LanKuDot](https://hackmd.io/s/rkggUeKa#)