# 2016q3 Homework1 (raytracing)
contributed by <`nekoneko`>
###### tags: `sys2016` `nekoneko` `homework`
### Reviewed by `ChenYi`
* 易混淆的 git commit message -- last commit "SIMD dot_product solved-1",實際程式碼為"完成"
* 另外-1 -2會使人感到意義不明
* 可以試著分析struct更改後的cache-misses,以及使用圖表表示速度上的差異
* 撰寫使用SIMD的部份請儘量避免底線開頭的變數名稱,以免產生[Naming convention](https://en.wikipedia.org/wiki/Naming_convention_(programming))的狀況發生
## gprof
### 使用方式
- gcc編譯時下`-pg`,會在所以函式中加入mount的函式
- 在編譯和連結時,允許profile功能開啟 (CFLAGS=-gp, LDFLAGS=-gp)
- `-pg`的參數也可以下在編譯與連結同時做的時候 (gcc -o)
`cc -o myprog myprog.c utils.c -g -pg`
- 單獨使用`ld`(linker)時,必須指定profiling startup file gctr0.o,連結檔的檔改為lib_c_p.a,連結參數要改為`-lc_p`
`ld -o myprog /lib/gcrt0.o myprog.o utils.o -lc_p`
- `run into problems with the profiling support code in a shared library being called before that library has been fully initialised.`,解決方式為改為靜態連結到有包含profile support code的函式庫。
`gcc -g -pg -static-libgcc myprog.c utils.c -o myprog`
> 不懂為什麼文件這麼寫,先記錄下來,方便以後用到查閱[name=cheng hung lin]
- 可以只針對想要測試的module編譯上加上`-pg`
- 執行測試檔後,會產生gmon.out,為了能正常的產生gmon.out,測試程式要正常的結束--*`returning by main or calling exit`*
- 產生的gmon.out會在程式執行當下所在的資料夾(directory)
- **bb.out**: *`Unfortunately, the appearance of a human-readable bb.out means the basic-block counts didn't get written into gmon.out.`*
- default excutable file: a.out, default profile data file: gmon.out
### 參數
- `-Q`, `--no-graph`: 不印出 call graph資料
- `-q`, `--graph`: 印出 call graph資料
- `-p`, `--flat-profile`: 印出 flat-profile
- `-A` : [Ref](https://books.google.com.tw/books?id=wQ6r3UTivJgC&pg=PA147&lpg=PA147&dq=gprof:+could+not+locate&source=bl&ots=ELZoMp4BDt&sig=dDSi4XvTbx6NlgQ40yiqqlHdkew&hl=zh-TW&sa=X&ved=0ahUKEwiA2qKpz7HPAhUBNY8KHbvcC1kQ6AEIJDAB#v=onepage&q=gprof%3A%20could%20not%20locate&f=false)
### flat profile
- 函式所花的時間
- 函式被呼叫的次數
### call graph
顯示每個函式呼叫的關係,呼叫到哪些函式,本身被哪些函式呼叫
### 可以測試到
- program spent is time
- 函式互相呼叫的關係
- 顯示某些比認知上預期還慢的程式片段
- 函式呼叫的次數
- 檢視為注意到的bug
## graphviz
## gprof2dot
這邊參考了[Chen Yi同學的共筆](https://hackmd.io/OwYwRmDMCskKYFoCGA2YAzBAWaSzIA4BOABgQEYSUT11oUATAgJnKA==?view#2016q3-homework-1-raytracing),使用[gprof2dot](https://github.com/jrfonseca/gprof2dot)
```txt
$ sudo apt-get install pip3
$ pip3 install gprof2dot
```
> 照script安裝或pip都沒辦法成功執行[name=cheng hung lin]
> `Exception: using gprof2dot.py as a module is unsupported`
## raytracing
- 編譯和執行
```txt
make clean
make PROFILE=1
$ ./raytracing
$ gprof raytracing | gprof2dot | dot -Tpng -o output.pn
```
設定PROFILE的原因,可以看Makefile 11行到15行。
- 輸出結果
- 輸出
- 11.650036 sec
```txt
# Rendering scene
Done!
Execution time of raytracing() : 11.650036 sec
```
- flat profile
```txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
23.26 1.46 1.46 69646433 0.00 0.00 dot_product
18.32 2.61 1.15 56956357 0.00 0.00 subtract_vector
9.40 3.20 0.59 31410180 0.00 0.00 multiply_vector
8.29 3.72 0.52 17836094 0.00 0.00 add_vector
7.01 4.16 0.44 13861875 0.00 0.00 rayRectangularIntersection
5.90 4.53 0.37 17821809 0.00 0.00 cross_product
5.74 4.89 0.36 13861875 0.00 0.00 raySphereIntersection
5.74 5.25 0.36 10598450 0.00 0.00 normalize
3.98 5.50 0.25 4620625 0.00 0.00 ray_hit_object
1.75 5.61 0.11 4221152 0.00 0.00 multiply_vectors
1.67 5.72 0.11 2110576 0.00 0.00 compute_specular_diffuse
1.67 5.82 0.11 2110576 0.00 0.00 localColor
1.12 5.89 0.07 1048576 0.00 0.00 ray_color
1.12 5.96 0.07 1 0.07 6.26 raytracing
0.96 6.02 0.06 2520791 0.00 0.00 idx_stack_top
0.96 6.08 0.06 3838091 0.00 0.00 length
0.96 6.14 0.06 1048576 0.00 0.00 rayConstruction
0.64 6.18 0.04 1241598 0.00 0.00 refraction
0.40 6.21 0.03 2558386 0.00 0.00 idx_stack_empty
0.40 6.23 0.03 1204003 0.00 0.00 idx_stack_push
0.32 6.25 0.02 1 0.02 0.02 delete_sphere_list
0.24 6.27 0.02 1241598 0.00 0.00 reflection
0.16 6.28 0.01 1241598 0.00 0.00 protect_color_overflow
0.00 6.28 0.00 1048576 0.00 0.00 idx_stack_init
0.00 6.28 0.00 113297 0.00 0.00 fresnel
0.00 6.28 0.00 37595 0.00 0.00 idx_stack_pop
0.00 6.28 0.00 3 0.00 0.00 append_rectangular
0.00 6.28 0.00 3 0.00 0.00 append_sphere
0.00 6.28 0.00 2 0.00 0.00 append_light
0.00 6.28 0.00 1 0.00 0.00 calculateBasisVectors
0.00 6.28 0.00 1 0.00 0.00 delete_light_list
0.00 6.28 0.00 1 0.00 0.00 delete_rectangular_list
0.00 6.28 0.00 1 0.00 0.00 diff_in_second
0.00 6.28 0.00 1 0.00 0.00 write_to_ppm
```
- branch 和 branch-misses
```txt
4,785,429,867 branches
26,321,418 branch-misses # 0.55% of all branches
12.091801958 seconds time elapsed
```
- gprof2dot

```txt
+------------------------------+
| function name |
| total time % ( self time % ) |
| total calls |
+------------------------------+
```
### 分析
- math-toolkit.h裡宣告的函式
- normalize
- length
- add_vector
- subtract_vector
- multiply_vectors
- multiply_vector
- cross_product
- dot_product
- scalar_triple_product
- scalar_triple
### 優化
#### Loop Unrooling
- 有loop的函式
- dot_product 1.46s
- subtract_vector ==1.15s==
- multiply_vector ==0.59s==
- add_vector 0.52s
- multiply_vectors ==0.11s==
- scalar_triple_product -> cross_product, multiply_vectors
- scalar_triple -> cross_product, dot_product
- 觀察
1. scalar_triple_product和scalar_triple 都沒有出現在flat profile
2. subtract_vector(1.15) > multiply_vector(0.59) > multiply_vectors(0.11)
- 乘法其實來比減法還要快 (原因不知道)
- multiply_vector > multiply_vectors (原因不知道)
> `$ perf record -F 12500 -e cycles ./raytracing && perf report` 做Annotate可以了解,但是x86_64組語看不懂><[name=cheng hung lin]
- 改成loop unrooling: 9.119676 s
```txt
# Rendering scene
Done!
Execution time of raytracing() : 9.119676 sec
```
其中dot_product除了改成loop unrooling之外,將三行改成一行(這是參考別人共筆的,但已經忘記是那一筆QQ)
```txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
15.52 0.56 0.56 56956357 0.00 0.00 subtract_vector
13.17 1.04 0.48 13861875 0.00 0.00 rayRectangularIntersection
12.89 1.50 0.47 69646433 0.00 0.00 dot_product
12.47 1.95 0.45 10598450 0.00 0.00 normalize
7.48 2.22 0.27 17821809 0.00 0.00 cross_product
7.35 2.49 0.27 13861875 0.00 0.00 raySphereIntersection
6.24 2.71 0.23 31410180 0.00 0.00 multiply_vector
6.10 2.93 0.22 4620625 0.00 0.00 ray_hit_object
4.43 3.09 0.16 17836094 0.00 0.00 add_vector
2.49 3.18 0.09 2110576 0.00 0.00 compute_specular_diffuse
2.49 3.27 0.09 1048576 0.00 0.00 ray_color
1.94 3.34 0.07 4221152 0.00 0.00 multiply_vectors
1.66 3.40 0.06 2520791 0.00 0.00 idx_stack_top
1.39 3.45 0.05 3838091 0.00 0.00 length
1.11 3.49 0.04 2110576 0.00 0.00 localColor
1.11 3.53 0.04 1 0.04 3.61 raytracing
0.55 3.55 0.02 1241598 0.00 0.00 refraction
0.55 3.57 0.02 1204003 0.00 0.00 idx_stack_push
0.55 3.59 0.02 1048576 0.00 0.00 rayConstruction
0.28 3.60 0.01 1241598 0.00 0.00 reflection
0.28 3.61 0.01 1048576 0.00 0.00 idx_stack_init
0.00 3.61 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 3.61 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 3.61 0.00 113297 0.00 0.00 fresnel
0.00 3.61 0.00 37595 0.00 0.00 idx_stack_pop
0.00 3.61 0.00 3 0.00 0.00 append_rectangular
0.00 3.61 0.00 3 0.00 0.00 append_sphere
0.00 3.61 0.00 2 0.00 0.00 append_light
0.00 3.61 0.00 1 0.00 0.00 calculateBasisVectors
0.00 3.61 0.00 1 0.00 0.00 delete_light_list
0.00 3.61 0.00 1 0.00 0.00 delete_rectangular_list
0.00 3.61 0.00 1 0.00 0.00 delete_sphere_list
0.00 3.61 0.00 1 0.00 0.00 diff_in_second
0.00 3.61 0.00 1 0.00 0.00 write_to_ppm
```
- branch branch-miss
- branches : 減少927897692
```txt
Performance counter stats for './raytracing':
3,875,532,175 branches
26,090,737 branch-misses # 0.67% of all branches
9.382581189 seconds time elapsed
```
- 函式時間
- dot_product 0.47s
- subtract_vector 0.56s
- multiply_vector 0.23s
- add_vector 0.16s
- multiply_vectors 0.07s
#### force inline
- `__attribute__`: 宣告某些標語用來優化程式碼。
- 用法: ~~`void foo () __attribute__((always_inline));` ([Ref](http://stackoverflow.com/questions/13228326/force-inline-function-in-other-translation-unit))~~(還未確定,原文文件看不懂)
```clike=
static inline __attribute__((always_inline))
```
- 小實驗
原因: 因為使用inline function的話,gprof是不會有function call的顯現,會不方便分析math-toolkit.h的函數。所以換個角度想,能不能測出使用inline function所減少的instruction?
```txt
$ perf stat -r 1 -e instructions ./raytracing #得到total instructions
$ gprof -b raytracing #得到function call次數
```
- 以normalize為例
| | instructions | function call | time |
|:-:| - | - | - |
| non-inline | 25,758,523,927 | 10598450 | 9.157607 |
| inline | 25,121,100,243 | - | 8.967475 |
因為使用inline不會有normalize的function call 次數紀錄,所以沿用non-line的
- `(25,758,523,927 - 25,121,100,243)/10,598,450 = 60.14 inst per normal func call`
- 問題:
不知道60.14能代表什麼意義
- 不了解inline與function call實際的差別
- 不了解gprof對function call是否有影響
- 所有math-tooltik.h函式都改成inline function
```txt
# Rendering scene
Done!
Execution time of raytracing() : 4.762099 sec
```
#### SIMD
- cpu 型號: intel i3 330M
- 支援到SSE4.1/4.2
- 照著 [共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP)和[SPEC](https://software.intel.com/en-us/node/524253)
- compiler對 __m128d 和 __m128i 型態的變數,無論是區域變數還是全域變數,都是對齊16 byte為界的stack上,可以下\__attribute更改,`__attribute__((aligned(n[, offset])))或__attribute__((aligned(n[,offset])))`
> 不知道要放在哪,後來就沒放了
```clike=
__m128d _v1;
__m128d _v2;
double _v_tmp[3];
_v1 = _mm_loadu_pd(v1);
_v2 = _mm_loadu_pd(v2);
_v1 = _mm_mul_pd(_v1, _v2);
_mm_store_pd(_v_tmp, _v1);
return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];
```
- 結果
```txt
# Rendering scene
Done!
Execution time of raytracing() : 9.730340 sec
```
```txt
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.75 1.18 1.18 69646433 0.00 0.00 dot_product
11.25 1.64 0.46 13861875 0.00 0.00 rayRectangularIntersection
10.52 2.07 0.43 10598450 0.00 0.00 normalize
9.54 2.46 0.39 13861875 0.00 0.00 raySphereIntersection
8.68 2.81 0.36 56956357 0.00 0.00 subtract_vector
```
- 嘗試加上aligned
```clike=
__m128d _v1 __attribute(aligned(64));
__m128d _v2 __attribute(aligned(64));
double _v_tmp[3] __attribute(aligned(64));
```
結果Segmentation fault.,使用gdb bt的功能
```txt
#0 0x00000000004015c2 in _mm_load_pd (__P=0x7fffffffd9a8) at /usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h:119
#1 dot_product (v1=0x7fffffffd9a8, v2=0x7fffffffdd00) at math-toolkit.h:72
#2 0x0000000000401ee0 in rayRectangularIntersection (ray_e=0x7fffffffd940, ray_d=0x7fffffffdd00, rec=0x60bbf0, ip=0x7fffffffd990,
t1=0x7fffffffd8f8) at raytracing.c:121
#3 0x00000000004028a5 in ray_hit_object (e=0x4043c0 <view>, d=0x7fffffffdd00, t0=0, t1=59.948704218900112, rectangulars=0x60bbf0,
hit_rectangular=0x7fffffffda78, spheres=0x60ba70, hit_sphere=0x7fffffffda88) at raytracing.c:264
#4 0x0000000000402f94 in ray_color (e=0x4043c0 <view>, t=0, d=0x7fffffffdd00, stk=0x7fffffffdd40, rectangulars=0x60bbf0,
spheres=0x60ba70, lights=0x60b9d0, object_color=0x7fffffffdd20, bounces_left=3) at raytracing.c:367
#5 0x0000000000403835 in raytracing (pixels=0x7ffff7f1b010 "", background_color=0x7fffffffdec0, rectangulars=0x60bbf0,
spheres=0x60ba70, lights=0x60b9d0, view=0x4043c0 <view>, width=512, height=512) at raytracing.c:481
#6 0x0000000000403caa in main () at main.c:52
```
math-toolkit.h:72
```clike=
_v1 = _mm_load_pd(v1);
```
會不會是`const double *v1, const double *v2`傳近來的參數問題呢?
- 重新map到aligned array
```clike=
double _v1_tmp[3] __attribute__((aligned(64)));
double _v2_tmp[3] __attribute__((aligned(64)));
memcpy(_v1_tmp, v1, sizeof(double)*3 );
memcpy(_v2_tmp, v2, sizeof(double)*3 );
```
- 結果: 可以跑出結果,時間又更長
```txt
# Rendering scene
Done!
Execution time of raytracing() : 10.993178 sec
```
```txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.74 2.11 2.11 69646433 0.00 0.00 dot_product
10.45 2.65 0.54 13861875 0.00 0.00 rayRectangularIntersection
6.77 3.00 0.35 13861875 0.00 0.00 raySphereIntersection
6.68 3.34 0.35 56956357 0.00 0.00 subtract_vector
```
- 換一個方式: 把有傳進dot_product變數的宣告加上aligned
```txt
$ grep -n 'dot_product' *
```
後來直接試改`primitivs.h` 和 `math-toolkit.h`
```clike=
typedef double point3[3] __attribute__((aligned(64))); <--
...
typedef struct {
point4 vertices[4]; <-- 3 改 4
point3 normal;
object_fill rectangular_fill;
} rectangular;
```
```clike=
__m128d _v1;
__m128d _v2;
double _v_tmp[3];
_v1 = _mm_load_pd(v1);
_v2 = _mm_load_pd(v2);
_v1 = _mm_mul_pd(_v1, _v2);
_mm_store_pd(_v_tmp, _v1);
return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];
```
結果:
```txt
Execution time of raytracing() : 9.984913 sec
```
是的,==整體==比用 \_mm_loadu_pd 的版本慢!
~~但是dot_product的時間是有變少的~~多執行幾次,\_mm_loadu_pd 與 aligned兩個版本得到的時間會==飄==
```
jacklin@jacklin-CX420-CX420-MX:[~/Documents/sys2016/week1/raytracing] (master) 0h24m $ gprof -p -b raytracing
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.12 1.09 1.09 69646433 0.00 0.00 dot_product
12.79 1.72 0.63 56956357 0.00 0.00 subtract_vector
12.08 2.32 0.60 13861875 0.00 0.00 rayRectangularIntersection
9.03 2.76 0.45 10598450 0.00 0.00 normalize
8.02 3.16 0.40 13861875 0.00 0.00 raySphereIntersection
7.10 3.51 0.35 31410180 0.00 0.00 multiply_vector
6.90 3.85 0.34 17821809 0.00 0.00 cross_product
6.29 4.16 0.31 4620625 0.00 0.00 ray_hit_object
3.45 4.33 0.17 1048576 0.00 0.00 ray_color
3.04 4.48 0.15 17836094 0.00 0.00 add_vector
2.03 4.58 0.10 2110576 0.00 0.00 compute_specular_diffuse
1.62 4.66 0.08 4221152 0.00 0.00 multiply_vectors
1.42 4.73 0.07 1048576 0.00 0.00 rayConstruction
0.81 4.77 0.04 2110576 0.00 0.00 localColor
0.81 4.81 0.04 1 0.04 4.93 raytracing
0.61 4.84 0.03 1241598 0.00 0.00 reflection
0.61 4.87 0.03 1241598 0.00 0.00 refraction
0.41 4.89 0.02 3838091 0.00 0.00 length
0.41 4.91 0.02 1204003 0.00 0.00 idx_stack_push
0.20 4.92 0.01 2520791 0.00 0.00 idx_stack_top
0.20 4.93 0.01 1241598 0.00 0.00 protect_color_overflow
0.10 4.93 0.01 1 0.01 0.01 delete_sphere_list
```
===而且,回來看這段code,~~似乎有改的不對~~不知道原本為何編譯有誤,之後改成===
```clike
point3 vertices[4] --> point4 vertices[4];
```
原本的編譯會有錯
```
In file included from objects.c:4:0:
primitives.h:32:5: error: alignment of array elements is greater than element size
point3 vertices[4];
^
Makefile:24: recipe for target 'objects.o' failed
```
到`moduls.inc`來看,就知道原本point3 vertices[4]想要表達是3 x 4的2d array,所以改rectangular的方式不對
- 資料: [Attribute Syntax](https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Attribute-Syntax.html#Attribute-Syntax), [Ref](https://gcc.gnu.org/ml/gcc-help/2007-01/msg00051.html)
## 參考文件
- [gnu gprof](https://sourceware.org/binutils/docs/gprof/)
- [Chen Yi 同學的共筆](https://hackmd.io/OwYwRmDMCskKYFoCGA2YAzBAWaSzIA4BOABgQEYSUT11oUATAgJnKA==?view#2016q3-homework-1-raytracing)
- [Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)
- [課程說明影片](https://www.youtube.com/watch?v=m1RmfOfSwno)
- - [ ] [Attribute Syntax](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Attribute-Syntax.html#Attribute-Syntax)
- [吳彥寬的共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP) (SIMD的部份)
- [Intel® Core™ i3-330M Processor Spec](http://ark.intel.com/products/47663/Intel-Core-i3-330M-Processor-3M-Cache-2_13-GHz)
- [SSE part1 ~ part7](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html) (2001年的文章)