# 2017q1 Homework1 (raytracing)
###### tags: `heathcliffYang`
contributed by <`heathcliffYang`>
### Reviewd by `yachiyang01`
- 因為沒有特別說明,不太清楚loop unrolling是用在所有function嗎?每一個function的效果會不一,可以試著分析看看。
- 選擇使用這些優化方法的原因是?
### Reviewed by `etc276`
- 善用 HackMD 中的 # 符號,可以讓讀者更明瞭整篇共筆的架構與順序,例如各種嘗試優化的方法就可以使用 `###`,這樣讀者就可以在看到「進行優化與比較」時一目了然。
- 最後兩次的 commit message 看起來相同,如有修改 commit message 之類的需求可參考[這篇]("https://blog.yorkxin.org/2011/07/29/git-rebase")
## 分析原本程式的效能
- Call graph
`gprof ./raytracing | gprof2dot | dot -T png -o output.png`
> 指令參考 [hugikun999](https://hackmd.io/MwBgnARsCGAcEFoBsBGAZogLEgJkhYsATLAgMYRKxgDsYYRAprAKxA==#) 的共筆,將 gprof 輸出的資訊轉成圖片表示
![](https://i.imgur.com/g5MWCT9.png)
可以發現,raytracing 要建立光影的模型會呼叫到幾種數學運算
- 程式熱點
```
% cumulative self self total
time seconds seconds calls s/call s/call name
22.49 0.49 0.49 69646433 0.00 0.00 dot_product
16.98 0.86 0.37 56956357 0.00 0.00 subtract_vector
10.79 1.10 0.24 13861875 0.00 0.00 rayRectangularIntersection
8.49 1.28 0.19 31410180 0.00 0.00 multiply_vector
7.34 1.44 0.16 17836094 0.00 0.00 add_vector
7.34 1.60 0.16 10598450 0.00 0.00 normalize
5.97 1.73 0.13 4620625 0.00 0.00 ray_hit_object
4.82 1.84 0.11 13861875 0.00 0.00 raySphereIntersection
4.13 1.93 0.09 17821809 0.00 0.00 cross_product
2.29 1.98 0.05 1048576 0.00 0.00 ray_color
1.84 2.02 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.61 2.05 0.04 4221152 0.00 0.00 multiply_vectors
1.38 2.08 0.03 1241598 0.00 0.00 refraction
0.92 2.10 0.02 3838091 0.00 0.00 length
```
在[官方 Manual](https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter/gprof_5.html) 提到:
1. cumulative seconds
This is the cumulative total number of seconds the computer spent executing this functions, plus the time spent in all the functions above this one in this table.
2. self seconds
This is the number of seconds accounted for by this function alone. The flat profile listing is sorted first by this number.
主要花時間的數學運算可以從 self seconds 看出。
- Cache-misses & branch-misses
```
Performance counter stats for './raytracing' (100 runs):
703,209 cache-misses # 31.249 % of all cache refs ( +- 4.67% )
2,250,333 cache-references ( +- 5.17% )
7,446,324 branch-misses ( +- 0.59% )
4.132374478 seconds time elapsed ( +- 2.20% )
```
## 進行優化與比較
- 預計使用的方法有 : POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling
1. Loop unrolling
- Cache-misses & branch-misses
```
Performance counter stats for './raytracing' (100 runs):
323,532 cache-misses # 23.108 % of all cache refs ( +- 5.11% )
1,400,109 cache-references ( +- 5.30% )
5,727,748 branch-misses ( +- 0.23% )
2.586578907 seconds time elapsed ( +- 1.32% )
```
- 程式熱點
```
% cumulative self self total
time seconds seconds calls s/call s/call name
24.86 0.41 0.41 69646433 0.00 0.00 dot_product
11.22 0.60 0.19 13861875 0.00 0.00 rayRectangularIntersection
11.22 0.78 0.19 31410180 0.00 0.00 multiply_vector
10.92 0.96 0.18 56956357 0.00 0.00 subtract_vector
9.70 1.12 0.16 10598450 0.00 0.00 normalize
7.58 1.25 0.13 17821809 0.00 0.00 cross_product
7.28 1.37 0.12 4620625 0.00 0.00 ray_hit_object
5.76 1.46 0.10 13861875 0.00 0.00 raySphereIntersection
2.12 1.50 0.04 3838091 0.00 0.00 length
1.82 1.53 0.03 2110576 0.00 0.00 localColor
1.52 1.55 0.03 17836094 0.00 0.00 add_vector
1.21 1.57 0.02 4221152 0.00 0.00 multiply_vectors
1.21 1.59 0.02 1048576 0.00 0.00 rayConstruction
1.21 1.61 0.02 1048576 0.00 0.00 ray_color
```
2. POSIX thread
因為 `typedef double point3[3];` -> point3 是傳 address
3. Software pipeline
![](https://i.imgur.com/Ohx45Bj.png)
這個 loop 總共有六圈,每一圈被分為 4 個 stage (也就是 4 個 block),讓他們前後 overlap --> 可同時執行不同階段的工作,增加 throughput
- 將 raytracing() 裡的第 3 層 loop 劃分 stage : ==注意 dependent 參數為 d, stk, object_color==
- Stage 1.
idx_stack_init() + rayConstruction() -> d & stk
- Stage 2.
ray_color() -> d, stk & object_color
- Stage 3.
rgb operations -> object_color & rgb
- Stage 4.
pixels -> rgb
```
Performance counter stats for './raytracing' (100 runs):
95,023 cache-misses # 26.193 % of all cache refs ( +- 8.58% )
362,784 cache-references ( +- 13.95% )
5,151,867 branch-misses ( +- 0.26% )
2.337996284 seconds time elapsed ( +- 0.89% )
```
總體所需的要用到 cache 次數大幅下降
> 圖片源自 [Introduction to Software Pipelining in the IA-64 Architecture Architecture](https://www.ele.uva.es/~jesman/BigSeti/ftp/Microprocesadores/Intel/IA-64/Presentaciones/ia-64_notes.pdf)
4. OpenMP + loopunroll + swp
- 2 threads
```
Performance counter stats for './raytracing' (100 runs):
50,139 cache-misses # 24.425 % of all cache refs ( +- 1.03% )
205,274 cache-references ( +- 2.93% )
5,184,165 branch-misses ( +- 0.83%)
1.320096975 seconds time elapsed ( +- 1.69% )
```
- 4
```
Performance counter stats for './raytracing' (100 runs):
61,863 cache-misses # 18.482 % of all cache refs ( +- 2.03% )
334,723 cache-references ( +- 2.57% )
5,663,438 branch-misses ( +- 0.32% )
1.497532812 seconds time elapsed ( +- 2.81% )
```
- 8
```
Performance counter stats for './raytracing' (100 runs):
62,719 cache-misses # 19.101 % of all cache refs ( +- 1.46% )
328,354 cache-references ( +- 1.91% )
5,877,346 branch-misses ( +- 0.07% )
1.197949277 seconds time elapsed ( +- 2.10% )
```
待參照資料 :
[Compiler Optimizations for Modern VLIW/EPIC Architectures
](https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0ahUKEwiQkq6B1bbSAhVCEpQKHZQwAfwQFghkMAo&url=https%3A%2F%2Fcs.nyu.edu%2Fcourses%2Fspring02%2FG22.3130-001%2Fvliw_epic_4_2002.ppt&usg=AFQjCNEyaRs-z2fjHNZwq6QKE8qQRp8-hA&sig2=m7EO1C1GHEWZmNEE4kdZOQ)
- [My computer's specifications](http://ark.intel.com/products/65708)
- [](http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/pthreads.html#BASICS)