# 2017q1 Homework1 (raytracing)
###### tags: `PeterTing`
contributed by <`PerterTing`>
### Reviewed by `jack81306`
* 優化後的方法也可以加入 call graph 來觀察
* 可以加入給 compile 優化後的結果
* 可以解釋一下 force inline 會降低時間的理由
* 可以加入 simd 來優化
## 開發環境
```shell
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 61
Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
製程: 4
CPU MHz: 1941.699
CPU max MHz: 2700.0000
CPU min MHz: 500.0000
BogoMIPS: 3200.29
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
```
linux kernel:
```
peterting@peterting-MacBookAir:~$ uname -a
Linux peterting-MacBookAir 4.8.0-36-generic
```
## 未優化前
* **執行時間**
```
Execution time of raytracing() : 2.935029 sec
```
* **效能測試**
使用 Phonebook 作業所提到的 perf 工具來做測試,得到:
```
Performance counter stats for './raytracing' (5 runs):
204,985 cache-misses # 31.439 % of all cache refs ( +- 7.09% )
651,997 cache-references ( +- 10.63% )
19,469,436,804 instructions # 2.48 insn per cycle ( +- 0.00% )
7,848,363,703 cycles ( +- 0.10% )
2.942645053 seconds time elapsed ( +- 0.13% )
```
* **call graph**
安裝gprof2dot後,可以產生關聯圖,讓我們更清楚funtion之間的關聯
> 指令參考 [hugikun999](https://hackmd.io/MwBgnARsCGAcEFoBsBGAZogLEgJkhYsATLAgMYRKxgDsYYRAprAKxA==#) 的共筆
`gprof ./raytracing | gprof2dot | dot -T png -o output.png`
![](https://i.imgur.com/FUVuhnC.png)
* **gmon.out**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.74 0.61 0.61 69646433 0.00 0.00 dot_product
21.83 1.09 0.48 56956357 0.00 0.00 subtract_vector
10.01 1.31 0.22 31410180 0.00 0.00 multiply_vector
7.50 1.48 0.17 17836094 0.00 0.00 add_vector
5.91 1.61 0.13 4620625 0.00 0.00 ray_hit_object
5.69 1.73 0.13 13861875 0.00 0.00 rayRectangularIntersection
4.32 1.83 0.10 13861875 0.00 0.00 raySphereIntersection
3.18 1.90 0.07 17821809 0.00 0.00 cross_product
3.18 1.97 0.07 10598450 0.00 0.00 normalize
2.73 2.03 0.06 2110576 0.00 0.00 compute_specular_diffuse
2.27 2.08 0.05 4221152 0.00 0.00 multiply_vectors
1.59 2.11 0.04 1048576 0.00 0.00 ray_color
```
由此可知 dot_product 被 call 最多次,消耗的時間也是最多,因此,若將時間降低,效能必定也能改善許多。
## 首次優化 : Loop-unrolling
* **原始程式碼**
```C
double dot_product(const double *v1, const double *v2)
63 {
64 double dp = 0.0;
65 for (int i = 0; i < 3; i++)
66 dp += v1[i] * v2[i];
67 return dp;
68 }
```
因為loop次數不多,因此直接把他拆開,變成:
```C
double dot_product(const double *v1, const double *v2)
63 {
64 double dp = 0.0;
65
66 dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
67 return dp;
68 }
```
來看看改變後的執行時間、效能以及gprof所分析出來的數據結果
* **執行時間**
```
# Rendering scene
Done!
Execution time of raytracing() : 2.534777 sec
```
執行時間從 2.935s 降低至 2.535s ,**減少 13.7%**。
* **效能**
```
Performance counter stats for './raytracing' (5 runs):
253,765 cache-misses # 34.504 % of all cache refs ( +- 29.15% )
735,464 cache-references ( +- 34.95% )
16,750,538,375 instructions # 2.44 insn per cycle ( +- 0.00% )
6,866,806,918 cycles ( +- 0.57% )
2.582372910 seconds time elapsed ( +- 0.84% )
```
instruction 與 cycle 大幅減少
* **gprof**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.03 0.47 0.47 56956357 0.00 0.00 subtract_vector
13.80 0.71 0.24 69646433 0.00 0.00 dot_product
11.21 0.91 0.20 17836094 0.00 0.00 add_vector
10.93 1.10 0.19 31410180 0.00 0.00 multiply_vector
9.78 1.27 0.17 13861875 0.00 0.00 rayRectangularIntersection
6.90 1.39 0.12 4620625 0.00 0.00 ray_hit_object
5.18 1.48 0.09 17821809 0.00 0.00 cross_product
3.45 1.54 0.06 1048576 0.00 0.00 ray_color
2.88 1.59 0.05 10598450 0.00 0.00 normalize
2.30 1.63 0.04 4221152 0.00 0.00 multiply_vectors
1.73 1.66 0.03 13861875 0.00 0.00 raySphereIntersection
1.15 1.68 0.02 2110576 0.00 0.00 compute_specular_diffuse
```
時間從 0.61s 降低至 0.47s,**減少 23%**。
後來發現 add_vector() 和 subtract_vector() 以及mutiply_vector() 也都花蠻多時間的,因此也對這三個 function 做 loop-unrolling(**此次優化有加上 force inline**)
發現效能有更大的提升
* **執行時間**
```
# Rendering scene
Done!
Execution time of raytracing() : 2.051923 sec
```
* **執行效能**
```
Performance counter stats for './raytracing' (5 runs):
174,025 cache-misses # 27.751 % of all cache refs ( +- 5.36% )
627,101 cache-references ( +- 9.28% )
12,496,257,428 instructions # 2.31 insn per cycle ( +- 0.00% )
5,416,204,477 cycles ( +- 0.21% )
2.030640000 seconds time elapsed ( +- 0.20% )
```
* **gprof分析**
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.22 0.35 0.35 69646433 0.00 0.00 dot_product
11.22 0.54 0.19 13861875 0.00 0.00 raySphereIntersection
10.31 0.71 0.17 56956357 0.00 0.00 subtract_vector
10.31 0.88 0.17 17821809 0.00 0.00 cross_product
7.88 1.01 0.13 4620625 0.00 0.00 ray_hit_object
7.88 1.14 0.13 17836094 0.00 0.00 add_vector
6.67 1.25 0.11 10598450 0.00 0.00 normalize
6.06 1.35 0.10 1048576 0.00 0.00 ray_color
4.55 1.42 0.08 13861875 0.00 0.00 rayRectangularIntersection
1.82 1.45 0.03 31410180 0.00 0.00 multiply_vector
1.82 1.48 0.03 4221152 0.00 0.00 multiply_vectors
1.82 1.51 0.03 2520791 0.00 0.00 idx_stack_top
1.82 1.54 0.03 1 0.03 1.64 raytracing
```
由以上三像數據可以得知使用 loop-unrolling 對於效能改善的確有幫助。
## 第二次優化 : force inline
此次優化很簡單,強制開啟inline功能,看看效能是否能有所改善
* 執行時間
```
# Rendering scene
Done!
Execution time of raytracing() : 2.322062 sec
```
比起首次優化後的時間還要更低
* 執行效能
```
Performance counter stats for './raytracing' (5 runs):
213,282 cache-misses # 27.128 % of all cache refs ( +- 12.64% )
786,211 cache-references ( +- 17.59% )
15,010,319,433 instructions # 2.42 insn per cycle ( +- 0.00% )
6,207,870,112 cycles ( +- 0.41% )
2.325456367 seconds time elapsed ( +- 0.51% )
```
cycles, instructions, cache-misses 均有降低,所以 force inline 也可以優化效能
## 第三次優化 : 利用OpenMP
看到OpenMP的相關資料後,發現他可以對 for 回圈來進行平行處理,並且只需要短短一行的程式碼:
` #pragma omp parallel for`
哇!對於有很多回圈並且不會互相干擾的程式碼來說這真的是天上佳音!
但是之後經過實驗發現,若原本的執行時間不長,加上OpenMP的指令只會讓速度更慢。
剛開始時原本只利用上面提到的指令來改善 function raytracing ,不過後來發現效果不彰,後來參考到 [Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) 這篇文章,才發現原來是因為我少加了很多參數,之後一一去查參數的功用,才明白箇中道理。
**正確指令:**
`#pragma omp parallel for schedule(guided,1) collapse(2) num_threads(64) private(stk), private(d),private(object_color)`
OpenMP 語法:
`#pragma omp directive [clause]`
參數功用:
* 類別:
* **directive**:
* **parallel**:Defines a parallel region, which is code that will be executed by multiple threads in parallel.
* **clause:**
* **schedule**:Applies to the for (OpenMP) directive.
* **collapse**:Specifies how many loops in a nested loop should be collapsed into one large iteration space and divided according to the schedule clause. The sequential execution of the iterations in all associated loops determines the order of the iterations in the collapsed iteration space.
* **private**:Specifies that each thread should have its own instance of a variable.
* **num_threads**:Sets the number of threads in a thread team.
**此次優化延續上面所有的優化**
* **執行時間**
```
# Rendering scene
Done!
Execution time of raytracing() : 0.951524 sec
```
時間完全大幅提升!
* **執行效能**
```
Performance counter stats for './raytracing' (5 runs):
463,753 cache-misses # 24.932 % of all cache refs ( +- 4.66% )
1,860,067 cache-references ( +- 5.58% )
12,546,283,213 instructions # 1.34 insn per cycle ( +- 0.01% )
9,364,090,976 cycles ( +- 0.22% )
0.965814707 seconds time elapsed ( +- 0.63% )
```
除了 instructions 和 cache-misses 比例減少外,其他數值都上升,不過時間下降,推斷是效能測試有把所有平行的處理器的數值都加進來,所以數值會近乎double。
* ****
## 參考資料
[使用Gnu gprof进行Linux平台下的程序分析](http://os.51cto.com/art/200703/41426.htm)
[簡易的程式平行化方法-OpenMP(一)簡介](https://kheresy.wordpress.com/2006/06/09/%e7%b0%a1%e6%98%93%e7%9a%84%e7%a8%8b%e5%bc%8f%e5%b9%b3%e8%a1%8c%e5%8c%96%e6%96%b9%e6%b3%95%ef%bc%8dopenmp%ef%bc%88%e4%b8%80%ef%bc%89%e7%b0%a1%e4%bb%8b/)