# 2016q3 Homework1 - raytracing
contributed by <`Jim00000`>
###### tags: `awesome face`
###### tags: `☺`
## 學習gprof
從==參考資料3==中大概可以得知gprof可以讓你知道程式把時間花在甚麼地方和哪個函式的呼叫了哪個函式並找出效能的瓶頸
### 使用方式
參考==參考資料3==的資料做整理
在gcc的Options要加入 -pg,不然不會有call-graph data
-static-libgcc或是-static 在使用shared library時要加進去,不然會有segmentation fault
`gcc -g -pg -static-libgcc ...`
執行程式後產生一個含有profile data **gmon.out**的文件
`gprof options executable-file gmon.out bb-data [yet-more-profile-data-files...] [> outfile]`
或是簡單點
`gprof executable-file gmon.out [> outfile]`
其中options有可以
* 加入 -b 或是 --brief 讓文件精簡,只含有效能分析時間的資訊
* 加入 -l 開啟 line-by-line profiling,在name部份會出現行號幫助追蹤
* 加入 -A 會出現含有原始碼的報表,並附上被使用到的次數,搭配-x使用會確保每一行至少被標記一次
`gprof -b -l a.out gmon.out > outfile`
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.05 3.96 3.96 1 3.96 3.98 foo2 (main.c:14 @ 400734)
1.02 4.00 0.04 3 0.01 0.01 foo3 (main.c:12 @ 400702)
0.00 4.00 0.00 11 0.00 0.00 foo (main.c:5 @ 4006c6)
Call graph
granularity: each sample hit covers 2 byte(s) for 0.25% of 4.00 seconds
index % time self children called name
3.96 0.01 1/1 main (main.c:21 @ 40078b) [1]
[2] 99.3 3.96 0.01 1 foo2 (main.c:14 @ 400734) [2]
0.01 0.00 1/3 foo3 (main.c:12 @ 400702) [3]
-----------------------------------------------
0.01 0.00 1/3 foo2 (main.c:14 @ 400734) [2]
0.01 0.00 1/3 main (main.c:21 @ 40078b) [1]
0.01 0.00 1/3 main (main.c:23 @ 40079f) [4]
[3] 1.0 0.04 0.00 3 foo3 (main.c:12 @ 400702) [3]
-----------------------------------------------
0.00 0.00 1/11 main (main.c:19 @ 400777) [18]
0.00 0.00 10/11 foo (main.c:9 @ 4006f0) [11]
[5] 0.0 0.00 0.00 11 foo (main.c:5 @ 4006c6) [5]
-----------------------------------------------
Index by function name
[5] foo (main.c:5 @ 4006c6) [2] foo2 (main.c:14 @ 400734) [3] foo3 (main.c:12 @ 400702)
```
`gprof -l -A -x a.out gmon.out > outfile`
```
#include <stdio.h>
#include <limits.h>
void foo(int time)
11 -> {
11 -> if(time <= 0)
11 -> return;
11 -> printf("%d\n",time);
11 -> foo(time - 1);
11 -> }
3 -> void foo3(){int i;for(i=0;i<8000000;i++);};
##### -> void foo4(){};
1 -> void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
int main()
##### -> {
##### -> printf("hello\n");
##### -> printf("world\n");
##### -> foo(10);
##### -> foo2();
##### -> foo3();
##### -> foo3();
##### -> return 0;
##### -> }
Top 10 Lines:
Line Count
5 11
12 3
14 1
Execution Summary:
18 Executable lines in this file
18 Lines executed
100.00 Percent of the file executed
15 Total number of line executions
0.83 Average executions per line
```
### 數據用意
* time : 程式在此函式花了多少百分比時間(包含)
* self : 所有花費在這個函式的時間
* children : 函式的subroutine花費的時間
* called : 被呼叫的次數
* name : 函式名稱
:::info
Q:How do I find which lines in my program were executed the most times?
A:Use the gcov program.
From answers to Common Questions,==參考資料3==
:::
## 學習gcov
因為文件有提到gcov,那就來學習gcov吧,觀看==參考資料5==的文件
### 使用方式
以==參考資料5==的資料做整理
gcov可以協助分析
* 每行程式碼的執行頻率
* 哪些程式碼有執行到
* how much computing time each section of code uses
`gcc -fprofile-arcs -ftest-coverage source-file`
執行程式後產生 source-file.gcda
`gcov source-file.gcda`產生source-file.c.gcov檔案
```
-: 0:Source:main.c
-: 0:Graph:main.gcno
-: 0:Data:main.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include <stdio.h>
-: 2:#include <limits.h>
-: 3:
11: 4:void foo(int time)
-: 5:{
11: 6: if(time <= 0)
1: 7: return;
10: 8: printf("%d\n",time);
10: 9: foo(time - 1);
-: 10:}
-: 11:
3: 12:void foo3(){int i;for(i=0;i<8000000;i++);};
#####: 13:void foo4(){};
1: 14:void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
-: 15:
1: 16:int main()
-: 17:{
1: 18: printf("hello\n");
1: 19: printf("world\n");
1: 20: foo(10);
1: 21: foo2();
1: 22: foo3();
1: 23: foo3();
1: 24: return 0;
-: 25:}
```
使用 -b 來察看branch
`gcov -b source-file`
```
-: 0:Source:main.c
-: 0:Graph:main.gcno
-: 0:Data:main.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include <stdio.h>
-: 2:#include <limits.h>
-: 3:
function foo called 11 returned 100% blocks executed 100%
11: 4:void foo(int time)
-: 5:{
11: 6: if(time <= 0)
branch 0 taken 9% (fallthrough)
branch 1 taken 91%
1: 7: return;
10: 8: printf("%d\n",time);
call 0 returned 100%
10: 9: foo(time - 1);
call 0 returned 100%
-: 10:}
-: 11:
function foo3 called 3 returned 100% blocks executed 100%
3: 12:void foo3(){int i;for(i=0;i<8000000;i++);};
branch 0 taken 99%
branch 1 taken 1% (fallthrough)
function foo4 called 0 returned 0% blocks executed 0%
#####: 13:void foo4(){};
function foo2 called 1 returned 100% blocks executed 100%
1: 14:void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
branch 0 taken 99%
branch 1 taken 1% (fallthrough)
call 2 returned 100%
-: 15:
function main called 1 returned 100% blocks executed 100%
1: 16:int main()
-: 17:{
1: 18: printf("hello\n");
call 0 returned 100%
1: 19: printf("world\n");
call 0 returned 100%
1: 20: foo(10);
call 0 returned 100%
1: 21: foo2();
call 0 returned 100%
1: 22: foo3();
call 0 returned 100%
1: 23: foo3();
call 0 returned 100%
1: 24: return 0;
-: 25:}
```
## 案例分析 RayTracing
### 原始版本
```
# Rendering scene
Done!
Execution time of raytracing() : 2.766088 sec
```
使用perf尋找熱點
`./raytracing & perf top -p $!`

Overhead為紅字部份的像是dot_product、subtract_vector、multiplt_vector、normalize與add_vector為math-toolkit.h裡的函式,消耗有一定程度cycles
而rayRectangularIntersection與raySphereIntersection則是raytracing.c內的函式,仔細瞧瞧原始碼裡的內容,在rayRectangularIntersection中大量使用subtract_vector、cross_product、dot_product、multiply_vector等toolkit.h內的數學函式而raySphereIntersection亦是如此,應此若能改良toolkit.h的函式,效能定有改善
再來是分析函式的使用頻率
`make PROFILE=1`
`./raytracing`
`gprof -l -b raytracing gmon.out > report`
打開report,節錄並重新排列米黃色字部份內容如下:
```
index % time self children called name
[9] 2.8 0.07 0.00 69646433 dot_product (math-toolkit.h:79 @ 401dd2) [9]
[10] 2.4 0.06 0.00 56956357 subtract_vector (math-toolkit.h:38 @ 401b14) [ 10]
[12] 2.4 0.06 0.00 13861875 rayRectangularIntersection (raytracing.c:54 @ 402409) [12]
[25] 1.2 0.03 0.00 13861875 raySphereIntersection (raytracing.c:24 @ 4020f 0) [25]
[33] 0.8 0.02 0.00 17821809 cross_product (math-toolkit.h:71 @ 401ce0) [33 ]
[43] 0.8 0.02 0.00 31410180 multiply_vector (math-toolkit.h:60 @ 401c56) [ 43]
[56] 0.4 0.01 0.00 17836094 add_vector (math-toolkit.h:27 @ 401a73) [56]
[57] 0.4 0.01 0.00 4620625 ray_hit_object (raytracing.c:251 @ 40368c) [57]
[83] 0.2 0.01 0.00 4221152 multiply_vectors (math-toolkit.h:49 @ 401bb5) [ 83]
[145] 0.0 0.00 0.00 10598450 normalize (math-toolkit.h:10 @ 4018cf) [145]
[146] 0.0 0.00 0.00 3838091 length (math-toolkit.h:21 @ 4019e7) [146]
(以下略)
```
在call graph中,呼叫最多次數的函式為dot_product,緊接著是subtract_vector,除了rayRectangularIntersection與raySphereIntersection外,幾乎都是toolkit.h內的函式,因此,toolkit.h內的數學函式為改進效能的關鍵
再來比較每一行的執行數
`gcov raytracing.gcda`
```
-: 0:Source:math-toolkit.h
-: 0:Graph:raytracing.gcno
-: 0:Data:raytracing.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#ifndef __RAY_MATH_TOOLKIT_H
-: 2:#define __RAY_MATH_TOOLKIT_H
-: 3:
-: 4:#include <math.h>
-: 5:#include <stdio.h>
-: 6:#include <assert.h>
-: 7:
-: 8:static inline
10598450: 9:void normalize(double *v)
-: 10:{
10598450: 11: double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
10598450: 12: assert(d != 0.0 && "Error calculating normal");
-: 13:
10598450: 14: v[0] /= d;
10598450: 15: v[1] /= d;
10598450: 16: v[2] /= d;
10598450: 17:}
-: 18:
-: 19:static inline
3838091: 20:double length(const double *v)
-: 21:{
3838091: 22: return sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
-: 23:}
-: 24:
-: 25:static inline
17836094: 26:void add_vector(const double *a, const double *b, double *out)
-: 27:{
71344376: 28: for (int i = 0; i < 3; i++)
53508282: 29: out[i] = a[i] + b[i];
-: 30:
-: 31: // out[0] = a[0] + b[0];
-: 32: // out[1] = a[1] + b[1];
-: 33: // out[2] = a[2] + b[2];
17836094: 34:}
-: 35:
-: 36:static inline
56956357: 37:void subtract_vector(const double *a, const double *b, double *out)
-: 38:{
227825428: 39: for (int i = 0; i < 3; i++)
170869071: 40: out[i] = a[i] - b[i];
-: 41:
-: 42: // out[0] = a[0] - b[0];
-: 43: // out[1] = a[1] - b[1];
-: 44: // out[2] = a[2] - b[2];
56956357: 45:}
-: 46:
-: 47:static inline
4221152: 48:void multiply_vectors(const double *a, const double *b, double *out)
-: 49:{
16884608: 50: for (int i = 0; i < 3; i++)
12663456: 51: out[i] = a[i] * b[i];
-: 52:
-: 53: // out[0] = a[0] * b[0];
-: 54: // out[1] = a[1] * b[1];
-: 55: // out[2] = a[2] * b[2];
4221152: 56:}
-: 57:
-: 58:static inline
31410180: 59:void multiply_vector(const double *a, double b, double *out)
-: 60:{
125640720: 61: for (int i = 0; i < 3; i++)
94230540: 62: out[i] = a[i] * b;
-: 63:
-: 64: // out[0] = a[0] * b;
-: 65: // out[1] = a[1] * b;
-: 66: // out[2] = a[2] * b;
31410180: 67:}
-: 68:
-: 69:static inline
17821809: 70:void cross_product(const double *v1, const double *v2, double *out)
-: 71:{
17821809: 72: out[0] = v1[1] * v2[2] - v1[2] * v2[1];
17821809: 73: out[1] = v1[2] * v2[0] - v1[0] * v2[2];
17821809: 74: out[2] = v1[0] * v2[1] - v1[1] * v2[0];
17821809: 75:}
-: 76:
-: 77:static inline
69646433: 78:double dot_product(const double *v1, const double *v2)
-: 79:{
69646433: 80: double dp = 0.0;
278585732: 81: for (int i = 0; i < 3; i++)
208939299: 82: dp += v1[i] * v2[i];
69646433: 83: return dp;
-: 84:
-: 85: // return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
-: 86:}
-: 87:
-: 88:static inline
-: 89:void scalar_triple_product(const double *u, const double *v, const double *w,
-: 90: double *out)
-: 91:{
-: 92: cross_product(v, w, out);
-: 93: multiply_vectors(u, out, out);
-: 94:}
-: 95:
-: 96:static inline
-: 97:double scalar_triple(const double *u, const double *v, const double *w)
-: 98:{
-: 99: double tmp[3];
-: 100: cross_product(w, u, tmp);
-: 101: return dot_product(v, tmp);
-: 102:}
-: 103:
-: 104:#endif
```
可以觀察到執行次數最多的地方是for迴圈,所以,for迴圈是一個效能改善的重點,作業有提到loop unrolling的技巧
```
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
```
```
v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]
```
將loop unrolling後的math-toolkit.h內的for loop展開後的執行結果如下
```
# Rendering scene
Done!
Execution time of raytracing() : 1.783641 sec
```
### 使用多執行緒改善效能
使用OpenMP修改RayTracing程式,參考OpenMP文件(==參考資料7==)
在raytracing.c的文件中找到一處適合修改成多執行續的地方
```
void raytracing(...)
{
.
.
.
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
.
.
.
```
這裡可以變成多執行續處理,但是要考慮各個資料相依性與獨立性,這裡卡了一段時間,參考同學的hackmd(==參考資料8==)後找出必須獨立的資料(或是說每個執性緒都有各自的資料副本而非共享資料)為d,stk與object_color,才讓輸出的圖形是正確的。
```
.
.
.
#pragma omp parallel for schedule(dynamic)
for (int j = 0; j < height; j++) {
#pragma omp parallel for schedule(dynamic) private(d,stk,object_color)
for (int i = 0; i < width; i++) {
.
.
.
```
測試過部份loop的schedule方法(意指迴圈是如何被分配或切割到各個執行緒,來==參考資料7==中對schedule的解釋),dynamic的效率是比較好的
```
# Rendering scene
Done!
Execution time of raytracing() : 0.392827 sec
Verified OK
```
<!--
-->
## 參考資料
1. [使用Gnu gprof进行Linux平台下的程序分析(1)](http://os.51cto.com/art/200703/41426.htm)
2. [gprof - linux program profiling tool](http://awkzone.blogspot.tw/2010/11/gprof.html)
3. [GNU gprof](https://sourceware.org/binutils/docs/gprof/)
4. [Option Summary](https://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html)
5. [gcov—a Test Coverage Program](https://gcc.gnu.org/onlinedocs/gcc/Gcov.html)
6. [6.42 An Inline Function is As Fast As a Macro](https://gcc.gnu.org/onlinedocs/gcc/Inline.html)
7. [OpenMP Tutorial](https://computing.llnl.gov/tutorials/openMP/)
8. [Vivian HackMD raytracing](https://hackmd.io/MYQwpgnARsBMUFoBmAWAjADgSsBmLGSuArAgAwDsFKAJmmAGxJgbFA==?view#)
9. [2016q1 homework1 hackpad](https://embedded2016.hackpad.com/2016q1-Homework-2A-wOu40KzMaIP#:h=Multi-Threads%E7%89%88%E6%9C%AC)
<!--從此處開始繼續往下寫-->