2016q3 Homework1 - raytracing

contributed by <Jim00000>

tags: `awesome face`

tags: `☺`

學習gprof

從參考資料3中大概可以得知gprof可以讓你知道程式把時間花在甚麼地方和哪個函式的呼叫了哪個函式並找出效能的瓶頸

使用方式

參考參考資料3的資料做整理

在gcc的Options要加入 -pg，不然不會有call-graph data

-static-libgcc或是-static 在使用shared library時要加進去，不然會有segmentation fault

gcc -g -pg -static-libgcc ...

執行程式後產生一個含有profile data gmon.out的文件

gprof options executable-file gmon.out bb-data [yet-more-profile-data-files...] [> outfile]

或是簡單點

gprof executable-file gmon.out [> outfile]

其中options有可以

加入 -b 或是 –brief 讓文件精簡，只含有效能分析時間的資訊
加入 -l 開啟 line-by-line profiling，在name部份會出現行號幫助追蹤
加入 -A 會出現含有原始碼的報表，並附上被使用到的次數，搭配-x使用會確保每一行至少被標記一次

gprof -b -l a.out gmon.out > outfile

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.05      3.96     3.96        1     3.96     3.98  foo2 (main.c:14 @ 400734)
  1.02      4.00     0.04        3     0.01     0.01  foo3 (main.c:12 @ 400702)
  0.00      4.00     0.00       11     0.00     0.00  foo (main.c:5 @ 4006c6)

			Call graph


granularity: each sample hit covers 2 byte(s) for 0.25% of 4.00 seconds

index % time    self  children    called     name
                3.96    0.01       1/1           main (main.c:21 @ 40078b) [1]
[2]     99.3    3.96    0.01       1         foo2 (main.c:14 @ 400734) [2]
                0.01    0.00       1/3           foo3 (main.c:12 @ 400702) [3]
-----------------------------------------------
                0.01    0.00       1/3           foo2 (main.c:14 @ 400734) [2]
                0.01    0.00       1/3           main (main.c:21 @ 40078b) [1]
                0.01    0.00       1/3           main (main.c:23 @ 40079f) [4]
[3]      1.0    0.04    0.00       3         foo3 (main.c:12 @ 400702) [3]
-----------------------------------------------
                0.00    0.00       1/11          main (main.c:19 @ 400777) [18]
                0.00    0.00      10/11          foo (main.c:9 @ 4006f0) [11]
[5]      0.0    0.00    0.00      11         foo (main.c:5 @ 4006c6) [5]
-----------------------------------------------

Index by function name

   [5] foo (main.c:5 @ 4006c6) [2] foo2 (main.c:14 @ 400734) [3] foo3 (main.c:12 @ 400702)

gprof -l -A -x a.out gmon.out > outfile

                #include <stdio.h>
                #include <limits.h>
                
                void foo(int time)
          11 -> {
          11 ->     if(time <= 0)
          11 ->         return;
          11 ->     printf("%d\n",time);
          11 ->     foo(time - 1);
          11 -> }
                
           3 -> void foo3(){int i;for(i=0;i<8000000;i++);};
       ##### -> void foo4(){};
           1 -> void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
                
                int main()
       ##### -> {
       ##### ->     printf("hello\n");
       ##### ->     printf("world\n");
       ##### ->     foo(10);
       ##### ->     foo2();
       ##### ->     foo3();
       ##### ->     foo3();
       ##### ->     return 0;
       ##### -> }


Top 10 Lines:

     Line      Count

        5         11
       12          3
       14          1

Execution Summary:

       18   Executable lines in this file
       18   Lines executed
   100.00   Percent of the file executed

       15   Total number of line executions
     0.83   Average executions per line

數據用意

time ：程式在此函式花了多少百分比時間(包含)
self ：所有花費在這個函式的時間
children ：函式的subroutine花費的時間
called ：被呼叫的次數
name : 函式名稱

Q:How do I find which lines in my program were executed the most times?
A:Use the gcov program.
From answers to Common Questions，參考資料3

學習gcov

因為文件有提到gcov，那就來學習gcov吧，觀看參考資料5的文件

使用方式

以參考資料5的資料做整理

gcov可以協助分析

每行程式碼的執行頻率
哪些程式碼有執行到
how much computing time each section of code uses

gcc -fprofile-arcs -ftest-coverage source-file

執行程式後產生 source-file.gcda

gcov source-file.gcda產生source-file.c.gcov檔案

        -:    0:Source:main.c
        -:    0:Graph:main.gcno
        -:    0:Data:main.gcda
        -:    0:Runs:1
        -:    0:Programs:1
        -:    1:#include <stdio.h>
        -:    2:#include <limits.h>
        -:    3:
       11:    4:void foo(int time)
        -:    5:{
       11:    6:    if(time <= 0)
        1:    7:        return;
       10:    8:    printf("%d\n",time);
       10:    9:    foo(time - 1);
        -:   10:}
        -:   11:
        3:   12:void foo3(){int i;for(i=0;i<8000000;i++);};
    #####:   13:void foo4(){};
        1:   14:void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
        -:   15:
        1:   16:int main()
        -:   17:{
        1:   18:    printf("hello\n");
        1:   19:    printf("world\n");
        1:   20:    foo(10);
        1:   21:    foo2();
        1:   22:    foo3();
        1:   23:    foo3();
        1:   24:    return 0;
        -:   25:}

使用 -b 來察看branch

gcov -b source-file

        -:    0:Source:main.c
        -:    0:Graph:main.gcno
        -:    0:Data:main.gcda
        -:    0:Runs:1
        -:    0:Programs:1
        -:    1:#include <stdio.h>
        -:    2:#include <limits.h>
        -:    3:
function foo called 11 returned 100% blocks executed 100%
       11:    4:void foo(int time)
        -:    5:{
       11:    6:    if(time <= 0)
branch  0 taken 9% (fallthrough)
branch  1 taken 91%
        1:    7:        return;
       10:    8:    printf("%d\n",time);
call    0 returned 100%
       10:    9:    foo(time - 1);
call    0 returned 100%
        -:   10:}
        -:   11:
function foo3 called 3 returned 100% blocks executed 100%
        3:   12:void foo3(){int i;for(i=0;i<8000000;i++);};
branch  0 taken 99%
branch  1 taken 1% (fallthrough)
function foo4 called 0 returned 0% blocks executed 0%
    #####:   13:void foo4(){};
function foo2 called 1 returned 100% blocks executed 100%
        1:   14:void foo2(){int i;for(i=0;i<INT_MAX;i++);foo3();};
branch  0 taken 99%
branch  1 taken 1% (fallthrough)
call    2 returned 100%
        -:   15:
function main called 1 returned 100% blocks executed 100%
        1:   16:int main()
        -:   17:{
        1:   18:    printf("hello\n");
call    0 returned 100%
        1:   19:    printf("world\n");
call    0 returned 100%
        1:   20:    foo(10);
call    0 returned 100%
        1:   21:    foo2();
call    0 returned 100%
        1:   22:    foo3();
call    0 returned 100%
        1:   23:    foo3();
call    0 returned 100%
        1:   24:    return 0;
        -:   25:}

案例分析 RayTracing

原始版本

# Rendering scene
Done!
Execution time of raytracing() : 2.766088 sec

使用perf尋找熱點
./raytracing & perf top -p $!

Overhead為紅字部份的像是dot_product、subtract_vector、multiplt_vector、normalize與add_vector為math-toolkit.h裡的函式，消耗有一定程度cycles

而rayRectangularIntersection與raySphereIntersection則是raytracing.c內的函式，仔細瞧瞧原始碼裡的內容，在rayRectangularIntersection中大量使用subtract_vector、cross_product、dot_product、multiply_vector等toolkit.h內的數學函式而raySphereIntersection亦是如此，應此若能改良toolkit.h的函式，效能定有改善

再來是分析函式的使用頻率

make PROFILE=1
./raytracing
gprof -l -b raytracing gmon.out > report
打開report，節錄並重新排列米黃色字部份內容如下:

index % time    self  children    called     name
[9]      2.8    0.07    0.00 69646433         dot_product (math-toolkit.h:79 @ 401dd2) [9] 
[10]     2.4    0.06    0.00 56956357         subtract_vector (math-toolkit.h:38 @ 401b14) [    10]
[12]     2.4    0.06    0.00 13861875         rayRectangularIntersection (raytracing.c:54 @     402409) [12]
[25]     1.2    0.03    0.00 13861875         raySphereIntersection (raytracing.c:24 @ 4020f    0) [25]
[33]     0.8    0.02    0.00 17821809         cross_product (math-toolkit.h:71 @ 401ce0) [33    ]
[43]     0.8    0.02    0.00 31410180         multiply_vector (math-toolkit.h:60 @ 401c56) [    43]
[56]     0.4    0.01    0.00 17836094         add_vector (math-toolkit.h:27 @ 401a73) [56]
[57]     0.4    0.01    0.00 4620625         ray_hit_object (raytracing.c:251 @ 40368c) [57]
[83]     0.2    0.01    0.00 4221152         multiply_vectors (math-toolkit.h:49 @ 401bb5) [    83]
[145]    0.0    0.00    0.00 10598450         normalize (math-toolkit.h:10 @ 4018cf) [145]
[146]    0.0    0.00    0.00 3838091         length (math-toolkit.h:21 @ 4019e7) [146]
(以下略)

在call graph中，呼叫最多次數的函式為dot_product，緊接著是subtract_vector，除了rayRectangularIntersection與raySphereIntersection外，幾乎都是toolkit.h內的函式，因此，toolkit.h內的數學函式為改進效能的關鍵

再來比較每一行的執行數
gcov raytracing.gcda

        -:    0:Source:math-toolkit.h
        -:    0:Graph:raytracing.gcno
        -:    0:Data:raytracing.gcda
        -:    0:Runs:1
        -:    0:Programs:1
        -:    1:#ifndef __RAY_MATH_TOOLKIT_H
        -:    2:#define __RAY_MATH_TOOLKIT_H
        -:    3:
        -:    4:#include <math.h>
        -:    5:#include <stdio.h>
        -:    6:#include <assert.h>
        -:    7:
        -:    8:static inline
 10598450:    9:void normalize(double *v)
        -:   10:{
 10598450:   11:    double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
 10598450:   12:    assert(d != 0.0 && "Error calculating normal");
        -:   13:
 10598450:   14:    v[0] /= d;
 10598450:   15:    v[1] /= d;
 10598450:   16:    v[2] /= d;
 10598450:   17:}
        -:   18:
        -:   19:static inline
  3838091:   20:double length(const double *v)
        -:   21:{
  3838091:   22:    return sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
        -:   23:}
        -:   24:
        -:   25:static inline
 17836094:   26:void add_vector(const double *a, const double *b, double *out)
        -:   27:{
 71344376:   28:    for (int i = 0; i < 3; i++)
 53508282:   29:        out[i] = a[i] + b[i];
        -:   30:
        -:   31:    // out[0] = a[0] + b[0];
        -:   32:    // out[1] = a[1] + b[1];
        -:   33:    // out[2] = a[2] + b[2];
 17836094:   34:}
        -:   35:
        -:   36:static inline
 56956357:   37:void subtract_vector(const double *a, const double *b, double *out)
        -:   38:{
227825428:   39:    for (int i = 0; i < 3; i++)
170869071:   40:        out[i] = a[i] - b[i];
        -:   41:
        -:   42:    // out[0] = a[0] - b[0];
        -:   43:    // out[1] = a[1] - b[1];
        -:   44:    // out[2] = a[2] - b[2];
 56956357:   45:}
        -:   46:
        -:   47:static inline
  4221152:   48:void multiply_vectors(const double *a, const double *b, double *out)
        -:   49:{
 16884608:   50:    for (int i = 0; i < 3; i++)
 12663456:   51:        out[i] = a[i] * b[i];
        -:   52:
        -:   53:    // out[0] = a[0] * b[0];
        -:   54:    // out[1] = a[1] * b[1];
        -:   55:    // out[2] = a[2] * b[2];
  4221152:   56:}
        -:   57:
        -:   58:static inline
 31410180:   59:void multiply_vector(const double *a, double b, double *out)
        -:   60:{
125640720:   61:    for (int i = 0; i < 3; i++)
 94230540:   62:        out[i] = a[i] * b;
        -:   63:
        -:   64:    // out[0] = a[0] * b;
        -:   65:    // out[1] = a[1] * b;
        -:   66:    // out[2] = a[2] * b;
 31410180:   67:}
        -:   68:
        -:   69:static inline
 17821809:   70:void cross_product(const double *v1, const double *v2, double *out)
        -:   71:{
 17821809:   72:    out[0] = v1[1] * v2[2] - v1[2] * v2[1];
 17821809:   73:    out[1] = v1[2] * v2[0] - v1[0] * v2[2];
 17821809:   74:    out[2] = v1[0] * v2[1] - v1[1] * v2[0];
 17821809:   75:}
        -:   76:
        -:   77:static inline
 69646433:   78:double dot_product(const double *v1, const double *v2)
        -:   79:{
 69646433:   80:    double dp = 0.0;
278585732:   81:    for (int i = 0; i < 3; i++)
208939299:   82:        dp += v1[i] * v2[i];
 69646433:   83:    return dp;
        -:   84:    
        -:   85:    // return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
        -:   86:}
        -:   87:
        -:   88:static inline
        -:   89:void scalar_triple_product(const double *u, const double *v, const double *w,
        -:   90:                           double *out)
        -:   91:{
        -:   92:    cross_product(v, w, out);
        -:   93:    multiply_vectors(u, out, out);
        -:   94:}
        -:   95:
        -:   96:static inline
        -:   97:double scalar_triple(const double *u, const double *v, const double *w)
        -:   98:{
        -:   99:    double tmp[3];
        -:  100:    cross_product(w, u, tmp);
        -:  101:    return dot_product(v, tmp);
        -:  102:}
        -:  103:
        -:  104:#endif

可以觀察到執行次數最多的地方是for迴圈，所以，for迴圈是一個效能改善的重點，作業有提到loop unrolling的技巧

for (int i = 0; i < 3; i++)
	dp += v1[i] * v2[i];

v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]

將loop unrolling後的math-toolkit.h內的for loop展開後的執行結果如下

# Rendering scene
Done!
Execution time of raytracing() : 1.783641 sec

使用多執行緒改善效能

使用OpenMP修改RayTracing程式，參考OpenMP文件(參考資料7)

在raytracing.c的文件中找到一處適合修改成多執行續的地方

void raytracing(...)
{
.
.
. 
    for (int j = 0; j < height; j++) { 
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
            for (int s = 0; s < SAMPLES; s++) {
.
.
.

這裡可以變成多執行續處理，但是要考慮各個資料相依性與獨立性，這裡卡了一段時間，參考同學的hackmd(參考資料8)後找出必須獨立的資料(或是說每個執性緒都有各自的資料副本而非共享資料)為d，stk與object_color，才讓輸出的圖形是正確的。

.
.
.
#pragma omp parallel for schedule(dynamic)
    for (int j = 0; j < height; j++) {
        #pragma omp parallel for schedule(dynamic) private(d,stk,object_color) 
        for (int i = 0; i < width; i++) {
.
.
.

測試過部份loop的schedule方法(意指迴圈是如何被分配或切割到各個執行緒，來參考資料7中對schedule的解釋)，dynamic的效率是比較好的

# Rendering scene
Done!
Execution time of raytracing() : 0.392827 sec
Verified OK

2016q3 Homework1 - raytracing

tags: awesome face

tags: ☺

學習gprof

使用方式

數據用意

學習gcov

使用方式

案例分析 RayTracing

原始版本

使用多執行緒改善效能

參考資料

tags: `awesome face`

tags: `☺`