2017q1 Homework1 (raytracing)

contributed by <ryanpatiency>

Reviewed by `zhanyangch`

在標題簡介所使用的方法，例如：實作二，可改為實作二： loop 展開
可以嘗試使用平行化的技巧

目標

學習gprof
參考資料
以 gprof 指出效能瓶頸
著手改寫檔案 math-toolkit.h 的函式實做
紀錄效能差異

gprof 介紹

學任何事最簡單就是從範例著手，以下是程式碼，請將它複製為test.c

void fun2()
{
    for (int i = 0; i < 0xfffffff; i++);
}

void fun1()
{
    for (int i = 0; i < 0x4; i++)
        fun2();
}

int main()
{
    fun1();
    fun2();
}

內容超簡單的，可以看到程式大部分的時間都花在 func2, 一共執行了五次， fun1 執行 fun2 四次， main 執行 fun1 一次

接著請編譯：$ gcc test.c -pg -o test

-pg 代表 profile generate，產生對 gprof 有幫助的資料

接著請執行：$ ./test

執行成功後會發現一檔案gmon.out，就是 gprof 所需的檔案

接著打開 gprof gprof test gmon.out > gprof.txt

> 符號代表將 gprof 的 output 輸出到 gprof.txt 檔案

打開產生的 gprof.txt 你會看到：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
101.27      3.15     3.15        5     0.63     0.63  fun2
  0.00      3.15     0.00        1     0.00     2.52  fun1
...
...
index % time    self  children    called     name
                0.63    0.00       1/5           main [2]
                2.52    0.00       4/5           fun1 [3]
[1]    100.0    3.15    0.00       5         fun2 [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    3.15                 main [2]
                0.00    2.52       1/1           fun1 [3]
                0.63    0.00       1/5           fun2 [1]
-----------------------------------------------
                0.00    2.52       1/1           main [2]
[3]     80.0    0.00    2.52       1         fun1 [3]
                2.52    0.00       4/5           fun2 [1]
-----------------------------------------------

先說明第一部分：

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
101.27      3.15     3.15        5     0.63     0.63  fun2
  0.00      3.15     0.00        1     0.00     2.52  fun1

name欄位，最後一欄，為所執行的函式的名字，test中只有 fun1 和 fun2， gprof 為按照佔程式執行時間排序，所以 fun2 在第一個。
time欄位，對應到第三欄 self seconds，為本身(不包涵子程式)的執行時間佔全部時間的比重，第一個101.27代表佔了 101.27% 的執行時間，因為 gprof 是用取樣的方式，所以才超過了 100%
seconds欄位，有兩欄，cumulative是累積，分別是累積執行時間和自己的執行時間。
calls呼叫次數
s/call代表seconds per call，每次呼叫平均佔時。觀察 2.52 為 0.63 的四倍，可知道self s/call代表不包涵子函式的執行時間，total s/call相反

第二部分說明 caller 和 callee 的關係，說明完了第一部分後，這部分我只挑重點講：

index % time    self  children    called     name
                0.63    0.00       1/5           main [2]
                2.52    0.00       4/5           fun1 [3]
[1]    100.0    3.15    0.00       5         fun2 [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    3.15                 main [2]
                0.00    2.52       1/1           fun1 [3]
                0.63    0.00       1/5           fun2 [1]
-----------------------------------------------
                0.00    2.52       1/1           main [2]
[3]     80.0    0.00    2.52       1         fun1 [3]
                2.52    0.00       4/5           fun2 [1]
-----------------------------------------------

可看到第二部分分為三個小部分，分別有三個主角，主角是time欄位有數字者，主角的上排為 caller ，主角的下排為 callee

time欄位，程式 和其子函式 佔main()的時間比重(%)，注意是main()而不是caller()
self欄位，分為三部分
- 對主角而言，代表在main執行主角本身的總時間
- 對 caller 而言，代表 caller 執行主角本身的總時間，請對照看 caller 的 called欄位
- 對 callee 而言，代表主角執行 callee 本身的總時間，請對照看 callee 的 called欄位

剩下部分請看 gprof.txt 的文字說明

結論

第二部分的東西最重要的就是 called 欄位，其他可由第一部分的數字推導出來。

和 perf 的比較

如果只是要找程式熱點的話，perf record -e cpu-clock 就可以了。 gprof 的用途應該是找出熱點程式的 seconds/call 和呼叫次數(程式架構)

修改前的程式效能

$ gprof -b ./raytracing gmon.out > raytracing.gprof
可以看到下列資訊：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.80      0.61     0.61 69646433     0.00     0.00  dot_product
 19.30      1.15     0.54 56956357     0.00     0.00  subtract_vector
 11.44      1.47     0.32 31410180     0.00     0.00  multiply_vector
...

可得效能瓶頸前三名分別為

dot_product() 被呼叫 69646433 次memccpy()
subtract_vector()被呼叫 56956357 次memccpy()
multiply_vector()被呼叫 31410180 次memccpy()

實驗過程

實作一

觀看 math-toolkit.h ，可得上述實作如下：

double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

網路上得到的資料是將 loop 的方法由遞增改為遞減。理由是 assembly code 較短：

遞增
INC lpc      ; Increment loop counter
SUB lpc, #10 ; Compare loop counter to 10
BNZ loop     ; Branch if loop counter not equal to 10

遞減
DEC lpc   ; Decrement loop counter
BNZ loop  ; Branch if non zero

所以我更改dot_product()如下：

double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    int i=2;
    do{
        dp += v1[i] * v2[i];
    }while(i--);
        
    return dp;
}

再跑一次 gprof:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.07      0.52     0.52 56956357     0.00     0.00  subtract_vector
 17.42      0.95     0.43 69646433     0.00     0.00  dot_product
  9.32      1.18     0.23 31410181     0.00     0.00  multiply_vector

真是令人高興，原本第一名的dot_product變為第二名了。時間也從 0.61 變為 0.43

實作二

為了測試 for loop 的影響，我將 for loop 展開，同時保有原本的算式，更改dot_product()如下：

double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp+=v1[0]*v2[0];
    dp+=v1[1]*v2[1];
    dp+=v1[2]*v2[2];
    return dp;
}

得到 gprof 如下：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.02      0.46     0.46 69646433     0.00     0.00  dot_product
 17.36      0.84     0.38 56956357     0.00     0.00  subtract_vector
  9.59      1.05     0.21 31410180     0.00     0.00  multiply_vector

發現展開 for loop 並無任何影響。

實作三

但如果減少 dp 和 i 的 init，和 assign , 修改 dot_product()如下

double dot_product(const double *v1, const double *v2)
{
    return v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2];
}

得到 gprof 如下：

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 28.14      1.53     1.53 148233778     0.00     0.00  subtract_vector
 12.51      2.21     0.68 35651587     0.00     0.00  normalize
 11.96      2.86     0.65 46157185     0.00     0.00  add_vector
  9.38      3.37     0.51 28311552     0.00     0.00  raySphereIntersection
  8.92      3.86     0.49 146229607     0.00     0.00  dot_product

發現原本第一名的dot_product變為第五名了。但發現原本呼叫
6964-6433 次memccpy()，現在卻呼叫
1-4622-9607 次，差了兩倍多，可見 gprof 的 calls 並不可靠。

如果工具「不可靠」，為什麼還會持續存在呢？與其用一兩句話描述，何不從其行為和限制去探討呢？請重新分析並提出你的解釋 –jserv

參考資料

gprof
count down may be more efficient compare to increment