2016q3 Homework1 (raytracing)

contributed by <ktvexe>

Reviewed by `yanglin5689446`

一開始沒有定義 THREAD_NUM 而直接用魔法數字，之後雖然用了THREAD_NUM 巨集來做不同 thread 的狀況的測試，但是是用浮點數讓我有點不太理解， THREAD_NUM 應該用整數就可以了吧?看了一下之後的運算好像也沒有必需要用到浮點數的地方。
將 static inline 改成 static inline __attribute__((always_inline)) 的部分，感覺可以用巨集取代，這樣就不用將檔案的全部內容修改。
第一次 commit 時好像排版有一點不統一，推測是 tab 跟空白混用造成的。

前情提要

gprof

資料：http://os.51cto.com/art/200703/41426.htm
翻譯－
使用Gnu gprof進行Linux平台下的程式分析

Gprof 簡介:
Gprof 的功能：列印出程式執行中各個函數消耗的時間，可以幫助programmer找出眾多函數中耗時最多的函数。產生程式執行時候的function call關係，包括call function得次數，可以幫助programmer分析程式的執行過程。

有了function call 的關係，會讓開發者大大提高工作效率，不用費心地去一點點找出程式的執行過程，這對小程式来说可能效果不是很行顯，但對於有幾萬，幾十萬程式碼的工程來說，效率是毋庸置疑的！而且這個功能對於維護舊程式碼或者是分析Open Source来说那是相當誘人的。

結論－
使用gprof可以了解各個function間互call的次數、時間等關係，可以由此找出需進行優化的片段。
晚點來研究使用gprof與perf top的使用時機與差異。

步驟－

compile時加上-pg
執行會產生gmon.out
下gprof -b raytracing gmon.out | less 或gprof raytracing gmon.out | less ，差別在於是否有每個欄位的說明

原始結果：
可以找出花費時間前幾名的function。
為什麼%time的比例與cumulative sec的結果不同呢?等下來研究。
答 : 沒有不同，cumulative sec是累加self sec而來，%time是self sec的比例

Original
執行時間－

ktvexe@ktvexe-SVS13126PWR:~/sysprog21/raytracing-1$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 6.726505 sec

gprof-

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.85      0.57     0.57 69646433     0.00     0.00  dot_product
 16.49      1.00     0.43 56956357     0.00     0.00  subtract_vector
 11.12      1.29     0.29 13861875     0.00     0.00  rayRectangularIntersection
  7.67      1.49     0.20 17821809     0.00     0.00  cross_product
  7.28      1.68     0.19 31410180     0.00     0.00  multiply_vector
  6.52      1.85     0.17 13861875     0.00     0.00  raySphereIntersection
  6.52      2.02     0.17  4620625     0.00     0.00  ray_hit_object
  6.33      2.19     0.17 17836094     0.00     0.00  add_vector
  4.98      2.32     0.13 10598450     0.00     0.00  normalize
  2.30      2.38     0.06  1048576     0.00     0.00  ray_color
  1.73      2.42     0.05  2110576     0.00     0.00  localColor
  1.53      2.46     0.04  4221152     0.00     0.00  multiply_vectors
  1.53      2.50     0.04        1     0.04     2.61  raytracing

perf-
 46.53%  libc-2.21.so  [.] __mcount_internal
   9.91%  libc-2.21.so  [.] _mcount
   8.89%  raytracing    [.] dot_product
   8.15%  raytracing    [.] subtract_vector
   4.52%  raytracing    [.] rayRectangularIntersection
   3.08%  raytracing    [.] multiply_vector
   2.51%  raytracing    [.] add_vector
   2.42%  raytracing    [.] ray_hit_object
   2.34%  raytracing    [.] raySphereIntersection
   2.33%  raytracing    [.] normalize
   1.66%  raytracing    [.] cross_product

當然，如果不開gprof，執行時間會變快，因為少了 mcount 函數。
此函數在執行期間會記錄呼叫次數等相關資訊。而這些資訊會儲存至 gmon.out 中 - by 勃興筆記。

繪製call graph

找到gprof2dot 的document: https://github.com/jrfonseca/gprof2dot
這是可以將各式profiler的output會製程dot圖的工具，像是perf或是gprof都可以，需先安裝python與graphviz。

$ ./raytracing
$ gprof raytracing| gprof2dot | dot -Tpng -o dot.png

$ perf record -g -- ./raytracing
$ perf script | c++filt | gprof2dot. -f perf | dot -Tpng -o output.png

優化

提示：

可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling。

所以直接看程式碼，由前面效能分析工具分析的結果可以知道，math-toolkit.h中的運算function是這次改善效能的重點，而
math-toolkit.h中的function都有static inline這行，先查一下資料。
inline function是一個keyword，提醒compiler可以將function本體直接填入呼叫該function的位置。

優點：
減少呼叫function的overhead，如參數傳遞、回傳值處理等。
切換到function會jump，也就是會有branch的行為。因為CPU的pipeline和cache的特性，會影響效能。
經由代換程式碼，也許可以增加最佳化的可能性。













static inline
void add_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] + b[i];
}

static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    for (int i = 0; i < 3; i++)
        out[i] = a[i] - b[i];
}

複習一下：

static function－意思是說靜態函式只能被該檔案所看見，其它檔案無法得知該檔案是否有其靜態函式。
範例：






static void f1(){
cout << “f1()" << endl;
}
void f2(){
cout << “f2()" << endl;
}

f1( )只在這個 Compilation Unit 中生效。而 f2( )可以被其他 Compilation Unit 訪問。
來畫個圖說明Compilation Unit 吧，使用graphviz吧。
每個經過preprocessor處理後的.c 檔，和它的目的檔是一一對應的，而Compilation Unit就是這些被處理後的.cpp 檔。
Compilation Unit就是經過preprocessor處理後的.c 檔，因為f2沒有static所以與其他Compilation Unit link時會找到多份的f2，產生錯誤，即使加inline也是。

HackMD可以直接產生graphviz不用自己產生圖片後貼上，開心

複習一下：
static function－意思是說靜態函式只能被該檔案所看見，其它檔案無法得知該檔案是否有其靜態函式。
範例
static void f1(){
cout << “f1()" << endl;
}
void f2(){
cout << “f2()" << endl;
}
f1( )只在這個 Compilation Unit 中生效。而 f2( )可以被其他 Compilation Unit 訪問。
來畫個圖說明Compilation Unit 吧，使用graphviz吧。
每個經過preprocessor處理後的.c 檔，和它的目的檔是一一對應的，而Compilation Unit就是這些被處理後的.cpp 檔。
Compilation Unit就是經過preprocessor處理後的.c 檔，因為f2沒有static所以與其他Compilation Unit link時會找到多份的f2，產生錯誤，即使加inline也是。

extern－extern可以聲明變數會在其它的位置被定義，這個位置可能是在同一份文件之中，或是在其它文件之中。
範例：










other.c
double var = 1;

main.c
#include <stdio.h>
int main(void) {
    extern double var;
    printf("%f\n", var);
    return 0;
}

extern代表var並非在main.c中定義，compiler會找其它位置或文件中var的定義，而且extern時不可同時assign其值，會變成在該位置定義變數，結果就產生重複定義。
範例：












 reject
int main(void) {
    extern double someVar = 100; 
    return 0;
}

accept
int main(void) {
    extern double someVar;
    someVar = 100;
     return 0;
}

Extern Vs Static Inline－

From: Linus Torvalds
No, we had this fight with the gcc people a few years back, and they have a very valid argument for the current semantics

"static inline" means "we have to have this function, if you use it but don't inline it, then make a static version of it in this compilation unit"
"extern inline" means "I actually have an extern for this function, but if you want to inline it, here's the inline-version"

I think the current gcc semantics are (a) more powerful than the old one and (b) have been in effect long enough that it's not painful for Linux to just switch over to them. In short, we might actually want to start taking advantage of them, and even if we don't we should just convert
all current users of "extern inline" to "static inline".
Linus

Loop unrolling

過去在撰寫C程式碼時並沒有特別注意過loop unrolling的問題，因為compiler optimization時自然會被處理掉，而且展開後在維護上其實並不方便，只有在寫assembly時才會刻意將loop展開。

執行時間－節省0.9秒


ktvexe@ktvexe-SVS13126PWR:~/sysprog21/raytracing-1$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 5.833282 sec

比較結果

既然loop unrolling這類優化compiler optimilation時都會做，那不如先開optimilation來比較一下結果，要比也要知道對手是誰的概念。

-O0：

# Rendering scene
Done!
Execution time of raytracing() : 2.257529 sec

-O1：

# Rendering scene
Done!
Execution time of raytracing() : 0.809577 sec

-O2：

# Rendering scene
Done!
Execution time of raytracing() : 0.756377 sec

-O3：

# Rendering scene
Done!
Execution time of raytracing() : 0.770924 sec

POSIX Thread


     for (int j = 0; j < height/2; j++) {
         for (int i = 0; i < width/2; i++) {


for (int s = 0; s < SAMPLES/2; s++) {

然後就實作multithread，因為以前OS也練習過thread的使用，所以實作上很快，我是將圖的height分成四份，讓不同的thread去執行。

# Rendering scene
Done!
Execution time of raytracing() : 1.122686 sec

inline

將原先的inline改成__attribute__ ((always_inline)) ，才會被強制inline，執行時間變成1.04秒。

ktvexe@ktvexe-SVS13126PWR:~/sysprog21/raytracing-1$ ./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 0.969148 sec