2016q3 Homework1 (raytracing)

Reviewed by `TempoJiJi`

還有很多不同的改進方法可以實做看看，像是把function call都換成Macro
Openmp不同執行緒數量的結果，用圖來呈現效果會更好
要跟Pthread混熟，應該多做研究和尋找資料，這樣才能跟它混熟
可以將最好的效能跟編譯器優化做對比，並且畫圖來呈現
commit c8e806 沒有具體的說明爲何要做這些修改，對程式有什麼影響

目標

學習開發工具，特別是效能評估相關
了解數學式
使用POSIX Thread

Gprof

功能：了解各函式消耗的時間，及各函式互相調用的關係
使用：
1. 編譯和連結時加上-pg 參數
2. 執行，透過 gprof 分析產生的 profiling 資料 (查看有無 gmon.out 檔案產生)
3. gprof -b raytracing gmon.out | less
less讓我們可以用上下鍵查看分析結果
指令集

    Each sample counts as 0.01 seconds.

      %   cumulative   self              self     total           
     time   seconds   seconds    calls   s/call   s/call  name    
     24.31      0.86     0.86 69646433     0.00     0.00  dot_product
     20.92      1.60     0.74 56956357     0.00     0.00  subtract_vector

欄位說明:

%time :某函式的執行時間佔全部的百分比
cumulative seconds: 累計執行時間
self seconds: 執行過程中,函式的執行時間
calls: 執行過程中，函式被呼叫的次數
self ms/call: 每次被呼叫時平均花費了多少milliseconds執行函式
total ms/call: 每次被呼叫時，平均花費了多milliseconds執行函式和函式中呼叫的routine

    Call graph

    granularity: each sample hit covers 2 byte(s) for 0.28% of 3.54 seconds
    index % time    self  children    called     name
                    0.01    3.53       1/1           main [2]
    [1]    100.0    0.01    3.53       1         raytracing [1]
                    0.05    3.34 1048576/1048576     ray_color [3]
                    0.05    0.10 1048576/1048576     rayConstruction [15]
                    0.00    0.00       1/1           calculateBasisVectors [22]
                    0.00    0.00 1048576/1048576     idx_stack_init [26]

分項解說:

children:函式中呼叫的subroutine的執行時間
called: 函式被呼叫的次數，如果是遞迴呼叫，會有兩個數字，第二個數字表示遞迴呼叫的次數
- e.g. offtime <cycle 2> [7]
children: 函式呼叫其它subroutine的執行時間
self + children 的總合等於函式被 main() 呼叫後的實際執行時間
called: 兩個數字，第一個代表函式被main()呼叫的次數，第二個數字代表被所有caller functions(程式可能也會有其它function呼叫)呼叫的次數

Gprof V.S. Perf

Perf: 統計程式在執行過程中的事件
Gprof: 記錄呼叫次數等資訊

Ray Tracing

三維電腦圖形學中的特殊描繪 (rendering) 演算法，跟蹤從眼睛發出的光線而不是光源發出的光線，對於反射與折射有更準確的模擬效果
打開程式碼，依照作業說明上的指示 make PORFILE=1

-pg：是使用Gropf要加的指令

    cc -std=gnu99 -Wall -O0 -g -pg -c -o objects.o objects.c
    cc -std=gnu99 -Wall -O0 -g -pg -c -o raytracing.o raytracing.c
    cc -std=gnu99 -Wall -O0 -g -pg -c -o main.o main.c
    cc -o raytracing objects.o raytracing.o main.o -lm -pg

定義在Makefile裡

        ifeq ($(strip $(PROFILE)),1)
        PROF_FLAGS = -pg
        CFLAGS += $(PROF_FLAGS)
        LDFLAGS += $(PROF_FLAGS) 
        endif

再翻起makefile語法
- ifeq(a,b) 判斷a,b是否相等，可有else，以 endif結尾
- $(strip ):去掉字串中開頭和結尾的空字符，返回被去掉空格的字符串值
執行結果：
- 居然花了5.6秒 @@

    # Rendering scene
    Done!
    Execution time of raytracing() : 5.608881 sec

使用gprof分析
- 命令：gprof -b raytracing gmon.out | less
- 執行次數最多且花的時間最長的是 dot_product
- 排名前幾名的多為math-toolkit.h裡的函式

Each sample counts as 0.01 seconds.

      %   cumulative   self              self     total           

     time   seconds   seconds    calls   s/call   s/call  name    
     27.53      0.93     0.93 69646433     0.00     0.00  dot_product
     17.17      1.51     0.58 56956357     0.00     0.00  subtract_vector
      9.18      1.82     0.31 13861875     0.00     0.00  rayRectangularIntersection
      7.70      2.08     0.26 31410180     0.00     0.00  multiply_vector
      6.81      2.31     0.23 17836094     0.00     0.00  add_vector
      6.51      2.53     0.22 10598450     0.00     0.00  normalize
      6.22      2.74     0.21 13861875     0.00     0.00  raySphereIntersection
      5.33      2.92     0.18 17821809     0.00     0.00  cross_product
      3.85      3.05     0.13  4620625     0.00     0.00  ray_hit_object
      2.07      3.12     0.07  3838091     0.00     0.00  length
      1.78      3.18     0.06  4221152     0.00     0.00  multiply_vectors
      1.78      3.24     0.06  1048576     0.00     0.00  ray_color

了解程式碼
- math_toolkit.h中的static inline
  - inline: 提醒compiler可以將function本體直接展開到呼叫該function的位置，而不需要原本的 function prologue / epilogue
    - 只能建議編譯器，也就是說建議並不一定會被採納，這視您的編譯器而定，像是使用到goto、static變數、迴圈、switch等等，這些編譯器可能不接受 inline function 的建議，遞迴函式也無法在呼叫點展開，一個數千行的函式也不適合在呼叫點展開，如果編譯器拒絕將函式展開，它會將該函式視為一般函式進行編譯，inline的建議會被忽略。
  - 不會執行inline的時候：stackoverflow
    - For functions that return values and are having a loop or switch or goto.
    - For functions not returning values, if a return statement exists.

優化

優化方法：

multi-threading (using POSIX Thread)
loop unrolling
force unline ->看到詹欣達學長的筆記
OpenMP

背景知識：

loop unrolling
- a compiler optimization applied to certain kinds of loops to reduce the frequency of branches and loop maintenance instructions.
- It is easily applied to sequential array processing loops where the number of iterations is known prior to execution of the loop.
OpenMP
- an application programming interface (API) that supports multi-platform shared memory multiprocessing programming
- 編譯時加上 -fopenmp
- 更多使用介紹
- nested for loop

優化1： loop unrolling

方法：根據gprof的分析結果決定把math-toolkit中的for loop展開
結果：
- 從5.6降到4.7秒！

    # Rendering scene
    Done!
    Execution time of raytracing() : 4.754009 sec

Gprof結果:
- math-toolkit.h中各函式的self seconds皆有減少
- subtract_vector的執行時間原先是排第2名，現在掉到第4名
  - self seconds: 0.58降至0.21，佔全體時間：17.7%降至7.9%
  - 少了for loop的增加和判斷i值，整個函式更單純，花費的時間也就愈少
  - normalize反而從第六名上升到第2名
    - self seconds:0.22增為0.37，佔全體時間：6.51%上升至13.92%

    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           

     time   seconds   seconds    calls   s/call   s/call  name    
     28.21      0.75     0.75 69646433     0.00     0.00  dot_product
     13.92      1.12     0.37 10598450     0.00     0.00  normalize
     11.29      1.42     0.30 13861875     0.00     0.00  raySphereIntersection
      7.90      1.63     0.21 56956357     0.00     0.00  subtract_vector
      7.90      1.84     0.21 13861875     0.00     0.00  rayRectangularIntersection
      7.52      2.04     0.20 17821809     0.00     0.00  cross_product
      4.14      2.15     0.11 31410180     0.00     0.00  multiply_vector
      3.39      2.24     0.09 17836094     0.00     0.00  add_vector
      3.01      2.32     0.08  4620625     0.00     0.00  ray_hit_object
      3.01      2.40     0.08  1048576     0.00     0.00  ray_color
      1.50      2.44     0.04  2110576     0.00     0.00  compute_specular_diffuse
      1.50      2.48     0.04  1048576     0.00     0.00  rayConstruction

優化2：force inline

看到詹欣達學長的筆記後決定試一試
決定inline function是否執行是由compiler決定，但經過搜尋後發現有兩個類型是inline function不會執行的，其中之一是要該函式不return任何值，而math_toolkit.h要回傳值的有dot_product() length()和scalar_triple()
方法：針對上面3個函式和cross_product(scalar_triple()有用到)，加入inline __attribute__((always_inline))
結果：
- 時間又從4.7秒下降至4.5秒

    # Rendering scene
    Done!
    Execution time of raytracing() : 4.552191 sec

每一個都強迫
- 結果：
  - 又再下降成3.2秒！

    # Rendering scene
    Done!
    Execution time of raytracing() : 3.290624 sec

Gprof:
- 花費時間最多的變成rayRectangularIntersection ，呼叫7次dot_product

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 34.49      0.10     0.10   386260     0.00     0.00  rayRectangularIntersection
 34.49      0.20     0.10   367224     0.00     0.00  raySphereIntersection
 10.35      0.23     0.03    29896     0.00     0.01  ray_color
  6.90      0.25     0.02    55505     0.00     0.00  compute_specular_diffuse
  3.45      0.26     0.01   133214     0.00     0.00  ray_hit_object
  3.45      0.27     0.01    66535     0.00     0.00  localColor
  3.45      0.28     0.01    35712     0.00     0.00  refraction
  3.45      0.29     0.01    29303     0.00     0.00  rayConstruction
  0.00      0.29     0.00    74444     0.00     0.00  idx_stack_empty
  0.00      0.29     0.00    69735     0.00     0.00  idx_stack_top
  0.00      0.29     0.00    38372     0.00     0.00  reflection
  0.00      0.29     0.00    36207     0.00     0.00  idx_stack_push

優化3：OpenMP

簡介:
- 支援跨平台共享記憶體方式的多執行緒並行的編程API，也可以和MPI一起並用
- 支援C、C++和Fortran
編譯時記得在Makefile中的CFLAGS加上-fopenmp,LDFLAGS加上-lgomp
方法：修改raytracing.c中的raytracing()
- nested for loop #pragma omp parallel for collapse(3)
  - 編譯時的錯誤訊息：

     error: not enough perfectly nested loops before ‘double’    
                 double r = 0, g = 0, b = 0;

修改：看了詹欣達學長的筆記，在前面加上#pragma omp parallel for num_threads(64) private(d),private(object_color),private(stk)順利編譯過了
private:
- 宣告在 construct 內的 thread_id 變數為各執行緒所私有的，如此各執行緒執行接下來的程式敘述時才能取用各自的 thread_id 值
64個threads結果：
- 減至1.3秒

# Rendering scene
Done!
Execution time of raytracing() : 1.370892 sec

還要再多看看資料~研究一下collapse private

試著更改thread數目

16個threads: 1.488103 sec
32個threads: 1.412717 sec
128個threads: 1.348934 sec
256個threads: 1.339899 sec
512個threads: 1.339736 sec

==>從上面列舉的threads數來看，threads愈多執行的時間雖然愈短，但時間下降的幅度也愈小。

Gprof
- 整體時間都更短了

    Each sample counts as 0.01 seconds.

      %   cumulative   self              self     total           

     time   seconds   seconds    calls  ms/call  ms/call  name    
     46.36      0.19     0.19   761781     0.00     0.00  rayRectangularIntersection
     17.08      0.26     0.07   129903     0.00     0.00  localColor
     14.64      0.32     0.06   745257     0.00     0.00  raySphereIntersection
      9.76      0.36     0.04   264403     0.00     0.00  ray_hit_object
      4.88      0.38     0.02   115772     0.00     0.00  compute_specular_diffuse
      4.88      0.40     0.02    49185     0.00     0.01  ray_color
      2.44      0.41     0.01    77232     0.00     0.00  reflection
      0.00      0.41     0.00   164302     0.00     0.00  idx_stack_empty
      0.00      0.41     0.00   147711     0.00     0.00  idx_stack_top
      0.00      0.41     0.00    77722     0.00     0.00  refraction
      0.00      0.41     0.00    73367     0.00     0.00  idx_stack_push
      0.00      0.41     0.00    70649     0.00     0.00  protect_color_overflow

優化4：POSIX threads

誠實面對：我跟threads不熟！！
好好學習並實作！！

參考下方學長hackpad，也把thread用在function raytracing()中

pthread_create()
- int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
- 參數1：thread pointer
- 參數2：可以定義thread屬性
- 參數3：要threads去執行的function
- 參數4：要傳進functions的參數
由於pthread_create()，function傳進的參數只能有一個，因此把原本raytracing()的參數打包成一個structure

2016q3 Homework1 (raytracing)

Reviewed by `TempoJiJi`

目標

Gprof

Gprof V.S. Perf

Ray Tracing

優化

優化方法：

背景知識：

優化1： loop unrolling

優化2：force inline

優化3：OpenMP

試著更改thread數目

優化4：POSIX threads

參考資料：

tags: `system embedded` `HW1-2`

2016q3 Homework1 (raytracing)

Reviewed by TempoJiJi

目標

Gprof

Gprof V.S. Perf

Ray Tracing

優化

優化方法：

背景知識：

優化1： loop unrolling

優化2：force inline

優化3：OpenMP

試著更改thread數目

優化4：POSIX threads

參考資料：

tags: system embedded HW1-2

Reviewed by `TempoJiJi`

tags: `system embedded` `HW1-2`