Try   HackMD

2016q3 Homework1 (raytracing)

contributed by <nekoneko>

tags: sys2016 nekoneko homework

Reviewed by ChenYi

  • 易混淆的 git commit message last commit "SIMD dot_product solved-1",實際程式碼為"完成"
    • 另外-1 -2會使人感到意義不明
  • 可以試著分析struct更改後的cache-misses,以及使用圖表表示速度上的差異
  • 撰寫使用SIMD的部份請儘量避免底線開頭的變數名稱,以免產生Naming convention的狀況發生

gprof

使用方式

  • gcc編譯時下-pg,會在所以函式中加入mount的函式
  • 在編譯和連結時,允許profile功能開啟 (CFLAGS=-gp, LDFLAGS=-gp)
  • -pg的參數也可以下在編譯與連結同時做的時候 (gcc -o)
    cc -o myprog myprog.c utils.c -g -pg
  • 單獨使用ld(linker)時,必須指定profiling startup file gctr0.o,連結檔的檔改為lib_c_p.a,連結參數要改為-lc_p
    ld -o myprog /lib/gcrt0.o myprog.o utils.o -lc_p
  • run into problems with the profiling support code in a shared library being called before that library has been fully initialised.,解決方式為改為靜態連結到有包含profile support code的函式庫。
    gcc -g -pg -static-libgcc myprog.c utils.c -o myprog

不懂為什麼文件這麼寫,先記錄下來,方便以後用到查閱cheng hung lin

  • 可以只針對想要測試的module編譯上加上-pg
  • 執行測試檔後,會產生gmon.out,為了能正常的產生gmon.out,測試程式要正常的結束returning by main or calling exit
  • 產生的gmon.out會在程式執行當下所在的資料夾(directory)
  • bb.out: Unfortunately, the appearance of a human-readable bb.out means the basic-block counts didn't get written into gmon.out.
  • default excutable file: a.out, default profile data file: gmon.out

參數

  • -Q, --no-graph: 不印出 call graph資料
  • -q, --graph: 印出 call graph資料
  • -p, --flat-profile: 印出 flat-profile
  • -A : Ref

flat profile

  • 函式所花的時間
  • 函式被呼叫的次數

call graph

顯示每個函式呼叫的關係,呼叫到哪些函式,本身被哪些函式呼叫

可以測試到

  • program spent is time
  • 函式互相呼叫的關係
  • 顯示某些比認知上預期還慢的程式片段
  • 函式呼叫的次數
  • 檢視為注意到的bug

graphviz

gprof2dot

這邊參考了Chen Yi同學的共筆,使用gprof2dot

$ sudo apt-get install pip3
$ pip3 install gprof2dot

照script安裝或pip都沒辦法成功執行cheng hung lin
Exception: using gprof2dot.py as a module is unsupported

raytracing

  • 編譯和執行
make clean
make PROFILE=1
$ ./raytracing
$ gprof raytracing | gprof2dot | dot -Tpng -o output.pn

設定PROFILE的原因,可以看Makefile 11行到15行。

  • 輸出結果

  • 輸出

  • 11.650036 sec

# Rendering scene
Done!
Execution time of raytracing() : 11.650036 sec
  • flat profile
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 23.26      1.46     1.46 69646433     0.00     0.00  dot_product
 18.32      2.61     1.15 56956357     0.00     0.00  subtract_vector
  9.40      3.20     0.59 31410180     0.00     0.00  multiply_vector
  8.29      3.72     0.52 17836094     0.00     0.00  add_vector
  7.01      4.16     0.44 13861875     0.00     0.00  rayRectangularIntersection
  5.90      4.53     0.37 17821809     0.00     0.00  cross_product
  5.74      4.89     0.36 13861875     0.00     0.00  raySphereIntersection
  5.74      5.25     0.36 10598450     0.00     0.00  normalize
  3.98      5.50     0.25  4620625     0.00     0.00  ray_hit_object
  1.75      5.61     0.11  4221152     0.00     0.00  multiply_vectors
  1.67      5.72     0.11  2110576     0.00     0.00  compute_specular_diffuse
  1.67      5.82     0.11  2110576     0.00     0.00  localColor
  1.12      5.89     0.07  1048576     0.00     0.00  ray_color
  1.12      5.96     0.07        1     0.07     6.26  raytracing
  0.96      6.02     0.06  2520791     0.00     0.00  idx_stack_top
  0.96      6.08     0.06  3838091     0.00     0.00  length
  0.96      6.14     0.06  1048576     0.00     0.00  rayConstruction
  0.64      6.18     0.04  1241598     0.00     0.00  refraction
  0.40      6.21     0.03  2558386     0.00     0.00  idx_stack_empty
  0.40      6.23     0.03  1204003     0.00     0.00  idx_stack_push
  0.32      6.25     0.02        1     0.02     0.02  delete_sphere_list
  0.24      6.27     0.02  1241598     0.00     0.00  reflection
  0.16      6.28     0.01  1241598     0.00     0.00  protect_color_overflow
  0.00      6.28     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      6.28     0.00   113297     0.00     0.00  fresnel
  0.00      6.28     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      6.28     0.00        3     0.00     0.00  append_rectangular
  0.00      6.28     0.00        3     0.00     0.00  append_sphere
  0.00      6.28     0.00        2     0.00     0.00  append_light
  0.00      6.28     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      6.28     0.00        1     0.00     0.00  delete_light_list
  0.00      6.28     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      6.28     0.00        1     0.00     0.00  diff_in_second
  0.00      6.28     0.00        1     0.00     0.00  write_to_ppm
  • branch 和 branch-misses
     4,785,429,867      branches                                                    
        26,321,418      branch-misses             #    0.55% of all branches        

      12.091801958 seconds time elapsed
  • gprof2dot
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
+------------------------------+
|        function name         |
| total time % ( self time % ) |
|         total calls          |
+------------------------------+

分析

  • math-toolkit.h裡宣告的函式
    • normalize
    • length
    • add_vector
    • subtract_vector
    • multiply_vectors
    • multiply_vector
    • cross_product
    • dot_product
    • scalar_triple_product
    • scalar_triple

優化

Loop Unrooling

  • 有loop的函式
    • dot_product 1.46s
    • subtract_vector 1.15s
    • multiply_vector 0.59s
    • add_vector 0.52s
    • multiply_vectors 0.11s
    • scalar_triple_product -> cross_product, multiply_vectors
    • scalar_triple -> cross_product, dot_product
  • 觀察
    1. scalar_triple_product和scalar_triple 都沒有出現在flat profile
    2. subtract_vector(1.15) > multiply_vector(0.59) > multiply_vectors(0.11)
    • 乘法其實來比減法還要快 (原因不知道)
    • multiply_vector > multiply_vectors (原因不知道)

$ perf record -F 12500 -e cycles ./raytracing && perf report 做Annotate可以了解,但是x86_64組語看不懂><cheng hung lin

  • 改成loop unrooling: 9.119676 s
# Rendering scene
Done!
Execution time of raytracing() : 9.119676 sec

其中dot_product除了改成loop unrooling之外,將三行改成一行(這是參考別人共筆的,但已經忘記是那一筆QQ)

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 15.52      0.56     0.56 56956357     0.00     0.00  subtract_vector
 13.17      1.04     0.48 13861875     0.00     0.00  rayRectangularIntersection
 12.89      1.50     0.47 69646433     0.00     0.00  dot_product
 12.47      1.95     0.45 10598450     0.00     0.00  normalize
  7.48      2.22     0.27 17821809     0.00     0.00  cross_product
  7.35      2.49     0.27 13861875     0.00     0.00  raySphereIntersection
  6.24      2.71     0.23 31410180     0.00     0.00  multiply_vector
  6.10      2.93     0.22  4620625     0.00     0.00  ray_hit_object
  4.43      3.09     0.16 17836094     0.00     0.00  add_vector
  2.49      3.18     0.09  2110576     0.00     0.00  compute_specular_diffuse
  2.49      3.27     0.09  1048576     0.00     0.00  ray_color
  1.94      3.34     0.07  4221152     0.00     0.00  multiply_vectors
  1.66      3.40     0.06  2520791     0.00     0.00  idx_stack_top
  1.39      3.45     0.05  3838091     0.00     0.00  length
  1.11      3.49     0.04  2110576     0.00     0.00  localColor
  1.11      3.53     0.04        1     0.04     3.61  raytracing
  0.55      3.55     0.02  1241598     0.00     0.00  refraction
  0.55      3.57     0.02  1204003     0.00     0.00  idx_stack_push
  0.55      3.59     0.02  1048576     0.00     0.00  rayConstruction
  0.28      3.60     0.01  1241598     0.00     0.00  reflection
  0.28      3.61     0.01  1048576     0.00     0.00  idx_stack_init
  0.00      3.61     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      3.61     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      3.61     0.00   113297     0.00     0.00  fresnel
  0.00      3.61     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      3.61     0.00        3     0.00     0.00  append_rectangular
  0.00      3.61     0.00        3     0.00     0.00  append_sphere
  0.00      3.61     0.00        2     0.00     0.00  append_light
  0.00      3.61     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      3.61     0.00        1     0.00     0.00  delete_light_list
  0.00      3.61     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      3.61     0.00        1     0.00     0.00  delete_sphere_list
  0.00      3.61     0.00        1     0.00     0.00  diff_in_second
  0.00      3.61     0.00        1     0.00     0.00  write_to_ppm
  • branch branch-miss
    • branches : 減少927897692
 Performance counter stats for './raytracing':

     3,875,532,175      branches                                                    
        26,090,737      branch-misses             #    0.67% of all branches        

       9.382581189 seconds time elapsed
  • 函式時間
    • dot_product 0.47s
    • subtract_vector 0.56s
    • multiply_vector 0.23s
    • add_vector 0.16s
    • multiply_vectors 0.07s

force inline

  • __attribute__: 宣告某些標語用來優化程式碼。
  • 用法: void foo () __attribute__((always_inline)); (Ref)(還未確定,原文文件看不懂)
static inline __attribute__((always_inline))
  • 小實驗
    原因: 因為使用inline function的話,gprof是不會有function call的顯現,會不方便分析math-toolkit.h的函數。所以換個角度想,能不能測出使用inline function所減少的instruction?

    ​$ perf stat -r 1 -e instructions ./raytracing  #得到total instructions
    ​$ gprof -b raytracing  #得到function call次數
    
    • 以normalize為例
    instructions function call time
    non-inline 25,758,523,927 10598450 9.157607
    inline 25,121,100,243 - 8.967475

    因為使用inline不會有normalize的function call 次數紀錄,所以沿用non-line的

    • (25,758,523,927 - 25,121,100,243)/10,598,450 = 60.14 inst per normal func call
    • 問題:
      不知道60.14能代表什麼意義
      • 不了解inline與function call實際的差別
      • 不了解gprof對function call是否有影響
  • 所有math-tooltik.h函式都改成inline function

# Rendering scene
Done!
Execution time of raytracing() : 4.762099 sec

SIMD

  • cpu 型號: intel i3 330M
  • 支援到SSE4.1/4.2
  • 照著 共筆SPEC
    • compiler對 __m128d 和 __m128i 型態的變數,無論是區域變數還是全域變數,都是對齊16 byte為界的stack上,可以下__attribute更改,__attribute__((aligned(n[, offset])))或__attribute__((aligned(n[,offset])))

不知道要放在哪,後來就沒放了

__m128d _v1; __m128d _v2; double _v_tmp[3]; _v1 = _mm_loadu_pd(v1); _v2 = _mm_loadu_pd(v2); _v1 = _mm_mul_pd(_v1, _v2); _mm_store_pd(_v_tmp, _v1); return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];
  • 結果
# Rendering scene
Done!
Execution time of raytracing() : 9.730340 sec
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 28.75      1.18     1.18 69646433     0.00     0.00  dot_product
 11.25      1.64     0.46 13861875     0.00     0.00  rayRectangularIntersection
 10.52      2.07     0.43 10598450     0.00     0.00  normalize
  9.54      2.46     0.39 13861875     0.00     0.00  raySphereIntersection
  8.68      2.81     0.36 56956357     0.00     0.00  subtract_vector
  • 嘗試加上aligned
__m128d _v1 __attribute(aligned(64)); __m128d _v2 __attribute(aligned(64)); double _v_tmp[3] __attribute(aligned(64));

結果Segmentation fault.,使用gdb bt的功能

#0  0x00000000004015c2 in _mm_load_pd (__P=0x7fffffffd9a8) at /usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h:119
#1  dot_product (v1=0x7fffffffd9a8, v2=0x7fffffffdd00) at math-toolkit.h:72
#2  0x0000000000401ee0 in rayRectangularIntersection (ray_e=0x7fffffffd940, ray_d=0x7fffffffdd00, rec=0x60bbf0, ip=0x7fffffffd990, 
    t1=0x7fffffffd8f8) at raytracing.c:121
#3  0x00000000004028a5 in ray_hit_object (e=0x4043c0 <view>, d=0x7fffffffdd00, t0=0, t1=59.948704218900112, rectangulars=0x60bbf0, 
    hit_rectangular=0x7fffffffda78, spheres=0x60ba70, hit_sphere=0x7fffffffda88) at raytracing.c:264
#4  0x0000000000402f94 in ray_color (e=0x4043c0 <view>, t=0, d=0x7fffffffdd00, stk=0x7fffffffdd40, rectangulars=0x60bbf0, 
    spheres=0x60ba70, lights=0x60b9d0, object_color=0x7fffffffdd20, bounces_left=3) at raytracing.c:367
#5  0x0000000000403835 in raytracing (pixels=0x7ffff7f1b010 "", background_color=0x7fffffffdec0, rectangulars=0x60bbf0, 
    spheres=0x60ba70, lights=0x60b9d0, view=0x4043c0 <view>, width=512, height=512) at raytracing.c:481
#6  0x0000000000403caa in main () at main.c:52

math-toolkit.h:72

_v1 = _mm_load_pd(v1);

會不會是const double *v1, const double *v2傳近來的參數問題呢?

  • 重新map到aligned array
double _v1_tmp[3] __attribute__((aligned(64))); double _v2_tmp[3] __attribute__((aligned(64))); memcpy(_v1_tmp, v1, sizeof(double)*3 ); memcpy(_v2_tmp, v2, sizeof(double)*3 );
  • 結果: 可以跑出結果,時間又更長
# Rendering scene
Done!
Execution time of raytracing() : 10.993178 sec
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 40.74      2.11     2.11 69646433     0.00     0.00  dot_product
 10.45      2.65     0.54 13861875     0.00     0.00  rayRectangularIntersection
  6.77      3.00     0.35 13861875     0.00     0.00  raySphereIntersection
  6.68      3.34     0.35 56956357     0.00     0.00  subtract_vector
  • 換一個方式: 把有傳進dot_product變數的宣告加上aligned
$ grep -n 'dot_product' *

後來直接試改primitivs.hmath-toolkit.h

typedef double point3[3] __attribute__((aligned(64))); <-- ... typedef struct { point4 vertices[4]; <-- 34 point3 normal; object_fill rectangular_fill; } rectangular;
__m128d _v1; __m128d _v2; double _v_tmp[3]; _v1 = _mm_load_pd(v1); _v2 = _mm_load_pd(v2); _v1 = _mm_mul_pd(_v1, _v2); _mm_store_pd(_v_tmp, _v1); return _v_tmp[0] + _v_tmp[1] + v1[2]*v2[2];

結果:

Execution time of raytracing() : 9.984913 sec

是的,整體比用 _mm_loadu_pd 的版本慢!
但是dot_product的時間是有變少的多執行幾次,_mm_loadu_pd 與 aligned兩個版本得到的時間會

jacklin@jacklin-CX420-CX420-MX:[~/Documents/sys2016/week1/raytracing] (master) 0h24m $ gprof -p -b raytracing
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name    
 22.12      1.09     1.09 69646433     0.00     0.00  dot_product
 12.79      1.72     0.63 56956357     0.00     0.00  subtract_vector
 12.08      2.32     0.60 13861875     0.00     0.00  rayRectangularIntersection
  9.03      2.76     0.45 10598450     0.00     0.00  normalize
  8.02      3.16     0.40 13861875     0.00     0.00  raySphereIntersection
  7.10      3.51     0.35 31410180     0.00     0.00  multiply_vector
  6.90      3.85     0.34 17821809     0.00     0.00  cross_product
  6.29      4.16     0.31  4620625     0.00     0.00  ray_hit_object
  3.45      4.33     0.17  1048576     0.00     0.00  ray_color
  3.04      4.48     0.15 17836094     0.00     0.00  add_vector
  2.03      4.58     0.10  2110576     0.00     0.00  compute_specular_diffuse
  1.62      4.66     0.08  4221152     0.00     0.00  multiply_vectors
  1.42      4.73     0.07  1048576     0.00     0.00  rayConstruction
  0.81      4.77     0.04  2110576     0.00     0.00  localColor
  0.81      4.81     0.04        1     0.04     4.93  raytracing
  0.61      4.84     0.03  1241598     0.00     0.00  reflection
  0.61      4.87     0.03  1241598     0.00     0.00  refraction
  0.41      4.89     0.02  3838091     0.00     0.00  length
  0.41      4.91     0.02  1204003     0.00     0.00  idx_stack_push
  0.20      4.92     0.01  2520791     0.00     0.00  idx_stack_top
  0.20      4.93     0.01  1241598     0.00     0.00  protect_color_overflow
  0.10      4.93     0.01        1     0.01     0.01  delete_sphere_list

=而且,回來看這段code,似乎有改的不對不知道原本為何編譯有誤,之後改成=

	point3 vertices[4] --> point4 vertices[4];

原本的編譯會有錯

In file included from objects.c:4:0:
primitives.h:32:5: error: alignment of array elements is greater than element size
     point3 vertices[4];
     ^
Makefile:24: recipe for target 'objects.o' failed

moduls.inc來看,就知道原本point3 vertices[4]想要表達是3 x 4的2d array,所以改rectangular的方式不對

參考文件