2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by <`chenweiii`> >> 必須先承認我看不太懂 raytracing 的程式是如何寫的、其背後的原理是什麼，只能先盡量閱讀別人共筆，改善 math-toolkit.h 部分，及學習如何使用 pThread 改善 [raytracing.c] raytracing。[time=Mon, Feb 27, 2017 4:59 PM] >> 剛剛才發現老師有附一位同學的共筆有解說 raytracing，繼續努力！[time=Mon, Feb 27, 2017 5:00 PM] ### Reviewed by `SeanCCC` * 請補上硬體資訊 ## Intel® Pentium® CPU B960 支援之 SIMD instruction set ``` MMX instructions SSE / Streaming SIMD Extensions SSE2 / Streaming SIMD Extensions 2 SSE3 / Streaming SIMD Extensions 3 SSSE3 / Supplemental Streaming SIMD Extensions 3 SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 ``` ## 安裝 gprof2dot ``` $ git clone https://github.com/jrfonseca/gprof2dot $ cd gprof2dot $ python setup.py build $ python setup.py install ``` ## 觀察 ### Original Execution time | | 開 -pg |不開 -pg | | ------ | ----------- | ------- | | -O0 | 10.872s | 6.28s | | -Ofast | 2.32s | 2.24s | ### 使用 perf 觀察 branch-misses ``` Samples: 24K of event 'branch-misses', Event count (approx.): 11348775 Overhead Command Shared Object Symbol 35.42% raytracing raytracing [.] rayRectangularIntersection 13.11% raytracing raytracing [.] ray_color 10.72% raytracing raytracing [.] ray_hit_object 7.88% raytracing raytracing [.] raySphereIntersection 5.71% raytracing raytracing [.] refraction 5.70% raytracing raytracing [.] subtract_vector 4.35% raytracing raytracing [.] compute_specular_diffuse 3.33% raytracing raytracing [.] idx_stack_top 2.89% raytracing raytracing [.] multiply_vector 2.69% raytracing raytracing [.] add_vector 1.77% raytracing raytracing [.] raytracing 1.17% raytracing raytracing [.] length 0.92% raytracing raytracing [.] dot_product 0.80% raytracing raytracing [.] multiply_vectors 0.78% raytracing libm-2.23.so [.] __pow_finite 0.76% raytracing raytracing [.] localColor 0.63% raytracing raytracing [.] idx_stack_init ``` ### 使用 gprof 觀察 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 17.10 0.85 0.85 1241598 0.00 0.00 refraction 15.90 1.64 0.79 69646433 0.00 0.00 dot_product 11.27 2.20 0.56 31410184 0.00 0.00 multiply_vector 9.96 2.69 0.49 56956357 0.00 0.00 subtract_vector 6.94 3.04 0.34 10598450 0.00 0.00 normalize 6.44 3.36 0.32 4620625 0.00 0.00 ray_hit_object 6.04 3.66 0.30 13861875 0.00 0.00 rayRectangularIntersection 5.43 3.93 0.27 13861875 0.00 0.00 raySphereIntersection 4.53 4.16 0.23 17836094 0.00 0.00 add_vector 3.42 4.33 0.17 1048576 0.00 0.00 rayConstruction 2.82 4.46 0.14 17821809 0.00 0.00 cross_product 2.62 4.59 0.13 1048576 0.00 0.00 ray_color 1.61 4.67 0.08 4221152 0.00 0.00 multiply_vectors 1.31 4.74 0.07 3838091 0.00 0.00 length 1.01 4.79 0.05 2110576 0.00 0.00 compute_specular_diffuse 1.01 4.84 0.05 2110576 0.00 0.00 localColor 1.01 4.89 0.05 1 0.05 4.97 raytracin ``` 可以看到 refraction 時間最多，觀察在 call graph 的點，卻只有一條連出去到 dot_product 的線，但明明程式裡面也有執行 multiply_vector, subtract_vector，再重看一次文字版的 call graph 確實有執行到，但可能比例偏低，gprof2dot 便沒有把這些線畫出來了。 ### call graph ![](https://i.imgur.com/nTzjB2i.png) ### [raytracing.c] what's samples? * MSAA (MultiSampling Anti-Aliasing) * 試試看 SAMPLES = 1，每個形狀的邊邊感覺點凹凸不平的現象 ![](https://i.imgur.com/F0aK47E.png) ### [raytracing.c] raytracing * 觀察到 height, width 兩個迴圈的設計是一個個 row 計算 pixel，raytracing 此函數的 cache misses ratio 為 1.29%，我把改成一個個 column 計算後，cache misses ratio 增加至 3.54%。程式的設計一不小心便影響到執行的效能。 * 玩弄一下 * width / 2 ![](https://i.imgur.com/0OaNcG6.png) * height / 2 ![](https://i.imgur.com/JyMAs7C.png) * width / 2 && height / 2 ![](https://i.imgur.com/pd2gGWJ.png) * height increase by 5 ![](https://i.imgur.com/NnCyBIS.png) * height and width increase by 5 ![](https://i.imgur.com/rAdPE5K.png) * 小結：雖然不太清楚程式如何運作，但透過此實驗可得每個 pixel 應都是可以獨立計算。 ### 改善方向 - [x] loop unrolling - [x] force inline - [x] pthread - [ ] OpenMP (referenced: [HuangWenChen](https://hackmd.io/s/r1osNDuT)) - [ ] 練習用 gnuplot 製作各種圖 >> 想法：128bit 的 SIMD instruction，在一個 function 裡頂多能一次處理兩對 float 變數，剩下一對還是得自己運算。但如果累積兩個 function 的量便有六對 float 變數，可以一次執行三次的 SIMD instruction ## 改善方向一：loop unrolling 將 math-toolkit.h 裡有用到迴圈之 function 通通展開，效果良好。 ### 使用 perf 觀察 branch-misses ``` Samples: 17K of event 'branch-misses', Event count (approx.): 6951451, Thread: raytracing(12042) Overhead Command Shared Object Symbol ◆ 27.00% raytracing raytracing [.] rayRectangularIntersection ▒ 23.12% raytracing raytracing [.] ray_color ▒ 15.55% raytracing raytracing [.] raySphereIntersection ▒ 10.38% raytracing raytracing [.] ray_hit_object ▒ 8.56% raytracing raytracing [.] refraction ▒ 3.86% raytracing raytracing [.] compute_specular_diffuse ▒ 3.08% raytracing raytracing [.] raytracing ▒ 2.35% raytracing raytracing [.] idx_stack_top ▒ 1.11% raytracing raytracing [.] idx_stack_init ▒ 1.06% raytracing raytracing [.] multiply_vector ▒ 0.65% raytracing libm-2.23.so [.] __pow ▒ 0.61% raytracing raytracing [.] protect_color_overflow ▒ 0.60% raytracing libm-2.23.so [.] __pow_finite ▒ 0.60% raytracing raytracing [.] subtract_vector ▒ 0.57% raytracing raytracing [.] dot_product ``` * math-toolkit.h 的數學函數其 branch-misses 成功降低，但其它 raytracing 之函數其比例隨之上升。 * [Question] 不太了解為何 multiply_vector, subtract_vector 也會有 branch-misses，明明沒有 loop, if-else。下面摘錄 'perf Annotate multiply vector' 內容，**不太了解為何 mov 可以有大比例的 branch-misses**。 ``` │ static inline ▒ │ void multiply_vector(const double *a, double b, double *out) ▒ │ { ▒ │ push %ebp ▒ 1.12 │ mov %esp,%ebp ▒ │ sub $0x8,%esp ▒ │ mov 0xc(%ebp),%eax ▒ │ mov %eax,-0x8(%ebp) ▒ 0.56 │ mov 0x10(%ebp),%eax ▒ │ mov %eax,-0x4(%ebp) ▒ │ out[0] = a[0] * b; ▒ │ mov 0x8(%ebp),%eax ▒ │ fldl (%eax) ▒ 0.56 │ fmull -0x8(%ebp) ▒ 97.75 │ mov 0x14(%ebp),%eax ▒ │ fstpl (%eax) ``` ### 使用 gprof 觀察 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.11 0.89 0.89 1241598 0.00 0.00 refraction 14.44 1.42 0.53 31410181 0.00 0.00 multiply_vector 11.44 1.83 0.42 13861875 0.00 0.00 rayRectangularIntersection 7.90 2.12 0.29 10598450 0.00 0.00 normalize 6.81 2.38 0.25 69646433 0.00 0.00 dot_product 6.13 2.60 0.23 56956357 0.00 0.00 subtract_vector 5.99 2.82 0.22 4620625 0.00 0.00 ray_hit_object 4.36 2.98 0.16 13861875 0.00 0.00 raySphereIntersection 3.13 3.10 0.12 1048576 0.00 0.00 ray_color 2.72 3.19 0.10 1048576 0.00 0.00 rayConstruction 2.32 3.28 0.09 17836094 0.00 0.00 add_vector 2.32 3.37 0.09 17821809 0.00 0.00 cross_product 1.23 3.41 0.04 1 0.04 3.67 raytracing 1.09 3.45 0.04 3838091 0.00 0.00 length ``` 除 refraction 外，其它 function 時間都有下降。 ### 成果 ``` # Rendering scene Done! Execution time of raytracing() : 4.722194 sec ``` 時間從 6.28s -> 4.72s，改善 25%。 <br> ## 改善方向二：force inline 將 math-toolkit.h 的每個 function 加上 __attribute\__((always_inline)) 強制 inline ### 使用 gprof 觀察 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 27.81 1.09 1.09 13861875 0.00 0.00 rayRectangularIntersection 23.72 2.02 0.93 1241598 0.00 0.00 refraction 13.27 2.54 0.52 13861875 0.00 0.00 raySphereIntersection 9.44 2.91 0.37 2110576 0.00 0.00 compute_specular_diffuse 7.65 3.21 0.30 4620625 0.00 0.00 ray_hit_object 7.14 3.49 0.28 1048576 0.00 0.00 ray_color 5.10 3.69 0.20 2110576 0.00 0.00 localColor 4.08 3.85 0.16 1048576 0.00 0.00 rayConstruction 0.77 3.88 0.03 1241598 0.00 0.00 reflection 0.77 3.91 0.03 1 0.03 3.92 raytracing 0.26 3.92 0.01 1241598 0.00 0.00 protect_color_overflow ``` 因為強制 inline 的關係，math-toolkit.h 裡的數學函數便在 gprof 裡觀察不到了，也因此 raytracing.h 的各項函數時間因此上升。 ### 成果 ``` # Rendering scene Done! Execution time of raytracing() : 4.355438 sec ``` 時間較上一個版本下降約 0.3s。 <br> ## 改善方向三：使用 pthread ### 先準備練習參考材料： * [POSIX Threads Programming](https://computing.llnl.gov/tutorials/pthreads/) * [LanKuDot 共筆](https://hackmd.io/s/rkggUeKa) ### pthread 注意事項 * 一次只能傳一個參數，而且型態為 (void *)，故須設計一個 struct，將所需資料傳入。 * 記得將 argument 轉型。 ### 修改之內容 ```clike= /* raytracing.h */ typedef struct __RAYTRACING_ITEM { double *u, *v, *w; uint8_t *pixels; double *background_color; rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *view; int factor; int from_height, to_height; int height; int width; } ritem; ``` * 值得注意的是 from_height, to_height，代表此 thread 所負責產生的 block。 * raytracing 修改為準備 thread 所需的 struct 內容並產生 thread。 * 新增 rayThread 做 raytracing 原本的工作。 ### 成果 ``` # Rendering scene Done! Execution time of raytracing() : 2.215274 sec ``` **從上個版本 4.355s -> 2.215s，改善 49%，比原始版本改善 65%，已比我的電腦開 -Ofast 還要來得好。** ### 不同 threads 個數之 Execution Time ![](https://i.imgur.com/gywcPYL.png) <br> ## 改善方向四：使用 pthread 嘗試不同策略 ### thread 主動競爭 row >> 這邊是看到某一位同學所提出的想法，但很抱歉的是我把分頁關掉了，reference 之後再補上 orz [time=Wed, Mar 1, 2017 9:53 PM] 練習使用 pthread_mutex_t，從 [POSIX Threads Programming](https://computing.llnl.gov/tutorials/pthreads/) 看到的是將 mutex 宣告為全域變數，但這邊我宣告為 raytracing 的區域變數再傳入 rayThread。 ```clike= while (j < rayInfo->height) { pthread_mutex_lock(rayInfo->_MUTEX); j = ++(*rayInfo->__CURRENT_HEIGHT); pthread_mutex_unlock(rayInfo->_MUTEX); if (j >= rayInfo->height) break; /* raytracing */ /* ... */ } ``` 這邊也使用一個 __CURRENT_HEIGHT，代表 thread 目前所要做的 row。 ``` # Rendering scene Done! Execution time of raytracing() : 2.272589 sec ``` 時間表現沒有比較優異 orz ### cyclic partition 參考 [twzjwang](https://hackmd.io/s/BJkx6f6Fe) 的方法 ```clike= for (int j = rayInfo->from_height; j < rayInfo->height; j += NUM_THREADS) ``` 稍微修改一下即可使用。 ``` # Rendering scene Done! Execution time of raytracing() : 2.202453 sec ``` 改善有限。