# 2017 q1 Homework1 (raytracing) contributed by <`zmke`> ### Reviewed by `twzjwang` - multithreads 下使用 gprof ,時間卻增加: `gprof` is poorly suited for analysis of parallel programs as it doesn't understand the intricacies of OpenMP. [gprof on both OpenMP and without OpenMP codes produces different flat profile](http://stackoverflow.com/questions/36919518/gprof-on-both-openmp-and-without-openmp-codes-produces-different-flat-profile) - 可用其他方法實作平行化,如 Pthread ### Reviewed by `henry0929016816` * 比較了 thread 數的差異 ,這一點我覺得做的很好 * 可以試試看 SIMD ## 開發環境 OS: Ubuntu 16.04 LTS Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K ## gprof * 可以印出程式執行時各函數消耗的函數、 funtion 之間互相呼叫的關係,幫助 programmer 分析出消耗較多時間的函數,找出當作改善效能的方向,尤其是對大型程式來說是相當實用的工具 * `$ make PROFILE=1`: 加入 -pg , compiler 會在程式中間插入額外的程式碼,執行後會產生 gmon.out * `$ gprof ./raytracing | less` ## 原始版本 * 直接執行 clone 下來的程式 `$ make` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 2.499698 sec ``` * 使用 gprof 進行分析 `$ make PROFILE=1` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 5.441733 sec ``` compiler 在程式中加入額外程式碼,執行時間增加了 `$ gprof ./raytracing | less` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.02 0.51 0.51 69646433 0.00 0.00 dot_product 14.71 0.81 0.30 56956357 0.00 0.00 subtract_vector 11.28 1.04 0.23 31410180 0.00 0.00 multiply_vector 10.06 1.25 0.21 13861875 0.00 0.00 rayRectangularIntersection 7.85 1.41 0.16 17836094 0.00 0.00 add_vector 7.36 1.56 0.15 10598450 0.00 0.00 normalize 5.89 1.68 0.12 4620625 0.00 0.00 ray_hit_object 4.41 1.77 0.09 17821809 0.00 0.00 cross_product 3.19 1.83 0.07 13861875 0.00 0.00 raySphereIntersection 2.45 1.88 0.05 1048576 0.00 0.00 ray_color 1.96 1.92 0.04 2110576 0.00 0.00 compute_specular_diffuse 1.96 1.96 0.04 1 0.04 2.04 raytracing ``` 可以清楚看到時間集中在 dot_product 、 subtract_vector 、 multiply_vector ,這些 function 主要用來進行三維向量運算,可以在 `math-toolkit.h` 找到定義 * 用 perf 來分析 branch miss ``` Performance counter stats for './raytracing' (10 runs): 19,296,540 branch-misses # 0.40% of all branches ( +- 0.43% ) 4,782,031,932 branches ( +- 0.00% ) 32,981,345,906 instructions # 1.90 insn per cycle ( +- 0.00% ) 17,360,336,322 cycles ( +- 0.22% ) 5.137705397 seconds time elapsed ( +- 0.22% ) ``` ## compiler 優化 * 實際動手之前先來看看 compiler 到底有多強大,更改優化選項 `-Ofast` `$ make PROFILE=1` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 0.518435 sec ``` 這效果... `$ gprof ./raytracing | less` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 52.21 0.24 0.24 4620625 51.98 51.98 ray_hit_object 21.76 0.34 0.10 2110576 47.42 47.42 compute_specular_diffuse 10.88 0.39 0.05 592239 84.49 408.09 ray_color 10.88 0.44 0.05 raytracing 2.18 0.45 0.01 2110576 4.74 4.74 localColor 2.18 0.46 0.01 1241598 8.06 8.06 refraction 0.00 0.46 0.00 113297 0.00 0.00 fresnel ``` * 發現剛剛看到最花時間的 function 都沒在表上,應該是 inline 發揮作用了,在 function 的定義前加入 inline 提示 compiler 在編譯的時候就將 function 直接展開,減少呼叫 function 所需要成本 ## Method1 - force inline * 實驗將 `math-toolkit.h` 裡面的 inline 改成 `__attribute__((always_inline)) inline` ,讓 compiler 在關閉編譯最佳化的情形下將 function inline `$ make PROFILE=1` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 2.967841 sec ``` `$ gprof ./raytracing | less` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 40.48 0.87 0.87 13861875 0.00 0.00 rayRectangularIntersection 18.61 1.27 0.40 13861875 0.00 0.00 raySphereIntersection 10.93 1.51 0.24 2110576 0.00 0.00 compute_specular_diffuse 7.91 1.68 0.17 1048576 0.00 0.00 ray_color 6.51 1.82 0.14 2110576 0.00 0.00 localColor 6.28 1.95 0.14 4620625 0.00 0.00 ray_hit_object 4.89 2.06 0.11 1048576 0.00 0.00 rayConstruction 1.86 2.10 0.04 1241598 0.00 0.00 reflection 1.40 2.13 0.03 1 0.03 2.15 raytracing ``` * 時間從 5.4417 sec 下降到 2.9678 sec ,從 gprof 的報表上也看不到剛剛在 `math-tool-kit.h` force inline 的 function ## Method2 - loop unrolling * 將迴圈展開,減少 branch misses 帶來的成本,不過可能會造成程式碼變冗長、可讀性降低的問題,所以通常是由 compiler 來完成 * 實驗將 `math-toolkit.h` function 內的 loop 展開 `$ make PROFILE=1` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 4.346391 sec ``` `$ gprof ./raytracing | less` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 16.03 0.21 0.21 69646433 0.00 0.00 dot_product 15.63 0.41 0.20 56956357 0.00 0.00 subtract_vector 14.85 0.60 0.19 10598450 0.00 0.00 normalize 8.99 0.71 0.12 17821809 0.00 0.00 cross_product 7.82 0.81 0.10 13861875 0.00 0.00 rayRectangularIntersection 6.64 0.90 0.09 13861875 0.00 0.00 raySphereIntersection 6.25 0.98 0.08 1048576 0.00 0.00 ray_color 4.69 1.04 0.06 4620625 0.00 0.00 ray_hit_object 3.91 1.09 0.05 1048576 0.00 0.00 rayConstruction 3.52 1.13 0.05 31410180 0.00 0.00 multiply_vector 2.74 1.17 0.04 17836094 0.00 0.00 add_vector 2.35 1.20 0.03 2110576 0.00 0.00 localColor 1.95 1.22 0.03 4221152 0.00 0.00 multiply_vectors 1.56 1.24 0.02 1 0.02 1.27 raytracing ``` * 時間從 5.4417 sec 下降到 4.3464 sec ``` * dot_product: 0.51 -> 0.21 * subtract_vector: 0.30 -> 0.20 * multify_vector: 0.23 -> 0.05 * add_vector: 0.16 -> 0.04 ``` * 用 perf 來分析 branch miss ``` Performance counter stats for './raytracing' (10 runs): 16,359,752 branch-misses # 0.42% of all branches ( +- 0.69% ) 3,873,917,856 branches ( +- 0.00% ) 25,746,564,892 instructions # 1.76 insn per cycle ( +- 0.00% ) 14,607,513,939 cycles ( +- 0.23% ) 4.322380613 seconds time elapsed ( +- 0.27% ) ``` * branch misses 從 19,197,236 -> 16,387,279 * branches 從 4,782,031,932 -> 3,873,917,856 ,減少了9億次!! * 從實驗結果可以看出, loop unrolling 對於改善 branch miss 是有效的 * 測試不加上 -pg 的情形,時間縮短 (2.50 -> 1.68) `$ make clean` `$ make` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 1.682859 sec ``` ## Method3 - OpenMP 剛好上學期有修平行程式設計,上課有提過 OpenMP ,把投影片找出來複習一下 * An API for shared-memory parallel programming * Designed for systems in which each thread or process can potentially have access to all available memory * System is viewed as a collection of cores or CPU’s, all of which have access to main memory * 以 loop unrolling 的版本為基礎,針對 raytracing() 做平行化 ### (1) default schedule ``` == void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { point3 u, v, w, d; color object_color = { 0.0, 0.0, 0.0 }; /* calculate u, v, w */ calculateBasisVectors(u, v, w, view); idx_stack stk; int factor = sqrt(SAMPLES); #pragma omp parallel for num_threads (thread_count) \ default(shared) private(stk, d, object_color) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } } ``` **thread_count = 4** `$ make PROFILE=1` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 7.930475 sec ``` 時間比只有 loop unrolling 增加了 (4.35 -> 7.93) 先不使用 gprof 測試看看 `$ make clean` `$ make` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 0.818154 sec ``` 和 loop unrolling 不加上 -pg 的情況相比時間下降了 (1.68 -> 0.89) ,推測在 OpenMP multi threads 的情況下, -pg 對程式帶來的成本也跟著增加了 * 比較不同 thread 數量對程式加速的效果,隨著 thread 增加,加速的比例也逐漸降低,當 thread_count 超過 16 之後, creat thread 和 join 的成本已經比速度提升還要高,反而失去加速的效果 ``` thread_count = 1 1.720003 sec thread_count = 2 0.955195 sec thread_count = 4 0.818154 sec thread_count = 8 0.767247 sec thread_count = 16 0.743067 sec thread_count = 32 0.801165 sec ``` ### (2) cyclic schedule * static schedule : the iterations can be assigned to the thread before the loop is executed * 避免程式運算熱點分佈不均,利用 static schedule ,將raytracing() 作 cyclic 切割 * 加上 `schedule(static, 1)` ``` == #pragma omp parallel for num_threads (thread_count) \ default(shared) private(stk, d, object_color) schedule(static, 1) ``` **thread_count = 4** `$ make` `$ ./raytracing` ``` # Rendering scene Done! Execution time of raytracing() : 0.728800 sec ``` 在 thread_count = 4 的情形下,與 default schedule 相比,執行時間縮短 (0.818 -> 0.729) * cyclic schedule 可以將運算平均分配給各 thread ,平行加速效果比 default schedule 還要好 * 將以上3個實驗一起實作,得到的運算時間是 0.629104 sec ,只有原始版本的 25%,不過還是沒贏過 -Ofast ![](https://i.imgur.com/gj1OAnW.png) ## Reference [簡易的程式平行化-OpenMP](https://kheresy.wordpress.com/2006/09/15/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E5%9B%9B%EF%BC%89%E7%AF%84%E4%BE%8B-for/) [hugikun999 筆記](https://hackmd.io/s/HyHhgcv6) [LanKuDot 筆記](https://hackmd.io/s/rkggUeKa) 成大資訊系-多處理機平行程式設計 課程投影片