2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by<`zhanyangch`> ###### tags:`zhanyangch` `sysprog2017` `week1` ### Reviewed by `etc276` - 執行環境的資訊需要排版以利閱讀，另外可以考慮展示出程式碼時才使用 `clike=`，而展示 terminal 上的資訊不用 = 來顯示行數 - commit message 可以以大寫開頭，並將修改的部分寫在內文，例如[這次 commit]("https://github.com/zhanyangch/raytracing-1/commit/8e8a1a388e1e12f342e2607858a1a2225903d721") 可以在內文補上 "rewrite the function in math-toolkit.h by loop unrolling to optimization" - 可以使用 force inline 或 macro 等方法降低 function call 所花的時間 ## 執行環境 lscpu 的資訊 ``` $ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 42 Model name: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz 製程： 7 CPU MHz： 855.421 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5587.06 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 4096K NUMA node0 CPU(s): 0-3 ``` ## 執行原始程式 * 使用 make PROFILE=1 ，加入 gprof 所需的編譯及連結參數，gprof 會在函數內加入 mcount，以繪製函數的調用圖，因此執行時間較長，但利用gprof 可以快速的讓我們找到程式改進的方向。 ```shell= $ make $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 3.248656 sec $ make clean $ make PROFILE=1 ./raytracing # Rendering scene Done! Execution time of raytracing() : 6.375895 sec ``` * 將輸出檔案轉換成 .png ``` $ convert out.ppm out.png ``` ![](https://i.imgur.com/nvVTUir.png) * 使用 gprof 觀察，可以發現耗時的前幾名都是數學運算，試著去改善此部份。 ``` gprof -b raytracing gmon.out | less Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.41 0.61 0.61 69646433 0.00 0.00 dot_product 20.41 1.12 0.51 56956357 0.00 0.00 subtract_vector 12.21 1.43 0.31 31410180 0.00 0.00 multiply_vector 9.21 1.66 0.23 10598450 0.00 0.00 normalize 8.00 1.86 0.20 17836094 0.00 0.00 add_vector 5.40 1.99 0.14 13861875 0.00 0.00 rayRectangularIntersection 4.80 2.11 0.12 4620625 0.00 0.00 ray_hit_object 3.00 2.19 0.08 13861875 0.00 0.00 raySphereIntersection ``` ## loop unrolling * 觀察 math-toolkit.h，可以發現 dot_product 等向量運算的迴圈執行次數是固定的，因此我們可以將它預先展開，此方法可以不用 branch，降低執行時間。 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 27.52 0.55 0.55 69646433 0.00 0.00 dot_product 10.51 0.76 0.21 56956357 0.00 0.00 subtract_vector 10.51 0.97 0.21 10598450 0.00 0.00 normalize 8.76 1.15 0.18 13861875 0.00 0.00 rayRectangularIntersection 8.51 1.32 0.17 17821809 0.00 0.00 cross_product 8.25 1.48 0.17 13861875 0.00 0.00 raySphereIntersection 6.50 1.61 0.13 17836094 0.00 0.00 add_vector 4.50 1.70 0.09 31410180 0.00 0.00 multiply_vector 3.50 1.77 0.07 4620625 0.00 0.00 ray_hit_object 2.00 1.81 0.04 4221152 0.00 0.00 multiply_vectors 2.00 1.85 0.04 2110576 0.00 0.00 compute_specular_diffuse 2.00 1.89 0.04 1048576 0.00 0.00 ray_color 1.00 1.91 0.02 1241598 0.00 0.00 protect_color_overflow 1.00 1.93 0.02 1241598 0.00 0.00 refraction 1.00 1.95 0.02 1048576 0.00 0.00 rayConstruction 1.00 1.97 0.02 1 0.02 1.99 raytracing 0.50 1.98 0.01 2520791 0.00 0.00 idx_stack_top 0.50 1.99 0.01 2110576 0.00 0.00 localColor ``` ``` $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 2.242073 sec $ make clean $ make PROFILE=1 ./raytracing # Rendering scene Done! Execution time of raytracing() : 5.370418 sec ``` ## 使用openmp * 在 openmp 範圍外的變數預設是共用的，對於需要在執行序間獨立的變數，可以使用 private(list)，num_threads (n) 使用 n 個 thread。 ```clike= #pragma omp parallel for num_threads (2) private(stk,object_color,d) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { ... ``` * 在加入 gprof 的情況下，需要的時間會比原本的多，但在不加入 gprof 的情況下，時間從 2.24 降低至 1.22 。 ``` $ make $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 1.221488 sec $ make PROFILE=1 $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 9.702812 sec ``` * 測試不同的 num_threads ，可以發現除了 1、2 之間差距很大，再增加 thread 並無太大的效益，因為 cpu 的核心數目有限。 ![](https://i.imgur.com/3aaVRkK.png) 參考資料 == 資訊系多處理機平行程式PPT