2016q3 Homework2 (raytracing)

# 2016q3 Homework2 (raytracing) contributed by <`aweimeow`> ###### tags: `sysprog21` `aweimeow` ### 作業環境 * OS: Ubuntu 14.04.4 LTS * CPU: Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz * Memory: 8G * Cache: * L1d cache: 32KB * L1i cache: 32KB * L2 cache: 256KB * L3 cache: 3072KB ### 前置準備 ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagick ``` ### 未修改的版本 ``` # Rendering scene Done! Execution time of raytracing() : 2.909354 sec ``` 附上輸出的圖： ![output.ppm](http://imgur.com/P0lKg3P.png) ### 找到可以著手修改增進速度的地方先加上 `PROFILE=1` ``` # Rendering scene Done! Execution time of raytracing() : 5.193626 sec ``` 只取前面幾個來看： ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 28.59 0.90 0.90 69646433 0.00 0.00 dot_product 14.29 1.35 0.45 56956357 0.00 0.00 subtract_vector 9.05 1.64 0.29 31410180 0.00 0.00 multiply_vector 8.89 1.92 0.28 13861875 0.00 0.00 rayRectangularIntersection 6.67 2.13 0.21 13861875 0.00 0.00 raySphereIntersection 6.67 2.34 0.21 10598450 0.00 0.00 normalize 6.35 2.54 0.20 17836094 0.00 0.00 add_vector 4.76 2.69 0.15 4620625 0.00 0.00 ray_hit_object 3.34 2.79 0.11 17821809 0.00 0.00 cross_product 2.86 2.88 0.09 1048576 0.00 0.00 ray_color 2.54 2.96 0.08 2110576 0.00 0.00 compute_specular_diffuse 1.27 3.00 0.04 4221152 0.00 0.00 multiply_vectors 1.27 3.04 0.04 1048576 0.00 0.00 rayConstruction 1.27 3.08 0.04 1 0.04 3.15 raytracing 0.64 3.10 0.02 2110576 0.00 0.00 localColor 0.32 3.11 0.01 3838091 0.00 0.00 length ``` 也使用第一個作業學到的 `perf` 來觀察： ``` Performance counter stats for './raytracing' (5 runs): 954,195 cache-misses # 49.638 % of all cache refs 2,062,377 cache-references 33,500,782,366 instructions # 2.04 insns per cycle 16,585,243,176 cycles 5.220092593 seconds time elapsed ( +- 0.23% ) ``` ### 著手修改程式碼 #### dot_product Loop unrolling，把程式碼當中的迴圈展開，加速程式的執行速度 ```c static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2] return dp; } ``` 結果： ``` # Rendering scene Done! Execution time of raytracing() : 4.807088 sec ``` 與上一次相比，`5.193626` - `4.807088` ＝ `0.386538 秒`，這個是論結果來看，我們很明確的發現速度提昇了，那麼 gprof 呢？把兩次放在一起比較，秒速下降了約 .39 秒，代表這樣子是真的能加速的 ``` time seconds seconds calls s/call s/call name 28.59 0.90 0.90 69646433 0.00 0.00 dot_product 17.72 1.09 0.51 69646433 0.00 0.00 dot_product ``` #### subtract_vector 一樣是以 loop unrolling 來修改： ``` # Rendering scene Done! Execution time of raytracing() : 4.579059 sec ``` gprof 之後也能夠發現執行的秒速從 0.45 下降到 0.33 秒 ``` time seconds seconds calls s/call s/call name 14.29 1.35 0.45 56956357 0.00 0.00 subtract_vector 12.41 1.10 0.33 56956357 0.00 0.00 subtract_vector ``` #### 省略一堆的 Loop Unrolling，總結全部展開的結果 ``` % cumulative self self total time seconds seconds calls s/call s/call name 21.70 0.49 0.49 69646433 0.00 0.00 dot_product 13.73 0.80 0.31 13861875 0.00 0.00 rayRectangularIntersection 11.51 1.06 0.26 13861875 0.00 0.00 raySphereIntersection 9.74 1.28 0.22 10598450 0.00 0.00 normalize 8.41 1.47 0.19 56956357 0.00 0.00 subtract_vector 7.08 1.63 0.16 31410180 0.00 0.00 multiply_vector 6.64 1.78 0.15 17821809 0.00 0.00 cross_product 4.43 1.88 0.10 17836094 0.00 0.00 add_vector ``` #### OpenMP 在這邊參考[其他同學](https://hackmd.io/s/r1vckLB6#%E5%AF%A6%E5%81%9Apthread-%E8%B7%9F-openmp)的作法，以及[此篇](https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c)的說明，對於怎麼寫有一些大概的概念。首先要先思考哪些變數在各個 Thread 是必須獨立的： * idx_stack stk: 看起來是要存放東西的 Stack，每個 Thread 應該都要有自己的 * d: 這個參數有傳入 `rayConstruction`, `ray_color`，所以應該不是固定的值 * object_color: 先 import ```c #import <omp.h> ``` 然後在 for 迴圈前面宣告 ```c #pragma omp parallel for num_threads(THREAD_NUM) private(stk, d, object_color) ``` 並且修改 MakeFile: ```makefile CFLAGS = \ -std=gnu99 -Wall -O0 -g -fopenmp LDFLAGS = \ -lm -fopenmp ``` 要記得加上 `fopenmp` 這個 Tag，[我參考的那位同學](https://hackmd.io/s/r1vckLB6#%E5%AF%A6%E5%81%9Apthread-%E8%B7%9F-openmp)寫的： > 最後要記得 #include<omp.h> ，以及在編譯選項中加上 -fopenp 這邊有打錯字 :P ##### 然後再來是結果在沒有 OpenMP 的加持時： ``` # Rendering scene Done! Execution time of raytracing() : 4.038518 sec ``` 再來是有 OpenMP 的加持（Thread = 4）： ``` # Rendering scene Done! Execution time of raytracing() : 8.254940 sec ``` :::info 咦，怎麼時間還增加了呢？不過我們有 gprof 可以用。 ::: 發現執行時間原本從 1.95 下降到 1.92。好像不是很理想，我不確定是不是因為 Thread 給太少了？所以接下來試試看 16 個 Thread ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 16.31 0.15 0.15 988837 0.00 0.00 raySphereIntersection 11.96 0.26 0.11 4444684 0.00 0.00 dot_product 9.79 0.35 0.09 372860 0.00 0.00 ray_hit_object 9.24 0.44 0.09 3957736 0.00 0.00 subtract_vector 8.70 0.52 0.08 1062330 0.00 0.00 cross_product 8.70 0.60 0.08 912758 0.00 0.00 rayRectangularIntersection 7.61 0.67 0.07 673090 0.00 0.00 normalize 5.44 0.72 0.05 61496 0.00 0.01 ray_color 4.35 0.76 0.04 2129702 0.00 0.00 multiply_vector 4.35 0.80 0.04 1099567 0.00 0.00 add_vector 3.26 0.83 0.03 282499 0.00 0.00 length 2.18 0.85 0.02 69869 0.00 0.00 rayConstruction 2.18 0.87 0.02 1 20.01 920.55 raytracing 1.63 0.88 0.02 226005 0.00 0.00 multiply_vectors 1.09 0.89 0.01 159710 0.00 0.00 compute_specular_diffuse 1.09 0.90 0.01 155301 0.00 0.00 localColor 1.09 0.91 0.01 91039 0.00 0.00 refraction 1.09 0.92 0.01 89899 0.00 0.00 protect_color_overflow 0.00 0.92 0.00 188053 0.00 0.00 idx_stack_empty 0.00 0.92 0.00 153272 0.00 0.00 idx_stack_top 0.00 0.92 0.00 82018 0.00 0.00 reflection 0.00 0.92 0.00 80465 0.00 0.00 idx_stack_push 0.00 0.92 0.00 69209 0.00 0.00 idx_stack_init 0.00 0.92 0.00 9361 0.00 0.00 fresnel 0.00 0.92 0.00 3378 0.00 0.00 idx_stack_pop 0.00 0.92 0.00 3 0.00 0.00 append_rectangular 0.00 0.92 0.00 3 0.00 0.00 append_sphere 0.00 0.92 0.00 2 0.00 0.00 append_light 0.00 0.92 0.00 1 0.00 0.00 calculateBasisVectors 0.00 0.92 0.00 1 0.00 0.00 delete_light_list 0.00 0.92 0.00 1 0.00 0.00 delete_rectangular_list 0.00 0.92 0.00 1 0.00 0.00 delete_sphere_list 0.00 0.92 0.00 1 0.00 0.00 diff_in_second 0.00 0.92 0.00 1 0.00 0.00 write_to_ppm ``` 可以看到在最後是 0.92 秒，時間確實下降了，接下來比較一下Thread的數量與執行時間的差異（以 4 為基底取次方作為數量）： | Thread Number | Execute Time (sec) | |---------------|--------------------| |No OpenMP |2.234710 | |4 |1.203826 | |16 |0.907454 | |64 |0.960009 | |256 |0.938343 | |1024 |1.029779 | |4096 |1.053434 |