# 2017q1 Homework1 (raytracing) contributed by < `changyuanhua` > ### Reviewed by `ryanwang522` * Git Commit Message 過於簡略,可以在表達的詳細一點,commit 的目的跟過程都可以在詳加描述,參考 [How to wirte git commit message](https://chris.beams.io/posts/git-commit/)。 * 可以嘗試比較 OpenMP 的不同 schedule method有何差異。 * 思考最後比較圖表 Cache-miss 的差異原因? --- >請修正程式碼縮排,為四個空白 >[name=課程助教][color=red] >>ok, 謝謝[name=changyuanhua][color=black] ## 開發環境 * 輸入指令 ` lscpu ` ``` Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:1 每通訊端核心數:4 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 94 Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz 製程: 3 CPU MHz: 799.890 CPU max MHz: 3200.0000 CPU min MHz: 800.0000 BogoMIPS: 4608.00 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 6144K NUMA node0 CPU(s): 0-3 ``` ## 預先準備工作 * **安裝準備** ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagic ``` ## 光影追蹤程式 * **取得原始程式碼、編譯和測試** ``` $ git clone https://github.com/sysprog21/raytracing $ cd raytracing $ make $ ./raytracing ``` * **可以看到程式碼的行數** ``` $ cloc *.[ch] ``` * **開啟 .ppm 檔的兩種方式** ``` 1.eog out.ppm 2.gimp out.ppm ``` * **透過 convert 這個工具可將 PPM 轉為 PNG 或其他格式** ``` $ convert out.ppm out.png ``` * **圖片顯示** ![](https://i.imgur.com/wrAxKJo.png) * **檢驗程式碼輸出的圖片是否符合預期,符合的話會得到 “Verified OK!” 字樣** ``` $ make check ``` * **清除執行檔** ``` $ make clean ``` * **gprof 分析程式** ``` $ make PROFILE=1 $ ./raytracing $ gprof -b raytracing gmon.out | less ``` * [gprof 教學](http://os.51cto.com/art/200703/41426.htm) * **install gprof2dot** ``` $sudo apt install python python-pip $sudo pip install --upgrade pip $sudo pip install gprof2dot ``` * **作圖** ``` gprof ./raytracing | gprof2dot | dot -T png -o output.png ``` * **使用 astyle + Git 實做自動程式碼排版檢查** ``` $ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch] ``` ## 未優化 * **時間** ``` Done! Execution time of raytracing() : 6.158007 sec ``` * **gprof 分析程式 , 指出效能瓶頸** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 27.64 0.71 0.71 69646433 0.00 0.00 dot_product 28.18.69 1.19 0.48 56956357 0.00 0.00 subtract_vector 7.79 1.39 0.20 31410180 0.00 0.00 multiply_vector 6.62 1.56 0.17 13861875 0.00 0.00 raySphereIntersection 6.62 1.73 0.17 10598450 0.00 0.00 normalize 6.23 1.89 0.16 17836094 0.00 0.00 add_vector 6.23 2.05 0.16 13861875 0.00 0.00 rayRectangularIntersection 5.06 2.18 0.13 4620625 0.00 0.00 ray_hit_object 4.67 2.30 0.12 17821809 0.00 0.00 cross_product 2.73 2.37 0.07 2110576 0.00 0.00 localColor 1.75 2.42 0.05 1048576 0.00 0.00 ray_color 1.56 2.46 0.04 4221152 0.00 0.00 multiply_vectors 1.17 2.49 0.03 3838091 0.00 0.00 length 0.78 2.51 0.02 2110576 0.00 0.00 compute_specular_diffuse 0.78 2.53 0.02 1048576 0.00 0.00 rayConstruction 0.58 2.54 0.02 1241598 0.00 0.00 reflection 0.39 2.55 0.01 1241598 0.00 0.00 refraction 0.39 2.56 0.01 1 0.01 2.56 raytracing 0.00 2.56 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.56 0.00 2520791 0.00 0.00 idx_stack_top 0.00 2.56 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 2.56 0.00 1204003 0.00 0.00 idx_stack_push 0.00 2.56 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.56 0.00 113297 0.00 0.00 fresnel 0.00 2.56 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.56 0.00 3 0.00 0.00 append_rectangular 0.00 2.56 0.00 3 0.00 0.00 append_sphere 0.00 2.56 0.00 2 0.00 0.00 append_light 0.00 2.56 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.56 0.00 1 0.00 0.00 delete_light_list 0.00 2.56 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.56 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.56 0.00 1 0.00 0.00 diff_in_second 0.00 2.56 0.00 1 0.00 0.00 write_to_ppm ``` * **cache miss** * 清空 ``` echo 1 | sudo tee /proc/sys/vm/drop_caches ``` * 指令 ``` perf stat -r 10 -e branch-misses,cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./raytracing ``` ``` Performance counter stats for './raytracing' (10 runs): 20,968,120 branch-misses ( +- 0.33% ) 2,014,576 cache-misses # 65.248 % of all cache refs ( +- 18.82% ) 3,087,590 cache-references ( +- 14.01% ) 3,721,407 L1-dcache-load-misses ( +- 2.23% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 1,456,429 L1-icache-load-misses ( +- 9.33% ) 6.357804873 seconds time elapsed ( +- 0.99% ) ``` * **圖** ![](https://i.imgur.com/dUNLO3Z.png) * 可以發現在 dot_product function 呼叫的次數最高,所以先嘗試以此著手 ## 實驗一: loop unrolling * **[loop unrolling](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80)** * **code** ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp += v1[0] * v2[0]+ v1[1] * v2[1]+ v1[2] * v2[2]; return dp; } ``` * **時間** ``` Done! Execution time of raytracing() : 5.412815 sec ``` * **gprof 分析程式 , 指出效能瓶頸** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.86 0.45 0.45 56956357 0.00 0.00 subtract_vector 22.11.90 0.70 0.25 69646433 0.00 0.00 dot_product 11.41 0.93 0.24 31410180 0.00 0.00 multiply_vector 9.23 1.12 0.19 13861875 0.00 0.00 rayRectangularIntersection 7.77 1.28 0.16 13861875 0.00 0.00 raySphereIntersection 7.29 1.43 0.15 10598450 0.00 0.00 normalize 6.56 1.57 0.14 17836094 0.00 0.00 add_vector 4.86 1.67 0.10 17821809 0.00 0.00 cross_product 4.86 1.77 0.10 4620625 0.00 0.00 ray_hit_object 2.91 1.83 0.06 1 0.06 2.05 raytracing 2.43 1.88 0.05 2110576 0.00 0.00 compute_specular_diffuse 1.70 1.91 0.04 4221152 0.00 0.00 multiply_vectors 1.46 1.94 0.03 2110576 0.00 0.00 localColor 1.46 1.97 0.03 1048576 0.00 0.00 ray_color 1.21 2.00 0.03 3838091 0.00 0.00 length 0.97 2.02 0.02 1241598 0.00 0.00 refraction 0.97 2.04 0.02 1048576 0.00 0.00 rayConstruction 0.73 2.05 0.02 1048576 0.00 0.00 idx_stack_init 0.49 2.06 0.01 1 0.01 0.01 delete_sphere_list 0.00 2.06 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.06 0.00 2520791 0.00 0.00 idx_stack_top 0.00 2.06 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 2.06 0.00 1241598 0.00 0.00 reflection 0.00 2.06 0.00 1204003 0.00 0.00 idx_stack_push 0.00 2.06 0.00 113297 0.00 0.00 fresnel 0.00 2.06 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.06 0.00 3 0.00 0.00 append_rectangular 0.00 2.06 0.00 3 0.00 0.00 append_sphere 0.00 2.06 0.00 2 0.00 0.00 append_light 0.00 2.06 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.06 0.00 1 0.00 0.00 delete_light_list 0.00 2.06 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.06 0.00 1 0.00 0.00 diff_in_second 0.00 2.06 0.00 1 0.00 0.00 write_to_ppm ``` * **cache miss** ``` Performance counter stats for './raytracing' (10 runs): 19,438,104 branch-misses ( +- 0.33% ) 1,210,765 cache-misses # 42.924 % of all cache refs ( +- 38.56% ) 2,820,696 cache-references ( +- 31.18% ) 3,801,936 L1-dcache-load-misses ( +- 5.67% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 1,368,386 L1-icache-load-misses ( +- 16.02% ) 5.751152093 seconds time elapsed ( +- 1.83% ) ``` * **圖** ![](https://i.imgur.com/B5Dg2ng.png) * **結果與發現** 當把 dot_product function 經過 loop unrolling 後 , 所花的整體執行時間從 6.158007 秒降至 5.412815 秒 ,而 dot_product function 所花的時間從 0.71 秒降至 0.45 秒 ## 實驗二: loop unrolling add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()、dot_product() * **code** ```clike= static inline void add_vector(const double *a, const double *b, double *out) { out[0] = a[0] + b[0]; out[1] = a[1] + b[1]; out[2] = a[2] + b[2]; } static inline void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } static inline void multiply_vectors(const double *a, const double *b, double *out) { out[0] = a[0] * b[0]; out[1] = a[1] * b[1]; out[2] = a[2] * b[2]; } static inline void multiply_vector(const double *a, double b, double *out) { out[0] = a[0] * b; out[1] = a[1] * b; out[2] = a[2] * b; } ``` * **時間** ``` Done! Execution time of raytracing() : 4.924315 sec ``` * **gprof 分析程式 , 指出效能瓶頸** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 14.50 0.20 0.20 69646433 0.00 0.00 dot_product 13.78 0.39 0.19 56956357 0.00 0.00 subtract_vector 11.60 0.55 0.16 10598450 0.00 0.00 normalize 6.89 0.65 0.10 13861875 0.00 0.00 raySphereIntersection 6.89 0.74 0.10 31410180 0.00 0.00 multiply_vector 6.89 0.84 0.10 17836094 0.00 0.00 add_vector 6.89 0.93 0.10 13861875 0.00 0.00 rayRectangularIntersection 6.53 1.02 0.09 4620625 0.00 0.00 ray_hit_object 5.44 1.10 0.08 17821809 0.00 0.00 cross_product 4.35 1.16 0.06 1048576 0.00 0.00 ray_color 3.63 1.21 0.05 2110576 0.00 0.00 compute_specular_diffuse 3.26 1.25 0.05 4221152 0.00 0.00 multiply_vectors 2.90 1.29 0.04 2110576 0.00 0.00 localColor 2.90 1.33 0.04 1048576 0.00 0.00 rayConstruction 2.90 1.37 0.04 1 0.04 1.38 raytracing 0.73 1.38 0.01 1241598 0.00 0.00 refraction 0.00 1.38 0.00 3838091 0.00 0.00 length 0.00 1.38 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.38 0.00 2520791 0.00 0.00 idx_stack_top 0.00 1.38 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 1.38 0.00 1241598 0.00 0.00 reflection 0.00 1.38 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.38 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.38 0.00 113297 0.00 0.00 fresnel 0.00 1.38 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.38 0.00 3 0.00 0.00 append_rectangular 0.00 1.38 0.00 3 0.00 0.00 append_sphere 0.00 1.38 0.00 2 0.00 0.00 append_light 0.00 1.38 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.38 0.00 1 0.00 0.00 delete_light_list 0.00 1.38 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.38 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.38 0.00 1 0.00 0.00 diff_in_second 0.00 1.38 0.00 1 0.00 0.00 write_to_ppm ``` * **cache miss** ``` Performance counter stats for './raytracing' (10 runs): 16,816,057 branch-misses ( +- 0.10% ) 1,189,917 cache-misses # 31.525 % of all cache refs ( +- 23.36% ) 3,774,516 cache-references ( +- 14.38% ) 4,019,687 L1-dcache-load-misses ( +- 3.71% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 1,507,763 L1-icache-load-misses ( +- 8.15% ) 5.034827277 seconds time elapsed ( +- 1.28% ) ``` * **圖** ![](https://i.imgur.com/9fimoDX.png) * **結果與發現** 當把 add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()、dot_product() 能不用for迴圈的都展開來 , 所花的整體執行時間從 5.412815 秒降至 4.924315 秒 ## 實驗三: force nline + loop unrolling * **code** * 把 math-toolkit.h 檔的所有 static inline 改成下列 ```cstyles= static __attribute__((always_inline)) ``` * **時間** ``` Done! Execution time of raytracing() : 3.206461 sec ``` * **gprof 分析程式 , 指出效能瓶頸** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.19 0.37 0.37 13861875 0.00 0.00 rayRectangularIntersection 15.04 0.60 0.23 13861875 0.00 0.00 raySphereIntersection 12.42 0.79 0.19 69646433 0.00 0.00 dot_product 11.77 0.97 0.18 2110576 0.00 0.00 compute_specular_diffuse 8.50 1.10 0.13 2110576 0.00 0.00 localColor 8.50 1.23 0.13 1048576 0.00 0.00 ray_color 5.88 1.32 0.09 4620625 0.00 0.00 ray_hit_object 3.92 1.38 0.06 1048576 0.00 0.00 rayConstruction 1.96 1.41 0.03 2520791 0.00 0.00 idx_stack_top 1.96 1.44 0.03 113297 0.00 0.00 fresnel 1.31 1.46 0.02 1241598 0.00 0.00 protect_color_overflow 1.31 1.48 0.02 1241598 0.00 0.00 reflection 1.31 1.50 0.02 1048576 0.00 0.00 idx_stack_init 1.31 1.52 0.02 1 0.02 1.52 raytracing 0.65 1.53 0.01 1 0.01 0.01 delete_sphere_list 0.00 1.53 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.53 0.00 1241598 0.00 0.00 refraction 0.00 1.53 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.53 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.53 0.00 3 0.00 0.00 append_rectangular 0.00 1.53 0.00 3 0.00 0.00 append_sphere 0.00 1.53 0.00 2 0.00 0.00 append_light 0.00 1.53 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.53 0.00 1 0.00 0.00 delete_light_list 0.00 1.53 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.53 0.00 1 0.00 0.00 diff_in_second 0.00 1.53 0.00 1 0.00 0.00 write_to_ppm ``` * **cache miss** ``` Performance counter stats for './raytracing' (10 runs): 3,691,085 branch-misses ( +- 0.30% ) 1,926,179 cache-misses # 58.611 % of all cache refs ( +- 14.85% ) 3,286,392 cache-references ( +- 11.63% ) 3,047,752 L1-dcache-load-misses ( +- 3.31% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 2,445,715 L1-icache-load-misses ( +- 9.69% ) 2.928272691 seconds time elapsed ( +- 1.24% ) ``` * **圖** ![](https://i.imgur.com/a4samjG.png) * **結果與發現** 當把 math-toolkit.h 檔所有 static inline 都改成 static __attribute__((always_inline)), 所花的整體執行時間從 4.924315 秒降至 3.206461 秒 ## 實驗四:OPENMP * [openmp](https://drive.google.com/file/d/0B4WZ6ihm_C7zYkdvNUJyYlJFRG8/view) * **code** * Makefile ```cstyles= CFLAGS = \ -std=gnu99 -Wall -O0 -g -fopenmp LDFLAGS = \ -lm -lgomp ``` * raytracing.c ```cstyles= int factor = sqrt(SAMPLES); #pragma omp parallel for num_threads(64) private(d),private(object_color),private(stk) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; } ``` * **時間** ``` Done! Execution time of raytracing() : 1.635157 sec ``` * **gprof 分析程式 , 指出效能瓶頸** ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 27.79 0.05 0.05 208968 0.00 0.00 rayRectangularIntersection 27.79 0.10 0.05 32376 0.00 0.00 compute_specular_diffuse 16.67 0.13 0.03 34239 0.00 0.00 localColor 11.11 0.15 0.02 65640 0.00 0.00 ray_hit_object 5.56 0.16 0.01 41770 0.00 0.00 idx_stack_top 5.56 0.17 0.01 13934 0.00 0.01 ray_color 2.78 0.18 0.01 23951 0.00 0.00 idx_stack_push 2.78 0.18 0.01 14600 0.00 0.00 idx_stack_init 0.00 0.18 0.00 204869 0.00 0.00 raySphereIntersection 0.00 0.18 0.00 54181 0.00 0.00 idx_stack_empty 0.00 0.18 0.00 21517 0.00 0.00 refraction 0.00 0.18 0.00 19371 0.00 0.00 protect_color_overflow 0.00 0.18 0.00 19012 0.00 0.00 reflection 0.00 0.18 0.00 16662 0.00 0.00 rayConstruction 0.00 0.18 0.00 1593 0.00 0.00 fresnel 0.00 0.18 0.00 681 0.00 0.00 idx_stack_pop 0.00 0.18 0.00 3 0.00 0.00 append_rectangular 0.00 0.18 0.00 3 0.00 0.00 append_sphere 0.00 0.18 0.00 2 0.00 0.00 append_light 0.00 0.18 0.00 1 0.00 0.00 calculateBasisVectors 0.00 0.18 0.00 1 0.00 0.00 delete_light_list 0.00 0.18 0.00 1 0.00 0.00 delete_rectangular_list 0.00 0.18 0.00 1 0.00 0.00 delete_sphere_list 0.00 0.18 0.00 1 0.00 0.00 diff_in_second 0.00 0.18 0.00 1 0.00 180.06 raytracing 0.00 0.18 0.00 1 0.00 0.00 write_to_ppm ``` * **cache miss** ``` Performance counter stats for './raytracing' (10 runs): 4,282,936 branch-misses ( +- 1.31% ) 729,409 cache-misses # 0.173 % of all cache refs ( +- 5.46% ) 421,057,183 cache-references ( +- 0.23% ) 62,879,634 L1-dcache-load-misses ( +- 0.31% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 4,155,291 L1-icache-load-misses ( +- 2.28% ) 1.644746532 seconds time elapsed ( +- 0.11% ) * **圖** ![](https://i.imgur.com/kdXFAUw.png) * **結果與發現** 當使用openmp, 所花的整體執行時間從 3.206461 秒降至 1.635157 秒 ## 比較圖 * **時間** ![](https://i.imgur.com/P6BiOqy.png) * **cache miss** ![](https://i.imgur.com/mYMYC9P.png) ## 參考 * [hugikun999](https://hackmd.io/s/HyHhgcv6#) * [LanKuDot](https://hackmd.io/s/rkggUeKa#)