2017q2 Homework1 (raytracing)

# 2017q2 Homework1 (raytracing) contributed by < [`ChiHsiang`](https://github.com/ChiHsiang) > >> 這裡要填入 GitHub 帳號 [name=jserv] ==作業解說 [video](https://www.youtube.com/watch?v=m1RmfOfSwno)== ==作業介紹 [Wiki](http://wiki.csie.ncku.edu.tw/embedded/2016q1h2)== --- ### Reviewed by `etc276` - 共筆需要區分項目，例如 `#`是用在 title，`###`用在各主題，目的是讓看共筆的人可以很明確的知道自己的所在位置和跳著閱讀，但如果 `###` 太多，會讓人不知道從何看起，可以參考 [我的共筆]("https://hackmd.io/OwBgpgJgzMCcIFpYQgDgQFgIYCMDGCOo6ArAGYCMFZYATDmGasEA") (雖然也是有很多待改進的地方) - OpenMP 是套我覺得滿成熟的 API，一開始我只有閱讀沒有嘗試實際寫成程式碼，但後來參考其他人的共筆和 github，其實基礎實作上滿容易的，只要加上幾行程式碼如`#pragma...`和修改`Makefile`的編譯指令就可初步優化，建議花些時間實踐並記錄在共筆。 - commit 次數過於頻繁且 commit message 並不明確(如 "Added some more math functions")，建議可以有一定程度的修改在進行 commit ### 預期目標 * 減少 render time ### 關於 Ray tracing 在物理學中，光線追跡可以用來計算光束在介質中傳播的情況。在介質中傳播時，光束可能會被介質吸收，改變傳播方向或者射出介質表面等。我們通過計算理想化的窄光束（光線）通過介質中的情形來解決這種複雜的情況 ### GNU gprof [Guide](https://sourceware.org/binutils/docs/gprof/) #### 關於 Profiling 可以顯示執行時時間主要消耗在哪個 function called 以及 function called 次數，顯示整的執行過程中的資訊，可以幫助快速點出效率差的Bug。 * 簡單步驟 * 啟用 compiler FLAGS (-pg) 使profiling在compiler時連結程式 * 執行產出的執行檔後，會產出資料檔==gmon.out== * 執行`gprof execute_program | less` * 畫圖評估 [gprof2dot](https://github.com/jrfonseca/gprof2dot) ### Profiling Graph ![](https://i.imgur.com/NJyeVZn.png) ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 28.96 1.01 1.01 69646433 0.00 0.00 dot_product 17.20 1.61 0.60 56956357 0.00 0.00 subtract_vector 11.04 2.00 0.39 31410180 0.00 0.00 multiply_vector 7.45 2.26 0.26 13861875 0.00 0.00 rayRectangularIntersection 6.88 2.50 0.24 17836094 0.00 0.00 add_vector 5.73 2.70 0.20 13861875 0.00 0.00 raySphereIntersection 5.16 2.88 0.18 17821809 0.00 0.00 cross_product 5.16 3.06 0.18 4620625 0.00 0.00 ray_hit_object 4.87 3.23 0.17 10598450 0.00 0.00 normalize 1.72 3.29 0.06 1048576 0.00 0.00 ray_color 1.58 3.34 0.06 4221152 0.00 0.00 multiply_vectors ``` ### perf ==compiler -O0== ``` Execution time of raytracing() : 5.855616 sec Performance counter stats for './raytracing': 51,790 cache-misses # 15.905 % of all cache refs (44.34%) 325,622 cache-references (44.39%) 4,099,177 L1-dcache-load-misses # 0.03% of all L1-dcache hits (44.55%) 13,776,423,601 L1-dcache-loads (44.46%) 856,528 L1-dcache-prefetch-misses (22.28%) 298,322 L1-dcache-store-misses (22.27%) 884,842 L1-icache-load-misses (33.35%) 4,784,882,802 branch-instructions (44.41%) 75,971,367 branch-misses # 1.59% of all branches (44.35%) 5.857297119 seconds time elapsed ``` ==compiler -Ofast== ``` Execution time of raytracing() : 0.654530 sec Performance counter stats for './raytracing': 30,146 cache-misses # 38.650 % of all cache refs (44.72%) 77,997 cache-references (45.05%) 2,112,889 L1-dcache-load-misses # 0.29% of all L1-dcache hits (45.10%) 738,721,430 L1-dcache-loads (44.02%) 441,622 L1-dcache-prefetch-misses (21.98%) 52,570 L1-dcache-store-misses (22.67%) 93,843 L1-icache-load-misses (33.79%) 258,535,076 branch-instructions (44.78%) 784,490 branch-misses # 0.30% of all branches (44.50%) 0.656329191 seconds time elapsed ``` 很明顯的發現Compiler優化後的結果，每個項目次數都下降許多，幾乎都有兩倍以上的差距，由此可知程式方面有許多部分未達到最佳化。 #### perf hotspot * cache-misses ``` 31.10% raytracing [kernel.kallsyms] [k] clear_page_c_e 14.72% raytracing [kernel.kallsyms] [k] get_page_from_freelist 12.13% raytracing [kernel.kallsyms] [k] get_mem_cgroup_from_mm 11.10% raytracing [kernel.kallsyms] [k] copy_page 9.72% raytracing [kernel.kallsyms] [k] __alloc_pages_nodemask 8.74% raytracing ld-2.23.so [.] dl_main 4.87% raytracing [kernel.kallsyms] [k] anon_vma_prepare 2.23% raytracing libm-2.23.so [.] __ieee754_pow_sse2 2.10% raytracing [kernel.kallsyms] [k] enqueue_entity 1.54% raytracing [kernel.kallsyms] [k] mem_cgroup_try_charge 1.06% raytracing [kernel.kallsyms] [k] commit_creds 0.41% raytracing [kernel.kallsyms] [k] handle_mm_fault 0.28% perf [kernel.kallsyms] [k] perf_event_addr_filters_exec 0.02% perf [kernel.kallsyms] [k] perf_ctx_unlock ``` misses的比例不算高，因此效能不佳原因比較不在Caches上。 ### 評估效能 #### Original ==gprof ./raytracing== * -O0 ``` # Rendering scene Done! Execution time of raytracing() : 2.835277 sec Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 19.69 0.49 0.49 69646433 0.00 0.00 dot_product 18.49 0.95 0.46 56956357 0.00 0.00 subtract_vector 11.65 1.24 0.29 13861875 0.00 0.00 rayRectangularIntersection 8.84 1.46 0.22 10598450 0.00 0.00 normalize 6.43 1.62 0.16 17836094 0.00 0.00 add_vector 6.43 1.78 0.16 17821809 0.00 0.00 cross_product 6.03 1.93 0.15 31410180 0.00 0.00 multiply_vector 5.22 2.06 0.13 13861875 0.00 0.00 raySphereIntersection 2.81 2.13 0.07 4620625 0.00 0.00 ray_hit_object 2.81 2.20 0.07 1 0.07 2.49 raytracing 2.41 2.26 0.06 1048576 0.00 0.00 ray_color 1.61 2.30 0.04 1048576 0.00 0.00 rayConstruction 1.21 2.33 0.03 4221152 0.00 0.00 multiply_vectors 1.21 2.36 0.03 2110576 0.00 0.00 localColor ``` * -Ofast ``` # Rendering scene Done! Execution time of raytracing() : 0.651282 sec Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 56.14 0.23 0.23 4620625 49.81 49.81 ray_hit_object 21.97 0.32 0.09 2110576 42.67 42.67 compute_specular_diffuse 14.64 0.38 0.06 raytracing 7.32 0.41 0.03 592239 50.69 338.09 ray_color 0.00 0.41 0.00 2110576 0.00 0.00 localColor 0.00 0.41 0.00 1241598 0.00 0.00 refraction ``` 由前三個時間跟呼叫次數較多的functions進行分析。 ### 程式碼分析 * math-toolkit.h ```clike= double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } --> loop unrolling double dot_product(const double *v1, const double *v2) { return v1[0] * v2[0] + v1[1] * v2[1] +v1[2] * v2[2]; } ====================================================== double scalar_triple(const double *u, const double *v, const double *w) { double tmp[3]; cross_product(w, u, tmp); return dot_product(v, tmp); } ``` 差異 * `⚡ gcc -O0 -Wall -std=gnu99 -c math-toolkit.c -o math_tool.o ` ```assembly= ⚡ objdump -D -S ./math.o | less 000000000000041b <dot_product>: 41b: 55 push %rbp 41c: 48 89 e5 mov %rsp,%rbp 41f: 48 89 7d f8 mov %rdi,-0x8(%rbp) 423: 48 89 75 f0 mov %rsi,-0x10(%rbp) 427: 48 8b 45 f8 mov -0x8(%rbp),%rax 42b: f2 0f 10 08 movsd (%rax),%xmm1 42f: 48 8b 45 f0 mov -0x10(%rbp),%rax 433: f2 0f 10 00 movsd (%rax),%xmm0 437: f2 0f 59 c8 mulsd %xmm0,%xmm1 43b: 48 8b 45 f8 mov -0x8(%rbp),%rax 43f: 48 83 c0 08 add $0x8,%rax 443: f2 0f 10 10 movsd (%rax),%xmm2 447: 48 8b 45 f0 mov -0x10(%rbp),%rax 44b: 48 83 c0 08 add $0x8,%rax 44f: f2 0f 10 00 movsd (%rax),%xmm0 453: f2 0f 59 c2 mulsd %xmm2,%xmm0 457: f2 0f 58 c8 addsd %xmm0,%xmm1 45b: 48 8b 45 f8 mov -0x8(%rbp),%rax 45f: 48 83 c0 10 add $0x10,%rax 463: f2 0f 10 10 movsd (%rax),%xmm2 467: 48 8b 45 f0 mov -0x10(%rbp),%rax 46b: 48 83 c0 10 add $0x10,%rax 46f: f2 0f 10 00 movsd (%rax),%xmm0 473: f2 0f 59 c2 mulsd %xmm2,%xmm0 477: f2 0f 58 c1 addsd %xmm1,%xmm0 47b: 5d pop %rbp 47c: c3 retq ``` * `⚡ gcc -Ofast -Wall -std=gnu99 -c math-toolkit.c -o math_tool.o` ```assembly= ⚡ objdump -D -S ./math.o | less 00000000000001d0 <dot_product>: 1d0: f2 0f 10 07 movsd (%rdi),%xmm0 1d4: f2 0f 10 0e movsd (%rsi),%xmm1 1d8: f2 0f 59 c8 mulsd %xmm0,%xmm1 1dc: f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0 1e1: f2 0f 59 46 08 mulsd 0x8(%rsi),%xmm0 1e6: f2 0f 58 c1 addsd %xmm1,%xmm0 1ea: f2 0f 10 4f 10 movsd 0x10(%rdi),%xmm1 1ef: f2 0f 59 4e 10 mulsd 0x10(%rsi),%xmm1 1f4: f2 0f 58 c1 addsd %xmm1,%xmm0 1f8: c3 retq 1f9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) ``` [online Compiler](https://godbolt.org/) 結果從Assembly來看，Loop unrolling過後少了許多jmp jg pxor指令，相對來說剩下mov add 居多，因此執行上也簡單許多。然而用local gcc compiler 則是 move，add 指令都變少詳細原因有待查詢。 ``` Execution time of raytracing() : 5.483000 sec Performance counter stats for './raytracing': 32,812 cache-misses # 11.371 % of all cache refs (44.48%) 288,557 cache-references (44.52%) 4,107,611 L1-dcache-load-misses # 0.03% of all L1-dcache hits (44.56%) 12,598,272,040 L1-dcache-loads (44.44%) 877,960 L1-dcache-prefetch-misses (22.19%) 289,377 L1-dcache-store-misses (22.18%) 866,805 L1-icache-load-misses (33.39%) 4,420,780,584 branch-instructions (44.48%) 60,781,856 branch-misses # 1.37% of all branches (44.45%) 5.484984316 seconds time elapsed ``` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 20.24 0.36 0.36 56956357 0.00 0.00 subtract_vector 14.62 0.62 0.26 10598450 0.00 0.00 normalize 10.12 0.80 0.18 69646433 0.00 0.00 dot_product 9.84 0.98 0.18 31410180 0.00 0.00 multiply_vector 8.43 1.13 0.15 4620625 0.00 0.00 ray_hit_object 7.31 1.26 0.13 17836094 0.00 0.00 add_vector 6.75 1.38 0.12 17821809 0.00 0.00 cross_product 6.75 1.50 0.12 13861875 0.00 0.00 rayRectangularIntersection 6.18 1.61 0.11 13861875 0.00 0.00 raySphereIntersection 1.97 1.64 0.04 4221152 0.00 0.00 multiply_vectors 1.69 1.67 0.03 2110576 0.00 0.00 localColor 1.69 1.70 0.03 1 0.03 1.78 raytracing 1.41 1.73 0.03 1048576 0.00 0.00 ray_color 1.12 1.75 0.02 1048576 0.00 0.00 rayConstruction ``` 雖然時間有下降，但是呼叫次數是一樣的，效果並不顯著。 ==將所有Math-toolkit 都flatten後的結果== ``` Done! Execution time of raytracing() : 2.046054 sec Performance counter stats for './raytracing': 65,528 cache-misses # 38.347 % of all cache refs (44.32%) 170,881 cache-references (44.32%) 2,557,996 L1-dcache-load-misses # 0.05% of all L1-dcache hits (44.32%) 5,090,645,082 L1-dcache-loads (44.16%) 522,613 L1-dcache-prefetch-misses (22.41%) 95,561 L1-dcache-store-misses (22.36%) 223,423 L1-icache-load-misses (33.48%) 972,075,387 branch-instructions (44.55%) 6,118,594 branch-misses # 0.63% of all branches (44.36%) 2.047843058 seconds time elapsed ``` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 19.09 0.29 0.29 10598450 0.00 0.00 normalize 13.82 0.50 0.21 13861875 0.00 0.00 rayRectangularIntersection 12.18 0.69 0.19 56956357 0.00 0.00 subtract_vector 9.87 0.84 0.15 69646433 0.00 0.00 dot_product 7.24 0.95 0.11 4620625 0.00 0.00 ray_hit_object 6.58 1.05 0.10 13861875 0.00 0.00 raySphereIntersection 5.60 1.13 0.09 17821809 0.00 0.00 cross_product 5.27 1.21 0.08 1048576 0.00 0.00 ray_color 4.94 1.29 0.08 31410180 0.00 0.00 multiply_vector 4.94 1.36 0.08 17836094 0.00 0.00 add_vector 1.97 1.39 0.03 4221152 0.00 0.00 multiply_vectors 1.97 1.42 0.03 1241598 0.00 0.00 refraction 1.97 1.45 0.03 1 0.03 1.52 raytracing ``` 雖然Render時間下降3秒，但是距離機器優化還是慢了1.4秒。 ### 圖形處理由於這支程式主要目的是描繪出圖形的點以及向量計算，根據提示以及其他同學的共筆方向程式改善的方向，可能跟平行計算有相關，因此開始學習有以下： >> "render" 在 20 年前的台灣翻譯為「描繪」(圖形處理領域)，請尊重我們的科技傳統 [name="jserv"] >> 修正！ >> [name="chihsiang"] * CPU 支援的指令集 * OpenMP * pthread #### Method 1. CPU 指令流 & 資料流整理了相關的 [資料](https://hackmd.io/IYIwJgLAxgpgzANgLQEZ4iRArABmcBELJYMBADgCYB2OYcyLIA==) * 查詢CPU支援的指令集`cat /proc/cpuinfo` ```shell= model name: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts ``` 支援有SSE/0/2/4、SSSE3、AVX、MMX :::info 向量處理機常見的SIMD有VIS、MMX、SSE、AltiVec、AVX。 ::: > 除了這些還有其他許多的flags想探討理解。 #### 程式改善使用了AVX Intruction [官網資料](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5210,2791,3152,4583,565,5491,4673,4929,2271,4591,3129,4591,100,3290,5145,3290,5145,3129,100,100,5194,5145,5194,5205,3289,5194,5216,3129,3290,4591,4591,3290,4591,3129,5216,5194,5145,100,5145,3290,5194,5216,3289,1686,253,5682,694,572,570,4602,4591,4602,4680,4602,4642,4591,3289,5210,3151,3278,2276,2754,3290,3291,3290,3292,1383,100,100,4680,4602,4602,4599,4642,3129,4602,4602,4599,3151,4591,4695,4673,3279,4642,3279,3129,3279,4591,487,5145,4929,487,2754,100,5268,4602,4929,3594,3649,4680,2754&techs=AVX) [AVX詳細介紹](https://software.intel.com/sites/default/files/m/d/4/1/d/8/Intro_to_Intel_AVX.pdf) 主要需要`#include <immintrin.h>` 以及 compiler 需要加上`-mavx` 參數 * dot_product ```c= static inline double dot_product(const double *v1, const double *v2) { double out[4]; __m256i mask = _mm256_set_epi64x(ADDRESS_LOW, ADDRESS_HI, ADDRESS_HI, ADDRESS_HI); __m256d c = _mm256_loadu_pd(v1); __m256d d = _mm256_loadu_pd(v2); __m256d dst = _mm256_mul_pd(c, d); _mm256_maskstore_pd(&out[0], mask, dst); return out[0] + out[1] + out[2]; } ``` * add_vector ```c= static inline void add_vector(const double *a, const double *b, double *out) { __m256i mask = _mm256_set_epi64x(ADDRESS_LOW, ADDRESS_HI, ADDRESS_HI, ADDRESS_HI); __m256d c = _mm256_loadu_pd(a); __m256d d = _mm256_loadu_pd(b); __m256d dst = _mm256_add_pd(c, d); _mm256_maskstore_pd(out, mask, dst); } ``` * 分析主要改上了幾個呼叫次數較多的的function * 執行時間 ``` # Rendering scene Done! Execution time of raytracing() : 7.707085 sec ``` ==比原本的慢了五秒。== * gprouf ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 27.34 1.56 1.56 69646433 0.00 0.00 dot_product 16.70 2.51 0.95 56956357 0.00 0.00 subtract_vector 16.53 3.45 0.94 31410180 0.00 0.00 multiply_vector 10.46 4.04 0.60 17821809 0.00 0.00 cross_product 6.68 4.42 0.38 10598450 0.00 0.00 normalize 4.48 4.68 0.26 13861875 0.00 0.00 rayRectangularIntersection 4.40 4.93 0.25 17836094 0.00 0.00 add_vector 3.96 5.15 0.23 13861875 0.00 0.00 raySphereIntersection 2.29 5.28 0.13 3838091 0.00 0.00 length 1.58 5.37 0.09 2110576 0.00 0.00 compute_specular_diffuse 1.41 5.45 0.08 4221152 0.00 0.00 multiply_vectors 1.23 5.52 0.07 1048576 0.00 0.00 ray_color 1.23 5.59 0.07 4620625 0.00 0.00 ray_hit_object ``` ==前十個執行時間幾乎都上升了==，可能是轉移暫存器的運算時並沒有有效利用256bit處理，且每個改寫後的method幾乎都需要用 if 判斷只處理到191bit的位置導致速度提昇不上來。 * perf 檢測 ``` Execution time of raytracing() : 6.480361 sec Performance counter stats for './raytracing': 70,015 cache-misses # 29.775 % of all cache refs (44.46%) 235,143 cache-references (44.46%) 3,330,995 L1-dcache-load-misses # 0.04% of all L1-dcache hits (44.46%) 7,825,703,400 L1-dcache-loads (44.35%) 631,623 L1-dcache-prefetch-misses (22.22%) 218,518 L1-dcache-store-misses (22.26%) 3,302,188 L1-icache-load-misses (33.38%) 1,039,523,858 branch-instructions (44.48%) 6,161,239 branch-misses # 0.59% of all branches (44.45%) 6.482253263 seconds time elapsed ``` cache-misses也上升了，但是其他都下降了，尤其branch-misses下降了==1,400,000==次數 * inline attribute ``` # Rendering scene Done! Execution time of raytracing() : 6.181082 sec # Rendering scene Done! Execution time of raytracing() : 6.059673 sec ``` 執行兩次描繪時間皆快了==0.3 - 0.5==秒效果不佳。 :::info 小結：SIMD Avx指令的效果不如預期會提升效能，反而下降了，也可能是改善的方向不對，沒有先把資料整理好，就直接做平行運算並不會達到加速。 ::: #### Method 2. OpenMp [介紹 OpenMp](https://zh.wikipedia.org/wiki/OpenMP) 參照[HyHhgcv6共筆](https://hackmd.io/s/HyHhgcv6#raytracing) 使用平行化處理的部份，必須獨立執行的部份，且相互共用資源依賴度越低越適合。嘗試了某些部分的平行化 * raytracing#470 ```clike= #pragma omp parallel for private( d, stk, object_color) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } ``` 根據共筆每個方式也嘗試過以及閱讀了OpenMp基本API，唯一方式是把主要的fun透過各自的平行化處理也同時產生私有的變數才能有效下降執行時間。 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 19.83 0.22 0.22 8820537 0.00 0.00 subtract_vector 16.68 0.41 0.19 10411709 0.00 0.00 dot_product 10.37 0.52 0.12 2554579 0.00 0.00 cross_product 8.11 0.61 0.09 1620293 0.00 0.00 normalize 7.21 0.69 0.08 846032 0.00 0.00 ray_hit_object 6.31 0.76 0.07 4884684 0.00 0.00 multiply_vector 6.31 0.83 0.07 2027362 0.00 0.00 rayRectangularIntersection 4.51 0.88 0.05 1958003 0.00 0.00 raySphereIntersection 3.61 0.92 0.04 278657 0.00 0.00 compute_specular_diffuse 3.61 0.96 0.04 207204 0.00 0.00 protect_color_overflow 3.61 1.00 0.04 128510 0.00 0.00 ray_color 2.25 1.03 0.03 2468484 0.00 0.00 add_vector 2.25 1.05 0.03 170954 0.00 0.00 rayConstruction 1.35 1.07 0.02 624307 0.00 0.00 length 0.90 1.08 0.01 505867 0.00 0.00 multiply_vectors 0.90 1.09 0.01 375776 0.00 0.00 idx_stack_top 0.90 1.10 0.01 373556 0.00 0.00 localColor 0.90 1.11 0.01 1 0.01 1.11 raytracing 0.45 1.11 0.01 1 0.01 0.01 calculateBasisVectors 0.00 1.11 0.00 418656 0.00 0.00 idx_stack_empty 0.00 1.11 0.00 206896 0.00 0.00 refraction 0.00 1.11 0.00 186036 0.00 0.00 reflection 0.00 1.11 0.00 171468 0.00 0.00 idx_stack_push 0.00 1.11 0.00 130055 0.00 0.00 idx_stack_init 0.00 1.11 0.00 30970 0.00 0.00 fresnel 0.00 1.11 0.00 10998 0.00 0.00 idx_stack_pop 0.00 1.11 0.00 3 0.00 0.00 append_rectangular 0.00 1.11 0.00 3 0.00 0.00 append_sphere 0.00 1.11 0.00 2 0.00 0.00 append_light 0.00 1.11 0.00 1 0.00 0.00 delete_light_list 0.00 1.11 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.11 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.11 0.00 1 0.00 0.00 diff_in_second 0.00 1.11 0.00 1 0.00 0.00 write_to_ppm ``` :::warning OpenMP小結: 學習OpenMP遇到許多困難，記錄以下 * 閱讀資料 [Guide](https://computing.llnl.gov/tutorials/openMP/)，[OpenMP系列文](http://blog.csdn.net/donhao/article/details/5651156) 不容易理解，適合應用的地方。 * 閱讀 raytracing code 要先找適合獨立的部分，嘗試了 `parellel for` 語法，產出的結果皆錯誤，變數之間內容會互相干擾，需要更科學的數據研究執行過程。 ::: * -O0 ``` # Rendering scene Done! Execution time of raytracing() : 0.588355 sec convert out.ppm out.png ``` * -Ofast ``` # Rendering scene Done! Execution time of raytracing() : 0.206408 sec convert out.ppm out.png ``` 經過 openmp 的加速後依然還可以再更快，看起來越是要降低越需要更多數據分析。 #### Method 3. pthread 先整理 [pthread](https://hackmd.io/BwFgxgRsCmAMIFpoFYCc0EgCbgRATAGwDMChwsqYWyNqAjMEA===?view) 基本用法，僅只使用 `pthread_create` 改善程式碼的部分也選在描繪圖形的部份。 * single thread 步驟1: 製作參數指標，用來存取呼叫函數的參數步驟2: 修改本來描繪圖形的 method 使其參數變為參數指標步驟3: 修改主要程式製作 thread 執行步驟4: thread join 回主 thread ```c= //raytracing.h typedef struct __RAY_DETAIL { uint8_t *pixels; color background_color; rectangular_node rectangulars; sphere_node spheres; light_node lights; onst viewpoint *view; int width; int height; } raydetail; ``` ```c= //raytracing.c raydetail *set_raydetail(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view,int width, int height) { raydetail *detail = (raydetail *) malloc(sizeof(raydetail)); detail->pixels = pixels; detail->background_color = background_color; detail->rectangulars = rectangulars; detail->spheres = spheres; detail->lights = lights; detail->view = view; detail->width = width; detail->height = height; return detail; } void raytracing( void *raydetail ) { raydetail *detail = (raydetail *) raydetail; ... } ``` * multiple thread 參照 [yenWu共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP) 學習用 Thread 編號來區別描繪指定列是個不錯的做法。步驟1: 加入指定 Thread 數量步驟2: 增加 raydetail 結構多存入 thread 編號步驟3: 修改描繪函數的條件 ``` Please input the thread num: 2 # Rendering scene Done! Execution time of raytracing() : 2.080394 sec =============================== Please input the thread num: 4 # Rendering scene Done! Execution time of raytracing() : 2.252935 sec =============================== Please input the thread num: 8 # Rendering scene Done! Execution time of raytracing() : 2.557829 sec =============================== Please input the thread num: 16 # Rendering scene Done! Execution time of raytracing() : 3.193906 sec =============================== Please input the thread num: 32 # Rendering scene Done! Execution time of raytracing() : 3.801572 sec =============================== Please input the thread num: 64 # Rendering scene Done! Execution time of raytracing() : 3.972577 sec =============================== Please input the thread num: 128 # Rendering scene Done! Execution time of raytracing() : 4.187006 sec =============================== Please input the thread num: 256 # Rendering scene Done! Execution time of raytracing() : 3.965714 sec =============================== Please input the thread num: 512 # Rendering scene Done! Execution time of raytracing() : 3.565197 sec =============================== Please input the thread num: 1024 # Rendering scene Done! Execution time of raytracing() : 3.493345 sec ``` * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls us/call us/call name 12.33 0.18 0.18 31410180 0.01 0.01 multiply_vector 11.27 0.34 0.16 69646433 0.00 0.00 dot_product 11.27 0.50 0.16 13861875 0.01 0.03 rayRectangularIntersection 10.57 0.65 0.15 10598453 0.01 0.01 normalize 9.86 0.79 0.14 4620625 0.03 0.17 ray_hit_object 9.51 0.92 0.14 56956357 0.00 0.00 subtract_vector 6.34 1.01 0.09 13861875 0.01 0.01 raySphereIntersection 6.34 1.10 0.09 1048576 0.09 1.26 ray_color 4.58 1.17 0.07 17821811 0.00 0.00 cross_product 4.58 1.23 0.07 17836094 0.00 0.00 add_vector 4.23 1.29 0.06 2110576 0.03 0.07 localColor 2.82 1.33 0.04 raytracing 1.41 1.35 0.02 3838091 0.01 0.01 length 1.41 1.37 0.02 2110576 0.01 0.09 compute_specular_diffuse 0.70 1.38 0.01 4221152 0.00 0.00 multiply_vectors 0.70 1.39 0.01 2520791 0.00 0.00 idx_stack_top 0.70 1.40 0.01 1241598 0.01 0.01 protect_color_overflow 0.70 1.41 0.01 1241598 0.01 0.02 reflection 0.70 1.42 0.01 1048576 0.01 0.05 rayConstruction 0.00 1.42 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.42 0.00 1241598 0.00 0.00 refraction 0.00 1.42 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.42 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.42 0.00 113297 0.00 0.01 fresnel 0.00 1.42 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.42 0.00 3 0.00 0.00 append_rectangular 0.00 1.42 0.00 3 0.00 0.00 append_sphere 0.00 1.42 0.00 2 0.00 0.00 append_light 0.00 1.42 0.00 2 0.00 0.05 calculateBasisVectors 0.00 1.42 0.00 2 0.00 0.00 set_raydetail 0.00 1.42 0.00 1 0.00 0.00 delete_light_list 0.00 1.42 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.42 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.42 0.00 1 0.00 0.00 diff_in_second ``` 從我的數據看起來，確實 thread 可以個別執行使其分工，但是在時間上看來最好的時間只比原本的快了==0.8 秒==，並沒有太顯著的成長，可見程式部分還有許多值得改善，或許分工各做各的部分可能效果會更好==之後補上==。 ### 結論比較 ![](https://i.imgur.com/snD8Dto.png) 經過測試目前效率最好的是使用OpenMP的修改版本，然後直接使用SIMD-Avx卻是最差的。 ### 補充 * Loop unrolling > 循環展開，英文中稱（Loop unwinding或loop unrolling），是一種犧牲程序的尺寸來加快程序的執行速度的優化方法。 Example: ``` for (i = 1; i <= 60; i++) a[i] = a[i] * b + c; ================================== for (i = 1; i <= 60; i+=3) { a[i] = a[i] * b + c; a[i+1] = a[i+1] * b + c; a[i+2] = a[i+2] * b + c; } ``` * 優點 * 分支預測失敗減少 * 如果循環體內語句沒有數據相關，增加了並發執行的機會 * 可以在執行時動態循環展開，這種情況在編譯時也不可能掌握。 * 缺點 * 程式碼膨脹 * 程式碼可讀性降低，除非編譯器透明執行循環展開 * 循環體內含遞歸可能會降低循環展開的得益 ### 參考資訊 * 網路資源 * [Wiki Loop unrolling](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80) * [Loop unrolling with adam](http://blog.teamleadnet.com/2012/02/code-unwinding-performance-is-far-away.html) * [分支預測器](https://zh.wikipedia.org/wiki/%E5%88%86%E6%94%AF%E9%A0%90%E6%B8%AC%E5%99%A8) * 共筆 * [RNIC](https://embedded2016.hackpad.com/2016q1-Homework-2A-GalzL151aZc) * [yenWu](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP)