contributed by < ChiHsiang
>
這裡要填入 GitHub 帳號 jserv
etc276
共筆需要區分項目,例如 #
是用在 title,###
用在各主題,目的是讓看共筆的人可以很明確的知道自己的所在位置和跳著閱讀,但如果 ###
太多,會讓人不知道從何看起,可以參考 我的共筆 (雖然也是有很多待改進的地方)
OpenMP 是套我覺得滿成熟的 API,一開始我只有閱讀沒有嘗試實際寫成程式碼,但後來參考其他人的共筆和 github,其實基礎實作上滿容易的,只要加上幾行程式碼如#pragma...
和修改Makefile
的編譯指令就可初步優化,建議花些時間實踐並記錄在共筆。
commit 次數過於頻繁且 commit message 並不明確(如 "Added some more math functions"),建議可以有一定程度的修改在進行 commit
在物理學中,光線追跡可以用來計算光束在介質中傳播的情況。在介質中傳播時,光束可能會被介質吸收,改變傳播方向或者射出介質表面等。我們通過計算理想化的窄光束(光線)通過介質中的情形來解決這種複雜的情況
Profiling 可以顯示執行時時間主要消耗在哪個 function called 以及 function called 次數,顯示整的執行過程中的資訊,可以幫助快速點出效率差的Bug。
gprof execute_program | less
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.96 1.01 1.01 69646433 0.00 0.00 dot_product
17.20 1.61 0.60 56956357 0.00 0.00 subtract_vector
11.04 2.00 0.39 31410180 0.00 0.00 multiply_vector
7.45 2.26 0.26 13861875 0.00 0.00 rayRectangularIntersection
6.88 2.50 0.24 17836094 0.00 0.00 add_vector
5.73 2.70 0.20 13861875 0.00 0.00 raySphereIntersection
5.16 2.88 0.18 17821809 0.00 0.00 cross_product
5.16 3.06 0.18 4620625 0.00 0.00 ray_hit_object
4.87 3.23 0.17 10598450 0.00 0.00 normalize
1.72 3.29 0.06 1048576 0.00 0.00 ray_color
1.58 3.34 0.06 4221152 0.00 0.00 multiply_vectors
compiler -O0
Execution time of raytracing() : 5.855616 sec
Performance counter stats for './raytracing':
51,790 cache-misses # 15.905 % of all cache refs (44.34%)
325,622 cache-references (44.39%)
4,099,177 L1-dcache-load-misses # 0.03% of all L1-dcache hits (44.55%)
13,776,423,601 L1-dcache-loads (44.46%)
856,528 L1-dcache-prefetch-misses (22.28%)
298,322 L1-dcache-store-misses (22.27%)
884,842 L1-icache-load-misses (33.35%)
4,784,882,802 branch-instructions (44.41%)
75,971,367 branch-misses # 1.59% of all branches (44.35%)
5.857297119 seconds time elapsed
compiler -Ofast
Execution time of raytracing() : 0.654530 sec
Performance counter stats for './raytracing':
30,146 cache-misses # 38.650 % of all cache refs (44.72%)
77,997 cache-references (45.05%)
2,112,889 L1-dcache-load-misses # 0.29% of all L1-dcache hits (45.10%)
738,721,430 L1-dcache-loads (44.02%)
441,622 L1-dcache-prefetch-misses (21.98%)
52,570 L1-dcache-store-misses (22.67%)
93,843 L1-icache-load-misses (33.79%)
258,535,076 branch-instructions (44.78%)
784,490 branch-misses # 0.30% of all branches (44.50%)
0.656329191 seconds time elapsed
很明顯的發現Compiler優化後的結果,每個項目次數都下降許多,幾乎都有兩倍以上的差距,由此可知程式方面有許多部分未達到最佳化。
31.10% raytracing [kernel.kallsyms] [k] clear_page_c_e
14.72% raytracing [kernel.kallsyms] [k] get_page_from_freelist
12.13% raytracing [kernel.kallsyms] [k] get_mem_cgroup_from_mm
11.10% raytracing [kernel.kallsyms] [k] copy_page
9.72% raytracing [kernel.kallsyms] [k] __alloc_pages_nodemask
8.74% raytracing ld-2.23.so [.] dl_main
4.87% raytracing [kernel.kallsyms] [k] anon_vma_prepare
2.23% raytracing libm-2.23.so [.] __ieee754_pow_sse2
2.10% raytracing [kernel.kallsyms] [k] enqueue_entity
1.54% raytracing [kernel.kallsyms] [k] mem_cgroup_try_charge
1.06% raytracing [kernel.kallsyms] [k] commit_creds
0.41% raytracing [kernel.kallsyms] [k] handle_mm_fault
0.28% perf [kernel.kallsyms] [k] perf_event_addr_filters_exec
0.02% perf [kernel.kallsyms] [k] perf_ctx_unlock
misses的比例不算高,因此效能不佳原因比較不在Caches上。
gprof ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.835277 sec
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
19.69 0.49 0.49 69646433 0.00 0.00 dot_product
18.49 0.95 0.46 56956357 0.00 0.00 subtract_vector
11.65 1.24 0.29 13861875 0.00 0.00 rayRectangularIntersection
8.84 1.46 0.22 10598450 0.00 0.00 normalize
6.43 1.62 0.16 17836094 0.00 0.00 add_vector
6.43 1.78 0.16 17821809 0.00 0.00 cross_product
6.03 1.93 0.15 31410180 0.00 0.00 multiply_vector
5.22 2.06 0.13 13861875 0.00 0.00 raySphereIntersection
2.81 2.13 0.07 4620625 0.00 0.00 ray_hit_object
2.81 2.20 0.07 1 0.07 2.49 raytracing
2.41 2.26 0.06 1048576 0.00 0.00 ray_color
1.61 2.30 0.04 1048576 0.00 0.00 rayConstruction
1.21 2.33 0.03 4221152 0.00 0.00 multiply_vectors
1.21 2.36 0.03 2110576 0.00 0.00 localColor
# Rendering scene
Done!
Execution time of raytracing() : 0.651282 sec
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
56.14 0.23 0.23 4620625 49.81 49.81 ray_hit_object
21.97 0.32 0.09 2110576 42.67 42.67 compute_specular_diffuse
14.64 0.38 0.06 raytracing
7.32 0.41 0.03 592239 50.69 338.09 ray_color
0.00 0.41 0.00 2110576 0.00 0.00 localColor
0.00 0.41 0.00 1241598 0.00 0.00 refraction
由前三個時間跟呼叫次數較多的functions進行分析。
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
--> loop unrolling
double dot_product(const double *v1, const double *v2)
{
return v1[0] * v2[0] + v1[1] * v2[1] +v1[2] * v2[2];
}
======================================================
double scalar_triple(const double *u, const double *v, const double *w)
{
double tmp[3];
cross_product(w, u, tmp);
return dot_product(v, tmp);
}
差異
⚡ gcc -O0 -Wall -std=gnu99 -c math-toolkit.c -o math_tool.o
⚡ objdump -D -S ./math.o | less
000000000000041b <dot_product>:
41b: 55 push %rbp
41c: 48 89 e5 mov %rsp,%rbp
41f: 48 89 7d f8 mov %rdi,-0x8(%rbp)
423: 48 89 75 f0 mov %rsi,-0x10(%rbp)
427: 48 8b 45 f8 mov -0x8(%rbp),%rax
42b: f2 0f 10 08 movsd (%rax),%xmm1
42f: 48 8b 45 f0 mov -0x10(%rbp),%rax
433: f2 0f 10 00 movsd (%rax),%xmm0
437: f2 0f 59 c8 mulsd %xmm0,%xmm1
43b: 48 8b 45 f8 mov -0x8(%rbp),%rax
43f: 48 83 c0 08 add $0x8,%rax
443: f2 0f 10 10 movsd (%rax),%xmm2
447: 48 8b 45 f0 mov -0x10(%rbp),%rax
44b: 48 83 c0 08 add $0x8,%rax
44f: f2 0f 10 00 movsd (%rax),%xmm0
453: f2 0f 59 c2 mulsd %xmm2,%xmm0
457: f2 0f 58 c8 addsd %xmm0,%xmm1
45b: 48 8b 45 f8 mov -0x8(%rbp),%rax
45f: 48 83 c0 10 add $0x10,%rax
463: f2 0f 10 10 movsd (%rax),%xmm2
467: 48 8b 45 f0 mov -0x10(%rbp),%rax
46b: 48 83 c0 10 add $0x10,%rax
46f: f2 0f 10 00 movsd (%rax),%xmm0
473: f2 0f 59 c2 mulsd %xmm2,%xmm0
477: f2 0f 58 c1 addsd %xmm1,%xmm0
47b: 5d pop %rbp
47c: c3 retq
⚡ gcc -Ofast -Wall -std=gnu99 -c math-toolkit.c -o math_tool.o
⚡ objdump -D -S ./math.o | less
00000000000001d0 <dot_product>:
1d0: f2 0f 10 07 movsd (%rdi),%xmm0
1d4: f2 0f 10 0e movsd (%rsi),%xmm1
1d8: f2 0f 59 c8 mulsd %xmm0,%xmm1
1dc: f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0
1e1: f2 0f 59 46 08 mulsd 0x8(%rsi),%xmm0
1e6: f2 0f 58 c1 addsd %xmm1,%xmm0
1ea: f2 0f 10 4f 10 movsd 0x10(%rdi),%xmm1
1ef: f2 0f 59 4e 10 mulsd 0x10(%rsi),%xmm1
1f4: f2 0f 58 c1 addsd %xmm1,%xmm0
1f8: c3 retq
1f9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
online Compiler 結果從Assembly來看,Loop unrolling過後少了許多jmp jg pxor指令,相對來說剩下mov add 居多,因此執行上也簡單許多。
然而用local gcc compiler 則是 move,add 指令都變少詳細原因有待查詢。
Execution time of raytracing() : 5.483000 sec
Performance counter stats for './raytracing':
32,812 cache-misses # 11.371 % of all cache refs (44.48%)
288,557 cache-references (44.52%)
4,107,611 L1-dcache-load-misses # 0.03% of all L1-dcache hits (44.56%)
12,598,272,040 L1-dcache-loads (44.44%)
877,960 L1-dcache-prefetch-misses (22.19%)
289,377 L1-dcache-store-misses (22.18%)
866,805 L1-icache-load-misses (33.39%)
4,420,780,584 branch-instructions (44.48%)
60,781,856 branch-misses # 1.37% of all branches (44.45%)
5.484984316 seconds time elapsed
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.24 0.36 0.36 56956357 0.00 0.00 subtract_vector
14.62 0.62 0.26 10598450 0.00 0.00 normalize
10.12 0.80 0.18 69646433 0.00 0.00 dot_product
9.84 0.98 0.18 31410180 0.00 0.00 multiply_vector
8.43 1.13 0.15 4620625 0.00 0.00 ray_hit_object
7.31 1.26 0.13 17836094 0.00 0.00 add_vector
6.75 1.38 0.12 17821809 0.00 0.00 cross_product
6.75 1.50 0.12 13861875 0.00 0.00 rayRectangularIntersection
6.18 1.61 0.11 13861875 0.00 0.00 raySphereIntersection
1.97 1.64 0.04 4221152 0.00 0.00 multiply_vectors
1.69 1.67 0.03 2110576 0.00 0.00 localColor
1.69 1.70 0.03 1 0.03 1.78 raytracing
1.41 1.73 0.03 1048576 0.00 0.00 ray_color
1.12 1.75 0.02 1048576 0.00 0.00 rayConstruction
雖然時間有下降,但是呼叫次數是一樣的,效果並不顯著。
將所有Math-toolkit 都flatten後的結果
Done!
Execution time of raytracing() : 2.046054 sec
Performance counter stats for './raytracing':
65,528 cache-misses # 38.347 % of all cache refs (44.32%)
170,881 cache-references (44.32%)
2,557,996 L1-dcache-load-misses # 0.05% of all L1-dcache hits (44.32%)
5,090,645,082 L1-dcache-loads (44.16%)
522,613 L1-dcache-prefetch-misses (22.41%)
95,561 L1-dcache-store-misses (22.36%)
223,423 L1-icache-load-misses (33.48%)
972,075,387 branch-instructions (44.55%)
6,118,594 branch-misses # 0.63% of all branches (44.36%)
2.047843058 seconds time elapsed
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
19.09 0.29 0.29 10598450 0.00 0.00 normalize
13.82 0.50 0.21 13861875 0.00 0.00 rayRectangularIntersection
12.18 0.69 0.19 56956357 0.00 0.00 subtract_vector
9.87 0.84 0.15 69646433 0.00 0.00 dot_product
7.24 0.95 0.11 4620625 0.00 0.00 ray_hit_object
6.58 1.05 0.10 13861875 0.00 0.00 raySphereIntersection
5.60 1.13 0.09 17821809 0.00 0.00 cross_product
5.27 1.21 0.08 1048576 0.00 0.00 ray_color
4.94 1.29 0.08 31410180 0.00 0.00 multiply_vector
4.94 1.36 0.08 17836094 0.00 0.00 add_vector
1.97 1.39 0.03 4221152 0.00 0.00 multiply_vectors
1.97 1.42 0.03 1241598 0.00 0.00 refraction
1.97 1.45 0.03 1 0.03 1.52 raytracing
雖然Render時間下降3秒,但是距離機器優化還是慢了1.4秒。
由於這支程式主要目的是描繪出圖形的點以及向量計算,根據提示以及其他同學的共筆方向程式改善的方向,可能跟平行計算有相關,因此開始學習有以下:
"render" 在 20 年前的台灣翻譯為「描繪」(圖形處理領域),請尊重我們的科技傳統 "jserv"
修正!
"chihsiang"
CPU 支援的指令集
OpenMP
pthread
整理了相關的 資料
cat /proc/cpuinfo
model name: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave
avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep
erms xsaveopt dtherm ida arat pln pts
支援有SSE/0/2/4、SSSE3、AVX、MMX
向量處理機常見的SIMD有VIS、MMX、SSE、AltiVec、AVX。
除了這些還有其他許多的flags想探討理解。
使用了AVX Intruction
主要需要#include <immintrin.h>
以及 compiler 需要加上-mavx
參數
static inline
double dot_product(const double *v1, const double *v2)
{
double out[4];
__m256i mask = _mm256_set_epi64x(ADDRESS_LOW, ADDRESS_HI, ADDRESS_HI, ADDRESS_HI);
__m256d c = _mm256_loadu_pd(v1);
__m256d d = _mm256_loadu_pd(v2);
__m256d dst = _mm256_mul_pd(c, d);
_mm256_maskstore_pd(&out[0], mask, dst);
return out[0] + out[1] + out[2];
}
static inline
void add_vector(const double *a, const double *b, double *out)
{
__m256i mask = _mm256_set_epi64x(ADDRESS_LOW, ADDRESS_HI, ADDRESS_HI, ADDRESS_HI);
__m256d c = _mm256_loadu_pd(a);
__m256d d = _mm256_loadu_pd(b);
__m256d dst = _mm256_add_pd(c, d);
_mm256_maskstore_pd(out, mask, dst);
}
分析
主要改上了幾個呼叫次數較多的的function
# Rendering scene
Done!
Execution time of raytracing() : 7.707085 sec
比原本的慢了五秒。
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.34 1.56 1.56 69646433 0.00 0.00 dot_product
16.70 2.51 0.95 56956357 0.00 0.00 subtract_vector
16.53 3.45 0.94 31410180 0.00 0.00 multiply_vector
10.46 4.04 0.60 17821809 0.00 0.00 cross_product
6.68 4.42 0.38 10598450 0.00 0.00 normalize
4.48 4.68 0.26 13861875 0.00 0.00 rayRectangularIntersection
4.40 4.93 0.25 17836094 0.00 0.00 add_vector
3.96 5.15 0.23 13861875 0.00 0.00 raySphereIntersection
2.29 5.28 0.13 3838091 0.00 0.00 length
1.58 5.37 0.09 2110576 0.00 0.00 compute_specular_diffuse
1.41 5.45 0.08 4221152 0.00 0.00 multiply_vectors
1.23 5.52 0.07 1048576 0.00 0.00 ray_color
1.23 5.59 0.07 4620625 0.00 0.00 ray_hit_object
前十個執行時間幾乎都上升了,可能是轉移暫存器的運算時並沒有有效利用256bit處理,且每個改寫後的method幾乎都需要用 if 判斷只處理到191bit的位置導致速度提昇不上來。
Execution time of raytracing() : 6.480361 sec
Performance counter stats for './raytracing':
70,015 cache-misses # 29.775 % of all cache refs (44.46%)
235,143 cache-references (44.46%)
3,330,995 L1-dcache-load-misses # 0.04% of all L1-dcache hits (44.46%)
7,825,703,400 L1-dcache-loads (44.35%)
631,623 L1-dcache-prefetch-misses (22.22%)
218,518 L1-dcache-store-misses (22.26%)
3,302,188 L1-icache-load-misses (33.38%)
1,039,523,858 branch-instructions (44.48%)
6,161,239 branch-misses # 0.59% of all branches (44.45%)
6.482253263 seconds time elapsed
cache-misses也上升了,但是其他都下降了,尤其branch-misses下降了1,400,000次數
# Rendering scene
Done!
Execution time of raytracing() : 6.181082 sec
# Rendering scene
Done!
Execution time of raytracing() : 6.059673 sec
執行兩次描繪時間皆快了0.3 - 0.5秒效果不佳。
小結:SIMD Avx指令的效果不如預期會提升效能,反而下降了,也可能是改善的方向不對,沒有先把資料整理好,就直接做平行運算並不會達到加速。
使用平行化處理的部份,必須獨立執行的部份,且相互共用資源依賴度越低越適合。
嘗試了某些部分的平行化
#pragma omp parallel for private( d, stk, object_color)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
根據共筆每個方式也嘗試過以及閱讀了OpenMp基本API,唯一方式是把主要的fun透過各自的平行化處理也同時產生私有的變數才能有效下降執行時間。
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
19.83 0.22 0.22 8820537 0.00 0.00 subtract_vector
16.68 0.41 0.19 10411709 0.00 0.00 dot_product
10.37 0.52 0.12 2554579 0.00 0.00 cross_product
8.11 0.61 0.09 1620293 0.00 0.00 normalize
7.21 0.69 0.08 846032 0.00 0.00 ray_hit_object
6.31 0.76 0.07 4884684 0.00 0.00 multiply_vector
6.31 0.83 0.07 2027362 0.00 0.00 rayRectangularIntersection
4.51 0.88 0.05 1958003 0.00 0.00 raySphereIntersection
3.61 0.92 0.04 278657 0.00 0.00 compute_specular_diffuse
3.61 0.96 0.04 207204 0.00 0.00 protect_color_overflow
3.61 1.00 0.04 128510 0.00 0.00 ray_color
2.25 1.03 0.03 2468484 0.00 0.00 add_vector
2.25 1.05 0.03 170954 0.00 0.00 rayConstruction
1.35 1.07 0.02 624307 0.00 0.00 length
0.90 1.08 0.01 505867 0.00 0.00 multiply_vectors
0.90 1.09 0.01 375776 0.00 0.00 idx_stack_top
0.90 1.10 0.01 373556 0.00 0.00 localColor
0.90 1.11 0.01 1 0.01 1.11 raytracing
0.45 1.11 0.01 1 0.01 0.01 calculateBasisVectors
0.00 1.11 0.00 418656 0.00 0.00 idx_stack_empty
0.00 1.11 0.00 206896 0.00 0.00 refraction
0.00 1.11 0.00 186036 0.00 0.00 reflection
0.00 1.11 0.00 171468 0.00 0.00 idx_stack_push
0.00 1.11 0.00 130055 0.00 0.00 idx_stack_init
0.00 1.11 0.00 30970 0.00 0.00 fresnel
0.00 1.11 0.00 10998 0.00 0.00 idx_stack_pop
0.00 1.11 0.00 3 0.00 0.00 append_rectangular
0.00 1.11 0.00 3 0.00 0.00 append_sphere
0.00 1.11 0.00 2 0.00 0.00 append_light
0.00 1.11 0.00 1 0.00 0.00 delete_light_list
0.00 1.11 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.11 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.11 0.00 1 0.00 0.00 diff_in_second
0.00 1.11 0.00 1 0.00 0.00 write_to_ppm
OpenMP小結:
學習OpenMP遇到許多困難,記錄以下
# Rendering scene
Done!
Execution time of raytracing() : 0.588355 sec
convert out.ppm out.png
# Rendering scene
Done!
Execution time of raytracing() : 0.206408 sec
convert out.ppm out.png
經過 openmp 的加速後依然還可以再更快,看起來越是要降低越需要更多數據分析。
先整理 pthread 基本用法,僅只使用 pthread_create
改善程式碼的部分也選在描繪圖形的部份。
single thread
步驟1: 製作參數指標,用來存取呼叫函數的參數
步驟2: 修改本來描繪圖形的 method 使其參數變為參數指標
步驟3: 修改主要程式製作 thread 執行
步驟4: thread join 回主 thread
//raytracing.h
typedef struct __RAY_DETAIL {
uint8_t *pixels;
color background_color;
rectangular_node rectangulars;
sphere_node spheres;
light_node lights;
onst viewpoint *view;
int width;
int height;
} raydetail;
//raytracing.c
raydetail *set_raydetail(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,int width, int height)
{
raydetail *detail = (raydetail *) malloc(sizeof(raydetail));
detail->pixels = pixels;
detail->background_color = background_color;
detail->rectangulars = rectangulars;
detail->spheres = spheres;
detail->lights = lights;
detail->view = view;
detail->width = width;
detail->height = height;
return detail;
}
void raytracing( void *raydetail )
{
raydetail *detail = (raydetail *) raydetail;
...
}
multiple thread
參照 yenWu共筆 學習用 Thread 編號來區別描繪指定列是個不錯的做法。
步驟1: 加入指定 Thread 數量
步驟2: 增加 raydetail 結構多存入 thread 編號
步驟3: 修改描繪函數的條件
Please input the thread num: 2
# Rendering scene
Done!
Execution time of raytracing() : 2.080394 sec
===============================
Please input the thread num: 4
# Rendering scene
Done!
Execution time of raytracing() : 2.252935 sec
===============================
Please input the thread num: 8
# Rendering scene
Done!
Execution time of raytracing() : 2.557829 sec
===============================
Please input the thread num: 16
# Rendering scene
Done!
Execution time of raytracing() : 3.193906 sec
===============================
Please input the thread num: 32
# Rendering scene
Done!
Execution time of raytracing() : 3.801572 sec
===============================
Please input the thread num: 64
# Rendering scene
Done!
Execution time of raytracing() : 3.972577 sec
===============================
Please input the thread num: 128
# Rendering scene
Done!
Execution time of raytracing() : 4.187006 sec
===============================
Please input the thread num: 256
# Rendering scene
Done!
Execution time of raytracing() : 3.965714 sec
===============================
Please input the thread num: 512
# Rendering scene
Done!
Execution time of raytracing() : 3.565197 sec
===============================
Please input the thread num: 1024
# Rendering scene
Done!
Execution time of raytracing() : 3.493345 sec
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
12.33 0.18 0.18 31410180 0.01 0.01 multiply_vector
11.27 0.34 0.16 69646433 0.00 0.00 dot_product
11.27 0.50 0.16 13861875 0.01 0.03 rayRectangularIntersection
10.57 0.65 0.15 10598453 0.01 0.01 normalize
9.86 0.79 0.14 4620625 0.03 0.17 ray_hit_object
9.51 0.92 0.14 56956357 0.00 0.00 subtract_vector
6.34 1.01 0.09 13861875 0.01 0.01 raySphereIntersection
6.34 1.10 0.09 1048576 0.09 1.26 ray_color
4.58 1.17 0.07 17821811 0.00 0.00 cross_product
4.58 1.23 0.07 17836094 0.00 0.00 add_vector
4.23 1.29 0.06 2110576 0.03 0.07 localColor
2.82 1.33 0.04 raytracing
1.41 1.35 0.02 3838091 0.01 0.01 length
1.41 1.37 0.02 2110576 0.01 0.09 compute_specular_diffuse
0.70 1.38 0.01 4221152 0.00 0.00 multiply_vectors
0.70 1.39 0.01 2520791 0.00 0.00 idx_stack_top
0.70 1.40 0.01 1241598 0.01 0.01 protect_color_overflow
0.70 1.41 0.01 1241598 0.01 0.02 reflection
0.70 1.42 0.01 1048576 0.01 0.05 rayConstruction
0.00 1.42 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.42 0.00 1241598 0.00 0.00 refraction
0.00 1.42 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.42 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.42 0.00 113297 0.00 0.01 fresnel
0.00 1.42 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.42 0.00 3 0.00 0.00 append_rectangular
0.00 1.42 0.00 3 0.00 0.00 append_sphere
0.00 1.42 0.00 2 0.00 0.00 append_light
0.00 1.42 0.00 2 0.00 0.05 calculateBasisVectors
0.00 1.42 0.00 2 0.00 0.00 set_raydetail
0.00 1.42 0.00 1 0.00 0.00 delete_light_list
0.00 1.42 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.42 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.42 0.00 1 0.00 0.00 diff_in_second
從我的數據看起來,確實 thread 可以個別執行使其分工,但是在時間上看來最好的時間只比原本的快了0.8 秒,並沒有太顯著的成長,可見程式部分還有許多值得改善,或許分工各做各的部分可能效果會更好之後補上。
經過測試目前效率最好的是使用OpenMP的修改版本,然後直接使用SIMD-Avx卻是最差的。
Loop unrolling
循環展開,英文中稱(Loop unwinding或loop unrolling),是一種犧牲程序的尺寸來加快程序的執行速度的優化方法。
Example:
for (i = 1; i <= 60; i++)
a[i] = a[i] * b + c;
==================================
for (i = 1; i <= 60; i+=3)
{
a[i] = a[i] * b + c;
a[i+1] = a[i+1] * b + c;
a[i+2] = a[i+2] * b + c;
}