heathcliffYang
contributed by <heathcliffYang
>
I had better to read the Makefile first.
make check
will verify the picture.
convert
can convert the ppm to png or else.
make PROFILE=1
-> compile the program needed by gprof in Makefile.
make clean
will remove use-models.h
,out.ppm
and gmon.out
.
gprof -b ./raytracing | less
-> 從上方顯示較少資訊,並提供翻頁,-b
是不需要解釋名詞
gprof ./raytracing | gprof2dot | dot -T png -o callgraph_omp1.png
6.232054 sec
heathcliff@heathcliff-MacBookPro:~/raytracing$ gprof -b ./raytracing
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.34 0.50 0.50 69646433 0.00 0.00 dot_product
16.68 0.91 0.41 56956357 0.00 0.00 subtract_vector
13.42 1.24 0.33 10598450 0.00 0.00 normalize
11.80 1.53 0.29 31410180 0.00 0.00 multiply_vector
9.36 1.76 0.23 13861875 0.00 0.00 rayRectangularIntersection
6.91 1.93 0.17 13861875 0.00 0.00 raySphereIntersection
4.47 2.04 0.11 4620625 0.00 0.00 ray_hit_object
3.46 2.13 0.09 17836094 0.00 0.00 add_vector
2.44 2.19 0.06 17821809 0.00 0.00 cross_product
2.03 2.24 0.05 1048576 0.00 0.00 ray_color
1.63 2.28 0.04 4221152 0.00 0.00 multiply_vectors
1.63 2.32 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.22 2.35 0.03 2520791 0.00 0.00 idx_stack_top
1.22 2.38 0.03 2110576 0.00 0.00 localColor
0.81 2.40 0.02 1241598 0.00 0.00 refraction
Performance counter stats for './raytracing' (10 runs):
78,085 cache-misses # 19.826 % of all cache refs ( +- 11.39% )
393,846 cache-references ( +- 3.41% )
34,500,667,719 instructions # 1.75 insns per cycle ( +- 0.00% )
19,734,977,254 cycles ( +- 0.83% )
6.453942844 seconds time elapsed ( +- 0.82% )
loop unrolling 所有的
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
16.57 0.27 0.27 13861875 0.00 0.00 rayRectangularIntersection
14.70 0.50 0.24 56956357 0.00 0.00 subtract_vector ***
13.13 0.71 0.21 69646433 0.00 0.00 dot_product ***
11.88 0.90 0.19 10598450 0.00 0.00 normalize ***
8.44 1.04 0.14 4620625 0.00 0.00 ray_hit_object
7.50 1.16 0.12 13861875 0.00 0.00 raySphereIntersection
6.88 1.27 0.11 17836094 0.00 0.00 add_vector ***
5.32 1.35 0.09 17821809 0.00 0.00 cross_product ***
3.13 1.40 0.05 3838091 0.00 0.00 length ***
2.19 1.44 0.04 31410180 0.00 0.00 multiply_vector ***
1.88 1.47 0.03 2110576 0.00 0.00 compute_specular_diffuse
1.88 1.50 0.03 1048576 0.00 0.00 ray_color
1.56 1.52 0.03 4221152 0.00 0.00 multiply_vectors***
0.94 1.54 0.02 113297 0.00 0.00 fresnel
0.94 1.55 0.02 2520791 0.00 0.00 idx_stack_top
0.63 1.56 0.01 2110576 0.00 0.00 localColor
0.63 1.57 0.01 1241598 0.00 0.00 protect_color_overflow
0.63 1.58 0.01 1241598 0.00 0.00 reflection
0.63 1.59 0.01 1 0.01 0.01 delete_sphere_list
0.63 1.60 0.01 1 0.01 1.59 raytracing
0.00 1.60 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.60 0.00 1241598 0.00 0.00 refraction
0.00 1.60 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.60 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.60 0.00 1048576 0.00 0.00 rayConstruction
0.00 1.60 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.60 0.00 3 0.00 0.00 append_rectangular
0.00 1.60 0.00 3 0.00 0.00 append_sphere
0.00 1.60 0.00 2 0.00 0.00 append_light
0.00 1.60 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.60 0.00 1 0.00 0.00 delete_light_list
0.00 1.60 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.60 0.00 1 0.00 0.00 diff_in_second
0.00 1.60 0.00 1 0.00 0.00 write_to_ppm
Performance counter stats for './raytracing' (10 runs):
59,245 cache-misses # 19.567 % of all cache refs ( +- 1.89% )
302,776 cache-references ( +- 7.82% )
25,746,948,831 instructions # 1.61 insns per cycle ( +- 0.00% )
15,999,672,351 cycles ( +- 0.43% )
5.210404034 seconds time elapsed ( +- 0.56% )
void normalize(double *v)
{
double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
assert(d != 0.0 && "Error calculating normal");
#pragma omp parallel for
for (a=0; a < 3; a++) {
v[a]/=d;
}
}
嘗試把所有math-toolkit.h裡的函式可以平行化的迴圈都平行化了,但是反而更慢更久;
之後換成只改其中一個函式,也跑得很久,這兩個都沒等到跑完。
all
static __attribute__((always_inline))
void normalize(double *v)
{
double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
assert(d != 0.0 && "Error calculating normal");
v[0] /= d;
v[1] /= d;
v[2] /= d;
}
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
30.20 0.47 0.47 13861875 0.00 0.00 rayRectangularIntersection
15.91 0.71 0.25 13861875 0.00 0.00 raySphereIntersection
14.29 0.93 0.22 2110576 0.00 0.00 compute_specular_diffuse
9.74 1.08 0.15 1048576 0.00 0.00 ray_color
9.09 1.22 0.14 2110576 0.00 0.00 localColor
6.82 1.33 0.11 4620625 0.00 0.00 ray_hit_object
5.85 1.42 0.09 1048576 0.00 0.00 rayConstruction
2.60 1.46 0.04 1 0.04 1.54 raytracing
1.30 1.48 0.02 1241598 0.00 0.00 reflection
1.30 1.50 0.02 1241598 0.00 0.00 refraction
0.97 1.51 0.02 113297 0.00 0.00 fresnel
0.65 1.52 0.01 2558386 0.00 0.00 idx_stack_empty
0.65 1.53 0.01 2520791 0.00 0.00 idx_stack_top
0.65 1.54 0.01 1241598 0.00 0.00 protect_color_overflow
0.00 1.54 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.54 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.54 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.54 0.00 3 0.00 0.00 append_rectangular
0.00 1.54 0.00 3 0.00 0.00 append_sphere
0.00 1.54 0.00 2 0.00 0.00 append_light
0.00 1.54 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.54 0.00 1 0.00 0.00 delete_light_list
0.00 1.54 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.54 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.54 0.00 1 0.00 0.00 diff_in_second
0.00 1.54 0.00 1 0.00 0.00 write_to_ppm
Performance counter stats for './raytracing' (10 runs):
53,970 cache-misses # 19.476 % of all cache refs ( +- 2.84% )
277,108 cache-references ( +- 6.25% )
13,144,659,166 instructions # 1.70 insns per cycle ( +- 0.00% )
7,750,237,141 cycles ( +- 0.74% )
2.540274253 seconds time elapsed ( +- 1.00% )
感謝hugikun999的提醒
加了inline之後,又試著在math-toolkit.h裡用OpenMP,但時間依然很長(兩百多秒);後來想到,每個函式被叫用那麼多次,但是每個都是迴圈短、運算比較簡單的時候,就不需要每個簡單的運算都用一個thread去執行。
Loop unrolling + inline (尚未加上omp)
ray_color()
& rayConstruction()
/* @param background_color this is not ambient light */
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
point3 u, v, w, d;
color object_color = { 0.0, 0.0, 0.0 };
/* calculate u, v, w */
calculateBasisVectors(u, v, w, view);
idx_stack stk;
int factor = sqrt(SAMPLES);
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
#pragma omp parallel for firstprivate(r,g,b)
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
}
speed up but check fail
Performance counter stats for './raytracing' (10 runs):
67,856 cache-misses # 0.188 % of all cache refs ( +- 2.44% )
36,086,782 cache-references ( +- 2.19% )
16,662,901,112 instructions # 0.91 insns per cycle ( +- 5.07% )
18,386,293,743 cycles ( +- 4.35% )
1.601089390 seconds time elapsed ( +- 4.40% )
-> data dependent.
Performance counter stats for './raytracing' (10 runs):
207,759 cache-misses # 0.276 % of all cache refs ( +- 2.99% )
75,389,334 cache-references ( +- 3.65% )
12,420,744,790 instructions # 0.72 insns per cycle ( +- 0.15% )
17,284,430,642 cycles ( +- 0.58% )
1.931539834 seconds time elapsed ( +- 1.13% )
Performance counter stats for './raytracing' (10 runs):
248,194 cache-misses # 35.564 % of all cache refs ( +- 3.97% )
697,873 cache-references ( +- 9.18% )
13,320,768,291 instructions # 1.69 insns per cycle ( +- 0.00% )
7,859,385,254 cycles ( +- 0.47% )
2.606176699 seconds time elapsed ( +- 0.38% )
在Unroll 第三層迴圈之下,找出dependent的參數d, stk, object_color
Performance counter stats for './raytracing' (10 runs):
74,722 cache-misses # 0.097 % of all cache refs ( +- 3.33% )
77,274,246 cache-references ( +- 5.31% )
12,810,788,231 instructions # 0.76 insns per cycle ( +- 0.11% )
16,853,158,191 cycles ( +- 0.42% )
2.026637683 seconds time elapsed ( +- 0.80% )
Performance counter stats for './raytracing' (10 runs):
102,166 cache-misses # 0.059 % of all cache refs ( +- 8.70% )
173,186,250 cache-references ( +- 5.89% )
63,646,408 branch-misses ( +- 1.08% )
15,806,263,440 instructions # 0.50 insns per cycle ( +- 0.26% )
31,412,416,665 cycles ( +- 1.00% )
3.722181320 seconds time elapsed ( +- 1.97% )
Performance counter stats for './raytracing' (10 runs):
586,040 cache-misses # 1.565 % of all cache refs ( +- 1.72% )
37,444,312 cache-references ( +- 0.24% )
12,194,796,576 instructions # 0.85 insns per cycle ( +- 0.01% )
14,430,824,154 cycles ( +- 0.12% )
1.266002949 seconds time elapsed ( +- 0.29% )
Performance counter stats for './raytracing' (10 runs):
571,745 cache-misses # 1.523 % of all cache refs ( +- 1.76% )
37,550,691 cache-references ( +- 0.42% )
12,194,710,719 instructions # 0.84 insns per cycle ( +- 0.01% )
14,506,164,926 cycles ( +- 0.38% )
1.275681361 seconds time elapsed ( +- 0.54% )
Performance counter stats for './raytracing' (10 runs):
400,710 cache-misses # 1.071 % of all cache refs ( +- 5.59% )
37,425,937 cache-references ( +- 0.36% )
12,182,627,168 instructions # 0.84 insns per cycle ( +- 0.01% )
14,428,624,839 cycles ( +- 0.21% )
1.277496560 seconds time elapsed ( +- 0.50% )
Performance counter stats for './raytracing' (10 runs):
93,895 cache-misses # 0.167 % of all cache refs ( +- 11.93% )
56,274,093 cache-references ( +- 1.03% )
12,322,845,231 instructions # 0.74 insns per cycle ( +- 0.02% )
16,560,790,058 cycles ( +- 0.33% )
1.497703779 seconds time elapsed ( +- 0.28% )
根據gprof的結果,發現主要耗時的是math-toolkit.h的4個函式,因此將會針對這4個函式進行優化
dot_product()
& subtract_vector()
: 減少不必要的跳出迴圈判斷發現multiply_vector()
的執行時間大於multiply_vectors()
運算式多為vector的運算,單個函式呼叫中資料彼此之前沒有關係 -> 用 OpenMP 把迴圈的運算平行化
inline
是建議compiler函式的屬性,內嵌,與編譯的方法相關
把所有raytracing()呼叫到的函式檢查一遍是否有共用存取
idx_stack.h ->
把idx_stack_init inline (not yet)
ray_color(const point3 e, double t,
const point3 d,
idx_stack *stk,
const rectangular_node rectangulars,
const sphere_node spheres,
const light_node lights,
color object_color, int bounces_left)
add_vector(object_color, refraction_part, object_color);
rayConstruction(d, u, v, w, i * factor + 0 / factor, j * factor + 0 % factor, view, width * factor, height * factor);
subtract_vector(s, view->vrp, d);
normalize(d);
__attribute__()
可以在宣告一個函式時,標明函式的屬性,用法是:
void f () __attribute__ (attributeName)
–>這個可能參考資料寫錯了,因為compile時有出現錯誤訊息:
attributes should be specified before the declarator in a function definition static
always_inline
: 標明函式是要用內嵌,也就是不使用呼叫、跳去記憶體位置執行函式,而是在編譯時將函式在呼叫點展開。 (inline
:建議,compiler不一定會照做,擺法也不太一樣,inline
位在宣告時的函式名稱最前面)print the execution time of each function in the process.
-pg
–> this option represents for profiling when compiling. Besides, it must be a compilation option as well as a link option.
-Q
–> switch to suppress the printing of the call graph data.
Run the program to generate the data gprof needs.
gprof options [exe] [profile-data-file] > outfile
If you omit the executable file name, the file a.out is used. If you give no profile data file name, the file gmon.out is used.
(Not yet):To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg' in addition to your usual options.
將大部分的整數運算指令擴展到 256 bits
sqrt()
calculates the square root.
hugikun999 :
apt install python-pip
->pip install gprof2dot