contributed by <zmke
>
twzjwang
multithreads 下使用 gprof ,時間卻增加:
gprof
is poorly suited for analysis of parallel programs as it doesn't understand the intricacies of OpenMP.
gprof on both OpenMP and without OpenMP codes produces different flat profile
可用其他方法實作平行化,如 Pthread
henry0929016816
OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
可以印出程式執行時各函數消耗的函數、 funtion 之間互相呼叫的關係,幫助 programmer 分析出消耗較多時間的函數,找出當作改善效能的方向,尤其是對大型程式來說是相當實用的工具
$ make PROFILE=1
: 加入 -pg , compiler 會在程式中間插入額外的程式碼,執行後會產生 gmon.out
$ gprof ./raytracing | less
$ make
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.499698 sec
$ make PROFILE=1
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 5.441733 sec
compiler 在程式中加入額外程式碼,執行時間增加了
$ gprof ./raytracing | less
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
25.02 0.51 0.51 69646433 0.00 0.00 dot_product
14.71 0.81 0.30 56956357 0.00 0.00 subtract_vector
11.28 1.04 0.23 31410180 0.00 0.00 multiply_vector
10.06 1.25 0.21 13861875 0.00 0.00 rayRectangularIntersection
7.85 1.41 0.16 17836094 0.00 0.00 add_vector
7.36 1.56 0.15 10598450 0.00 0.00 normalize
5.89 1.68 0.12 4620625 0.00 0.00 ray_hit_object
4.41 1.77 0.09 17821809 0.00 0.00 cross_product
3.19 1.83 0.07 13861875 0.00 0.00 raySphereIntersection
2.45 1.88 0.05 1048576 0.00 0.00 ray_color
1.96 1.92 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.96 1.96 0.04 1 0.04 2.04 raytracing
可以清楚看到時間集中在 dot_product 、 subtract_vector 、 multiply_vector ,這些 function 主要用來進行三維向量運算,可以在 math-toolkit.h
找到定義
Performance counter stats for './raytracing' (10 runs):
19,296,540 branch-misses # 0.40% of all branches ( +- 0.43% )
4,782,031,932 branches ( +- 0.00% )
32,981,345,906 instructions # 1.90 insn per cycle ( +- 0.00% )
17,360,336,322 cycles ( +- 0.22% )
5.137705397 seconds time elapsed ( +- 0.22% )
-Ofast
$ make PROFILE=1
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 0.518435 sec
這效果…
$ gprof ./raytracing | less
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
52.21 0.24 0.24 4620625 51.98 51.98 ray_hit_object
21.76 0.34 0.10 2110576 47.42 47.42 compute_specular_diffuse
10.88 0.39 0.05 592239 84.49 408.09 ray_color
10.88 0.44 0.05 raytracing
2.18 0.45 0.01 2110576 4.74 4.74 localColor
2.18 0.46 0.01 1241598 8.06 8.06 refraction
0.00 0.46 0.00 113297 0.00 0.00 fresnel
math-toolkit.h
裡面的 inline 改成 __attribute__((always_inline)) inline
,讓 compiler 在關閉編譯最佳化的情形下將 function inline$ make PROFILE=1
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.967841 sec
$ gprof ./raytracing | less
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.48 0.87 0.87 13861875 0.00 0.00 rayRectangularIntersection
18.61 1.27 0.40 13861875 0.00 0.00 raySphereIntersection
10.93 1.51 0.24 2110576 0.00 0.00 compute_specular_diffuse
7.91 1.68 0.17 1048576 0.00 0.00 ray_color
6.51 1.82 0.14 2110576 0.00 0.00 localColor
6.28 1.95 0.14 4620625 0.00 0.00 ray_hit_object
4.89 2.06 0.11 1048576 0.00 0.00 rayConstruction
1.86 2.10 0.04 1241598 0.00 0.00 reflection
1.40 2.13 0.03 1 0.03 2.15 raytracing
math-tool-kit.h
force inline 的 functionmath-toolkit.h
function 內的 loop 展開$ make PROFILE=1
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 4.346391 sec
$ gprof ./raytracing | less
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
16.03 0.21 0.21 69646433 0.00 0.00 dot_product
15.63 0.41 0.20 56956357 0.00 0.00 subtract_vector
14.85 0.60 0.19 10598450 0.00 0.00 normalize
8.99 0.71 0.12 17821809 0.00 0.00 cross_product
7.82 0.81 0.10 13861875 0.00 0.00 rayRectangularIntersection
6.64 0.90 0.09 13861875 0.00 0.00 raySphereIntersection
6.25 0.98 0.08 1048576 0.00 0.00 ray_color
4.69 1.04 0.06 4620625 0.00 0.00 ray_hit_object
3.91 1.09 0.05 1048576 0.00 0.00 rayConstruction
3.52 1.13 0.05 31410180 0.00 0.00 multiply_vector
2.74 1.17 0.04 17836094 0.00 0.00 add_vector
2.35 1.20 0.03 2110576 0.00 0.00 localColor
1.95 1.22 0.03 4221152 0.00 0.00 multiply_vectors
1.56 1.24 0.02 1 0.02 1.27 raytracing
* dot_product: 0.51 -> 0.21
* subtract_vector: 0.30 -> 0.20
* multify_vector: 0.23 -> 0.05
* add_vector: 0.16 -> 0.04
Performance counter stats for './raytracing' (10 runs):
16,359,752 branch-misses # 0.42% of all branches ( +- 0.69% )
3,873,917,856 branches ( +- 0.00% )
25,746,564,892 instructions # 1.76 insn per cycle ( +- 0.00% )
14,607,513,939 cycles ( +- 0.23% )
4.322380613 seconds time elapsed ( +- 0.27% )
$ make clean
$ make
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 1.682859 sec
剛好上學期有修平行程式設計,上課有提過 OpenMP ,把投影片找出來複習一下
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
point3 u, v, w, d;
color object_color = { 0.0, 0.0, 0.0 };
/* calculate u, v, w */
calculateBasisVectors(u, v, w, view);
idx_stack stk;
int factor = sqrt(SAMPLES);
#pragma omp parallel for num_threads (thread_count) \
default(shared) private(stk, d, object_color)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
}
thread_count = 4
$ make PROFILE=1
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 7.930475 sec
時間比只有 loop unrolling 增加了 (4.35 -> 7.93)
先不使用 gprof 測試看看
$ make clean
$ make
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 0.818154 sec
和 loop unrolling 不加上 -pg 的情況相比時間下降了 (1.68 -> 0.89) ,推測在 OpenMP multi threads 的情況下, -pg 對程式帶來的成本也跟著增加了
thread_count = 1 1.720003 sec
thread_count = 2 0.955195 sec
thread_count = 4 0.818154 sec
thread_count = 8 0.767247 sec
thread_count = 16 0.743067 sec
thread_count = 32 0.801165 sec
schedule(static, 1)
#pragma omp parallel for num_threads (thread_count) \
default(shared) private(stk, d, object_color) schedule(static, 1)
thread_count = 4
$ make
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 0.728800 sec
在 thread_count = 4 的情形下,與 default schedule 相比,執行時間縮短 (0.818 -> 0.729)
簡易的程式平行化-OpenMP
hugikun999 筆記
LanKuDot 筆記
成大資訊系-多處理機平行程式設計 課程投影片