contributed by <rayleigh0407
>
twzjwang
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Stepping: 3
CPU MHz: 800.000
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 8016.72
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
中英文間記得用空白隔開~
課程助教
gprof 是一個可以協助你開發程式的工具, 它可以檢測你的程式執行時間的分佈, 在 GNU gprof 的簡介就提到
This manual describes the gnu profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. We assume that you know how to write, compile, and execute programs…
在編譯的時候, 你可以加入參數-pg
, 接著執行過程式後會產生gmon.out檔, 接著只要輸入以下指令
gprof -b $(exec) gmon.out | less
-b 是將各種結果對應的說明刪除, 只留下執行分析的結果
less 可以讓你上下滾動查看輸出結果
先下載原始碼
載完立即來測試看看
$ git clone https://github.com/sysprog21/raytracing
$ cd raytracing
$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.224557 sec
用 gprof 來看看程式執行的部份
$ make PROFILE=1(輸出gmon.out)
$ ./raytracing
$ gprof -b raytracing gmon.out | less
執行的結果如下
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.23 0.37 0.37 69646433 0.00 0.00 dot_product
18.04 0.70 0.33 56956357 0.00 0.00 subtract_vector
12.85 0.94 0.24 13861875 0.00 0.00 rayRectangularIntersection
9.30 1.11 0.17 4620625 0.00 0.00 ray_hit_object
8.20 1.26 0.15 31410180 0.00 0.00 multiply_vector
6.29 1.37 0.12 17836094 0.00 0.00 add_vector
6.29 1.49 0.12 13861875 0.00 0.00 raySphereIntersection
5.47 1.59 0.10 10598450 0.00 0.00 normalize
4.92 1.68 0.09 17821809 0.00 0.00 cross_product
1.64 1.71 0.03 4221152 0.00 0.00 multiply_vectors
1.09 1.73 0.02 2110576 0.00 0.00 compute_specular_diffuse
1.09 1.75 0.02 2110576 0.00 0.00 localColor
1.09 1.77 0.02 1241598 0.00 0.00 protect_color_overflow
1.09 1.79 0.02 1048576 0.00 0.00 rayConstruction
0.82 1.80 0.02 3838091 0.00 0.00 length
0.55 1.81 0.01 1241598 0.00 0.00 refraction
0.55 1.82 0.01 1048576 0.00 0.00 ray_color
0.27 1.83 0.01 2558386 0.00 0.00 idx_stack_empty
0.27 1.83 0.01 1204003 0.00 0.00 idx_stack_push
0.00 1.83 0.00 2520791 0.00 0.00 idx_stack_top
0.00 1.83 0.00 1241598 0.00 0.00 reflection
0.00 1.83 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.83 0.00 113297 0.00 0.00 fresnel
0.00 1.83 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.83 0.00 3 0.00 0.00 append_rectangular
0.00 1.83 0.00 3 0.00 0.00 append_sphere
0.00 1.83 0.00 2 0.00 0.00 append_light
0.00 1.83 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.83 0.00 1 0.00 0.00 delete_light_list
0.00 1.83 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.83 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.83 0.00 1 0.00 0.00 diff_in_second
0.00 1.83 0.00 1 0.00 1.83 raytracing
0.00 1.83 0.00 1 0.00 0.00 write_to_ppm
...
下半部份主要為函式呼叫的分支(即一個函式呼叫了哪些函式)和統計次數
由此可看出, 佔用大部分時間的都是一些小函式( dot_product 不到十行)
試著從這些小地方開始著手
這是 math-toolkit.h 中的 dot_product() 和 subtract_vector()
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
...
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
可以發現, 原開發者早就考慮到他們佔用相當多的時間, 便宣告成 inline function 以提昇執行速度和減少記憶體的佔用(關於inline function), 但程式呼叫次數太多了, 佔用時間還是相對地大
首先我先這樣修改:
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
...
static inline
double dot_product(const double *v1, const double *v2)
{
dp += v1[0] * v2[0];
dp += v1[1] * v2[1];
dp += v1[2] * v2[2];
return dp;
}
將函式拆開的用意是, 在每次執行一次迴圈時, 都會檢查是否要跳出迴圈, 以組語來說就類似 jne jge 等等…(可參考x86 Assembly Guide), 使用 loop unrolling, 每次執行函式就會少一兩個 instructions, 多次使用後節省的時間也是很可觀的
我們來看執行結果
origin
Performance counter stats for './raytracing' (10 runs):
22,9075 cache-misses # 23.203 % of all cache refs ( +- 16.92% )
98,7278 cache-references ( +- 16.48% )
194,6446,2315 instructions # 2.21 insns per cycle ( +- 0.00% )
88,2142,5404 cycles ( +- 0.93% )
2.251422149 seconds time elapsed ( +- 0.92% )
loop unrolling
Performance counter stats for './raytracing' (10 runs):
13,3195 cache-misses # 22.574 % of all cache refs ( +- 9.44% )
59,0030 cache-references ( +- 9.96% )
147,1677,2511 instructions # 1.86 insns per cycle ( +- 0.00% )
79,0934,9083 cycles ( +- 1.25% )
2.019096528 seconds time elapsed ( +- 1.26% )
雖然 cache-miss 稍微地下降而已, 但是 instrcution 足足少了 24.39%, 執行時間也隨之減少 10.33%, 有時候看似細微的地方, 也能牽扯到整個大局
執行 gprof:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
18.01 0.23 0.23 69646433 0.00 0.00 dot_product
13.21 0.39 0.17 56956357 0.00 0.00 subtract_vector
11.21 0.53 0.14 10598450 0.00 0.00 normalize
10.81 0.67 0.14 13861875 0.00 0.00 rayRectangularIntersection
8.81 0.78 0.11 4620625 0.00 0.00 ray_hit_object
7.60 0.87 0.10 31410180 0.00 0.00 multiply_vector
7.60 0.97 0.10 13861875 0.00 0.00 raySphereIntersection
5.20 1.03 0.07 17836094 0.00 0.00 add_vector
5.20 1.10 0.07 17821809 0.00 0.00 cross_product
3.60 1.14 0.05 4221152 0.00 0.00 multiply_vectors
2.80 1.18 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.60 1.20 0.02 1048576 0.00 0.00 ray_color
1.20 1.21 0.02 2110576 0.00 0.00 localColor
0.80 1.22 0.01 2558386 0.00 0.00 idx_stack_empty
0.80 1.23 0.01 2520791 0.00 0.00 idx_stack_top
0.80 1.24 0.01 1 0.01 0.01 delete_sphere_list
0.80 1.25 0.01 1 0.01 1.24 raytracing
0.00 1.25 0.00 3838091 0.00 0.00 length
0.00 1.25 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 1.25 0.00 1241598 0.00 0.00 reflection
0.00 1.25 0.00 1241598 0.00 0.00 refraction
0.00 1.25 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.25 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.25 0.00 1048576 0.00 0.00 rayConstruction
0.00 1.25 0.00 113297 0.00 0.00 fresnel
0.00 1.25 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.25 0.00 3 0.00 0.00 append_rectangular
0.00 1.25 0.00 3 0.00 0.00 append_sphere
0.00 1.25 0.00 2 0.00 0.00 append_light
0.00 1.25 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.25 0.00 1 0.00 0.00 delete_light_list
0.00 1.25 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.25 0.00 1 0.00 0.00 diff_in_second
0.00 1.25 0.00 1 0.00 0.00 write_to_ppm
dot_product 下降了0.14s
substact_vector 下降了0.16s
那既然如此, 乾脆把 math-toolkit 內的迴圈都做 loop unrolling 看看
執行結果:
Performance counter stats for './raytracing' (10 runs):
11,9138 cache-misses # 23.585 % of all cache refs ( +- 2.98% )
50,5147 cache-references ( +- 10.63% )
127,3165,0289 instructions # 1.74 insns per cycle ( +- 0.00% )
73,2732,5074 cycles ( +- 0.31% )
1.869817600 seconds time elapsed ( +- 0.31% )
相較於 origin
instrcution 減少 34.59%
time elapsed 減少 16.95%
以上所有測試皆通過圖形測試
$make check
# Rendering scene
Done!
Execution time of raytracing() : 4.027406 sec
Verified OK
試著平行化迴圈
math-toolkit.h 內的函式皆為 inline function
實行平行化可能沒辦法在編譯時展開
先看看 raytracing.c
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
...
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
...
}
在 raytracing() 內
是一次畫一個點去作圖的
看能不能針對這巢狀迴圈做平行化
在這之前先考慮 data dependency
…
看了很久才發現資料沒有任何資料寫入衝突(可以看 function parameter 來過濾, 如果宣告為 const 則此資料不會被更動)
那就來實做平行化
將外層函式做平行化
int main(...)
{
...
pthread_t *thread_handles;
thread_handles = (pthread_t*)malloc(sizeof(pthread_t) * thread_num);
for(int i = 0; i < thread_num; i++) {
each_rank[i] = i;
pthread_create(&thread_handles[i], NULL, raytracing_pthread, (void *)&each_rank[i]);
}
}
void *raytracing_pthread(void *rank)
{
int my_rank = *(int*)rank;
...
int start = (height / THREAD_NUMBER) * my_rank;
int end = (height / THREAD_NUMBER) * (my_rank + 1);
for (int j = start; j < end; j++) {
for (int i = 0; i < width; i++) {
...
}
}
}
執行結果:
為了方便分配工作區域, 以2的倍數來分配
雙核:
# Rendering scene
Thread number : 2
Done!
Execution time of raytracing() : 1.110519 sec
四核:
# Rendering scene
Thread number : 4
Done!
Execution time of raytracing() : 0.816143 sec
八核:
# Rendering scene
Thread number : 8
Done!
Execution time of raytracing() : 0.506245 sec
效能成長沒有想像中好
換個方式分配看看
for (int j = my_rank; j < height; j+=THREAD_NUMBER) {
執行結果:
單執行序:
Performance counter stats for './raytracing' (10 runs):
11,0062 cache-misses # 23.736 % of all cache refs ( +- 2.04% )
46,3689 cache-references ( +- 5.54% )
127,2937,7767 instructions # 1.75 insns per cycle ( +- 0.00% )
72,9304,6745 cycles ( +- 0.12% )
1.865169660 seconds time elapsed ( +- 0.23% )
雙核:
Performance counter stats for './raytracing' (10 runs):
11,0145 cache-misses # 18.638 % of all cache refs ( +- 2.23% )
59,0976 cache-references ( +- 5.39% )
127,2961,2413 instructions # 1.58 insns per cycle ( +- 0.00% )
80,5269,8888 cycles ( +- 0.07% )
1.126548896 seconds time elapsed ( +- 0.09% )
四核:
Performance counter stats for './raytracing' (10 runs):
10,5345 cache-misses # 15.362 % of all cache refs ( +- 3.07% )
68,5756 cache-references ( +- 2.35% )
127,2978,3539 instructions # 1.47 insns per cycle ( +- 0.00% )
86,4925,3375 cycles ( +- 1.49% )
0.592542459 seconds time elapsed ( +- 2.45% )
八核:
Performance counter stats for './raytracing' (10 runs):
11,4074 cache-misses # 14.391 % of all cache refs ( +- 3.90% )
79,2670 cache-references ( +- 1.85% )
127,3064,3855 instructions # 1.21 insns per cycle ( +- 0.00% )
105,0530,6169 cycles ( +- 0.09% )
0.354254347 seconds time elapsed ( +- 0.69% )
效能提昇不少
幸好上學期有修過關於平行化的課程
cache-misses 減少了不少
可能是 cpu 能用的空間變多了
但還是沒有依倍數成長
參考資料