contributed by <RayPan
>
finalallpass
raypan@raypan-X550JD:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
Stepping: 3
CPU MHz: 2782.281
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.62
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
$ sudo apt-get update
$ sudo apt-get install graphviz
:可以繪製各類示意圖$ sudo apt-get install imagemagick
: 可以利用convert
將輸出的圖檔轉換格式gprof可以用來檢視程式執行時間、執行次數等等的工具,藉由產生的文件得以檢視程式問題。
$ make clean
$ make PROFILE=1
:不同於$ make
,會在編譯時加入-pg
參數,用途在於呼叫函式開始和結束的時候會加入統計資料,實務上稱為instrumentation。
$ ./raytracing
:執行後會多產生一個由gprof分析後產生的profiling: gmon.out
$ gprof -b raytracing gmon.out | less
:加入-b
可以移除統計圖表中詳細文字描述less
可以以頁面方式瀏覽
由於使用gprof中間會執行額外程式碼
執行時間會增加
Execution time of raytracing() : 5.140832 sec
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.87 0.48 0.48 69646433 0.00 0.00 dot_product
18.58 0.87 0.39 56956357 0.00 0.00 subtract_vector
9.53 1.07 0.20 31410180 0.00 0.00 multiply_vector
8.10 1.24 0.17 13861875 0.00 0.00 rayRectangularIntersection
7.62 1.40 0.16 4620625 0.00 0.00 ray_hit_object
6.67 1.54 0.14 17836094 0.00 0.00 add_vector
5.72 1.66 0.12 17821809 0.00 0.00 cross_product
5.72 1.78 0.12 13861875 0.00 0.00 raySphereIntersection
5.24 1.89 0.11 10598450 0.00 0.00 normalize
3.34 1.96 0.07 4221152 0.00 0.00 multiply_vectors
3.10 2.03 0.07 1048576 0.00 0.00 ray_color
1.43 2.06 0.03 2110576 0.00 0.00 localColor
0.95 2.08 0.02 2110576 0.00 0.00 compute_specular_diffuse
0.48 2.09 0.01 2520791 0.00 0.00 idx_stack_top
0.48 2.10 0.01 1048576 0.00 0.00 rayConstruction
0.00 2.10 0.00 3838091 0.00 0.00 length
0.00 2.10 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 2.10 0.00 1241598 0.00 0.00 protect_color_overflow
使用perf觀測cache-misses等情形
$perf stat -r 10 -e cache-misses,cache-references,instructions,cycles ./raytracing
Performance counter stats for './raytracing' (10 runs):
114,364 cache-misses # 24.126 % of all cache refs ( +- 5.19% )
474,038 cache-references ( +- 7.13% )
33,007,085,278 instructions # 1.90 insns per cycle ( +- 0.01% )
17,415,754,020 cycles ( +- 0.30% )
5.155209650 seconds time elapsed ( +- 0.29% )
由於for迴圈每次都要執行終止條件的判斷
所以把for展開捨去判斷的步驟可以減少執行時間
以math-toolkit.h
內的subtract_vector
為例
將for回圈展開如下:
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
執行看看
Execution time of raytracing() : 4.409033 sec
首先可以發現執行時間下降了0.7秒
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
17.17 0.23 0.23 69646433 0.00 0.00 dot_product
17.17 0.46 0.23 13861875 0.00 0.00 rayRectangularIntersection
11.20 0.61 0.15 10598450 0.00 0.00 normalize
10.45 0.75 0.14 56956357 0.00 0.00 subtract_vector
7.47 0.85 0.10 4620625 0.00 0.00 ray_hit_object
6.72 0.94 0.09 17821809 0.00 0.00 cross_product
6.35 1.03 0.09 31410180 0.00 0.00 multiply_vector
3.73 1.08 0.05 13861875 0.00 0.00 raySphereIntersection
3.73 1.13 0.05 1048576 0.00 0.00 ray_color
3.36 1.17 0.05 17836094 0.00 0.00 add_vector
3.36 1.22 0.05 4221152 0.00 0.00 multiply_vectors
2.99 1.26 0.04 2520791 0.00 0.00 idx_stack_top
2.24 1.29 0.03 1241598 0.00 0.00 refraction
1.49 1.31 0.02 2110576 0.00 0.00 compute_specular_diffuse
1.49 1.33 0.02 2110576 0.00 0.00 localColor
0.75 1.34 0.01 1 0.01 1.34 raytracing
0.37 1.34 0.01 3838091 0.00 0.00 length
0.00 1.34 0.00 2558386 0.00 0.00 idx_stack_empty
使用gprof後分析後
發現subtract_vector
執行時間由原本的0.39秒下降至0.14秒
multiply_vector
也由0.2秒降至0.09秒
Performance counter stats for './raytracing' (10 runs):
85,665 cache-misses # 18.558 % of all cache refs ( +- 4.49% )
461,603 cache-references ( +- 7.24% )
26,164,199,466 instructions # 1.72 insns per cycle ( +- 0.00% )
15,251,178,470 cycles ( +- 0.72% )
4.521390109 seconds time elapsed ( +- 0.69% )
cache-misses也由24.126%下降至18.558%
An Inline Function is As Fast As a Macro
由於沒有開啟gcc優化功能,所以在每個函式後面加上
__attribute__((always_inline))
強制使用inline功能
執行看看
Execution time of raytracing() : 2.217506 sec
可以發現執行時間大幅下降
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
36.37 0.52 0.52 13861875 0.00 0.00 rayRectangularIntersection
23.08 0.85 0.33 13861875 0.00 0.00 raySphereIntersection
9.79 0.99 0.14 2110576 0.00 0.00 compute_specular_diffuse
9.79 1.13 0.14 1048576 0.00 0.00 ray_color
5.60 1.21 0.08 4620625 0.00 0.00 ray_hit_object
4.90 1.28 0.07 2110576 0.00 0.00 localColor
4.20 1.34 0.06 1048576 0.00 0.00 rayConstruction
2.80 1.38 0.04 1241598 0.00 0.00 reflection
2.10 1.41 0.03 1 0.03 1.43 raytracing
1.40 1.43 0.02 1241598 0.00 0.00 refraction
0.00 1.43 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.43 0.00 2520791 0.00 0.00 idx_stack_top
0.00 1.43 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 1.43 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.43 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.43 0.00 113297 0.00 0.00 fresnel
0.00 1.43 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.43 0.00 3 0.00 0.00 append_rectangular
可以發現在math-toolkits.h內的函式
原本在排行榜前幾名的
做了force inline後已經掉到後面
把dot_product換成
#define dot_product(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2]))
執行看看
Execution time of raytracing() : 2.041964 sec
可以發現執行時間比force inline又略為下降
Performance counter stats for './raytracing' (10 runs):
259,586 cache-misses # 31.459 % of all cache refs ( +- 8.95% )
825,149 cache-references ( +- 9.63% )
12,050,482,702 instructions # 1.74 insns per cycle ( +- 0.00% )
6,934,713,035 cycles ( +- 0.30% )
2.070231925 seconds time elapsed ( +- 0.35% )
首先以下是我的cpu資訊
每個核心執行緒數為2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
MAKEFILE
的CFLAGS
前加入-fopenmp
以及LDFLAGS
前加入-lgomp
raytracing.c
內#include <omp.h>
並加入(thread數目設為2)#pragma omp parallel for num_threads(1024), private(d, object_color, stk)
執行看看
thread = 2 Execution time of raytracing() : 2.293012 sec
可以發現到thread數愈多執行時間愈短
thread = 4 Execution time of raytracing() : 1.537311 sec
thread = 8 Execution time of raytracing() : 1.348184 sec
thread = 16 Execution time of raytracing() : 1.322113 sec
thread = 32 Execution time of raytracing() : 1.277774 sec
thread = 64 Execution time of raytracing() : 1.118317 sec
thread = 128 Execution time of raytracing() : 1.095946 sec
thread = 256 Execution time of raytracing() : 1.104022 sec
thread = 512 Execution time of raytracing() : 1.073616 sec
愈來愈接近電腦優化目標。
然而最多只能執行到213 = 8192個
加工中
作業詳細資訊:A02: raytracing
RayPan
A02:Raytracing