contributed by <Jing Zhou
>
ubuntu 16.04
Wall-clock time
現實世界中實際經過的時間,Wall-clock time不一定是單調遞增,但量測過程中沒有涉及到 NTP、時區改變之類的問題,所以經過的 system time (elapsed time) 基本上會等於真實走過的時間
CPU time
CPU 上面運行消耗 (佔用) 的時間,每條 thread 的使用時間加總
time 指令返回值
CPU bound 與 I/O bound
判斷效能
效能提昇表示平行處理有得到好處
#define CLOCK_ID CLOCK_MONOTONIC_RAW
int gettimeofday(struct timeval*tv,struct timezone *tz )
struct timezone{
int tz_minuteswest;/*和greenwich 時間差了多少分鐘*/
int tz_dsttime;/*type of DST correction*/
}
取得原始程式碼並編譯
$ git clone https://github.com/sysprog21/compute-pi
$ cd compute-pi
$ make check
time ./time_test_baseline
N = 400000000 , pi = 3.141593
1.04user 0.00system 0:01.04elapsed 99%CPU (0avgtext+0avgdata 1784maxresident)k
0inputs+0outputs (0major+85minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
1.12user 0.00system 0:00.56elapsed 198%CPU (0avgtext+0avgdata 1808maxresident)k
0inputs+0outputs (0major+88minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 400000000 , pi = 3.141593
1.14user 0.00system 0:00.28elapsed 398%CPU (0avgtext+0avgdata 1836maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_avx
N = 400000000 , pi = 3.141593
0.50user 0.00system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 1768maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 400000000 , pi = 3.141593
0.36user 0.00system 0:00.36elapsed 99%CPU (0avgtext+0avgdata 1816maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
使用clock_gettime()
執行
Bash的迴圈用法:r i in $(seq [初始] [增值] [上界]);
Makefile中改為for i in `seq 100 100 50000`;
使用gnuplot產生圖表
修改Makefile
和新增plot.gp
,使用gencsv
產生的csv來繪製圖檔
# plot.gp
reset
set ylabel 'Time(sec)'
set xlabel 'N'
set style data lines
set title 'Wall-clock time - using clock _gettime()'
set datafile separator ","
set terminal png enhanced font 'Verdana,10'
set output 'runtime.png'
plot [0:][0:0.01]'result_clock_gettime.csv' using 1:2 title 'Baseline', \
'' using 1:3 title 'OpenMP (2 threads)', \
'' using 1:4 title 'OpenMP (4 threads)', \
'' using 1:5 title 'AVX', \
'' using 1:6 title 'AVX + Unroll looping'
執行
結果,OpenMP因為使用4個treads的關係,效能比AVX SIMD + Unroll looping還好
clock_gettime() 與 clock() 比較
benchmark_clock.c
中clock()
使用以下算法
clock_t start=0, end=0;
# 略
start = clock();
for(i = 0; i < loop; i++) {
compute_pi_baseline(N);
}
end = clock();
printf("%lf,", ((double) (end - start)) / CLOCKS_PER_SEC);
使用clock()結果圖,發現讀取時間的效能比gettime()還差
初步懷疑跟monotonic有關