contributed by <nekoneko
>
nekoneko
homework
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz
Stepping: 2
CPU MHz: 933.000
CPU max MHz: 2133.0000
CPU min MHz: 933.0000
BogoMIPS: 4256.27
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm arat
$ mkdir ~/bin && mv sde-external-7.49.0-2016-07-07-lin .
.profile
PATH="$PATH:$HOME/bin/sde-external-7.49.0-2016-07-07-lin"
$ source .profile
若在Ubuntu 10以上要關掉yama
$ sudo i
$ echo 0 > /proc/sys/kernel/yama/ptrace_scope && exit
$ sde64 -- /bin/bash
clock_gettime()
$ make check
$ make gencsv
到makefile可以知道make gencsv
做什麼
gencsv: default
for i in `seq 100 5000 25000`; do \
printf "%d," $$i;\
./benchmark_clock_gettime $$i; \
done > result_clock_gettime.csv
之後增加資料點的話,改seq
seq [OPTION]... LAST
seq [OPTION]... FIRST LAST
seq [OPTION]... FIRST INCREMENT LAST
time ./time_test_baseline
N = 400000000 , pi = 3.141593
229.57user 0.44system 3:50.18elapsed 99%CPU (0avgtext+0avgdata 39988maxresident)k
0inputs+0outputs (0major+36347minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
234.88user 0.11system 1:58.07elapsed 199%CPU (0avgtext+0avgdata 41556maxresident)k
0inputs+0outputs (0major+39560minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 400000000 , pi = 3.141593
355.82user 0.35system 1:38.23elapsed 362%CPU (0avgtext+0avgdata 41324maxresident)k
0inputs+0outputs (0major+40143minor)pagefaults 0swaps
time ./time_test_avx
N = 400000000 , pi = 3.141593
134.49user 0.24system 2:14.88elapsed 99%CPU (0avgtext+0avgdata 43340maxresident)k
0inputs+0outputs (0major+37336minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 400000000 , pi = 3.141593
114.62user 0.44system 1:55.14elapsed 99%CPU (0avgtext+0avgdata 44384maxresident)k
0inputs+0outputs (0major+40273minor)pagefaults 0swaps
elapsed
為real timeCPU percentage 代表意思還要查cheng hung lin
type | user | sys | elapsed | - |
---|---|---|---|---|
baseline | 229.57 | 0.44 | 3:50.18(230.18) | x |
openmp_2 | 234.88 | 0.11 | 1:58.07(118.07) | cpu bound |
openmp_4 | 355.82 | 0.35 | 1:38.23(98.23) | cpu bound |
avx | 134.49 | 0.24 | 2:14.88(134.88) | x |
avxunroll | 114.62 | 0.44 | 1:55.14(115.14) | x |
作業說明提到user time是"每條thread的時間加總"
man time
有解釋到default output。此外elapsed格式為[hours:]mins:seconds
考量到gencsv總共會花多少時間,分別下time取N = 100, 1000, 10000, 100000, 1000000, 10000000的測試時間。其中real time分別是1.052, 1.073, 1.456, 5.256, 47.048, 7:3.214。
產生的測資1000到1000000,其中間隔為1000, 應該太耗時間,先取seq 1000 1000 1000000
seq 1000 10000 1000000
因為不會用LibreOffice先改用gnumeric
因為使用sde64
模擬器跑,所以先將max值降低和增加N樣本間去縮小,跑seq 1000 1000 100000
輸出: sde64的模擬器下跑
1000 ~ 100000, 每間隔10000取一次,共10筆資料
1000 ~ 100000, 每間隔1000取一次,共100筆資料
1000 ~ 100000,每隔間隔100取一次,共1000筆資料
最後一組數據才1000筆就跑了40多分鐘,決定avx版本分開跑!!cheng hung lin
1000 ~ 100000, 間隔:10000,10筆
1000 ~ 100000, 間隔:1000,100筆
1000 ~ 100000, 間隔:100,1000筆
man clock
的資訊提到,glibc 2.18之後,是實做於clock id為CLOCK_PROCESS_CPUTIME_ID
,clock_gettime()
之上。2.17之前都是實做於times()
。不是很懂wrap的意思cheng hung lin
#include <time.h>
clock_t clock(void);
- 筆電雙核虛擬四核,要再研究
- Hyper Threadingcheng hung lin
參考廖健富學長筆記
試做將某次測量時間做m次,產生母體,做分佈圖觀察
time*10^5, count
5173, 39
5174, 67
5175, 52
5176, 51
5177, 19
5178, 26
5179, 44
5180, 44
5181, 35
5182, 36
5183, 25
5184, 31
5185, 21
5186, 28
5187, 32
5188, 22
5189, 21
5190, 18
5191, 18
5192, 17
5193, 15
5194, 9
5195, 9
5196, 20
5197, 14
5198, 7
5199, 2
5200, 7
5201, 8
5202, 8
5203, 9
5204, 9
5205, 13
5206, 4
5207, 7
5208, 6
5209, 8
5210, 2
5211, 5
5212, 7
5213, 2
5214, 5
5215, 1
5216, 4
5217, 6
5218, 2
5219, 2
5220, 2
5221, 2
5222, 1
5223, 2
5225, 3
5226, 5
5227, 1
5228, 3
5229, 1
5230, 2
5231, 1
5232, 2
5233, 1
5234, 2
5235, 1
5236, 2
5238, 2
5239, 3
5240, 1
5241, 3
5244, 1
5245, 3
5246, 2
5247, 4
5248, 1
5249, 1
5250, 1
5253, 1
5254, 4
5255, 3
5256, 3
5257, 2
5258, 3
5259, 2
5260, 5
5261, 2
5262, 3
5264, 1
5265, 2
5266, 6
5268, 4
5269, 1
5270, 1
5271, 1
5272, 2
5273, 3
5274, 1
5278, 2
5279, 1
5282, 3
5283, 1
5284, 2
5285, 2
5286, 2
5289, 3
5290, 1
5292, 1
5293, 5
5294, 4
5296, 2
5298, 1
5302, 1
5303, 1
5304, 1
5306, 1
5307, 1
5308, 1
5309, 1
5310, 1
5311, 1
5315, 1
5316, 1
5317, 2
5318, 1
5319, 1
5321, 1
5322, 1
5324, 1
5325, 1
5326, 1
5327, 1
5332, 1
5333, 1
5334, 1
5335, 1
5338, 1
5344, 1
5703, 1
5747, 1
5867, 1
不熟悉gnuplot,應該要畫成更能看出分佈的historgramcheng hung lin
還是參考同份共筆,實做第一板的信賴區間計算source
信賴區間
其實沒有看懂信賴區間的算法,最後先直接照著算
1000 到 100000,間隔為1000,100筆資料,每筆母體樣本100筆
因為間隔為100的資料要跑很久(可能約1、2小時),沒有下去跑。不然間隔100的資料應該會更顯現差異
除了95%信賴區間純化資料,應該也可以跑看看不使用信賴區間,單純取平均的方式
double compute_pi_leibniz(size_t N)
{
double pi = 0.0;
int sign = 1;
for (size_t i = 0;i < N; i++, sign*=(-1)) {
pi += (double) sign / ((i<<1)+1);
}
return pi * 4.0;
}
double compute_pi_leibniz_unroll(size_t N)
{
double pi = 0.0;
int base = 0;
for (size_t i = 0;i < N - 16; i+=16) {
base = i<<1;
pi += (double) 1 / (base+1);
pi += (double) -1 / (base+3);
pi += (double) 1 / (base+5);
pi += (double) -1 / (base+7);
pi += (double) 1 / (base+9);
pi += (double) -1 / (base+11);
pi += (double) 1 / (base+13);
pi += (double) -1 / (base+15);
pi += (double) 1 / (base+17);
pi += (double) -1 / (base+19);
pi += (double) 1 / (base+21);
pi += (double) -1 / (base+23);
pi += (double) 1 / (base+25);
pi += (double) -1 / (base+27);
pi += (double) 1 / (base+29);
pi += (double) -1 / (base+31);
}
}
leibniz在於不用像baseline一樣做到乘法處理,時間上是有減少的,但leibniz unroll 16 時間上並沒有減少很多。
time()
clock()
clock_gettime(CLOCK_MONOTONIC,....)
getrusage()
gettimeofday()
mach_absolute_time()