# 2016q3 Homework1 (compute-pi) contributed by <`nekoneko`> ###### tags: `nekoneko` `homework` ## 硬體資訊 ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Model name: Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz Stepping: 2 CPU MHz: 933.000 CPU max MHz: 2133.0000 CPU min MHz: 933.0000 BogoMIPS: 4256.27 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm arat ``` ## SIMD AVX ### emulator: Intel SDE #### Install ```txt $ mkdir ~/bin && mv sde-external-7.49.0-2016-07-07-lin . ``` - `.profile` ```txt PATH="$PATH:$HOME/bin/sde-external-7.49.0-2016-07-07-lin" ``` ```txt $ source .profile ``` 若在Ubuntu 10以上要關掉yama ``` $ sudo i $ echo 0 > /proc/sys/kernel/yama/ptrace_scope && exit ``` - Run - Emulate Everything Mode 預設bash shell都接到sde64上執行 ``` $ sde64 -- /bin/bash ``` ## 作業資料筆記 - Wall-clock time - xtime - jiffies - CPU time - Real time - User CPU time - System CPU time - real - cpu-time(user + sys) - `clock_gettime()` ## Compute-pi ``` $ make check $ make gencsv ``` 到makefile可以知道`make gencsv`做什麼 ```txt= gencsv: default for i in `seq 100 5000 25000`; do \ printf "%d," $$i;\ ./benchmark_clock_gettime $$i; \ done > result_clock_gettime.csv ``` 之後增加資料點的話,改`seq` - seq: ``` seq [OPTION]... LAST seq [OPTION]... FIRST LAST seq [OPTION]... FIRST INCREMENT LAST ``` - 原始LibreOffice輸出 ![](https://i.imgur.com/mW0IeW6.png) - make check ```txt time ./time_test_baseline N = 400000000 , pi = 3.141593 229.57user 0.44system 3:50.18elapsed 99%CPU (0avgtext+0avgdata 39988maxresident)k 0inputs+0outputs (0major+36347minor)pagefaults 0swaps time ./time_test_openmp_2 N = 400000000 , pi = 3.141593 234.88user 0.11system 1:58.07elapsed 199%CPU (0avgtext+0avgdata 41556maxresident)k 0inputs+0outputs (0major+39560minor)pagefaults 0swaps time ./time_test_openmp_4 N = 400000000 , pi = 3.141593 355.82user 0.35system 1:38.23elapsed 362%CPU (0avgtext+0avgdata 41324maxresident)k 0inputs+0outputs (0major+40143minor)pagefaults 0swaps time ./time_test_avx N = 400000000 , pi = 3.141593 134.49user 0.24system 2:14.88elapsed 99%CPU (0avgtext+0avgdata 43340maxresident)k 0inputs+0outputs (0major+37336minor)pagefaults 0swaps time ./time_test_avxunroll N = 400000000 , pi = 3.141593 114.62user 0.44system 1:55.14elapsed 99%CPU (0avgtext+0avgdata 44384maxresident)k 0inputs+0outputs (0major+40273minor)pagefaults 0swaps ``` - 參考同學的筆記和上網查,得知`elapsed`為real time > CPU percentage 代表意思還要查[name=cheng hung lin] | type | user | sys | elapsed | - | | - | - | - | - | - | | baseline | 229.57 | 0.44 | 3:50.18(230.18) | x | | openmp_2 | 234.88 | 0.11 | 1:58.07(118.07) | cpu bound | | openmp_4 | 355.82 | 0.35 | 1:38.23(98.23) | cpu bound | | avx | 134.49 | 0.24 | 2:14.88(134.88) | x | | avxunroll | 114.62 | 0.44 | 1:55.14(115.14) | x | 作業說明提到user time是"每條thread的時間加總" `man time`有解釋到default output。此外elapsed格式為[hours:]mins:seconds ### clock_gettime() - 環境: 在Intel的模擬下跑 ~~先取1\~1000,看看變化是否真的不準確~~ N 取 1的時候bench_clock_gettime似乎不會結束 考量到gencsv總共會花多少時間,分別下time取N = 100, 1000, 10000, 100000, 1000000, 10000000的測試時間。其中real time分別是1.052, 1.073, 1.456, 5.256, 47.048, 7:3.214。 ~~產生的測資1000到1000000,其中間隔為1000,`seq 1000 1000 1000000`~~ 應該太耗時間,先取`seq 1000 10000 1000000` ![](https://i.imgur.com/Ljc1RHP.png) 因為不會用LibreOffice先改用[gnumeric](https://en.wikipedia.org/wiki/Gnumeric) 因為使用`sde64` 模擬器跑,所以先將max值降低和增加N樣本間去縮小,跑`seq 1000 1000 100000` - 輸出: sde64的模擬器下跑 - 1000 ~ 100000, 每間隔10000取一次,共10筆資料 ![](https://i.imgur.com/bKFsPcJ.png) - 1000 ~ 100000, 每間隔1000取一次,共100筆資料 ![](https://i.imgur.com/pSFBOnz.png) - 1000 ~ 100000,每隔間隔100取一次,共1000筆資料 ![](https://i.imgur.com/4uDZtSd.png) > 最後一組數據才1000筆就跑了40多分鐘,決定avx版本分開跑!![name=cheng hung lin] ## 不考慮使用Intel SDE - 1000 ~ 100000, 間隔:10000,10筆 ![](https://i.imgur.com/Uo3sEiK.png) - 1000 ~ 100000, 間隔:1000,100筆 ![](https://i.imgur.com/aqR4vnA.png) - 1000 ~ 100000, 間隔:100,1000筆 ![](https://i.imgur.com/NYl1XJA.png) ### 小結 - 使用Intel SDE跑確實有比較慢 - 在本機(i3 330M)是雙核行,四執行緒,openmp 2條 和 4條的執行結果是一樣的。但是在Intel SED下可以看出來,openmp四條執行的時間序確實比較兩條執行序來的少。 ## clock() - `man clock`的資訊提到,glibc 2.18之後,是實做於clock id為`CLOCK_PROCESS_CPUTIME_ID`,`clock_gettime()`之上。2.17之前都是實做於[`times()`](http://man7.org/linux/man-pages/man2/times.2.html)。 - wrap 的部份,在32bit的系統上面,CLOCKS_PER_SEC為1000000,會大約在每72分鐘回傳相同的值\(?\) > 不是很懂wrap的意思[name=cheng hung lin] - clock 使用方式 ``` #include <time.h> clock_t clock(void); ``` - 1000 ~ 100000, 間隔:100,1000筆 ![](https://i.imgur.com/jtDbE8J.png) > - 筆電雙核虛擬四核,要再研究 > - Hyper Threading[name=cheng hung lin] ## Error Rate 參考廖健富學長筆記 - baseline 1000000到100000000,間隔100000,共1000筆 ![](https://i.imgur.com/CATpWp7.png) ## 分佈 試做將某次測量時間做m次,產生母體,做分佈圖觀察 - baseline, 積分迴圈做100000,樣本取1000次。因為樣本都小於零,無法算次數,所以下圖所有時間都乘上10^5 ![](https://i.imgur.com/HNGNMJ8.png) time*10^5, count ```txt 5173, 39 5174, 67 5175, 52 5176, 51 5177, 19 5178, 26 5179, 44 5180, 44 5181, 35 5182, 36 5183, 25 5184, 31 5185, 21 5186, 28 5187, 32 5188, 22 5189, 21 5190, 18 5191, 18 5192, 17 5193, 15 5194, 9 5195, 9 5196, 20 5197, 14 5198, 7 5199, 2 5200, 7 5201, 8 5202, 8 5203, 9 5204, 9 5205, 13 5206, 4 5207, 7 5208, 6 5209, 8 5210, 2 5211, 5 5212, 7 5213, 2 5214, 5 5215, 1 5216, 4 5217, 6 5218, 2 5219, 2 5220, 2 5221, 2 5222, 1 5223, 2 5225, 3 5226, 5 5227, 1 5228, 3 5229, 1 5230, 2 5231, 1 5232, 2 5233, 1 5234, 2 5235, 1 5236, 2 5238, 2 5239, 3 5240, 1 5241, 3 5244, 1 5245, 3 5246, 2 5247, 4 5248, 1 5249, 1 5250, 1 5253, 1 5254, 4 5255, 3 5256, 3 5257, 2 5258, 3 5259, 2 5260, 5 5261, 2 5262, 3 5264, 1 5265, 2 5266, 6 5268, 4 5269, 1 5270, 1 5271, 1 5272, 2 5273, 3 5274, 1 5278, 2 5279, 1 5282, 3 5283, 1 5284, 2 5285, 2 5286, 2 5289, 3 5290, 1 5292, 1 5293, 5 5294, 4 5296, 2 5298, 1 5302, 1 5303, 1 5304, 1 5306, 1 5307, 1 5308, 1 5309, 1 5310, 1 5311, 1 5315, 1 5316, 1 5317, 2 5318, 1 5319, 1 5321, 1 5322, 1 5324, 1 5325, 1 5326, 1 5327, 1 5332, 1 5333, 1 5334, 1 5335, 1 5338, 1 5344, 1 5703, 1 5747, 1 5867, 1 ``` > 不熟悉gnuplot,應該要畫成更能看出分佈的historgram[name=cheng hung lin] ## 信賴區間 - 還是參考同份共筆,實做第一板的信賴區間計算[source](https://github.com/nekoneko/compute-pi/blob/master/benchmark_clock_gettime_ci.c) - 信賴區間 - 平均數 - 標準差 (這裡使用的是取樣的標準差算法) - 正負兩個標準差之間為95%的信賴區間 - 其實沒有看懂信賴區間的算法,最後先直接照著算 - 1000 到 100000,間隔為1000,100筆資料,每筆母體樣本==100筆== ![](https://i.imgur.com/elAPAm8.png) - 因為間隔為100的資料要跑很久(可能約1、2小時),沒有下去跑。不然間隔100的資料應該會更顯現差異 - 除了95%信賴區間純化資料,應該也可以跑看看不使用信賴區間,單純取平均的方式 ## Leibniz formula for π - [證明](http://www.mathstat.concordia.ca/faculty/rhall/mc/arctan.pdf) - [arctan微分](https://ocw.mit.edu/courses/mathematics/18-01sc-single-variable-calculus-fall-2010/1.-differentiation/part-b-implicit-differentiation-and-inverse-functions/session-15-implicit-differentiation-and-inverse-functions/MIT18_01SCF10_Ses15b.pdf) - 公式 $$ \frac{\pi}{4}=\sum_{k=0}^{\infty}\frac{(-1)^k}{2k+1} $$ - 程式碼 ```clike= double compute_pi_leibniz(size_t N) { double pi = 0.0; int sign = 1; for (size_t i = 0;i < N; i++, sign*=(-1)) { pi += (double) sign / ((i<<1)+1); } return pi * 4.0; } double compute_pi_leibniz_unroll(size_t N) { double pi = 0.0; int base = 0; for (size_t i = 0;i < N - 16; i+=16) { base = i<<1; pi += (double) 1 / (base+1); pi += (double) -1 / (base+3); pi += (double) 1 / (base+5); pi += (double) -1 / (base+7); pi += (double) 1 / (base+9); pi += (double) -1 / (base+11); pi += (double) 1 / (base+13); pi += (double) -1 / (base+15); pi += (double) 1 / (base+17); pi += (double) -1 / (base+19); pi += (double) 1 / (base+21); pi += (double) -1 / (base+23); pi += (double) 1 / (base+25); pi += (double) -1 / (base+27); pi += (double) 1 / (base+29); pi += (double) -1 / (base+31); } } ``` - 每1000間隔所得到的資料如下 ![](https://i.imgur.com/ymoTYOo.png) leibniz在於不用像baseline一樣做到乘法處理,時間上是有減少的,但leibniz unroll 16 時間上並沒有減少很多。 ## timer - 撰寫時間量測程式前,先評估情況,問問自己幾個問題: 1. 使用什麼 clock 測量?(Real, User, System, Wall-clock…?) 2. clock 的精確度?(s, ms, µs, or faster ?) 3. 需要多久時間你的 clock 會 wrap around?有什麼方法可以避免或處理? 4. clock 是不是 monotonic (單調遞增)?還是會隨著系統時間改變 (NTP, time zone, daylight savings time, by the user…?) 5. 使用的函式使否已經過時、或不是標準函式庫? - [Measure time in Linux - time vs clock vs getrusage vs clock_gettime vs gettimeofday vs timespec_get?](http://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof) 文章中提到的幾種clock(大部分有看沒有很懂,之後在多查資料) - `time()` wall-time from OS, precision is second - `clock()` - clock wrap通常發生在每2^32 ticks之後,以CPU frequency 2.13 GHz來算,約2s就會發生一次 - `clock_gettime(CLOCK_MONOTONIC,....)` - `getrusage()` - `gettimeofday()` - `mach_absolute_time()` ## gnu plot - xtic(N): 設定 x axis每N筆資料顯示一次 - set format y %.6f: 設定ytic的label數字的顯示到小數點第六位 - set xrange [min:max] - plot ... with boxes - plot .... with impulses - plot sin(x) linetytpe 4 # set as color 4 - plot sin(x) lt rgb "#FFFFFF" - plot sin(x) lt rgb "violet" - gnuplot> show colorname(s) #顯示color name ## 參考資料 - [Intel® Software Development Emulator](https://software.intel.com/en-us/articles/intel-software-development-emulator) - [同學筆記](https://hackmd.io/s/Hy7tssw6) - 廖建富學長筆記 - [Why real time can be lower than user time](http://unix.stackexchange.com/questions/40694/why-real-time-can-be-lower-than-user-time) - [Gnuplot 純畫圖](http://v.im.cyut.edu.tw/~ckhung/b/ma/gnuplot.php) - [Plotting using a CSV file]() - [How to pass command line argument to gnuplot?]() - [How can I concatenate string variables in Bash?]() - [wall time](http://whatis.techtarget.com/definition/wall-time-real-world-time-or-wall-clock-time) - [Measure time in Linux - time vs clock vs getrusage vs clock_gettime vs gettimeofday vs timespec_get?](http://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof) - ~~[How to escape the % sign in C's printf?](http://stackoverflow.com/questions/1860159/how-to-escape-the-sign-in-cs-printf)~~ - [Linetype, colors, and styles](http://gnuplot.sourceforge.net/docs_4.2/node62.html)