owned this note
owned this note
Published
Linked with GitHub
# 2016q3 Homework1(Compute Pi)
###### tags: `tundergod` `hw1` `2016q3`
contributed by <`tundergod`>
[**作業內容**](https://hackmd.io/s/rJARexQT)
## 作業要求
* 在 GitHub 上 fork compute-pi,嘗試重現實驗,包含分析輸出
* 注意: 要增加圓周率計算過程中迭代的數量,否則時間太短就看不出差異
* 詳閱廖健富提供的[詳盡分析](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I),研究 rate rate 的計算,以及為何思考我們該留意
* 用 phonebook 提到的 gnuplot 作為 $ make gencsv 以外的輸出機制,預期執行 $ make plot 後,可透過 gnuplot 產生前述多種實做的效能分析比較圖表
* 可善用 rayatracing 提到的 OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
* 詳閱前方 “Wall-clock time vs. CPU time”,思考如何精準計算時間
* 除了研究程式以外,請證明 Leibniz formula for π,並且找出其他計算圓周率的方法
* 請詳閱[許元杰的共筆](https://embedded2015.hackpad.com/-Ext1-Week-1-5rm31U3BOmh)
* 將你的觀察、分析,以及各式效能改善過程,並善用 gnuplot 製圖,紀錄於[「作業區」](https://hackmd.io/s/H1B7-hGp)
## 開發環境
* Linux 版本: Ubuntu 16.04 LTS
* 硬體資訊:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
Stepping: 1
CPU MHz: 1661.660
CPU max MHz: 2600.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.54
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
```
## 程式碼閱讀
* `computepi.c` `computepi.h`是主程式
* `time_test.c` 定義了N的大小
* `benchmark_clock_gettime`利用lock_gettime()計算程式執行時間
* `make file`中使用了-Dmacro依次執行不同的程式
## 原始數據
* make check:
```
time ./time_test_baseline
N = 1000000000 , pi = 3.141593
11.01user 0.00system 0:11.01elapsed 99%CPU (0avgtext+0avgdata 1812maxresident)k
0inputs+0outputs (0major+85minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 1000000000 , pi = 3.141593
12.22user 0.02system 0:06.14elapsed 199%CPU (0avgtext+0avgdata 1724maxresident)k
0inputs+0outputs (0major+89minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 1000000000 , pi = 3.141593
24.18user 0.00system 0:06.13elapsed 394%CPU (0avgtext+0avgdata 1808maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps
time ./time_test_avx
N = 1000000000 , pi = 3.141593
3.16user 0.00system 0:03.16elapsed 99%CPU (0avgtext+0avgdata 1932maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 1000000000 , pi = 3.141593
3.06user 0.00system 0:03.06elapsed 99%CPU (0avgtext+0avgdata 1784maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
```
>不是很看得懂上面所顯示數據的後半部分,如果是手動輸入的時間只會有real time,user time,system time,如下:
```
time ./time_test_baseline
N = 1000000000 , pi = 3.141593
real 0m11.085s
user 0m11.076s
sys 0m0.008s
```
* 從以上的數據可以看到real time ≒ user time的時間差不多,該程式屬於CPU bound.所以測試中使用執行緒openmp時執行緒的數量越多執行時間反而越長,佔用的CPU也越高(2 therad:12.22; 4 thread:24.18).
## 計算PI的公式
* references:
* [List of formulae involving π(wiki)](https://en.wikipedia.org/wiki/List_of_formulae_involving_%CF%80)
* [Pi Formulas(wolfram)](http://mathworld.wolfram.com/PiFormulas.html)
* 計算pi的公式:
* Baseline:

* Leibniz:

從公式中就能發現以上兩種種公式都是通過多次迭代結果的綜合去逼近pi.而公式中常常有乘法或次方等運算.但是另一種計算圓周率的方法([monte calro method](https://en.wikipedia.org/wiki/Monte_Carlo_method))通過多個隨機在正方形中的點去計算pi.

## 實際操作
* baseline:
```
double compute_pi_baseline(size_t N)
{
double pi = 0.0;
double dt = 1.0 / N; // dt = (b-a)/N, b = 1, a = 0
for (size_t i = 0; i < N; i++) {
double x = (double) i / N; // x = ti = a+(b-a)*i/N = i/N
pi += dt / (1.0 + x * x); // integrate 1/(1+x^2), i = 0....N
}
return pi * 4.0;
}
```
* leibniz:
```
int main()
{
double n = 500000000;
double pi = 0.0;
for(size_t i = 0; i < n; i++)
{
int sign = i % 2 == 0 ? 1 : -1;
pi += (sign / (2.0 * (double)i + 1.0));
}
printf("pi = %lf\n",pi*4);
return 0;
}
```
* monte carlo:
```
int main()
{
double n=500000000;
double x,y;
int count=0; /* # of points in the 1st quadrant of unit circle */
double z;
double pi;
count=0;
for(size_t i=0; i<n; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
}
printf("pi = %lf \n",(double)count/n*4);
return 0;
}
```
```
result of monte carlo: pi = 3.141539
```
* monte carlo方法在計算5億個點後任有些許誤差,~~懷疑是rand函數的分布率非常差導致,會想辦法解決~~
因爲c中的rand()函數會有固定輸出所以即使隨機多少個點都無法達到精準的準確度.
* 對leibniz做avx 和 avx+unrolling;對monte carlo做openmp,結果:
```
time ./time_test_baseline
N = 500000000 , pi = 3.141593
5.76user 0.00system 0:05.76elapsed 99%CPU (0avgtext+0avgdata 1772maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 500000000 , pi = 3.141593
6.16user 0.02system 0:03.13elapsed 197%CPU (0avgtext+0avgdata 1824maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 500000000 , pi = 3.141593
11.43user 0.02system 0:03.16elapsed 362%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+94minor)pagefaults 0swaps
time ./time_test_avx
N = 500000000 , pi = 3.141593
1.77user 0.00system 0:01.77elapsed 99%CPU (0avgtext+0avgdata 1940maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 500000000 , pi = 3.141593
1.69user 0.00system 0:01.69elapsed 99%CPU (0avgtext+0avgdata 1724maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
time ./time_test_leibniz
N = 500000000 , pi = 3.141593
2.88user 0.00system 0:02.88elapsed 99%CPU (0avgtext+0avgdata 1724maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
time ./time_test_leibniz_avx
N = 500000000 , pi = 3.141593
1.85user 0.00system 0:01.85elapsed 99%CPU (0avgtext+0avgdata 1764maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
time ./time_test_leibniz_avx_unroll
N = 500000000 , pi = 3.141593
1.54user 0.00system 0:01.54elapsed 99%CPU (0avgtext+0avgdata 1720maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
time ./time_test_montecarlo
N = 500000000 , pi = 3.141539
15.24user 0.04system 0:15.29elapsed 99%CPU (0avgtext+0avgdata 1784maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
time ./time_test_montecarlo_openmp_2
N = 500000000 , pi = 3.141519
142.99user 186.48system 2:47.89elapsed 196%CPU (0avgtext+0avgdata 1816maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
time ./time_test_montecarlo_openmp_4
N = 500000000 , pi = 3.141506
129.12user 254.31system 2:07.48elapsed 300%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+0outputs (0major+92minor)pagefaults 0swaps
```
leibniz算式用avx和avx+unrolling做有明顯的效能增強,但是montecarlo利用openmp做卻變得非常緩慢,但是system time卻是最多的.
```
time ./time_test_montecarlo_openmp_4
N = 500000000 , pi = 3.141526
real 2m3.363s
user 2m1.964s
sys 3m58.884s
```
>程式花費的時間分布:system time > real time = user time
```
#pragma omp parallel num_threads(threads)
{
#pragma omp for private(x,y) reduction(+:z)
for(size_t i=0; i<N; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x+y*y <= 1) z++;
}
}
```
數據圖表1(baseline):

數據圖表2(leibniz):

數據圖表3(monte carlo method):

以上數據的sequences爲'seq 100 5000 100100'
## Error Rate
* references:
* [廖健富學長的筆記](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I)
* 爲什麼要考慮error rate?
* 任何實驗都必須建立在正確或最小誤差的基礎上
* monte carlo函式(如無法完全隨機分布)就是一個反面例子.
* error rate的計算:
* 利用定義在<math.h>中的M_PI與computepi中的計算結果對比
* 但是在網絡上搜索pi的數值卻發現c所定義的pi精準度只達到小數15位
```
pi found online = 3.14159265358979323846
pi define in c = 3.14159265358979311599
```
* 誤差數據以compute_pi_baseline(),compute_pi_leibniz()和compute_pi_montecarlo()三個函數分別減去<math.c>裏面的M_PI測試.
* 誤差數據圖表1:

* 誤差數據圖表2:

* 誤差數據圖表3:

* 誤差數據圖表4:去掉logscale會發現
>可以看到monte carlo method一直讀在跳動(隨機分布問題),而baseline和leibniz非常有趣(奇怪)的會以完全相同的誤差進行.並且精準值接近小數6位止(已以區近水平線的方式延展),在去掉logscale的第4張圖片會更加清楚的看到大約在N=6000的地方兩個演算法都以非常快的速度趨向最大精準值,而在6000之後對於pi精準小數的計算的效能非常差.
>[name=Lim Wen Sheng]
## Wall-clock time vs. CPU time
* references:
* [clock()、time()、clock_gettime()和gettimeofday()函数的用法和区别](http://www.cnblogs.com/krythur/archive/2013/02/25/2932647.html)
* [Measure time in Linux - time vs clock vs getrusage vs clock_gettime vs gettimeofday vs timespec_get?](http://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof)
### Measure time in Linux
* [time()](https://linux.die.net/man/2/time)
* returns the wall-clock time from the OS, with precision in seconds.
* 由於是從wall-clock time抓取的資料(1970開始經)所以在測試程式執行時間上如果有涉及NTP(Network Time Protocol),時區或是夏季節約時間,那就不是一個穩定的時間因爲會因爲而變成非單調遞增函數.
>老師的公筆上說wall-clock time是從xtime上更新的,xtime以微妙microsecond(µs)計算,但是在[一些網站](https://linux.die.net/man/2/time)看它是以秒來計算
* [clock()](https://linux.die.net/man/3/clock)
*returns an approximation of processor time used by the program.
* 返回值类型是clock_t,它除以CLOCKS_PER_SEC(1000000)来得出时间,一般用两次clock函数来计算进程自身运行的时间.
* 在32bit中clock()每72小時就會重新輪回一次(回傳同樣的值),2<sup>32</sup> = 72x60x100x100x100.
* clock()並沒有考慮CPU被子進程使用的情況
* 不能區分執行的權限是在user mode還是kernel mode
* [clock_gettime()](https://linux.die.net/man/3/clock_gettime)
* retrieve the time of the specified clock clk_id.
* 精準度到達納秒,單調遞增,從RTC取值
* 結構:
```
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
```
* [getrusage()](https://linux.die.net/man/2/getrusage)
* 獲取進程所佔用系統的相關信息,獲取時間進準度可達微秒
* 結構:
```
struct rusage {
struct timeval ru_utime; /* user CPU time used */
struct timeval ru_stime; /* system CPU time used */
long ru_maxrss; /* maximum resident set size */
long ru_ixrss; /* integral shared memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* IPC messages sent */
long ru_msgrcv; /* IPC messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
```
* [gettimeofday()](https://linux.die.net/man/2/gettimeofday)
* 回傳當前時間(有timeval和timezone兩個結構)
* 回傳的時間會受到系統不連續跳躍的影響,所以並不適合用在計算執行時間上
* [mach_absolute_time()](http://www.devdiv.com/mach_absolute_time_-blog-257920-52195.html)
* 使用在OS X上的一個時間函數,精準度達納秒
* 結論:
* clock_gettime()幾乎可以在任何情況下使用,唯一的缺點是CPU在進行上下文切換時會發生微妙級別的小誤差.
* time()和gettimeofday()因爲(有可能)不是單調遞增函數所以並不適合作benchmark
### Xtime,Jiffies,Tick,Hz
* xTime:從cmos電路上取得的時間,從1970年1月1日開始算到現在,Linux 核心也會決定每秒要發生幾次中斷 (透過常數 HZ) ,每次 timer interrupt,就會去更新 xtime。
* Hz:linux kernel每固定的周期就會發出timer interrupt(防止某進程陷入無限迴圈中或是發生狀況讓CPU拿不回控制權)
* Tick:
* Jiffies:是記錄著從電腦開機到現在總共的時鐘中斷次數(tick).假如我們系統的頻率是200Mhz,那麼一次中斷的間隔是1s / 200,000,000Hz=0.000 000 005秒,所以理論上我們系統的精確度是5納秒。
## 延伸思考:
* clock_gettime() 結果飄忽不定?
* references : [Why is clock_gettime so erratic?](http://stackoverflow.com/questions/6814792/why-is-clock-gettime-so-erratic)
* 從官方手冊中clock_gettime()的敘述中可以看到:
```
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on
many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium).
These registers may differ between CPUs and as a consequence these clocks may
return bogus results if a process is migrated to another CPU.
```
* clock_gettime()能夠作用於電腦上所有的CPU,所以進程在CPU之間切換時會影響clock_gettime的準確度
* 網絡上看到有人在dual intel5150上測試context switch對於不同的working set所花費的時間:


* 解決方法:可以使用[信賴區間](https://zh.wikipedia.org/zh-tw/%E7%BD%AE%E4%BF%A1%E5%8C%BA%E9%97%B4)解決數據飄忽不定的問題
* 信賴區間
* 信賴區間是統計學中對一個“概率樣本/參數”所進行的區間估計。信賴區間給出的是被測量參數的測量值的可信度。即,測量參數有一定的概率會落在該區間上。

* 在對稱的常態分佈圖上,一般認爲機率樣本有68%會落在平均值加減變準差,而95%信賴區間則爲平均值加減2倍的標準差
* 標準差公式:

* 標準差的迷思:信賴區間所說的%並不是機率,而是一個信心值。並不代表隨機抽樣樣本一定有95%落在該區域上。(也稱爲68-95-99.7rule 或 empirical rule(經驗法則))
## 莱布尼茨公式的證明
