#	2019q3 Homework4 (compute-pi)
contributed by < `uccuser159` >
##	time
*    CPU time
        process 在 CPU 上運行所消耗的時間,有多個 thread 的話, CPU time 是各個 thread 所花時間的總和。
*    Wall-clock time
        是真實世界所經過的時間且是自 1970-01-01 起經歷的秒數,但有時不完全是單調遞增,因為可能會和 NTP 伺服器、時區、日光節約時間同步。
由於 `$ time` 會報告程式從開始到結束的經歷時間,另外還會計算程式使用處理器的時間量。所以輸入`time ./time_test_baseline`來看看:
```css
N = 400000000 , pi = 3.141593
real    0m4.295s
user    0m4.291s
sys     0m0.004s
```
**real time** : 表示程式從執行開始到結束所花費的時間。
**user time** : 程式在 user mode 佔用所有的 CPU time 總和。
**sys time** : 程式在 kernel mode 佔用所有的 CPU time 總和。
*    在`benchmark_clock_gettime.c`中用到 [clock_gettime()](https://linux.die.net/man/3/clock_gettime) 函式:
`int clock_gettime(clockid_t clk_id, struct timespec *tp);`
>The clk_id argument is the identifier of the particular clock on which to act.

關於`clk_id`的選擇,會選擇 CLOCK_MONOTONIC_RAW 是因為不受 NTP 伺服器、時區、日光節約時間影響。
>tp arguments is timespec struct
```css=
struct timespec {
        time_t   tv_sec;        /* seconds */
        long     tv_nsec;       /* nanoseconds */
};
```
`clock_gettime()`的精確度到 nerosecond
## Baseline for π
*    Proof
使用 $tan^{-1}(x)$ 的積分來實現 $\pi$
因為 $\int \frac{1}{1+x^2} dx = tan^{-1}(x)$
若計算函數 $\frac{1}{1+x^2}$ 在 x=0 到 x=1 底下的面積,利用黎曼積分,$dx$ 即為在 x 軸上的一小等分,與函數相乘,一小塊一小塊去做加總,得到的面積即為$tan^{-1}(1)= \frac{\pi}{4}$,所以將 $dx$ 分的越細,計算 $\pi$ 會越精準。
程式碼:
```cpp=
double compute_pi_baseline(size_t N)
{
    double pi = 0.0;
    double dt = 1.0 / N;  // dt = (b-a)/N, b = 1, a = 0
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;  // x = ti = a+(b-a)*i/N = i/N
        pi += dt / (1.0 + x * x);   // integrate 1/(1+x^2), i = 0....N
    }
    return pi * 4.0;
}
```
取樣從 N=1008 ~ N=1000000,間隔4000

baseline 效能最好的為 AVX+unroll
```cpp=
time ./time_test_baseline
N = 400000000 , pi = 3.141593
4.28user 0.00system 0:04.28elapsed 100%CPU (0avgtext+0avgdata 2372maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
4.37user 0.00system 0:02.18elapsed 200%CPU (0avgtext+0avgdata 2312maxresident)k
0inputs+0outputs (0major+89minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 400000000 , pi = 3.141593
4.57user 0.00system 0:01.14elapsed 399%CPU (0avgtext+0avgdata 2532maxresident)k
0inputs+0outputs (0major+98minor)pagefaults 0swaps
time ./time_test_avx
N = 400000000 , pi = 3.141593
1.64user 0.00system 0:01.64elapsed 99%CPU (0avgtext+0avgdata 2344maxresident)k
0inputs+0outputs (0major+89minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 400000000 , pi = 3.141593
0.54user 0.00system 0:00.54elapsed 99%CPU (0avgtext+0avgdata 2384maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps
```
## Leibniz formula for π 
*    Proof
因為 $\frac{d[arctan(x)]}{dx} =\frac{1}{1+x^2}$
又利用 power series來展開 $\frac{1}{1+x^2} = 1-x^2+x^4-...$
對左右兩式做 0~1 的定積分 $\int_{0}^1(\frac{1}{1+x^2})dx = \int_{0}^1 (1-x^2+x^4 ...)dx$
化簡後得到 $tan^{-1}(1)-tan^{-1}(0) = (x-\frac{1}{3}x^3+ \frac{1}{5}x^5-...)|_{x=1}-(x-\frac{1}{3}x^3+ \frac{1}{5}x^5-...)|_{x=0}$
=> $\frac{\pi}{4} =1-\frac{1}{3}+\frac{1}{5}-\frac{1}{7}-...$
=> $\pi =4(1-\frac{1}{3}+\frac{1}{5}-\frac{1}{7}-...)$
程式碼:
```cpp=
double compute_pi_leibniz(size_t N)
{
    double pi = 0.0;
    for (size_t i = 0; i < N; i++) {
        double tmp = (i & 1) ? (-1) : 1;
        pi += tmp / (2 * i + 1);
    }
    return pi * 4.0;
}
```
取樣從 N=1008 ~ N=1000000,間隔4000

使用 Leibniz 方法來計算 $\pi$ 比 baseline 方法在 N = 400000000時,快了大約 4.28/0.79=5.4 倍。
```cpp=
time ./time_test_leibniz
N = 400000000 , pi = 3.141593
0.79user 0.00system 0:00.79elapsed 100%CPU (0avgtext+0avgdata 2356maxresident)k
0inputs+0outputs (0major+89minor)pagefaults 0swaps
time ./time_test_leibniz_openmp_2
N = 400000000 , pi = 3.141593
0.81user 0.00system 0:00.40elapsed 200%CPU (0avgtext+0avgdata 2460maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps
time ./time_test_leibniz_openmp_4
N = 400000000 , pi = 3.141593
0.85user 0.00system 0:00.21elapsed 400%CPU (0avgtext+0avgdata 2356maxresident)k
0inputs+0outputs (0major+95minor)pagefaults 0swaps
time ./time_test_leibniz_avx
N = 400000000 , pi = 3.141593
1.67user 0.00system 0:01.67elapsed 100%CPU (0avgtext+0avgdata 2492maxresident)k
0inputs+0outputs (0major+92minor)pagefaults 0swaps
time ./time_test_leibniz_avxunroll
N = 400000000 , pi = 3.141593
0.90user 0.00system 0:00.90elapsed 100%CPU (0avgtext+0avgdata 2504maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps
```
## Euler formula for π 
*    Proof
可以將 $f(x)=x^2 , x\in[-\pi,\pi]$ 做週期為 $T=2\pi$ 全幅展開做產生 $g(x)$:
因為 $g(x)$ 為偶函數,所以 Fouier cosine series 的形式為 $g(x)=\displaystyle\sum_{n=1}^{\infty}a_0 + a_ncos(nx)$
其中 $a_0=\frac{2}{2\pi}\int_{0}^{\pi}(x^2)dx=\frac{\pi^2}{3}$
$a_1=\frac{2}{\pi}\int_{0}^{\pi}(x^2)cos(nx)dx=\frac{4}{n^2}(-1)^n$
根據傅立葉級數的收斂(Dirichlet 收斂定理),$g(x)$為**均勻收斂**,故傅立葉級數的收斂值即為帶入函數值
$g(\pi)=\pi^2=\frac{\pi^2}{3}+\displaystyle\sum_{n=1}^{\infty}\frac{4}{n^2}(-1)^ncos(n\pi)$
$cos(n\pi)$當 n 為正整數,其值為$(-1)^n$,計算得$\displaystyle\sum_{n=1}^{\infty}\frac{1}{n^2}=\frac{\pi^2}{6}$

程式碼:
```cpp=
double compute_pi_euler(size_t N)
{
    double pi = 0.0;
    for (size_t i = 1; i < N; i++) {
        pi += 1.0 / (i * i);
    }
    return sqrt(pi * 6);
}
```
取樣從 N=1008 ~ N=1000000,間隔4000,計算執行時間

使用 Eular 方法來計算 $\pi$ 與 Leibniz 方法來計算 $\pi$ 的時間幾乎相同。
```css=
time ./time_test_euler
N = 400000000 , pi = 3.141593
0.81user 0.00system 0:00.81elapsed 100%CPU (0avgtext+0avgdata 2424maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_euler_openmp_2
N = 400000000 , pi = 3.141593
0.81user 0.00system 0:00.41elapsed 199%CPU (0avgtext+0avgdata 2368maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_euler_openmp_4
N = 400000000 , pi = 3.141593
0.85user 0.00system 0:00.21elapsed 399%CPU (0avgtext+0avgdata 2528maxresident)k
0inputs+0outputs (0major+98minor)pagefaults 0swaps
time ./time_test_euler_avx
N = 400000000 , pi = 3.141593
1.73user 0.00system 0:01.73elapsed 100%CPU (0avgtext+0avgdata 2504maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps
time ./time_test_euler_avxunroll
N = 400000000 , pi = 3.141593
0.91user 0.00system 0:00.91elapsed 99%CPU (0avgtext+0avgdata 2388maxresident)k
0inputs+0outputs (0major+91minor)pagefaults 0swaps
```
###	參考資料
- [user mode and kernel mode](http://peachwaneversay.blogspot.com/2007/05/user-mode-vs-kernel-mode.html)
- [clock_gettime()函式](https://starforcefield.wordpress.com/2012/08/06/c%E8%AA%9E%E8%A8%80%EF%BC%9A%E4%BD%BF%E7%94%A8clock_gettime%E8%A8%88%E7%AE%97%E7%A8%8B%E5%BC%8F%E7%A2%BC%E7%9A%84%E6%99%82%E9%96%93%E9%9C%80%E6%B1%82/)
- [Leibniz](https://zh.wikipedia.org/wiki/%CE%A0%E7%9A%84%E8%8E%B1%E5%B8%83%E5%B0%BC%E8%8C%A8%E5%85%AC%E5%BC%8F)
- [傅立葉級數](http://ocw.nthu.edu.tw/ocw/upload/145/1776/CH5_%E5%82%85%E7%AB%8B%E8%91%89%E7%B4%9A%E6%95%B8%20(Fourier%20Series).pdf)
- [shelly4132 的共筆](https://hackmd.io/@ZSR7ePPGTRqdsJIRLTv-8Q/HJrLU0dp?type=view)