# 2017q1 Homework1 (compute-pi)
contributed by < `twzjwang` >
作業說明:  [B03: compute-pi](https://hackmd.io/s/HkiJpDPtx)
github:  [twzjwang/compute-pi](https://github.com/twzjwang/compute-pi)
### Reviewed by `hunng`
- 實際實驗、比較各種擷取時間函數並比較
- 雖然只是更換 rand() 成 rand_r() 但卻能在效能上做出明顯改善
# 開發環境
- Ubuntu Ubuntu 16.04.2 LTS
  Linux 4.8.0-36-generic
- cpu
    version: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                4
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              3072K
- memory
    size: 4GiB
    
# 開發前
- [黎曼積分](https://zh.wikipedia.org/wiki/%E9%BB%8E%E6%9B%BC%E7%A7%AF%E5%88%86)
- [AVX](https://zh.wikipedia.org/wiki/AVX%E6%8C%87%E4%BB%A4%E9%9B%86)
>數學相關式子可以試著使用 LaTeX 表示喔!
>[name=課程助教][color=red]
- 時間處理
    - Wall-clock time 
        - 自 1970-01-01 起經歷的秒數
        - 「通常」不建議拿來量測程式時間
        - 不一定是[單調遞增(monotonic)](https://zh.wikipedia.org/wiki/%E5%8D%95%E8%B0%83%E5%87%BD%E6%95%B0)
    - CPU time
        - 程序在 CPU 上面運行消耗 (佔用) 的時間
        - clock()
        - multithreads - 每條 thread 的使用時間加總
- 時間函數
    - [clock()](https://linux.die.net/man/3/clock)
        - return number of clock ticks
        - 將結果除以 `CLOCKS_PER_SEC` 得到時間
        - 在 32-bit 系統下 `CLOCKS_PER_SEC` = 1000000
          $2^{32}$ / 1000000 = 4295(s) = 72(min)
          約72分鐘後會 wrap around
        - 不能區分 kernel mode 、 user mode 占用的 CPU time
    - [clock_gettime()](https://linux.die.net/man/2/clock_gettime)
        - 根據參數 `clockid_t` 決定 clock 類型
            - system-wide (for all processes)
            - per-process
        - 時間單位 nanoseconds : 1/$10^{-9}$ (s) , 10 億分之一秒
        ```C
        struct timespec {
            time_t   tv_sec;        /* seconds */
            long     tv_nsec;       /* nanoseconds */
        };
        ```
    - [time()](https://linux.die.net/man/2/time)
        - return 1970-01-01 00:00:00 +0000 (UTC) 至現在的秒數
        - 在 32-bit 系統下,time_t 是 signed int32
          可記錄 $2^{31}$(s) = 68(years),也就是到2038年會 wrap around
        - [2038年問題](https://zh.wikipedia.org/wiki/2038%E5%B9%B4%E9%97%AE%E9%A2%98)
    - [gettimeofday()](https://linux.die.net/man/2/gettimeofday)
        - 時間單位 microseconds : $1/10^{-6}$ (s) 
        ```c
        struct timeval {
            time_t      tv_sec;     /* seconds */
            suseconds_t tv_usec;    /* microseconds */
        };
        ```
        - `timezone` 結構
        ```c
        struct timezone {
            int tz_minuteswest;     /* minutes west of Greenwich */
            int tz_dsttime;         /* type of DST correction */
        };
        ```
- time() 和 gettimeofday() 不適合拿來作 benchmark
    - 參考[gettimeofday() should never be used to measure time
](https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html)
    - Nobody sneaks in and changes your wristwatch when you’re not looking
        - 系統時間可能被改變(NTP、時區...)
    - You’re not stupid, the computer is
    - You are the limiting factor
        - 生活中通常也是用現實時間的時間差計時
        - 時間單位不夠精確
    - 用 clock_gettime 代替
        - It’s not stable across reboots
        - goes up by one second for each second that passes in the physical world
# 實驗一 - [Monte Carlo method](https://en.wikipedia.org/wiki/Monte_Carlo_method)
- 步驟
    - 隨機放置一點在 A(x,y) , 0<=x<=1, 0<=y<=1
    - 重複 N 次,並紀錄 M
    - $pi = 4 * M / N$
- C : 半徑為 1 ,原點為(0,0)的園
- N : 樣本數
- M : A 在 C 內的次數
    - $x^2 + y^2 < 1$ => 在 C 內
- P : 
    - $P = M / N$
    - $P = 園面積 / 正方形面積 = (pi / 4) / 1 = pi / 4$
    - => $pi = 4 * M / N$
- 使用 rand() 產生隨機數
```c
double montecarlo(long long int N)
{
    long long int i;
    long long int M = 0;
    double x,y;
    
    for (i = 0; i < N; i++) {
        x = (double) rand() / RAND_MAX;
        y = (double) rand() / RAND_MAX;
        if ( x * x + y * y < 1)
            M++;
    }
    return (double) M / N * 4;
}
```
## 方法 1: 實作 Monte Carlo method,分別使用 4 種時間函數計時
- 計時方法
    - clock()
        ```C
        clock_t start = clock();
        
        ...
        
        clock_t finish = clock();
        double  duration = (double)(finish - start) / CLOCKS_PER_SEC;        
        ```
    - clock_gettime()
        ```C
        struct timespec start;
        struct timespec finish;
        clock_gettime(CLOCK_MONOTONIC, &start);
        
        ...
        
        clock_gettime(CLOCK_MONOTONIC, &finish);
        double  duration = (finish.tv_sec - start.tv_sec) 
            + (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
        ```
    - time()
        ```C
        time_t start = time(0);
        
        ...
        
        time_t finish = time(0);
        double  duration = (double)(finish - start);
        ```
    - gettimeofday()
        ```C
        struct timeval start;
        struct timeval finish;
        gettimeofday(&start,NULL);
        
        ...
        
        gettimeofday(&finish,NULL);
        double  duration = (finish.tv_sec - start.tv_sec) 
            + (finish.tv_usec - start.tv_usec) / 1000000.0;
        ```
- 結果
```
$ gcc -o montecarlo montecarlo.c 
$ ./montecarlo
N = 100000000
clock()
pi = 3.141698
2.731048 (s)
clock_gettime()
pi = 3.141420
2.729236 (s)
time()
pi = 3.141591
3.000000 (s)
gettimeofday()
pi = 3.141585
2.749408 (s)
```
## 方法 2: 比較不同樣本數得到結果的誤差
- 使用 clock_gettime()
- 結果
    n | $10^0$ | $10^1$ | $10^2$ | $10^3$ | $10^4$ | $10^5$ | $10^6$ | $10^7$ | $10^8$ | $10^9$ 
    -|----|----|----|----|----|----|----|----|----|----
    time(s)|0.000003|0.000002|0.000015|0.000128|0.000943|0.009436|0.040706|0.277493|2.723461|27.253682
    pi|4.000000|3.200000|3.120000|3.152000|3.167200|3.138880|3.141920|3.140899|3.141710|3.141604
    error|0.858407|0.058407|0.021593|0.010407|0.025607|0.002713|0.000327|0.000694|0.000117|0.000011
    
>X label 可以參考[美圖欣賞第三章](https://hackmd.io/s/Skwp-alOg#美圖欣賞) 修改
>[name=課程助教][color=red]
>增加`set xtics rotate by 45 right`
```
$ gcc -o montecarlo montecarlo.c 
$ ./montecarlo
N = 1
0.000003 (s)
pi    = 4.000000
error = 0.858407
N = 10
0.000002 (s)
pi    = 3.200000
error = 0.058407
...
N = 100000000
2.723461 (s)
pi    = 3.141710
error = 0.000117
N = 1000000000
27.253682 (s)
pi    = 3.141604
error = 0.000011
```
## 方法 3: 用 OpenMP 平行化
```C
#pragma omp parallel default(none) \
private(i,x,y) shared(N) reduction(+:M)
{
    #pragma omp for
    for (i = 0; i < N; i++) {
        x = (double) rand() / RAND_MAX;
        y = (double) rand() / RAND_MAX;
        if ( x * x + y * y < 1)
            M++;
    }
}
```
- 結果
    - 以 N = 10000000 為例
        - serial
            - 0.277493 (s)
        - parallel
            - 2.966225 (s)
    - multithread 比較慢...
        - 在 4 cores 環境下,時間增為 10.6 倍
- [Program is slower when using multiple threads](http://stackoverflow.com/questions/16437105/program-is-slower-when-using-multiple-threads)
    - rand() is not thread safe.
- [OpenMP program is slower than sequential one](http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one) 下方留言提到
    - rand() 使用 global state variables (hidden in the (g)libc implementation)
    - 應使用 private 的 seed 作為 parameter 呼叫 rand_r()
    - 同時推薦 [erand48()](https://linux.die.net/man/3/drand48) 作為更好的 random number source
- 這位學長筆記中解釋為什麼 multi-thread 不保證比較快
    - [CheHsuan HackMD](https://hackmd.io/s/BkbydgVa)
    - rand() is not reentrant or thread-safe
        - rand()有用到靜態變數(或全域變數),導致該靜態變數或全域變數變得不可預期
        - multi-thread 的環境下要使用非 thread-safe 時,必須自己加上lock
## 方法 4: 用 OpenMP 平行化-改用 [rand_r()](https://linux.die.net/man/3/rand_r)
- 參考 [How to generate random numbers in parallel?](http://stackoverflow.com/questions/4287531/how-to-generate-random-numbers-in-parallel) 產生隨機數
- 以 omp_get_thread_num() 作為 rand_r 的 seed
```C
    #pragma omp parallel default(none) \
    private(i,x,y) shared(N) reduction(+:M)
    {
        unsigned int myseed = omp_get_thread_num();
        #pragma omp for
        for (i = 0; i < N; i++) {
            x = (double) rand_r(&myseed) / RAND_MAX;
            y = (double) rand_r(&myseed) / RAND_MAX;
            if ( x * x + y * y < 1)
                M++;
        }
    }
```
- 結果
    - 以 N = 10000000 為例,比較平行化前後差異
        - serial
            - 0.277493 (s)
        - parallel
            - 0.079453 (s)
        - 在 4 cores 環境下,執行時間前為平行前的 29%
    - 以 N = 10000000 為例,在使用OpenMp平行化下,比較 rand() 與 rand_r()
        - rand()
            - 2.966225 (s)
        - rand_r()
            - 0.079453 (s)
        - rand_r() 執行時間為 rand() 的 3%
   
# 將 Monte Carlo method 與其他方法合併
- `$ make plot`
- 結果
    - 暫不考慮誤差大小,僅考慮執行時間
    - 
# 參考資料
- [clock()、time()、clock_gettime()和gettimeofday()函数的用法和区别](http://www.cnblogs.com/krythur/archive/2013/02/25/2932647.html)
- [vic85821 HackMD](https://hackmd.io/s/rkgYN0P6#)
- [Program is slower when using multiple threads](http://stackoverflow.com/questions/16437105/program-is-slower-when-using-multiple-threads)
- [OpenMP program is slower than sequential one](http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one)
- [CheHsuan HackMD](https://hackmd.io/s/BkbydgVa)
- [rand()](https://linux.die.net/man/3/rand)
- [rand_r()](https://linux.die.net/man/3/rand_r)
- [erand48()](https://linux.die.net/man/3/drand48)
- [How to generate random numbers in parallel?](http://stackoverflow.com/questions/4287531/how-to-generate-random-numbers-in-parallel)
- ['zmke' script.gp](https://github.com/zmke/compute-pi/blob/master/script.gp)
###### tags: `twzjwang`