2017q1 Homework1 (compute-pi)
===
contributed by <`Sean1127`>
### Reviewed by `chanyangch`
* OpenMP 的部分須注意函式是否為 reentrant,應將 rand 改為 rand_r,可以參考[連結](http://man7.org/linux/man-pages/man3/rand.3.html)
* 對於不同方式的時間與誤差可以再多做說明
## 開發環境
```
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Stepping: 1
CPU MHz: 1097.607
CPU max MHz: 2700.0000
CPU min MHz: 800.0000
BogoMIPS: 4789.26
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
```
## 作業要求
* 在 GitHub 上 fork compute-pi,嘗試重現實驗,包含分析輸出
* 注意: 要增加圓周率計算過程中迭代的數量,否則時間太短就看不出差異
* 詳閱[廖健富](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I)提供的詳盡分析,研究 error rate 的計算,以及為何思考我們該留意後者
* 用 phonebok 提到的 gnuplot 作為 $ make gencsv 以外的輸出機制,預期執行 $ make plot 後,可透過 gnuplot 產生前述多種實做的效能分析比較圖表
* 可善用 rayatracing 提到的 OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
* 詳閱前方 “Wall-clock time vs. CPU time”,思考如何精準計算時間
* 除了研究程式以外,請證明 Leibniz formula for π,並且找出其他計算圓周率的方法
* 請詳閱[許元杰](https://embedded2015.hackpad.com/-Ext1-Week-1-5rm31U3BOmh)的共筆
## 初步測試
專案下載下來之後,依照文件的指示跑一次流程
```
$ make check
...
time ./time_test_baseline
N = 400000000 , pi = 3.141593
4.20user 0.00system 0:04.20elapsed 99%CPU (0avgtext+0avgdata 1924maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
4.68user 0.00system 0:02.34elapsed 199%CPU (0avgtext+0avgdata 1792maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
time ./time_test_openmp_4
N = 400000000 , pi = 3.141593
9.32user 0.00system 0:02.35elapsed 395%CPU (0avgtext+0avgdata 1792maxresident)k
0inputs+0outputs (0major+90minor)pagefaults 0swaps
time ./time_test_avx
N = 400000000 , pi = 3.141593
1.21user 0.00system 0:01.21elapsed 99%CPU (0avgtext+0avgdata 1808maxresident)k
0inputs+0outputs (0major+85minor)pagefaults 0swaps
time ./time_test_avxunroll
N = 400000000 , pi = 3.141593
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 1780maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
```
* ++許元杰的共筆++提到許多 Makefile 相關設定
這邊可以用特殊字元`@`關閉指令執行的顯示
但發現用`$ make check`的時候還是會跑出這些
```
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 1780maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
```
實在是很難讀,所以之後還是個別計時好了
```
$ time ./time_test_baseline
N = 400000000 , pi = 3.141593
real 0m4.277s
user 0m4.260s
sys 0m0.012s
```
----
```
$ make gencsv
for i in `seq 100 5000 25000`; do \
printf "%d," $i;\
./benchmark_clock_gettime $i; \
done > result_clock_gettime.csv
$ cat result_clock_gettime.csv
100,0.000038,0.000106,0.000119,0.000022,0.000018
5100,0.001848,0.000825,0.001016,0.000636,0.000570
10100,0.002993,0.001534,0.002690,0.002067,0.000979
15100,0.004105,0.002295,0.002281,0.001882,0.001307
20100,0.005351,0.003022,0.003020,0.002469,0.001471
```
----
* 注意
`$ make gencsv`與`$ make check`兩者用的 N 不同,所以時間不在同一個比較標準
`benchmark_clock_gettime.c`裡面是這樣寫的
```clike=20
clock_gettime(CLOCK_ID, &start);
for (i = 0; i < loop; i++) {
compute_pi_baseline(N);
}
clock_gettime(CLOCK_ID, &end);
printf("%lf,", (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/ONE_SEC);
```
* 代表其實結果是過大的,因為計算了 25 次迴圈的時間
* 注意二
兩個計算時間的方法也不同,一個`$ time `用另一個用`clock_gettime()`,下面會整理共同與不同點
## 閱讀文件重點整理
### 計算程式執行時間
1. Wall-clock time
指的是電腦內部時鐘的時間,它也會等於現實世界中實際的時間
時間紀錄在 kernel 裡的 xtime,電腦啟動時系統會從 real time clock(硬體)取得,它的資料結構其實我不確定長怎樣
* [容易混淆LINUX時鐘的xtime和jiffies](http://www.coctec.com/docs/linux/show-post-186188.html)說長這樣:
```clike=
struct timeval
{
time_t tv_sec; /***second***/
susecond_t tv_usec;/***microsecond***/
}
```
* [作業文件](https://hackmd.io/s/HkiJpDPtx)寫這樣:
```clike=
struct timespec xtime __attribute__ ((aligned (16)));
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
```
* 總之可能因為版本不同而改變,但大體上都分為++整數秒++跟++小數秒++兩個部份,精度應該越新的版本會越高,而它的值是自 1970/1/1 到現在所經過的秒數
* 這邊有兩個容易搞混的名詞
1. xtime
單位: 秒、微秒
2. jiffies
單位: 次
系統會決定多久更新一次 xtime,也就是 timer interrupt 的頻率,若每秒更新 100 次就是 100 Hz
jiffies 是用來紀錄電腦從啟動到現在,總共有多少次 timer interrupt,所以跟 xtime 是完全不同的
* 總結 Wall-clock time,嚴格說不是一個好的標準
原因在於它表示的是系統的時間,當使用者手動調整時間、或電腦因為日光節約時間、跨越時區而連線自動調整時間的時候,時間就會跳動
但我們測試的程式不太可能因為上述的原因被影響,所以還是可以使用的,**只是需要注意單位**
----
2. CPU time
指的是某個程式在 CPU 內執行的時間,單看 CPU time 卻沒辦法分析,因為程式裡可能還有 I/O, delay, overhead 等等其他耗時的作業
一般將 CPU time 分解成下面三種顯示
1. real time: 就是 wall-clock time 的意思
2. user time: user mode 的時間加總
3. system time: kernel mode 的時間加總
4. 以上三個的單位都是**秒**
* 可以想像如果執行一個 pthread 程式,real time 會縮短,但 user time 卻沒有改善
* 若執行一個 I/O bound 程式,real time 會比 user time 來的長,因為實際上 CPU 運算的時間並不多
---
### 圓周率的算法
1. Leibniz,也就是 baseline 的做法:
$arctan(x)=x-\frac{x^3}{3}+\frac{x^5}{5}-...$
令 $x=1,arctan(1)=\frac{\pi}{4}=1-\frac{1}{3}+\frac{1}{5}-...$
$arctan(1)=\int_{0}^{1}\frac{1}{x^2+1}$
缺點是收斂的速度比較慢
2. Monte Carlo:
利用 $圓面積=\pi{r^2}$
令 $r=1$
在 $x=0, y=0, x=1, y=1$圍成的方形內隨機取 n 點,其中有 m 點到原點的距離$\le1$,那$\pi=\frac{m}{n}$
3. Polygon:
![image alt](https://upload.wikimedia.org/wikipedia/commons/c/c9/Archimedes_pi.svg)
4. Euler:
## 實驗
先以原本的演算法,找出正確計算時間的方法,之後再嘗試不同演算法
* clock(), clock_gettime() 的使用
* time(), gettimeofday() 的使用
* 為什麼 clock_gettime() 結果飄忽不定?
* 為什麼 time() 和 gettimeofday() 不適合拿來作 benchmark ?
* 列出所有跟計算時間有關的函式自 [Linux Programmer's Manual](http://man7.org/linux/man-pages/man2/time.2.html)
| time(2) | gettimeofday(2) |
| - | - |
| time() returns the time as the number of seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). | The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2). |
| ctime(3) | ftime(3) |
| - | - |
| When interpreted as an absolute time value, it represents the number of seconds elapsed since the Epoch. | This function is obsolete. Don't use it. If the time in seconds suffices, time(2) can be used; gettimeofday(2) gives microseconds; clock_gettime(2) gives nanoseconds but is not as widely available. |
* 以上四個都屬於 wall-clock time,就連`gettimeofday(2)`的文件也說了時間可能會跳動(wall-clock time 的解釋請看上文)
----
| clock_gettime(2) | clock(3) |
| - | - |
| The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clk_id. | The value returned is the CPU time used so far as a clock_t; to get the number of seconds used, divide by CLOCKS_PER_SEC. |
* `clock_gettime`的完整定義
```
int clock_gettime(clockid_t clk_id, struct timespec *tp);
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
```
* `clk_id`讓使用者可以用不同的 clock 去當`clock_gettime()`的標準,在`benchmark_clock_gettime.c`裡可以看到是使用`CLOCK_MONOTONIC_RAW`,也就是不受使用者手動調整、也不受 NTP 校正干擾
>For improved accuracy, since glibc 2.18, it is implemented on top of clock_gettime(2) (using the CLOCK_PROCESS_CPUTIME_ID clock).
* 而`clock()`其實就等於`clock_gettime(CLOCK_PROCESS_CPUTIME_ID)`
---
### clock() vs clock_gettime()
寫一個小程式`clock_test.c`測試,為了區分 real/user/sys time,所以選 openmp 函式來測
#### clock()
```clike=
#include <stdio.h>
#include <time.h>
#include "computepi.h"
int main(int argc, char const *argv[])
{
__attribute__((unused)) int N = 400000000;
double pi = 0.0;
clock_t start, end;
start = clock();
pi = compute_pi_openmp(N, 2);
end = clock();
printf("N = %d , pi = %lf\n", N, pi);
printf("%lf\n", (double) (end - start) / CLOCKS_PER_SEC);
start = clock();
pi = compute_pi_openmp(N, 4);
end = clock();
printf("N = %d , pi = %lf\n", N, pi);
printf("%lf\n", (double) (end - start) / CLOCKS_PER_SEC);
return 0;
}
```
* 結果
```
$ ./clock_test
N = 400000000 , pi = 3.141593
4.744200
N = 400000000 , pi = 3.141593
9.391323
```
```
$ time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
real 0m2.407s
user 0m4.800s
sys 0m0.000s
$
$ time ./time_test_openmp_4
N = 400000000 , pi = 3.141593
real 0m2.379s
user 0m9.424s
sys 0m0.000s
```
----
#### clock_gettime()
```clike=
struct timespec start = {0, 0};
struct timespec end = {0, 0};
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
pi = compute_pi_openmp(N, 2);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
printf("N = %d , pi = %lf\n", N, pi);
printf("%lf\n", (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/1000000000.0);
```
* 結果跟我想的不一樣,好險有先做出來看看
我以為出來的結果會跟`clock()`的一樣
```
$ ./clock_test
N = 400000000 , pi = 3.141593
2.354214
N = 400000000 , pi = 3.141593
2.431054
```
* 並沒有哪一個函式最好的問題,因為每個情況要監測的東西不一樣
但就這次實驗而言,`clock_gettime()`是比較合適的,因為現在在乎的是從執行到輸出結果的 turn around time
* 若要比較演算法的好壞,則應該用`clock()`
---
### 畫圖
* 完全沒有修正程式(除了把`benchmark_clock_gettime`裡面的迴圈拿掉)的原始版本,openmp 4 threads 的震盪很大
* 其實就是`loop = 1`版本
>為什麼只有它震盪特別大?
![](https://i.imgur.com/eWaUwOU.png)
* `loop = 25`
![](https://i.imgur.com/Iin2tDr.png)
* `loop = 100`
![](https://i.imgur.com/LYuUJZF.png)
* 廢話,因為樣本變得更平均了,可以想見把`loop`調超級大,線就會變平滑囉~
* `loop = 100`, `for i in seq '5000 5000 1000000'`,畫圖時間約 6 分鐘
----
* 因為是用`clock_gettime()`,所以會受當前電腦的狀況影響
參考 [0xff07 的共筆](https://hackmd.io/s/H1MV8APKg#)
* `loop = 25`畫圖過程完全沒開其他程式
![](https://i.imgur.com/Lz4Q0dU.png)
* `loop = 25`期間一直開 chrome 新分頁
![](https://i.imgur.com/6NXgXB0.png)
* 偷看到 [claaaaassic 的共筆](https://hackmd.io/s/SJqyt5Uqx#)發現可以畫漂亮圖的方法`smooth cspline`
```
> plot "result_clock_gettime.csv" using 1:2 title 'baseline' with linespoints smooth cspline, \
```
![](https://i.imgur.com/OHAMSn0.png)
* 有明顯的改善,但還遠遠不夠
----
* 如果使用 CPU time 是不是就可以避免電腦狀態的影響呢?
* 關閉所有程式
![](https://i.imgur.com/OMp0xQe.png)
* 開啟
![](https://i.imgur.com/CRm4waI.png)
>照理來說應該是無關的,這邊還想不通
>根據老師的提醒,應該是 cache 的影響[name=Sean1127]
----
* 沒開程式版
```
$ perf stat --repeat 5 -e cache-misses ./time_test_baseline
Performance counter stats for './time_test_baseline' (5 runs):
21,145 cache-misses ( +- 6.69% )
4.236564981 seconds time elapsed ( +- 0.26% )
```
* 開 youtube 版
```
$ perf stat --repeat 5 -e cache-misses ./time_test_baseline
Performance counter stats for './time_test_baseline' (5 runs):
1,013,390 cache-misses ( +- 14.96% )
6.099879805 seconds time elapsed ( +- 1.38% )
```
## Monte Carlo 演算法
```clike=
double compute_pi_monte_carlo(size_t N)
{
double x, y, distance_squared;
int number_in_circle = 0;
srand(time(NULL));
for (size_t i = 0; i < N; ++i) {
x = (double) rand() / RAND_MAX;
y = (double) rand() / RAND_MAX;
distance_squared = x * x + y * y;
if (distance_squared <= 1) number_in_circle++;
}
return 4.0 * ((double) number_in_circle / N);
}
```
* 結果
```
$ time ./time_test_monte_carlo
N = 400000000 , pi = 3.1415383900000000
real 0m11.436s
user 0m11.432s
sys 0m0.000s
```
----
### 加上 OpenMP 版
```clike=
double compute_pi_monte_omp(size_t N, int threads)
{
int number_in_circle = 0;
double x, y;
#pragma omp parallel num_threads(threads)
{
srand(time(NULL) ^ omp_get_thread_num());
#pragma omp for private(x,y) reduction(+:number_in_circle)
for (size_t i = 0; i < N; ++i) {
x = (double) rand() / RAND_MAX;
y = (double) rand() / RAND_MAX;
if ((x * x + y * y) < 1) number_in_circle++;
}
}
return 4.0 * ((double) number_in_circle / N);
}
```
* 結果
```
$ time ./time_test_monte_omp
N = 400000000 , pi = 3.1415087800000001
error = 0.0000838735897930
real 3m42.005s
user 3m6.772s
sys 4m17.140s
```
```
$perf record ./time_test_monte_omp
$ perf report
Samples: 478K of event 'cycles:ppp', Event count (approx.): 748658909555278
Overhead Command Shared Object Symbol
14.70% time_test_monte libc-2.23.so [.] __random
10.33% time_test_monte [kernel.kallsyms] [k] futex_wait_setup
9.73% time_test_monte [kernel.kallsyms] [k] sys_futex
8.08% time_test_monte libc-2.23.so [.] __lll_lock_wait_private
6.99% time_test_monte [kernel.kallsyms] [k] hash_futex
6.77% time_test_monte libc-2.23.so [.] __random_r
...
```
![](https://i.imgur.com/cD2hykR.png)
![](https://i.imgur.com/vfxEWCg.png)
* 可以看到因為是隨機取樣,所以保證收斂的結果
* 另外也可以看到 avx + loop unrolling 的震盪,證實了[shelly4132 的共筆](https://hackmd.io/s/HJrLU0dp)所提到的不能整除問題
>當N=1000的時候,i最大可以到984,但984/16 = 61.5,不能整除,所以變成有漏算的情形。
### Cuda GPU 加速
* 參考 [CheHsuan 的共筆](https://hackmd.io/s/BkbydgVa),使用 GPU 加速,我的比電用的是 intel 的
```
$ lspci -vnn | grep VGA -A 12
00:02.0 VGA compatible controller [0300]: Intel Corporation Haswell-ULT Integrated Graphics Controller [8086:0a16] (rev 0b) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Haswell-ULT Integrated Graphics Controller [1043:16cd]
Flags: bus master, fast devsel, latency 0, IRQ 48
Memory at f7400000 (64-bit, non-prefetchable) [size=4M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=64]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: i915
Kernel modules: i915
```
* clone 下來之後先安裝 API
```
$ sudo apt install nvidia-cuda-toolkit
```
* 測試
```
$ time ./time_test_montecarlo_cpu
N = 400000000 , pi = 3.141609
real 0m10.584s
user 0m10.580s
sys 0m0.000s
$ time ./time_test_montecarlo_gpu
N = 400000000 , pi = 0.000000
real 0m0.428s
user 0m0.012s
sys 0m0.000s
```
>還不知道為什麼沒辦法正確的算出 $\pi$,要再研究程式碼
## 參考資料
* http://stackoverflow.com/questions/7335920/what-specifically-are-wall-clock-time-user-cpu-time-and-system-cpu-time-in-uni
* http://man7.org/linux/man-pages/index.html
* http://gnuplot.sourceforge.net/docs_4.2/node236.html
* http://www.binarytides.com/linux-get-gpu-information/
###### tags: `sysprog`