2017q1 Homework1 (compute-pi)

2017q1 Homework1 (compute-pi) === contributed by <`Sean1127`> ### Reviewed by `chanyangch` * OpenMP 的部分須注意函式是否為 reentrant，應將 rand 改為 rand_r，可以參考[連結](http://man7.org/linux/man-pages/man3/rand.3.html) * 對於不同方式的時間與誤差可以再多做說明 ## 開發環境 ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz Stepping: 1 CPU MHz: 1097.607 CPU max MHz: 2700.0000 CPU min MHz: 800.0000 BogoMIPS: 4789.26 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 ``` ## 作業要求 * 在 GitHub 上 fork compute-pi，嘗試重現實驗，包含分析輸出 * 注意: 要增加圓周率計算過程中迭代的數量，否則時間太短就看不出差異 * 詳閱[廖健富](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I)提供的詳盡分析，研究 error rate 的計算，以及為何思考我們該留意後者 * 用 phonebok 提到的 gnuplot 作為 $ make gencsv 以外的輸出機制，預期執行 $ make plot 後，可透過 gnuplot 產生前述多種實做的效能分析比較圖表 * 可善用 rayatracing 提到的 OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 * 詳閱前方 “Wall-clock time vs. CPU time”，思考如何精準計算時間 * 除了研究程式以外，請證明 Leibniz formula for π，並且找出其他計算圓周率的方法 * 請詳閱[許元杰](https://embedded2015.hackpad.com/-Ext1-Week-1-5rm31U3BOmh)的共筆 ## 初步測試專案下載下來之後，依照文件的指示跑一次流程 ``` $ make check ... time ./time_test_baseline N = 400000000 , pi = 3.141593 4.20user 0.00system 0:04.20elapsed 99%CPU (0avgtext+0avgdata 1924maxresident)k 0inputs+0outputs (0major+87minor)pagefaults 0swaps time ./time_test_openmp_2 N = 400000000 , pi = 3.141593 4.68user 0.00system 0:02.34elapsed 199%CPU (0avgtext+0avgdata 1792maxresident)k 0inputs+0outputs (0major+86minor)pagefaults 0swaps time ./time_test_openmp_4 N = 400000000 , pi = 3.141593 9.32user 0.00system 0:02.35elapsed 395%CPU (0avgtext+0avgdata 1792maxresident)k 0inputs+0outputs (0major+90minor)pagefaults 0swaps time ./time_test_avx N = 400000000 , pi = 3.141593 1.21user 0.00system 0:01.21elapsed 99%CPU (0avgtext+0avgdata 1808maxresident)k 0inputs+0outputs (0major+85minor)pagefaults 0swaps time ./time_test_avxunroll N = 400000000 , pi = 3.141593 1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 1780maxresident)k 0inputs+0outputs (0major+84minor)pagefaults 0swaps ``` * ++許元杰的共筆++提到許多 Makefile 相關設定這邊可以用特殊字元`@`關閉指令執行的顯示但發現用`$ make check`的時候還是會跑出這些 ``` 1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 1780maxresident)k 0inputs+0outputs (0major+84minor)pagefaults 0swaps ``` 實在是很難讀，所以之後還是個別計時好了 ``` $ time ./time_test_baseline N = 400000000 , pi = 3.141593 real 0m4.277s user 0m4.260s sys 0m0.012s ``` ---- ``` $ make gencsv for i in `seq 100 5000 25000`; do \ printf "%d," $i;\ ./benchmark_clock_gettime $i; \ done > result_clock_gettime.csv $ cat result_clock_gettime.csv 100,0.000038,0.000106,0.000119,0.000022,0.000018 5100,0.001848,0.000825,0.001016,0.000636,0.000570 10100,0.002993,0.001534,0.002690,0.002067,0.000979 15100,0.004105,0.002295,0.002281,0.001882,0.001307 20100,0.005351,0.003022,0.003020,0.002469,0.001471 ``` ---- * 注意 `$ make gencsv`與`$ make check`兩者用的 N 不同，所以時間不在同一個比較標準 `benchmark_clock_gettime.c`裡面是這樣寫的 ```clike=20 clock_gettime(CLOCK_ID, &start); for (i = 0; i < loop; i++) { compute_pi_baseline(N); } clock_gettime(CLOCK_ID, &end); printf("%lf,", (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/ONE_SEC); ``` * 代表其實結果是過大的，因為計算了 25 次迴圈的時間 * 注意二兩個計算時間的方法也不同，一個`$ time `用另一個用`clock_gettime()`，下面會整理共同與不同點 ## 閱讀文件重點整理 ### 計算程式執行時間 1. Wall-clock time 指的是電腦內部時鐘的時間，它也會等於現實世界中實際的時間時間紀錄在 kernel 裡的 xtime，電腦啟動時系統會從 real time clock(硬體)取得，它的資料結構其實我不確定長怎樣 * [容易混淆LINUX時鐘的xtime和jiffies](http://www.coctec.com/docs/linux/show-post-186188.html)說長這樣： ```clike= struct timeval { time_t tv_sec; /***second***/ susecond_t tv_usec;/***microsecond***/ } ``` * [作業文件](https://hackmd.io/s/HkiJpDPtx)寫這樣： ```clike= struct timespec xtime __attribute__ ((aligned (16))); struct timespec { __kernel_time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; ``` * 總之可能因為版本不同而改變，但大體上都分為++整數秒++跟++小數秒++兩個部份，精度應該越新的版本會越高，而它的值是自 1970/1/1 到現在所經過的秒數 * 這邊有兩個容易搞混的名詞 1. xtime 單位: 秒、微秒 2. jiffies 單位: 次系統會決定多久更新一次 xtime，也就是 timer interrupt 的頻率，若每秒更新 100 次就是 100 Hz jiffies 是用來紀錄電腦從啟動到現在，總共有多少次 timer interrupt，所以跟 xtime 是完全不同的 * 總結 Wall-clock time，嚴格說不是一個好的標準原因在於它表示的是系統的時間，當使用者手動調整時間、或電腦因為日光節約時間、跨越時區而連線自動調整時間的時候，時間就會跳動但我們測試的程式不太可能因為上述的原因被影響，所以還是可以使用的，**只是需要注意單位** ---- 2. CPU time 指的是某個程式在 CPU 內執行的時間，單看 CPU time 卻沒辦法分析，因為程式裡可能還有 I/O, delay, overhead 等等其他耗時的作業一般將 CPU time 分解成下面三種顯示 1. real time: 就是 wall-clock time 的意思 2. user time: user mode 的時間加總 3. system time: kernel mode 的時間加總 4. 以上三個的單位都是**秒** * 可以想像如果執行一個 pthread 程式，real time 會縮短，但 user time 卻沒有改善 * 若執行一個 I/O bound 程式，real time 會比 user time 來的長，因為實際上 CPU 運算的時間並不多 --- ### 圓周率的算法 1. Leibniz，也就是 baseline 的做法： $arctan(x)=x-\frac{x^3}{3}+\frac{x^5}{5}-...$ 令 $x=1，arctan(1)=\frac{\pi}{4}=1-\frac{1}{3}+\frac{1}{5}-...$ $arctan(1)=\int_{0}^{1}\frac{1}{x^2+1}$ 缺點是收斂的速度比較慢 2. Monte Carlo：利用 $圓面積=\pi{r^2}$ 令 $r=1$ 在 $x=0, y=0, x=1, y=1$圍成的方形內隨機取 n 點，其中有 m 點到原點的距離$\le1$，那$\pi=\frac{m}{n}$ 3. Polygon： ![image alt](https://upload.wikimedia.org/wikipedia/commons/c/c9/Archimedes_pi.svg) 4. Euler： ## 實驗先以原本的演算法，找出正確計算時間的方法，之後再嘗試不同演算法 * clock(), clock_gettime() 的使用 * time(), gettimeofday() 的使用 * 為什麼 clock_gettime() 結果飄忽不定？ * 為什麼 time() 和 gettimeofday() 不適合拿來作 benchmark ? * 列出所有跟計算時間有關的函式自 [Linux Programmer's Manual](http://man7.org/linux/man-pages/man2/time.2.html) | time(2) | gettimeofday(2) | | - | - | | time() returns the time as the number of seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). | The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2). | | ctime(3) | ftime(3) | | - | - | | When interpreted as an absolute time value, it represents the number of seconds elapsed since the Epoch. | This function is obsolete. Don't use it. If the time in seconds suffices, time(2) can be used; gettimeofday(2) gives microseconds; clock_gettime(2) gives nanoseconds but is not as widely available. | * 以上四個都屬於 wall-clock time，就連`gettimeofday(2)`的文件也說了時間可能會跳動(wall-clock time 的解釋請看上文) ---- | clock_gettime(2) | clock(3) | | - | - | | The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clk_id. | The value returned is the CPU time used so far as a clock_t; to get the number of seconds used, divide by CLOCKS_PER_SEC. | * `clock_gettime`的完整定義 ``` int clock_gettime(clockid_t clk_id, struct timespec *tp); struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; ``` * `clk_id`讓使用者可以用不同的 clock 去當`clock_gettime()`的標準，在`benchmark_clock_gettime.c`裡可以看到是使用`CLOCK_MONOTONIC_RAW`，也就是不受使用者手動調整、也不受 NTP 校正干擾 >For improved accuracy, since glibc 2.18, it is implemented on top of clock_gettime(2) (using the CLOCK_PROCESS_CPUTIME_ID clock). * 而`clock()`其實就等於`clock_gettime(CLOCK_PROCESS_CPUTIME_ID)` --- ### clock() vs clock_gettime() 寫一個小程式`clock_test.c`測試，為了區分 real/user/sys time，所以選 openmp 函式來測 #### clock() ```clike= #include <stdio.h> #include <time.h> #include "computepi.h" int main(int argc, char const *argv[]) { __attribute__((unused)) int N = 400000000; double pi = 0.0; clock_t start, end; start = clock(); pi = compute_pi_openmp(N, 2); end = clock(); printf("N = %d , pi = %lf\n", N, pi); printf("%lf\n", (double) (end - start) / CLOCKS_PER_SEC); start = clock(); pi = compute_pi_openmp(N, 4); end = clock(); printf("N = %d , pi = %lf\n", N, pi); printf("%lf\n", (double) (end - start) / CLOCKS_PER_SEC); return 0; } ``` * 結果 ``` $ ./clock_test N = 400000000 , pi = 3.141593 4.744200 N = 400000000 , pi = 3.141593 9.391323 ``` ``` $ time ./time_test_openmp_2 N = 400000000 , pi = 3.141593 real 0m2.407s user 0m4.800s sys 0m0.000s $ $ time ./time_test_openmp_4 N = 400000000 , pi = 3.141593 real 0m2.379s user 0m9.424s sys 0m0.000s ``` ---- #### clock_gettime() ```clike= struct timespec start = {0, 0}; struct timespec end = {0, 0}; clock_gettime(CLOCK_MONOTONIC_RAW, &start); pi = compute_pi_openmp(N, 2); clock_gettime(CLOCK_MONOTONIC_RAW, &end); printf("N = %d , pi = %lf\n", N, pi); printf("%lf\n", (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/1000000000.0); ``` * 結果跟我想的不一樣，好險有先做出來看看我以為出來的結果會跟`clock()`的一樣 ``` $ ./clock_test N = 400000000 , pi = 3.141593 2.354214 N = 400000000 , pi = 3.141593 2.431054 ``` * 並沒有哪一個函式最好的問題，因為每個情況要監測的東西不一樣但就這次實驗而言，`clock_gettime()`是比較合適的，因為現在在乎的是從執行到輸出結果的 turn around time * 若要比較演算法的好壞，則應該用`clock()` --- ### 畫圖 * 完全沒有修正程式（除了把`benchmark_clock_gettime`裡面的迴圈拿掉）的原始版本，openmp 4 threads 的震盪很大 * 其實就是`loop = 1`版本 >為什麼只有它震盪特別大？ ![](https://i.imgur.com/eWaUwOU.png) * `loop = 25` ![](https://i.imgur.com/Iin2tDr.png) * `loop = 100` ![](https://i.imgur.com/LYuUJZF.png) * 廢話，因為樣本變得更平均了，可以想見把`loop`調超級大，線就會變平滑囉～ * `loop = 100`, `for i in seq '5000 5000 1000000'`，畫圖時間約 6 分鐘 ---- * 因為是用`clock_gettime()`，所以會受當前電腦的狀況影響參考 [0xff07 的共筆](https://hackmd.io/s/H1MV8APKg#) * `loop = 25`畫圖過程完全沒開其他程式 ![](https://i.imgur.com/Lz4Q0dU.png) * `loop = 25`期間一直開 chrome 新分頁 ![](https://i.imgur.com/6NXgXB0.png) * 偷看到 [claaaaassic 的共筆](https://hackmd.io/s/SJqyt5Uqx#)發現可以畫漂亮圖的方法`smooth cspline` ``` > plot "result_clock_gettime.csv" using 1:2 title 'baseline' with linespoints smooth cspline, \ ``` ![](https://i.imgur.com/OHAMSn0.png) * 有明顯的改善，但還遠遠不夠 ---- * 如果使用 CPU time 是不是就可以避免電腦狀態的影響呢？ * 關閉所有程式 ![](https://i.imgur.com/OMp0xQe.png) * 開啟 ![](https://i.imgur.com/CRm4waI.png) >照理來說應該是無關的，這邊還想不通 >根據老師的提醒，應該是 cache 的影響[name=Sean1127] ---- * 沒開程式版 ``` $ perf stat --repeat 5 -e cache-misses ./time_test_baseline Performance counter stats for './time_test_baseline' (5 runs): 21,145 cache-misses ( +- 6.69% ) 4.236564981 seconds time elapsed ( +- 0.26% ) ``` * 開 youtube 版 ``` $ perf stat --repeat 5 -e cache-misses ./time_test_baseline Performance counter stats for './time_test_baseline' (5 runs): 1,013,390 cache-misses ( +- 14.96% ) 6.099879805 seconds time elapsed ( +- 1.38% ) ``` ## Monte Carlo 演算法 ```clike= double compute_pi_monte_carlo(size_t N) { double x, y, distance_squared; int number_in_circle = 0; srand(time(NULL)); for (size_t i = 0; i < N; ++i) { x = (double) rand() / RAND_MAX; y = (double) rand() / RAND_MAX; distance_squared = x * x + y * y; if (distance_squared <= 1) number_in_circle++; } return 4.0 * ((double) number_in_circle / N); } ``` * 結果 ``` $ time ./time_test_monte_carlo N = 400000000 , pi = 3.1415383900000000 real 0m11.436s user 0m11.432s sys 0m0.000s ``` ---- ### 加上 OpenMP 版 ```clike= double compute_pi_monte_omp(size_t N, int threads) { int number_in_circle = 0; double x, y; #pragma omp parallel num_threads(threads) { srand(time(NULL) ^ omp_get_thread_num()); #pragma omp for private(x,y) reduction(+:number_in_circle) for (size_t i = 0; i < N; ++i) { x = (double) rand() / RAND_MAX; y = (double) rand() / RAND_MAX; if ((x * x + y * y) < 1) number_in_circle++; } } return 4.0 * ((double) number_in_circle / N); } ``` * 結果 ``` $ time ./time_test_monte_omp N = 400000000 , pi = 3.1415087800000001 error = 0.0000838735897930 real 3m42.005s user 3m6.772s sys 4m17.140s ``` ``` $perf record ./time_test_monte_omp $ perf report Samples: 478K of event 'cycles:ppp', Event count (approx.): 748658909555278 Overhead Command Shared Object Symbol 14.70% time_test_monte libc-2.23.so [.] __random 10.33% time_test_monte [kernel.kallsyms] [k] futex_wait_setup 9.73% time_test_monte [kernel.kallsyms] [k] sys_futex 8.08% time_test_monte libc-2.23.so [.] __lll_lock_wait_private 6.99% time_test_monte [kernel.kallsyms] [k] hash_futex 6.77% time_test_monte libc-2.23.so [.] __random_r ... ``` ![](https://i.imgur.com/cD2hykR.png) ![](https://i.imgur.com/vfxEWCg.png) * 可以看到因為是隨機取樣，所以保證收斂的結果 * 另外也可以看到 avx + loop unrolling 的震盪，證實了[shelly4132 的共筆](https://hackmd.io/s/HJrLU0dp)所提到的不能整除問題 >當N=1000的時候，i最大可以到984，但984/16 = 61.5，不能整除，所以變成有漏算的情形。 ### Cuda GPU 加速 * 參考 [CheHsuan 的共筆](https://hackmd.io/s/BkbydgVa)，使用 GPU 加速，我的比電用的是 intel 的 ``` $ lspci -vnn | grep VGA -A 12 00:02.0 VGA compatible controller [0300]: Intel Corporation Haswell-ULT Integrated Graphics Controller [8086:0a16] (rev 0b) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Haswell-ULT Integrated Graphics Controller [1043:16cd] Flags: bus master, fast devsel, latency 0, IRQ 48 Memory at f7400000 (64-bit, non-prefetchable) [size=4M] Memory at d0000000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: i915 Kernel modules: i915 ``` * clone 下來之後先安裝 API ``` $ sudo apt install nvidia-cuda-toolkit ``` * 測試 ``` $ time ./time_test_montecarlo_cpu N = 400000000 , pi = 3.141609 real 0m10.584s user 0m10.580s sys 0m0.000s $ time ./time_test_montecarlo_gpu N = 400000000 , pi = 0.000000 real 0m0.428s user 0m0.012s sys 0m0.000s ``` >還不知道為什麼沒辦法正確的算出 $\pi$，要再研究程式碼 ## 參考資料 * http://stackoverflow.com/questions/7335920/what-specifically-are-wall-clock-time-user-cpu-time-and-system-cpu-time-in-uni * http://man7.org/linux/man-pages/index.html * http://gnuplot.sourceforge.net/docs_4.2/node236.html * http://www.binarytides.com/linux-get-gpu-information/ ###### tags: `sysprog`