Try   HackMD

2019q3 Homework4 (compute-pi)

contributed by < uccuser159 >

time

  • CPU time
    process 在 CPU 上運行所消耗的時間,有多個 thread 的話, CPU time 是各個 thread 所花時間的總和。
  • Wall-clock time
    是真實世界所經過的時間且是自 1970-01-01 起經歷的秒數,但有時不完全是單調遞增,因為可能會和 NTP 伺服器、時區、日光節約時間同步。

由於 $ time 會報告程式從開始到結束的經歷時間,另外還會計算程式使用處理器的時間量。所以輸入time ./time_test_baseline來看看:

N = 400000000 , pi = 3.141593

real    0m4.295s
user    0m4.291s
sys     0m0.004s

real time : 表示程式從執行開始到結束所花費的時間。
user time : 程式在 user mode 佔用所有的 CPU time 總和。
sys time : 程式在 kernel mode 佔用所有的 CPU time 總和。

  • benchmark_clock_gettime.c中用到 clock_gettime() 函式:
    int clock_gettime(clockid_t clk_id, struct timespec *tp);

The clk_id argument is the identifier of the particular clock on which to act.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

關於clk_id的選擇,會選擇 CLOCK_MONOTONIC_RAW 是因為不受 NTP 伺服器、時區、日光節約時間影響。

tp arguments is timespec struct

struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ };

clock_gettime()的精確度到 nerosecond

Baseline for π

  • Proof

使用

tan1(x) 的積分來實現
π

因為

11+x2dx=tan1(x)

若計算函數

11+x2 在 x=0 到 x=1 底下的面積,利用黎曼積分,
dx
即為在 x 軸上的一小等分,與函數相乘,一小塊一小塊去做加總,得到的面積即為
tan1(1)=π4
,所以將
dx
分的越細,計算
π
會越精準。

程式碼:

double compute_pi_baseline(size_t N) { double pi = 0.0; double dt = 1.0 / N; // dt = (b-a)/N, b = 1, a = 0 for (size_t i = 0; i < N; i++) { double x = (double) i / N; // x = ti = a+(b-a)*i/N = i/N pi += dt / (1.0 + x * x); // integrate 1/(1+x^2), i = 0....N } return pi * 4.0; }

取樣從 N=1008 ~ N=1000000,間隔4000

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

baseline 效能最好的為 AVX+unroll

time ./time_test_baseline N = 400000000 , pi = 3.141593 4.28user 0.00system 0:04.28elapsed 100%CPU (0avgtext+0avgdata 2372maxresident)k 0inputs+0outputs (0major+90minor)pagefaults 0swaps time ./time_test_openmp_2 N = 400000000 , pi = 3.141593 4.37user 0.00system 0:02.18elapsed 200%CPU (0avgtext+0avgdata 2312maxresident)k 0inputs+0outputs (0major+89minor)pagefaults 0swaps time ./time_test_openmp_4 N = 400000000 , pi = 3.141593 4.57user 0.00system 0:01.14elapsed 399%CPU (0avgtext+0avgdata 2532maxresident)k 0inputs+0outputs (0major+98minor)pagefaults 0swaps time ./time_test_avx N = 400000000 , pi = 3.141593 1.64user 0.00system 0:01.64elapsed 99%CPU (0avgtext+0avgdata 2344maxresident)k 0inputs+0outputs (0major+89minor)pagefaults 0swaps time ./time_test_avxunroll N = 400000000 , pi = 3.141593 0.54user 0.00system 0:00.54elapsed 99%CPU (0avgtext+0avgdata 2384maxresident)k 0inputs+0outputs (0major+91minor)pagefaults 0swaps

Leibniz formula for π

  • Proof

因為

d[arctan(x)]dx=11+x2

又利用 power series來展開

11+x2=1x2+x4...

對左右兩式做 0~1 的定積分

01(11+x2)dx=01(1x2+x4...)dx

化簡後得到

tan1(1)tan1(0)=(x13x3+15x5...)|x=1(x13x3+15x5...)|x=0

=>

π4=113+1517...

=>

π=4(113+1517...)

程式碼:

double compute_pi_leibniz(size_t N) { double pi = 0.0; for (size_t i = 0; i < N; i++) { double tmp = (i & 1) ? (-1) : 1; pi += tmp / (2 * i + 1); } return pi * 4.0; }

取樣從 N=1008 ~ N=1000000,間隔4000

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

使用 Leibniz 方法來計算

π 比 baseline 方法在 N = 400000000時,快了大約 4.28/0.79=5.4 倍。

time ./time_test_leibniz N = 400000000 , pi = 3.141593 0.79user 0.00system 0:00.79elapsed 100%CPU (0avgtext+0avgdata 2356maxresident)k 0inputs+0outputs (0major+89minor)pagefaults 0swaps time ./time_test_leibniz_openmp_2 N = 400000000 , pi = 3.141593 0.81user 0.00system 0:00.40elapsed 200%CPU (0avgtext+0avgdata 2460maxresident)k 0inputs+0outputs (0major+93minor)pagefaults 0swaps time ./time_test_leibniz_openmp_4 N = 400000000 , pi = 3.141593 0.85user 0.00system 0:00.21elapsed 400%CPU (0avgtext+0avgdata 2356maxresident)k 0inputs+0outputs (0major+95minor)pagefaults 0swaps time ./time_test_leibniz_avx N = 400000000 , pi = 3.141593 1.67user 0.00system 0:01.67elapsed 100%CPU (0avgtext+0avgdata 2492maxresident)k 0inputs+0outputs (0major+92minor)pagefaults 0swaps time ./time_test_leibniz_avxunroll N = 400000000 , pi = 3.141593 0.90user 0.00system 0:00.90elapsed 100%CPU (0avgtext+0avgdata 2504maxresident)k 0inputs+0outputs (0major+91minor)pagefaults 0swaps

Euler formula for π

  • Proof

可以將

f(x)=x2,x[π,π] 做週期為
T=2π
全幅展開做產生
g(x)

因為
g(x)
為偶函數,所以 Fouier cosine series 的形式為
g(x)=n=1a0+ancos(nx)

其中
a0=22π0π(x2)dx=π23

a1=2π0π(x2)cos(nx)dx=4n2(1)n

根據傅立葉級數的收斂(Dirichlet 收斂定理),

g(x)均勻收斂,故傅立葉級數的收斂值即為帶入函數值
g(π)=π2=π23+n=14n2(1)ncos(nπ)

cos(nπ)當 n 為正整數,其值為
(1)n
,計算得
n=11n2=π26

程式碼:

double compute_pi_euler(size_t N) { double pi = 0.0; for (size_t i = 1; i < N; i++) { pi += 1.0 / (i * i); } return sqrt(pi * 6); }

取樣從 N=1008 ~ N=1000000,間隔4000,計算執行時間

使用 Eular 方法來計算

π 與 Leibniz 方法來計算
π
的時間幾乎相同。

time ./time_test_euler N = 400000000 , pi = 3.141593 0.81user 0.00system 0:00.81elapsed 100%CPU (0avgtext+0avgdata 2424maxresident)k 0inputs+0outputs (0major+90minor)pagefaults 0swaps time ./time_test_euler_openmp_2 N = 400000000 , pi = 3.141593 0.81user 0.00system 0:00.41elapsed 199%CPU (0avgtext+0avgdata 2368maxresident)k 0inputs+0outputs (0major+90minor)pagefaults 0swaps time ./time_test_euler_openmp_4 N = 400000000 , pi = 3.141593 0.85user 0.00system 0:00.21elapsed 399%CPU (0avgtext+0avgdata 2528maxresident)k 0inputs+0outputs (0major+98minor)pagefaults 0swaps time ./time_test_euler_avx N = 400000000 , pi = 3.141593 1.73user 0.00system 0:01.73elapsed 100%CPU (0avgtext+0avgdata 2504maxresident)k 0inputs+0outputs (0major+91minor)pagefaults 0swaps time ./time_test_euler_avxunroll N = 400000000 , pi = 3.141593 0.91user 0.00system 0:00.91elapsed 99%CPU (0avgtext+0avgdata 2388maxresident)k 0inputs+0outputs (0major+91minor)pagefaults 0swaps

參考資料