2017q1 Team9( Compute-pi )

contributed by<yanang><donbader>

tags: `yanang` `donbader`

github-yanang & github-donbader / youtube

作業要求

比照 B03: compute-pi 要求，特別留意測量時間的機制 (Wall-clock time vs. CPU time)

實驗結果

分析各種公式

當 error rate 收斂時 –- run 的次數
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
當 error rate 收斂時 –- run 的時間
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

結論：
除了用硬體/軟體加速之外，我們更可以使用不同公式來改善速度

各種加速方法(以公式一離曼積分為例)：

1. Baseline (完全沒有加速)

double compute_pi_baseline(size_t N)
{
    double pi = 0.0;
    double dt = 1.0 / N;                // dt = (b-a)/N, b = 1, a = 0
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;      // x = ti = a+(b-a)*i/N = i/N
        pi += dt / (1.0 + x * x);       // integrate 1/(1+x^2), i = 0....N
    }
    return pi * 4.0;
}

2. Openmp

multi-thread

double compute_pi_openmp(size_t N, int threads)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    double x;

    #pragma omp parallel for num_threads(threads) private(x) reduction(+:pi)
    for (size_t i = 0; i < N; i++) {
        x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}

比較 thread 數量

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

multi-thread simd

double compute_pi_omp_simd(size_t N, int threads)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    double x;

    #pragma omp parallel for simd num_threads(threads) private(x) reduction(+:pi)
    for (size_t i = 0; i < N; ++i) {
        x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}

比較 thread 數量

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

為什麼thread增加，效能卻沒有提升？
- 當thread數量大於CPU的核心數時，更多的thread只會造成產生thread的overhead
- 越多的thread 代表要context switch的次數更多
- 當每個thread要搶奪共享的資源的時候，會互相等待

3. AVX and AVX unrolling

double compute_pi_avx(size_t N)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    register __m256d ymm0, ymm1, ymm2, ymm3, ymm4;
    ymm0 = _mm256_set1_pd(1.0);
    ymm1 = _mm256_set1_pd(dt);
    ymm2 = _mm256_set_pd(dt * 3, dt * 2, dt * 1, 0.0);
    ymm4 = _mm256_setzero_pd();             // sum of pi

    for (int i = 0; i <= N - 4; i += 4) {
        ymm3 = _mm256_set1_pd(i * dt);      // i*dt, i*dt, i*dt, i*dt
        ymm3 = _mm256_add_pd(ymm3, ymm2);   // x = i*dt+3*dt, i*dt+2*dt, i*dt+dt, i*dt+0.0
        ymm3 = _mm256_mul_pd(ymm3, ymm3);   // x^2 = (i*dt+3*dt)^2, (i*dt+2*dt)^2, ...
        ymm3 = _mm256_add_pd(ymm0, ymm3);   // 1+x^2 = 1+(i*dt+3*dt)^2, 1+(i*dt+2*dt)^2, ...
        ymm3 = _mm256_div_pd(ymm1, ymm3);   // dt/(1+x^2)
        ymm4 = _mm256_add_pd(ymm4, ymm3);   // pi += dt/(1+x^2)
    }
    double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, ymm4);             // move packed float64 values to  256-bit aligned memory location
    pi += tmp[0] + tmp[1] + tmp[2] + tmp[3];
    return pi * 4.0;
}

4. OpenCL

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

參考 ierosodin 的共筆

比較各種加速方法

繪圖方式1: gnuplot with lines
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
繪圖方式2: 用 95% 信賴區間除雜訊
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
繪圖方式3: 直接使用gnuplot smooth bezier
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

time libarary

clock(void)

Linux man page :
The clock() function returns an approximation of processor time used by the program

在 32-bit 的系統中，CLOCKS_PER_SEC 約為 1000000，所以 clock() 每 72 分鐘將會回傳相似的值
在某些實作中， clock() 可以藉由 wait() 包括子程序的時間。但是，在 linux 系統當中 clock() 仍然不包括等待中的子程序時間。

linux man page :
The times() function, which explicitly returns (separate) information about the caller and its children, may be preferable

times(struct tms *buf)

linux man page :
times() returns the number of clock ticks that have elapsed since an arbitrary point in the past. The return value may overflow the possible range of type clock_t

回傳的時間

struct tms {
    clock_t tms_utime;  /* user time */
    clock_t tms_stime;  /* system time */
    clock_t tms_cutime; /* user time of children */
    clock_t tms_cstime; /* system time of children */
    }

當子程序當中的 wait() 或是 waitpid() 回傳 process id 時，將其時間加入 times()

clock_gettime(clockid_t clk_id, struct timespec *tp)

The functions clock_gettime() retrieve and set the time of the specified clock clk_id

CLOCK_REALTIME：使用的是系統的掛鐘時間，當系統時間重置或改變，將影響結果
CLOCK_MONOTONIC：使用的是相對時間，通過 jiffies 值來計算的。該時鐘不受系統時鐘源的影響，只受 jiffies 值的影響。
man clock_gettime()

gettimeofday(struct timeval *tv, struct timezone *tz)

當 clock_gettime() clock_id 指定為 CLOCK_REALTIME 時，它與 gettimeofday 完全一樣
只不過它返回的是奈秒，而 gettimeofday 返回的是微秒

為什麼 clock_gettime() 結果飄忽不定？
function的執行時間單位以 nanoseconds 來決定，執行時間很短時並無法排除 Linux 背景處理其他事情，導致額外的 cache miss ，所以需要很多針對某一筆參數作累積再平均，才能避免飄忽不定
繪圖方面可以使用以下方法去除雜訊：
- 95% 信賴區間
- smooth beizer(gnuplot)
為什麼 time() 和 gettimeofday() 不適合拿來作 benchmark ?
time() 和 gettimeofday() 未必是單調遞增的。
NTP 會使得 wall-clock 加速或減速以對齊(drift)到正確的時間，當它在對齊時，抓到的時間並不能完全符合真實的時間流逝。

公式

黎曼積分

\begin{aligned} \frac{π}{4} & = a r c t a n (1) \\ π & = 4 \int_{0}^{1} \frac{1}{1 + x^{2}} d x \\ = 4 lim_{N \to \infty} \sum_{k = 1}^{N} \frac{1}{1 + (k / N)^{2}} \cdot \frac{1}{N} \\ \approx 4 \sum_{k = 1}^{N} \frac{1}{1 + (k / N)^{2}} \cdot \frac{1}{N} \end{aligned}

在 MathJax 語法中可使用 &= 對齊式子，不過 hackmd 似乎不支援 yanag

除了 &= 前後還須加上 begin{align} end{align} yanag

Lebiniz

\begin{aligned} π & = \frac{4}{1} - \frac{4}{3} + \frac{4}{5} - \frac{4}{7} + \frac{4}{9} - \dots \\ π & = 4 \sum_{n = 0}^{\infty} \frac{(- 1)^{n}}{2 n + 1} \end{aligned}

pf :

\begin{aligned} \tan (\frac{π}{4}) & = 1; \arctan (1) = \frac{π}{4}; \\ \frac{p i}{4} & = \arctan (1) \\ = \int_{0}^{1} \frac{1}{1 + x^{2}} d x \\ = \int_{0}^{1} (\sum_{k = 0}^{n} (- 1)^{k} x^{2 k} + \frac{(- 1)^{n + 1} \cdot x^{2 n + 2}}{1 + x^{2}}) d x \\ = (\sum_{k = 0}^{n} \frac{(- 1)^{k}}{2 k + 1}) + (- 1)^{n + 1} (\int_{0}^{1} \frac{x^{2 n + 2}}{1 + x^{2}} d x) \end{aligned} a s n \to \infty, 0 < \int_{0}^{1} \frac{x^{2 n + 2}}{1 + x^{2}} d x < \int_{0}^{1} x^{2 n + 2} d x = \frac{1}{2 n + 3} \to 0; 藉 由 夾 擠 定 理 (S q u e e z e t h e o r e m) : \frac{π}{4} = \int_{k = 0}^{\infty} \frac{(- 1)^{k}}{2 k + 1}

Wallis

\frac{π}{2} = \frac{2}{1} \cdot \frac{2}{3} \cdot \frac{4}{3} \cdot \frac{4}{5} \cdot \frac{6}{5} \cdot \frac{6}{7} \dots

pf :
by using Euler's infinite product for sine

\begin{aligned} \frac{\sin (x)}{x} & = \prod_{n = 1}^{\infty} (1 - \frac{x^{2}}{n^{2} π^{2}}) \\ l e t x & = \frac{π}{2} \\ \Rightarrow \frac{2}{π} & = \prod_{n = 1}^{\infty} (1 - \frac{1}{4 n^{2}}) \\ \Rightarrow \frac{π}{2} & = \prod_{n = 1}^{\infty} (\frac{4 n^{2}}{4 n^{2} - 1}) \\ = \prod_{n = 1}^{\infty} (\frac{2 n}{2 n - 1} \cdot \frac{2 n}{2 n + 1}) = \frac{2}{1} \cdot \frac{2}{3} \cdot \frac{4}{3} \cdot \frac{4}{5} \cdot \frac{6}{5} \cdot \frac{6}{7} \dots \end{aligned}

Nilakantha

\begin{aligned} π & = 3 + \frac{4}{2 \times 3 \times 4} - \frac{4}{4 \times 5 \times 6} + \frac{4}{6 \times 7 \times 8} - \dots \\ π & = 3 + \sum_{n = 1}^{\infty} \frac{4 \cdot (- 1)^{n + 1}}{2 n \times (2 n + 1) \times (2 n + 2)} \end{aligned}

Nilakantha 比 Leibniz 有更快的計算精準率，N=3 時 Leibniz(pi) = 2.8952… 而 Nilakantha(pi) = 3.1452… wiki

Chudnovsky

\begin{aligned} \frac{1}{π} & = 12 \sum_{k = 0}^{\infty} \frac{(- 1)^{k} (6 k)! (545140134 k + 13591409)}{(3 k)! (k!)^{3} (640320)^{3 k + 3 / 2}} \\ π & = C (\sum_{k = 0}^{\infty} \frac{M_{k} \cdot L_{k}}{X_{k}})^{- 1}, w h e r e \\ C & = 426880 \sqrt{10005} \\ M_{k} & = \frac{(6 k)!}{(3 k)! \times (k!)^{3}} \\ L_{K} & = (13591409 + 545140134 \times k) \\ X_{k} & = (- 640320)^{3 k} = (- 262537412640768000)^{k} \end{aligned}

據 function 每多計算一位可增加14為有效小數

尋找次數的方法

方法一 : N = N+1 每次做 compute_pi(N) 都判斷 error rate 是否符合要求

// judge function
int judge(int n) {
    double pi = compute_pi_formula(formula_id, n);
    double error = abs(pi - M_PI);
    double error_diff = error - max_error;
    return error_diff > 0 ? 1 : error_diff < 0 ? -1 : 0;
}

int n = 0;
while(judge(n) > 0 && ++n);

方法二 : N = Nx10 ，達到 error rate 後退回前一個區間後，用 binary search 尋找答案

int binary_search(int lower, int upper, int (*judge)(int))
{
    int mid, decision;
    while (lower < upper) {
        mid = (upper + lower) / 2;
        decision = (*judge)(mid);
        if (decision > 0)
            lower = mid + 1;
        else if (decision < 0)
            upper = mid - 1;
        else
            return mid;
    }

    return lower;
}

#define abs(num) (num) > 0 ? (num) : -(num)

int convergence(double max_error, int formula_id)
{
    if (formula_id > MAX_FORMULA_ID || formula_id < 0) {
        return -1;
    }

    // judge function
    int judge(int n) {
        double pi = compute_pi_formula(formula_id, n);
        double error = abs(pi - M_PI);
        double error_diff = error - max_error;
 
        return error_diff > 0 ? 1 : error_diff < 0 ? -1 : 0;
    }

    long long N = 1;
    while (judge(N) > 0)
        N *= 10;

    if (N == 1) return 1;
    return binary_search(N / 10 + 1, N, judge) + 1;
}

gnuplot

result_clock_gettime.csv 中以 "，" 切割各比資料

set datafile sep ','

以折線圖的方式繪製圖表

plot 'result_clock_gettime.csv' using 1:2 with linespoints \
                              linewidth 2 title 'baseline', \

以曲線圖的方式繪製圖表

plot 'result_clock_gettime.csv' using 1:2 smooth bezier \
                              linewidth 2 title 'baseline', \

以長條圖的方式繪製圖表

plot 'result_clock_gettime.csv' \
        using 2:xtic(1) with histogram title 'baseline', \

參考資料

Pi algorithms
GPGPU + SIMD
First OpenCL Program
ierosodin 的共筆
 Diana Ho 的共筆
 why is clcock_gettime so erratic?
gettimeofday() should never be used to measure time
Why is threading increasing execution time ?
Richard's blog

紀錄

5:00 Chudnovsky 證明
error rate: gmp math (大數運算庫)
改善 error rate計算

2017q1 Team9( Compute-pi )

tags: yanang donbader

作業要求

實驗結果

分析各種公式

各種加速方法(以公式一離曼積分為例)：

1. Baseline (完全沒有加速)

2. Openmp

3. AVX and AVX unrolling

4. OpenCL

比較各種加速方法

time libarary

公式

黎曼積分

Lebiniz

Wallis

Nilakantha

Chudnovsky

尋找次數的方法

gnuplot

參考資料

紀錄

tags: `yanang` `donbader`