Try   HackMD

2017q1 Team9( Compute-pi )

contributed by<yanang><donbader>

tags: yanang donbader

github-yanang & github-donbader / youtube

作業要求

比照 B03: compute-pi 要求,特別留意測量時間的機制 (Wall-clock time vs. CPU time)

  • 在 GitHub 上 fork compute-pi,嘗試重現實驗,包含分析輸出
    • 注意: 要增加圓周率計算過程中迭代的數量,否則時間太短就看不出差異
  • 詳閱廖健富提供的詳盡分析,研究 error rate 的計算,以及為何思考我們該留意後者
  • phonebok 提到的 gnuplot 作為 $ make gencsv 以外的輸出機制,預期執行 $ make plot 後,可透過 gnuplot 產生前述多種實做的效能分析比較圖表
  • 可善用 raytracing 提到的 OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
  • 詳閱前方 "Wall-clock time vs. CPU time",思考如何精準計算時間
  • 除了研究程式以外,請證明 Leibniz formula for π,並且找出其他計算圓周率的方法
  • 將你的觀察、分析,以及各式效能改善過程,並善用 gnuplot 製圖,紀錄於「作業區
  • clock(), clock_gettime() 的使用
  • time(), gettimeofday() 的使用
  • 為什麼 clock_gettime() 結果飄忽不定?
  • 為什麼 time() 和 gettimeofday() 不適合拿來作 benchmark ?

實驗結果

分析各種公式

  1. 當 error rate 收斂時 - run 的次數

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  2. 當 error rate 收斂時 - run 的時間

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 結論:
    除了用硬體/軟體加速之外,我們更可以使用不同公式來改善速度

各種加速方法(以公式一離曼積分為例):

1. Baseline (完全沒有加速)

double compute_pi_baseline(size_t N)
{
    double pi = 0.0;
    double dt = 1.0 / N;                // dt = (b-a)/N, b = 1, a = 0
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;      // x = ti = a+(b-a)*i/N = i/N
        pi += dt / (1.0 + x * x);       // integrate 1/(1+x^2), i = 0....N
    }
    return pi * 4.0;
}

2. Openmp

  • multi-thread
double compute_pi_openmp(size_t N, int threads)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    double x;

    #pragma omp parallel for num_threads(threads) private(x) reduction(+:pi)
    for (size_t i = 0; i < N; i++) {
        x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}

比較 thread 數量

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • multi-thread simd
double compute_pi_omp_simd(size_t N, int threads)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    double x;

    #pragma omp parallel for simd num_threads(threads) private(x) reduction(+:pi)
    for (size_t i = 0; i < N; ++i) {
        x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}

比較 thread 數量

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 為什麼thread增加,效能卻沒有提升?
    • 當thread數量大於CPU的核心數時,更多的thread只會造成產生thread的overhead
    • 越多的thread 代表要context switch的次數更多
    • 當每個thread要搶奪共享的資源的時候,會互相等待

3. AVX and AVX unrolling

double compute_pi_avx(size_t N)
{
    double pi = 0.0;
    double dt = 1.0 / N;
    register __m256d ymm0, ymm1, ymm2, ymm3, ymm4;
    ymm0 = _mm256_set1_pd(1.0);
    ymm1 = _mm256_set1_pd(dt);
    ymm2 = _mm256_set_pd(dt * 3, dt * 2, dt * 1, 0.0);
    ymm4 = _mm256_setzero_pd();             // sum of pi

    for (int i = 0; i <= N - 4; i += 4) {
        ymm3 = _mm256_set1_pd(i * dt);      // i*dt, i*dt, i*dt, i*dt
        ymm3 = _mm256_add_pd(ymm3, ymm2);   // x = i*dt+3*dt, i*dt+2*dt, i*dt+dt, i*dt+0.0
        ymm3 = _mm256_mul_pd(ymm3, ymm3);   // x^2 = (i*dt+3*dt)^2, (i*dt+2*dt)^2, ...
        ymm3 = _mm256_add_pd(ymm0, ymm3);   // 1+x^2 = 1+(i*dt+3*dt)^2, 1+(i*dt+2*dt)^2, ...
        ymm3 = _mm256_div_pd(ymm1, ymm3);   // dt/(1+x^2)
        ymm4 = _mm256_add_pd(ymm4, ymm3);   // pi += dt/(1+x^2)
    }
    double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, ymm4);             // move packed float64 values to  256-bit aligned memory location
    pi += tmp[0] + tmp[1] + tmp[2] + tmp[3];
    return pi * 4.0;
}

4. OpenCL

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

參考 ierosodin 的共筆

比較各種加速方法

  • 繪圖方式1: gnuplot with lines
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • 繪圖方式2: 用 95% 信賴區間除雜訊
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • 繪圖方式3: 直接使用gnuplot smooth bezier
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

time libarary

  • clock(void)

Linux man page :
The clock() function returns an approximation of processor time used by the program

  1. 在 32-bit 的系統中,CLOCKS_PER_SEC 約為 1000000,所以 clock() 每 72 分鐘將會回傳相似的值
  2. 在某些實作中, clock() 可以藉由 wait() 包括子程序的時間。但是,在 linux 系統當中 clock() 仍然不包括等待中的子程序時間。

linux man page :
The times() function, which explicitly returns (separate) information about the caller and its children, may be preferable

  • times(struct tms *buf)

linux man page :
times() returns the number of clock ticks that have elapsed since an arbitrary point in the past. The return value may overflow the possible range of type clock_t

  1. 回傳的時間
struct tms {
    clock_t tms_utime;  /* user time */
    clock_t tms_stime;  /* system time */
    clock_t tms_cutime; /* user time of children */
    clock_t tms_cstime; /* system time of children */
    }
  1. 當子程序當中的 wait() 或是 waitpid() 回傳 process id 時,將其時間加入 times()
  • clock_gettime(clockid_t clk_id, struct timespec *tp)

The functions clock_gettime() retrieve and set the time of the specified clock clk_id

  1. CLOCK_REALTIME:使用的是系統的掛鐘時間,當系統時間重置或改變,將影響結果
  2. CLOCK_MONOTONIC:使用的是相對時間,通過 jiffies 值來計算的。該時鐘不受系統時鐘源的影響,只受 jiffies 值的影響。
    man clock_gettime()
  • gettimeofday(struct timeval *tv, struct timezone *tz)

當 clock_gettime() clock_id 指定為 CLOCK_REALTIME 時,它與 gettimeofday 完全一樣
只不過它返回的是奈秒,而 gettimeofday 返回的是微秒

  • 為什麼 clock_gettime() 結果飄忽不定?
    function的執行時間單位以 nanoseconds 來決定,執行時間很短時並無法排除 Linux 背景處理其他事情,導致額外的 cache miss ,所以需要很多針對某一筆參數作累積再平均,才能避免飄忽不定

  • 繪圖方面可以使用以下方法去除雜訊:

    • 95% 信賴區間
    • smooth beizer(gnuplot)
  • 為什麼 time() 和 gettimeofday() 不適合拿來作 benchmark ?
    time() 和 gettimeofday() 未必是單調遞增的。
    NTP 會使得 wall-clock 加速或減速以對齊(drift)到正確的時間,當它在對齊時,抓到的時間並不能完全符合真實的時間流逝。

公式

黎曼積分

π4=arctan(1)π=40111+x2dx=4limNk=1N11+(k/N)21N4k=1N11+(k/N)21N

在 MathJax 語法中可使用 &= 對齊式子,不過 hackmd 似乎不支援 yanag

除了 &= 前後還須加上 begin{align} end{align} yanag

Lebiniz

π=4143+4547+49π=4n=0(1)n2n+1
pf :
tan(π4)=1;arctan(1)=π4;pi4=arctan(1)=0111+x2dx=01(k=0n(1)kx2k+(1)n+1x2n+21+x2)dx=(k=0n(1)k2k+1)+(1)n+1(01x2n+21+x2dx)as n,0<01x2n+21+x2dx<01x2n+2dx=12n+30;(Squeezetheorem):π4=k=0(1)k2k+1

Wallis

π2=212343456567

pf :
by using Euler's infinite product for sine

sin(x)x=n=1(1x2n2π2)let x=π22π=n=1(114n2)π2=n=1(4n24n21)=n=1(2n2n12n2n+1)=212343456567

Nilakantha

π=3+42×3×444×5×6+46×7×8π=3+n=14(1)n+12n×(2n+1)×(2n+2)

pf

Nilakantha 比 Leibniz 有更快的計算精準率,N=3 時 Leibniz(pi) = 2.8952 而 Nilakantha(pi) = 3.1452 wiki

Chudnovsky

1π=12k=0(1)k(6k)!(545140134k+13591409)(3k)!(k!)3(640320)3k+3/2π=C(k=0MkLkXk)1,whereC=42688010005Mk=(6k)!(3k)!×(k!)3LK=(13591409+545140134×k)Xk=(640320)3k=(262537412640768000)k

pf

function 每多計算一位可增加14為有效小數

尋找次數的方法

方法一 : N = N+1 每次做 compute_pi(N) 都判斷 error rate 是否符合要求

// judge function
int judge(int n) {
    double pi = compute_pi_formula(formula_id, n);
    double error = abs(pi - M_PI);
    double error_diff = error - max_error;
    return error_diff > 0 ? 1 : error_diff < 0 ? -1 : 0;
}

int n = 0;
while(judge(n) > 0 && ++n);

方法二 : N = Nx10 ,達到 error rate 後退回前一個區間後,用 binary search 尋找答案

int binary_search(int lower, int upper, int (*judge)(int))
{
    int mid, decision;
    while (lower < upper) {
        mid = (upper + lower) / 2;
        decision = (*judge)(mid);
        if (decision > 0)
            lower = mid + 1;
        else if (decision < 0)
            upper = mid - 1;
        else
            return mid;
    }

    return lower;
}

#define abs(num) (num) > 0 ? (num) : -(num)

int convergence(double max_error, int formula_id)
{
    if (formula_id > MAX_FORMULA_ID || formula_id < 0) {
        return -1;
    }

    // judge function
    int judge(int n) {
        double pi = compute_pi_formula(formula_id, n);
        double error = abs(pi - M_PI);
        double error_diff = error - max_error;
 
        return error_diff > 0 ? 1 : error_diff < 0 ? -1 : 0;
    }

    long long N = 1;
    while (judge(N) > 0)
        N *= 10;

    if (N == 1) return 1;
    return binary_search(N / 10 + 1, N, judge) + 1;
}

gnuplot

  • result_clock_gettime.csv 中以 "," 切割各比資料
set datafile sep ','
  • 以折線圖的方式繪製圖表
plot 'result_clock_gettime.csv' using 1:2 with linespoints \
                              linewidth 2 title 'baseline', \
  • 以曲線圖的方式繪製圖表
plot 'result_clock_gettime.csv' using 1:2 smooth bezier \
                              linewidth 2 title 'baseline', \    
  • 以長條圖的方式繪製圖表
plot 'result_clock_gettime.csv' \
        using 2:xtic(1) with histogram title 'baseline', \

參考資料

Pi algorithms
GPGPU + SIMD
First OpenCL Program
ierosodin 的共筆
Diana Ho 的共筆
why is clcock_gettime so erratic?
gettimeofday() should never be used to measure time
Why is threading increasing execution time ?
Richard's blog

紀錄

5:00 Chudnovsky 證明
error rate: gmp math (大數運算庫)
改善 error rate計算