2017q1 Homework1 (compute-pi)

contributed by < 0xff07 >

原始碼: 這裡
作業要求: 這裡

Reviewed by `yanglin5689446`

同意 susuf3 的第二點，此外最好不要將註解掉的程式碼 commit 上去，時間長了易造成視覺上的混亂。
比較有興趣的是每種作法在多少 thread 下的 performance 會最好，只是好像沒有提及？建議除了圖以外加上一些分析。

Reviewed by `sufuf3`

commit a3b4670 和 commit e64fbae的 Git commit message 的詳細說明句子過長，詳細說明的一行句子，應儘量在72個字元以內。(ref: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project)

Short (50 chars or less) summary of changes

More detailed explanatory text, if necessary.  Wrap it to
about 72 characters or so.  In some contexts, the first
line is treated as the subject of an email and the rest of
the text as the body.  The blank line separating the
summary from the body is critical (unless you omit the body
entirely); tools like rebase can get confused if you run
the two together.

Further paragraphs come after blank lines.

  - Bullet points are okay, too

  - Typically a hyphen or asterisk is used for the bullet,
    preceded by a single space, with blank lines in
    between, but conventions vary here

每個實驗都應該有個commit, 使用commit message 來紀錄該實驗的實作會較佳，比較好對應這裡描述的順序和commit的描述。

開發環境

Ubuntu 16.04.2
Linux version 4.8.0-36-generic

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

做之前突然想到的問題

找方法量化收斂速度

測試的 code 好像都是用 N = 多少次來當作效能高低的標準。可是如果某種算法收斂很快，導致不用那麼大的 N 就可以提前算到誤差同樣低的

π

值，這種比較方法就不太適合。

待實際的程式測試

比如Chudnovsky演算法，

The Chudnovsky algorithm is a fast method for calculating the digits of π. It was published by the Chudnovsky brothers in 1989,[1] and was used in the world record calculations of 2.7 trillion digits of π in December 2009,[2] 5 trillion digits of π in August 2010,[3] 10 trillion digits of π in October 2011,[4][5] and 12.1 trillion digits in December 2013.[6]

雖然單項很肥大, 但是快在收斂速度。

也許測試應該設計成類似這樣：

#define eps 1e-8
const double pi_exact = 4 * atan(1);

...

clock_gettime(CLOCK_ID, &start);

double pi = 0;
for(i = 1; pi_exact - pi < eps; i++){
        pi += term(i);
}

clock_gettime(CLOCK_ID, &end);

但這樣的問題是 while 迴圈當中不斷的減法可能會影響到量測時間，所以還要再思考。

這種考量收斂速度的情況下，以「多少時間達到指定的 error rate」作為目標也許比以固定 step 數目來的有意義(有一種

ϵ - δ

證明的既視感？)。這就是 error rate 重要的部份了。

不過因為這裡是用各種方法去優化相同的數值積分, 演算法都一樣，所以哪個實作先跑完所有積分的小單位, 誰就比較快。這樣比較又回到相同基準點了。

背景知識

數值積分

白話地來說，積分就是再算曲面面積，數值積分的概念就是把要積分的東西切成小條，並逐一計算面積。

萊布尼茲公式

其實就是利用

t a n^{- 1} (\frac{π}{4}) = 1

來計算因為

t a n^{- 1} 又 等 於 ：

t a n^{- 1} (x) = \int \frac{1}{x^{2} + 1}

然後把

\frac{1}{x^{2} + 1}

用power series(或說高中學過的等比級數公式)展開並做多項式積分:

\int \frac{1}{x^{2} + 1} = \int 1 - x^{2} + x^{4} . . . = x - \frac{x^{3}}{3} + \frac{x^{5}}{5} . . .

把上下限帶入1與0即可得到。

工具

測量時間

參閱 Advenced Programming in Unix Environments 這本書

上網搜尋 APUE 可以找到很多關於這本書的資料。

Calender Time

這個時間是自1970年1/1 00:00:00 開始計時的秒數：

The basic time service provided by the UNIX kernel counts the number of seconds that have passed since the Epoch: 00:00:00 January 1, 1970, Coordinated Universal Time (UTC). In Section 1.10, we said that these seconds are represented in a time_t data type, and we call them calendar times.

可以用 time.h 當中的

time_t time(time_t *calptr);

來找出相應的時間。查閱 man page 發現有一些注意事項：

On Linux, a call to time() with tloc specified as NULL cannot fail with
the error EOVERFLOW, even on ABIs where time_t is a signed 32-bit inte‐
ger  and  the clock ticks past the time 2**31 (2038-01-19 03:14:08 UTC,
ignoring leap seconds).

也就是使用 time() 時, 就算秒數溢位 , 也不會被視為出錯。 POSIX. 1 有允許 time() 溢位時發出 EOVERFLOW 錯誤, 但是並沒有強制，所以還是有溢位的可能。

另外一個跟 time() 很像的函數是 gettimeofday

#include <sys/time.h>
int gettimeofday(struct timeval *restrict tp, void *restrict tzp);
Returns: 0 always

不一樣的是它可以回傳微秒。感覺起來像是比較精準的 time。

Version 4 of the Single UNIX Specification specifies that the gettimeofday
function is now obsolescent. However, a lot of programs still use it, because it provides
greater resolution (up to a microsecond) than the time function.

在 POSIX 1 標準中，又多了幾種時鐘：

The real-time extensions to POSIX.1 added support for multiple system clocks. In Version 4 of the Single UNIX Specification, the interfaces used to control these clocks were moved from an option group to the base.

這些時鐘可以用

#include <sys/time.h>
int clock_gettime(clockid_t clock_id, struct timespec *tsp);

來得到。其中不同的時鐘用不同的 clock_id 分別。可以查閱 clock_gettime() 的 man page 來知道，這裡節錄幾個：

CLOCK_REALTIME : 量測的是 wall-clock, 可以設定, 但是需要對應的權限。如果計時時改變手動時間, 那計時的數值會受到影響。
CLOCK_MONOTONIC : 從某一個開始點的時間。不受使用者改變時間影響, 但是會受到 adjtime() 跟 NTP 影響。
CLOCK_MONOTONIC_RAW ：大致跟 CLOCK_MONOTONIC 相同, 但是不受 adjtime() 跟 NTP 影響
CLOCK_PROCESS_CPUTIME_ID : 每個 process 自己的時鐘
CLOCK_THREAD_CPUTIME_ID : 每個 thread 自己的時鐘

呼叫 clock_gettime() 時會把時間存在 struct timespec 這個結構體裡：

struct timespec {
    time_t   tv_sec;        /* seconds */
    long     tv_nsec;       /* nanoseconds */
};

AVX

把多個相同的資料型態的物件(比如說整數，浮點數, 倍準浮點數等等)並排放好, 然後一口氣一起做運算。聽起來就像是向量運算(比如說「逐元相乘」跟「內積」實際上做的事根本一樣), 也難怪名字當中有一個 vectorized。

openMP

官方有教學影片

這裡簡單附上摘要：

Concurrent vs. Parallel

播放清單中的這個

Concurrent : 中文比較常看到的翻譯是稱作「並行」。工作可以拆分成「獨立執行」的部份, 這樣「可以」讓很多事情一起做, 但是「不一定」要真的同時做。比如說這個例子

就是一個具有並行性, 但不去同時執行的例子。

並行性是一種「架構程式」的概念。在寫下一段程式之前, 思考問題架構時就決定好的。

Parallel : 把規劃好，能夠並行的程式, 分配給不同執行緒, 並讓他們同時執行。

「平行」是一種選擇。假設有一天我在程式時，發現需要做內積, 我可以用迴圈遍歷3個元素做(沒有平行化)，也可以用3個執行緒同時執行每個相應位置元素的乘法(有平行化)。在這裡要不要做平行化只是一種「選擇」。

(註：這只是個例子, 為內積故意創造執行緒, 有可能光是花在建立新執行緒上的時間與資源就比內積本身還要多)

Hello World

先來個最簡單的 hello, world

#include<stdio.h>
#include<omp.h>

int main()
{
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        printf("hello, World Thread %d\n", id);
    }
}

編譯程式時要加上 -fopenmp 選項。編譯完執行之後得到以下結果：

hello, World Thread 0
hello, World Thread 1
hello, World Thread 3
hello, World Thread 2

（實際上順序是不一定的）

這個程式當中 :

#pragma omp parallel 下面以大括號括起來的東西，是平行化的區塊。程式執行到這裡的時候, 會跟作業系統要一堆thread, 每個 thread 都去執行這個區塊東西。至於有多少 thread , 因為沒有指定, 所以就是一個預設數目。
在平行區塊裡面, 可以用 omp_get_thread_num() 來知道不同 thread 的編號, 這個編號會從0, 1, 2…一直編下去。

基礎

SMP VS. NUMA

影片這裡想要講的是兩種 shared memory computer 的模型。

SMP : Symetric Multiprocessor, 每顆 CPU 都是平等的, 不同位址的記憶體間存取也是均等的。
NUMA ： Non Uniform Address Space Multiprocessor。就是上者的相反, 有一些位址的記憶體可以存取比較快。

影片中有提到現今的處理器，很難有完全 SMP 的結構，因為有 Cache 等等的議題，讓 memory 的存取。

Memory Model

Fork-Join Model

是一種平行程式的流程，大致上有3階段 :

準備進入要平行化的區域時, 分岔(原文是用fork, 但是暫時想不到更好的講法)出多條執行緒：

影片裏面稱呼這西準備一起「合作」的執行緒叫「team of thread」。我覺得很貼切。
這些執行緒一起合力執行工作。
工作完成之後, 又回到分叉之前, 單一一個執行緒的狀態。

這個過程叫作join.

fork-join model就是不斷的重複上面的過程。概念上來說， fork 階段的時候，記憶體佈局是由這樣：

變成這樣：

每條執行緒都有自己的 stack , 但是跟本來的執行緒共用 text, heap 與 data 段。

註：這裡的 fork 不是指 fork() 函數。\

請問"記憶體佈局"的兩張徒步一樣的地方在哪邊？

指定執行緒

通常是用以下的形式：

    int A[10000];
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel 
    {
        /* 大括號裏面是每一個執行緒要做的東西 */
        int ID = omp_get_thread_num(); //每個執行緒的 編號, 可以用omp_get_thread_num()來得知
        do_something(id, A)
    }

這裡想要做的事跟 pthread_create() ，概念是相同的。

這裡有一些注意事項：

#pragma 區塊外面宣告的變數，會在每個執行緒的可視範圍內
#pragma 區域裡面宣告的變數, 會在每個執行緒自己的stack裏面, 所以可視範圍只有該執行緒。
omp_get_thread_num() 編號會固定是 0, 1…n-1, 跟POSIX下的 tid 或 pid 行為不同。
作業系統實際給的執行緒數目，可能會比本來要求的少。真實分配到的執行緒數目可以在平行區塊裡面用 omp_get_num_threads() 來知道。這是一件很重要的事。

同步

影片中把同步的概念大致分成兩種

Barrier : 等大家都執行完, 再往下繼續做

跟wait()很像。 openMP再這裡對應的是：
```
#pragma omp barrier
```
Mutal Exclusion : 確保一次只有一個執行緒會執行某一段程式。這段程式叫作 critical section.
```
#pragma omp critical
{
    /* critical region */
}
```

Atomic Operation

用下面的關鍵字

#pragma omp atomic
/* atomic operation */

這裡的atomic operation只能是下面幾個：

x binop= a
x++, x–, ++x, –x

到這裡可以發現本來 pthread 要寫很多(e.g. 傳入函式指標、變數要包成結構體傳進去等等)的東西，openMP 都把它變得更精簡。

精簡化的語法

有一些功能很常用, 比如說平行化 for 迴圈, 如果每次都要寫一遍執行緒 id, 確認執行緒數目…會有很多 code 花在這些例行的準備上。所以針對這些常用的平行策略, openMP 提供了更精減的語法。

for 迴圈

如果要把 for 迴圈平行化, 要做下面的事：

    int A[10000];
    int real_thread;
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel 
    {
        /* 大括號裏面是每一個執行緒要做的東西 */
        real_thread = omp_get_num_threads();
        int id = omp_get_thread_num();
        for(int i = if; i < STEP; i += real_thread)
            do_something(id, A)
    }

但是 openPM 提供了更簡單的語法：

    int A[10000];
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel 
    {
        #pragma omp for
        {
            for(int i = if; i < STEP; i ++)
                do_something(id, A);
        }
    }

就可以自動把迴圏拆開。更簡單的寫法是：

    int A[10000];
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel for
        for(int i = if; i < STEP; i ++)
            do_something(id, A);

Reduction

考慮下面這個程式：

    int sum = 0;
    int A[10000];
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel for
        for(int i = if; i < STEP; i ++)
            sum += A[i];

思考一下之前的平行化策略, 可以發現再這裡並不管用，因為 sum 會一直用到。但是這又是寫迴圈很常出現的狀況。 openMP 針對這點提供了 resuction 語法。以這段程式為例, 可以透過下面這個語法來平行化：

    int sum = 0;
    int A[10000];
    omp_set_num_threads(n);    //n是執行緒的數目
    #pragma omp parallel for resuction(+:sum)
        for(int i = if; i < STEP; i ++)
            sum += A[i];

原始版本

在 Makefile 中補上 plot 的選項。基本形式跟 gencsv 一樣, 只是讓區間變大, 並搭配 gnuplot script :

plot: default
    for i in `seq 10000 5000 1000000`; do \
        printf "%d," $$i;\
        ./benchmark_clock_gettime $$i; \
    done > time_data.txt
    gnuplot scripts/plot.gp

因為這張圖是取 log-scale , 10000以下的數據在這個情況下幾乎是0, 所以從 10000 開始作為繪圖範圍。

得到以下的結果：

Bassline 最慢, 感覺很顯然, 因為完全沒有優化
openMP 次之, 但 4 個執行緒似乎沒有明顯的加快。為了觀察執行緒數目對效能的影響，做了後面的實驗。
AVX 最快。加上 unrolling 似乎沒有對整體效能提升太多

不同平行策略對效能的影響

Thread 數目

在練習 openMP 的時候好奇「如果 thread 數目增加，對優化程度造成什麼影響？」

首先是原始版本：

double computepi_orig(int step)
{
    double pi = 0;
    double dx = 1.0 / step;
    for(int i = 0; i < step; i++){
        double x = i * dx;
        pi += dx / (1 + x*x);
    }
    return pi*4;
}

以及一個天真的 openMP 優化版本

double computepi_naivemp(int step, int thread)
{
    double *pi = calloc(thread, sizeof(double));
    double dx = 1.0 / step;

    int real_thrd = 0;
    omp_set_num_threads(thread);
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        real_thrd = omp_get_num_threads();
        for(int i = id; i < step; i += real_thrd){
            double x = i * dx;
            pi[id] += dx / (1 + x*x);
        }
    }

    double sum = 0;
    for(int i = 0; i < real_thrd; i++){
        sum += pi[i];
    }

    free(pi);
    return sum * 4;
}

在 step 相同（1000000）的情況下，對執行緒數目與花費時間做圖：

時間量測的實作方式同 benchmark_clock_gettime.c 。測試用的 shell script 大致如下 :

for i in $(seq 10 1024)
do
    echo -n "$i," >> data.txt
    ./a.out $i >> data.txt
    echo "" >> data.txt
done

可以發現

執行緒較少的時候, 可能反而花更久時間 : 應該是因為創造執行緒需要耗費額外資源的關係
隨著執行緒愈多, 平行處理大致開始展現效能, 但是仍有一些較大的波動
大的波動集中在執行緒數目較多的地方

突然想到一個問題：會不會是受到電腦開著其他程式的影響呢？所以做了一個粗略的實驗。首先開著 Firefox, 用 YouTube 播放影片，並執行以下結果：

接著關閉所有視窗，只留下終端機執行程式，得到以下的結果：

發現波動減少許多！這似乎暗示在做效能評估的時候，也要考慮這件事, 如果丟著程式就直接切去滑 fb, 可能就會產生誤差。不知道有沒有什麼測試方法可以避開這個效應？

但是換個角度想，一個程式執行時有其他程式也再同時執行，這樣的狀況不才是程式會在電腦中執行的真實情境嗎？所以我想兩者多少都有參考的價值。

影片中有另外提到 false sharing, 正在想要不要做實驗

改成使用 `critical`

接著換一個實作，　這次不用陣列紀錄每個執行緒的和，　而是使用 critical section :

double computepi_sync(int step, int thread)
{
    double pi = 0;
    double dx = 1.0 / step;

    int real_thrd = 0;
    omp_set_num_threads(thread);
    #pragma omp parallel
    {
        double sum = 0;
        int id = omp_get_thread_num();
        real_thrd = omp_get_num_threads();
        for(int i = id; i < step; i += real_thrd){
            double x = i * dx;
            sum += dx / (1 + x*x);
        }
        #pragma omp critical
        {
            pi += sum;
        }
    }
    return pi * 4;
}

結果如下：

但是考慮到前者 computepi_naivemp() 使用到 calloc(), 有可能帶來額外負擔, 所以把它改成直接用陣列寫死, 並且再執行一次, 得到 :

可以發現就算少掉 calloc(), 利用「陣列紀錄每個執行緒的和, 再加總」比「使用 critical 關鍵字控制變數pi」還要好。

這讓我覺得很意外, 因為一次只有一個執行緒能執行 critical section, 直覺上應該比每個執行緒各自擁有一份自己的一份複製, 不用等其他執行緒釋放資源還要慢才對？

`parallel for` + reduction 版本

接著嘗試 parallel for + reduction :

double computepi_simplemp(int step, int thread)
{
    double pi = 0;
    double dx = 1.0 / step;

    omp_set_num_threads(thread);
    #pragma omp parallel for reduction(+:pi)
    for(int i = 0; i < step; i++){
        double x = i * dx;
        pi += dx / (1 + x*x);
    }
    return pi*4;
}

發現這個版本是裡面寫法最簡單, 但速度幾乎是最快。這大概是 openMP 最方便的地方了吧。