2016q3 Homework2 (phonebook-concurrent)

contributed by <0140454>

資料閱讀

軟體開發現況

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

過去
過去30年內，CPU Designer 從三大領域中提升執行效能，分別是「Clock Speed」、「Execution Optimization」以及「Cache」。
現在
但這幾年來，CPU Designer 為了榨乾 CPU 的效能，有可能利用激進的 Instruction Reordering，進而讓程式執行結果與預期的有所不同。

而在短期內，CPU 的效能增長來源將改為「Hyper-threading」、「Multi-core」以及「Cache」。
未來
因為在硬體設計上遇到了一些瓶頸，而 Cache 的部份越來越大，所以未來會朝向 Concurrency 的方向發展。

Concurrency 的優點是各執行緒之間的控制流程相互獨立，而缺點主要是在透顧 lock 來同步化時所需要的成本極高。

An Introduction to Lock-Free Programming

What is lock-free?
簡易的分辨流程：

lock-free 中的 lock 其實不是指 mutex 等等，而是指不會被鎖定。
Lock-free Programming Techniques
Atomic Operations
所謂的 atomic operation 即該操作在系統中是不可再繼續被分割的操作。

目前各平台都有提供相關的 API，如 Windows 下的 _InterlockedIncrement、iOS 下的 OSAtomicAdd32 以及 C++11 中的 std::atomic<int>::fetch_add。

但值得一提的是，C++11 的 atomic operation 並沒有保證他是 lock-free 的，所以必須透過 std::atomic<>::is_lock_free 來確認。
Compare-And-Swap
最被廣泛討論的 atomic operation 方法。過程就是「先從記憶體中讀取資料，若與預期的相同才將新的值寫入」。但必須要注意到 ABA Problem。
Sequential Consistency
在所有執行緒中，對記憶體的操作會依尋著特定的順序進行，而且也會與程式碼中的順序一樣。

Acquire and Release Semantics & Weak vs. Strong Memory Models

透過 Acquire and Release Semantics 來限制優化，讓記憶體相關操作必須在特定語句前後完成。

+----------------------------------------+
|            Acquire Semantics           |
+----------------------------------------+ <<= 所有記憶體操作要在此線之下執行
|                                        |
|                  ....                  |
|                                        |
+----------------------------------------+ <<= 所有記憶體操作要在此線之前執行
|            Release Semantics           |
+----------------------------------------+

Memory Barrier
透過一些指令防止 CPU 對一些記憶體操作進行重排。共分4種 Barrier。
- LoadLoad
- StoreStore
- LoadStore
- StoreLoad
Weak vs. Strong Memory Model
所謂的 Weak 就是指有很大的機率會將 load/store 指令進行重排，而 Strong 則相反。

Concurrency 相關

Concurrency vs. Parallelism

Concurrency 指的是兩個以上的工作 相互交錯 執行。

而 Parallelism 則是指兩個以上的工作同時執行。

Concurrency 系列文章

因為編譯器會最佳化程式碼的執行順序、處理器會亂序執行、硬體又可能有 Cache coherent 的問題，所以程式實際的執行過程和我們寫的不一定相同。

Sequenced-Before
在同 thread 下，對求值順序關係的描述。

以 A is sequenced-before B 為例，也就是 A 會先求值完後才開始進行 B 的求值。

所以當有 A is sequenced-before B 且 B is sequenced-before A 的情況發生時，表示我們無法確定他的求值順序，抑或是他們是同時求值。
Happens-Before
前一個操作的效果在後一個操作執行之前必須要可見。

happens-before 所關注的是在執行前可見，不同於 happening-before 是發生的順序。

而在同一個 thread 中的 happens-before 就是上面提到的 sequenced-before。
Synchronized-With
發生在兩個不同 thread 間的同步行為。當 A is synchronized-with B 時，代表 A 對記憶體操作的效果，對於 B 是可見的。而 A 和 B 是兩個不同的 thread 的某個操作。

換句話說，synchrozied-with 就是不同 thread 間的 happens-before。
Sequential Consistency
正如上面提到的，系統在執行時都維持程式原有的順序。在文章中舉 Dekker's algorithm 為例子，說明其重要性。

Thread Pool

Thread 雖然可以說是一個 lightweight process，但是在 create 跟 destroy 的時候還是需要一定的成本。

所以就有了 thread pool 這個概念，thread pool 讓我們在一開始先建立 worker thread，當有工作進來的時候先丟入 work queue，然後再指派給各 worker thread。

這中間免去了頻繁的 create 與 destroy，進而使得程式的效率以及工作的分配變得更好。

常見的實作方式
只擁有一個 work queue，必須仰賴 mutex 等方式來確保正常的執行。
Lock-free 的實作方式
利用 mutex 來實作的 memory pool 必定會受到 mutex 的成本所影響，因此便有利用 CAS 等等的 atomic operation 來實作的版本。

作業說明中提及的 LFTPool 便是讓每個 work thread 擁有自己的 queue，當有工作進來的時候，利用 round-robin 的方式分配下去。

案例分析

正確性驗證

將優化版本的 linked list 印出後與原本的 words.txt 做比較，可以發現少了4個單字。











$ colordiff -Naur dictionary/words.txt sorted.opt.words.txt 
--- dictionary/words.txt	2016-10-03 10:59:13.531775949 +0800
+++ sorted.opt.words.txt	2016-10-06 02:32:27.803229037 +0800
@@ -1,7 +1,3 @@
-aaaa
-aaaaa
-aaaaaa
-aaaaaaa
 aaaaaaaa
 aaaaaaaah
 aaaaaaauugh

原因在於合併 linked list 的時候，不小心把各個 linked list 的第一個 node 給跳過了。

修正方法就是將 main.c 中的



pHead = app[i]->pHead->pNext;
...
etmp->pNext = app[i]->pHead->pNext;

改成



pHead = app[i]->pHead;
...
etmp->pNext = app[i]->pHead;

閱讀程式碼

main.c

利用 mmap 將檔案內容 map 到 VMA 中


char *map = mmap(NULL, fs, PROT_READ, MAP_SHARED, fd, 0);

先分配好所有 entry 的記憶體
fs 是檔案大小，因為檔案已經對齊了
所以 fs / MAX_LAST_NAME_SIZE 就等同於行數


entry *entry_pool = (entry *) malloc(sizeof(entry) *
                                     fs / MA  X_LAST_NAME_SIZE);

設定 Concurrency Level
我們可以透過 pthread_setconcurrency() 來提示系統。預設而言，concurrency level 為 0，也就讓系統自行決定。

除此之外，man page 的 NOTES 也提到，當系統的 thread model 為 1:1 時，這個函數是沒有太大的意義。
```
pthread_setconcurrency(THREAD_NUM + 1);
```

在 pthreads 的 man page 中提到，對於 POSIX threads 的實作，Linux 不外乎分為 LinuxThreads 與 NPTL。可這兩種方式都是 1:1 的 model，所以在這邊是否有用呢？吳勃興

phonebook_opt.c

準備 thread argument
new_append_a() 是負責準備要傳入 append 的參數。
在命名上有點讓人看不出來，在下一個區塊 Code Refactoring 會修改一下。
```
append_a *new_append_a(char *ptr, char *eptr, int tid, int ntd,
                       entry *start)
```
建立 linked list
append() 會將資料放入一開始所分配好的 memory pool 中。

第一個 thread 負責的行數是：1、1 + THREAD_NUM、1 + THREAD_NUM * 2、…
第二個 thread 負責的行數是：2、2 + THREAD_NUM、2 + THREAD_NUM * 2、…

剩下的以此類推。

Code Refactoring

main.c

將 Line 50 的 include directive 移到檔案的開頭，方便閱讀。








#ifdef OPT

#include "file.c"
#include "debug.h"
#include <fcntl.h>
#define ALIGN_FILE "align.txt"

#endif

將 Line 104 中的 i == 0 狀況移到 for-loop 外，減少 branch 的時間。






pHead = app[0]->pHead;
e = app[0]->pLast;
for (int i = 1; i < THREAD_NUM; i++) {
    e->pNext = app[i]->pHead;
    e = app[i]->pLast;
}

phonebook_opt.[ch]

傳遞給 append() 用的結構如下，有些命名並不是那麼直觀，而且風格不太統一。









typedef struct _append_a {
    char *ptr;
    char *eptr;
    int tid;
    int nthread;
    entry *entryStart;
    entry *pHead;
    entry *pLast;
} append_a;

修改後如下








typedef struct __APPEND_ARGUMENT {
    char *pDataStart;
    char *pDataEnd;
    int tid;
    int nThread;
    entry *pListHead;
    entry *pListTail;
} append_argument;

在 append() 中的 for-loop 有一點難閱讀，改成 while-loop 會比較好一點。














int count = 0;
entry *new_node = app->pListHead;
char *cur_last_name = app->pDataStart;

while (cur_last_name < app->pDataEnd) {
    app->pListTail = app->pListTail->pNext = new_node;

    app->pListTail->lastName = cur_last_name;
    app->pListTail->pNext = NULL;

    cur_last_name += MAX_LAST_NAME_SIZE * app->nThread;
    new_node += app->nThread;
    ++count;
}

在 main.c 跟 phonebook_opt.c 中都有 diff_in_second()，因此把他拉出來，放在 utils.c 中。








#ifndef _UTILS_H
#define _UTILS_H

#include <time.h>

double diff_in_second(struct timespec t1, struct timespec t2);

#endif

在編譯的時候可以看到一個 warning message




phonebook_opt.c: In function ‘append’:
phonebook_opt.c:46:12: warning: variable ‘cpu_time’ set but not used [-Wunused-but-set-variable]
     double cpu_time;
            ^

原先 cpu_time 是為了要讓 dprintf 來輸出，但是在沒有 define DEBUG 的情況下，會使得他成為未始用的變數。

為了維持原本的設計，所以我利用 #ifdef DEBUG 和 __attribute__((unused)) 來避免這個訊息。





#ifdef DEBUG
    double cpu_time;
#else
    double cpu_time __attribute__((unused));
#endif

Thread Pool

在原本的程式中就如同下圖一般，會等到所有 thread 都 join 完後才會繼續下去。

可是這就有個問題，有些 thread 做的很快，有些 thread 做的很慢，這樣的話先完成的要在那邊呆呆的等。因此，在這邊會換成 thread pool 來實驗看看。

原始版本

Thread Pool (mutex)

上圖更換至 threadpool-mbrossard 的結果。在 append() 的部份快了 0.00003 秒附近。

2016q3 Homework2 (phonebook-concurrent)

資料閱讀

軟體開發現況

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

An Introduction to Lock-Free Programming

Acquire and Release Semantics & Weak vs. Strong Memory Models

Concurrency 相關

Concurrency vs. Parallelism

Concurrency 系列文章

Thread Pool

案例分析

正確性驗證

閱讀程式碼

main.c

phonebook_opt.c

Code Refactoring

main.c

phonebook_opt.[ch]

Thread Pool

原始版本

Thread Pool (mutex)

參考資料