2016q3 Homework2 (phonebook-concurrent)

# 2016q3 Homework2 (phonebook-concurrent) contributed by <`mingnus`> ###### tags: `ncku` `sysprog` `perf` `linux` `phonebook` ## Trace code ### Version1: Multi-thread version (commit 9eb4827) * 平行化linked list建構過程: 每條thread有獨立的linked list, 各自建好list再合併。合併方法僅是將數個list頭尾相連, 合併後的list不是sorted list * File IO未平行化 (同一時間只有一個thread能執行fgets()) * 實測結果: multithread效能反而更糟 ### Version2: mmap() preload file to memory space (commit 4ec6ade) * IO改用mmap, 但建構linked list還是沒變快 (原因真的是malloc?) * 為了隨機存取檔案裡的姓名記錄, 必須將檔案前置處理為align.txt, 再做mmap; 再者mmap整個align.txt有點浪費記憶體 * findName()測項不準, 因為list不是sorted list * mmap vs. fgets: single thread循序存取, mmap會比較好嗎? >> `mmap` 只是提供 memory mapping (from file system or memory region)，與 fopen / fread 相比，沒有絕對的效能好壞，完全取決你如何操作 [name=jserv] ### Version3: “Pthread” + “file alignment” + “mmap” + “簡易 memory pool” + 檢索省略尋找 `\n` (commit 9c69b94) * memory pool實際上是linked list in array, 而且各節點在記憶體裡不是循序連結, 勢必提高循序存取時的cache miss (待驗證) * multithreaded append()還是沒變快 >> 也要一併思考，現在 benchmarking 的方式是否合理 [name=jserv] ## TODO ### mmap Linux Device Driver 3/e, ch15 https://static.lwn.net/images/pdf/LDD3/ch15.pdf http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io https://news.ycombinator.com/item?id=2562800 http://notailbear.blogspot.tw/2008/09/implement-mmap-in-driver.html mmap blocking & readahead issue: Does mmap block the caller when page fault? mmap flags: MAP_NONBLOCK, MAP_POPULATE mmap mapped size: 以64KB為單位? mmap與aio適用場合? ### 多執行緒同時執行malloc對效能的影響 modern malloc is lock-free ? http://stackoverflow.com/questions/4524437/in-multithreaded-c-c-does-malloc-new-lock-the-heap-when-allocating-memory ### 找multithread效能瓶頸測試1. perf timechart觀察thread活動: 1. 套用[patch] (http://www.spinics.net/lists/linux-perf-users/msg00009.html)後, 重編perf ``` diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c index 0d4d8ff..5b925de 100644 --- a/tools/perf/builtin-timechart.c +++ b/tools/perf/builtin-timechart.c @@ -286,7 +286,7 @@ static int process_comm_event(event_t *event, struct perf_session *session __use static int process_fork_event(event_t *event, struct perf_session *session __used) { - pid_fork(event->fork.pid, event->fork.ppid, event->fork.time); + pid_fork(event->fork.tid, event->fork.ppid, event->fork.time); return 0; } ``` 2. 設定CPU scaling_governor為performance: $ sudo cpupower frequency-set -g performance (https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt) (https://www.kernel.org/doc/readme/tools-power-cpupower-bench-README-BENCH) (https://wiki.archlinux.org/index.php/CPU_frequency_scaling) (http://lxr.free-electrons.com/source/tools/power/cpupower/) 結果: timechart顯示的thread活動圖有點怪, 4 threads時，總有一個thread的活動時間特別短。原因待查。(sampling誤差或其他原因?) 測試2. 測試單一thread性能及平行化效益: 修改程式，讓程式只跑其中一條thread 結果: 單一thread負載與性能不成比例, 就算單一thread只處理1/4的資料，其花費的時間沒有降為1/4, 所以開再多條thread也是白搭 (希望可以找個科學一點的方式debug, 不要try & error) 稍微修改了一下記憶體存取順序, 效能有變好, 但cache相關的perf event數據並無明顯變化 ($ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-dcache-stores,cache-references,cache-misses <other opts>) ``` diff --git a/phonebook_opt.c b/phonebook_opt.c index 56590ba..cdc3d20 100644 --- a/phonebook_opt.c +++ b/phonebook_opt.c @@ -53,7 +53,7 @@ void append(void *arg) entry *j = app->entryStart; for (char *i = app->ptr; i < app->eptr; i += MAX_LAST_NAME_SIZE * app->nthread, - j += app->nthread,count++) { + j += 1,count++) { app->pLast->pNext = j; app->pLast = app->pLast->pNext; ```