2016q3 Homework2 (phonebook-concurrent)

contributed by <mingnus>

tags: `ncku` `sysprog` `perf` `linux` `phonebook`

Trace code

Version1: Multi-thread version (commit 9eb4827)

平行化linked list建構過程: 每條thread有獨立的linked list, 各自建好list再合併。合併方法僅是將數個list頭尾相連, 合併後的list不是sorted list
File IO未平行化 (同一時間只有一個thread能執行fgets())
實測結果: multithread效能反而更糟

Version2: mmap() preload file to memory space (commit 4ec6ade)

IO改用mmap, 但建構linked list還是沒變快 (原因真的是malloc?)
為了隨機存取檔案裡的姓名記錄, 必須將檔案前置處理為align.txt, 再做mmap; 再者mmap整個align.txt有點浪費記憶體
findName()測項不準, 因為list不是sorted list
mmap vs. fgets: single thread循序存取, mmap會比較好嗎?

mmap 只是提供 memory mapping (from file system or memory region)，與 fopen / fread 相比，沒有絕對的效能好壞，完全取決你如何操作 jserv

Version3: “Pthread” + “file alignment” + “mmap” + “簡易 memory pool” + 檢索省略尋找 `\n` (commit 9c69b94)

memory pool實際上是linked list in array, 而且各節點在記憶體裡不是循序連結, 勢必提高循序存取時的cache miss (待驗證)
multithreaded append()還是沒變快

也要一併思考，現在 benchmarking 的方式是否合理 jserv

TODO

mmap

Linux Device Driver 3/e, ch15 https://static.lwn.net/images/pdf/LDD3/ch15.pdf
http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io
https://news.ycombinator.com/item?id=2562800
http://notailbear.blogspot.tw/2008/09/implement-mmap-in-driver.html
mmap blocking & readahead issue: Does mmap block the caller when page fault?
mmap flags: MAP_NONBLOCK, MAP_POPULATE
mmap mapped size: 以64KB為單位?
mmap與aio適用場合?

多執行緒同時執行malloc對效能的影響

modern malloc is lock-free ?
http://stackoverflow.com/questions/4524437/in-multithreaded-c-c-does-malloc-new-lock-the-heap-when-allocating-memory

找multithread效能瓶頸

測試1. perf timechart觀察thread活動:

套用[patch] (http://www.spinics.net/lists/linux-perf-users/msg00009.html)後, 重編perf

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 0d4d8ff..5b925de 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -286,7 +286,7 @@ static int process_comm_event(event_t *event, struct perf_session *session __use
 
 static int process_fork_event(event_t *event, struct perf_session *session __used)
 {
-	pid_fork(event->fork.pid, event->fork.ppid, event->fork.time);
+	pid_fork(event->fork.tid, event->fork.ppid, event->fork.time);
 	return 0;
 }

設定CPU scaling_governor為performance:
$ sudo cpupower frequency-set -g performance
(https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)
(https://www.kernel.org/doc/readme/tools-power-cpupower-bench-README-BENCH)
(https://wiki.archlinux.org/index.php/CPU_frequency_scaling)
(http://lxr.free-electrons.com/source/tools/power/cpupower/)

結果: timechart顯示的thread活動圖有點怪, 4 threads時，總有一個thread的活動時間特別短。原因待查。(sampling誤差或其他原因?)

測試2. 測試單一thread性能及平行化效益: 修改程式，讓程式只跑其中一條thread
結果: 單一thread負載與性能不成比例, 就算單一thread只處理1/4的資料，其花費的時間沒有降為1/4,
所以開再多條thread也是白搭
(希望可以找個科學一點的方式debug, 不要try & error)

稍微修改了一下記憶體存取順序, 效能有變好, 但cache相關的perf event數據並無明顯變化
($ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-dcache-stores,cache-references,cache-misses <other opts>)

diff --git a/phonebook_opt.c b/phonebook_opt.c
index 56590ba..cdc3d20 100644
--- a/phonebook_opt.c
+++ b/phonebook_opt.c
@@ -53,7 +53,7 @@ void append(void *arg)
     entry *j = app->entryStart;
     for (char *i = app->ptr; i < app->eptr;
             i += MAX_LAST_NAME_SIZE * app->nthread,
-            j += app->nthread,count++) {
+            j += 1,count++) {
         app->pLast->pNext = j;
         app->pLast = app->pLast->pNext;