# 2016q3 Homework2 (phonebook-concurrent)
contributed by <`mingnus`>
###### tags: `ncku` `sysprog` `perf` `linux` `phonebook`
## Trace code
### Version1: Multi-thread version (commit 9eb4827)
* 平行化linked list建構過程: 每條thread有獨立的linked list, 各自建好list再合併。合併方法僅是將數個list頭尾相連, 合併後的list不是sorted list
* File IO未平行化 (同一時間只有一個thread能執行fgets())
* 實測結果: multithread效能反而更糟
### Version2: mmap() preload file to memory space (commit 4ec6ade)
* IO改用mmap, 但建構linked list還是沒變快 (原因真的是malloc?)
* 為了隨機存取檔案裡的姓名記錄, 必須將檔案前置處理為align.txt, 再做mmap; 再者mmap整個align.txt有點浪費記憶體
* findName()測項不準, 因為list不是sorted list
* mmap vs. fgets: single thread循序存取, mmap會比較好嗎?
>> `mmap` 只是提供 memory mapping (from file system or memory region),與 fopen / fread 相比,沒有絕對的效能好壞,完全取決你如何操作 [name=jserv]
### Version3: “Pthread” + “file alignment” + “mmap” + “簡易 memory pool” + 檢索省略尋找 `\n` (commit 9c69b94)
* memory pool實際上是linked list in array, 而且各節點在記憶體裡不是循序連結, 勢必提高循序存取時的cache miss (待驗證)
* multithreaded append()還是沒變快
>> 也要一併思考,現在 benchmarking 的方式是否合理 [name=jserv]
## TODO
### mmap
Linux Device Driver 3/e, ch15 https://static.lwn.net/images/pdf/LDD3/ch15.pdf
http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io
https://news.ycombinator.com/item?id=2562800
http://notailbear.blogspot.tw/2008/09/implement-mmap-in-driver.html
mmap blocking & readahead issue: Does mmap block the caller when page fault?
mmap flags: MAP_NONBLOCK, MAP_POPULATE
mmap mapped size: 以64KB為單位?
mmap與aio適用場合?
### 多執行緒同時執行malloc對效能的影響
modern malloc is lock-free ?
http://stackoverflow.com/questions/4524437/in-multithreaded-c-c-does-malloc-new-lock-the-heap-when-allocating-memory
### 找multithread效能瓶頸
測試1. perf timechart觀察thread活動:
1. 套用[patch] (http://www.spinics.net/lists/linux-perf-users/msg00009.html)後, 重編perf
```
diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 0d4d8ff..5b925de 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -286,7 +286,7 @@ static int process_comm_event(event_t *event, struct perf_session *session __use
static int process_fork_event(event_t *event, struct perf_session *session __used)
{
- pid_fork(event->fork.pid, event->fork.ppid, event->fork.time);
+ pid_fork(event->fork.tid, event->fork.ppid, event->fork.time);
return 0;
}
```
2. 設定CPU scaling_governor為performance:
$ sudo cpupower frequency-set -g performance
(https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)
(https://www.kernel.org/doc/readme/tools-power-cpupower-bench-README-BENCH)
(https://wiki.archlinux.org/index.php/CPU_frequency_scaling)
(http://lxr.free-electrons.com/source/tools/power/cpupower/)
結果: timechart顯示的thread活動圖有點怪, 4 threads時,總有一個thread的活動時間特別短。原因待查。(sampling誤差或其他原因?)
測試2. 測試單一thread性能及平行化效益: 修改程式,讓程式只跑其中一條thread
結果: 單一thread負載與性能不成比例, 就算單一thread只處理1/4的資料,其花費的時間沒有降為1/4,
所以開再多條thread也是白搭
(希望可以找個科學一點的方式debug, 不要try & error)
稍微修改了一下記憶體存取順序, 效能有變好, 但cache相關的perf event數據並無明顯變化
($ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-dcache-stores,cache-references,cache-misses <other opts>)
```
diff --git a/phonebook_opt.c b/phonebook_opt.c
index 56590ba..cdc3d20 100644
--- a/phonebook_opt.c
+++ b/phonebook_opt.c
@@ -53,7 +53,7 @@ void append(void *arg)
entry *j = app->entryStart;
for (char *i = app->ptr; i < app->eptr;
i += MAX_LAST_NAME_SIZE * app->nthread,
- j += app->nthread,count++) {
+ j += 1,count++) {
app->pLast->pNext = j;
app->pLast = app->pLast->pNext;
```