contributed by <mingnus
>
ncku
sysprog
perf
linux
phonebook
mmap
只是提供 memory mapping (from file system or memory region),與 fopen / fread 相比,沒有絕對的效能好壞,完全取決你如何操作 jserv
\n
(commit 9c69b94)也要一併思考,現在 benchmarking 的方式是否合理 jserv
Linux Device Driver 3/e, ch15 https://static.lwn.net/images/pdf/LDD3/ch15.pdf
http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io
https://news.ycombinator.com/item?id=2562800
http://notailbear.blogspot.tw/2008/09/implement-mmap-in-driver.html
mmap blocking & readahead issue: Does mmap block the caller when page fault?
mmap flags: MAP_NONBLOCK, MAP_POPULATE
mmap mapped size: 以64KB為單位?
mmap與aio適用場合?
modern malloc is lock-free ?
http://stackoverflow.com/questions/4524437/in-multithreaded-c-c-does-malloc-new-lock-the-heap-when-allocating-memory
測試1. perf timechart觀察thread活動:
diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 0d4d8ff..5b925de 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -286,7 +286,7 @@ static int process_comm_event(event_t *event, struct perf_session *session __use
static int process_fork_event(event_t *event, struct perf_session *session __used)
{
- pid_fork(event->fork.pid, event->fork.ppid, event->fork.time);
+ pid_fork(event->fork.tid, event->fork.ppid, event->fork.time);
return 0;
}
結果: timechart顯示的thread活動圖有點怪, 4 threads時,總有一個thread的活動時間特別短。原因待查。(sampling誤差或其他原因?)
測試2. 測試單一thread性能及平行化效益: 修改程式,讓程式只跑其中一條thread
結果: 單一thread負載與性能不成比例, 就算單一thread只處理1/4的資料,其花費的時間沒有降為1/4,
所以開再多條thread也是白搭
(希望可以找個科學一點的方式debug, 不要try & error)
稍微修改了一下記憶體存取順序, 效能有變好, 但cache相關的perf event數據並無明顯變化
($ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-dcache-stores,cache-references,cache-misses <other opts>)
diff --git a/phonebook_opt.c b/phonebook_opt.c
index 56590ba..cdc3d20 100644
--- a/phonebook_opt.c
+++ b/phonebook_opt.c
@@ -53,7 +53,7 @@ void append(void *arg)
entry *j = app->entryStart;
for (char *i = app->ptr; i < app->eptr;
i += MAX_LAST_NAME_SIZE * app->nthread,
- j += app->nthread,count++) {
+ j += 1,count++) {
app->pLast->pNext = j;
app->pLast = app->pLast->pNext;