2020q1 Homework6 (sehttpd)

# 2020q1 Homework6 (sehttpd) contributed by < `eecheng87` > ###### tags: `linux2020` ## sehttpd 運作原理 ### mainloop.c 第 81 行正如註解所說，當 client 突然關閉時，系統會發出訊號 `SIGPIPE`，若將 `SIG_IGN` 給 `sa_handler` 代表忽略訊號。反之，`SIG_DF` 表示執行系統預設函數。當然也可以給使用者自訂函數給 `sa_handler` ( 可帶一個 `int` 參數，回傳 `void` )。第 89 行的 `sock_set_non_blocking` 是用來對剛剛創的 `fd` 做描述。在 `sock_set_non_blocking` 中透過 `fcntl` 達到控制 `fd` 的效果。`fcntl` 的第二個參數傳入想要的操作 `cmd`。其中有用到的分別是 `F_GETFL` 取得目前 `fd` 的 flag; `F_SETFL` 設定 `fd` 的 flag。注意，我們在呼叫 `F_SETFL` 前必須透過 bitwise 對 `flag` 操作，再透過 `F_SETFL` 才能有效修改 `fd` 的 flag。這裡使用到的 flag 是 `O_NONBLOCK`，如果 `read` 没有可讀的資料，則返回 -1 和 `EAGAIN`。接著的數十行是非常標準的 `epoll` 建立流程，忘記的話可以參考[這裡](https://hackmd.io/@eecheng/Sy9XsgjOI)。再來是比較關鍵的 while loop，裡面也是標準的 epoll 流程，分兩種情況。第一種是，若目前從 `epoll_wait` 中改變的 `fd` 是 `listen_fd`，這代表有新的 client 想做連線，所以這個時候就需要做 socket programing 中固定的流程 `accept`。當然，設定 `event` 的細節、`infd` 的細節都是很制式的。比較特別的是 ( 在 sehttpd 獨有的 )，會透過 `init_http_request` 將這個 request 加入 list，此外又呼叫 `add_timer` 將 request 插入 priority queue。第二種情況是，在 epoll list 中變動的 `fd` 不是 `listen_fd`，這代表是已經建立連線的 client 想讀或想寫，所以呼叫 `do_request`。 ### timer.c 之所以需要 timer 的而且把所有 timer 建構成 priority queue 的原因是方便知道哪個 timer 是接著最快會 expired 的。由 `add_time` 可以發現，priority queue 上的節點的 key 取決於 `current_msec + timeout`。所以，越快到期的 timer，key 值越小，在 pq 的最頂端。 ### http.c 這個檔案內最重要的部分是在 mainloop 遇到 epoll 有動靜時會呼叫的 `do_request`。 ## 引入 thread pool ![](https://i.imgur.com/e3DqUE5.png) 按照以上架構圖實作，只有一個 thread 負責安排工作 ( 可想成 producer )，然後不管怎樣，先丟到一個共用的 queue。接著有多個人開始搶工作 ( 可想成 consumer )。如此一來，就是標準的生產/消費者問題。實作的部分修改自 [mbrossard](https://github.com/mbrossard/threadpool) 的原始碼。 ### thread pool 的資料結構 ```cpp typedef struct { void (*function)(void *); void *argument; } threadpool_task_t; struct threadpool_t { pthread_mutex_t lock; pthread_cond_t notify; /* CV for notify worker threads */ pthread_t *threads; /* threads set ( array ) */ threadpool_task_t *queue; /* array */ int thread_count; int queue_size; int head; /* index of first element */ int tail; /* index of last element */ int count; /* num of task */ int shutdown; /* flag */ int started; /* num of threads are working */ }; ``` 前面提到公共的 queue 就是 `threadpool_task_t *queue`，他是一個 `task_t` 型態的陣列，每個成員存有 `function` 和 `argument`。一個 queue 上有一把鎖 `lock`，避免同時兩人對 queue 做插入等競爭 ( contention ) 的動作。雖然本專案的 `function` 都一樣，但是為了擴充性，這邊就先不寫死。 ### 初始化 thread pool ```cpp threadpool_t *threadpool_create(int thread_count, int queue_size){ ... } ``` 內容其實就是根據 `thread_count` 和 `queue_size` 來配置記憶體大小，另外初始化一些成員變數。至於呼叫的地方是在 mainloop.c 中，本次選用的參數是開 16 個 thread 和大小為 256 的 queue。最後，這函數需要將所有 thread 初始化它們的執行函數。 ```cpp if (pthread_create(&(pool->threads[i]), NULL, threadpool_thread, (void *) pool) != 0) { threadpool_destroy(pool); return NULL; } ``` 確保每條 thread 都在做 `threadpool_thread`。簡單來說這個函數就是消費者，持續和 queue 拿 task 來做，稍後會詳細解釋。 ### 安排新工作進 queue 在 `mainloop.c` 中 epoll 發現 client 端有動靜 ( 想寫/讀 )，所以把改變的 `events` 丟到 queue 中。`do_request` 就是 task 的 function，`events[i].data.ptr` 是 function 的參數。 ```cpp=172 if (threadpool_add(pool, do_request, events[i].data.ptr) < 0) { printf("Error occur when adding task in queue\n"); } ``` ### 如何在插入任務時避免競爭 ```cpp int threadpool_add(threadpool_t *pool, void (*function)(void *), void *argument){ .. if (pthread_mutex_lock(&(pool->lock)) != 0) { return threadpool_lock_failure; } next = (pool->tail + 1) % pool->queue_size; do { /* Queue size checking */ if (pool->count == pool->queue_size) { .. } /* Shutdown checking */ if (pool->shutdown) { .. } /* Add task to queue */ .. /* Notify *_wait */ if (pthread_cond_signal(&(pool->notify)) != 0) { err = threadpool_lock_failure; break; } } while (0); /* Release lock */ if (pthread_mutex_unlock(&pool->lock) != 0) { err = threadpool_lock_failure; } .. } ``` 前面有提到，我們不希望有多個人同時對 queue 操作，所以我們進入 `add` 時，需要拿鎖。因為這個鎖是 queue 獨有的，只有一把，所以搶到的人保證不會有其他人進來。也因為我們拿到鎖了，所以在接下來的 critical section 可以盡情地修改 shared variable。這邊還有一個比較特別的地方，`pthread_cond_signal` 的呼叫將在稍後的 `void *threadpool_thread(void *threadpool)` 中解釋用途。 ### 拿 queue 上的 task 執行前面有提到 `threadpool_thread` 的角色就像消費者，會一直和 queue 拿 task 來做。接著來解釋一些實作重點。由於 `threadpool_thread` 是 thread 的 routine，需要一直做，所以用 `for(;;)` 來達成效果。此外，同 `add`，`threadpool_thread` 也涉及對共有變數 queue 的操作，所以我們需要上鎖。 ```cpp=211 /* Grab our task */ task.function = pool->queue[pool->head].function; task.argument = pool->queue[pool->head].argument; pool->head = (pool->head + 1) % pool->queue_size; pool->count -= 1; /* Unlock */ pthread_mutex_unlock(&(pool->lock)); /* Call worker */ (*(task.function))(task.argument); ``` 以上因為在 critical section，所以可以大膽操作 shared variable。`(*(task.function))(task.argument)` 就是**消費**掉 task 的地方。前面有提到 condition variable 的運用有出現在此。沒錯，在 202 行的地方使用了。 ```cpp=202 while ((pool->count == 0) && (!pool->shutdown)) { pthread_cond_wait(&(pool->notify), &(pool->lock)); } ``` 當 queue 中已經沒有 task 了，我們就要呼叫 sleep ( wait )，這時此 thread 會釋放先前拿到的鎖，這樣當有新任務要加到 queue 中才能拿到 (避免 deadlock)。當有新任務透過 `threadpool_add` 加入 queue 時，會呼叫 signal。此時，睡在 condition variable 上的 thread 會被叫醒，重新獲得 lock，繼續往下執行。 ### 實驗的 script 希望達到能重複執行不同的 concurrency 參數，並將 htstress.c 的效能數據紀錄到 `.out` 方便接下來的 `.gp` 來製圖 ```shell #!/usr/bin/env bash # usage: repeat iterate_time cmd DEST=$1 repeat(){ T=$1 shift for i in `seq 10 $T`; do # test several time for each Concurrecy value echo -n $i >> $DEST $1 $i $2 >> $DEST echo -n $i >> $DEST $1 $i $2 >> $DEST echo -n $i >> $DEST $1 $i $2 >> $DEST done } repeat 50 "./htstress -n 10000 -c " " -t 3 http://localhost:8081/" ``` 在不同的 concurrency 會執行 `htstress` 3 次 (考量到時間，所以只做三次)來評估效能(平衡一下極端值) ### 效能改善 ![](https://i.imgur.com/eZikuLE.png) 可以觀察到引入 thread pool 的 server 能處理的 requests 提升。 ## 引入 lock free thread pool 原始碼在分支 [lfpool](https://github.com/eecheng87/sehttpd/tree/lfpool)，修改自 [xhjcehust](https://github.com/xhjcehust/LFTPool) 實作。 lock free 其實就是不用鎖來達到 multi-thread programing 的技巧，**理論上**效能可以提升，因為鎖會占用很多資源。但是，這種不用鎖又有達到避免競爭的程式設計是非常困難的，常常會有沒考慮到的競爭問題，例如: [ABA Problem](https://en.wikipedia.org/wiki/ABA_problem)。由於架構設計關係，本次 lock free thread pool 不會遇到 ABA 問題。(註: [Solution to ABA problem](https://lumian2015.github.io/lockFreeProgramming/aba-problem.html)) ![](https://i.imgur.com/vdKwFcT.png) 以上是 lock free thread pool 的架構，有別於 lock thread pool，為了避免消費者和生產者同時對 queue 操作產生競爭，所以我們盡可能的讓競爭的種類下降，使 lock free programing 更容易一些。｢讓競爭的種類下降｣具體的做法是，把原本共用的 queue 改成，每條 thread 都有自己的 queue。這樣一來，每條 thread (消費者) 只會拿自己的 task，不會去**競爭**屬於別條 thread 的 task。而生產者安排工作給 queue 的方式是採用 [round robin scheduling](https://en.wikipedia.org/wiki/Round-robin_scheduling) 以達到近似公平的安排工作 (真正公平需要看當前所有 thread 上的 task 數量來決定)。 ### lock free thread pool 的資料結構 ```cpp /* entry in work queue */ struct threadpool_task_t { void (*function)(void *); void *argument; }; /* pack real thread `pthread_t` */ /* `pop_index` and `push_index` are both gaurantee contention-free. */ struct threadpool_thread_t { pthread_t thread; /* real POSIX thread */ threadpool_task_t *queue; /* work queue belong each thread */ int size; /* size of queue */ int pop_index; /* next index for pop */ int push_index; /* next index for push */ int task_count; /* shared var between enq and deq */ }; struct threadpool_t { threadpool_thread_t *threads; /* threads set ( array ) */ int thread_count; int shutdown; /* flag */ }; ``` 每個程式有一個 threadpool (`threadpool_t`)，在 pool 裡有多條 thread (`threadpool_thread_t`)，而每條 thread 上有一條屬於該 thread 的 queue (`threadpool_task_t *`)。每個 queue 的成員是由 `threadpool_task_t` 組成，紀錄任務和參數。 ### 初始化因為不涉及資料競爭，所以和 lock based thread pool 的實作一樣。 ### 釐清需要考慮競爭的部分由於是 lock free，所以無法像前面，只要加個鎖接著就可以盡情操作 shared variable。以下討論三個成員變數: * push_index: 是用來決定下一個要插入新任務的 index。因為這個變數每條 thread 都有，而且只有在呼叫 `threadpool_qpush` 時才會改動值。恰巧在 `mainloop.c` 中呼叫此函數是走單執行續，所以不會有人同時呼叫 `threadpool_qpush`。 * pop_index: 是用來決定下一個準備要做的 task 的 index。變數只有在 `threadpool_qpop` 中才會改變。雖然有多個 thread 會改動 `pop_index`，但是上圖提供的架構巧妙的避開這個問題。因為每條 thread 都有自己的 `pop_index`，所以儘管同時做，也都是改自己的。而每條 thread 相對自己都是一條單執行緒，所以 `pop_index` 不會遇到 contention 的問題。 * task_count: 當 `push` 或 `pop` 時都會影響值，所以這個變數是會遇到 contention 的。所以，接著對成員變數的操作只需要對 task_count 做競爭考慮，其他正常操作即可。 ### 安排新工作進 queue 同 lock based thread pool，在 mainloop 中當 client 想傳東西時被 epoll 發現，則將此視為任務，加到 queue 中等待執行。呼叫 `threadpool_qpush(pool, do_request, events[i].data.ptr)`。 ### 實作 threadpool_qpush ```cpp int threadpool_qpush(threadpool_t *pool, void (*task)(void *), void *arg) { threadpool_thread_t *dest = round_robin_schedule(pool); return dispatch(dest, task, arg); } ``` 首先需要透過 RR (註: 使用到 [Static variable inside of a function in C](https://stackoverflow.com/questions/5033627/static-variable-inside-of-a-function-in-c) 技巧) 來決定新的任務要丟到哪一條 thread 上的 queue。接著的 `dispatch` 才是真正把 task 插入 queue 的函數。 ```cpp do { /* There isn't contention in push_index */ int p = dest->push_index; /* check whether full */ if (__sync_bool_compare_and_swap(&(dest->task_count), dest->size, dest->size)) { printf("Queue is full\n"); return -1; } /* insert task into queue */ dest->queue[dest->push_index].function = task; dest->queue[dest->push_index].argument = arg; dest->push_index = (p + 1) % dest->size; /* task_count is shared variable for qpop and qpush */ __sync_fetch_and_add(&(dest->task_count), 1); break; } while (1); ``` 以上可以觀察到我們只有在操作 `task_count` 才需要透過 atomic 操作來避免競爭([原因](https://hackmd.io/hJriZE4NQ96AZnohbBYYpw?view#%E9%87%90%E6%B8%85%E9%9C%80%E8%A6%81%E8%80%83%E6%85%AE%E7%AB%B6%E7%88%AD%E7%9A%84%E9%83%A8%E5%88%86))。由於 `pop` 的觀念類似，所以不再解釋。 ### 每條執行緒在做甚麼在 `create` pthread 的時候，我們綁定的函數是 `threadpool_thread`，意味著執行緒會一直執行該函數。而在 `threadpool_thread` 做的事就是不停的做 queue 上的 task (消費者)。 ```cpp threadpool_thread_t *thread = (threadpool_thread_t *) arg; for (;;) { /* Grab task from queue belong this thread */ threadpool_task_t *next_task = threadpool_qpop(thread); /* Call worker */ if (next_task) { (next_task->function)(next_task->argument); } /* Performance improvement */ /* Improve 5k requests */ sched_yield(); } ``` 這邊值得注意的是 `sched_yield` 的使用。原先沒有使用的效能比 lock based thread pool 還差 5K 左右，但是參考 [jwang](https://github.com/jwang0306/sehttpd) 同學的實作發現他有加 `sched_yield`，沒想到竟然差這麼多。呼叫 [sched_yield](http://man7.org/linux/man-pages/man2/sched_yield.2.html) 時，該 thread 會放棄在 cpu 上的使用權，自動跑到尾巴。 ### 效能比較先說結論，lock free thread pool 的效能並沒有比 lock based thread pool 還要好 (輸了 1K 以上的 requests 量)。 ### 不同 consuming 策略對效能的影響雖然整體效能不如 lock based，但是還是有其他東西可以研究。由 `sched_yield` 得到靈感，想到用不同的方式消耗 queue 中的 task 是否會影響效能? 於是實驗兩種不同的 scheme: * scheme 1: 若輪到我 (thread) 做 (拿到 cpu)，一定一直做到 queue (屬於該 thread 的 queue) 中沒任務可做，才主動呼叫 `sched_yield` 釋放 cpu 占用權。 * scheme 2: 我 (thread) 每次拿到 cpu 時只做 queue 中的一個 task 就釋放權力，給其他人做。 #### 效能 ![](https://i.imgur.com/LJX4e4S.png) 可以觀察到 scheme 1 表現普遍較好，我覺得主要的原因在於沒有頻繁的呼叫 `sched_yield`。一次做完該 thread 的 queue 上面的事再把控制權給其他 thread。省去的是頻繁的 context switch ( thread A 做一個任務就換 B，B 再做一個再換 A => 頻繁 context switch )。以上是透過正常執行方式，接著實驗把程式固定在一個 cpu 上看看效能是否影響 ( `./taskset 0x1` ) ![](https://i.imgur.com/TkhjWAM.png) 有趣的是，這次變成 scheme 2 表現較好。我想理由應該是，若在單 cpu 上要達成多執行緒的效果本來就很需要大量 context switch 製造多執行緒的假象，若如 scheme 1 堅持要把該 queue 上的東西做完，很有可能每一次迭代都會被 cpu 剝奪控制權。反觀 scheme 2，每做一個 task 就自動呼叫 `sched_yield` 讓出控制權，這也是 cpu 為了完成多執行緒所比較想看到的事。所以結論，在一般執行程式，能夠有多個 cpu 狀況下，建議用 scheme 1。 :::warning thread switch 的成本跟 cpu affinity 及核心內部結構有關。以 [lwan](https://lwan.ws/) 來說，捨棄對 thread switch 的依賴，而引入 coroutine，從而確保更好的效率 :notes: jserv ::: #### lock 和 lock-free 的 thread pool 至少就我這次的比較來看，lock free 沒有比較好，甚至在開發的過程會帶來很多困擾。於是為了說服自己不要再用 lock free programing，我找到了一篇[文章](https://www.modernescpp.com/index.php/c-core-guidelines-concurrency-and-lock-free-programming)當成我不使用 lock free 繼續開發的背書。節錄部分: #### Don't program lock-free Sure, you don't believe me but based on my experience giving many concurrency classes and workshops this is my first rule. Honestly, I agree with many of the most appreciated C++ experts worldwide. Here are the key statements and cites from their talks: * Herb Sutter: Lock-free programming is like playing with knives. * Anthony Williams: "Lock-free programming is about how to shoot yourself in the foot." * Tony Van Eerd: "Lock-free coding is the last thing you want to do." * Fedor Pikus: "Writing correct lock-free programs is even harder." * Harald Böhm: "The rules are not obvious." 雖然上面的人名都不認識，但是姑且就信了，接著改善效能的方向還是回到 lock based 的 thread pool。 ## 閱讀資料 - [x] [高效能 Web 伺服器開發](https://hackmd.io/@sysprog/fast-web-server) - [x] [Toward Concurrency](https://hackmd.io/@ktvexe/ryq21wT2?type=view)