2022q1 seHTTPd 改進

contributed by < haoyu0970624763 >

作業說明
 GitHub

壓力測試

apache bench
- 用法: ab -n 100000 -c 500 -k http://127.0.0.1:8081/
- 無法有效反映出多執行緒的特性
htstress
- 用法:
  - 編譯: gcc -std=gnu11 -Wall -Wextra -Werror -o htstress htstress.c -lpthread
  - 使用: ./htstress -n 10000 -c 100 -t 8 localhost:8081
- 使用 epoll 觸發、可反映多執行緒的特性
wrk
- 只支援 HTTP 通訊協定的連線請求（如 get, post 等），若要執行 get 之外的 HTTP 請求，需要使用者自行撰寫 lua 腳本
尚有 httperf , weighttp 等壓力測試工具等

使用 computed goto 改寫 switch-case

在 Computed goto for efficient dispatch tables 一文中提及, goto 在執行效率上比 switch-case 要更好，主要可以從 2 個層面探討:

switch-case 會做 bounds checking , 檢查是否有 default case 的狀況，即使程式中的 switch 沒有寫到 default case，編譯期間仍會產生強制檢查的程式，這是為了符合 C99 的標準

If no converted case constant expression matches and there is no default label, no part of the switch body is executed.

另一部份則是 branch prediction

branch prediction 的部份可參考老師之前撰寫的
switch-背後的-goto-和實作考量

我們透過 Labels as Values，在 http_parser.c 的兩個函式中加入 conditions 作為 label address table 讓 state 進行查表，並透過 goto 跳躍到對應的程式碼。

以 http_parse_request_body conditions 為範例

const void *conditions[] = {
    &&c_start,
    &&c_key,
    &&c_spaces_before_colon,
    &&c_spaces_after_colon,
    &&c_value,
    &&c_cr,
    &&c_crlf,
    &&c_crlfcr,
};

接著用 goto 取代 switch

goto *conditions[state];

我用 perf 觀察原本的程式碼以及使用 computed goto 改寫後的效率

switch 版本的觀測紀錄如下

$ sudo perf stat ./sehttpd -p 8081

$ ./htstress -n 100000 -c 1 -t 4 http:/127.0.0.1:8081/

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socket errors: 0 [0%]
seconds:       4.626
requests/sec:  21617.794


 Performance counter stats for './sehttpd -p 8081':

          4,551.69 msec task-clock                #    0.498 CPUs utilized          
               458      context-switches          #  100.622 /sec                   
                36      cpu-migrations            #    7.909 /sec                   
           100,070      page-faults               #   21.985 K/sec                  
    14,861,152,979      cycles                    #    3.265 GHz                    
    10,629,920,230      instructions              #    0.72  insn per cycle         
     2,043,158,393      branches                  #  448.879 M/sec                  
        25,004,139      branch-misses             #    1.22% of all branches

goto 版本的觀測紀錄如下

$ sudo perf stat ./sehttpd -p 8081

$ ./htstress -n 100000 -c 1 -t 4 http://127.0.0.1:8081/

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socket errors: 0 [0%]
seconds:       4.150
requests/sec:  24094.580


Performance counter stats for './sehttpd -p 8081':

          4,093.64 msec task-clock                #    0.207 CPUs utilized          
               359      context-switches          #   87.697 /sec                   
                34      cpu-migrations            #    8.306 /sec                   
           100,071      page-faults               #   24.445 K/sec                  
    13,730,441,736      cycles                    #    3.354 GHz                    
    10,613,940,571      instructions              #    0.77  insn per cycle         
     2,038,720,358      branches                  #  498.021 M/sec                  
        24,397,318      branch-misses             #    1.20% of all branches

可以看出在 computed goto 版本中的執行的 cycles 少於 switch 版本的且 branch miss rate 也略少於 switch 版本

在改寫的過程中有參照 oscarshiang 的共筆，不過在測量的過程中我沒辦法觀測出 branch miss rate 有明顯的改善，我更傾向在這個實驗中 goto 之所以效能更好的原因在於減少了bounds checking

程式碼更動

更改 buffer size 從 8124 更改為 8192 , 變成 2 的冪 , 我們就可以透過跟 (MAX_BUF-1) 做 bitwise-AND , 可取得 0~8191的範圍來取代 % 的運算 , 因為 & 運算比 % 來的更有效率

原本

p = (uint8_t *) &r->buf[pi % MAX_BUF];

更改成

p = (uint8_t *) &r->buf[pi & (MAX_BUF - 1)];

可將 pi % MAX_BUF 改為 pi & (MAX_BUF - 1)，當 MAX_BUF 為 2 的冪
詳見: 你所不知道的 C 語言: bitwise 操作
jserv

epoll 工作模式

節錄自 epoll 中

monitoring multiple file descriptors to see if I/O is possible on any of them

epoll 用來監聽 I/O 事件是否準備好了 , 而 epoll 又可分成

Level-triggered
- 當 file descriptors 準備就緒時 , kernel 透過 epoll 發送通知告知我們已就緒,接著就可以進行IO操作, 當我們長時間沒有進行操作時 , kernel 會繼續發送通知給我們
- socket 在 readable/writable 狀態時，epoll_wait會返回該socket
Edge-triggered
- 當 file descriptors 從還沒準備好轉變成準備好時,kernel 通過 epoll 發送通知告知我們,並且不再發送第二次通知

以 test_epoll_lt_and_et 舉例說明可能會講的更清楚

在客户端输入aabbcc，客户端發送包含 aabbcc 内容的數據包到服務端，在服務端一次可讀取2個字元的情況下

lt 模式下, 服務端先讀取 aa 並輸出 , epoll_wait 可返回 socket 繼續讀取資料 , 讀取 bb並輸出 , 依此類推
et 模式下, 服務端先讀取 aa 並輸出 , epoll_wait 只有等到新的數據到達時才會被喚醒 , bbcc 會遺留在服務端的 buffer

我們可以看出

lt 模式下的操作更為直覺與簡單
et 模式下可能隱含較多的 bug

此外, et 模式的效能會比 lt 來的更好 , 原因在於當 kernel 發送通知後 , 會把該事件從 ready list 中刪除 , 而 lt 模式則不會從 ready list 中刪除該事件 , 當 list中有越來越多事件時 , epoll 下次要通知用戶時 , 搜尋 list 的時間會增長

兩種模式各有其優缺點 , 在高效的 web server 中 , 我認為採用 edge-triggered 的方式可能會更合適

timer

web server 需要同時間處理多項請求 , 要優先處理哪個請求就變成一個重要的課題 , 而常見的作法之一是先處理最快到期的請求

在 mainloop.c 中 server 接收請求後會先呼叫 timer.c 中的handle_expired_timers 函式處理已經超出期限但還沒處理的請求 , 接著會呼叫 add_timer 函式將新的請求加入 queue 中

而 add_timer 中根據 node->key = current_msec + timeout (current_msec 表示當前的時間 , timeout 代表處理請求的最大延遲時間),顯而易見的可以發現 node->key 值代表的是此請求處理的期限

我們想取得最快到期的請求就可以透過 priority queue(MIN_HEAP)的方式獲取 , 也就是 timer.c 中的 prio_queue_min 函式

2022q1 seHTTPd 改進

壓力測試

使用 computed goto 改寫 switch-case

程式碼更動

epoll 工作模式

timer

Read more

Parallel Programming HW-6

Parallel Programming HW-2

2020q3 Homework3 (quiz3)

論文筆記