高效 Web 伺服器開發

--- tags: LINUX KERNEL, LKI --- # [I/O 模型演化](https://hackmd.io/@sysprog/linux-io-model): 高效 Web 伺服器開發 > 資料整理: [jserv](https://wiki.csie.ncku.edu.tw/User/jserv) :::info 本文探討 ==[seHTTPd](https://github.com/sysprog21/sehttpd)== 的設計和實作議題，涵蓋並行處理、I/O 模型、React pattern，和 Web 伺服器在事件驅動架構的考量。 ::: ## TCP 伺服器的實作挑戰一般來說，TCP 伺服器的流程是先 [listen](http://man7.org/linux/man-pages/man2/listen.2.html) 於某個 port，呼叫 [accept](http://man7.org/linux/man-pages/man2/accept.2.html) 等待客戶端的連線，等客戶端連線後會返回一個 fd (file descriptor)，自指定 fd 進行 [read](http://man7.org/linux/man-pages/man2/read.2.html)，以 TCP echo 來說，對應的處置方式即 [write](https://man7.org/linux/man-pages/man2/write.2.html) 同樣的資料到該 fd，然後重新 [accept](http://man7.org/linux/man-pages/man2/accept.2.html)，關鍵程式碼如下: ```c /* Enter an infinite loop to respond to client requests */ while (1) { /* Wait for a connection, then accept() it */ if ((conn_s = accept(list_s, NULL, NULL)) < 0) exit(EXIT_FAILURE); /* Retrieve an input line from the connected socket */ for (size_t n = 1; n < MAX_LINE - 1; n++) { ssize_t rc; char c; if ((rc = read(conn_s, &c, 1)) == 1) { *buffer++ = c; if (c == '\n') break; } else if (rc == 0) break; else { if (errno == EINTR) continue; break; } } *buffer = 0; /* simply write it back to the same socket */ size_t nleft = strlen(buffer); while (nleft > 0) { ssize_t nwritten; if ((nwritten = write(conn_s, buffer, nleft)) <= 0) { if (errno == EINTR) nwritten = 0; break; } nleft -= nwritten; buffer += nwritten; } /* Close the connected socket */ if (close(conn_s) < 0) exit(EXIT_FAILURE); } ``` > 完整程式碼可見 [echoserv](https://www.paulgriffiths.net/program/c/srcs/echoservsrc.html) 理解這個 echo 伺服器後，即可在其上擴展為 web 伺服器 —— HTTP 建構於 TCP 之上，會有通訊協定的解析和處理。[CS:APP](https://hackmd.io/@sysprog/CSAPP) 提供 [tiny.c](http://csapp.cs.cmu.edu/public/ics2/code/netp/tiny/tiny.c) 來服務靜態和動態內容，儘管效益不彰，但已達到教學的目的。雖然這個程式能正常工作，但它無法勝任廣泛的應用，原因是伺服器在處理一個客戶端請求時，無法受理別的客戶端請求，也就是說，這個程式無法同時滿足兩個想獲得 echo 服務的使用者，這自然是極大的限制。其中一項改進的解決方案：[accept](http://man7.org/linux/man-pages/man2/accept.2.html) 之後再進行 [fork](http://man7.org/linux/man-pages/man2/fork.2.html)，親代行程 (parent process) 繼續 [accept](http://man7.org/linux/man-pages/man2/accept.2.html)，子行程 (child process) 則處理該 fd，對應的程式碼如下: ```c int guard(int n, char *err) { if (n == -1) { perror(err); exit(1); } return n; } int main(void) { int listen_fd = guard(socket(PF_INET, SOCK_STREAM, 0), "Could not create TCP socket"); guard(listen(listen_fd, 100), "Could not listen on TCP socket"); struct sockaddr_in listen_addr; socklen_t addr_len = sizeof(listen_addr); guard(getsockname(listen_fd, (struct sockaddr *) &listen_addr, &addr_len), "Could not get socket name"); for (;;) { int conn_fd = accept(listen_fd, NULL, NULL); if (guard(fork(), "Could not fork") == 0) { char buf[100]; for (;;) { ssize_t nreceived = guard(recv(conn_fd, buf, sizeof(buf), 0), "Could not recv on TCP connection"); if (nreceived == 0) { guard(shutdown(conn_fd, SHUT_WR), "Could not shutdown TCP connection"); guard(close(conn_fd), "Could not close TCP connection"); exit(0); } guard(send(conn_fd, buf, nreceived, 0), "Could not send to TCP connection"); } } else { /* Child takes over connection; close it in parent */ close(conn_fd); } } return 0; } ``` > [完整程式碼](https://jameshfisher.com/2017/02/25/tcp-server-fork/) 乍看這個程式解決上述只能服務單一客戶端請求的問題，但基於以下幾個因素，仍無法滿足效能需求: * 每個網路連線都涉及 [fork](http://man7.org/linux/man-pages/man2/fork.2.html)，開銷大：定址空間是 copy-on-write (CoW)，因此初期開銷不大，但仍有複製親代行程 page table 的操作，而執行緒卻可分享同一個 page table。 * 行程間的排程器承受的壓力大：考慮到系統因大量連線而有成千上萬個行程，排程器勢必要耗費時間於挑選出下一個欲執行的行程，及進行 context switch，整體開銷可觀，可參見〈[A Design Framework for Highly Concurrent Systems](https://people.eecs.berkeley.edu/~culler/papers/events.pdf)〉的分析。 * 在負載沈重的狀況下，行程易於消耗大量的記憶體資源：每個網路連線對應一個獨立的定址空間，即使在執行緒的狀況下，每個連線也會佔用資源。此外，[fork](http://man7.org/linux/man-pages/man2/fork.2.html) 後 parent-child 行程之間需要 IPC，這也是成本。改用執行緒雖可克服 [fork](http://man7.org/linux/man-pages/man2/fork.2.html) 開銷的問題，但排程器和記憶體開銷仍無法解決。單純行程和執行緒的模型在本質相似，被歸類於 process-per-connection model，無法滿足高並行 (concurrency) 需求。可引入 [thread pool](https://en.wikipedia.org/wiki/Thread_pool)：事先建立數量固定的執行緒，依據需求分配任務到這些執行緒，免去 TCP 伺服器在服務客戶端請求時，需要重新建立執行緒的疑慮。引入 [thread pool](https://en.wikipedia.org/wiki/Thread_pool) 後，TCP 伺服器的基本架構是，主要的循環 (main loop) 用 [accept](http://man7.org/linux/man-pages/man2/accept.2.html) 網路連線，之後將該連線分配給 [thread pool](https://en.wikipedia.org/wiki/Thread_pool) 的某個執行緒，處理後這個執行緒又可處理別的網路連線。 > 延伸閱讀: [圖解 Go 執行時期排程器](https://tonybai.com/2020/03/21/illustrated-tales-of-go-runtime-scheduler/) ![](https://i.imgur.com/bhsp5jC.png) 乍看 [thread pool](https://en.wikipedia.org/wiki/Thread_pool) 是個理想的方案，但在真實情況中，很多連線都要長時間處理 (longer keep alive time，即在一個 TCP 連線中進行多次通訊)，一個執行緒收到任務後，處理完第一批來的資料，此時會再次呼叫 [read](http://man7.org/linux/man-pages/man2/read.2.html)，但卻無法確認對方何時發送新資料，於是該執行緒就被這個 [read](http://man7.org/linux/man-pages/man2/read.2.html) 給阻塞 (blocked) 住了，這也是預設 I/O 操作的模式 —— 即便該 fd 沒有任何資料，呼叫 [read](http://man7.org/linux/man-pages/man2/read.2.html) 仍會致使行程的 blocking。於是，在 [thread pool](https://en.wikipedia.org/wiki/Thread_pool) 上若有 $n$ 個執行緒，只要遇上第 $n + 1$ 個長時間的連線，系統仍無法滿足並行需求。不難發現問題的根源是 [read](http://man7.org/linux/man-pages/man2/read.2.html) 預設行為致使的 blocking，解法是將後者該換為 non-blocking I/O。[read](http://man7.org/linux/man-pages/man2/read.2.html) 的做法是若有資料則返回資料，如果沒有可讀資料就返回 `-1` 並設定 errno 為 `EAGAIN`，後者表明下次有資料再繼續處理。 > `EAGAIN` The file descriptor fd refers to a file other than a socket and has been marked nonblocking (`O_NONBLOCK`), and the read would block. See [open(2)](http://man7.org/linux/man-pages/man2/open.2.html) for further details on the `O_NONBLOCK` flag. 很快我們發現新問題：行程如何得知這個 fd 在何時有資料可讀取呢？這就要透過[事件驅動](https://hackmd.io/@sysprog/event-driven-server) (event-driven) 模型來克服。 ## Reactor Pattern 同步的事件循環搭配 non-blocking I/O 又稱為 [Reactor pattern](http://en.wikipedia.org/wiki/Reactor_pattern)。 ![Reactor pattern](https://hackmd.io/_uploads/Bk0cxVg1ll.png) 事件循環用來通知特定事件，若 listenfd 可讀取，則呼叫 [accept](http://man7.org/linux/man-pages/man2/accept.2.html)，把新建立的 fd 加入 epoll 中；若是普通的連線 fd，將其加入到一個生產者與消費者佇列中，讓 worker thread 取用。thread pool 用來做計算，從一個生產者與消費者佇列裡領取一個 fd 作為計算輸入，直到讀到 `EAGAIN` 為止，保存現在的處理狀態（狀態機），等待事件循環，直到這個 fd 讀寫事件的下一次通知。 [Reactor pattern](http://en.wikipedia.org/wiki/Reactor_pattern) 是個設計模式 (design pattern)，針對同時處理多個或單個服務需求。服務處理任務 (service handler) 將傳入的需求拆解，然後分派給相關的請求端處理任務 (request handler)。再來思考為何需要 [Reactor pattern](http://en.wikipedia.org/wiki/Reactor_pattern)。考慮這樣的情境：web 應用程式收到請求，處理過後接著要讀取資料或是將資料寫入到資料庫，中間勢必要等待其他功能模組或程式進行，延遲的時間可能數十毫秒或更長，但原本的 web 應用程式就處於等待的狀態，也就是不做事情，但 CPU 佔用率仍居高不下，因為行程處理資料庫的存取、檔案的 I/O 或者是跟其他服務的**溝通之間的等待**都存在可觀成本。在 Reactor Pattern 中有個單執行緒的事件循環 (event loop)，不停地循環等待新事件的發生，事件處理的過程中如有Blocking I/O 的產生，會使用 **defer** 手段，將動作轉包到事件循環以外的其他執行緒，從而讓等待資料庫讀取的狀況發生在另外一個執行緒，這樣的好處很明顯 —— 主要的事件循環不會發生 blocking，並可繼續服務下一個事件，等待資料庫讀取完畢、資料回傳後，再透過預先設定的 callback，將資料傳回主要事件循環中，自此繼續受理各項請求。 ## Edge vs. Level Triggering select 和 poll 屬於 level-triggered Asynchronous I/O 則傾向 edge-triggered. `epoll` 的兩種工作模式談起: 1. Edge Triggered (ET, 邊緣觸發) 2. Level Triggered (LT, 條件觸發) ![](https://i.imgur.com/U5MvTC2.png) ET 表示在狀態改變時才通知（例如: 在邊緣上從低電位到高電位），LT 表示在這個狀態才通知（例如: 只要處於低電位就通知)。對應到 epoll，ET 指一旦有新資料就通知（狀態的改變），而 LT 是「只要有新資料」就會持續通知，直到緩衝區的資料全數取出。 ## 實作考量點 :question: 若將 fd 放入生產者與消費者佇列後，拿到這個任務的 worker thread 尚未讀完這個 fd，因為沒讀取完資料，所以這個 fd 可讀，那麼下一次事件循環又返回這個 fd，還分配給其他的執行緒，該如何處理？ :thinking_face: 若某 fd 對應到 2 KiB 的資料，程式只讀取 1 KiB，ET 就不會在下次 `epoll_wait` 時返回，而是會在讀取結束後又有新資料之際才回傳。而 LT 每次都會回傳該 fd，只要該 fd 有資料可讀。所以我們要用 epoll 的 ET，把 fd 設為 `O_NONBLOCK`，如果返回某 fd 可讀，循環 read 直到 `EAGAIN`（若 read 回傳 `0`，則遠端關閉連線）。 > 實驗程式碼: [test_epoll_lt_and_et](https://github.com/Manistein/test_epoll_lt_and_et) :question: 如果某個執行緒在處理 fd 的同時，又有新的一批資料進來，該 fd 可讀（注意和上一題的區別：一個是處理同一批資料，另一個是處理新一批資料），那麼該 fd 會被分配另一個執行緒，這樣兩個執行緒處理同一個 fd 就會導致錯亂。 :thinking_face: 設定 `EPOLLONESHOT` 可解決這個問題。在 fd 回傳可讀後，需要另行設定，才能讓 epoll 重新回傳這個 fd。 :question: 當 Web 伺服器和客戶端網頁瀏覽器保持著一個長期連線的狀況下，遇到 (1) 網頁瀏覽器突然關閉; (2) 客戶端網路離線; 這兩種狀況，伺服器該如何處理？ :thinking_face: 分兩種狀況處理: 1. 此時該 fd 在事件循環裡會回傳一個可讀事件，然後分配給某個執行緒，該執行緒 read 會回傳 `0`，代表對方已關閉這個 fd，於是伺服器端呼叫 close 即可; 2. 通訊協定無法立刻知悉，所以僅能透過伺服器引入 timer 逾時事件來克服 timer 可透過 priority queue 實作，後者又常用 binary heap 實作，可參見以下: * [Priority Queue 簡介](https://alrightchiu.github.io/SecondRound/priority-queueintrojian-jie.html) * [Priority Queue：Binary Heap](https://alrightchiu.github.io/SecondRound/priority-queuebinary-heap.html) :question: 將 fd 設定為 non-blocking 後，需要不斷地 read 直到回傳 `-1`，errno 為 EAGAIN，然後等待下次 epoll_wait 回傳這個 fd 再繼續處理。這時就遇到了 blocking read 不曾遇到的問題：資料可能分批送達，於是在通訊協定解析到一半時，read 就回傳 `-1`，所以我們必須將已讀到的資料保存下來，並維護其狀態，以表示是否仍需要資料。例如解析 HTTP Request Header 時，讀到 `GET /index.html HTT` (注意：少一個 `P`) 就結束，在 blocking I/O 裡只要繼續 read 就可處理，但在 nonblocking I/O，我們必須維護這個狀態，下一次必須讀到欠缺的 `P` 字元，否則 HTTP 通訊協定會遇上解析錯誤。 :thinking_face: 解法是維護一個狀態機，在解析 Request Header 時候對應一個狀態機，解析 Header Body 的時候也維護一個狀態機。 ## 參考資訊 * [Concurrent HTTP Server with Epoll](https://kaleid-liner.github.io/blog/2019/06/02/epoll-web-server.html) * [fancy: A high performance web server](https://github.com/guangqianpeng/fancy) * [poll vs select vs event-based](https://daniel.haxx.se/docs/poll-vs-select.html) * [Async IO on Linux: select, poll, and epoll](https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/) * [Fast UNIX Servers](https://nick-black.com/dankwiki/index.php/Fast_UNIX_Servers) * [Concurrent Servers](https://eli.thegreenplace.net/2017/concurrent-servers-part-1-introduction/)