2020q1 Final Project (High performance Web Server)

contributed by < jwang0306 >

基本上，這是接續 sehttpd 作業的更進一步研究。

效能測量工具

首先選擇一個好的測量工具是必須的。

apache bench

用法： ab -n 100000 -c 500 -k http://127.0.0.1:8081/
缺點：單執行緒，無法反應多執行緒硬體的優勢

htstress

用法： htstress -n 100000 -c 500 -t 4 127.0.0.1:8081/
優點：使用 epoll 觸發、支持多執行緒
缺點：不支援 keep-alive ，無法發揮 http1.1 的特性

以上兩個是老師介紹過的工具，遺憾的是都存在一些小缺點。

weighttp

幾經搜索，我發現了一個同時克服了上述兩項缺點的工具，由 lighttpd 開發的 weighttp 。許多比較大規模的測量評比都使用了它，並給予高度評價：

Linux Web Server Performance Benchmark – 2016 Results
- Once again I made use of Weighttpd to perform the actual benchmark tests, as I’ve found that it works well and scales quite nicely with multiple threads.
G-WAN web server official site
- Weighttp is by far the best stress tool we know today: it uses the clean AB interface and works reasonably well. It could be made even faster by using leaner code, but there are not many serious coders investing their time to write decent client tools, it seems.
用法： weighttp -n 100000 -c 500 -t 4 -k 127.0.0.1:8081/

效能測量環境

效能測量的環境也十分重要。如果開一個 terminal 執行 server ，開另一個 terminal 執行 client 壓力測試，不免有失公正，因為 client 運行的同時也正在跟 server 搶用資源。

較客觀的作法是分開環境。可以分兩台電腦，或是更簡單的跑兩個 virtual machine ，這也是一些大規模測量的準則：

我只有一台電腦，所以選開兩台 virtual machine 。可以選用 virtualbox 或是 VMware ，我使用前者因為本來就有安裝了。可遵照這個教學來下載 virtualbox ，然後再到 ubuntu 官網下載 iso 檔，安裝過程可以參考這篇教學。

架設好環境，並把該下載的東西 (e.g. 測量工具、你的專案…) 都裝好後，記得設定一下 network ，兩台必須是不同的 ip address （預設是一樣的），可以 ping 對方看看是否連得到彼此。確定都能與對方連上後，就可以一台跑 server 一台跑 client 了。

I/O 事件模型

Non-Blocking model

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

無論完成與否，皆會回傳一個狀態以讓事件循環緒執行。以 socket 實做如下：

flags |= O_NONBLOCK;
int s = fcntl(fd, F_SETFL, flags);

而 read event 就要寫得像這樣：

int ret = read();
if (ret == 0) {
    // EOF encountered
    break;
} else if (ret < 0) {
    if (errno == EAGAIN || errno == EWOULDBLOCK) {
        // read end normally
        break;
    } else {
        perror("read");
        break;
    }
else {
    // --snippet--
}

Epoll - 事件驅動

select -
$O (n)$
epoll -
$O (1)$

一般認為 epoll 效率普遍較好，幾乎是 web server 於 linux kernel 中的首選，以下是幾個常用的函式：

epoll_create

創造一個 epoll instance

int epoll_create(int size);

epoll_wait

將所有 epoll 事件從 interest list 取出

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

epoll_ctl

有三種 operation: EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

觸發模式
- Edge Triggered (ET, 邊緣觸發)
- Level Triggered (LT, 條件觸發)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

ET 表示在狀態改變時才通知（例如: 在邊緣上從低電位到高電位），LT 表示在這個狀態才通知（例如: 只要處於低電位就通知)。對應到 epoll，ET 指一旦有新資料就通知（狀態的改變），而 LT 是「只要有新資料」就會持續通知，直到緩衝區的資料全數取出。

epoll 是
$O (1)$ ?

說 select 是

O (n)

還好理解，就整個掃過逐個檢查，但 epoll 為什麼又說是

O (1)

呢？以下是 eventpoll 結構，也就是 epoll_create 所創造的 epfd 的所指對象：

struct eventpoll {
    /* Protect the access to this structure */
    spinlock_t lock;

    /*
     * This mutex is used to ensure that files are not removed
     * while epoll is using them. This is held during the event
     * collection loop, the file cleanup path, the epoll file exit
     * code and the ctl operations.
     */
    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */
    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */
    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */
    struct list_head rdllist;

    /* RB tree root used to store monitored fd structs */
    struct rb_root rbr;

    /*
     * This is a single linked list that chains all the "struct epitem" that
     * happened while transferring ready events to userspace w/out
     * holding ->lock.
     */
    struct epitem *ovflist;

    /* wakeup_source used when ep_scan_ready_list is running */
    struct wakeup_source *ws;

    /* The user that created the eventpoll descriptor */
    struct user_struct *user;

    struct file *file;

    /* used to optimize loop detection check */
    int visited;
    struct list_head visited_list_link;
};

eventpoll 內部使用紅黑樹來維護所有 fd 。要進行新增與刪除，透過呼叫 epoll_ctl 於紅黑樹中找到特定 fd 時， epoll_find 的複雜度是
$O (n l o g n)$ 。
eventpoll 同時維護一個 doubly linked list rdllist 作為 ready list ，一旦某個 socket 有資料進來，就會觸發 ep_poll_callback ，將該 fd 放進 rdllist 。
當呼叫 epoll_wait 時，返回 rdllist 裡面已經準備好的 fd 們。這邊要提到 epoll 幾個假設上的前提：
- adding/removing fd 並不會頻繁發生，等待可讀事件的到來比較常見才對
- 真正 ready 的 fd 總數應當遠小於所有監控的 fd (RB-Tree 裡面的所有 fd)

由於假設 fd 的新增與刪除不會太頻繁，因此不提

$O (l o g n)$ ；又假設 ready 的 fd 總數遠小於所有監控的 fd ，所以複雜度不會達到
$O (n)$ ，可以假設他是
$O (1)$ 。所以我們才會看到大家說 epoll 複雜度
$O (1)$ ，但其實這並不準確。

所以 epoll 比較適合處理那種同時有大量的連線，但同時活躍的卻相對不多的情況。假如說 fd 總數很少，且每一個 socket 都是活躍狀態一直有資料進來，那麼 epoll 表現或許就沒那麼好了。

Preforking

Nginx 最主要的模型，使用 prefork 的好處大致有以下幾個優點

記憶體開銷控制良好
master process 易於控管 child process
比較不會有 race condition

而 preforking 又可以有不同的 socket 型態，如 SO_REUSEPORT 與 SO_REUSEADDR ，之前的作業已經描述過，這裡就暫不贅述。

Thread pool

例如 node js 就是開了 thread 來處理 request ，但其實也可以使用另外的 process 來處理 request ，端看使用者如何設計。

如何避免 race condition?

在我之前的實做當中，一旦發生 timeout 且 request 正處理到一半時， timer 將 request 記憶體釋放有可能造成 worker thread 的 illegal memory access。目前仍然沒有想到完美的解決方案。

Nginx 的處理方式如下：

timer 管理一律由 main thread 完成。
在一個連線請求被接受並開始處理 (read) 後， timer 就會刪掉。所以 nginx 的 timer 並沒有上鎖。
只把整個請求處理過程的一小部份丟給 thread pool (timer 早在此之前刪掉了)。
thread pool 處理完一堆消耗 cpu 的運算後，會通知 main thread ，以接著把事情做完
- 透過 ngx_notify 來完成，可以看到 thread pool 中 notify 的參數為 ngx_thread_pool_handler ，也就是說 main thread 會形制個函式
- ngx_notify 本身的機制為一個 fd ，於 epoll module 初始化的同時放進 epoll queue 中
- 觸發 notify 的方式為寫一個 sizeof(uint64_t) 到 notify fd 中，使其可以被讀取，進而觸發 read ，執行 ngx_thread_pool_handler

我針對現有專案的 race condition 嘗試了以下解決方法：

每個 request 的結構內都配有一個 lock

並且有三個狀態：

enum request_status {
    UNPROCESSED,
    PROCESSED_BY_THREAD,
    PROCESSED_BY_TIMER,
};

初始狀態為 UNPROCESSED
- 如果 worker thread 先拿到，就設為 PROCESSED_BY_THREAD ，若還沒處理完就
- 如果 timer 先拿到，就設為 PROCESSED_BY_TIMER
當兩方都看過這個 request 後，就可以 free 掉

可是仍有無法避免的狀況：

如果 timer 已經將 request 給 free 掉，任務卻還留在 task queue ，就會造成 heap-use-after-free
如果一定要雙方才能看過，那麼設想今天我連上 8081 後就放著， timer 到期後將 request 給 close 但尚未 free ，而由於我再也沒有動作，所以也不會被放到 task queue ，因此整個 request 雖然 fd 被 close 了，但記憶體卻不曾被釋放。

所以與 Nginx 最大的差別在於，整個 task 都丟給 thread pool ，也因此無從知曉該 task 是否還才殘留在 task queue… 。

Preforking + Threadpool

Nginx 的 thread pool ，如果啟用的話每個 child process 都可以配置，目的是為了處理會造成 blocking 的請求。如果有 thread pool ，那麼像是 upstream 的 read write 就可以丟到 thread pool 去處理，等資料好了再通知 main thread ，進而寫回去給 client ，某些程度上降低 latency 。

至於這樣的策略所造成的效能衝擊，則需要進一步的測量。

一個 request 花最多時間在哪個部份？

架設好測試環境後，透過 perf 測量應該可以得到更加客觀的結果。

fastCGI

Nginx 如何與 php 等程式溝通？

corotine

參考 corotine 的實做方式？

tasklet

能否將 tasklet 當作低成本的 thread 來處理請求？

2020q1 Final Project (High performance Web Server)

效能測量工具

apache bench

htstress

weighttp

效能測量環境

I/O 事件模型

Non-Blocking model

Epoll - 事件驅動

`epoll_create`

`epoll_wait`

`epoll_ctl`

epoll 是
$O (1)$ ?

Preforking

Thread pool

如何避免 race condition?

Nginx 的處理方式如下：

我針對現有專案的 race condition 嘗試了以下解決方法：

Preforking + Threadpool

一個 request 花最多時間在哪個部份？

fastCGI

corotine

tasklet

2020q1 Final Project (High performance Web Server)

效能測量工具

apache bench

htstress

weighttp

效能測量環境

I/O 事件模型

Non-Blocking model

Epoll - 事件驅動

epoll_create

epoll_wait

epoll_ctl

epoll 是 O(1) ?

Preforking

Thread pool

如何避免 race condition?

Nginx 的處理方式如下：

我針對現有專案的 race condition 嘗試了以下解決方法：

Preforking + Threadpool

一個 request 花最多時間在哪個部份？

fastCGI

corotine

tasklet

Read more

2020q1 Homework6 (sehttpd)

2020q1 Homework4 (khttpd)

2020q1 Homework3 (quiz3)

2020q1 Homework2 (fibdrv)

`epoll_create`

`epoll_wait`

`epoll_ctl`

epoll 是
$O (1)$ ?