Linux 核心專題: 改進高效網頁伺服器

執行人: SPFishcool
專題解說錄影

:question: 提問清單

任務簡述

lwan 是個優秀的高效網頁伺服器，運用若干 Linux 進階特徵，程式碼簡潔清晰，目標:

研讀高效能網頁伺服器 lwan 的技術文件和程式碼，理解其關鍵的設計
理解 lwan 如何運用 Linux 核心系統呼叫和 coroutine 達到更好的效能，應實作原生的 Arm64 支援 (不依賴 ucontext)
在 Arm64 量化分析並探討可能的改進

lwan 第一手材料

技術報告:

相關資訊

TODO: 研讀 Linux 核心設計: 針對事件驅動的 I/O 模型演化，紀錄問題，討論 nginx 的內部設計

Q1: 在研讀 I/O multiplexing 部份時有一個疑問，在講座直播上提到 I/O multiplexing 實際上還是一個 blocking 的 I/O 模型，但後續有提到 epoll 會先 epoll_wait 監聽所有的 fd 是否有事件發生，若有偵測到會處理事件，否則可以先做其他事情，這件事聽起來像是 non-blocking。
但後來思考後學生在想是不是其 blocking 的地方在於事件被偵測到之後所要處理的動作是需要等待所有發生事件的 fd 處理完後才可以繼續做其他事情？以教材內的範例來說，Blocking 的地方位於 epoll_wait 後的 for loop 部分？

eecheng 應該更正為 epoll_wait 本質上還是 synchronous I/O 操作。

在 Linux 中可以透過 fcntl 一類系統呼叫來設定檔案為 blocking 或 non-blocking，在大部分的情況下，兩者都可與 epoll/select 並用 (除了 edge-trigger 模式的 epoll 下只能用 non-blocking)。
舉例來說，若想讀取某個 socket fd 且 fd 為 blocking，搭配 select 的使用可能為:

if (select(FD_SETSIZE, &fd, NULL, NULL, NULL) < 0) { 
    exit(1);
}

for (int i = 0; i < FD_SETSIZE; i++) {
    if (FD_ISSET(i, &fd)) {
        if ((res = read(i, buf, sizeof(buf))) >= 0) {
            read_callback(res, buf);
        } else {
            exit(1);
        }
    }
}

而使用 non-blocking 搭配 select 的情境可能為，若 buf 是一塊很小的 buffer (e.g., 8 bytes)，但即將讀入的內容非常大。若使用 blocking 的策略，正確性不會有問題，但是每次只能讀 8 bytes，然後在呼叫 select，接著再讀 8 bytes。可以發現在「明明知道將讀入的內容已經準備好」的情況下還呼叫 select 檢查資料狀態是多餘的。
所以這時候可以透過 non-blocking 的機制來避免這樣的問題，可能的程式碼為:

if (select(FD_SETSIZE, &fd, NULL, NULL, NULL) < 0) { 
    exit(1);
}

for (int i = 0; i < FD_SETSIZE; i++) {
    if (FD_ISSET(i, &fd)) {
        for (;;) { 
            res = read(i, buf, sizeof(buf));
            if (res >= 0) {
                read_callback(res, buf);
            } else {
                if (errno != EAGAIN) { 
                    exit(1);
                }
                // read from large data is completed
                break;
            }
        }
    }
}

上述兩個例子的 read 均為 synchronous，代表即使資料已經準備好了，但執行 read 時，程式執行邏輯仍會 block 在該行、無法往下走，直到讀完。

nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
...
for (n = 0; n < nfds; ++n) {
   if (events[n].data.fd == listen_sock) {
    /* do connection */
   } else {
    /* do request */
       do_use_fd(events[n].data.fd);
   }
}

SPFishcool所以「I/O multiplexing 實際上還是一個 blocking 的 I/O 模型」是因為不管是 select/epoll 在挑選事件觸發的 fd，或是讀寫 fd 的這些過程中整個程式還是會 block 在這些操作？

eecheng
是的。避免混淆，或許可寫為 I/O multiplexing 實際上為所需資料準備好的 blocking I/O，不會在 wait for data 這段阻塞，只會在執行資料操作 (e.g., copy data from kernel to user) 時阻塞。

PS: 那份資料是我前年整理的，敘述有部分混淆。你重新理解後或許可以對其修改，感謝。

SPFishcool好的！感謝您的解答！

Q2: 在講座中有提到 nginx 中 worker 的預設組態是 process，且如果有 worker crash 了，Master 可以重新 fork 再特化成 worker，但如果 worker 為 thread，其 crash 可能導致其他 worker 甚至 master 也 crash 是甚麼原因？

eecheng
造成 crash 通常可以透過 signal 的方式來通知其他 process/signal，nginx 應該是由 master process 管理所有來自 child process 或 worker thread 的 signal。只要是有 signal 觸發，程式控制就會轉交給初始化時註冊的 signal handler，以 nginx 來說，你可以參考 ngx_signal_handler 內如何處理各種 signal。可搭配參考 nginx 自訂義的 signal 種類。同時也可參考 linux 的 signal 介紹。

SPFishcool謝謝，我有發現 nginx 的 ngx_signal_t 有紀錄每一種 signal 的 ngx_signal_handler，那 Worker 的 Signal 觸發時是會呼叫其註冊的 ngx_signal_handler，之後 Master 監控時會因為 ngx_signal_handler 所作的處理從而注意到其 Signal 的變化給予相對應的行動嗎？

eecheng
舉例來說，若 child 因為某中原因，需要中止目前的程式，那可以透過改變 global variable ngx_quit 來告知 master 程式要終止了。對應的 master 邏輯在這裡。

Q3: 在 nginx 裡的程式碼有許多 ngx_cycle_t *cycle 的物件，像是 ngx_process_cycle.c 裡面的函式和 nginx.c有做 init_cycle 的動作，這些 cycle 是否就是代表一個 process？

eecheng
比較像是一個 event loop 的抽象概念

Q4: worker 內部都會做 epoll 監聽 fd，那其 fd_list 是 shared 還是 Master 會分配？

eecheng
listener fd 的變數名稱為 global variable ep，其在 ngx_epoll_init 中初始化。我認為此變數不是共享，因為每個 worker 為不同 process，有獨立的 heap，所以全域變數不會共享。

nginx 內部設計

參照事件驅動伺服器：原理和實例對於 nginx 的介紹

組成
nginx 主要由一個 Master process 開始做初始化，包含 worker 建立與管理。而 worker 一旦建立且初始化， Master Process 會將其綁定在特定的 CPU 上，當其 Process 被終止或 crash 被 Master Process 發現則可以回收其資源。
工作
其 worker 主要負責監測事件發生，也就是 epoll，當監聽到事件發生， worker 就會將任務丟至 queue 裡面，thread pool 裡的 thread 則負責從 queue 中領取任務並 lock 住避免任務被重複領取，執行完成後就把任務放置 finish queue，這邊就類似 io_uring 的 SQ(submission queue) 與 CQ(complete queue)
Signal
Process 間的溝通使用了 signal，nginx 在 ngx_process.c 定義了 signal 種類與其 signal_headler，但經過 grep -r ngx_signal_handler，學生並未發現其他除了初始化與函數定義之外有用到其 signal_handler 的地方，因此學生這裡推測「當接收到訊號，就會觸發對應的 signal_handler，讓某些跟 Signal 相關的數值發生改變」，而 ngx_process_cycle.c 內的 ngx_master_process_cycle 和 ngx_worker_process_cycle 會檢查這些數值去做對應的動作。

TODO: 研讀 lwan 技術文件和程式碼，理解其關鍵的設計

比照 ktcp 的描述風格，導讀設計思維和相關技術選擇，並搭配關鍵的程式碼予以解說，過程中應研讀上方指定的第一手材料、技術報告，和相關資訊，進行重點提示和記錄自己的見解和疑慮。

在 Life of a HTTP request, as seen by my toy web server 中敘述 lwan 接收到 http request 時的處理過程，其中也包括一些 lwan 的重點技術與設計。

lwan 的輕量化除了組成架構外，還包括了程式碼的設計，作者在第一手資料上有陸續提到在很多議題上仍存在著更好的解法，但為了程式碼的可讀性作者在非必要改進的地方仍選擇了原本的寫法。

利用 Coroutine 實作 Event loop

lwan 利用 coroutine 實作 event loop，主要結構還是 epoll_wait，但在監聽到 event 發生時處理 I/O 事件時會 resume 至目標連線的 corotine，這時處理 I/O 結束時會有以下狀況：

結束連線
完成 I/O 處理後 keep alive

第一個狀況監聽到處理完就會直接結束，但第二個 coroutine 機制就會有明顯的效果，在 close 前處於等待時就可以 yield 至 event loop 繼續處理監聽到的 I/O 事件。

參考 Life of a HTTP request, as seen by my toy web server 之 Coroutine 的示意圖

在 lwan-thread.c 裡可以看到 lwan_thread_io_loop

for (;;) {
        int timeout = turn_timer_wheel(&tq, t, epoll_fd);
        int n_fds = epoll_wait(epoll_fd, events, max_events, timeout);
        bool created_coros = false;

        if (UNLIKELY(n_fds < 0)) {
            if (errno == EBADF || errno == EINVAL)
                break;
            continue;
        }

        for (struct epoll_event *event = events; n_fds--; event++) {
            struct lwan_connection *conn = event->data.ptr;

            assert(!(conn->flags & CONN_ASYNC_AWAIT));

            ...
            if (!conn->coro) {
                if (UNLIKELY(!spawn_coro(conn, &switcher, &tq))) {
                    send_last_response_without_coro(t->lwan, conn, HTTP_UNAVAILABLE);
                    continue;
                }

                created_coros = true;
            }

            resume_coro(&tq, conn, epoll_fd);
        }

        if (created_coros)
            timeouts_add(t->wheel, &tq.timeout, 1000);
    }

在 event loop 中，會先檢查 conn->coro 是否建立，若沒有建立則會呼叫 spawn_coro 幫連線建立一個 coroutine，有了 coroutine 之後便會 resume 至連線處理事件。

建立 connect structure 與 Accept connection

在 lwan 在剛開始會建立一個 array of connection structure，這個陣列是依照可接受連線最大數量 SOMAXCONN 決定陣列大小。

static int backlog_size;
static void init_backlog_size(void)
{
#ifdef __linux__
    FILE *somaxconn;

    somaxconn = fopen("/proc/sys/net/core/somaxconn", "re");
    if (somaxconn) {
        int tmp;
        if (fscanf(somaxconn, "%d", &tmp) == 1)
            backlog_size = tmp;
        fclose(somaxconn);
    }
#endif

    if (!backlog_size)
        backlog_size = SOMAXCONN;
}

SOMAXCONN 在 linux 中可以在 /proc/sys/net/core/somaxconn 看到系統預設值

fico@fico-mac:~ -$ cat /proc/sys/net/core/somaxconn 
4096

其他若沒有 somaxconn，lwan 在 socket.h 預設 SOMAXCONN 預設 128。

ALWAYS_INLINE int
lwan_connection_get_fd(const struct lwan *lwan, const struct lwan_connection *conn)
{
    return (int)(intptr_t)(conn - lwan->conns);
}

而且在文中提到取得 fd 的方法是「目標 conn 的位址 - conn 的起始位址」，在 lwan 連線建立時回傳的 fd 會當成 conn 的 index 來使用。

handler trie

針對不同種類的 request 有不一樣的 handler，這些 handler 會在 main function 定義好，lwan 裡將 request handler search 使用 trie 結構實作出來。

Trie 的結構與 huffman 編碼類似，但 trie 在 node 之間的走訪是依據 char 的 ASCII code 進行走訪，只要輸入特定的 prefix (string 型態)，就可以沿著 prefix 找到我們要找的 output leaf，而 Time Complexiy 保證為 $O (n)$ ，其中 n 為 prefix 長度，這樣的架構也在某些情境可取代 hash table（比較不會發生碰撞）。

對比下方結構體的定義，澄清「trie 的 node 為 char」這陳述是否正確。
:notes: jserv

在 lwan trie 結構中，應該是根據 char *key 中每一個 char 的 ASCII code 去決定下一步該走哪一個 next node，學生仔細看 trie wiki 發現其中的資料結構範例也是類似的想法，因此這句話我改成「trie 在 node 之間的走訪是依據 char 的 ASCII code 進行走訪」應該比較符合。

lwan lookup handler 的原理是給定一個特定的 prefix，接下來照著 prefix 往下存取到 leaf，其 leaf 就有我們所要找的 handler function。在 trie 的結構上 lwan 將 node 設計成以下結構。

struct lwan_trie_node {
    struct lwan_trie_node *next[8];
    struct lwan_trie_leaf *leaf;
    int ref_count;
};

lwan 為了符合 x86_64 cacheline 的長度，每一個 node 最多只給八個，這樣剛好一個 node 填滿一個 cacheline。
再者根據 lwan 原本的設計 256 pointers per node 對虛擬記憶體負擔太大，因此將 array 縮減成 8 pointers per node。
然而可以在 lwan-trie.c 裡的 function 發現，index 是根據 *key & 7 決定的（key 為 prefix）

for (knode = &trie->root; *key; knode = &node->next[(int)(*key++ & 7)])
        GET_NODE();

位於 lwan_trie_add() 函式

這樣的設計雖說可能會發生碰撞，但是此設計比較符合硬體架構。

TODO: 針對 Arm64，實作原生的 coroutine

相關 pull request

不依賴 ucontext，在 lwan 既有的 coro 程式碼提供 Arm64 組合語言實作，比對 ucontext 和自行實作的效能落差，過程中應當要留意相關 ABI 規範。

參照 minicoro 實作

lwan 的 coroutine 是讓整個網頁伺服器輕量化重要的一環，藉由 coroutine 與 non-blocking 機制使得整個 request 處理簡化，以下方程式為例：

while (true) {
        struct lwan_request_parser_helper helper = {
            .buffer = &buffer,
            .next_request = next_request,
            .error_when_n_packets = error_when_n_packets,
            .header_start = header_start,
        };
        struct lwan_request request = {.conn = conn,
                                       .global_response_headers = &lwan->headers,
                                       .fd = fd,
                                       .response = {.buffer = &strbuf},
                                       .flags = flags,
                                       .proxy = &proxy,
                                       .helper = &helper};
    
        ...

        if (next_request && *next_request) {
            conn->flags |= CONN_CORK;

            if (!(conn->flags & CONN_EVENTS_WRITE))
                coro_yield(coro, CONN_CORO_WANT_WRITE);
        } else {
            conn->flags &= ~CONN_CORK;
            coro_yield(coro, CONN_CORO_WANT_READ);
        }

        /* Ensure string buffer is reset between requests, and that the backing
         * store isn't over 2KB. */
        lwan_strbuf_reset_trim(&strbuf, 2048);

        /* Only allow flags from config. */
        flags = request.flags & (REQUEST_PROXIED | REQUEST_ALLOW_CORS | REQUEST_WANTS_HSTS_HEADER);
        next_request = helper.next_request;
    }

    coro_yield(coro, CONN_CORO_ABORT);
    __builtin_unreachable();

位於 process_request_coro 函式

此函式是讀取並處理 request 的過程，當 epoll_wait 監測到連線有資料進來便會 coro_resume 至此連線的 process_request_coro，當讀取或寫入結束後需等待對方訊息，就可以先 coro_yield 回 main loop 繼續監聽，當下一次再監聽到跳至此 task 時，就會從上次 yield 的位置繼續執行，就不用擔心重複執行 loop 前面的一些程式碼太多次。

lwan 裡 x86_64 coro_swapcontext 與 minicoro 的 _mco_switch 作法相似，只有在函式與 struct 的定義上有差異。

typedef uintptr_t coro_context[10];

這裡有一個我個人覺得很有趣的作法，他不像 minicoro 一樣建立一個 struct 紀錄各個暫存器的值，而是使用 uintptr_t 的陣列存取暫存器的值，uintptr_t 與暫存器一樣是看處理器位元數而定，因此可以完整紀錄處理器的值。

因此想要實作 ARM64 的 Coroutine 優化，可以參考 minicoro 針對 aarch64 linux 之 _mco_switch 與 aapcs64 裡針對暫存器與程式呼叫規範的介紹

原生 Coroutine 的初步實驗（Throughput）

在實作 Coroutine 之前學生對於有無原生 Coroutine 的 lwan web server 的效能做比較

為了在同一個硬體配置下看出差異，我額外編譯一個 lwan 並將 x86_64 原生 Coroutine 刪除

diff --git a/src/lib/lwan-coro.h b/src/lib/lwan-coro.h
index 2d762c03..42991621 100644
--- a/src/lib/lwan-coro.h
+++ b/src/lib/lwan-coro.h
@@ -23,9 +23,7 @@
 #include <stddef.h>
 #include <stdint.h>
 
-#if defined(__x86_64__)
-typedef uintptr_t coro_context[10];
-#elif defined(LWAN_HAVE_LIBUCONTEXT)
+#ifdef LWAN_HAVE_LIBUCONTEXT
 #include <libucontext/libucontext.h>
 typedef libucontext_ucontext_t coro_context;
 #else

diff --git a/src/lib/lwan-coro.c b/src/lib/lwan-coro.c
index db51e176..e7274eb0 100644
--- a/src/lib/lwan-coro.c
+++ b/src/lib/lwan-coro.c
@@ -133,55 +133,8 @@ struct coro {
 #endif
 };
 
-#if defined(__APPLE__)
-#define ASM_SYMBOL(name_) "_" #name_
-#else
-#define ASM_SYMBOL(name_) #name_
-#endif
 
-#define ASM_ROUTINE(name_)                                                     \
-    ".globl " ASM_SYMBOL(name_) "\n\t" ASM_SYMBOL(name_) ":\n\t"
-
...
-#if defined(__x86_64__)
-void __attribute__((noinline, visibility("internal")))
-coro_swapcontext(coro_context *current, coro_context *other);
-asm(".text\n\t"
-    ".p2align 5\n\t"
-    ASM_ROUTINE(coro_swapcontext)
-    "movq   %rbx,0(%rdi)\n\t"
-    "movq   %rbp,8(%rdi)\n\t"
...
-    "movq   64(%rsi),%rcx\n\t"
-    "movq   56(%rsi),%rsi\n\t"
-    "jmpq   *%rcx\n\t");
-#elif defined(LWAN_HAVE_LIBUCONTEXT)
+#ifdef LWAN_HAVE_LIBUCONTEXT
 #define coro_swapcontext(cur, oth) libucontext_swapcontext(cur, oth)
 #else
 #error Unsupported platform.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f79e5aab..98a78eff 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -340,9 +340,8 @@ else ()
        set(LWAN_COMMON_LIBS -Wl,-whole-archive lwan-static -Wl,-no-whole-archive)
 endif ()
 
-if (NOT CMAKE_SYSTEM_PROCESSOR MATCHES "^x86_64|amd64")
-       set(LWAN_HAVE_LIBUCONTEXT 1)
-endif ()
+
+set(LWAN_HAVE_LIBUCONTEXT 1)
 
 include_directories(src/lib)
 include_directories(BEFORE src/lib/missing)

並使用 weighttp 進行壓力測試

weighttp -n 10000 -c 10 -t 4 -k localhost:8080

實驗結果：

原生 Coroutine

weighttp 0.4 - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 3 concurrent requests, 2500 total requests
spawning thread #2: 3 concurrent requests, 2500 total requests
spawning thread #3: 2 concurrent requests, 2500 total requests
spawning thread #4: 2 concurrent requests, 2500 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 0 sec, 251 millisec and 28 microsec, 39836 req/s, 47111 kbyte/s
requests: 10000 total, 10000 started, 10000 done, 10000 succeeded, 0 failed, 0 errored
status codes: 10000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 12110240 bytes total, 1570240 bytes http, 10540000 bytes data

ucontext

weighttp 0.4 - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 3 concurrent requests, 2500 total requests
spawning thread #2: 3 concurrent requests, 2500 total requests
spawning thread #3: 2 concurrent requests, 2500 total requests
spawning thread #4: 2 concurrent requests, 2500 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 0 sec, 262 millisec and 161 microsec, 38144 req/s, 45111 kbyte/s
requests: 10000 total, 10000 started, 10000 done, 10000 succeeded, 0 failed, 0 errored
status codes: 10000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 12110240 bytes total, 1570240 bytes http, 10540000 bytes data

發現其實兩者實驗結果是差不多的，並且在之前編譯時也有遇到 libucontext/libucontext.h: no such file or directory 的錯誤，最後是 git clone kaniini/libucontext 並安裝，因此我查了 C 函式庫的 man page ，發現 C 函式庫的標頭檔為 #include <ucontext.h> 但 lwan 包含的 ucontext 為 #include <libucontext/libucontext.h>，因此這裡的 libucontext 應該為外部專案，而且此專案也有針對不一樣的處理器架構實作原生的 ucontext。

針對 coroutine ，並非要測試 throughput，而是「同樣的 concurrent connection 數量下，記憶體開銷的落差」，可搭配 massif 來分析。
:notes: jserv

好的，學生會針對記憶體開銷做測試

原生 Coroutine 針對記憶體開銷實驗

執行以下命令開啟 ./lwan

$ valgrind --tool=massif ./lwan

接下來執行壓力測試

$ weighttp -n 100000 -c 100 -t 4 -k localhost:8080

接下來使用 massif-visualizer 就能夠視覺化記憶體開銷

實驗結果：

x86-64 原生 Coroutine

原生 Coroutine 總 heap 開銷為 457.5 KB
ucontext

ucontext 總 heap 開銷為 545.4 KB

可以看到原生 Coroutine 的記憶體開銷少了一些，有部份原因是因為資料結構，在 lwan-coro.h x86_64 的 context 有 10 * 8 Bytes (uintptr_t) 的記憶體空間，libucontext_ucontext_t 可以在 x86_64/include/libucontext/bits.h 看到，實作原生的 Coroutine 能夠針對自己的需求留下自己需要的暫存器資訊，因此開銷相對較少。

ARM64 架構下的原生 Coroutine

在實作原生 Coroutine 下，學生將參考 lwan-coro.c 與 minicoro 的設計風格，為保留 lwan 的程式風格，多半還是以專案本身的設計手法為主。

在 assembly coroutine 實作上，就是將暫存器的資訊存入記憶體，等需要時再將其從記憶體取出。

ARM64 一共有 30 個暫存器 (不包括 stack pointer)

r0 ~ r7 : 參數與回傳值使用的暫存器，當呼叫函式時將參數放入此暫存器，回傳執也利用這些暫存器傳遞。
r8 : 間接結果暫存器（indirect result register），可透過紀錄位址來間接回傳結果，例如回傳值為大型資料結構。
r9 ~ r15 : Caller 使用的一般目的暫存器，即呼叫函式者。
r16、r17 : IP (intra-procedure-call) 暫存器
r18 : 為平台 ABI 保留的暫存器。
r19 ~ r28 : Callee 使用的一般目的暫存器，即被呼叫者。
r29 : Frame pointer，指向這次 Stack 儲存的基底位址。
r30 : link register，儲存返回位址。

除了暫存器類型還有資料形態需注意（以 r0 暫存器為例）：
整數：

w0 : 為 32-bit 整數形態。
x0 : 為 64-bit 整數形態。

浮點數：

b0 : 為 8-bit 浮點數形態。
h0 : 為 16-bit 浮點數形態。
s0 : 為 32-bit 浮點數形態。
d0 : 為 64-bit 浮點數形態。
q0 : 為 128-bit 浮點數形態。

向量 V0 ：為 128-bit，可以根據浮點數的代號從此向量中取部份的值。

參考 minicoro 與 lwan-coro 中 x86-64 asm，需要保存的暫存器可能就是 x9 ~ x15、x19 ~ x30，還有 x0 與 x1 ，總共 22 個暫存器。

我們的架構為 aarch64，因此在 lwan-coro.h coro_context 宣告時新增架構判斷

#elif defined(__aarch64__)
typedef uintptr_t coro_context[22];

`coro_swapcontext`

接下來就是 lwan-coro.c coro_swapcontext，首先要把 sp 與 x30 的值取出，我們放在 x10 與 x11。

    mov x10, sp
    mov x11, x30

接下來使用 stp 指令一次將兩個暫存器存入 coro_context 的兩個區域。
以第一行為例，當使用 stp 指令存入，x8、x9 的存放位置會是 x0[2] 與 x0[3]。

    stp x8, x9, [x0, #(1*16)]
    stp x10, x11, [x0, #(2*16)]
    stp x12, x13, [x0, #(3*16)]
    stp x14, x15, [x0, #(4*16)]
    stp x19, x20, [x0, #(5*16)]
    stp x21, x22, [x0, #(6*16)]
    stp x23, x24, [x0, #(7*16)]
    stp x25, x26, [x0, #(8*16)]
    stp x27, x28, [x0, #(9*16)]
    stp x29, x30, [x0, #(10*16)]
    stp x0, x1, [x0, #(0*16)]

而取出指令為 ldr ，用法與 stp 正好相反

    ldp x8, x9, [x1, #(1*16)]
    ldp x10, x11, [x1, #(2*16)]
    ldp x12, x13, [x1, #(3*16)]
    ldp x14, x15, [x1, #(4*16)]
    ldp x19, x20, [x1, #(5*16)]
    ldp x21, x22, [x1, #(6*16)]
    ldp x23, x24, [x1, #(7*16)]
    ldp x25, x26, [x1, #(8*16)]
    ldp x27, x28, [x1, #(9*16)]
    ldp x29, x30, [x1, #(10*16)]
    ldp x0, x1, [x1, #(0*16)]

注意在取出暫存器時，x1 必須在最後一個取出，否則會導致後續位址錯誤！

最後是設定 sp，結束後跳至 x11。

    mov sp, x10
    br x11

coro 初始化

在一開始 malloc 時是沒有 context 存入 coro_context 的，所以要針對一些暫存器做初始化

#elif defined(__aarch64__)
    coro->context[19/* x28 */] = (uintptr_t)data;
    coro->context[0 /* x0  */] = (uintptr_t)coro;
    coro->context[1 /* x1  */] = (uintptr_t)func;
    coro->context[5 /* lr  */] = (uintptr_t)coro_entry_point_arm64;

    uintptr_t rsp = (uintptr_t)stack + CORO_STACK_SIZE;
#define STACK_PTR 4
    coro->context[STACK_PTR] = rsp & ~0xful;

上面是個 4 個初始化數值是為了第一次跳躍而準備的，在第一次呼叫 coro_entry_point_arm64 時會將 x28 的數值放入 x2（第 3 個參數）再呼叫 coro_entry_point。

    mov x2, x28
    bl coro_entry_point

coro_entry_point 會呼叫 coro_yield，主要任務是設置 yield_value。

而 stack 在 coro_new 時就會 mmap 配置一個 stack 空間。

    void *stack = mmap(NULL, CORO_STACK_SIZE, PROT_READ | PROT_WRITE,
                       MAP_STACK | MAP_ANON | MAP_PRIVATE, -1, 0);
    if (UNLIKELY(stack == MAP_FAILED))
        return NULL;

    coro = lwan_aligned_alloc(sizeof(*coro), 64);
    if (UNLIKELY(!coro)) {
        munmap(stack, CORO_STACK_SIZE);
        return NULL;
    }

    coro->stack = stack;

最後 coro->context[STACK_PTR] = rsp & ~0xful 就是對齊記憶體位址。

massif 實驗

ucontext

總 heap 開銷為 1.3 MB
ARM64 原生 Coroutine

總 heap 開銷為 904.3 KB

準備提交 Arm64 coro 相關 pull request 到 lwan 專案。
:notes: jserv

收到，已提交等待結果中

TODO: 研究 Tickless Hierarchical Timing Wheel 並嘗試改進

I/O multiplexing 只是伺服器開發的一環，真實要考慮的議題很多，例如當 Web 伺服器和客戶端網頁瀏覽器保持著一個長期連線的狀況下，遇到以下狀況:

網頁瀏覽器突然關閉;
客戶端網路離線;

伺服器該如何處理？在前者，由於已分配該連線的任務中，read 會回傳 0，代表對方已關閉這個檔案操作，於是伺服器端呼叫 close 即可。但在後者，通訊協定無法立刻知悉，所以僅能透過伺服器引入 timer 逾時事件來克服，timer 可透過 priority queue 實作，後者又常用 binary heap 實作 —— 這樣只要從 priority queue 的開頭檢查起逾時的連線即可。
另一個解決方式是動態調整等待的間隔，每次選出最需要處理逾時的 timer，如此一來，每次時間會不同，且不佔用過多的 CPU 資源，稱為 tickless，通常會透過 timing wheel 來實作，William Ahern 提供一套 Tickless Hierarchical Timing Wheel，參照論文〈Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility〉

bench 目錄提供相關的實作效能比較 (使用方式參見 sysprog21/timeout)。lwan 使用到這個 Tickless Hierarchical Timing Wheel，但程式碼進行一定的整理:

timing wheel 的 insertion 是 O(1) 時間複雜度:

timing wheel 的 deletion 也是 O(1) 時間複雜度:

可能的效能改進提案:

研讀〈Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility〉

論文中從最基礎的 timeout model 介紹並慢慢改進其 timeout model，最後介紹 Hashed and Hierarchical Timing Wheels。
早期的 OS timeout 機制需要 $O (n)$ 的時間來啟動與維護，在 n 非常大時，其時間成本就會變得昂貴，因此在論文中的任務就是將其時間改善為 $O (1)$ 。

Timeout model 有四個主要函式：

STARTTIMER(Interval, RequestId, ExpiryAction) : 新增一個 timer 到 model 中，Interval 為 timer 過期時間，RequestId 用來區分 timer，ExpiryAction 為 timer 過期時要執行的動作。
STOPTIMER(RequestId) : 停止 RequestId 對應的 timer。
PERTICKBOOKKEEPING : 檢查是否有 timer 過期，若過期則停止 timer。
EXPIRYPROCESSING : 執行 timer 的指定動作。

Timeout model 有幾個實作方案

Scheme 1 - Straightforward

最直覺的 timer 管理機制，啟動和停止時間為 $O (1)$ ，但 PERTICKBOOKKEEPING 需要一個個檢查，因此需要 $O (n)$ 的時間。

Scheme 2 - Order list

以已排序的 linked list 來管理 timer，最早過期的 timer 會被排在 head，使得停止和檢查時間能夠為 $O (1)$ ，但因為要保持 linked list 是排序的，因此啟動時 worst case 需要 $O (n)$ 時間。

Scheme 3 - Tree-Based Algorithms

為解決 Scheme 2 的 STARTTIMER 還是 $O (n)$ 的問題，將 model 結構改進為樹狀結構（文中提及的樹狀結構有：非平衡二元樹、Heap、後序樹和左偏樹），使用樹狀結構可以將 STARTTIMER 從 $O (n)$ 降低成 $O (\log n)$ ，但是這些非平衡樹的 worst case 還是會變成 $O (n)$ （平衡樹的話會使 STOPTIMER 和 PERTICKBOOKKEEPING 時間上升）。

Scheme 4 - Basic Timer-Wheel

在這裡就正式介紹 Timer-wheel，wheel（輪子）就是一個 cycle，可以把 wheel 看作一個時鐘上面有 n 個 cell （刻度），這些刻度代表一個單位時間（一個或多個 Ticks），一個單位時間過去了，其指針就會移動移動到下一個 cell。假設 curtime 為目前 model 指向的刻度，在 STARTTIMER 呼叫時，就可以根據 (cutime + interval) % n 來決定 timer 要放入哪個 cell。

取自〈Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility〉 Fig. 5.

每移動到下一個 cell 系統就會檢查該 cell 裡面的 timer 是否過期，這樣就不必逐個檢查，時間會比較趨近於 $O (1)$

但這樣一個 wheel 有 n 個 cell 的情況下，雖然 n 可以依照自己的需求設定數量，但在記憶體不足的情況下，當 Timer 的 interval 大於 n 時，有可能會造成碰撞，這時經過一個 cycle 也不會過期，就可能需要對同一個 cell 的 timer 進行排序。

scheme 5、6 - Hash Table With Sorted List / Unsorted List

這裡考慮到 timer interval 有 32-bits 時，他將 wheel cell 數量設為 256 (或 $2^{n}$ ) 個，這樣在運算上能夠使用 bitwise 操作完成 STARTTIMER 而非四則運算，而 5 和 6 的差別就在於有沒有排序過，如果選擇排序，STARTTIME 的 worst case 就會是 $O (n)$ ，如果沒有排序，PERTICKBOOKKEEPING 的 worst case 就會是 $O (n)$ 。

取自〈Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility〉 Fig. 6.

scheme 7 - Hierachical Timer-Wheel

增加了 Hierachy 概念的 Timer-wheel，在論文裡 Scheme 7 有四個 wheel，分別是：

有 100 個 cell，代表天。
有 24 個 cell，代表時。
有 60 個 cell，代表分。
有 60 個 cell，代表秒。

當 STARTTIMER 被呼叫了，我們會先取得過期時間（curtime + interval），第一次 insert 會把 timer 插入最高 Level 的 wheel，而在過期時間其他較低的時間則會被儲存起來。

例如：curtime 為 11 天 10 時 24 分 30 秒，STARTTIMER 插入的 timer Interval 為 50 分 45 秒，這樣過期時間為 11 天 11 時 15 分 15 秒，因為天數一樣，因此最高的 Level 是「時」，所以插入「時」的第 11 個 cell，而把「15 分 15 秒」記錄在 timer 裡。

當時針指向 11 時，因為 remainder 還有「15 分 15 秒」，所以接下來就會把 timer 插入「分」的第 15 個 cell，把「15 秒」記錄在 timer 裡，而當 remainder 沒有東西之後，此 timer 才是真正過期。

取自〈Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility〉 Fig. 7

lwan 中的 timeout.c

lwan 裡的 timeout 就參考這樣的 Hashed and hierachical timer-wheel 架構，不過機制有些許的不同。

Hierachy 機制

timeout.c 的 hierachy 不是以「天、時、分、秒」這種真實數據來表現，而是考慮 bitwise 操作能夠讓電腦更有效率地運算，避免使用過多的乘跟除增加電腦的運算時間，可以在 macro 與 struct 結構看到。

#define WHEEL_BIT 6
#define WHEEL_NUM 4

#define WHEEL_LEN (1U << WHEEL_BIT)
#define WHEEL_MAX (WHEEL_LEN - 1)
#define WHEEL_MASK (WHEEL_LEN - 1)
...
struct timeouts {
    struct list_head wheel[WHEEL_NUM][WHEEL_LEN];
    struct list_head expired;

    wheel_t pending[WHEEL_NUM];

    timeout_t curtime;
};

struct timeouts 是 timeout model 的主體，可以看到 lwan 的 wheel 是一個 4 * $2^{6}$ 的陣列，代表有 4 個 level 的 wheel，每個 wheel 有 64 個 cell。

這樣的好處是，當每次在計算 hierachy 的數值時，可以將大部分的運算使用 bitwise 代替。

static inline int timeout_wheel(timeout_t timeout)
{
    /* must be called with timeout != 0, so fls input is nonzero */
    return (fls(LWAN_MIN(timeout, TIMEOUT_MAX)) - 1) / WHEEL_BIT;
}

static inline int timeout_slot(int wheel, timeout_t expires)
{
    return (int)(WHEEL_MASK & ((expires >> (wheel * WHEEL_BIT)) - !!wheel));
}

timeout_wheel: 取得 timeout 要放入的 wheel index
timeout_slot: 取得要放入 wheel 的 slot(cell)

pending 是一個 uint64_t 的陣列，四個 64 位元整數對應四個 wheel，用途為標記 wheel 裡有 append timer 的 slot，當 timeouts_sched 被呼叫就會將 timer 排程到對應的 slot 並使 pending 的對應位元設為 1。

    if (expires > T->curtime) {
        rem = timeout_rem(T, to);

        /* rem is nonzero since:
         *   rem == timeout_rem(T,to),
         *       == to->expires - T->curtime
         *   and above we have expires > T->curtime.
         */
        wheel = timeout_wheel(rem);
        slot = timeout_slot(wheel, to->expires);

        to->pending = &T->wheel[wheel][slot];

        T->pending[wheel] |= WHEEL_C(1) << slot;
    } else {
        to->pending = &T->expired;
    }

    list_add_tail(to->pending, &to->tqe);

而 struct timeout 中也有一個 pending ，但此 pending 是記錄 timer 目前所 append 的 wheel slot 位址。

struct timeout {
    int flags;

    timeout_t expires;
    /* absolute expiration time */

    struct list_head *pending;
    /* timeout list if pending on wheel or expiry queue */

    struct list_node tqe;
    /* entry member for struct timeout_list lists */
}; /* struct timeout */

timeouts_update 主要是更新 timeouts 的時間並根據時間來更新狀態，lwan 的 timeout model 並不是指針再轉，而是比較像 wheel 在旋轉，第零個 slot 會是 curtime 的 slot。

改善 Timeout model

#TODO