2023q1 Homework7 (ktcp)

contributed by < yanjiew1 >

作業說明

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
    CPU family:          6
    Model:               142
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         3400.0000
    CPU min MHz:         400.0000
    BogoMIPS:            3600.00

CMWQ 研究

自我檢查清單：給定的 kecho 已使用 CMWQ，請陳述其優勢和用法
Linux 核心文件
workqueue 核心程式碼： workqueue.c, workqueue.h, workqueue_internal.h

看了 kecho 的說明。有一段

Therefore, you can set the param to 1 to disable WQ_UNBOUND flag. By disabling this flag, tasks submitted to the CMWQ are actually submitted to a wq named system wq, which is a wq shared by the whole system. Tasks in the system wq are executed by the CPU core who submitted the task at most of the time. BE AWARE that if you use telnet-like program to interact with the module with the param set to 1, your machine may get unstable since your connection may stall other tasks in the system wq. For details about the CMWQ, you can refer to the documentation.

程式碼中已經透過 alloc_workqueue 建立新 workqueue 。不懂為什麼沒有 WQ_UNBOUND 就會讓長時間執行的 work item 去阻擋其他在系統 wq 的工作執行。照理說等待 I/O 時，讓出 CPU 後，其他的工作應該也要能執行才對。

自已測試設 bench=1 ，並用 telnet 連線，似乎沒造成系統不穩定。

搭配 netstat 確認，上述文字的確需要調整

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

看完 Linux 說明文件，目前我的理解是：
系統每個 CPU 會有二個 worker-pool (類似 thread poool) ，分別為高優先權和一般優先權，每一個 worker-pool 內的 concurrency-level 就是裡面 worker (thread) 數量。除了每個 CPU 的 worker-pool ，還有一個或多個 worker-pool ，不限定在特定 CPU 執行，用來處理 WQ_UNBOUND 的工作。

原本 Workqueue 問題

In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker thread system-wide. A single MT wq needed to keep around the same number of workers as the number of CPUs. The kernel grew a lot of MT wq users over the years and with the number of CPU cores continuously rising, some systems saturated the default 32k PID space just booting up.

跟據 Linux 核心文件說明。原本的 Workqueue 實作，是各個 Workqueue 各自管理自已的執行緒。跟據 workqueue 類型，MT-wq (Multithread workqueue) 針對每一個 CPU 各建立一個執行緒，ST-wq (Singlethread workqueue) 只建一個執行緒由排程器去決定在哪個 CPU 上執行。但這樣的問題是：各個子系統、驅動程式可能都會自已建立 workqueue ，且 workqueue 也不是隨時在工作。故會有很多閒置的執行緒產生。尤其是 MT-wq 的 workqueue ，當 CPU 核心增加，執行緒數目就增加，且可能很多都是閒置的。

Although MT wq wasted a lot of resource, the level of concurrency provided was unsatisfactory. The limitation was common to both ST and MT wq albeit less severe on MT. Each wq maintained its own separate worker pool. An MT wq could provide only one execution context per CPU while an ST wq one for the whole system. Work items had to compete for those very limited execution contexts leading to various problems including proneness to deadlocks around the single execution context.

即便用的資源很多，但因為每一個 Workqueue 各自管理自已的執行緒，故在併行程度上仍不理想。因為一個 Workqueue 在一個 CPU 上只能同時有一個工作是執行（這裡的執行包含等待 I/O）。

CMWQ 改善原本 Workqueue 的問題

在 CMWQ 中， thread-pool 和 threads 都是統一管理的。即便系統中有多個 Workqueue ，這些 Workqueue 的工作都會在統一管理的 thread-pool 上執行，故解決了上述每一個 Workqueue 各自管理自已的 threads ，造成資源浪費，且效率不好的情形。

CMWQ 運作方式

CMWQ API

引入 CMWQ 至 `httpd`

Workqueue 建立

在 http_server.h 宣告名為 khttpd_wq 變數，放置 workqueue。

extern struct workqueue_struct *khttpd_wq;

在 main.c 中建立 workqueue 。需要先宣告全域變數 khttpd_wq 。

struct workqueue_struct *khttpd_wq;

之後分別在 khttpd_init 加入建立 workqueue 的程式。與 kecho 不同的是， kecho 沒有檢查 workqueue 是否成功建立，但這裡有實作這樣的檢查。

@@ -159,6 +161,12 @@ static int __init khttpd_init(void)
         pr_err("can't open listen socket\n");
         return err;
     }
+    khttpd_wq = alloc_workqueue(MODULE_NAME, WQ_UNBOUND, 0);
+    if (!khttpd_wq) {
+        pr_err("can't allocate workqueue\n");
+        close_listen_socket(listen_socket);
+        return -ENOMEM;
+    }
     param.listen_socket = listen_socket;
     http_server = kthread_run(http_server_daemon, &param, KBUILD_MODNAME);
     if (IS_ERR(http_server)) {

在 khttpd_exit 加入把 workqueue 釋放的程式。

 static void __exit khttpd_exit(void)
 {
     send_sig(SIGTERM, http_server, 1);
     kthread_stop(http_server);
     close_listen_socket(listen_socket);
+    destroy_workqueue(khttpd_wq);
     pr_info("module unloaded\n");
 }

定義結構體

仿照 kecho ，分別建立代表 service 和 worker 的結構。

struct http_service {
    bool is_stopped;
    struct list_head worker;
};

struct khttpd {
    struct socket *sock;
    struct list_head list;
    struct work_struct khttpd_work;
};

struct http_service 中的 is_stopped 能夠用來通知 worker ，目前服務已經關閉了。而 struct khttpd 則用來放置每一個 worker 的 socket 和置入 workqueue 內的 work item ，當 khttpd 要缷載時，能夠確認所有 work item 和 socket 均已結束，避免出現 race condition。

修改 `http-server.c` 改用 CMWQ

宣告 daemon 全域變數，用來存放目前服務是否已停止和 worker 的 linked list 。

struct http_service daemon = {.is_stopped = false,
                              .worker = LIST_HEAD_INIT(daemon.worker)};

仿照 kecho 建立 free_worker 函式。它會在服務停止時被呼叫，用來釋放 socket 及確保每一個 worker 都確實終止。

/* it would be better if we do this dynamically */
static void free_work(void)
{
    struct khttpd *l, *tar;
    /* cppcheck-suppress uninitvar */

    list_for_each_entry_safe (tar, l, &daemon.worker, list) {
        kernel_sock_shutdown(tar->sock, SHUT_RDWR);
        flush_work(&tar->khttpd_work);
        sock_release(tar->sock);
        kfree(tar);
    }
}

因為 socket 釋放會在 free_worker 進行，故在 http_server_worker 就不必釋放 socket 。修改 http_server_worker 函式程式如下：

         if (request.complete && !http_should_keep_alive(&parser))
             break;
         memset(buf, 0, RECV_BUFFER_SIZE);
     }
     kernel_sock_shutdown(socket, SHUT_RDWR);
-    sock_release(socket);
     kfree(buf);
     return 0;
 }

修改到這裡，會發現 kecho 中的 CMWQ ，其 worker 和 socket 都是等到最後要缷載模組時，才釋放掉。這樣子會造成不必要的記憶體浪費。

另外也發現原本的 khttpd 在缷載模組時，確保沒有 worker 在執行就缷載，這大概是為什麼在作業說明提到缷載時，可能出現 Kernel OOPS 的訊息。

仿造 kecho 建立 create_work 函式。這個函式用來建立 workqueue 中的 work item ，並且會把 struct khttpd 中相關欄位填入，串接在 daemon.worker 上，最後回傳 work item 。

static struct work_struct *create_work(struct socket *sk)
{
    struct khttpd *work;

    if (!(work = kmalloc(sizeof(*work), GFP_KERNEL)))
        return NULL;

    work->sock = sk;
    INIT_WORK(&work->khttpd_work, http_server_worker);
    list_add(&work->list, &daemon.worker);
    return &work->khttpd_work;
}

修改 http_server_worker ，改成從 struct khttpd 中取得 socket 。

函數宣告改為

static void http_server_worker(struct work_struct *work)

socket 取得改為

struct khttpd *worker = container_of(work, struct khttpd, khttpd_work);
struct socket *socket = worker->sock;

修改 http_server_daemon ，改成接受連線後，改用 create_work 建立 worker ，並用 queue_work 來把 work item 放入 workqueue 。

int http_server_daemon(void *arg)
 {
     struct socket *socket;
-    struct task_struct *worker;
+    struct work_struct *work;
     struct http_server_param *param = (struct http_server_param *) arg;
 
     allow_signal(SIGKILL);

.....

             pr_err("kernel_accept() error: %d\n", err);
             continue;
         }
-        worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME);
-        if (IS_ERR(worker)) {
-            pr_err("can't create more worker process\n");
+        work = create_work(socket);
+        if (!work) {
+            pr_err("create work error, connection closed\n");
+            kernel_sock_shutdown(socket, SHUT_RDWR);
+            sock_release(socket);
             continue;
         }
+        queue_work(khttpd_wq, work);
     }
+    daemon.is_stopped = true;
+    free_work();
     return 0;
 }

至此，已把 khttpd 換成 CMWQ ，用 make check 測試看看

$ make check
.....
80000 requests
90000 requests

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socket errors: 0 [0%]
seconds:       2.590
requests/sec:  38609.159

Complete

重新測量 因為上面的數字是在 GUI 環境下測量，背景執行眾多工作，包含瀏覽器及 Visual Studio Code ，加上螢幕解析度為 4K。故決定在文字 tty 模式下再執行一次，速度提升不少。
另外也把處理 request 時，用 pr_info 輸出的部份移除來增加效能。

比較表格：

	結果 (requests/sec)
原始 khttpd	28423.400
原始 khttpd 但 pr_info 移除	35762.271
改用 CMWQ 的 khttpd 且 pr_info 移除	46008.848

再看看 dmesg ：

......
[  377.087904] khttpd: requested_url = /
[  377.087942] khttpd: requested_url = /
[  377.284773] khttpd: module unloaded

看起來沒有錯誤訊息，故測試成功。

在連線結束時釋放 worker 記憶體

在原來的 CMWQ 實作中，只會在核心模組要結束時才會釋放所有連線的記憶體 (struct khttpd），這樣子會造成記憶體洩漏。故希望在每一個 worker 完成工作時，順便釋放記憶體。
因為 worker list 會有多個執行續會同時存取，故加入 spinlock 保護。

在 struct http_service 加入 spinlock

 struct http_service {
     bool is_stopped;
     struct list_head worker;
+    spinlock_t lock;
 };

在 http_server_daemon 函式中初始化 spinlock

     allow_signal(SIGKILL);
     allow_signal(SIGTERM);
 
+    spin_lock_init(&daemon.lock);
     while (!kthread_should_stop()) {
         int err = kernel_accept(param->listen_socket, &socket, 0);
         if (err < 0) {
             if (signal_pending(current))
                 break;

create_worker 函式中，要新增工作時要取得 lock 。

 static struct work_struct *create_work(struct socket *sk)
 {
     struct khttpd *work;

     if (!(work = kmalloc(sizeof(*work), GFP_KERNEL)))
         return NULL;

     work->sock = sk;
     INIT_WORK(&work->khttpd_work, http_server_worker);
+    spin_lock(&daemon.lock);
     list_add(&work->list, &daemon.worker);
+    spin_unlock(&daemon.lock);
     return &work->khttpd_work;
 }

free_worker 在走訪 worker 時，要取得 lock 。此外 free_worker 不再釋放空間，只要把 socket 終止就好。

@@ -208,12 +215,11 @@ static void free_work(void)
     struct khttpd *l, *tar;
     /* cppcheck-suppress uninitvar */
 
+    spin_lock(&daemon.lock);
     list_for_each_entry_safe (tar, l, &daemon.worker, list) {
         kernel_sock_shutdown(tar->sock, SHUT_RDWR);
-        flush_work(&tar->khttpd_work);
-        sock_release(tar->sock);
-        kfree(tar);
     }
+    spin_unlock(&daemon.lock);
 }

在 http_server_worker 裡，結束之前把 worker 從 list 中移除，並把記憶體釋放。

@@ -185,8 +185,13 @@ static void http_server_worker(struct work_struct *work)
     }
 
 out:
-    kernel_sock_shutdown(socket, SHUT_RDWR);
     kfree(buf);
+    spin_lock(&daemon.lock);
+    list_del(&worker->list);
+    spin_unlock(&daemon.lock);
+    kernel_sock_shutdown(socket, SHUT_RDWR);
+    sock_release(socket);
+    kfree(worker);
 }

修改完後，執行 make check ，確認功能正常。效能測試結果如下：

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socket errors: 0 [0%]
seconds:       2.183
requests/sec:  45818.679

加入 spinlock 效能與沒加入 spinlock 差不多。估計是因為：

我的 CPU 核心數不多，故發生競爭情況較少
Critical Section 內，只有一個 list_del 或 list_add 操作，基本上就是改變四個指標而已，相當快速就會完成。（free_work 內會走訪 linked list ，但那是在要卸載模組時才會執行）

TODO 再思考能不能透過 lock-free algorithm 搭配 RCU 來避免 lock 。

目錄列表功能實作

作業說明中有提到幾個函式可以利用，分別是 filp_open (類似 open 系統呼叫) 、 iterate_dir （用來走訪目錄）、 filp_close （關檔）。故利用這幾個函式來實作。

為了不要用額外的 buffer 存放目錄列表，因此採用 chunked encoding 。

其他紀錄

修正 kecho 在 Linux v5.17 以後無法編譯 (Pull Request #12)

eBPF 套件安裝

在作業說明中，是安裝 iovisor 套件庫中的 bcc-tools ，但 iovisor 沒有提供 Ubuntu 22.04 版本的套件，故我安裝 Ubuntu 22.04 套件庫內的 bpfcc-tools 。

修正可能的 buffer overflow

在 http-server.c 中的 http_parser_callback_request_url 函式，會把請求的 URL 複製到 struct http_request 中，但 request_url 的大小只有 128。此處沒有考慮到實際上的 request url 可能會比 128 bytes 還長，造成 buffer overflow ，故做了下列修改：

@@ -103,7 +103,10 @@ static int http_parser_callback_request_url(http_parser *parser,
                                             size_t len)
 {
     struct http_request *request = parser->data;
-    strncat(request->request_url, p, len);
+    if (len > 127)
+        len = 127;
+    strncpy(request->request_url, p, len);
+    request->request_url[len] = '\0';
     return 0;
 }

主要是強制讓 len 小於等於 127 ，且改用 strncpy 來複製字串。另外 strncpy 複製完的字串可能不是 null-terminated ，故最後加上 '\0'。

能否讓這段程式碼更通用？亦即檢查緩衝區範圍的工具函式/巨集

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

我後來發現我的寫法是有問題的，因為這個 callback 可能會被呼叫很多次，要把收到的東西串起來才對。我再想想要怎麼改。

Commit ecce64a

2023q1 Homework7 (ktcp)

開發環境

CMWQ 研究

原本 Workqueue 問題

CMWQ 改善原本 Workqueue 的問題

CMWQ 運作方式

CMWQ API

引入 CMWQ 至 httpd

Workqueue 建立

定義結構體

修改 http-server.c 改用 CMWQ

在連線結束時釋放 worker 記憶體

目錄列表功能實作

其他紀錄

eBPF 套件安裝

修正可能的 buffer overflow

Read more

2023q1 Homework6 (quiz5)

2023q1 Homework5 (assessment)

2023q1 Homework6 (quiz6)

2023q1 Homework3 (quiz3)

引入 CMWQ 至 `httpd`

修改 `http-server.c` 改用 CMWQ