2023q1 Homework7 (ktcp)

contributed by < WangHanChi >

作業要求

ktcp

學習〈Linux 核心設計: 針對事件驅動的 I/O 模型演化〉
探討 TCP 伺服器開發議題
學習 Linux 核心的 kernel thread 和 workqueue 處理機制
學習 Concurrency Managed Workqueue (cmwq)
預習電腦網路原理
學習 Ftrace，搭配閱讀《Demystifying the Linux CPU Scheduler》第 6 章

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.

$ lscpu | less
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 5 5600X 6-Core Processor
CPU family:                      25
Model:                           33
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     4650.2920
CPU min MHz:                     2200.0000
BogoMIPS:                        7385.75

為 `kecho` 添加 CI 測試

從 yanjiew1 的 commit 學習如何撰寫 CI 測試腳本，並且將其加入到 kecho 之中。

總共新增了兩個檔案，分別是

.ci/check-format.sh

#!/usr/bin/env bash

SOURCES=$(find $(git rev-parse --show-toplevel) | egrep "\.(cpp|h)\$")

set -x

for file in ${SOURCES};
do
    clang-format-14 ${file} > expected-format
    diff -u -p --label="${file}" --label="expected coding style" ${file} expected-format
done
exit $(clang-format-14 --output-replacements-xml ${SOURCES} | egrep -c "</replacement>")

.github/workflows/main.yml

name: CI

on: [push, pull_request]

jobs:
  kecho-check:
    runs-on: ubuntu-22.04
    steps:
    - uses: actions/checkout@v3.3.0
    - name: install-dependencies
      run: |
            sudo apt-get update
            sudo apt-get -q -y install build-essential cppcheck
            sudo apt-get -q -y install linux-headers-`uname -r`
    - name: make
      run:  |
            make
    - name: make check
      run: |
            make check
  coding-style:
    runs-on: ubuntu-22.04
    steps:
    - uses: actions/checkout@v3.3.0
    - name: coding convention
      run: |
            sudo apt-get install -q -y clang-format-14
            sh .ci/check-format.sh
      shell: bash

主要也是測試 make check 與 coding style

詳情可以參考 commit

`kecho` 學習筆記

`kecho_mod.c`

這個檔案是主要的模組程式，從一開始就看到一個巨集

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 8, 0)
#define USE_SETSOCKET
#endif

在這邊透過解析我們所用的核心版本，並且使用對應到的 API，而這邊就是為了兼容以前的版本才做了這樣的解決辦法，詳情可以參考 LINUX_VERSION_CODE 與 KERNEL_VERSION

接著他又進行了參數的傳遞

static ushort port = DEFAULT_PORT;
static ushort backlog = DEFAULT_BACKLOG;
static bool bench = false;
module_param(port, ushort, S_IRUGO);
module_param(backlog, ushort, S_IRUGO);
module_param(bench, bool, S_IRUGO);

接下來是三個結構體

struct echo_server_param param;
struct socket *listen_sock;
struct task_struct *echo_server;

其中 echo_server_param 其實也是由 scoket 所組成的，定義在 echo_server.h 裡面。而 task_struct 就是大名鼎鼎的結構體，他的定義極長，可以從這邊參考 linux/include/linux/sched.h

`kecho_init_module`

這個函式是初始化這個核心模組接著就用 open_listen 開放並且等待接收消息，並且做了簡單的異常檢測。

接著

kecho_wq = alloc_workqueue(MODULE_NAME, bench ? 0 : WQ_UNBOUND, 0);
echo_server = kthread_run(echo_server_daemon, &param, MODULE_NAME);
if (IS_ERR(echo_server)) {
    printk(KERN_ERR MODULE_NAME ": cannot start server daemon\n");
    close_listen(listen_sock);
}

其中的 alloc_workqueue 在 CMWQ 中有提到 WQ_UNBOUND 的用法

在Concurrency Managed Workqueue (cmwq)中，WQ_UNBOUND是一種工作佇列（work queue）的屬性。當一個工作佇列被設置為WQ_UNBOUND屬性時，工作項目將被服務於特殊的工作池中，這些工作池中的工作程序沒有綁定到任何特定的CPU。

這樣做使得這個工作佇列行為像一個簡單的執行上下文提供程序，沒有並發管理。未綁定的工作池會儘快開始執行工作項目。雖然未綁定的工作佇列會犧牲一定的局部性，但它在以下情況下非常有用：

預計並發需求會有大幅度波動，使用綁定的工作佇列可能會在不同的CPU上創建大量未使用的工作程序，因為發出者在不同的CPU之間跳躍。

長時間運行並且需要大量CPU資源的工作負載，可以更好地由系統調度程序管理，因此使用未綁定的工作佇列可以更好地利用系統資源。

總之，WQ_UNBOUND屬性允許未綁定的工作佇列快速執行工作項目，特別適用於波動較大的並發需求和長時間運行的工作負載。

接下來是 kthread_run ，他是 kthread_create 與 wake_up_process 的結合版，

`kecho_cleanup_module`

這個函式主要進行 module 的停止與清除

send_sig(SIGTERM, echo_server, 1);

其中 SIGTERM 從 CSAPP 的第8-5節 <信號> 中的表格可以看到代表的含意為軟體中止信號

再來就是終止 thread 與 listen

kthread_stop(echo_server);
close_listen(listen_sock);

可以停止 echo_server 這個 thread，以及停止接收訊息。

`open_listen`

這個程式碼是用來初始化一個網路 server，並且通過 TCP/IP 協定以及 IPv4的網路來接受客戶端的連線請求。

struct socket *sock : 用來儲存 socket 物件
struct sockaddr_in addr : 用來儲存 IPv4 地址的結構體
int error : 如果出現錯誤而返回時的錯誤代碼
int opt : 用來保存 socket 設定的參數

首先先用 sock_create 創建了一個 socket 物件，並且設定了網路協定為 IPv4 、 socket 類型為 SOCK_STREAM（即TCP協定）、協定為 IPPROTO_TCP 。並且檢查了 error 函式返回值，如果 < 0 的話，就代表出現錯誤，就會函式返回並且回傳 error 錯誤代碼。

接下來要設定 tcp 的 nodelay，如果設定了這個選項的話，就會 turn off Nagle's algorithm 來減少 TCP 傳輸時的延遲。最後也進行錯誤檢查。

接下來要設定 so_reuseport 這個選項，這代表了允許多個 socket 物件同時綁定 (bind) 到同一個 port 上。在最後也進行了錯誤檢查。

最後設定 struct sockaddr_in 結構體用於後續綁定 socket 和監聽連線請求。 struct sockaddr_in 是 Internet 網路埠定址結構，可以用來表示 IPv4 網路埠。

這裡首先使用 memset 將 addr 的所有成員變數都被初始化為 0。接下來，sin_family 成員設置為 AF_INET，這表示使用 IPv4 地址。sin_addr.s_addr 成員設置為 htonl(INADDR_ANY)，表示將本地任何可用的 IP 位址與 socket 綁定，INADDR_ANY 是一個特殊的值，表示任何可用 IP 位址。最後，sin_port 成員設置為指定的 port ，使用 htons 函式將小端 (x86-64) 轉換為大端 (network)，詳情可以參考 socket编程为什么需要htons(), ntohl(), ntohs()，htons() 函数。

`close_listen`

這邊就是強制結束掉 socket 的監聽，再來進行釋放。

static void close_listen(struct socket *sock)
{
    kernel_sock_shutdown(sock, SHUT_RDWR);
    sock_release(sock);
}

`echo_server[ch]`

這邊用到了許多的結構體，同時很多又將這些結構包進新的結構。

初始結構體 :

socket
list_head
work_struct

包裝成新的結構體 :

echo_server_param
echo_service
kecho

`get_request`

這邊多使用了兩個結構體 struct msghdr 與 struct kvec

可以看到這個函式主要都在進行初始化，同時老師在 printk 的地方加上了一個註解

/*
 * TODO: during benchmarking, such printk() is useless and lead to worse
 * result. Add a specific build flag for these printk() would be good.
 */

提醒我們在印出這個訊息是相當耗費時間的，所以我們可以使用一個 debug_mode 的巨集來定義是否要進行印出。

再印出模組名稱後接下來會開始進行接收

length = kernel_recvmsg(sock, &msg, &vec, size, size, msg.msg_flags);

而回傳值是接收到的字節 byte 的數量

`send_request`

這個函式跟 get_recvmsg 相似，只是一個是接收，另外一個是發送

length = kernel_sendmsg(sock, &msg, &vec, 1, size);

而回傳值是傳輸的字節 byte 的數量

`echo_server_work`

這個函式主要在執行的是判斷 daemon的狀態來重複執行接收與傳遞消息

首先先利用 container_of 來將 struct work_struct 的指標轉換成 struct kecho 的指標，再來使用 kzalloc 來把 buf 進行記憶體配置並且初始化 0

可以注意到這邊 buf 的大小為 BUF_SIZE (4096)，如果輸入超過這個數量的字元的話，就會報錯並且結束，這個就是老師在講解作業的時候說我們可以貼上一段文字測試的用意

接著就可以看到它將接收的字串原封不動的傳回，並且在每一次都會重新的將 buf 填充 0

while (!daemon.is_stopped) {
    int res = get_request(worker->sock, buf, BUF_SIZE - 1);
    if (res <= 0) {
        if (res) {
            printk(KERN_ERR MODULE_NAME ": get request error = %d\n", res);
        }
        break;
    }

    res = send_request(worker->sock, buf, res);
    if (res < 0) {
        printk(KERN_ERR MODULE_NAME ": send request error = %d\n", res);
        break;
    }

    memset(buf, 0, res);
}

當 daemon.is_stopped 的時候，就會停止接收，並且關閉監聽客戶端的消息，最後釋放 buf

kernel_sock_shutdown(worker->sock, SHUT_RDWR);
kfree(buf);

`create_work`

這邊主要是將工作加進 workqueue，來達到非同步執行

可以看到首先先為 kecho 配置了記憶體空間，再來將傳入的 sock 指派給 kecho->sock，再將用 INIT_WORK 這個巨集來初始化 work->kecho_work，並且同時指派了 echo_server_worker 這個函式作為要執行的工作。

最後再將它加入到 daemon.worker 這個 linked-list 裡面，使用的方法是 lab0-c 中的 kernel list API。

`free_work`

這個函式會釋放掉 workqueue 裡面所的工作

list_for_each_entry_safe (tar, l, &daemon.worker, list) {
    kernel_sock_shutdown(tar->sock, SHUT_RDWR);
    flush_work(&tar->kecho_work);
    sock_release(tar->sock);
    kfree(tar);
}

可以看到步驟是

停止 socket 的接收監聽
等待目前的工作執行完畢
關閉這個 socket
釋放這個 kecho

其中使用到了 lab0-c 當中的走訪全部 node 的 API list_for_each_entry_safe

`echo_server_daemon`

這邊講述了一個背景執行的伺服器 server 使用一個 while 循環不斷地接受連線請求。如果接受連線請求時發生錯誤，它會檢查是否是收到了 SIGKILL 或 SIGTERM 信號，如果是，就結束 while-loop。如果不是，就輸出錯誤訊息並繼續接受下一個連線請求。

如果接受連線請求成功，就會執行 queue_work ，將 work 加入到 workqueue 之中

最後，當收到 SIGKILL 或 SIGTERM 信號時，背景執行的部份會結束，並且釋放所有的 work 結構體。

效能測試

$ make 
$ sudo insmod kecho.ko
$ ./bench

接著可以得到 kernel 版本的 performance 圖

接著看看 user-echo-server 版的

$ make
$ ./user-echo-server

## another terminal
$ ./bench
$ make plot

可以看到效能差距極大!

接下來測試註解掉所有 printk 的

可以看到如果少了這些消息的 IO 輸出入的話，效能是可以更進一步的提昇的，但是在 debug 的時候就會變得比較複雜，我認為可以在 gcc 編譯的時候加上 -DDEBUG 這樣的 define ，並且修改源碼如下

#ifdef DEBUG
    printk(MODULE_NAME ": start get response\n");
#endif

就可以方便的切換是否要進行 debug 模式的 IO 輸出

CMWQ 學習

針對原本的 wq 有說明劣勢的地方在那

In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker thread system-wide. A single MT wq needed to keep around the same number of workers as the number of CPUs. The kernel grew a lot of MT wq users over the years and with the number of CPU cores continuously rising, some systems saturated the default 32k PID space just booting up.

Although MT wq wasted a lot of resource, the level of concurrency provided was unsatisfactory. The limitation was common to both ST and MT wq albeit less severe on MT. Each wq maintained its own separate worker pool. An MT wq could provide only one execution context per CPU while an ST wq one for the whole system. Work items had to compete for those very limited execution contexts leading to various problems including proneness to deadlocks around the single execution context.

接著展現了 CMWQ 的重新實作版本

Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with focus on the following goals.

Maintain compatibility with the original workqueue API.

Use per-CPU unified worker pools shared by all wq to provide flexible level of concurrency on demand without wasting a lot of resource.

Automatically regulate worker pool and level of concurrency so that the API users don’t need to worry about such details.

並且在這邊說明他們的設計想法

In order to ease the asynchronous execution of functions a new abstraction, the work item, is introduced.

A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously. Whenever a driver or subsystem wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue.

Special purpose threads, called worker threads, execute the functions off of the queue, one after the other. If no work is queued, the worker threads become idle. These worker threads are managed in so called worker-pools.

The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and the backend mechanism which manages worker-pools and processes the queued work items.

There are two worker-pools, one for normal work items and the other for high priority ones, for each possible CPU and some extra worker-pools to serve work items queued on unbound workqueues - the number of these backing pools is dynamic.

同時也有提供一些 API 來讓程式設計者使用，像是在 kecho 中所使用的

API
- alloc_workqueue
- destroy_workqueue
flags
- WQ_UNBOUND

接著準備將 CMWQ 引入到 khttpd 中

參考資料

torvalds/linux

2023q1 Homework7 (ktcp)

作業要求

為 kecho 添加 CI 測試

kecho 學習筆記

kecho_mod.c

kecho_init_module

kecho_cleanup_module

open_listen

close_listen

echo_server[ch]

get_request

send_request

echo_server_work

create_work

free_work

echo_server_daemon