KTCP

contributed by < csotaku0926 >

自我檢查清單

如何測試網頁伺服器的效能，針對多核處理器場景調整

古典方法可用 ab (Apache bending tool) 這項工具進行伺服器的壓力測試

例如測量 sehttpd 伺服器
就以下指令來說

$ ab -n 10000 -c 500 -k http://127.0.0.1:8081/

-n : 在 benchmarking 階段發送 10000 條請求，
-c : 在同一時間發送的請求數量，concurrecny
-k : 開啟 HTTP "Keep Alive" 設置，也就是在同一 HTTP session 執行多個請求

部份測量結果數據：

Time taken for tests:   0.905 seconds
Complete requests:      10000
Failed requests:        0
Keep-Alive requests:    10000
Total transferred:      4180000 bytes
HTML transferred:       2410000 bytes
Requests per second:    11053.58 [#/sec] (mean)
Time per request:       45.234 [ms] (mean)
Time per request:       0.090 [ms] (mean, across all concurrent requests)
Transfer rate:          4512.11 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.8      0       6
Processing:     2   44  10.2     44      61
Waiting:        0    2   1.5      1       9
Total:          5   44   9.4     44      61

但 ab 無法反映多執行緒特性（自身已消耗單核 100% 運算量）
所以使用 wrk 這項工具，可以針對多核場景測量

$ wrk -t8 -c400 -d30s http://127.0.0.1:8081

這項工具允許開啟多個執行緒，可依據測試端的 CPU 數量進行調整

Running 2s test @ http://127.0.0.1:8081
  8 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.23ms    0.97ms  15.18ms   91.45%
    Req/Sec     8.54k     0.89k   18.95k    95.00%
  135956 requests in 2.03s, 47.84MB read
Requests/sec:  67050.39
Transfer/sec:     23.60MB

注意到利用 ab 以及 wrk 所測量出的 transfer rate 相差二十幾個 MBytes
也可以使用 htstress 工具

另外，經過長時間的開啟 sehttpd，出現以下錯誤訊息

[ERROR] (src/http.c:32: errno: Broken pipe) errno == 32

研讀透過 eBPF 觀察作業系統行為，如何用 eBPF 測量 kthread / CMWQ 關鍵操作的執行成本？

eBPF簡單介紹

動態追蹤允許「非侵入」的方式，不需更動內部系統的運作，可以獲取需要的資訊

傳統的封包過濾，需要將位於核心的封包傳進 (當然一開始封包是由網卡接收) 使用者空間 (user space)，而 BPF 的核心概念為讓使用者透過額外的過濾程式告訴核心，應該過濾哪些封包。
好處顯而易見，可以在封包一進入核心空間就進行過濾，避免無用封包進入網路堆疊 (network stack) 到應用層

以網路封包的獲取為例，tcpdump 將透過 libpcap 轉譯後的濾包條件，送給位於核心的 BPF 模組，再由其將符合條件的封包送回 tcpdump

eBPF 允許使用者以高效率的方式，撰寫程式附加於 Linux 核心內部，以達到事件監聽，追蹤系統呼叫等功能

kthread 的執行成本是什麼？？ CMWQ 呢？

當我們提到測量 kthread 執行成本，我們的關注點在於其執行花費時間，可以使用 eBPF 測量

khttpd 效能測量

觀察測試的 BPF 程式碼:

#include <uapi/linux/ptrace.h>

BPF_HASH(start, u64);

int BPF_kprobe(struct pt_regs *ctx)
{
	u64 ts = bpf_ktime_get_ns();
	bpf_trace_printk("in %llu\\n",ts);
	return 0;
}

int BPF_kretprobe(struct pt_regs *ctx)
{
	u64 ts = bpf_ktime_get_ns();
	bpf_trace_printk("out %llu\\n",ts);
	return 0;
}

其中 BPF_HASH(start, u64) (ref) 創建一個名叫 start 的雜湊表，他的 key 是 struct request* 形式，value (這裡為 timestamp) 形式則是 u64

kprobe 允許使用者自行定義 callback function，並動態將探針插入大多核心函式與模組
kretprobe 則是用來取得 kprobe 的回傳值

在 callback 函式中，bpf_trace_printk 將輸出 log 儲存於 /sys/kernel/debug/tracing/trace_pipe
可以用 python bcc.BPF module 的 trace_print 取得結果

問題是，kthread_run 並不是列於 /proc/kallsyms 中的符號之一，而是巨集

include/linux/kthread.h 中的定義

#define kthread_run(threadfn, data, namefmt, ...)			   \
({									  \
	struct task_struct *__k					   \
		= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
	if (!IS_ERR(__k))						   \
		wake_up_process(__k);					   \
	__k;								   \
})

因此，在測量 khttpd 中 kthread_run 時間成本時，需要額外添加一個 function wrapper my_kthread_wrapper 包裝起來

但是 eBPF 應該只能測量系統呼叫，(如 /proc/kallsyms 列舉的呼叫)，要怎麼測量這個 wrapper ? 目前無法測量 my_kthread_wrapper ，但是位於 khttpd 的執行緒函式 http_server_worker 卻可以被測量到

int my_kthread_wrapper(struct socket *socket, struct task_struct *worker)
{
    // printk(KERN_INFO "my_kthread_wrapper called");
    worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME);
    return IS_ERR(worker);
}

int http_server_daemon(void *arg)
{
    ...
        // kthread_run(http_server_worker, socket, KBUILD_MODNAME);
        if (my_kthread_wrapper(socket, worker)) {
            pr_err("can't create more worker process\n");
            continue;
        }
    }
    return 0;
}

測量的 BCC python code

b = BPF(text=code)
b.attach_kprobe(event="http_server_worker", fn_name="BPF_kprobe")
b.attach_kretprobe(event="http_server_worker", fn_name="BPF_kretprobe")

while True:
	res = b.trace_fields()
	print(res[5].decode())

根據 vax-r 同學的想法，可能是編譯器優化，導致 wrapper 被無視，直接執行裡面的程式

初步解決方案是將 wrapper 內的功能寫的更多一點，還需要涉及記憶體配置，ftrace 才能捕捉到

後續的 my_kthread_wrapper 如下，這次就可以捕捉到了

commit 537ca0a

struct task_struct *my_kthread_wrapper(struct socket *socket)
{
    // dummy kmalloc
+    char* buf = kmalloc(1, GFP_KERNEL);
+    if (!buf) {
+        pr_err("kmalloc\n");
+        return NULL;
+    }
+    kfree(buf);
    
    // real code
    return kthread_run(http_server_worker, socket, KBUILD_MODNAME);
}

完成 eBPF 時間測量後，應用 gnuplot 繪圖

kthread_run 的建立成本大多落在 200 us 以下

`kecho` 的 `kthread` 與 CMWQ 效能比較

壓力測試程式碼:

$ ./htstress http://127.0.0.1:8081/ -c 1 -t 4 -n 100000

kthread_based 的版本在收到連線請求後才會建立 kthread
而 CMWQ 中的 workqueue 可以根據任務執行狀況安排執行緒，並且採用 thread pool 的概念，預先建立好執行緒

測量過程為先透過 make 編譯出核心檔 (.ko) 再用 ./bench 進行壓力測試
最後用 gnuplot 繪圖

根據官方文件，

WQ_UNBOUND
Work items queued to an unbound wq are served by the special worker-pools which host workers which are not bound to any specific CPU.

WQ_UNBOUND 會讓 worker 不與特定 CPU 綁定 (bond)
為了測試 locality 效能，將參數 bench 設為真

user_echo_server (kthread-based)

kecho (CMWQ-based)
- bench=true
- bench=false

反而是 bench=false 看起來比較穩定，大多數資料都集中在一處

作業要求

khttpd

引入 CMWQ，分析效能表現並提出改進方案

根據改進功能與效能一文，CMWQ 版本的實作得益於 locality 以及事先準備的執行緒
面對大量連線時，CMWQ 的優勢較 kthread-based 還要突出

首先在 init_module 處配置 workqueue

利用 Ftrace 找出 khttpd 核心模組效能瓶頸，以及該如何設計相關實驗學習。

搭配閱讀《Demystifying the Linux CPU Scheduler》第 6 章

Ftrace 是個位於核心的動態追蹤工具，可用於追蹤函式、事件等

可藉由寫入 /sys/kernel/debug/tracing 內的檔案來設定 ftrace

例如，透過 available_filter_functions 列出 khttpd 核心程式中可以被追蹤的函式:

$ sudo insmod khttpd.ko
$ sudo cat /sys/kernel/debug/tracing/available_filter_functions | grep khttpd
parse_url_char [khttpd]
http_message_needs_eof [khttpd]
http_should_keep_alive [khttpd]
http_parser_execute [khttpd]
http_method_str [khttpd]
http_status_str [khttpd]
...

接著嘗試追蹤 http_server_worker 函式

current_tracer : 設定或顯示目前使用的 tracer ，如: function、function_graph
set_ftrace_filter : 指定要追蹤的函式（只會追蹤他們）
set_graph_function : 指定要顯示呼叫關係的函式
tracing_on : 設定或顯示使用的 tracer 是否寫入到 ring buffer
max_graph_depth : function graph tracer 最大追蹤深度 (呼叫 kernel function 數量)
trace : 紀錄追蹤輸出結果

#!/bin/bash
TRACE_DIR=/sys/kernel/debug/tracing
TARGET=http_server_worker

# clear file
echo 0 > $TRACE_DIR/tracing_on 
echo > $TRACE_DIR/set_graph_function 
echo > $TRACE_DIR/set_ftrace_filter 
echo nop > $TRACE_DIR/current_tracer 
echo > $TRACE_DIR/trace

# settings
echo function_graph > $TRACE_DIR/current_tracer 
echo 3 > $TRACE_DIR/max_trace_depth
echo $TARGET > $TRACE_DIR/set_graph_function

# execute
echo 1 > $TRACE_DIR/tracing_on
../htstress http://127.0.0.1:8081/ -n 2000
echo 0 > $TRACE_DIR/tracing_on

# output file
cat $TRACE_DIR/trace > trace.txt

最後，追蹤結果會在 $TRACE_DIR/trace 裡面

可以看到整個 http_server_worker 函式在各個內部函式所耗費的時間
以下節錄:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  7)               |  http_server_worker [khttpd]() {
  7)               |    kernel_sigaction() {
  7)   0.138 us    |      _raw_spin_lock_irq();
  7)   0.111 us    |      _raw_spin_unlock_irq();
  7)   0.694 us    |    }
  7)               |    kernel_sigaction() {
  7)   0.090 us    |      _raw_spin_lock_irq();
  7)   0.087 us    |      _raw_spin_unlock_irq();
  7)   0.418 us    |    }
  7)               |    kmalloc_trace() {
  7)   0.490 us    |      __kmem_cache_alloc_node();
  7)   0.873 us    |    }
  7)   0.097 us    |    http_parser_init [khttpd]();
  7)   0.089 us    |    kthread_should_stop();
  7)               |    http_server_recv.constprop.0 [khttpd]() {
  7)   2.440 us    |      kernel_recvmsg();
  7)   2.611 us    |    }
  7)               |    kernel_sock_shutdown() {
  7) + 34.380 us   |      inet_shutdown();
  7) + 34.707 us   |    }
  7)               |    sock_release() {
  7)   3.229 us    |      inet_release();
  7)   0.097 us    |      module_put();
  7)   3.064 us    |      iput();
  7)   6.840 us    |    }
  7)               |    kfree() {
  7)   0.186 us    |      __kmem_cache_free();
  7)   0.491 us    |    }
  7) + 48.728 us   |  }
 ...

雜記

kecho 是 Linux 核心模組的 TCP 伺服器
telnet 是 application-layer ，使用 TCP/IP 與遠端伺服器溝通的協議

seHTTPd 是高效網路伺服器

khttpd 與 kecho 在掛載階段時，差異在後者使用 CMWQ 函式 alloc_workqueue

在 open_listen_socket 中，進行 socket 與 TCP 連線相關設定:
TCP_NODELAY 是關閉 Nagle's 算法
TCP_CORK 為了將零碎資料彙整為完整封包後再發送

可以發現伺服器本身，以及每個連線都會以 kthread 創建新的執行緒

建立 socket 後，呼叫 kthread_run 並執行函式 http_server_daemon
與 kecho 邏輯相似
首先利用 allow_signal 登記要接收的 SIGKILL , SIGTERM
再來以 kthread_should_stop 判斷是否中止負責執行 http_server_daemon 的執行緒
使用函式 kernel_accept 接收連線，若成功建立則使用 kthread_run 建立新的執行緒執行 kthread_worker

kthread_worker 函式執行以下行為:

設定 callback 函式：這部份是用來送出回應客戶的資料
進入迴圈，同樣以 kthread_should_stop 判斷中止與否
接收客戶端傳來的資料
使用 http_parser_execute 解讀並傳給客戶
最後釋放記憶體

至於 kecho 則是透過 create_work , queue_work 這種使用 struct work_item 的方式處理連線，以取代 khttpd 中kthread_run 建立連線的方式

測量效能的方式

./htstress : http server 壓力測試
- ./htstress http://127.0.0.1:8081 -t 3 -c 20 -n 200000
- 每秒處理多少要求
perf : 分析 user program ，或核心程式碼中各函式呼叫佔多少時間
- 教學
eBPF : python BCC module 或透過 C code 編譯
- 教學
ftrace : 核心內部呼叫函式追蹤
- man

KTCP

自我檢查清單

khttpd 效能測量

kecho 的 kthread 與 CMWQ 效能比較

作業要求

khttpd

雜記

測量效能的方式

Read more

AutoRhythm: 自動化生成樂譜

Linux 核心專題 : 開發用以加速 LLaMA 的 Linux 核心模組

2024q1 Homework6 (integration)

2024q1 Homework5 (assessment)

`kecho` 的 `kthread` 與 CMWQ 效能比較