sysprog
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Linux 核心專題: 核心模式的網頁伺服器 > 執行人: JimmyChongz > [解說影片](https://youtu.be/HbOBwWeDXQ4) ### Reviewed by `I-Ying-Tsai` 我看到你在 create_work() 中用 kmalloc 建立一個 http_request 並包裝成 work_struct,然後在 free_work() 裡配對釋放。但我有點好奇:如果某個 work 執行得很慢,或還在傳輸資料時 module 被 unload,那這個 flush_work() 真的能保證 work 執行完嗎?kernel 這邊會不會有 race condition 的可能? > 讓我追蹤一下,當 module 卸載時,會呼叫 `kthread_stop` 讓 `http_server_daemon` 跳出主迴圈,確保不會再有新的 work 被排入 workqueue。接著,將 `daemon_list.is_stopped` 設為 true,通知所有 worker 結束其執行迴圈,並釋放各自的 socket 資源,確保每個 work 都能安全結束。但==因為 kernel_recvmsg 預設是 blocking== 當沒有資料可讀時,worker 會一直卡在這個呼叫,無法即時檢查 `daemon_list.is_stopped`,導致 module 卸載時 worker 無法順利結束。 :::info 目前想到的解法是在 http_server_recv 中,新增 Timer 當 kernel_recvmsg block 太久,就回傳 `-ETIMEDOUT`。不確定這樣是否可行? ```diff + static struct timer_list worker_timer; + static struct http_request *current_worker; + + static void worker_timer_callback(struct timer_list *t) + { + if (daemon_list.is_stopped && current_worker) { + // 強制中斷當前的 socket 操作 + kernel_sock_shutdown(current_worker->socket, SHUT_RDWR); + } + + // 重新設置計時器 + if (!daemon_list.is_stopped) { + mod_timer(&worker_timer, jiffies + msecs_to_jiffies(1000)); + } + } static void http_server_worker(struct work_struct *work) { struct http_request *worker = container_of(work, struct http_request, khttpd_work); + timer_setup(&worker_timer, worker_timer_callback, 0); + mod_timer(&worker_timer, jiffies + msecs_to_jiffies(1000)); char *buf; struct http_parser parser; struct http_parser_settings setting = { .on_message_begin = http_parser_callback_message_begin, .on_url = http_parser_callback_request_url, .on_header_field = http_parser_callback_header_field, .on_header_value = http_parser_callback_header_value, .on_headers_complete = http_parser_callback_headers_complete, .on_body = http_parser_callback_body, .on_message_complete = http_parser_callback_message_complete, }; allow_signal(SIGKILL); allow_signal(SIGTERM); buf = kzalloc(RECV_BUFFER_SIZE, GFP_KERNEL); if (!buf) { pr_err("can't allocate memory!\n"); return; } http_parser_init(&parser, HTTP_REQUEST); parser.data = worker; while (!daemon_list.is_stopped) { int ret = http_server_recv(worker->socket, buf, RECV_BUFFER_SIZE - 1); if (ret <= 0) { if (ret) pr_err("recv error: %d\n", ret); break; } http_parser_execute(&parser, &setting, buf, ret); if (worker->complete && !http_should_keep_alive(&parser)) break; memset(buf, 0, RECV_BUFFER_SIZE); } + del_timer(&worker_timer); + current_worker = NULL; kernel_sock_shutdown(worker->socket, SHUT_RDWR); sock_release(worker->socket); kfree(buf); return; } ``` ::: ### Reviewed by `ginsengAttack` 我也試過上面那種方法,在 worker 加入 timer ,可是效能會變得很差,你有測試過這種方法的效能嗎?如果效能也變得很差,可以考慮老師教材實作 timer 的方法。 ### Reviewed by [`otischung`](https://github.com/otischung) 1. 請補上解說影片。 2. 請將修改內容相對應的 commit message 補上,方便對照完整程式碼。 > 已補上 ### Reviewed by `Max042004` 導入 CMWQ 前,伺服器效能瓶頸在於大量要排程的 kernel thread 占用資源;那麼導入 CMWQ 後,伺服器的效能瓶頸會變成什麼? > 利用 [Ftrace](https://hackmd.io/@sysprog/linux2025-ktcp-c#使用-Ftrace-觀察-kHTTPd) 來分析程式的效能瓶頸,以下是 trace 後的輸出結果: ```log http_server_worker [khttpd]() { 5) | kernel_sigaction() { 5) 0.328 us | rawspin_lock_irq(); 5) 0.248 us | rawspin_unlock_irq(); 5) 1.292 us | } 5) | kernel_sigaction() { 5) 0.229 us | rawspin_lock_irq(); 5) 0.265 us | rawspin_unlock_irq(); 5) 1.160 us | } 5) | kmalloc_cache_noprof() { 5) 0.219 us | cond_resched(); 5) 1.080 us | } 5) 0.231 us | http_parser_init [khttpd](); 5) | http_server_recv.constprop.0 [khttpd]() { 5) + 26.881 us | kernel_recvmsg(); 5) + 27.459 us | } 5) | http_parser_execute [khttpd]() { 5) 0.255 us | http_parser_callback_message_begin [khttpd](); 5) 0.249 us | parse_url_char [khttpd](); 5) 0.295 us | http_parser_callback_request_url [khttpd](); 5) 0.233 us | http_parser_callback_header_field [khttpd](); 5) 0.238 us | http_parser_callback_header_value [khttpd](); 5) 0.243 us | http_parser_callback_headers_complete [khttpd](); 5) 0.239 us | http_message_needs_eof [khttpd](); 5) 0.230 us | http_should_keep_alive [khttpd](); 5) + 76.329 us | http_parser_callback_message_complete [khttpd](); 5) + 81.742 us | } 5) 0.247 us | http_should_keep_alive [khttpd](); 5) | kernel_sock_shutdown() { 5) + 39.731 us | inet_shutdown(); 5) + 40.764 us | } 5) | sock_release() { 5) 2.735 us | inet_release(); 5) 0.221 us | module_put(); 5) 4.314 us | iput(); 5) 8.281 us | } 5) 0.313 us | kfree(); 5) ! 165.448 us | } ``` > 可以發現 `http_parser_callback_message_complete` 函式佔了 76.329 us (46% 的總時間),佔用了將近一半的執行時間,是最主要的瓶頸來源。在程式實作中,它會呼叫 `http_server_response`,而 `http_server_response` 會進一步呼叫 `http_server_send`,這個函式會透過 socket 傳送 HTTP 回應資料給 client,它涉及 TCP send buffer 大小的限制,可能需多次呼叫 `kernel_sendmsg`,在  `http_server_send`  裡有一個 while 迴圈,會直到所有資料都送出才結束。因此讓這個 callback 執行時間明顯拉長。相比其他 callback 只是簡單的字串處理或欄位設定,幾乎不耗時。這個瓶頸的核心問題是將網路 I/O 操作放在了請求處理的必經路徑上,導致每個請求都必須等待網路傳輸完成才能結束。 ### 改善方法 > TODO ### 利用 gunplot 製圖觀察 ftrace 後的瓶頸改善狀況 由上敘述得知瓶頸為 `http_parser_callback_message_complete` 函式: 先將 ftrace 結果寫入 `trace_output.txt`: ```bash $ sudo cat /sys/kernel/debug/tracing/trace > trace_output.txt ``` 再取出 `http_parser_callback_message_complete` 函式所在列: ```bash $ grep http_parser_callback_message_complete trace_output.txt > message_complete_time.txt ``` 再取出花費時間: ```bash $ grep -o '[0-9]\+\.[0-9]\+ us' message_complete_time.txt | awk '{print $1}' > message_complete_time_us.txt ``` 於 scripts 中新增 `bench.gp` 腳本: ```bash set terminal pngcairo size 800,400 set output 'http_parser_callback_message_complete.png' set title "http parser callback message complete 時間分布" set xlabel "事件 (requests finished)" set ylabel "時間 (us)" plot 'message_complete_time_us.txt' using 0:1 with points title 'Avg. time elapsed' ``` 於 Makefile 中新增: ```diff + plot: + gnuplot scripts/bench.gp ``` 執行命令: ```bash $ make plot ``` 改善前: ![http_parser_callback_message_complete](https://hackmd.io/_uploads/HySmDCESgl.png) 改善後: ## 任務簡述 依循 [ktcp](https://hackmd.io/@sysprog/linux2025-ktcp/) 作業規範,進行作業,務必提交給進 khttpd 的 pull request 並接受 review,過程中善用 eBPF 相關工具來追蹤封包並詳實紀錄。 ## 理解 [kecho](https://github.com/JimmyChongz/kecho.git) 的運作 ### 疑難雜症 - Linux manual page 中括號裡的數字代表什麼? :::spoiler 說明 > 參考 [UNIX PROGRAMMER’S MANUAL](https://dspinellis.github.io/unix-v4man/v4man.pdf) 之「序言」: > 分類不同類型的文件。 1. Commands 2. System calls 3. Subroutines 4. Special files 5. File formats 6. User-maintained programs 7. Miscellaneous 8. Maintenance ::: - 網頁伺服器 (I/O heavy) 實作在 Linux ==核心== 中,要考慮哪些因素? :::spoiler 說明 > 在 user-space 的 processs/thread (task) 會由 Linux 核心管理資源 (面對 VFS),反之,khttpd 需要自行呼叫 `kernel_recvmsg`, `kernel_sendmsg`, `kernel_sock_shutdown`, `kernel_accept`,直接對 protocol stack,於是 kernel thread 需要自行處理資源 reclaim ::: - 為什麼要有 USE_SETSOCKET 巨集? - `GFP_KERNEL` 是什麼? > 參考:[kmalloc](https://docs.kernel.org/core-api/mm-api.html#c.kmalloc) - workqueue 的存在必要性? :::spoiler 目前的理解 參考:[ktcp - \(C\) 教材](https://hackmd.io/@sysprog/linux2025-ktcp/%2F%40sysprog%2Flinux2025-ktcp-c) > 相比於使用 kthread 來處理 client 連線請求,Workqueue 避免了執行緒無限制增長,如果為每個連線請求都直接建立一個 kthread 來處理,那麼系統中就會有大量的執行緒存在,即使它們大部分時間都在休眠等待 I/O 工作,這會浪費大量的 kernel space 記憶體空間,且每一個 kthread 也會排程,大量的執行緒會讓排程器效率下降,造成 context switch 頻繁、cache 命中率降低;而使用 Workqueue 的核心優勢就是執行緒復用,它提供 worker pool,有固定數量的執行緒,通常等於 CPU 核心數,任務會由這些 worker threads 輪流處理,有效限制了 kthread 的數量。 ::: - 如何得知有效的 kernel thread 總量? (i.e., 可建立多少 kernel thread) :::spoiler 目前的理解 > 參考 [Linux kernel Document - threads-max ](https://docs.kernel.org/admin-guide/sysctl/kernel.html#threads-max) > This value controls the maximum number of threads that can be created using `fork()`. > > During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. > 這段說明即使建立最大數量的執行緒,所有的執行緒也只能佔用可用 pages 的 $\cfrac{1}{8}$。 透過以下命令查看系統可容納的最大執行緒數量上限: ```bash $ cat /proc/sys/kernel/threads-max ``` 輸出結果: ```bash 126359 ``` 為什麼是 $126359$?以記憶體配置的角度來推算系統能配置給 kernel thread 的合理數量: 參考 [Kernel Stacks](https://docs.kernel.org/arch/x86/kernel-stacks.html) > x86\_64 page size (PAGE\_SIZE) is 4K. > Like all other architectures, x86\_64 has a kernel stack for every active thread. These thread stacks are THREAD\_SIZE (4*PAGE\_SIZE) big. These stacks contain useful data as long as a thread is alive or a zombie. While the thread is in user space the kernel stack is empty except for the thread\_info structure at the bottom. 使用以下命令查看主機的記憶體狀況 (`-k` 是以 KB 為單位): ```bash $ free -k ``` 以我電腦的輸出為例: ```bash total used free shared buff/cache available Mem: 16253580 4257188 10870632 74984 1488184 11996392 Swap: 4194300 0 4194300 ``` 在 x86_64 架構下的 PAGE_SIZE 為 4KB,所以可以知道這台電腦的記憶體約有 4,063,395 個可用 Pages,且所有執行緒只能佔用可用 Pages 的 $\cfrac{1}{8}$,又一個執行緒佔用 $4$ 個 Page 的空間,所以執行緒只能約有 $126,981$ 個,與 `threads-max` 的值相近。(下方有列計算過程) ```txt 16,253,580 KB / 4 KB = 4,063,395 -> Page 數量 4,063,395 / 8 ≈ 507,924 -> 所有執行緒能佔用的 Page 數量 507924 / 4 = 126,981 -> 可容納的最大執行緒數量 ``` ::: - [Workqueue 文件](https://docs.kernel.org/core-api/workqueue.html#) 提到的 execution context 是什麼? :::spoiler 目前的理解 > 在 kernel 中,執行情境(execution context)主要分為兩種類型:Process context 和 Interrupt context。userspace 的程式通過系統呼叫 system call 觸發核心模式(kernel mode)時,核心會執行對應的系統呼叫程式碼,這種情況下核心的運行情境稱為 Process context。另一方面,當硬體中斷(interrupt)發生時,會觸發對應硬體驅動程式中的中斷服務例程(ISR,Interrupt Service Routine),由於驅動程式運行於核心空間(kernel space),其程式碼以核心模式執行,這種情況下核心的運行情境稱為 Interrupt context。 ::: - 在 work_struct 中的 data 成員的資料型別 `atomic_long_t` 是什麼? - workqueue 的排班跟一般 Process 的排班有一樣嗎? > 閱讀《Demystifying the Linux CPU Scheduler》第 1 章和第 2 章,得知 CPU 排程器和 workqueue/CMWQ 的互動 - 在 Linux 源碼中,`READ_ONCE` <s>函數</s> 的用途是? > 用途是確保變數的讀取操作不會被編譯器優化或 CPU 亂序執行而破壞,強制以一次性的方式完整讀取資料。(有點類似於 volatile?) ### 追蹤程式碼 當掛載 kecho.ko 時,會執行: - `open_listen`: 建立並配置一個 TCP 伺服器 socket,準備接受客戶端連接。 - `sock_create`: 用於建立 socket,包含記憶體分配。 > TODO: 追蹤 Linux 源碼 [linux/net/socket.c](https://github.com/torvalds/linux/blob/master/net/socket.c#L1597) - PF_INET: IPv4 協定 - SOCK_STREAM: TCP socket - IPPROTO_TCP: TCP/IP 協定 - 待釐清 Set socket opt > 參考:[ktcp 作業說明](https://hackmd.io/@sysprog/linux2025-ktcp/%2F%40sysprog%2Flinux2025-ktcp-b) 的 [socket(7) - Linux man page](https://linux.die.net/man/7/socket)、[tcp(7) — Linux manual page](https://man7.org/linux/man-pages/man7/tcp.7.html)、[What is the meaning of SO_REUSEADDR (setsockopt option) - Linux?](https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux) ```c /* set tcp_nodelay */ #ifdef USE_SETSOCKET error = sock->ops->setsockopt(sock, SOL_TCP, TCP_NODELAY, kopt, sizeof(opt)); #else error = kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *) &opt, sizeof(opt)); #endif if (error < 0) { printk(KERN_ERR MODULE_NAME ": setsockopt tcp_nodelay setting error = %d\n", error); sock_release(sock); return error; } ``` - Set server listening socket: - 利用 `memset(&addr, 0, sizeof(addr))` 清空 `addr` - 可以把 `sockaddr_in` 看做 `sockaddr` 的 subclass,比 `sockaddr` 存了更多資訊,包括 Protocol、Port、host IP。 - Network byte order which is **big endian** byte order 所以,使用 `htonl` (**h**ost **to** **n**etwork **l**ong^[32-bit]^) 將 host byte order 轉為 network byte order. > 參考:[htonl(3p) — Linux manual page](https://man7.org/linux/man-pages/man3/htonl.3p.html) - `INADDR_ANY` 表示 server 接收所有連線 - Port 大小為 2 bytes 故使用 `htons` (**h**ost **to** **n**etwork **s**hort^[16-bit]^) - Bind / Listen - `alloc_workqueue` : 不使用系統內建的 `system_wq`,而是為這個模組建立專用的工作佇列 :::info 這邊發現 kecho 專案在 alloc_workqueue 後並沒有檢查配置是否成功,導致有可能發生 deferrence NULL pointer 的情況,另外,當 kthread_run 建立 kthread 失敗時,應銷毀先前已建立的 `kecho_wq`,以避免 Memory leak 發生。已有發送 [pull request](https://github.com/sysprog21/kecho/pull/14/commits) 至 kecho 專案 > 參考 [linux/kernel/workqueue.c](https://github.com/torvalds/linux/blob/master/kernel/workqueue.c#L5780) ```diff - kecho_wq = alloc_workqueue(MODULE_NAME, bench ? 0 : WQ_UNBOUND, 0); + if (unlikely(!(kecho_wq = alloc_workqueue(MODULE_NAME, bench ? 0 : WQ_UNBOUND, 0)))) { + printk(KERN_ERR MODULE_NAME ": cannot allocate workqueue\n"); + close_listen(listen_sock); + return -ENOMEM; + } echo_server = kthread_run(echo_server_daemon, &param, MODULE_NAME); if (IS_ERR(echo_server)) { printk(KERN_ERR MODULE_NAME ": cannot start server daemon\n"); close_listen(listen_sock); + destroy_workqueue(kecho_wq); return PTR_ERR(echo_server); } ``` ::: - 原因:當連接時間很長時 (如使用 telnet),kecho 的任務會變成 CPU 密集型,若使用系統內建的 `system_wq` 會長時間佔用 `system_wq` 的 worker thread 導致阻塞其他 kernel 模組的任務執行。 - 註:每個被建立的 workqueue 都會對應到一組獨立的 worker threads。 - 使用 `WQ_UNBOUNDED`: 依據 [Workqueue 文件](https://docs.kernel.org/core-api/workqueue.html#) flags 欄位對 `WQ_UNBOUNDED` 的描述片段: > Unbound wq sacrifices locality but is useful for the following cases. > - Long running CPU intensive workloads which can be better managed by the system scheduler. 使用 `WQ_UNBOUNDED` 讓任務不要綁定到特定的 CPU,使任務可以在任何可用的 CPU 上執行,實現 CPU migration,不過也因此犧牲 cache locality,以換取系統穩定性 (不會讓單一 CPU 長時間被佔用,避免系統負載不平衡) => 犧牲一點性能,換取系統穩定性。 - `kthread_run` : 是一個巨集,呼叫 `kthread_create` 以及 `wake_up_process` 建立一個立刻執行 `echo_server_daemon` 函式的執行緒。 > 參考 Linux 源碼:[include/linux/kthread.h](https://github.com/torvalds/linux/blob/master/include/linux/kthread.h) ```c #define kthread_run(threadfn, data, namefmt, ...) \ ({ \ struct task_struct *__k \ = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \ if (!IS_ERR(__k)) \ wake_up_process(__k); \ __k; \ }) ``` - `param` : 欲傳入 `echo_server_daemon` 的參數,這邊傳入的是一個自定義結構體: ```c struct echo_server_param { struct socket *listen_sock; }; ``` `listen_sock` 成員即 Server 的 Listening Socket。 - `IS_ERR` : Linux kernel 中用來檢測「指標」是否落在錯誤碼 (ERRNO) 範圍的巨集。 > 參考: > - Linux 源碼 [linux/include/linux/err.h](https://github.com/torvalds/linux/blob/master/include/linux/err.h#L68) > - [Why return a negative errno? (e.g. return -EIO)](https://stackoverflow.com/questions/1848729/why-return-a-negative-errno-e-g-return-eio) ```c #define MAX_ERRNO 4095 /** * IS_ERR_VALUE - Detect an error pointer. * @x: The pointer to check. * * Like IS_ERR(), but does not generate a compiler warning if result is unused. */ #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO) /** * IS_ERR - Detect an error pointer. * @ptr: The pointer to check. * Return: true if @ptr is an error pointer, false otherwise. */ static inline bool __must_check IS_ERR(__force const void *ptr) { return IS_ERR_VALUE((unsigned long)ptr); } ``` Linux Kernel 中的錯誤碼是以「負數」表示,範圍是 $-1$ 到 $-4095$,在 64 位元系統中,$-4095$ 的二進制表示 (2's complement) 為 0xFFFFFFFFFFFFF001,轉換型別成 `unsigned long` 比大小,即檢查 `x` 是否在錯誤碼範圍 [0xFFFFFFFFFFFFF001, 0xFFFFFFFFFFFFFFFF] 之間,若 `x` 大於等於 0xFFFFFFFFFFFFF001,則表示 `x` 落在錯誤碼範圍中。 :::info 為什麼要轉換型別成 `unsigned long`? > 若直接判斷 x 是否落在 [0xFFFFFFFFFFFFF001, 0xFFFFFFFFFFFFFFFF] 之間,就會需要進行兩次邊界比較,例: ```c #define IS_ERR_VALUE(x) unlikely(((void *)(x) >= (void *)(-4095)) && ((void *)(x) <= (void *)(-1))) ``` ::: - `echo_server_daemon` : 主要負責接收客戶端連線請求的常駐程式。 - `allow_signal` : 允許當前的 kthread 接收 `SIGTERM`、`SIGKILL` 訊號。如使用以下命令發送訊號: ```bash $ kill -SIGTERM <PID> ``` 當行程接收到(`SIGTERM` 或 `SIGKILL`)訊號時,`signal_pending(current)` 會回傳 `TRUE`。 - SIGTERM 與 SIGKILL 的差異: > 以下參考 [SIGKILL vs SIGTERM: A Developer's Guide to Process Termination](https://www.stackstate.com/blog/sigkill-vs-sigterm-a-developers-guide-to-process-termination/) | SIGTERM(15) | SIGKILL(9) | | ---------------------------- | ------------------------------------------------ | | 執行清理作業、釋放資源後終止 | 行程立即終止,不執行清理作業 | | 有機會保存行程的狀態 | 沒有機會保存行程的狀態 | | 子行程不會自動終止 | 子行程成為孤兒 (Orphan),並由 init 行程接管 | 結論:`SIGTERM` 是結束行程的首選方式,因為它允許程式正常關閉,從而可能保存資料並正確釋放資源,而當行程對 `SIGTERM` 無回應或需要立即停止行程時,`SIGKILL` 作為最後的手段。 - `INIT_LIST_HEAD` : 將 `echo_service` daemon 初始化為鏈結串列的 head,每當有新連接時,就會建立新的 `kecho` work 並加入這個鏈結串列。 ```c struct echo_service { bool is_stopped; struct list_head worker; }; struct kecho { struct socket *sock; struct list_head list; struct work_struct kecho_work; }; ``` 示意圖: ![截圖 2025-06-20 下午1.46.54](https://hackmd.io/_uploads/rkoCB_zNeg.png =70%x) - `kthread_should_stop` 參考 [Driver Basic](https://docs.kernel.org/driver-api/basics.html#c.kthread_should_stop) > When someone calls [`kthread_stop()`](https://docs.kernel.org/driver-api/basics.html#c.kthread_stop "kthread_stop") on your kthread, it will be woken and this will return true. You should then return, and your return value will be passed through to [`kthread_stop()`](https://docs.kernel.org/driver-api/basics.html#c.kthread_stop "kthread_stop"). 這裡的 woken 是指把在 `kernel_accept` blocking 住的 `echo_server` 執行緒叫醒,讓它不要再等待連線了,重新檢查 `kthread_should_stop` (這時會回傳 `TRUE`) 跳出 `while` 迴圈,並執行關閉伺服器的一系列操作。另外,`kernel_stop` 是在 exit module 時呼叫的,在此之前有發送一個 `SIGTERM` 訊號給 `echo_server` 執行緒,說明應確保行程安全、乾淨的結束運行。 TODO:讀〈[Linux 核心設計: 透過 eBPF 觀察作業系統行為](https://hackmd.io/@sysprog/linux-ebpf) 〉 ## 於 khttpd 中導入 CMWQ 參考:[ktcp - CMWQ](https://hackmd.io/@sysprog/linux2025-ktcp-c#引入-CMWQ-到-khttpd)、[kecho](https://github.com/JimmyChongz/kecho.git) > Commit: [cfe735f](https://github.com/JimmyChongz/khttpd/commit/cfe735f9387155ad943efde2b63b1bda99c08f72) ### 相比於 kernel thread,使用 CMWQ 的優勢 > 以下部分摘自:[ktcp - \(C\) 教材](https://hackmd.io/@sysprog/linux2025-ktcp/%2F%40sysprog%2Flinux2025-ktcp-c) 相比於使用 kernel thread 來處理多個 client 連線請求,CMWQ 主要避免了執行緒的無限制增長,如果為每個連線請求都直接建立一個 kernel thread 來處理,那麼系統中將會有大量的執行緒存在,即使它們大部分時間都在休眠等待 I/O 的工作,這會浪費大量 kernel space 的記憶體空間,且每一個 kernel thread 也會排程,因此大量的執行緒會增加排程器的負擔,同時也會造成 context switch 頻繁、cache 命中率降低等問題;而使用 CMWQ 的優勢就是能夠讓執行緒復用,整體設計遵循「以最少資源維持最大效率」原則,只要某 CPU 上有任何 runnable 的 worker thread,系統就暫不啟動新 worker,也就是說 work item 有可由不同的 CPU 處理;只有當該 CPU 上最後一個正在執行的 worker 進入睡眠,且仍有待處理的工作項目時,才會新啟一條 worker thread,以避免空轉與處理延遲,有效限制了 kernel thread 的總體數量。 ### 引入 CMWQ 到 `khttpd` 在原先實作中,khttpd 使用 kernel thread 去處理客戶端連線,欲將其改成 CMWQ,首先於 http_server.h 中新增: ```c struct httpd_service { bool is_stopped; /* 是否收到終止訊號 */ struct list_head head; /* 串列首節點 */ }; ``` 接著再到 main.c 中,建立專屬於 khttpd 的 Workqueue: ```diff + struct workqueue_struct *khttpd_wq; ... + if (!(khttpd_wq = alloc_workqueue(MODULE_NAME, 0, 0))) { + pr_err("can't allocate workqueue\n"); + close_listen_socket(listen_socket); + return -ENOMEM; + } ... ``` ```diff ... http_server = kthread_run(http_server_daemon, &param, MODULE_NAME); if (IS_ERR(http_server)) { pr_err("can't start http server daemon\n"); close_listen_socket(listen_socket); + destroy_workqueue(khttpd_wq); return PTR_ERR(http_server); } ... ``` 於 http_server.c 中,新增 CMWQ 並設計管理任務的鏈結串列 `daemon_list`,同時擴充 http_request 結構體。這樣的設計的好處是:當 daemon 停止時,可以走訪 `daemon_list`,取得所有尚未釋放的 http_request 節點,確保資源能被正確回收;此外,透過 `container_of` 巨集可反推出對應 http_request 結構體的起始位址,不僅便於參數傳遞,也大幅提升了資源管理的效率與安全性。 ```diff + extern struct workqueue_struct *khttpd_wq; + struct httpd_service daemon_list = {.is_stopped = false}; struct http_request { struct socket *socket; enum http_method method; char request_url[128]; int complete; + struct list_head node; + struct work_struct khttpd_work; }; ``` 另外也新增 create_work 以及 free_work 函數,來管理 `daemon_list`: ```c static struct work_struct *create_work(struct socket *sk) { struct http_request *work = kmalloc(sizeof(*work), GFP_KERNEL); if (!work) return NULL; work->socket = sk; INIT_WORK(&work->khttpd_work, http_server_worker); list_add(&work->node, &daemon_list.head); return &work->khttpd_work; } static void free_work(void) { struct http_request *tar, *tmp; list_for_each_entry_safe (tar, tmp, &daemon_list.head, node) { kernel_sock_shutdown(tar->socket, SHUT_RDWR); flush_work(&tar->khttpd_work); sock_release(tar->socket); kfree(tar); } } ``` 在 `http_server_daemon` 啟動時,會先初始化 `daemon_list.head`,並將每個新的連線請求透過 `create_work` 建立 work_item,提交至先前建立的 `khttpd_wq` Workqueue。當 daemon 結束時,則會呼叫 `free_work()` 釋放所有尚未處理完的 work,確保所有資源都能被正確回收,避免任何記憶體洩漏或資源遺漏: ```diff int http_server_daemon(void *arg) { struct socket *socket; - struct task_struct *worker; + struct work_struct *work; struct http_server_param *param = (struct http_server_param *) arg; allow_signal(SIGKILL); allow_signal(SIGTERM); + INIT_LIST_HEAD(&daemon_list.head); while (!kthread_should_stop()) { ... - worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME); - if (IS_ERR(worker)) { + if (unlikely(!(work = create_work(socket)))) { pr_err("can't create more worker process\n"); kernel_sock_shutdown(socket, SHUT_RDWR); sock_release(socket); continue; } + queue_work(khttpd_wq, work); } + daemon_list.is_stopped = true; + free_work(); return 0; } ``` 引入 CMWQ 後,可以重複利用已存在的 worker thread,當有連線請求時,只需將工作提交到 CMWQ 中。相比於建立新的 kernel thread,這樣的設計大幅降低了建立 kernel thread 的成本,同時提升在高並行負載下的效能。 ### 效能對比 使用命令 `./htstress http://localhost:8081 -t 3 -c 20 -n 200000` 測試,以下為執行結果 - kthread Based: ```bash requests: 200000 good requests: 200000 [100%] bad requests: 0 [0%] socket errors: 0 [0%] seconds: 12.917 requests/sec: 15483.061 ``` - CMWQ: ```bash requests: 200000 good requests: 200000 [100%] bad requests: 0 [0%] socket errors: 0 [0%] seconds: 6.489 requests/sec: 30823.589 ``` | 原本的實作 | CMWQ | | ---------- | --------- | | 15483.061 | 30823.589 | 採用 CMWQ 實作後,與原有實作方案相比,吞吐量獲得了約一倍的提升。 ## Memory Management :::info 發現下列程式碼疑似在建立 kernel thread 失敗後,並沒有釋放由 kernel_accept 配置的 socket? ```c while (!kthread_should_stop()) { int err = kernel_accept(param->listen_socket, &socket, 0); if (err < 0) { if (signal_pending(current)) break; pr_err("kernel_accept() error: %d\n", err); continue; } worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME); if (IS_ERR(worker)) { pr_err("can't create more worker process\n"); continue; } } ``` > 此疑問已被 @JordyMalone 解決 [PR](https://github.com/JimmyChongz/khttpd/commit/c030df2622ef590da8621efdb0269ea2be3e70b4) ::: 不過他在 release_socket 之前,並沒有先 shutdown socket。 > PR: [Shutdown socket before release](https://github.com/sysprog21/khttpd/pull/16) 參考: - [close vs shutdown socket](https://stackoverflow.com/questions/4160347/close-vs-shutdown-socket?noredirect=1&lq=1) - [Linux socket.c](https://github.com/torvalds/linux/blob/master/net/socket.c) TCP 連接關閉的兩種方式: - 正常關閉(4-way handshake)=> shutdown 參考:[TCP Connection termination](https://www.geeksforgeeks.org/computer-networks/tcp-connection-termination/) - Scheme: ![image](https://hackmd.io/_uploads/HJQ-xvcExl.png =60%x) - Steps: 1. 客戶端發送 FIN bit 設為 1 的 TCP 封包(FIN 封包)給伺服器端,表示不再發送訊息,此時客戶端會進入 FIN_WAIT_1 狀態,等待伺服器端的確認(ACK)封包。(當然也可以由伺服器端發送,即伺服器主動關閉連接) 2. 伺服器端發送 ACK 給客戶端,確認客戶端的 FIN 請求。 3. 客戶端在 FIN_WAIT_2 的狀態下,收到伺服器端的 ACK 後,進入 FIN_WAIT_2 狀態,等待伺服器端發送 FIN 封包。 4. 伺服器端在發送 ACK 後,會在 CLOSE_WAIT 狀態下執行關閉流程,執行完成後,發送 FIN 封包給客戶端,表示伺服器端也準備好關閉連接了,再進入 LAST_ACK 狀態。 5. 客戶端收到伺服器端的 FIN 封包後,發送 ACK 給伺服器端,進入 TIME_WAIT 狀態,在 TIME_WAIT 狀態下,允許客戶端在 ACK 丟失時重新傳送最終 ACK。另外,客戶端在 TIME_WAIT 狀態下花費的時間取決於其實現,但其典型值為 30 秒、1 分鐘和 2 分鐘。等待後,連線正式關閉,客戶端的所有資源(包括埠號和緩衝資料)被釋放。 - 強制關閉(發送 RST 封包的方式)=> close - 其中一方直接發送 RST (reset) 封包,強制中斷連線。另一方收到 RST 封包後,也立即終止連線。 - 使用情境可參考: > [Common Causes and Troubleshooting Methods for Connection Reset](https://www.alibabacloud.com/blog/common-causes-and-troubleshooting-methods-for-connection-reset_600117) **兩者的差異** : shutdown 會發送 FIN 封包,表示「不再傳送(或接收)資料」,但連線還沒完全關閉,還能繼續接收對方資料,即應用程式還能對 socket 進行操作。反之,close 則是直接釋放 socket 資源,應用程式無法再對這個 socket 進行任何操作。 ![截圖 2025-07-01 下午2.34.21](https://hackmd.io/_uploads/HytubWWHxl.png =19%x)![截圖 2025-07-01 下午2.34.50](https://hackmd.io/_uploads/SySqWWZBll.png =17%x) 為了確保資料完整送達,若直接 close 而不是先 shutdown socket 的話,有可能導致部分資料還沒來得及送出就被丟棄,這在需要確保資料完整傳輸的應用場景下會有問題,因此我做了以下修正: ```diff while (!kthread_should_stop()) { int err = kernel_accept(param->listen_socket, &socket, 0); if (err < 0) { if (signal_pending(current)) break; pr_err("kernel_accept() error: %d\n", err); continue; } worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME); if (IS_ERR(worker)) { pr_err("can't create more worker process\n"); + kernel_sock_shutdown(socket, SHUT_RDWR); sock_release(socket); continue; } } ``` ## 實作 [directory listing](https://cwiki.apache.org/confluence/display/httpd/DirectoryListings) 功能 Directory listing 是指當網頁伺服器在某個目錄下沒有預設首頁檔案(如 index.html)時,會自動將該目錄下的所有檔案和子目錄以列表方式顯示給使用者。 > Commit: [06e99a5](https://github.com/JimmyChongz/khttpd/commit/06e99a532e09293052c8c3bba9d35adc821c31ed) 首先為了要讓每個 http 請求都能有自己獨立的目錄走訪狀態,因此在 http_request 結構體中新增 Linux kernel 用於目錄走訪的結構體 `struct dir_context`,其中包含走訪時需要的狀態與回呼函數(如 `.actor`),每走訪到一個目錄項目(檔案或子目錄)時,kernel 就會呼叫這個回呼函數。 ```diff struct http_request { struct socket *socket; enum http_method method; char request_url[128]; int complete; + struct dir_context dir_context; struct list_head node; struct work_struct khttpd_work; }; ``` 而回呼函數 `tracedir` 就是要將走訪到的目錄項目包裝成一個 HTML 表格列 `<tr><td>...</td></tr>`,且為了避免安全風險需過濾 `.` 和 `..` 等路徑穿越字串: ```c // callback for 'iterate_dir', trace entry. static bool tracedir(struct dir_context *dir_context, const char *name, int namelen, loff_t offset, u64 ino, unsigned int d_type) { if (strcmp(name, ".") && strcmp(name, "..")) { struct http_request *request = container_of(dir_context, struct http_request, dir_context); char buf[SEND_BUFFER_SIZE] = {0}; snprintf(buf, SEND_BUFFER_SIZE, "<tr><td><a href=\"%s\">%s</a></td></tr>\r\n", name, name); http_server_send(request->socket, buf, strlen(buf)); } return true; } ``` 接著新增走訪目錄的函數 `handle directory`,首先,將回呼函數設定為 `tracedir`,用於處理每個目錄項目的輸出,再來分析 http request 的 header 是否為 GET 請求,若是則繼續分段傳送 http 回應內容,使用 Linux kernel 檔案系統的 API `iterate_dir` 走訪 `/home/jimmy/linux2025/khttpd` 目錄下的所有項目,並透過回呼函數 `tracedir`,將每個目錄項目逐一傳送給客戶端。 ```c static bool handle_directory(struct http_request *request) { struct file *fp; char buf[SEND_BUFFER_SIZE] = {0}; request->dir_context.actor = tracedir; if (request->method != HTTP_GET) { snprintf(buf, SEND_BUFFER_SIZE, "HTTP/1.1 501 Not Implemented\r\n%s%s%s%s", "Content-Type: text/plain\r\n", "Content-Length: 19\r\n", "Connection: Close\r\n", "501 Not Implemented\r\n"); http_server_send(request->socket, buf, strlen(buf)); return false; } snprintf(buf, SEND_BUFFER_SIZE, "HTTP/1.1 200 OK\r\n%s%s%s", "Connection: Keep-Alive\r\n", "Content-Type: text/html\r\n", "Keep-Alive: timeout=5, max=1000\r\n\r\n"); http_server_send(request->socket, buf, strlen(buf)); snprintf(buf, SEND_BUFFER_SIZE, "%s%s%s%s", "<html><head><style>\r\n", "body{font-family: monospace; font-size: 15px;}\r\n", "td {padding: 1.5px 6px;}\r\n", "</style></head><body><table>\r\n"); http_server_send(request->socket, buf, strlen(buf)); fp = filp_open("/home/jimmy/linux2025/khttpd", O_RDONLY | O_DIRECTORY, 0); if (IS_ERR(fp)) { pr_info("Open file failed"); return false; } iterate_dir(fp, &request->dir_context); snprintf(buf, SEND_BUFFER_SIZE, "</table></body></html>\r\n"); http_server_send(request->socket, buf, strlen(buf)); filp_close(fp, NULL); return true; } ``` 最後,將 `http_server_response` 改為呼叫 `handle_directory(request)` 來處理目錄瀏覽的請求,將目錄內容以 HTML 格式回傳給客戶端: ```c static int http_server_response(struct http_request *request, int keep_alive) { int ret = handle_directory(request); if (!ret) { pr_info("handle directory failed"); } return 0; } ``` ### 展示 {%youtube CCEiKd2ex7A %}

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully