2021q1 Homework4 (khttpd)

# 2021q1 Homework4 (khttpd) contributed by < [`bakudr18`](https://github.com/bakudr18) > ###### tags: `linux2021` > [作業說明 - khttpd](https://hackmd.io/@sysprog/linux2020-khttpd) > [2021期末專題說明 - khttpd](https://hackmd.io/@sysprog/linux2021-projects#kHTTPd) ## 事前準備 1. [CS:APP 第 11 章](https://hackmd.io/s/ByPlLNaTG): Network Programming 2. [nstack 開發紀錄 (1)](https://hackmd.io/s/ryfvFmZ0f) 3. [nstack 開發紀錄 (2)](https://hackmd.io/s/r1PUn3KGV) 4. [你所不知道的 C 語言：從打造類似 Facebook 網路服務探討整合開發](https://hackmd.io/@sysprog/c-prog/%2Fs%2FB1s8hX1yg) 5. [高效 Web 伺服器開發](https://hackmd.io/@sysprog/fast-web-server) ## 測試環境 :::spoiler ```shell $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS" $ uname -a Linux bakud-PX60-2QD 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 71 Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz Stepping: 1 CPU MHz: 1197.344 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5387.38 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 ``` ::: ## 編譯錯誤 [khttpd](https://github.com/sysprog21/khttpd) 在 Linux v5.8 編譯會遇到以下錯誤 ```shell $ make make -C /lib/modules/5.8.0-50-generic/build M=/home/bakud/linux2021/khttpd modules make[1]: Entering directory '/usr/src/linux-headers-5.8.0-50-generic' CC [M] /home/bakud/linux2021/khttpd/main.o /home/bakud/linux2021/khttpd/main.c: In function ‘setsockopt’: /home/bakud/linux2021/khttpd/main.c:28:12: error: implicit declaration of function ‘kernel_setsockopt’; did you mean ‘kernel_getsockname’? [-Werror=implicit-function-declaration] 28 | return kernel_setsockopt(sock, level, optname, (char *) &opt, sizeof(opt)); | ^~~~~~~~~~~~~~~~~ | kernel_getsockname cc1: some warnings being treated as errors make[2]: *** [scripts/Makefile.build:286: /home/bakud/linux2021/khttpd/main.o] Error 1 make[1]: *** [Makefile:1783: /home/bakud/linux2021/khttpd] Error 2 make[1]: Leaving directory '/usr/src/linux-headers-5.8.0-50-generic' make: *** [Makefile:14: all] Error 2 ``` 原因是 `kernel_setsockopt` 函式在 [commit 5a892f](https://github.com/torvalds/linux/commit/5a892ff2facb4548c17c05931ed899038a0da63e#diff-ad9d6061c9dcaa70d934b3a2ea393237d93ebbc9bf3a535c7f9e5a1f1ae86c72) 被拿掉了，然而此版本的 `sock_setsockopt` 與 `tcp_setsockopt` 的函式都只接受存在 userspace 的參數 `optval` ```c int sock_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen); int tcp_setsockopt(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen); ``` 部份 `setsockopt` 用以下 kernel 釋出的 function 取代 ```c sock_set_reuseaddr(sock->sk); tcp_sock_set_nodelay(sock->sk); tcp_sock_set_cork(sock->sk, 0); sock_set_rcvbuf(sock->sk, 1024 * 1024); ``` 但 `SO_SNDBUF` 暫時沒有釋出 function 取代，因此先暫時不設定。Linux v5.12 已經修改 function 參數如下，因此可直接使用。 ```c int sock_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen); int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen); ``` ## 修正原有bug ### Connection reset by peer 執行指令 `wget localhost:8081` 會出現錯誤 `khttpd: recv error: -104` ，`ECONNRESET: Conection reset by peer` 表示在 tcp 遇到錯誤而丟失了資料，某端會發出 RST 給對方做錯誤處理，經老師提示可能是 [TIME_WAIT](https://hackmd.io/@eecheng/Sy9XsgjOI#%E8%A7%A3%E9%87%8B-drop-tcp-socket-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E9%81%8B%E4%BD%9C%E5%8E%9F%E7%90%86%E3%80%82TIME-WAIT-sockets-%E5%8F%88%E6%98%AF%E4%BB%80%E9%BA%BC%EF%BC%9F) 所造成的問題，並推薦我使用 [eBPF](https://hackmd.io/@sysprog/linux-ebpf#Linux-%E6%A0%B8%E5%BF%83%E8%A8%AD%E8%A8%88-%E9%80%8F%E9%81%8E-eBPF-%E8%A7%80%E5%AF%9F%E4%BD%9C%E6%A5%AD%E7%B3%BB%E7%B5%B1%E8%A1%8C%E7%82%BA) 去追蹤問題。因此這裡先參考以下 eBPF 資料： * [HW10: sehttpd: 以 eBPF 追蹤 HTTP 封包](https://hackmd.io/@sysprog/linux2020-sehttpd#-%E4%BB%A5-eBPF-%E8%BF%BD%E8%B9%A4-HTTP-%E5%B0%81%E5%8C%85) * [Appendix C](https://hackmd.io/@0xff07/r1f4B8aGI#Appendix-C) * [Learn eBPF Tracing: Tutorial and Examples](http://www.brendangregg.com/blog/2019-01-01/learn-ebpf-tracing.html) 下載 [bcc](https://github.com/iovisor/bcc) 原始碼並 checkout 到對應支援的 Linux Kernel版本，我使用的是 [bcc v0.17.0](https://github.com/iovisor/bcc/releases/tag/v0.17.0)，安裝教學在 [INSTALL.md](https://github.com/iovisor/bcc/blob/v0.17.0/INSTALL.md) 。bcc 提供了多個工具讓你追蹤 TCP 的相關動作，這裡會用到的是以下兩個： 1. [tcpstates](https://github.com/iovisor/bcc/blob/v0.17.0/tools/tcpstates_example.txt)：用於檢測 TCP 連線狀態的改變 2. [tcpdrop](https://github.com/iovisor/bcc/blob/v0.17.0/tools/tcpdrop_example.txt)：用來追蹤 TCP 是否有丟失封包首先執行 `tcpdrop` 會得到以下資訊 ```shell $ sudo ./tcpdrop.py TIME PID IP SADDR:SPORT > DADDR:DPORT STATE (FLAGS) 20:06:38 66507 4 127.0.0.1:57808 > 127.0.0.1:8081 CLOSE (RST|ACK) tcp_drop+0x1 tcp_rcv_established+0x126 tcp_v4_do_rcv+0x140 tcp_v4_rcv+0xcef ip_protocol_deliver_rcu+0x30 ip_local_deliver_finish+0x48 ip_local_deliver+0x73 ip_rcv_finish+0x87 ip_rcv+0xbc __netif_receive_skb_one_core+0x88 __netif_receive_skb+0x18 process_backlog+0xa9 net_rx_action+0x142 __softirqentry_text_start+0xe1 asm_call_sysvec_on_stack+0x12 do_softirq_own_stack+0x3d do_softirq.part.0+0x46 __local_bh_enable_ip+0x50 ip_finish_output2+0x1af __ip_finish_output+0xc8 ip_finish_output+0x2d ip_output+0x7a ip_local_out+0x3d __ip_queue_xmit+0x17a ip_queue_xmit+0x10 __tcp_transmit_skb+0x56e tcp_send_active_reset+0xf6 tcp_close+0x306 inet_release+0x3b __sock_release+0x42 sock_close+0x15 __fput+0xe9 ____fput+0xe task_work_run+0x70 do_exit+0x3a0 do_group_exit+0x43 __x64_sys_exit_group+0x18 do_syscall_64+0x49 entry_SYSCALL_64_after_hwframe+0x44 ``` 可以看到在 `tcp_close` 時執行了 `tcp_send_active_reset` ，追蹤 [linux/net/ipv4/tcp.c](https://elixir.bootlin.com/linux/v5.8/source/net/ipv4/tcp.c#L2407) 可以看到以下程式碼，可知在 `tcp_set_state(sk, TCP_CLOSE)` 仍有資料在 receive buffer 內，因此觸發了 `tcp_send_active_reset` 。 ```c=2416 if (sk->sk_state == TCP_LISTEN) { tcp_set_state(sk, TCP_CLOSE); /* Special case. */ inet_csk_listen_stop(sk); goto adjudge_to_death; } /* We need to flush the recv. buffs. We do this only on the * descriptor close, not protocol-sourced closes, because the * reader process may not have drained the data yet! */ while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) { u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq; if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) len--; data_was_unread += len; __kfree_skb(skb); } ``` 這裡還沒經過下圖中 TCP close 時四次交握的正常程序，這點可以透夠 `tcpstates` 來確認。 ![](https://i.imgur.com/lnkxAWg.png) ```shell $ sudo ./tcpstates.py SKADDR C-PID C-COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS ffff91fa81e03480 66524 wget 127.0.0.1 0 127.0.0.1 8081 CLOSE -> SYN_SENT 0.000 ffff91fa81e03480 66524 wget 127.0.0.1 57810 127.0.0.1 8081 SYN_SENT -> ESTABLISHED 0.053 ffff91face46cec0 66524 wget 0.0.0.0 8081 0.0.0.0 0 LISTEN -> SYN_RECV 0.000 ffff91face46cec0 66524 wget 127.0.0.1 8081 127.0.0.1 57810 SYN_RECV -> ESTABLISHED 0.007 ffff91fa81e03480 66524 wget 127.0.0.1 57810 127.0.0.1 8081 ESTABLISHED -> CLOSE 0.650 ffff91face46cec0 66524 wget 127.0.0.1 8081 127.0.0.1 57810 ESTABLISHED -> CLOSE 0.636 ``` 可以看到 client 端狀態直接從 ESTABLISHED 變成 CLOSE，完全沒經過 FIN_WAIT_1, FIN_WAIT_2, TIME_WAIT 等狀態，因此可以知道並非是老師所提示的 [TIME_WAIT](https://hackmd.io/@eecheng/Sy9XsgjOI#%E8%A7%A3%E9%87%8B-drop-tcp-socket-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E9%81%8B%E4%BD%9C%E5%8E%9F%E7%90%86%E3%80%82TIME-WAIT-sockets-%E5%8F%88%E6%98%AF%E4%BB%80%E9%BA%BC%EF%BC%9F) 所造成的問題。考慮到是因為關閉連線時仍有資料沒讀取，因此去看 server 傳回來的資料，程式碼如下 ```c #define HTTP_RESPONSE_200_KEEPALIVE_DUMMY \ "" \ "HTTP/1.1 200 OK" CRLF "Server: " KBUILD_MODNAME CRLF \ "Content-Type: text/plain" CRLF "Content-Length: 12" CRLF \ "Connection: Keep-Alive" CRLF CRLF "Hello World!" CRLF ``` 回傳字串 `HELLO WORLD!` 後方還有 `CRLF` ，然而 `Content-Length: 12` 沒有把 `CRLF` 算進去，因此修正為 14 bytes 後即可解決此 bug 。 :::info 註：根據 Brendan Gregg's (eBPF主要貢獻者之一) [Blog](http://www.brendangregg.com/blog/2020-11-04/bpf-co-re-btf-libbpf.html)，以 Python 使用 eBPF 的功能將會逐漸被捨棄，並以 libbpf 取代。~~說好的用 Python 方便做資料分析呢~~ :cry: ::: ### Kernel thread 未正確被釋放這裡有事先看過了 [AndybnACT](https://hackmd.io/@AndybnA/khttpd#http_server_worker-%E6%B2%92%E6%9C%89%E8%A2%AB%E6%AD%A3%E7%A2%BA%E5%9C%B0%E9%87%8B%E6%94%BE) 同學的共筆，此 bug 可透過瀏覽器連線後，再 `rmmod` 卸載 `khttpd` 來重現。 ```shell [19459.787911] BUG: unable to handle page fault for address: ffffffffc13f1465 [19459.787915] #PF: supervisor instruction fetch in kernel mode [19459.787916] #PF: error_code(0x0010) - not-present page [19459.787917] PGD 44f40f067 P4D 44f40f067 PUD 44f411067 PMD 453fe7067 PTE 0 [19459.787921] Oops: 0010 [#1] SMP PTI [19459.787924] CPU: 2 PID: 414058 Comm: khttpd Tainted: G OE 5.8.0-53-generic #60~20.04.1-Ubuntu [19459.787925] Hardware name: Micro-Star International Co., Ltd. PX60 2QD/MS-16H6, BIOS E16H6IMS.110 11/03/2015 [19459.787928] RIP: 0010:0xffffffffc13f1465 [19459.787930] Code: Unable to access opcode bytes at RIP 0xffffffffc13f143b. [19459.787931] RSP: 0018:ffff9f3c01653d60 EFLAGS: 00010282 [19459.787933] RAX: 0000000000000243 RBX: ffff8edadc595e00 RCX: 0000000000000005 [19459.787934] RDX: 0000000000000000 RSI: 00000000fffffe01 RDI: ffffffffb7f4ceff [19459.787936] RBP: ffff9f3c01653dd8 R08: 0000000000078014 R09: ffff8eda9f69b550 [19459.787937] R10: 0000000000000243 R11: 0000000000000243 R12: ffff8edaa675f000 [19459.787938] R13: ffff8eda9d9d5140 R14: ffff8eda9d9d5140 R15: ffff9f3c00e87e10 [19459.787939] FS: 0000000000000000(0000) GS:ffff8edadec80000(0000) knlGS:0000000000000000 [19459.787941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19459.787942] CR2: ffffffffc13f143b CR3: 0000000455dda005 CR4: 00000000003606e0 [19459.787943] Call Trace: [19459.787951] ? kthread+0x114/0x150 [19459.787953] ? kthread_park+0x90/0x90 [19459.787956] ? ret_from_fork+0x22/0x30 ``` 此 bug 發生的原因是在於下方 `http_server_daemon` 函式會為每個 client 連線創一個 kthread (下方第205行)，而在關閉 server 時卻沒有任何機制回收這些 kthread ，當 client 沒有收到斷線資訊，而繼續連線傳輸資料就會導致 kernel 去執行已經被卸載的程式碼而導致 kernel page fault ，此 bug 會在後續加入 [cmwq](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) 時一並修正。 ```c=188 int http_server_daemon(void *arg) { struct socket *socket; struct task_struct *worker; struct http_server_param *param = (struct http_server_param *) arg; allow_signal(SIGKILL); allow_signal(SIGTERM); while (!kthread_should_stop()) { int err = kernel_accept(param->listen_socket, &socket, 0); if (err < 0) { if (signal_pending(current)) break; pr_err("kernel_accept() error: %d\n", err); continue; } worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME); if (IS_ERR(worker)) { pr_err("can't create more worker process\n"); continue; } } return 0; } ``` ## Concurrency Managed Workqueue (cmwq) [cmwq](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) 是 Linux kernel 所提供的一種機制，其實作了 workqueue API，用於 asynchronous 的執行 kernel 開發者所開發的 work (function) ，從而簡化了開發者處理背後 kthread 等資源利用的流程，其開發目的也是為了使 workqueue 能夠最大化使用硬體資源，滿足 Linux 對 [scalability](https://hackmd.io/@sysprog/linux-scalability) 的要求。 ### cmwq Design * work: 開發者實作所需 async 執行的 function。 * workqueue: 存放待執行 work 的空間，user 將 work enque 到 workqueue 中，等待後續 worker 從 workqueue 中 deque 拿取 work 來執行。 * worker: 背後實際執行 work 的 kernel thread。 * worker-pools: 就相當於 [thread-pools](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fposix-threads#Thread-Pool) 的概念。 cmwq 的設計拆分了 workqueue 與 worker-pools 概念，user 可以單純將 work 推進 workqueue ，而不必在意 OS 如何分配 work 給 worker 執行。系統本身就預先建立了兩個 worker-pool ，分別用於執行一般的 task 與高優先權的 task ，另外根據情況也會動態配置新的 worker-pool ，而各個 worker-pool 會根據 workqueue 的設定，也就是 `alloc_workqueue` 時所設定的 flag 來決定如何分配 work 給哪個 worker 執行，例如`WQ_UNBOUND` 不綁定 workqueue 的 work 要給哪個 CPU 執行， `WQ_HIGHPRI` 的 workqueue 中的 work 會交給高優先權的 worker-pool 去執行等。 cmwq 基本上是希望能達到最大化 Concurrency 的效益而去設計的，實際實作如何分配 work 給 worker 就得要自己去 trace code 才能知道了。 ### 引入 cmwq 至 khttpd 這裡參考了 [kecho](https://hackmd.io/@sysprog/linux2020-kecho) 的作法，`struct http_server_wq` 用來管理 workqueue，`struct http_server` 用來管理 work 。 ```c struct http_server_wq { struct workqueue_struct *client_wq; struct list_head head; bool should_stop; }; struct http_server { struct socket *socket; struct work_struct work; struct list_head list; }; ``` 在 `http_server_daemon` 中配置了一個 `WQ_UNBOUND` 的 workqueue ，針對每一個 client 連線請求，都會建立一個新的 work ，各個 work 會由哪個 worker (kthread) 來執行是交給 OS 來分配的，配置出來的 `struct http_server` 資源使用 linked list 來管理。 ```c static struct work_struct *create_work(struct socket *socket) { struct http_server *client; client = kmalloc(sizeof(struct http_server), GFP_KERNEL); if (!client) return NULL; client->socket = socket; INIT_WORK(&client->work, http_wq_work); list_add(&client->list, &wq.head); return &client->work; } int http_server_daemon(void *arg) { ... wq.client_wq = alloc_workqueue("kthhpd_client_wq", WQ_UNBOUND, 0); if (!wq.client_wq) return -ENOMEM; wq.should_stop = false; INIT_LIST_HEAD(&wq.head); ... while (!kthread_should_stop()) { ... work = create_work(socket); if (!work) { pr_err("can't create more work\n"); continue; } queue_work(wq.client_wq, work); } wq.should_stop = true; destroy_work(); return 0; } ``` 而對於配置出來的 work 回收的問題，則是於 module 卸載時透過 linked list 一個一個 `kfree` ，當然這會有個明顯的缺點，對於已經斷線的 client ，資源沒辦法動態的被釋放，只能等到 `rmmod` 時才能釋放。 ```c static void destroy_work(void) { struct http_server *curr, *tmp; list_for_each_entry_safe (curr, tmp, &wq.head, list) { flush_work(&curr->work); kfree(curr); } } static int http_server_work(void *arg) { ... while (!wq.should_stop) { int ret = http_server_recv(socket, buf, RECV_BUFFER_SIZE - 1); if (ret <= 0) { if (ret) pr_err("recv error: %d\n", ret); break; } http_parser_execute(&parser, &setting, buf, ret); if (request.complete && !http_should_keep_alive(&parser)) break; } kernel_sock_shutdown(socket, SHUT_RDWR); sock_release(socket); kfree(buf); return 0; } ``` 使用 cmwq 前後 `make check` 的數據如下， requests/sec 效率快了近一倍。 ```shell kthread cmwq requests: 100000 requests: 100000 good requests: 100000 [100%] good requests: 100000 [100%] bad requests: 0 [0%] bad requests: 0 [0%] socker errors: 0 [0%] socker errors: 0 [0%] seconds: 3.056 seconds: 1.607 requests/sec: 32727.322 requests/sec: 62215.171 ``` :::info TODO：針對動態釋放 work 資源的問題，目前思考了一個作法是建立一個 work 並使用 event driven 的方式來動態釋放資源，目前有查到 kernel 提供的 [wait queue](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h) 有 event driven 的功能，然而因為 linked list 屬於共用資源，在操作上估計逃不了使用 lock ，因此預期可能會損失一點執行效率，但目前沒有更好的想法，因此會先朝這方向實作。 ::: ## 其他參考資料 > [鳥哥的 Linux 私房菜：第二章、基礎網路概念](http://linux.vbird.org/linux_server/0110network_basic.php)