# 2021q1 Homework4 (khttpd)
contributed by < [`bakudr18`](https://github.com/bakudr18) >
###### tags: `linux2021`
> [作業說明 - khttpd](https://hackmd.io/@sysprog/linux2020-khttpd)
> [2021期末專題說明 - khttpd](https://hackmd.io/@sysprog/linux2021-projects#kHTTPd)
## 事前準備
1. [CS:APP 第 11 章](https://hackmd.io/s/ByPlLNaTG): Network Programming
2. [nstack 開發紀錄 (1)](https://hackmd.io/s/ryfvFmZ0f)
3. [nstack 開發紀錄 (2)](https://hackmd.io/s/r1PUn3KGV)
4. [你所不知道的 C 語言:從打造類似 Facebook 網路服務探討整合開發](https://hackmd.io/@sysprog/c-prog/%2Fs%2FB1s8hX1yg)
5. [高效 Web 伺服器開發](https://hackmd.io/@sysprog/fast-web-server)
## 測試環境
:::spoiler
```shell
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
$ uname -a
Linux bakud-PX60-2QD 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 71
Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping: 1
CPU MHz: 1197.344
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5387.38
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
```
:::
## 編譯錯誤
[khttpd](https://github.com/sysprog21/khttpd) 在 Linux v5.8 編譯會遇到以下錯誤
```shell
$ make
make -C /lib/modules/5.8.0-50-generic/build M=/home/bakud/linux2021/khttpd modules
make[1]: Entering directory '/usr/src/linux-headers-5.8.0-50-generic'
CC [M] /home/bakud/linux2021/khttpd/main.o
/home/bakud/linux2021/khttpd/main.c: In function ‘setsockopt’:
/home/bakud/linux2021/khttpd/main.c:28:12: error: implicit declaration of function ‘kernel_setsockopt’; did you mean ‘kernel_getsockname’? [-Werror=implicit-function-declaration]
28 | return kernel_setsockopt(sock, level, optname, (char *) &opt, sizeof(opt));
| ^~~~~~~~~~~~~~~~~
| kernel_getsockname
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:286: /home/bakud/linux2021/khttpd/main.o] Error 1
make[1]: *** [Makefile:1783: /home/bakud/linux2021/khttpd] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.8.0-50-generic'
make: *** [Makefile:14: all] Error 2
```
原因是 `kernel_setsockopt` 函式在 [commit 5a892f](https://github.com/torvalds/linux/commit/5a892ff2facb4548c17c05931ed899038a0da63e#diff-ad9d6061c9dcaa70d934b3a2ea393237d93ebbc9bf3a535c7f9e5a1f1ae86c72) 被拿掉了,然而此版本的 `sock_setsockopt` 與 `tcp_setsockopt` 的函式都只接受存在 userspace 的參數 `optval`
```c
int sock_setsockopt(struct socket *sock, int level, int optname,
char __user *optval, unsigned int optlen);
int tcp_setsockopt(struct sock *sk, int level, int optname,
char __user *optval, unsigned int optlen);
```
部份 `setsockopt` 用以下 kernel 釋出的 function 取代
```c
sock_set_reuseaddr(sock->sk);
tcp_sock_set_nodelay(sock->sk);
tcp_sock_set_cork(sock->sk, 0);
sock_set_rcvbuf(sock->sk, 1024 * 1024);
```
但 `SO_SNDBUF` 暫時沒有釋出 function 取代,因此先暫時不設定。Linux v5.12 已經修改 function 參數如下,因此可直接使用。
```c
int sock_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen);
int tcp_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen);
```
## 修正原有bug
### Connection reset by peer
執行指令 `wget localhost:8081` 會出現錯誤 `khttpd: recv error: -104` ,`ECONNRESET: Conection reset by peer` 表示在 tcp 遇到錯誤而丟失了資料,某端會發出 RST 給對方做錯誤處理,經老師提示可能是 [TIME_WAIT](https://hackmd.io/@eecheng/Sy9XsgjOI#%E8%A7%A3%E9%87%8B-drop-tcp-socket-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E9%81%8B%E4%BD%9C%E5%8E%9F%E7%90%86%E3%80%82TIME-WAIT-sockets-%E5%8F%88%E6%98%AF%E4%BB%80%E9%BA%BC%EF%BC%9F) 所造成的問題,並推薦我使用 [eBPF](https://hackmd.io/@sysprog/linux-ebpf#Linux-%E6%A0%B8%E5%BF%83%E8%A8%AD%E8%A8%88-%E9%80%8F%E9%81%8E-eBPF-%E8%A7%80%E5%AF%9F%E4%BD%9C%E6%A5%AD%E7%B3%BB%E7%B5%B1%E8%A1%8C%E7%82%BA) 去追蹤問題。
因此這裡先參考以下 eBPF 資料:
* [HW10: sehttpd: 以 eBPF 追蹤 HTTP 封包](https://hackmd.io/@sysprog/linux2020-sehttpd#-%E4%BB%A5-eBPF-%E8%BF%BD%E8%B9%A4-HTTP-%E5%B0%81%E5%8C%85)
* [Appendix C](https://hackmd.io/@0xff07/r1f4B8aGI#Appendix-C)
* [Learn eBPF Tracing: Tutorial and Examples](http://www.brendangregg.com/blog/2019-01-01/learn-ebpf-tracing.html)
下載 [bcc](https://github.com/iovisor/bcc) 原始碼並 checkout 到對應支援的 Linux Kernel版本,我使用的是 [bcc v0.17.0](https://github.com/iovisor/bcc/releases/tag/v0.17.0),安裝教學在 [INSTALL.md](https://github.com/iovisor/bcc/blob/v0.17.0/INSTALL.md) 。bcc 提供了多個工具讓你追蹤 TCP 的相關動作,這裡會用到的是以下兩個:
1. [tcpstates](https://github.com/iovisor/bcc/blob/v0.17.0/tools/tcpstates_example.txt):用於檢測 TCP 連線狀態的改變
2. [tcpdrop](https://github.com/iovisor/bcc/blob/v0.17.0/tools/tcpdrop_example.txt):用來追蹤 TCP 是否有丟失封包
首先執行 `tcpdrop` 會得到以下資訊
```shell
$ sudo ./tcpdrop.py
TIME PID IP SADDR:SPORT > DADDR:DPORT STATE (FLAGS)
20:06:38 66507 4 127.0.0.1:57808 > 127.0.0.1:8081 CLOSE (RST|ACK)
tcp_drop+0x1
tcp_rcv_established+0x126
tcp_v4_do_rcv+0x140
tcp_v4_rcv+0xcef
ip_protocol_deliver_rcu+0x30
ip_local_deliver_finish+0x48
ip_local_deliver+0x73
ip_rcv_finish+0x87
ip_rcv+0xbc
__netif_receive_skb_one_core+0x88
__netif_receive_skb+0x18
process_backlog+0xa9
net_rx_action+0x142
__softirqentry_text_start+0xe1
asm_call_sysvec_on_stack+0x12
do_softirq_own_stack+0x3d
do_softirq.part.0+0x46
__local_bh_enable_ip+0x50
ip_finish_output2+0x1af
__ip_finish_output+0xc8
ip_finish_output+0x2d
ip_output+0x7a
ip_local_out+0x3d
__ip_queue_xmit+0x17a
ip_queue_xmit+0x10
__tcp_transmit_skb+0x56e
tcp_send_active_reset+0xf6
tcp_close+0x306
inet_release+0x3b
__sock_release+0x42
sock_close+0x15
__fput+0xe9
____fput+0xe
task_work_run+0x70
do_exit+0x3a0
do_group_exit+0x43
__x64_sys_exit_group+0x18
do_syscall_64+0x49
entry_SYSCALL_64_after_hwframe+0x44
```
可以看到在 `tcp_close` 時執行了 `tcp_send_active_reset` ,追蹤 [linux/net/ipv4/tcp.c](https://elixir.bootlin.com/linux/v5.8/source/net/ipv4/tcp.c#L2407) 可以看到以下程式碼,可知在 `tcp_set_state(sk, TCP_CLOSE)` 仍有資料在 receive buffer 內,因此觸發了 `tcp_send_active_reset` 。
```c=2416
if (sk->sk_state == TCP_LISTEN) {
tcp_set_state(sk, TCP_CLOSE);
/* Special case. */
inet_csk_listen_stop(sk);
goto adjudge_to_death;
}
/* We need to flush the recv. buffs. We do this only on the
* descriptor close, not protocol-sourced closes, because the
* reader process may not have drained the data yet!
*/
while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
len--;
data_was_unread += len;
__kfree_skb(skb);
}
```
這裡還沒經過下圖中 TCP close 時四次交握的正常程序,這點可以透夠 `tcpstates` 來確認。

```shell
$ sudo ./tcpstates.py
SKADDR C-PID C-COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS
ffff91fa81e03480 66524 wget 127.0.0.1 0 127.0.0.1 8081 CLOSE -> SYN_SENT 0.000
ffff91fa81e03480 66524 wget 127.0.0.1 57810 127.0.0.1 8081 SYN_SENT -> ESTABLISHED 0.053
ffff91face46cec0 66524 wget 0.0.0.0 8081 0.0.0.0 0 LISTEN -> SYN_RECV 0.000
ffff91face46cec0 66524 wget 127.0.0.1 8081 127.0.0.1 57810 SYN_RECV -> ESTABLISHED 0.007
ffff91fa81e03480 66524 wget 127.0.0.1 57810 127.0.0.1 8081 ESTABLISHED -> CLOSE 0.650
ffff91face46cec0 66524 wget 127.0.0.1 8081 127.0.0.1 57810 ESTABLISHED -> CLOSE 0.636
```
可以看到 client 端狀態直接從 ESTABLISHED 變成 CLOSE,完全沒經過 FIN_WAIT_1, FIN_WAIT_2, TIME_WAIT 等狀態,因此可以知道並非是老師所提示的 [TIME_WAIT](https://hackmd.io/@eecheng/Sy9XsgjOI#%E8%A7%A3%E9%87%8B-drop-tcp-socket-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E9%81%8B%E4%BD%9C%E5%8E%9F%E7%90%86%E3%80%82TIME-WAIT-sockets-%E5%8F%88%E6%98%AF%E4%BB%80%E9%BA%BC%EF%BC%9F) 所造成的問題。考慮到是因為關閉連線時仍有資料沒讀取,因此去看 server 傳回來的資料,程式碼如下
```c
#define HTTP_RESPONSE_200_KEEPALIVE_DUMMY \
"" \
"HTTP/1.1 200 OK" CRLF "Server: " KBUILD_MODNAME CRLF \
"Content-Type: text/plain" CRLF "Content-Length: 12" CRLF \
"Connection: Keep-Alive" CRLF CRLF "Hello World!" CRLF
```
回傳字串 `HELLO WORLD!` 後方還有 `CRLF` ,然而 `Content-Length: 12` 沒有把 `CRLF` 算進去,因此修正為 14 bytes 後即可解決此 bug 。
:::info
註: 根據 Brendan Gregg's (eBPF主要貢獻者之一) [Blog](http://www.brendangregg.com/blog/2020-11-04/bpf-co-re-btf-libbpf.html),以 Python 使用 eBPF 的功能將會逐漸被捨棄,並以 libbpf 取代。~~說好的用 Python 方便做資料分析呢~~ :cry:
:::
### Kernel thread 未正確被釋放
這裡有事先看過了 [AndybnACT](https://hackmd.io/@AndybnA/khttpd#http_server_worker-%E6%B2%92%E6%9C%89%E8%A2%AB%E6%AD%A3%E7%A2%BA%E5%9C%B0%E9%87%8B%E6%94%BE) 同學的共筆,此 bug 可透過瀏覽器連線後,再 `rmmod` 卸載 `khttpd` 來重現。
```shell
[19459.787911] BUG: unable to handle page fault for address: ffffffffc13f1465
[19459.787915] #PF: supervisor instruction fetch in kernel mode
[19459.787916] #PF: error_code(0x0010) - not-present page
[19459.787917] PGD 44f40f067 P4D 44f40f067 PUD 44f411067 PMD 453fe7067 PTE 0
[19459.787921] Oops: 0010 [#1] SMP PTI
[19459.787924] CPU: 2 PID: 414058 Comm: khttpd Tainted: G OE 5.8.0-53-generic #60~20.04.1-Ubuntu
[19459.787925] Hardware name: Micro-Star International Co., Ltd. PX60 2QD/MS-16H6, BIOS E16H6IMS.110 11/03/2015
[19459.787928] RIP: 0010:0xffffffffc13f1465
[19459.787930] Code: Unable to access opcode bytes at RIP 0xffffffffc13f143b.
[19459.787931] RSP: 0018:ffff9f3c01653d60 EFLAGS: 00010282
[19459.787933] RAX: 0000000000000243 RBX: ffff8edadc595e00 RCX: 0000000000000005
[19459.787934] RDX: 0000000000000000 RSI: 00000000fffffe01 RDI: ffffffffb7f4ceff
[19459.787936] RBP: ffff9f3c01653dd8 R08: 0000000000078014 R09: ffff8eda9f69b550
[19459.787937] R10: 0000000000000243 R11: 0000000000000243 R12: ffff8edaa675f000
[19459.787938] R13: ffff8eda9d9d5140 R14: ffff8eda9d9d5140 R15: ffff9f3c00e87e10
[19459.787939] FS: 0000000000000000(0000) GS:ffff8edadec80000(0000) knlGS:0000000000000000
[19459.787941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19459.787942] CR2: ffffffffc13f143b CR3: 0000000455dda005 CR4: 00000000003606e0
[19459.787943] Call Trace:
[19459.787951] ? kthread+0x114/0x150
[19459.787953] ? kthread_park+0x90/0x90
[19459.787956] ? ret_from_fork+0x22/0x30
```
此 bug 發生的原因是在於下方 `http_server_daemon` 函式會為每個 client 連線創一個 kthread (下方第205行),而在關閉 server 時卻沒有任何機制回收這些 kthread ,當 client 沒有收到斷線資訊,而繼續連線傳輸資料就會導致 kernel 去執行已經被卸載的程式碼而導致 kernel page fault ,此 bug 會在後續加入 [cmwq](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) 時一並修正。
```c=188
int http_server_daemon(void *arg)
{
struct socket *socket;
struct task_struct *worker;
struct http_server_param *param = (struct http_server_param *) arg;
allow_signal(SIGKILL);
allow_signal(SIGTERM);
while (!kthread_should_stop()) {
int err = kernel_accept(param->listen_socket, &socket, 0);
if (err < 0) {
if (signal_pending(current))
break;
pr_err("kernel_accept() error: %d\n", err);
continue;
}
worker = kthread_run(http_server_worker, socket, KBUILD_MODNAME);
if (IS_ERR(worker)) {
pr_err("can't create more worker process\n");
continue;
}
}
return 0;
}
```
## Concurrency Managed Workqueue (cmwq)
[cmwq](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) 是 Linux kernel 所提供的一種機制,其實作了 workqueue API,用於 asynchronous 的執行 kernel 開發者所開發的 work (function) ,從而簡化了開發者處理背後 kthread 等資源利用的流程,其開發目的也是為了使 workqueue 能夠最大化使用硬體資源,滿足 Linux 對 [scalability](https://hackmd.io/@sysprog/linux-scalability) 的要求。
### cmwq Design
* work: 開發者實作所需 async 執行的 function。
* workqueue: 存放待執行 work 的空間,user 將 work enque 到 workqueue 中,等待後續 worker 從 workqueue 中 deque 拿取 work 來執行。
* worker: 背後實際執行 work 的 kernel thread。
* worker-pools: 就相當於 [thread-pools](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fposix-threads#Thread-Pool) 的概念。
cmwq 的設計拆分了 workqueue 與 worker-pools 概念 ,user 可以單純將 work 推進 workqueue ,而不必在意 OS 如何分配 work 給 worker 執行。系統本身就預先建立了兩個 worker-pool ,分別用於執行一般的 task 與高優先權的 task ,另外根據情況也會動態配置新的 worker-pool ,而各個 worker-pool 會根據 workqueue 的設定,也就是 `alloc_workqueue` 時所設定的 flag 來決定如何分配 work 給哪個 worker 執行,例如`WQ_UNBOUND` 不綁定 workqueue 的 work 要給哪個 CPU 執行, `WQ_HIGHPRI` 的 workqueue 中的 work 會交給高優先權的 worker-pool 去執行等。
cmwq 基本上是希望能達到最大化 Concurrency 的效益而去設計的,實際實作如何分配 work 給 worker 就得要自己去 trace code 才能知道了。
### 引入 cmwq 至 khttpd
這裡參考了 [kecho](https://hackmd.io/@sysprog/linux2020-kecho) 的作法,`struct http_server_wq` 用來管理 workqueue,`struct http_server` 用來管理 work 。
```c
struct http_server_wq {
struct workqueue_struct *client_wq;
struct list_head head;
bool should_stop;
};
struct http_server {
struct socket *socket;
struct work_struct work;
struct list_head list;
};
```
在 `http_server_daemon` 中配置了一個 `WQ_UNBOUND` 的 workqueue ,針對每一個 client 連線請求,都會建立一個新的 work ,各個 work 會由哪個 worker (kthread) 來執行是交給 OS 來分配的,配置出來的 `struct http_server` 資源使用 linked list 來管理。
```c
static struct work_struct *create_work(struct socket *socket)
{
struct http_server *client;
client = kmalloc(sizeof(struct http_server), GFP_KERNEL);
if (!client)
return NULL;
client->socket = socket;
INIT_WORK(&client->work, http_wq_work);
list_add(&client->list, &wq.head);
return &client->work;
}
int http_server_daemon(void *arg)
{
...
wq.client_wq = alloc_workqueue("kthhpd_client_wq", WQ_UNBOUND, 0);
if (!wq.client_wq)
return -ENOMEM;
wq.should_stop = false;
INIT_LIST_HEAD(&wq.head);
...
while (!kthread_should_stop()) {
...
work = create_work(socket);
if (!work) {
pr_err("can't create more work\n");
continue;
}
queue_work(wq.client_wq, work);
}
wq.should_stop = true;
destroy_work();
return 0;
}
```
而對於配置出來的 work 回收的問題,則是於 module 卸載時透過 linked list 一個一個 `kfree` ,當然這會有個明顯的缺點,對於已經斷線的 client ,資源沒辦法動態的被釋放,只能等到 `rmmod` 時才能釋放。
```c
static void destroy_work(void)
{
struct http_server *curr, *tmp;
list_for_each_entry_safe (curr, tmp, &wq.head, list) {
flush_work(&curr->work);
kfree(curr);
}
}
static int http_server_work(void *arg)
{
...
while (!wq.should_stop) {
int ret = http_server_recv(socket, buf, RECV_BUFFER_SIZE - 1);
if (ret <= 0) {
if (ret)
pr_err("recv error: %d\n", ret);
break;
}
http_parser_execute(&parser, &setting, buf, ret);
if (request.complete && !http_should_keep_alive(&parser))
break;
}
kernel_sock_shutdown(socket, SHUT_RDWR);
sock_release(socket);
kfree(buf);
return 0;
}
```
使用 cmwq 前後 `make check` 的數據如下, requests/sec 效率快了近一倍。
```shell
kthread cmwq
requests: 100000 requests: 100000
good requests: 100000 [100%] good requests: 100000 [100%]
bad requests: 0 [0%] bad requests: 0 [0%]
socker errors: 0 [0%] socker errors: 0 [0%]
seconds: 3.056 seconds: 1.607
requests/sec: 32727.322 requests/sec: 62215.171
```
:::info
TODO: 針對動態釋放 work 資源的問題,目前思考了一個作法是建立一個 work 並使用 event driven 的方式來動態釋放資源,目前有查到 kernel 提供的 [wait queue](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h) 有 event driven 的功能,然而因為 linked list 屬於共用資源,在操作上估計逃不了使用 lock ,因此預期可能會損失一點執行效率,但目前沒有更好的想法,因此會先朝這方向實作 。
:::
## 其他參考資料
> [鳥哥的 Linux 私房菜: 第二章、基礎網路概念](http://linux.vbird.org/linux_server/0110network_basic.php)