Echo Server - HackMD

--- tags: linux2022, homework, hw6 --- # Echo Server ## 目標 - [x] 解釋 user-echo-server 運作原理，特別是 epoll 系統呼叫的使用 - [ ] 給定的 kecho 已使用 CMWQ，請陳述其優勢和用法 - [ ] 是否理解 bench 原理，能否比較 kecho 和 user-echo-server 表現？佐以製圖 --- ## user-echo-server ### Create a socket 根據 [The GNU C Library - 16 Sockets](https://www.gnu.org/software/libc/manual/html_node/Sockets.html) 中的說明： > A *socket* is a generalized interprocess communication **channel**. Like a pipe, a socket is represented as a file descriptor. Unlike pipes sockets support communication between unrelated processes, and **even between processes running on different machines that communicate over a network**. socket 是一個能讓行程之間進行通訊的通道，且不限於使用在同裝置或相關的行程上。而這個 socket 能夠經由 [`socket(2)`](https://www.man7.org/linux/man-pages/man2/socket.2.html) 建立： ```c int socket(int domain, int type, int protocol); ``` 這個系統呼叫包含三個參數： - `domain` 單純存在一個 socket 是不足以進行行程間通訊的，首先需要有 IP 地址、Port 之類的 address 資訊，然後再根據指定的通訊協定進行解讀，而這些 address 資訊就相當於 socket 的名字，因此有時候也會用 name 表示。不過有些通訊協定的 address 資訊是相關的，例如 TCP 與 UDP 都需要以 IP 地址以及 Port 來表示，而這些相關的通訊協定就稱為 Protocol Family（PF），若是以 socket 的角度來看的話，也能稱為 Address Family（AF）或是 namespace 或是 domain。這些 `domain` 定義在 [`include/linux/socket.h`](https://github.com/torvalds/linux/blob/master/include/linux/socket.h#L232) 中，雖然有以 `PF_` 或是 `AF_` 兩種不同前綴的定義，但兩者實際上對應到的定義是相同的。 - `type` `type` 用來指定要使用的通訊策略的特性，像是 Reliability 或是否為 Connection-based 等特性，可能的 `type` 列舉在 [`include/linux/net.h`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/include/linux/net.h#L61) 中。而在 [The GNU C Library - 16.2 Communication Styles](https://www.gnu.org/software/libc/manual/html_node/Communication-Styles.html) 中則列出了 `SOCK_STREAM`、`SOCK_DGRAM`、`SOCK_RAW` 三種的差異。 - `protocol` 這個參數是用來設定在指定 `domain` 下擁有相應 `type` 的通訊協定。不同的 `domain` 對應的 `protocol` 可能定義在不同標頭檔中，例如 `AF_INET` 相關的 `protocol` 被列舉以及定義在 [`include/uapi/linux/in.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/in.h) 中。而由於**通常**在一個 `domain` 下只會有一種符合 `type` 特性的通訊協定，因此若在 `protocol` 部份傳入 0 的話，則代表要使用對應的通訊協定，例如在 `PF_INET` 下符合 `SOCK_STREAM` 的通訊協定為 `IPPROTO_TCP`。需要注意的是，在指定 `domain` 下不一定會有符合 `type` 特性的通訊協定存在。而這些參數最終會被帶到定義在 `net/socket.c` 中的 `__sock_create` 函式進行相關的初始化，以便之後相關的 socket 操作根據設定的 Address Family 進行操作： ```c SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) { return __sys_socket(family, type, protocol); } int __sys_socket(int family, int type, int protocol) { int retval; struct socket *sock; int flags; ... retval = sock_create(family, type, protocol, &sock); if (retval < 0) return retval; return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK)); } int sock_create(int family, int type, int protocol, struct socket **res) { return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0); } ``` ### Non-blocking socket 而在呼叫 `socket` 成功並得到一個對應的 socket descriptor 後，會透過 `setnonblock` 呼叫 `fcntl` 將 socket descriptor 加上 `O_NONBLOCK` 的 flag 以便後面使用 `epoll` 進行 Non-blocking IO 操作。 ```c static int setnonblock(int fd) { int fdflags; if ((fdflags = fcntl(fd, F_GETFL, 0)) == -1) return -1; fdflags |= O_NONBLOCK; if (fcntl(fd, F_SETFL, fdflags) == -1) return -1; return 0; } ``` :::danger 為什麼要設定這個 socket 為 non-blocking ::: :::warning 根據 [`socket` 的 Man Page](https://www.man7.org/linux/man-pages/man2/socket.2.html) 中關於 `SOCK_NONBLOCK` 的說明： > Set the `O_NONBLOCK` file status flag on the open file description (see `open(2)`) referred to by the new file descriptor. Using this flag saves extra calls to `fcntl(2)` to achieve the same result. 在 `type` 參數部份可以將 Communication Semantics 與 `SOCK_NONBLOCK` OR 起來，讓回傳的 socket descriptor 設定 `O_NONBLOCK` 的 flag，就可以不用額外呼叫 `setnonblock`。 ::: ### Bind an address to the socket 然而透過 `socket(2)` 建立的 socket 並沒有對應到一個 socket address（name），因此需要透過 [`bind(2)`](https://www.man7.org/linux/man-pages/man2/bind.2.html) 設定： ```c int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen); ``` 而其中 `addr` 參數部份應為 `sockaddr` 結構體的物件，會被用來設定 socket address，因此首先要建立一個用來表示 socket address 的結構體 `sockaddr_in` 的物件： ```c // include/uapi/linux/socket.h typedef unsigned short __kernel_sa_family_t; // include/linux/socket.h typedef __kernel_sa_family_t sa_family_t; struct sockaddr { sa_family_t sa_family; /* address family, AF_xxx */ char sa_data[14]; /* 14 bytes of protocol address */ }; ``` 但根據 `bind` Man Page 中所述： > The only purpose of this structure is to cast the structure pointer passed in addr in order to avoid compiler warnings. 這個結構體唯一的用途就只是用來避免編譯器報錯，那實際上應傳入的應該要是什麼？這邊嘗試去看 `bind` 到底做了什麼，然後發現它實際上會呼叫的函式是定義在 [`net/socket.h` 中的 `__sys_bind`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/net/socket.c#L1683)，並在 `__sys_bind` 中先經由 socket descriptor 找到實際的 [socket 結構體](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/include/linux/net.h#L114)的物件 `sock`，然後再呼叫這個 socket 的 [`bind` 操作函式 `sock->ops->bind`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/include/linux/net.h#L160)，而這個 `bind` 函式其實是在前面呼叫 [`socket`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/net/socket.c#L1541) 建立 socket 時呼叫的 `__sock_create` 中指派的： ```c // net/socket.c int __sock_create(struct net *net, int family, int type, int protocol, struct socket **res, int kern) { ... #ifdef CONFIG_MODULES /* Attempt to load a protocol module if the find failed. * * 12/09/1996 Marcin: But! this makes REALLY only sense, if the user * requested real, full-featured networking support upon configuration. * Otherwise module support will break! */ if (rcu_access_pointer(net_families[family]) == NULL) request_module("net-pf-%d", family); #endif rcu_read_lock(); pf = rcu_dereference(net_families[family]); ... err = pf->create(net, sock, protocol, kern); ... } ``` 其中 `__sock_create` 中使用到的 `pf->create` 會根據不同的 Address Family 呼叫不同的函式，以 `AF_INET` 為例，這個 Address Family 對應到的 `create` 應為定義在 [`net/ipv4/af_inet.c` 中的 `inet_create`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/net/ipv4/af_inet.c#L245)： :::warning 各個 Address Family 的實作其實就是個核心模組，在 `__sock_create` 中呼叫到 `request_module` 時，會將對應的核心模組透過 `modprobe` 掛載到系統上。而 `net/socket.c` 中的 `net_families[family]` 則會在模組進行初始化時呼叫 `net/socket.c` 中定義的 `sock_register` 函式時被指派。 ::: ```c static const struct net_proto_family inet_family_ops = { .family = PF_INET, .create = inet_create, .owner = THIS_MODULE, }; ``` 而在這個函式中就會將傳入的 `protocol` 與 [`inetsw_array` 中定義的 `protocol`](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/net/ipv4/af_inet.c#L1120) 進行比對，然後指派對應的 `sock->ops` 操作，以 `SOCK_STREAM` 為例，`sock->pos->bind` 會對應到 [`inet_bind` 函式](https://github.com/torvalds/linux/blob/42226c989789d8da4af1de0c31070c96726d990c/net/ipv4/af_inet.c#L435)： ```c int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { ... return __inet_bind(sk, uaddr, addr_len, flags); } int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, u32 flags) { struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; ... } ``` 然後在 `__inet_bind` 中將原本的結構體 `sockaddr` 轉成目標 socket 設定的 Address Family 指定的結構體，在 `AF_INET` 中為 `sockaddr_in`。從以上的操作可以看到 `sockaddr` 比較像是作為一個通用界面使用，而實際上的 `addr` 則應像 `bind` 的 Man Page 中提到的，需要根據實際的 Address Family 建立對應的結構體物件： > The actual structure passed for the addr argument will depend on the address family. :::success 而為什麼不是使用 `void *` 則請見 [rickywu0421 的 Linux 核心網路](https://hackmd.io/@rickywu0421/linux_networking_1)一文的說明。 ::: ### Marks the socket as a passive socket :::info ::: ### Non-blocking Handling 而在呼叫 [`listen`](https://www.man7.org/linux/man-pages/man2/listen.2.html) 成功後，原本的 socket descriptor 會被轉為用來監聽 `connect` 請求的 passive socket descriptor，然後就可以透過 [`accept`](https://man7.org/linux/man-pages/man2/accept.2.html) 等待 `connect` 請求，但是根據 Man Page 的說明： > The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. > If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. 當以 Blocking IO 方式呼叫 `accept` 時，會 block 到有連線請求存在，然後再從佇列中依序取出連線請求處理，而連線成功後接受訊息使用的 [`recv`](https://man7.org/linux/man-pages/man2/recv.2.html) 也同樣會等待訊息。因此若要在 Blocking IO 的方式同時兼顧連線請求以及接收訊息的話，每多一個連線時就要建立一個執行緒專門負責處理該連線，才能避免卡在 `accept` 或是 `recv` 導致已經能夠存取的資料無法馬上存取。而在 user-echo-server 中則不是透過建立執行緒的方式管理連線，而是透過 `epoll` 搭配 Non-blocking IO 優先處理「可讀（Readable）」的 passive socket descriptor 或是 accepted socket 的 FD。 ### epoll 要使用 epoll 首先要透過 `epoll_create` 建立一個 epoll instance： ```c int epoll_create(int size); ``` 之後的 epoll 相關操作都需要用到 `epoll_create` 回傳的 epoll file descriptor。 :::warning 在 [`epoll_create` 的 Man Page](https://man7.org/linux/man-pages/man2/epoll_create.2.html) 中關於 `size` 這個參數的說明： > **DESCRIPTION:** > epoll_create() creates a new epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES. > **NOTES:** > In the initial epoll_create() implementation, the size argument informed the kernel of the number of file descriptors that the caller expected to add to the epoll instance. The kernel used this information as a hint for the amount of space to initially allocate in internal data structures describing events. (If necessary, the kernel would allocate more space if the caller's usage exceeded the hint given in size.) Nowadays, this hint is no longer required (the kernel dynamically sizes the required data structures without needing the hint), but size must still be greater than zero, in order to ensure backward compatibility when new epoll applications are run on older kernels. 可以知道這個參數在 Linux 2.6.8 之後就沒有實際用途了，保留這個參數的用意僅是為了保持對舊版核心的相容性。 ::: 接著就能透過 [`epoll_ctl`](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html) 將 passive socket descriptor 或是 accepted socket 的 FD 加入到 epoll instance 的 interest list 中，等待指定的事件發生： ```c int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); ``` :::warning 在這個 Echo Server 中，需要等待的 IO 事件為 `connect` 以及 `recv`，根據 `accept` Man Page 中的說明： > In order to be notified of incoming connections on a socket, you can use select(2), poll(2), or epoll(7). A **readable event** will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details. 可以知道 `accept` 是屬於「讀取」事件；而在 `recv` 則可以根據用途以及命名知道同樣屬於「讀取」事件。因此根據 [`epoll_ctl` Man Page](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html) 中列出的 epoll 事件類型的說明，在 `epoll_ctl` 讀取要指定的事件類型應為 `EPOLLIN`。但在 `user-echo-server` 的實作中，指定的 event 除了 `EPOLLIN` 還使用了 `EPOLLET` 這個 flag，而根據 [`epoll` Man Page](https://man7.org/linux/man-pages/man7/epoll.7.html) 中的說明： > An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve a task that is handling multiple file descriptors. 在等待 Non-blocking 事件時應使用 edge-trigger 模式，避免發生 blocking。 ::: 當 interest list 中的 descriptor 指定的 IO 準備完成後，則可以在 ready list 中存取到對應的 descriptor，而存取 ready list 則需要透過 [`epoll_wait`](https://man7.org/linux/man-pages/man2/epoll_wait.2.html) 來完成： ```c typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); ``` 當 `epoll_wait` 被呼叫時最多會等待 `timeout` 毫秒，若在等待期間內 interest list 中的 FD 有對應事件準備完成的話，就會回傳對應的事件數量，一次 `epoll_wait` 最多能夠通知 `maxevents` 個事件完成。而這些準備完成的事件的 FD 能夠經由 `events.data.fd` 取得，因此在 user-echo-server 的實作中，當 `epoll_wait` 回傳值大於 0 時，會檢查各個事件的 FD 是連線請求還是收到訊息，然後再分別進行處理： - Connect Request 當事件的 FD 是 passive socket descriptor 時代表有 `connect` 請求發生，這時會透過 `accept` 接受連線請求，並將新的 accepted socket 的 FD 新增到 epoll instance 的 interest list 中，並新增一個 client 到 client_list 中。 :::warning 從 `accept` 的 prototype 可以看到 `accept` 沒有參數能夠設定 ```c #include <sys/socket.h> int accept(int sockfd, struct sockaddr *restrict addr, socklen_t *restrict addrlen); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <sys/socket.h> int accept4(int sockfd, struct sockaddr *restrict addr, socklen_t *restrict addrlen, int flags); ``` 但是有另一個系統呼叫 `accept4`，能夠像 `socket` 那樣透過指定 `SOCK_NONBLOCK` 讓回傳的 socket descriptor 設定 `O_NONBLOCK` 的 flag，省去呼叫 `setnonblock` 的步驟。 ::: - Message Receiving 而其他 FD 則代表是其中一個 client 有訊息能夠讀取，此時則會呼叫 `handle_message_from_client` 透過 `recv` 讀取訊息，然後再透過 [`send`](https://man7.org/linux/man-pages/man2/send.2.html) 傳送一樣的訊息給 client。 ## kecho - kernel echo server ### Initialization 首先在 [在 `insmod` 時使用 `port=` 初始化核心模組的機制](/rJ7YiwC_TSa6C5l2i-p0kg) 中有提到 `insmod` 時會呼叫到 `kecho_init_module` 進行初始化，而這個初始化函式： ```c static int kecho_init_module(void) { int error = open_listen(&listen_sock); if (error < 0) { printk(KERN_ERR MODULE_NAME ": listen socket open error\n"); return error; } param.listen_sock = listen_sock; kecho_wq = alloc_workqueue(MODULE_NAME, bench ? 0 : WQ_UNBOUND, 0); echo_server = kthread_run(echo_server_daemon, &param, MODULE_NAME); if (IS_ERR(echo_server)) { printk(KERN_ERR MODULE_NAME ": cannot start server daemon\n"); close_listen(listen_sock); } return 0; } ``` 在初始化時首先會呼叫 `open_listen` 來初始化 socket descriptor。除了需要換成對應的 kernel socket API 以及多了一些處理之外，其餘流程大致上與在 user mode 上建立 socket server 差不多： 1. Initialize socket 使用 [`sock_create`](https://www.kernel.org/doc/html/v5.6/networking/kapi.html#c.sock_create) 建立一個 socket 並回傳其 socket descriptor 2. `setsockopt` 在 bind 之前做了兩次 `setsockopt`，分別是用來設定： 1. `TCP_NODELAY` 可以在 [tcp 的 Man Page](https://man7.org/linux/man-pages/man7/tcp.7.html) 中找到關於這個參數的說明： > If set, disable the [Nagle algorithm](https://en.wikipedia.org/wiki/Nagle's_algorithm). This means that **segments are always sent as soon as possible**, even if there is only a small amount of data. **When not set, data is buffered until there is a sufficient amount** to send out, thereby avoiding the frequent sending of small packets, which results in poor utilization of the network. This option is overridden by TCP_CORK; however, setting this option forces an explicit flush of pending output, even if TCP_CORK is currently set. 從說明可以知道這個設定是用來避免資料因為 Nagle's algorithm 導致無法馬上送出、造成延遲，但同時也可能造成壅塞導致網路使用率降低（碰撞多）。 2. `SO_REUSEPORT` 可在 [socket(7) 的 Man Page](https://man7.org/linux/man-pages/man7/socket.7.html) 找到 `SO_REUSEPORT` 相關的說明： > Permits multiple AF_INET or AF_INET6 sockets to be bound to an identical socket address. This option must be set on each socket (including the first socket) prior to calling bind(2) on the socket. To prevent port hijacking, all of the processes binding to the same address must have the same effective UID. This option can be employed with both TCP and UDP sockets. > > For TCP sockets, this option **allows accept(2) load distribution in a multi-threaded server to be improved by using a distinct listener socket for each thread.** This provides improved load distribution as compared to ==traditional techniques such using a single accept(2)ing thread that distributes connections==, or having multiple threads that compete to accept(2) from the same socket. 而這個設定則可以讓多個 socket 監聽同一個 [socket address](https://www.gnu.org/software/libc/manual/html_node/Socket-Addresses.html)， :::info ::: 3. Assigning a name to a socket 透過呼叫 [`kernel_bind`](https://www.kernel.org/doc/html/v5.6/networking/kapi.html#c.kernel_bind) 將要監聽的 port 以及 address 設定在 socket 上 4. Marks the socket as a passive socket 使用 [`kernel_listen`](https://www.kernel.org/doc/html/v5.6/networking/kapi.html#c.kernel_listen) 將 socket 設定成 listening state ### Workqueue 在透過 `kernel_listen` 將 socket descriptor 轉成 listening state 後，則會透過 `alloc_workqueue` 建立一個 workqueue。 :::info ::: ### Echo-Server Daemon 最後則會呼叫 `kthread_run` 建立另一個 kernel thread 來執行 `echo_server_daemon`，而在這個 echo server daemon 中則會不斷的呼叫 `kernel_accept` 來接收連線請求。 :::danger 但不同於 user-echo-server 是透過 epoll 進行 Non-blocking 的等待，這個 daemon 是以 Blocking 等待的方式處理連線請求? ::: 並在收到連線請求後透過 `create_work` 以及 `queue_work` 加入一個任務 `echo_server_worker` 到前面建立的 workqueue 中，而該任務就是不斷的進行 `get_request` 以及 `send_request`。 #### Signal Handling #### kthread_should_stop ### Destroy --- ## kecho vs user-echo-server ## Bench ### 能否透過 bench 比較 kecho 和 user-echo-server 表現？ --- ## 修正 kecho 的執行時期的缺失並提升效能和穩健度 --- ## References - Man Page - [Source Code](https://www.github.com/torvalds/linux) - [Kernel Document - The `struct proto_ops` Structure](https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#the-struct-proto-ops-structure)