# 2022q1 Homework6 (ktcp)
contributed by < `hankluo6` >
## kecho
### 傳遞參數到核心模組
`insmod` 最後會呼叫到 `load_module`,在 `load_module` 中會先透過 `find_module_sections` 設置 module 的指標到對應的 section,接著 `parse_args` 便能將對應的參數寫入。
```c
static int find_module_sections(struct module *mod, struct load_info *info)
{
mod->kp = section_objs(info, "__param",
sizeof(*mod->kp), &mod->num_kp);
...
}
```
在 `find_module_sections` 中,`struct module` 是代表核心模組主要的結構,將 `mod->kc` 指向 `__param` 這個 section,並設置 `mod->num_kp` 的值。
```c
/* Args looks like "foo=bar,bar2 baz=fuz wiz". */
char *parse_args(const char *doing,
char *args,
const struct kernel_param *params,
unsigned num,
s16 min_level,
s16 max_level,
void *arg,
int (*unknown)(char *param, char *val,
const char *doing, void *arg))
{
...
while (*args) {
int ret;
int irq_was_disabled;
args = next_arg(args, ¶m, &val);
/* Stop at -- */
if (!val && strcmp(param, "--") == 0)
return err ?: args;
ret = parse_one(param, val, doing, params, num,
min_level, max_level, arg, unknown);
...
}
return err;
}
static int parse_one(char *param,
char *val,
const char *doing,
const struct kernel_param *params,
unsigned num_params,
s16 min_level,
s16 max_level,
void *arg,
int (*handle_unknown)(char *param, char *val,
const char *doing, void *arg))
{
unsigned int i;
int err;
/* Find parameter */
for (i = 0; i < num_params; i++) {
if (parameq(param, params[i].name)) {
if (param_check_unsafe(¶ms[i]))
err = params[i].ops->set(val, ¶ms[i]);
else
err = -EPERM;
return err;
}
}
...
}
```
`parse_args` 將使用者輸入的每個參數都傳到 `parse_one` 檢查,`parse_one` 將使用者提供的參數名字與 `__param` section 內的所有參數名字比對,如果相同則透過 `params[i].ops->set` 設置參數。
所以 `module_param` 應該要將參數植入到 `__param` section,並提供對應的 `ops`:
```c
#define module_param(name, type, perm) \
module_param_named(name, name, type, perm)
#define module_param_named(name, value, type, perm) \
param_check_##type(name, &(value)); \
module_param_cb(name, ¶m_ops_##type, &value, perm); \
__MODULE_PARM_TYPE(name, #type)
#define module_param_cb(name, ops, arg, perm) \
__module_param_call(MODULE_PARAM_PREFIX, name, ops, arg, perm, -1, 0)
#define __module_param_call(prefix, name, ops, arg, perm, level, flags) \
/* Default value instead of permissions? */ \
static const char __param_str_##name[] = prefix #name; \
static struct kernel_param __moduleparam_const __param_##name \
__used __section("__param") \
__aligned(__alignof__(struct kernel_param)) \
= { __param_str_##name, THIS_MODULE, ops, \
VERIFY_OCTAL_PERMISSIONS(perm), level, flags, { arg } }
```
可以看到最後會透過 `__module_param_call` 展開,其中 `__used __section("__param")` 證明了會將這些參數放置在 `__param` section。而實際設置參數 `ops->set` 函式的 `ops` 會在 `module_param_named` 時設置成 `¶m_ops_##type` 傳入,`param_ops_##type` 會被展開成 `param_ops_ushort` 或 `param_ops_bool` 等預先定義好的 type,其中也有定義好每個 type 對應的 `set` 操作,透過對應的 `set` 函式便能將資料寫入。
### user-echo-server 運作原理
```c
int main(void)
{
static struct epoll_event events[EPOLL_SIZE];
struct sockaddr_in addr = {
.sin_family = PF_INET,
.sin_port = htons(SERVER_PORT),
.sin_addr.s_addr = htonl(INADDR_ANY),
};
socklen_t socklen = sizeof(addr);
client_list_t *list = NULL;
int listener;
if ((listener = socket(PF_INET, SOCK_STREAM, 0)) < 0)
server_err("Fail to create socket", &list);
printf("Main listener (fd=%d) was created.\n", listener);
if (setnonblock(listener) == -1)
server_err("Fail to set nonblocking", &list);
if (bind(listener, (struct sockaddr *) &addr, sizeof(addr)) < 0)
server_err("Fail to bind", &list);
printf("Listener was binded to %s\n", inet_ntoa(addr.sin_addr));
if (listen(listener, 128) < 0)
server_err("Fail to listen", &list);
...
}
```
透過 [`socket(2)`](https://man7.org/linux/man-pages/man2/socket.2.html) 建立 socket,`setnonblock` 將這個 sockeet 設置為 no blocking,[`bind(2)`](https://man7.org/linux/man-pages/man2/bind.2.html) 將 socket 與 address 綁定在一起。
[`listen(2)`](https://man7.org/linux/man-pages/man2/listen.2.html) 將 socket 開始監聽,使其能夠接收 client 端的請求連線。TCP 在 kernel 中會維護兩個 queue,第一個 queue 用來儲存正在進行 [three-way handshaking](https://en.wikipedia.org/wiki/Handshaking#TCP_three-way_handshake) 中的 request,第二個 queue 會存放已經處理好在等待 `accept` 的 request,而 `backlog` 參數指的是可以指定這兩個 queue 的總和大小。
```c
...
int epoll_fd;
if ((epoll_fd = epoll_create(EPOLL_SIZE)) < 0)
server_err("Fail to create epoll", &list);
static struct epoll_event ev = {.events = EPOLLIN | EPOLLET};
ev.data.fd = listener;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listener, &ev) < 0)
server_err("Fail to control epoll", &list);
while (1) {
struct sockaddr_in client_addr;
int epoll_events_count;
if ((epoll_events_count = epoll_wait(epoll_fd, events, EPOLL_SIZE,
EPOLL_RUN_TIMEOUT)) < 0)
server_err("Fail to wait epoll", &list);
for (int i = 0; i < epoll_events_count; i++) {
/* EPOLLIN event for listener (new client connection) */
if (events[i].data.fd == listener) {
int client;
while (
(client = accept(listener, (struct sockaddr *) &client_addr,
&socklen)) > 0) {
setnonblock(client);
ev.data.fd = client;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client, &ev) < 0)
server_err("Fail to control epoll", &list);
push_back_client(&list, client,
inet_ntoa(client_addr.sin_addr));
}
} else {
/* EPOLLIN event for others (new incoming message from client)
*/
if (handle_message_from_client(events[i].data.fd, &list) < 0)
server_err("Handle message from client", &list);
}
}
}
...
```
[`epoll(7)`](https://man7.org/linux/man-pages/man7/epoll.7.html) 可以同時監控多個 file descriptions 並判斷哪些 fd 有資料可寫入或讀取:
> * epoll_create(2) creates a new epoll instance and returns a file
> descriptor referring to that instance. (The more recent
> epoll_create1(2) extends the functionality of epoll_create(2).)
>
> * Interest in particular file descriptors is then registered via
> epoll_ctl(2), which adds items to the interest list of the
> epoll instance.
>
> * epoll_wait(2) waits for I/O events, blocking the calling thread
> if no events are currently available. (This system call can be
> thought of as fetching items from the ready list of the epoll
> instance.)
將要監聽的 `listener` 透過 `epoll_ctl` 註冊到 `epoll` 當中,`epoll_wait` 當有 client 連接時,便會回傳。在 `for` 迴圈中,透過 [`accept(2)`](https://man7.org/linux/man-pages/man2/accept.2.html) 將這個 client 連接,並將 client 的 fd 透過 `epoll_ctl` 放到 `epoll` 中等待接收資料。而 `push_back_client` 透過 linked list 紀錄現在有哪些 client 連接。
當 `epoll_wait` 回傳的 events 不是 `listener` 時,表示 client 端有資料傳入,進入 `handle_message_from_client` 處理。
```c
static int handle_message_from_client(int client, client_list_t **list)
{
int len;
char buf[BUF_SIZE];
memset(buf, 0, BUF_SIZE);
if ((len = recv(client, buf, BUF_SIZE, 0)) < 0)
server_err("Fail to receive", list);
if (len == 0) {
if (close(client) < 0)
server_err("Fail to close", list);
*list = delete_client(list, client);
printf("After fd=%d is closed, current numbers clients = %d\n", client,
size_list(*list));
} else {
printf("Client #%d :> %s", client, buf);
if (send(client, buf, BUF_SIZE, 0) < 0)
server_err("Fail to send", list);
}
return len;
}
```
`handle_message_from_client` 透過 `recv` 及 `send` 將 client 傳來的資料回傳,如果 `len` 為 0,表示沒有資料傳入,可以認為 `client` 端以被關閉,利用 `delete_client` 將 client 從 linked list 中移除。
### `bench.c`
```c
static void *bench_worker(__attribute__((unused)))
{
int sock_fd;
char dummy[MAX_MSG_LEN];
struct timeval start, end;
/* wait until all workers created */
pthread_mutex_lock(&worker_lock);
while (!ready)
if (pthread_cond_wait(&worker_wait, &worker_lock)) {
puts("pthread_cond_wait failed");
exit(-1);
}
pthread_mutex_unlock(&worker_lock);
sock_fd = socket(AF_INET, SOCK_STREAM, 0);
if (sock_fd == -1) {
perror("socket");
exit(-1);
}
struct sockaddr_in info = {
.sin_family = PF_INET,
.sin_addr.s_addr = inet_addr(TARGET_HOST),
.sin_port = htons(TARGET_PORT),
};
if (connect(sock_fd, (struct sockaddr *) &info, sizeof(info)) == -1) {
perror("connect");
exit(-1);
}
gettimeofday(&start, NULL);
send(sock_fd, msg_dum, strlen(msg_dum), 0);
recv(sock_fd, dummy, MAX_MSG_LEN, 0);
gettimeofday(&end, NULL);
shutdown(sock_fd, SHUT_RDWR);
close(sock_fd);
if (strncmp(msg_dum, dummy, strlen(msg_dum))) {
puts("echo message validation failed");
exit(-1);
}
pthread_mutex_lock(&res_lock);
time_res[idx++] += time_diff_us(&start, &end);
pthread_mutex_unlock(&res_lock);
bench_worker
pthread_exit(NULL);
}
static void bench(void)
{
for (int i = 0; i < BENCH_COUNT; i++) {
ready = false;
create_worker(MAX_THREAD);
pthread_mutex_lock(&worker_lock);
ready = true;
/* all workers are ready, let's start bombing kecho */
pthread_cond_broadcast(&worker_wait);
pthread_mutex_unlock(&worker_lock);
/* waiting for all workers to finish the measurement */
for (int x = 0; x < MAX_THREAD; x++)
pthread_join(pt[x], NULL);
idx = 0;
}
for (int i = 0; i < MAX_THREAD; i++)
fprintf(bench_fd, "%d %ld\n", i, time_res[i] /= BENCH_COUNT);
}
```
`create_worker` 會透過 `pthread_create` 建立執行緒執行 `bench_worker` 函式,而為了確保每個執行緒能在相同的時間執行,`pthread_cond_wait` 讓每個執行緒等待 condition 發生,建議完執行緒後,`pthread_cond_broadcast` 喚醒所有等待的執行緒。`bench_worker` 便開始建立 TCP 連線,`gettimeofday` 紀錄 `recv` 及 `send` 的時間。
![](https://i.imgur.com/P1PwMrT.png)
![](https://i.imgur.com/SwRVwA2.png)
kecho 的速度比 user-echo-server 快將近 10 倍左右,這是因為 user space 的系統呼叫 (如 `listen`, `accept`) 最後也會呼叫到 kernel 內的 `kernel_listen` 及 `kernel_accept`。
:::warning
系統呼叫的成本雖然整體來說持續降低,但終究無法消弭,我們今年發表 [Effective System Call Aggregation (ESCA)](https://eecheng87.github.io/ESCA/) 來降低系統呼叫過程的 CPU mode switch 的衝擊。
:notes: jserv
:::
---
###### tags: `linux2022`