# 2021q3 Homework3 (vpoll) contributed by < `ccs100203` > ###### tags: `linux2021q3` > [quiz4](https://hackmd.io/@sysprog/linux2021-summer-quiz4) ## 程式原理 user 透過 epoll 註冊所需的 events,再利用 `ioctl` 對該 fd 送出 command,而 module 透過 `vpoll_poll` 確認 I/O 是否已準備好,再藉由 `vpoll_ioctl` 將該 thread 喚醒,最後再從 user 得到回傳的結果。 ### user.c > [epoll](https://man7.org/linux/man-pages/man7/epoll.7.html) The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. > >The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a container for two lists: > >• The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring. > >• The ready list: the set of file descriptors that are "ready" for I/O. The ready list is a subset of (or, more precisely, a set of references to) the file descriptors in the interest list. The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors. ```cpp struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI, .data.u64 = 0, }; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) handle_error("epoll_ctl"); ``` 利用 [epoll_ctl](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html) 將 ev 內的 event 加進 interest list 內。 ```cpp switch (fork()) { case 0: sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN); sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN); sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN | EPOLLPRI); sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLPRI); sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLOUT); sleep(1); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLHUP); exit(EXIT_SUCCESS); ``` 使用 `fork` ,在 child process 中利用 `ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN);` 等,與 module 中的 vpoll_ioctl 對接,去處理相對應的 event。 > [ioctl](https://man7.org/linux/man-pages/man2/ioctl.2.html) ```cpp default: while (1) { int nfds = epoll_wait(epollfd, &ev, 1, 1000); if (nfds < 0) handle_error("epoll_wait"); else if (nfds == 0) printf("timeout...\n"); else { printf("GOT event %x\n", ev.events); ioctl(efd, VPOLL_IO_DELEVENTS, ev.events); if (ev.events & EPOLLHUP) break; } } break; case -1: /* should not happen */ handle_error("fork"); } ``` 至於 parent process 就會利用 [epoll_wait](https://man7.org/linux/man-pages/man2/epoll_wait.2.html) 去讀取 read list 中的 event,`nfds` 會是讀取到的 event 數量,因為 maxevent 設定為 1,所以至多只會讀到一個 event,會在 else block 之中將其印出,並且透過 ioctl 傳送一個 `VPOLL_IO_DELEVENTS` 的 command 去將該 event 刪除。 ### module.c #### vpoll_data ```cpp struct vpoll_data { wait_queue_head_t wqh; __poll_t events; }; ``` 作為儲存用的 wait queue,並且利用 bitwise 的方式紀錄 events。 ```cpp struct wait_queue_head { spinlock_t lock; struct list_head head; }; typedef struct wait_queue_head wait_queue_head_t; ``` - `wait_queue_head_t` 為 [<linux/wait.h>](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L37) 中所定義的結構,可以看出是由一條 linked list,以及一個 spinlock 所組成 ```cpp typedef unsigned __bitwise __poll_t; ``` - `__poll_t` 在 type.h 內定義,可以看出是 bitwise mask #### vpoll_ioctl ```cpp= static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct vpoll_data *vpoll_data = file->private_data; __poll_t events = arg & EPOLLALLMASK; long res = 0; spin_lock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); /* WWW */ } spin_unlock_irq(&vpoll_data->wqh.lock); return res; } ``` > [struct file](https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L920) 這裡會將 event 的資訊從 file 中讀出,利用 wait_queue 中的 lock 確保對其操作時的一致性。透過不同的 `cmd` 以及 bitwise 的操作,將 events 從 wait queue 中新增或刪除。 先利用 [waitqueue_active](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L127) 確保 wait_queue 不為空。 再來從 [wait.h](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h) 發現 wake up 的方式有 `wake_up_poll`, `wake_up_locked_poll`, `wake_up_interruptible_poll` 等多種選擇,接著看 [wait.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L170) 內 ```cpp /** * __wake_up - wake up threads blocked on a waitqueue. * @wq_head: the waitqueue * @mode: which threads * @nr_exclusive: how many wake-one or wake-many threads to wake up * @key: is directly passed to the wakeup function * * If this function wakes up a task, it executes a full memory barrier before * accessing the task state. */ void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, void *key) { __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key); } EXPORT_SYMBOL(__wake_up); /* * Same as __wake_up but called with the spinlock in wait_queue_head_t held. */ void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr) { __wake_up_common(wq_head, mode, nr, 0, NULL, NULL); } EXPORT_SYMBOL_GPL(__wake_up_locked); void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key) { __wake_up_common(wq_head, mode, 1, 0, key, NULL); } EXPORT_SYMBOL_GPL(__wake_up_locked_key); ``` 從這邊可以發現,如果已經對 `wait_queue_head_t` 呼叫了一次 spinlock,那就要選擇使用 `__wake_up_locked`,避免其 lock 兩次 ```cpp static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { unsigned long flags; wait_queue_entry_t bookmark; bookmark.flags = 0; bookmark.private = NULL; bookmark.func = NULL; INIT_LIST_HEAD(&bookmark.entry); do { spin_lock_irqsave(&wq_head->lock, flags); nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark); spin_unlock_irqrestore(&wq_head->lock, flags); } while (bookmark.flags & WQ_FLAG_BOOKMARK); } ``` > [spinlock.h](https://elixir.bootlin.com/linux/latest/source/include/linux/spinlock.h#L377) > [spinlock.c](https://elixir.bootlin.com/linux/latest/source/kernel/locking/spinlock.c#L165) 可以看到 `__wake_up_common_lock` 內就是先對 wait queue 做一次 spinlock 再去呼叫 `__wake_up_common`,顯然我們要選擇 locked 的版本,我嘗試將 WWW 換成其他 API 程式會 stuck,這就是 lock 兩次所造成的結果。 > [sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L108) 解釋 TASK_NORMAL 與 TASK_INTERRUPTIBLE 的差別,兩者都是用來做 bitmask 表達 Task 的狀態,而 TASK_NORMAL 是方便 wake_up 所設計出來的。 :::info 在 [wake_up](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L221) 的列表中可以看出有沒有 _poll 的差別會是呼叫的函式有沒有將 `poll_to_key` 放入參數 key 之中,看起來是將 `__poll_t` 這個 bitmask 的狀態傳入,但後續的運作沒有搞懂。 ::: 而 wake_up_interruptible_sync_poll_locked 可以查閱 [__wake_up_locked_sync_key](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L225),是在 waker 知道 target thread 被喚醒後很快就會被 scheduler 安排離開,所以會讓他不被 migrate 到其他的 CPU 上。 但因為我無法呼叫該函式,所以未能測試。(不清楚原因) #### vpoll_poll > [File Operations](https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch03s03.html) ```cpp= static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait) { struct vpoll_data *vpoll_data = file->private_data; poll_wait(file, &vpoll_data->wqh, wait); return READ_ONCE(vpoll_data->events); } ``` 程式會詢問 vpoll_poll 來確認該 device 是否已經準備好做 I/O。 而回傳值則為 events 的 bitmask,用來做狀態確認。 而 READ_ONCE 是 compiler barrier,確保編譯時的正確。 > [poll](https://man7.org/linux/man-pages/man2/poll.2.html) > [poll and select](http://www.makelinux.net/ldd3/?u=chp-6-sect-3.shtml) > [poll.h](https://elixir.bootlin.com/linux/latest/source/include/linux/poll.h) ```cpp /* * structures and helpers for f_op->poll implementations */ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *); /* * Do not touch the structure directly, use the access functions * poll_does_not_wait() and poll_requested_events() instead. */ typedef struct poll_table_struct { poll_queue_proc _qproc; __poll_t _key; } poll_table; static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p) { if (p && p->_qproc && wait_address) p->_qproc(filp, wait_address, p); } ``` [poll_wait](https://elixir.bootlin.com/linux/latest/source/include/linux/poll.h#L48) 會確認 wait queue 中是否有 file descriptor 可以做 I/O,並且對 poll_table 註冊該 wait queue。 <!-- ### Ftrace ```shell #!/bin/bash # trace_schedule.sh TRACE_DIR=/sys/kernel/debug/tracing echo > $TRACE_DIR/trace echo $$ > $TRACE_DIR/set_ftrace_pid echo do_epoll_ctl > $TRACE_DIR/set_ftrace_filter echo function_graph > $TRACE_DIR/current_tracer echo "vpoll_poll" > $TRACE_DIR/set_graph_function exec $* ``` -->