# 2021q3 Homework3 (vpoll)
contributed by < `ccs100203` >
###### tags: `linux2021q3`
> [quiz4](https://hackmd.io/@sysprog/linux2021-summer-quiz4)
## 程式原理
user 透過 epoll 註冊所需的 events,再利用 `ioctl` 對該 fd 送出 command,而 module 透過 `vpoll_poll` 確認 I/O 是否已準備好,再藉由 `vpoll_ioctl` 將該 thread 喚醒,最後再從 user 得到回傳的結果。
### user.c
> [epoll](https://man7.org/linux/man-pages/man7/epoll.7.html)
The epoll API performs a similar task to poll(2): monitoring
multiple file descriptors to see if I/O is possible on any of
them. The epoll API can be used either as an edge-triggered or a
level-triggered interface and scales well to large numbers of
watched file descriptors.
>
>The central concept of the epoll API is the epoll instance, an
in-kernel data structure which, from a user-space perspective,
can be considered as a container for two lists:
>
>• The interest list (sometimes also called the epoll set): the
set of file descriptors that the process has registered an
interest in monitoring.
>
>• The ready list: the set of file descriptors that are "ready"
for I/O. The ready list is a subset of (or, more precisely, a
set of references to) the file descriptors in the interest
list. The ready list is dynamically populated by the kernel as
a result of I/O activity on those file descriptors.
```cpp
struct epoll_event ev = {
.events =
EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI,
.data.u64 = 0,
};
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1)
handle_error("epoll_ctl");
```
利用 [epoll_ctl](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html) 將 ev 內的 event 加進 interest list 內。
```cpp
switch (fork()) {
case 0:
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN);
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN);
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN | EPOLLPRI);
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLPRI);
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLOUT);
sleep(1);
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLHUP);
exit(EXIT_SUCCESS);
```
使用 `fork` ,在 child process 中利用 `ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN);` 等,與 module 中的 vpoll_ioctl 對接,去處理相對應的 event。
> [ioctl](https://man7.org/linux/man-pages/man2/ioctl.2.html)
```cpp
default:
while (1) {
int nfds = epoll_wait(epollfd, &ev, 1, 1000);
if (nfds < 0)
handle_error("epoll_wait");
else if (nfds == 0)
printf("timeout...\n");
else {
printf("GOT event %x\n", ev.events);
ioctl(efd, VPOLL_IO_DELEVENTS, ev.events);
if (ev.events & EPOLLHUP)
break;
}
}
break;
case -1: /* should not happen */
handle_error("fork");
}
```
至於 parent process 就會利用 [epoll_wait](https://man7.org/linux/man-pages/man2/epoll_wait.2.html) 去讀取 read list 中的 event,`nfds` 會是讀取到的 event 數量,因為 maxevent 設定為 1,所以至多只會讀到一個 event,會在 else block 之中將其印出,並且透過 ioctl 傳送一個 `VPOLL_IO_DELEVENTS` 的 command 去將該 event 刪除。
### module.c
#### vpoll_data
```cpp
struct vpoll_data {
wait_queue_head_t wqh;
__poll_t events;
};
```
作為儲存用的 wait queue,並且利用 bitwise 的方式紀錄 events。
```cpp
struct wait_queue_head {
spinlock_t lock;
struct list_head head;
};
typedef struct wait_queue_head wait_queue_head_t;
```
- `wait_queue_head_t` 為 [<linux/wait.h>](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L37) 中所定義的結構,可以看出是由一條 linked list,以及一個 spinlock 所組成
```cpp
typedef unsigned __bitwise __poll_t;
```
- `__poll_t` 在 type.h 內定義,可以看出是 bitwise mask
#### vpoll_ioctl
```cpp=
static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct vpoll_data *vpoll_data = file->private_data;
__poll_t events = arg & EPOLLALLMASK;
long res = 0;
spin_lock_irq(&vpoll_data->wqh.lock);
switch (cmd) {
case VPOLL_IO_ADDEVENTS:
vpoll_data->events |= events;
break;
case VPOLL_IO_DELEVENTS:
vpoll_data->events &= ~events;
break;
default:
res = -EINVAL;
}
if (res >= 0) {
res = vpoll_data->events;
if (waitqueue_active(&vpoll_data->wqh))
wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); /* WWW */
}
spin_unlock_irq(&vpoll_data->wqh.lock);
return res;
}
```
> [struct file](https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L920)
這裡會將 event 的資訊從 file 中讀出,利用 wait_queue 中的 lock 確保對其操作時的一致性。透過不同的 `cmd` 以及 bitwise 的操作,將 events 從 wait queue 中新增或刪除。
先利用 [waitqueue_active](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L127) 確保 wait_queue 不為空。
再來從 [wait.h](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h) 發現 wake up 的方式有 `wake_up_poll`, `wake_up_locked_poll`, `wake_up_interruptible_poll` 等多種選擇,接著看 [wait.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L170) 內
```cpp
/**
* __wake_up - wake up threads blocked on a waitqueue.
* @wq_head: the waitqueue
* @mode: which threads
* @nr_exclusive: how many wake-one or wake-many threads to wake up
* @key: is directly passed to the wakeup function
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr)
{
__wake_up_common(wq_head, mode, nr, 0, NULL, NULL);
}
EXPORT_SYMBOL_GPL(__wake_up_locked);
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
__wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
EXPORT_SYMBOL_GPL(__wake_up_locked_key);
```
從這邊可以發現,如果已經對 `wait_queue_head_t` 呼叫了一次 spinlock,那就要選擇使用 `__wake_up_locked`,避免其 lock 兩次
```cpp
static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
{
unsigned long flags;
wait_queue_entry_t bookmark;
bookmark.flags = 0;
bookmark.private = NULL;
bookmark.func = NULL;
INIT_LIST_HEAD(&bookmark.entry);
do {
spin_lock_irqsave(&wq_head->lock, flags);
nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
wake_flags, key, &bookmark);
spin_unlock_irqrestore(&wq_head->lock, flags);
} while (bookmark.flags & WQ_FLAG_BOOKMARK);
}
```
> [spinlock.h](https://elixir.bootlin.com/linux/latest/source/include/linux/spinlock.h#L377)
> [spinlock.c](https://elixir.bootlin.com/linux/latest/source/kernel/locking/spinlock.c#L165)
可以看到 `__wake_up_common_lock` 內就是先對 wait queue 做一次 spinlock 再去呼叫 `__wake_up_common`,顯然我們要選擇 locked 的版本,我嘗試將 WWW 換成其他 API 程式會 stuck,這就是 lock 兩次所造成的結果。
> [sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L108)
解釋 TASK_NORMAL 與 TASK_INTERRUPTIBLE 的差別,兩者都是用來做 bitmask 表達 Task 的狀態,而 TASK_NORMAL 是方便 wake_up 所設計出來的。
:::info
在 [wake_up](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L221) 的列表中可以看出有沒有 _poll 的差別會是呼叫的函式有沒有將 `poll_to_key` 放入參數 key 之中,看起來是將 `__poll_t` 這個 bitmask 的狀態傳入,但後續的運作沒有搞懂。
:::
而 wake_up_interruptible_sync_poll_locked 可以查閱 [__wake_up_locked_sync_key](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L225),是在 waker 知道 target thread 被喚醒後很快就會被 scheduler 安排離開,所以會讓他不被 migrate 到其他的 CPU 上。
但因為我無法呼叫該函式,所以未能測試。(不清楚原因)
#### vpoll_poll
> [File Operations](https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch03s03.html)
```cpp=
static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait)
{
struct vpoll_data *vpoll_data = file->private_data;
poll_wait(file, &vpoll_data->wqh, wait);
return READ_ONCE(vpoll_data->events);
}
```
程式會詢問 vpoll_poll 來確認該 device 是否已經準備好做 I/O。
而回傳值則為 events 的 bitmask,用來做狀態確認。
而 READ_ONCE 是 compiler barrier,確保編譯時的正確。
> [poll](https://man7.org/linux/man-pages/man2/poll.2.html)
> [poll and select](http://www.makelinux.net/ldd3/?u=chp-6-sect-3.shtml)
> [poll.h](https://elixir.bootlin.com/linux/latest/source/include/linux/poll.h)
```cpp
/*
* structures and helpers for f_op->poll implementations
*/
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
/*
* Do not touch the structure directly, use the access functions
* poll_does_not_wait() and poll_requested_events() instead.
*/
typedef struct poll_table_struct {
poll_queue_proc _qproc;
__poll_t _key;
} poll_table;
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && p->_qproc && wait_address)
p->_qproc(filp, wait_address, p);
}
```
[poll_wait](https://elixir.bootlin.com/linux/latest/source/include/linux/poll.h#L48) 會確認 wait queue 中是否有 file descriptor 可以做 I/O,並且對 poll_table 註冊該 wait queue。
<!-- ### Ftrace
```shell
#!/bin/bash
# trace_schedule.sh
TRACE_DIR=/sys/kernel/debug/tracing
echo > $TRACE_DIR/trace
echo $$ > $TRACE_DIR/set_ftrace_pid
echo do_epoll_ctl > $TRACE_DIR/set_ftrace_filter
echo function_graph > $TRACE_DIR/current_tracer
echo "vpoll_poll" > $TRACE_DIR/set_graph_function
exec $*
``` -->