# 2021q3 Homework3 (vpoll) contributed by < [`linD026`](https://github.com/linD026) > ###### tags: `linux2021` > [2021 年暑期 第 4 週隨堂測驗題目](https://hackmd.io/@sysprog/linux2021-summer-quiz4) --- ## vpoll ### 解釋程式碼運作原理 ```cpp #define VPOLL_IOC_MAGIC '^' #define VPOLL_IO_ADDEVENTS _IO(VPOLL_IOC_MAGIC, 1 /*MMM*/) #define VPOLL_IO_DELEVENTS _IO(VPOLL_IOC_MAGIC, 2 /*NNN*/) #define EPOLLALLMASK ((__force __poll_t) 0x0fffffff) ``` > [/include/uapi/linux/eventpoll.h](https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/eventpoll.h) * [/include/uapi/asm-generic/ioctl.h](https://elixir.bootlin.com/linux/v5.13.12/source/include/uapi/asm-generic/ioctl.h) ```cpp /* ioctl command encoding: 32 bits total, command in lower 16 bits, * size of the parameter structure in the lower 14 bits of the * upper 16 bits. * Encoding the size of the parameter structure in the ioctl request * is useful for catching programs compiled with old versions * and to avoid overwriting user space outside the user buffer area. * The highest 2 bits are reserved for indicating the ``access mode''. * NOTE: This limits the max parameter size to 16kB -1 ! */ /* * The following is for compatibility across the various Linux * platforms. The generic ioctl numbering scheme doesn't really enforce * a type field. De facto, however, the top 8 bits of the lower 16 * bits are indeed used as a type field, so we might just as well make * this explicit here. Please be sure to use the decoding macros * below from now on. */ #define _IOC_NRBITS 8 #define _IOC_TYPEBITS 8 /* * Let any architecture override either of the following before * including this file. */ #ifndef _IOC_SIZEBITS # define _IOC_SIZEBITS 14 #endif #ifndef _IOC_DIRBITS # define _IOC_DIRBITS 2 #endif #define _IOC_NRSHIFT 0 #define _IOC_TYPESHIFT (_IOC_NRSHIFT+_IOC_NRBITS) #define _IOC_SIZESHIFT (_IOC_TYPESHIFT+_IOC_TYPEBITS) #define _IOC_DIRSHIFT (_IOC_SIZESHIFT+_IOC_SIZEBITS) /* * Direction bits, which any architecture can choose to override * before including this file. * * NOTE: _IOC_WRITE means userland is writing and kernel is * reading. _IOC_READ means userland is reading and kernel is writing. */ #ifndef _IOC_NONE # define _IOC_NONE 0U #endif #define _IOC(dir,type,nr,size) \ (((dir) << _IOC_DIRSHIFT) | \ ((type) << _IOC_TYPESHIFT) | \ ((nr) << _IOC_NRSHIFT) | \ ((size) << _IOC_SIZESHIFT)) /* * Used to create numbers. * * NOTE: _IOW means userland is writing and kernel is reading. _IOR * means userland is reading and kernel is writing. */ #define _IO(type,nr) _IOC(_IOC_NONE,(type),(nr),0) ``` ioctl command encoding 完後: [DIR|SIZE|TYPE|NR] [` 2`|` 14`|` 8`|` 8`] = 32 bits * [root/Documentation/dev-tools/sparse.rst](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/dev-tools/sparse.rst) ```cpp typedef unsigned __bitwise __poll_t; ``` > **Sparse** is a semantic checker for C programs; it can be used to find a number of potential problems with kernel code. See https://lwn.net/Articles/689907/ for an overview of sparse; this document contains some kernel-specific sparse information. More information on sparse, mainly about its internals, can be found in its official pages at https://sparse.docs.kernel.org. > Two such extensions are related to the type system and have been discussed previously in these pages: `bitwise`, which creates a "new" type (in the Ada sense) that is identical to some other integer type except that it is incompatible with it, and `address_spaces`, which provide similar functionality for pointers. **`bitwise` can be used to avoid confusing big-endian and little-endian values, or to avoid accidentally using bitmasks on the wrong variable.** The most obvious use of address spaces in the Linux kernel are to distinguish user-space pointers from kernel-space pointers, though there are other uses. > "__bitwise" is a type attribute, so you have to do something like this:: > > ```cpp > typedef int __bitwise pm_request_t; > > enum pm_request { > PM_SUSPEND = (__force pm_request_t) 1, > PM_RESUME = (__force pm_request_t) 2 > }; > ``` > > which makes `PM_SUSPEND` and `PM_RESUME` "bitwise" integers (the "__force" is > there because sparse will complain about casting to/from a bitwise type, > but in this case we really _do_ want to force the conversion). And because > the enum values are all the same type, now "enum pm_request" will be that > type too. > > And with gcc, all the "__bitwise"/"__force stuff" goes away, and it all > ends up looking just like integers to gcc. > > Quite frankly, you don't need the enum there. The above all really just > boils down to one special "int __bitwise" type. > > So the simpler way is to just do:: > > ```cpp > typedef int __bitwise pm_request_t; > > #define PM_SUSPEND ((__force pm_request_t) 1) > #define PM_RESUME ((__force pm_request_t) 2) > ``` > > > and you now have all the infrastructure needed for strict typechecking. > > One small note: the constant integer "0" is special. You can use a > constant zero as a bitwise integer type without sparse ever complaining. > This is because "bitwise" (as the name implies) was designed for making > sure that bitwise types don't get mixed up (little-endian vs big-endian > vs cpu-endian vs whatever), and there the constant "0" really _is_ > special. :::success **[GCC Bugzilla – Bug 5985 - Support sparse-style __attribute__((bitwise)) (type attribute)](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59852)** > The sparse static C language checker contains a type attribute extension: > > `__attribute__((bitwise))` > > The bitwise attribute modifies an arithmetic type so that the only arithmetic options permitted are the ones that are strictly bitwise. This is primarily used for data items with a specific endianness, such as the "network" side of the `htonX()` and `ntohX()` functions. > > The sparse documentation describes this as: > > Warn about unsupported operations or type mismatches with restricted integer types. > > Sparse supports an extended attribute, `__attribute__((bitwise))`, which creates a new restricted integer type from a base integer type, distinct from the base integer type and from any > other restricted integer type not declared in the same declaration or typedef. For example, this allows programs to create typedefs for integer types with specific endianness. With > -Wbitwise, Sparse will warn on any use of a restricted type in arithmetic operations other than bitwise operations, and on any conversion of one restricted type into another, except > via a cast that includes `__attribute__((force))`. > > `__bitwise` ends up being a "stronger integer separation". That one doesn't allow you to mix with non-bitwise integers, so now it's much harder to lose the type by mistake. > > `__bitwise` is for *unique types* that cannot be mixed with other types, and that you'd never want to just use as a random integer (the integer 0 is special, though, and gets silently > accepted iirc - it's kind of like "NULL" for pointers). So "gfp_t" or the "safe endianness" types would be `__bitwise`: you can only operate on them by doing specific operations that > know about *that* particular type. > > Generally, you want bitwise if you are looking for type safety. Sparse does not issue these warnings by default. ::: ```cpp static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait) { struct vpoll_data *vpoll_data = file->private_data; poll_wait(file, &vpoll_data->wqh, wait); return READ_ONCE(vpoll_data->events); } ``` * [poll(2) — Linux manual page](https://man7.org/linux/man-pages/man2/poll.2.html) > `poll()` performs a similar task to `select(2)`: **it waits for one of a set of file descriptors to become ready to perform I/O.** The Linux-specific `epoll(7)` API performs a similar task, but offers features beyond those found in `poll()`. > > **POLLIN** > There is data to read. > **POLLHUP** > Hang up (only returned in `revents`; ignored in `events`). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return `0` (end of file) only after all outstanding data in the channel has been consumed. :::success **[lwn.net - A new kernel polling interface](https://lwn.net/Articles/743714/)** Internally to the kernel, any device driver (or other subsystem that exports a `file_operations` structure) can support the new poll interface, but some small changes will be required. It is not, however, necessary to support (or even know about) `AIO` in general. In current kernels, the polling system calls are all supported by the` poll()` method in `struct file_operations`: ```cpp int (*poll) (struct file *file, struct poll_table_struct *table); ``` This function must perform two actions: **setting up notifications for when the underlying file is ready for I/O, and returning the types of I/O that could be performed without blocking now.** The first is done by adding one or more wait queues to the provided table; the driver will perform a wakeup call on one of those queues when the state of the device changes. The current readiness state is the return value from the `poll()` method itself. ::: * [/include/linux/poll.h](https://elixir.bootlin.com/linux/v5.13.12/source/include/linux/poll.h#L48) ```cpp /* * structures and helpers for f_op->poll implementations */ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *); /* * Do not touch the structure directly, use the access functions * poll_does_not_wait() and poll_requested_events() instead. */ typedef struct poll_table_struct { poll_queue_proc _qproc; __poll_t _key; } poll_table; static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p) { if (p && p->_qproc && wait_address) p->_qproc(filp, wait_address, p); } ``` * [/fs/aio.c](https://elixir.bootlin.com/linux/v5.13.12/source/fs/aio.c#L1719) ```cpp static void aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, struct poll_table_struct *p) { struct aio_poll_table *pt = container_of(p, struct aio_poll_table, pt); /* multiple wait queues per file are not supported */ if (unlikely(pt->iocb->poll.head)) { pt->error = -EINVAL; return; } pt->error = 0; pt->iocb->poll.head = head; add_wait_queue(head, &pt->iocb->poll.wait); } ``` --- ```cpp static char *vpoll_devnode(struct device *dev, umode_t *mode) { if (!mode) return NULL; *mode = 0666; return NULL; } ``` --- ```cpp if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) ``` * [epoll_ctl(2) — Linux manual page](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html) > This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed for the target file descriptor, fd. ```cpp= static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct vpoll_data *vpoll_data = file->private_data; __poll_t events = arg & EPOLLALLMASK; long res = 0; spin_lock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) /*WWW*/ wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); } spin_unlock_irq(&vpoll_data->wqh.lock); return res; } ``` > `ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN);` 第 4 行 `arg` 如 `EPOLLIN` 會轉成 `events` 再根據 `cmd` 來新增或刪除 `arg` 的事件。 ```graphviz digraph main { node [shape = box] rankdir = LR vi [label = "vpoll->ioctl"] vp [label = "vpoll->poll"] vi -> vp [label = "wake up"] ee [label = "epoll->event"] ew [label = "epoll_wait"] vp -> ee [label = "event return"] subgraph cluster_ec { label = "epoll_ctl" ee vp } ee -> ew } ``` :::info * [/fs/eventpoll.c - do_epoll_wait](https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L2191) * [/fs/eventpoll.c - ep_poll](https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L1759) > Retrieves ready events, and delivers them to the caller-supplied event buffer. ::: --- ### wake up poll * [/include/linux/wait.h](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L41) ```cpp #define wake_up_locked_poll(x, m) \ __wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m)) ``` * [/kernel/sched/wait.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L68) ```cpp void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key) { __wake_up_common(wq_head, mode, 1, 0, key, NULL); } ``` ```cpp /* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve * number) then we wake that number of exclusive tasks, and potentially all * the non-exclusive tasks. Normally, exclusive tasks will be at the end of * the list and any non-exclusive tasks will be woken first. A priority task * may be at the head of the list, and can consume the event without any other * tasks being woken. * * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key, wait_queue_entry_t *bookmark) { wait_queue_entry_t *curr, *next; int cnt = 0; lockdep_assert_held(&wq_head->lock); if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { curr = list_next_entry(bookmark, entry); list_del(&bookmark->entry); bookmark->flags = 0; } else curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry); if (&curr->entry == &wq_head->head) return nr_exclusive; list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) { unsigned flags = curr->flags; int ret; if (flags & WQ_FLAG_BOOKMARK) continue; ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && (&next->entry != &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry); break; } } return nr_exclusive; } ``` `lockdep_assert_held` 會去檢查此 lock 是否被正在執行的 `task_struct current` 所持有。 * [`spin_lock_irq`](https://elixir.bootlin.com/linux/latest/source/include/linux/spinlock_api_smp.h#L124) * [`__lock_acquire`](https://elixir.bootlin.com/linux/latest/source/kernel/locking/lockdep.c#L4872) * [`lockdep_assert_held`](https://elixir.bootlin.com/linux/latest/source/include/linux/lockdep.h#L309) ```cpp #define lockdep_assert_held(l) do { \ WARN_ON(debug_locks && \ lockdep_is_held(l) == LOCK_STATE_NOT_HELD); \ } while (0) ``` * [`__lock_is_held`](https://elixir.bootlin.com/linux/latest/source/kernel/locking/lockdep.c#L5363) ```cpp static __always_inline int __lock_is_held(const struct lockdep_map *lock, int read) { struct task_struct *curr = current; int i; for (i = 0; i < curr->lockdep_depth; i++) { struct held_lock *hlock = curr->held_locks + i; if (match_held_lock(hlock, lock)) { if (read == -1 || hlock->read == read) return LOCK_STATE_HELD; return LOCK_STATE_NOT_HELD; } } return LOCK_STATE_NOT_HELD; } ``` :::success **TODO trace lockdep_assert_held** ::: 其中 `__wake_up_common` 的 `curr->func` 則是為 `default_wake_function` ,初始化於: * [/include/linux/wait.h](https://elixir.bootlin.com/linux/v5.13.12/source/include/linux/wait.h#L54) ```cpp #define __WAITQUEUE_INITIALIZER(name, tsk) { \ .private = tsk, \ .func = default_wake_function, \ .entry = { NULL, NULL } } #define DECLARE_WAITQUEUE(name, tsk) \ struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk) ``` * [/kernel/sched/core.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L5545) ```cpp int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags, void *key) { WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC); return try_to_wake_up(curr->private, mode, wake_flags); } ``` ### Program Order on SMP systems :::success **延伸閱讀** * [kernel.org - Linux Scheduler](https://www.kernel.org/doc/html/latest/scheduler/index.html) * [並行程式設計: Atomics 操作 - Memory Ordering 和 Barrier](https://hackmd.io/@sysprog/concurrency-atomics) * [Linux CFS and task group](https://mechpen.github.io/posts/2020-04-27-cfs-group/index.html) ::: ```cpp /* * Notes on Program-Order guarantees on SMP systems. * * MIGRATION * * The basic program-order guarantee on SMP systems is that when a task [t] * migrates, all its activity on its old CPU [c0] happens-before any subsequent * execution on its new CPU [c1]. * * For migration (of runnable tasks) this is provided by the following means: * * A) UNLOCK of the rq(c0)->lock scheduling out task t * B) migration for t is required to synchronize *both* rq(c0)->lock and * rq(c1)->lock (if not at the same time, then in that order). * C) LOCK of the rq(c1)->lock scheduling in task * * Release/acquire chaining guarantees that B happens after A and C after B. * Note: the CPU doing B need not be c0 or c1 * * Example: * * CPU0 CPU1 CPU2 * * LOCK rq(0)->lock * sched-out X * sched-in Y * UNLOCK rq(0)->lock * * LOCK rq(0)->lock // orders against CPU0 * dequeue X * UNLOCK rq(0)->lock * * LOCK rq(1)->lock * enqueue X * UNLOCK rq(1)->lock * * LOCK rq(1)->lock // orders against CPU2 * sched-out Z * sched-in X * UNLOCK rq(1)->lock * * * BLOCKING -- aka. SLEEP + WAKEUP * * For blocking we (obviously) need to provide the same guarantee as for * migration. However the means are completely different as there is no lock * chain to provide order. Instead we do: * * 1) smp_store_release(X->on_cpu, 0) -- finish_task() * 2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up() * * Example: * * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) * * LOCK rq(0)->lock LOCK X->pi_lock * dequeue X * sched-out X * smp_store_release(X->on_cpu, 0); * * smp_cond_load_acquire(&X->on_cpu, !VAL); * X->state = WAKING * set_task_cpu(X,2) * * LOCK rq(2)->lock * enqueue X * X->state = RUNNING * UNLOCK rq(2)->lock * * LOCK rq(2)->lock // orders against CPU1 * sched-out Z * sched-in X * UNLOCK rq(2)->lock * * UNLOCK X->pi_lock * UNLOCK rq(0)->lock * * * However, for wakeups there is a second guarantee we must provide, namely we * must ensure that CONDITION=1 done by the caller can not be reordered with * accesses to the task state; see try_to_wake_up() and set_current_state(). */ ``` * [`try_to_wake_up`](https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L3332) * 第 163 行 `ttwu_queue_wakelist` 為 enqueue ```cpp= /** * try_to_wake_up - wake up a thread * @p: the thread to be awakened * @state: the mask of task states that can be woken * @wake_flags: wake modifier flags (WF_*) * * Conceptually does: * * If (@state & @p->state) @p->state = TASK_RUNNING. * * If the task was not queued/runnable, also place it back on a runqueue. * * This function is atomic against schedule() which would dequeue the task. * * It issues a full memory barrier before accessing @p->state, see the comment * with set_current_state(). * * Uses p->pi_lock to serialize against concurrent wake-ups. * * Relies on p->pi_lock stabilizing: * - p->sched_class * - p->cpus_ptr * - p->sched_task_group * in order to do migration, see its use of select_task_rq()/set_task_cpu(). * * Tries really hard to only take one task_rq(p)->lock for performance. * Takes rq->lock in: * - ttwu_runnable() -- old rq, unavoidable, see comment there; * - ttwu_queue() -- new rq, for enqueue of the task; * - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us. * * As a consequence we race really badly with just about everything. See the * many memory barriers and their comments for details. * * Return: %true if @p->state changes (an actual wakeup was done), * %false otherwise. */ static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; int cpu, success = 0; preempt_disable(); if (p == current) { /* * We're waking current, this means 'p->on_rq' and 'task_cpu(p) * == smp_processor_id()'. Together this means we can special * case the whole 'p->on_rq && ttwu_runnable()' case below * without taking any locks. * * In particular: * - we rely on Program-Order guarantees for all the ordering, * - we're serialized against set_special_state() by virtue of * it disabling IRQs (this allows not taking ->pi_lock). */ if (!(p->state & state)) goto out; success = 1; trace_sched_waking(p); p->state = TASK_RUNNING; trace_sched_wakeup(p); goto out; } /* * If we are going to wake up a thread waiting for CONDITION we * need to ensure that CONDITION=1 done by the caller can not be * reordered with p->state check below. This pairs with smp_store_mb() * in set_current_state() that the waiting thread does. */ raw_spin_lock_irqsave(&p->pi_lock, flags); smp_mb__after_spinlock(); if (!(p->state & state)) goto unlock; trace_sched_waking(p); /* We're going to change ->state: */ success = 1; /* * Ensure we load p->on_rq _after_ p->state, otherwise it would * be possible to, falsely, observe p->on_rq == 0 and get stuck * in smp_cond_load_acquire() below. * * sched_ttwu_pending() try_to_wake_up() * STORE p->on_rq = 1 LOAD p->state * UNLOCK rq->lock * * __schedule() (switch to task 'p') * LOCK rq->lock smp_rmb(); * smp_mb__after_spinlock(); * UNLOCK rq->lock * * [task p] * STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). * * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) goto unlock; #ifdef CONFIG_SMP /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. * * One must be running (->on_cpu == 1) in order to remove oneself * from the runqueue. * * __schedule() (switch to task 'p') try_to_wake_up() * STORE p->on_cpu = 1 LOAD p->on_rq * UNLOCK rq->lock * * __schedule() (put 'p' to sleep) * LOCK rq->lock smp_rmb(); * smp_mb__after_spinlock(); * STORE p->on_rq = 0 LOAD p->on_cpu * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). * * Form a control-dep-acquire with p->on_rq == 0 above, to ensure * schedule()'s deactivate_task() has 'happened' and p will no longer * care about it's own p->state. See the comment in __schedule(). */ smp_acquire__after_ctrl_dep(); /* * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq * == 0), which means we need to do an enqueue, change p->state to * TASK_WAKING such that we can unlock p->pi_lock before doing the * enqueue, such as ttwu_queue_wakelist(). */ p->state = TASK_WAKING; /* * If the owning (remote) CPU is still in the middle of schedule() with * this task as prev, considering queueing p on the remote CPUs wake_list * which potentially sends an IPI instead of spinning on p->on_cpu to * let the waker make forward progress. This is safe because IRQs are * disabled and the IPI will deliver after on_cpu is cleared. * * Ensure we load task_cpu(p) after p->on_cpu: * * set_task_cpu(p, cpu); * STORE p->cpu = @cpu * __schedule() (switch to task 'p') * LOCK rq->lock * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu) * STORE p->on_cpu = 1 LOAD p->cpu * * to ensure we observe the correct CPU on which the task is currently * scheduling. */ if (smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; /* * If the owning (remote) CPU is still in the middle of schedule() with * this task as prev, wait until it's done referencing the task. * * Pairs with the smp_store_release() in finish_task(). * * This ensures that tasks getting woken will be fully ordered against * their previous state and preserve Program Order. */ smp_cond_load_acquire(&p->on_cpu, !VAL); cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); if (task_cpu(p) != cpu) { if (p->in_iowait) { delayacct_blkio_end(p); atomic_dec(&task_rq(p)->nr_iowait); } wake_flags |= WF_MIGRATED; psi_ttwu_dequeue(p); set_task_cpu(p, cpu); } #else cpu = task_cpu(p); #endif /* CONFIG_SMP */ ttwu_queue(p, cpu, wake_flags); unlock: raw_spin_unlock_irqrestore(&p->pi_lock, flags); out: if (success) ttwu_stat(p, task_cpu(p), wake_flags); preempt_enable(); return success; } ``` * [/include/linux/sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L192) ```cpp /* * set_current_state() includes a barrier so that the write of current->state * is correctly serialised wrt the caller's subsequent test of whether to * actually sleep: * * for (;;) { * set_current_state(TASK_UNINTERRUPTIBLE); * if (CONDITION) * break; * * schedule(); * } * __set_current_state(TASK_RUNNING); * * If the caller does not need such serialisation (because, for instance, the * CONDITION test and condition change and wakeup are under the same lock) then * use __set_current_state(). * * The above is typically ordered against the wakeup, which does: * * CONDITION = 1; * wake_up_state(p, TASK_UNINTERRUPTIBLE); * * where wake_up_state()/try_to_wake_up() executes a full memory barrier before * accessing p->state. * * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). * * However, with slightly different timing the wakeup TASK_RUNNING store can * also collide with the TASK_UNINTERRUPTIBLE store. Losing that store is not * a problem either because that will result in one extra go around the loop * and our @cond test will save the day. * * Also see the comments of try_to_wake_up(). */ ``` :::warning **`wake_up_poll` 問題** 若將原程式碼改為下列: ```cpp spin_lock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) // wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); /wake_up_poll(&vpoll_data->wqh, vpoll_data->events); } spin_unlock_irq(&vpoll_data->wqh.lock); ``` 實際執行會發現,終端機會停留在: ```shell ubuntu@ubuntu:~$ make check make -C /lib/modules/`uname -r`/build M=/home/ubuntu modules make[1]: Entering directory '/usr/src/linux-headers-5.4.0-80-generic' CC [M] /home/ubuntu/module.o LD [M] /home/ubuntu/vpoll.o Building modules, stage 2. MODPOST 1 modules CC [M] /home/ubuntu/vpoll.mod.o LD [M] /home/ubuntu/vpoll.ko make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-80-generic' gcc -Wall -o user user.c sudo rmmod vpoll || echo rmmod: ERROR: Module vpoll is not currently loaded sudo insmod vpoll.ko ./user timeout... ``` 為何替換 `wake_up_locked_poll` 成 `wake_up_poll` 會出現問題? 可在 `__wake_up_common_lock` 發現又 lock 了一次: * [/include/linux/wait.h](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L237) ```cpp #define wake_up_poll(x, m) \ __wake_up(x, TASK_NORMAL, 1, poll_to_key(m)) ``` * [/kernel/sched/wait.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154) ```cpp /** * __wake_up - wake up threads blocked on a waitqueue. * @wq_head: the waitqueue * @mode: which threads * @nr_exclusive: how many wake-one or wake-many threads to wake up * @key: is directly passed to the wakeup function * * If this function wakes up a task, it executes a full memory barrier before * accessing the task state. */ void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, void *key) { __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key); } ``` * [/kernel/sched/wait.c](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L125) ```cpp static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { unsigned long flags; wait_queue_entry_t bookmark; bookmark.flags = 0; bookmark.private = NULL; bookmark.func = NULL; INIT_LIST_HEAD(&bookmark.entry); do { spin_lock_irqsave(&wq_head->lock, flags); nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark); spin_unlock_irqrestore(&wq_head->lock, flags); } while (bookmark.flags & WQ_FLAG_BOOKMARK); } ``` ::: ### sparse 檢查 locking > [Docs » Development tools for the kernel » Sparse](https://www.kernel.org/doc/html/v4.9/dev-tools/sparse.html) 利用 `kallsyms_lookup_name` 在 v5.4 版本的核心上重新定義 `__wake_up` 等函式,藉此在編譯時期可讓 sparse 檢測。 ```cpp typedef int (*__wake_up_common_t)(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key, wait_queue_entry_t *bookmark); static __wake_up_common_t __wake_up_common; #include <linux/kallsyms.h> static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key) __releases(wq_head->lock) { unsigned long flags; wait_queue_entry_t bookmark; bookmark.flags = 0; bookmark.private = NULL; bookmark.func = NULL; INIT_LIST_HEAD(&bookmark.entry); __wake_up_common = (__wake_up_common_t)kallsyms_lookup_name("__wake_up_common"); do { spin_lock_irqsave(&wq_head->lock, flags); nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark); spin_unlock_irqrestore(&wq_head->lock, flags); } while (bookmark.flags & WQ_FLAG_BOOKMARK); } #define self_wake_up_poll(x, m) \ __wake_up_common_lock(x, TASK_NORMAL, 1, 0, poll_to_key(m)); static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct vpoll_data *vpoll_data = file->private_data; __poll_t events = arg & EPOLLALLMASK; long res = 0; spin_lock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) // /*WWW*/ wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); self_wake_up_poll(&vpoll_data->wqh, vpoll_data->events); } spin_unlock_irq(&vpoll_data->wqh.lock); return res; } ``` 檢測後發現,並不如之前所說是因為使用 `wake_up_poll` 後又 lock 了一次,而是因為被 unlock 了。 ```shell ubuntu@ubuntu:~/linux2021_vpoll$ make C=2 ... /home/ubuntu/linux2021_vpoll/sparse.c:68:9: warning: context imbalance in '__wake_up_common_lock' - wrong count at exit /home/ubuntu/linux2021_vpoll/sparse.c:102:20: warning: context imbalance in 'vpoll_ioctl' - unexpected unlock ``` :::warning **TODO** * `spin_lock_irq` 和 `spin_lock_irqsave` ::: ### 實作 epoll 效能評比 * [Linux Foundation - Epoll Kernel Performance Improvements](https://events19.linuxfoundation.org/wp-content/uploads/2018/07/dbueso-oss-japan19.pdf) * kernel receives 4095 bytes of data and wake up thread A, kernel receives 4 bytess of data and wake up thread B. Thread A performs read(4096) and read full buffer of 4096 bytes. Thead B performs read(4096) and reads remaining 3 bytes of data. * Data is split across threads and can be reordered without serialization. The correct solution is to use `EPOLLONESHOT` and re-arm. * It is possible to receive events after closing the fd * Must `EPOLL_CTL_DEL` the fd before closing. * Benchmark - Locking/algorithmic changes. - Wakeup latencies. 在此針對 wakeup latencies 做測量。 利用 [trace_time.h](https://github.com/linD026/linux2021_vpoll/blob/main/trace_time.h) 紀錄從 `wake_up_locked_poll` 至 `poll_wait` 時間: ```shell [ 62.322749] trace wake_up_locked_poll number 2: [CPU#0] 12 usec [ 62.323224] trace wake_up_locked_poll number 3: [CPU#0] 3 usec [ 63.322169] trace wake_up_locked_poll number 4: [CPU#0] 13 usec [ 63.322262] trace wake_up_locked_poll number 5: [CPU#0] 3 usec [ 64.321496] trace wake_up_locked_poll number 6: [CPU#0] 10 usec [ 64.321534] trace wake_up_locked_poll number 7: [CPU#0] 3 usec [ 65.320734] trace wake_up_locked_poll number 8: [CPU#0] 4 usec [ 65.320749] trace wake_up_locked_poll number 9: [CPU#0] 1 usec [ 66.319875] trace wake_up_locked_poll number 10: [CPU#0] 10 usec [ 66.319917] trace wake_up_locked_poll number 11: [CPU#0] 2 usec [ 67.319945] trace wake_up_locked_poll number 12: [CPU#0] 9 usec ``` 利用 ftrace hook 目標函式 `do_epoll_wait` 之後取代成自己定義的函式: ```cpp static struct trace_time tt_do_epoll_wait; static int hook_do_epoll_wait(int epfd, struct epoll_event __user *events, int maxevents, struct timespec64 *to) { int ret; if (events->data == 123456789) { tt_do_epoll_wait = TRACE_TIME_INIT("do_epoll_wait"); TRACE_TIME_START(tt_do_epoll_wait); ret = real_do_epoll_wait(epfd, events, maxevents, to); TRACE_TIME_END(tt_do_epoll_wait); TRACE_CALC(tt_do_epoll_wait); TRACE_PRINT(tt_do_epoll_wait); } else ret = real_do_epoll_wait(epfd, events, maxevents, to); return ret; } ``` 其中 `123456789` 為 userspace 宣告的數值,在此定義於 user.c 裡,是為了篩選哪個 process 執行此動作。 ```cpp struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI, .data.u64 = 123456789, }; ``` 之後紀錄 `do_epoll_wait` 的時間花費: ```shell [ 996.272615] vpoll: loaded [ 997.275741] trace do_epoll_wait number 0: [CPU#0] 976729 usec [ 997.276367] trace do_epoll_wait number 1: [CPU#0] 193 usec [ 998.277642] trace do_epoll_wait number 2: [CPU#0] 976939 usec [ 998.277802] trace do_epoll_wait number 3: [CPU#0] 113 usec [ 999.278660] trace do_epoll_wait number 4: [CPU#0] 976860 usec [ 999.278892] trace do_epoll_wait number 5: [CPU#0] 166 usec [ 1000.279921] trace do_epoll_wait number 6: [CPU#0] 977015 usec [ 1000.280182] trace do_epoll_wait number 7: [CPU#0] 168 usec [ 1001.281037] trace do_epoll_wait number 8: [CPU#0] 976847 usec [ 1001.281199] trace do_epoll_wait number 9: [CPU#0] 116 usec [ 1002.281966] trace do_epoll_wait number 10: [CPU#0] 976772 usec [ 1002.282128] trace do_epoll_wait number 11: [CPU#0] 115 usec [ 1002.295292] vpoll: unloaded ``` :::warning **問題** 在執行 `insmod` 後,會出現: ```shell [ 744.526000] vpoll: module verification failed: signature and/or required key missing - tainting kernel ``` 除了上述,在移除核心模組後會崩潰,目前檢測是因為 ftrace hooking 的函式 `do_epoll_wait` 會有其他 process 使用: ```shell [ 104.465311] systemd-journald[347]: Failed to run event loop: Resource temporarily unavailable [ 104.472748] vpoll: unloaded [ 104.476832] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1. [ 104.477784] systemd[1]: Stopping Flush Journal to Persistent Storage... [ 104.485678] systemd[1]: systemd-journal-flush.service: Succeeded. [ 104.486011] systemd[1]: Stopped Flush Journal to Persistent Storage. [ 104.486207] systemd[1]: Stopped Journal Service. [ 104.487444] systemd[1]: Starting Journal Service... [ 104.545654] systemd[1]: Started Journal Service. [ 104.551994] systemd-journald[2756]: Received client request to flush runtime journal. ``` 嘗試利用 `atomic_t` 偵測現有使用中的數量,並且 `block` 參數阻擋移出過程中有人嘗試呼叫此函式。 ```shell [ 231.624046] vpoll: loading out-of-tree module taints kernel. [ 231.624063] vpoll: module verification failed: signature and/or required key missing - tainting kernel [ 231.630467] vpoll: loaded [ 232.632867] trace do_epoll_wait number 0: [CPU#0] 976749 usec [ 232.633657] trace do_epoll_wait number 1: [CPU#0] 297 usec [ 233.634531] trace do_epoll_wait number 2: [CPU#0] 976796 usec [ 234.635447] trace do_epoll_wait number 3: [CPU#0] 976871 usec [ 234.636329] trace do_epoll_wait number 4: [CPU#0] 142 usec [ 235.637228] trace do_epoll_wait number 5: [CPU#0] 976798 usec [ 235.637462] trace do_epoll_wait number 6: [CPU#0] 169 usec [ 236.638180] trace do_epoll_wait number 7: [CPU#0] 976890 usec [ 236.638416] trace do_epoll_wait number 8: [CPU#0] 171 usec [ 237.639202] trace do_epoll_wait number 9: [CPU#0] 976955 usec [ 237.639435] trace do_epoll_wait number 10: [CPU#0] 168 usec [ 238.653570] Warning! Someone(3) still using hook function [ 238.658051] vpoll: unloaded [ 272.538649] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2. [ 272.539722] systemd[1]: Stopping Flush Journal to Persistent Storage... [ 272.548014] systemd[1]: systemd-journal-flush.service: Succeeded. [ 272.548328] systemd[1]: Stopped Flush Journal to Persistent Storage. [ 272.548533] systemd[1]: Stopped Journal Service. [ 272.549677] systemd[1]: Starting Journal Service... [ 272.609682] systemd[1]: Started Journal Service. [ 272.615594] systemd-journald[3273]: Received client request to flush runtime journal. ``` ```cpp static void __exit vpoll_exit(void) { block = 1; if (atomic_read(&do_epoll_wait_cnt) > 0) { pr_info("Warning! Someone(%d) still using hook function\n", atomic_read(&do_epoll_wait_cnt)); mdelay(20); } hook_remove(&hook); device_destroy(vpoll_class, major); cdev_del(&vpoll_cdev); class_destroy(vpoll_class); unregister_chrdev_region(major, 1); printk(KERN_INFO NAME ": unloaded\n"); } ``` ::: 之後參照 [GitHub - linux-ipc-benchmarks](https://github.com/kamalmarhubi/linux-ipc-benchmarks) 改寫 user.c : ```cpp #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/epoll.h> #include <sys/ioctl.h> #include <unistd.h> #define VPOLL_IOC_MAGIC '^' #define VPOLL_IO_ADDEVENTS _IO(VPOLL_IOC_MAGIC, 1) #define VPOLL_IO_DELEVENTS _IO(VPOLL_IOC_MAGIC, 2) typedef struct state { /* child */ int efd; /* parent */ int epfd; struct epoll_event ev; } state; state *new_state() { state *state = malloc(sizeof(struct state)); return state; }; void free_state(state *state __attribute__((unused))) { free(state); }; int pre_fork_setup(state *state __attribute__((unused))) { state->efd = open("/dev/vpoll", O_RDWR | O_CLOEXEC); state->epfd = epoll_create1(EPOLL_CLOEXEC); struct epoll_event tmp = { .events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI, .data.u64 = 0, }; state->ev = tmp; epoll_ctl(state->epfd, EPOLL_CTL_ADD, state->efd, &state->ev); printf("setup efd %d epfd %d\n", state->efd, state->epfd); return 0; }; int cleanup(state *state) { close(state->efd); close(state->epfd); return 0; }; int child_post_fork_setup(state *state __attribute__((unused))) { return 0; } int child_warmup(int warmup_iters __attribute__((unused)), state *state __attribute__((unused))) { return 0; } int child_loop(int iters, state *state) { printf("child efd %d \n", state->efd); for (int i = 0; i < iters; ++i) { ioctl(state->efd, VPOLL_IO_ADDEVENTS, EPOLLIN); } ioctl(state->efd, VPOLL_IO_ADDEVENTS, EPOLLHUP); return 0; } int child_cleanup(state *state __attribute__((unused))) { return 0; } int parent_post_fork_setup(state *state __attribute__((unused))) { return 0; } int parent_warmup(int warmup_iters __attribute__((unused)), state *state __attribute__((unused))) { return 0; } int parent_loop(int iters __attribute__((unused)), state *state) { printf("parent epfd %d \n", state->epfd); while (1) { int nfds = epoll_wait(state->epfd, &state->ev, 1, 1000); if (nfds == 0) printf("timeout...\n"); else { printf("GOT event %x\n", state->ev.events); ioctl(state->efd, VPOLL_IO_DELEVENTS, state->ev.events); if (state->ev.events & EPOLLHUP) break; } } return 0; } int parent_cleanup(state *state __attribute__((unused))) { return 0; } ``` 輸出結果: ```shell 100000 iters in 113667163 ns 1136.671630 ns/iter ``` :::info **Reference** * [Linux Foundation - Epoll Kernel Performance Improvements](https://events19.linuxfoundation.org/wp-content/uploads/2018/07/dbueso-oss-japan19.pdf) * [GitHub - linux-ipc-benchmarks](https://github.com/kamalmarhubi/linux-ipc-benchmarks) :::