contributed by < linD026
>
Linux kernel RCU
, linux2021
關於 RCU 的概念,請見:Linux 核心設計: RCU 同步機制 、 perfbook - section 9.5 , section B 。
關於 Linux 內 RCU 完整資料:RCU concepts 。
So the typical RCU update sequence goes something like the following:
- Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it.
- Wait for all previous readers to complete their RCU read-side critical sections.
- At this point, there cannot be any readers who hold references to the data structure, so it now may safely be reclaimed (e.g.,
kfree()
).This article has described the three fundamental components of RCU-based algorithms:
- publish-subscribe mechanism for adding new data,
- way of waiting for pre-existing RCU readers to finish, and
- discipline of maintaining multiple versions to permit change without harming or unduly delaying concurrent RCU readers.
在無 CONFIG_PREEMPT_RCU
下, classic 版本的 read-side critical section 不准被 block 或 sleep ,因此 read-side lock 可寫成:
而在沒有 CONFIG_PREEMPT_COUNT
下 (被用於 kernel preemption, interrupt count 等), preempt_disable()
等相關函式彙編成 compiler barrier 。
而 synchronize_rcu
只要在每個 CPU 上執行 context switch ,則可視為已離開 read-side critical section :
There is a trick, and the trick is that RCU Classic read-side critical sections delimited by
rcu_read_lock()
andrcu_read_unlock()
are not permitted to block or sleep. Therefore, when a given CPU executes a context switch, we are guaranteed that any prior RCU read-side critical sections will have completed. This means that as soon as each CPU has executed at least one context switch, all prior RCU read-side critical sections are guaranteed to have completed, meaning thatsynchronize_rcu()
can safely return.
In v5.0 and later kernels,
synchronize_rcu()
andcall_rcu()
also wait for regions of code with preemption disabled, including regions of code with interrupts or softirqs disabled. In pre-v5.0 kernels, which definesynchronize_sched()
, only code enclosed withinrcu_read_lock()
andrcu_read_unlock()
are guaranteed to be waited for.Note, however, that RCU callbacks are permitted to run concurrently with new RCU read-side critical sections. One way that this can happen is via the following sequence of events:
(1) CPU 0 enters an RCU read-side critical section,
(2) CPU 1 invokescall_rcu()
to register an RCU callback,
(3) CPU 0 exits the RCU read-side critical section,
(4) CPU 2 enters a RCU read-side critical section,
(5) the RCU callback is invoked.
This is legal, because the RCU read-side critical section that was running concurrently with thecall_rcu()
(and which therefore might be referencing something that the corresponding RCU callback would free up) has completed before the corresponding RCU callback is invoked.
__rcu
A pointer can be marked
__rcu
, which will cause the sparse static-analysis checker to complain if that pointer is accessed directly, instead of viarcu_dereference()
andrcu_assign_pointer()
. This checking can help developers avoid accidental destruction of the address and data dependencies on which many RCU use cases depend.
利用 Sparse 提供的 address_space
等功能,檢測在 RCU 下的 pointer 不會直接被 dereference 。
假設有一個 struct test __rcu *foo
指標,在 Sparse 的檢測下 plain access 會出現下列錯誤:
這是為了要讓被 RCU 保護的指標,只經由 RCU API 來 access ,例如 rcu_assign_pointer
以及 rcu_dereference
。
In most RCU use cases, the
rcu_dereference()
primitive that accesses RCU-protected pointers must itself be protected byrcu_read_lock()
. The Linux-kernel lockdep facility has been therefore adapted to complain about unprotected uses ofrcu_dereference()
.
rcu_dereference()
rcu_assign_pointer()
In contrast, when assigning a non-NULL pointer, it is necessary to use smp_store_release() in order to ensure that initialization of the pointed-to structure is carried out before assignment of the pointer.
Single atomic counter vs. mult-thread atomic counter
其中單個全域變數的 reference count 並以 C11 atomic 操作的 locked-RCU,與 per-thread 的 reference count 的 thrd-based-RCU 做比較。以下分別為 read side 和 update side 的花費時間:
可以看出 thread-based RCU 在有作 partition 下,在 read side 的時間花費較低。而在 update side ,則是兩者大致相同。
The key concept behind the Classic RCU implementation is that Classic RCU read-side critical sections are confined to kernel code and are not permitted to block. This means that any time a given CPU is seen either blocking, in the idle loop, or exiting the kernel, we know that all RCU read-side critical sections that were previously running on that CPU must have completed. Such states are called “quiescent states”, and after each CPU has passed through at least one quiescent state, the RCU grace period ends.
Classic RCU's most important data structure is the
rcu_ctrlblk
structure, which contains the->cpumask
field, which contains one bit per CPU. Each CPU's bit is set to one at the beginning of each grace period, and each CPU must clear its bit after it passes through a quiescent state. Because multiple CPUs might want to clear their bits concurrently, which would corrupt the->cpumask
field, a->lock
spinlock is used to protect->cpumask
, preventing any such corruption. Unfortunately, this spinlock can also suffer extreme contention if there are more than a few hundred CPUs, which might soon become quite common if multicore trends continue. Worse yet, the fact that all CPUs must clear their own bit means that CPUs are not permitted to sleep through a grace period, which limits Linux's ability to conserve power.
Reference: Hierarchical RCU
The “RCU” column corresponds to the consolidation of the three Linux-kernel RCU implementations, in which RCU read-side critical sections start with
rcu_read_lock()
,rcu_read_lock_bh()
, orrcu_read_lock_sched()
and end withrcu_read_ unlock()
,rcu_read_unlock_bh()
, orrcu_read_unlock_sched()
, respectively. Any region of code that disables bottom halves, interrupts, or preemption also acts as an RCU read-side critical section. RCU read-side critical sections may be nested.The corresponding synchronous update-side primitives,
synchronize_rcu()
andsynchronize_rcu_expedited()
, along with their synonymsynchronize_net()
, wait for any type of currently executing RCU read-side critical sections to complete. The length of this wait is known as a “grace period”, andsynchronize_rcu_expedited()
is designed to reduce grace-period latency at the expense of increased CPU overhead and IPIs.The asynchronous update-side primitive,
call_rcu()
, invokes a specified function with a specified argument after a subsequent grace period. For example,call_rcu(p,f)
; will result in the “RCU callback”f(p)
being invoked after a subsequent grace period. There are situations, such as when unloading a Linux-kernel module that usescall_rcu()
, when it is necessary to wait for all outstanding RCU callbacks to complete. Thercu_barrier()
primitive does this job.
Reference: Perfbook Section 9.5.3
struct rcu_head
此結構嵌入至 RCU-protected 結構,用於 asynchronous 的 grace periods 。
The
->next
field is used to link thercu_head
structures together in the lists within the rcu_data structures. The->func
field is a pointer to the function to be called when the callback is ready to be invoked, and this function is passed a pointer to thercu_head
structure. However,kfree_rcu()
uses the->func
field to record the offset of thercu_head
structure within the enclosing RCU-protected data structure.Both of these fields are used internally by RCU. From the viewpoint of RCU users, this structure is an opaque “cookie”.
kfree_rcu()
NOTE vfree
also use the RCU to free the physical allocation when the reference count is 0.
CONFIG_PREEMPT_RCU
)The
->rcu_read_lock_nesting
field records the nesting level for RCU read-side critical sections, and the->rcu_read_unlock_special
field is a bitmask that records special conditions that requirercu_read_unlock()
to do additional work. The->rcu_node_entry
field is used to form lists of tasks that have blocked within preemptible-RCU read-side critical sections and the->rcu_blocked_node
field references thercu_node
structure whose list this task is a member of, orNULL
if it is not blocked within a preemptible-RCU read-side critical section.
rcu_preempt_read_enter
__rcu_read_lock
___might_sleep
rcu_exp_handler
rcu_note_context_switch
rcu_flavor_sched_clock_irq
疑問
如果在 read-side critical section 前 update-side 更新了資料,但在 synchronize_rcu
還沒完成前 read-side 進入了 critical section ,synchronize_rcu
是否要等待明明是讀新的資料的 reader 離開 critical section ?
現存 API 可能以其他形式解決了,或者這是一種 trade off
:
void synchronize_rcu_expedited(void)
call_rcu()
The “SRCU” column in Table 9.1 displays a special- ized RCU API that permits general sleeping in SRCU read-side critical sections delimited by srcu_ read_lock() and srcu_read_unlock(). However, unlike RCU, SRCU’s srcu_read_lock() returns a value that must be passed into the corresponding srcu_read_ unlock(). This difference is due to the fact that the SRCU user allocates an srcu_struct for each distinct SRCU usage. These distinct srcu_struct structures prevent SRCU read-side critical sections from blocking unrelated synchronize_srcu() and synchronize_ srcu_expedited() invocations. Of course, use of either synchronize_srcu() or synchronize_srcu_ expedited() within an SRCU read-side critical section can result in self-deadlock, so should be avoided. As with RCU, SRCU’s synchronize_srcu_expedited() de- creases grace-period latency compared to synchronize_ srcu(), but at the expense of increased CPU overhead.
The read-copy-update (RCU) mechanism is charged with keeping old versions of data structures around until it knows that no CPU can hold a reference to them; once that happens, the structures can be freed. Recently, though, a potential RCU user came forward with a request for something different: would it be possible to defer the destruction of an old data structure until it is known that no process holds a reference to it? The answer would appear to be "yes," as demonstrated by the RCU-tasks subsystem recently posted by Paul McKenney.
Normal RCU works on data structures that are accessed via a pointer. When an RCU-protected structure must change, the code that maintains that structure starts by making a copy. The changes are made to the copy, then the relevant pointer is changed to point to that new copy. At this point, the old version is inaccessible, but there may be code running that obtained a pointer to it before the change was made. So the old structure cannot yet be freed. Instead, RCU waits until every CPU in the system goes through a context switch (or sits idle). Since the rules for RCU say that references to data structures can only be held in atomic context, the "every CPU has context switched" condition guarantees that no references to an old data structure can be held.
In earlier implementations, the task requesting the expedited grace period also drove it to completion. This straightforward approach had the disadvantage of needing to account for POSIX signals sent to user tasks, so more recent implemementations use the Linux kernel’s workqueues.
The use of workqueues has the advantage that the expedited grace-period code need not worry about POSIX signals. Unfortunately, it has the corresponding disadvantage that workqueues cannot be used until they are initialized, which does not happen until some time after the scheduler spawns the first task. Given that there are parts of the kernel that really do want to execute grace periods during this mid-boot “dead zone”, expedited grace periods must do something else during thie time.
What they do is to fall back to the old practice of requiring that the requesting task drive the expedited grace period, as was the case before the use of workqueues. However, the requesting task is only required to drive the grace period during the mid-boot dead zone. Before mid-boot, a synchronous grace period is a no-op. Some time after mid-boot, workqueues are used.
Non-expedited non-SRCU synchronous grace periods must also operate normally during mid-boot. This is handled by causing non-expedited grace periods to take the expedited code path during mid-boot.
The current code assumes that there are no POSIX signals during the mid-boot dead zone. However, if an overwhelming need for POSIX signals somehow arises, appropriate adjustments can be made to the expedited stall-warning code. One such adjustment would reinstate the pre-workqueue stall-warning checks, but only during the mid-boot dead zone.
kfree_rcu()
kmem_cache_alloc()
rcu_barrier()
doesn't wait for memory beiing free via kfree_rcu()
call_rcu()
insteadkfree_rcu_mightsleep()
to avoid additional byteskfree_rcu()
is changed to kfree_rcu_mightsleep()
More control and information from gp
LWN.net RCU 2024 API edition
LSFMM+BPF 2024: Reclamation Interactions with RCU
In recent Linux kernels, RCU instead provides a complete polling API for managing RCU grace periods. This permits RCU users to take full control over all aspects of waiting for grace periods and responding to the end of a particular grace period. For example, the get_state_synchronize_rcu() function returns a cookie that can be passed to poll_state_synchronize_rcu(). This latter function will return true if a grace period has elapsed in the meantime.
In computing, a non-maskable interrupt (NMI) is a hardware interrupt that standard interrupt-masking techniques in the system cannot ignore. It typically occurs to signal attention for non-recoverable hardware errors. Some NMIs may be masked, but only by using proprietary methods specific to the particular NMI.
SRCU read-side critical sections use this_cpu_inc(), which excludes interrupt and software-interrupt handlers, but is not guaranteed to exclude non-maskable interrupt (NMI) handlers. Therefore, SRCU read-side critical sections may not be used in NMI handlers, at least not in portable code. This restriction became problematic for printk(), which is frequently called upon to do stack backtraces from NMI handlers. This situation motivated adding srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe(). These new API members instead use atomic_long_inc(), which can be more expensive than this_cpu_inc(), but which does exclude NMI handlers.
However, SRCU will complain if you use both the traditional and the NMI-safe API members on the same srcu_struct structure. In theory, it is possible to mix and match but, in practice, the rules for safely doing so are not consistent with good software-engineering practice.