# RCU in the Linux Kernel
###### tags: `rcu`
## RCU Concepts - Frequently Asked Questions
> https://www.kernel.org/doc/html/latest/RCU/rcu.html
TODO 補個
---
## RCU’S CORE API
> https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#what-is-rcu-s-core-api
- rcu_read_lock()
- rcu_read_unlock()
- synchronize_rcu() / call_rcu()
- rcu_assign_pointer()
- rcu_dereference()
TODO 有空補個
> https://hackmd.io/@cccccs100203/SkYgGgsa9/https%3A%2F%2Fhackmd.io%2FJ-UpD-UmQD6fjqjoLSDFYA#Linex-doc--RCU-API
這裡有寫一些
---
## RCU Flavors
> [A Tour Through RCU’s Requirements](https://docs.kernel.org/RCU/Design/Requirements/Requirements.html#other-rcu-flavors)
One of the more surprising things about RCU is that there are now no fewer than five flavors, or API families.
- non-preemptible
- preemptible
- Bottom-Half Flavor (Historical)
- Sched Flavor (Historical)
- Sleepable RCU
- Tasks RCU
#### TL;DR
- RCU-bh for code that is subject to network-based denial-of-service attacks.
- RCU-sched for code that must interact with interrupt/NMI handlers or with preemption-disabled regions of code, and for general-purpose use in CONFIG_PREEMPT=n kernels.
- RCU-preempt for general-purpose use in CONFIG_PREEMPT=y kernels.
----
### Bottom-Half Flavor (Historical)
> The RCU-bh flavor of RCU has since been expressed in terms of the other RCU flavors as part of a consolidation of the three flavors into a single flavor. **The read-side API remains, and continues to disable softirq and to be accounted for by lockdep**.
==The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) flavor of RCU==, or RCU-bh, was developed by Dipankar Sarma to provide a flavor of RCU that could withstand the network-based denial-of-service attacks researched by Robert Olsson.
RCU-bh AKA softirq-disable RCU, 用來抵禦 DoS 攻擊,因為一般的 RCU reader 在 DoS 下會無法進入 context switch 進而導致 grace period 永遠無法結束 -> OOM
RCU-bh 的方法是在 read-side critical sections 內執行 local_bh_disable(),並且將 softirq 之間類型轉換也視為 quiescent state,這樣即使 CPU 一直在 softirq 裡也不會導致 grace period 無限延長。
在 `rcu_read_lock_bh()` 會 disable softirq 而 `rcu_read_unlock_bh()` 才會重新啟用,所以 softirq 會被延後,而且在 read_unlock 內需要重新調用 softirq processing。
btw, 關閉 softirq 導致的 overhead 理論上是在 `rcu_read_lock_bh()` 造成的,但大部分的 profiling tools 無法分變得這麼細,他們只會觀察到是在 `rcu_read_unlock_bh()` 有延遲,認為 overhead 是在 `rcu_read_unlock_bh()` 發生的。
> In addition, anything that disables bottom halves also marks an RCU-bh read-side critical section, including local_bh_disable() and local_bh_enable(), local_irq_save() and local_irq_restore(), and so on.
#### local_bh_disable()
> [local_bh_disable()/local_bh_enable() include/linux/interrupt.h](https://www.kernel.org/doc/htmldocs/kernel-hacking/routines-softirqs.html)
> [local_bh_disable, preempt_disable, local_irq_disable](https://stackoverflow.com/questions/24260087)
>
> These routines disable soft interrupts on the local CPU, and restore them. They are reentrant; if soft interrupts were disabled before, they will still be disabled after this pair of functions has been called. They prevent softirqs and tasklets from running on the current CPU.
關閉/開啟 local CPU 上的 soft interrupts
> [local_irq_save()/local_irq_restore() include/linux/irqflags.h](https://www.kernel.org/doc/htmldocs/kernel-hacking/routines-local-irqs.html)
> [[PATCH] local_irq_disable removal](https://lore.kernel.org/lkml/1118214519.4759.17.camel@dhcp153.mvista.com/)
#### 關於 softirq 的影響
[/include/linux/interrupt.h](https://elixir.bootlin.com/linux/v5.19/source/include/linux/interrupt.h#L548)
----
### Sched Flavor (Historical)
> The RCU-sched flavor of RCU has since been expressed in terms of the other RCU flavors as part of a consolidation of the three flavors into a single flavor.
==The read-side API remains, and continues to disable preemption== and to be accounted for by lockdep.
在 CONFIG_PREEMPTION=n 時,RCU-sched 跟 classic RCU 是相同的 implementation。但在CONFIG_PREEMPTION=y 時,兩者就會有差異。因為 RCU-sched 會在 read-side critical section 關閉 preemption
這也表示在 `rcu_read_unlock_sched()` 之後有可能會進入 scheduler 造成 overhead,但 high priority task 不容易被 preempt,所以可以放心的呼叫 `rcu_read_unlock_sched()`
----
### Sleepable RCU
- 可以參照以下筆記
https://hackmd.io/J-UpD-UmQD6fjqjoLSDFYA?view#Sleepable-RCU-SRCU
`srcu_struct` 代表的是 subsystem (domain),而不同 domain 之間的 grace period 不會互相影響
> Unlike the other RCU flavors, SRCU read-side critical sections can run on idle and even offline CPUs. This ability requires that srcu_read_lock() and srcu_read_unlock() contain memory barriers, which means that SRCU readers will run a bit slower than would RCU readers. It also motivates the smp_mb__after_srcu_read_unlock() API, which, in combination with srcu_read_unlock(), guarantees a full memory barrier.
在 read-side critical sections 需要 memory barrier,也導致額外的 overhead
> Also unlike other RCU flavors, synchronize_srcu() may not be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace periods make use of timers and the possibility of timers being temporarily “stranded” on the outgoing CPU. This stranding of timers means that timers posted to the outgoing CPU will not fire until late in the CPU-hotplug process. The problem is that if a notifier is waiting on an SRCU grace period, that grace period is waiting on a timer, and that timer is stranded on the outgoing CPU, then the notifier will never be awakened, in other words, deadlock has occurred. This same situation of course also prohibits srcu_barrier() from being invoked from CPU-hotplug notifiers.
SRCU 利用 timer 來維護 grace periods,但對於熱插拔的 CPU 會有延遲,因為 timer 對於 outgoing CPU 會產生 "擱置" 的現象。
----
### Tasks RCU
> [The RCU-tasks subsystem](https://lwn.net/Articles/607117/)
lwn, 2014 By Jonathan Corbet
> The solution, in the form of Tasks RCU, is to have implicit read-side critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections.
> The tasks-RCU API is quite compact, consisting only of call_rcu_tasks(), synchronize_rcu_tasks(), and rcu_barrier_tasks(). In CONFIG_PREEMPTION=n kernels, trampolines cannot be preempted, so these APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), respectively. In CONFIG_PREEMPTION=y kernels, trampolines can be preempted, and these three APIs are therefore implemented by separate functions that check for voluntary context switches.
Wait for all non-idle tasks to be seen in userspace or to execute a voluntary context switch
- [RCU-tasks implementation](https://lwn.net/Articles/608487/)
- [Tasks RCU](https://01.org/linuxgraphics/gfx-docs/drm/RCU/Design/Requirements/Requirements.html#tasks-rcu)
---
## RCU Implementation
- classic RCU
- tree RCU
- tiny RCU
### Classic RCU
> - [scalable classic RCU implementation](https://lwn.net/Articles/295094/)
`git diff` between classic RCU & tree RCU at Linux 2.6.27
> - [Hierarchical RCU](https://lwn.net/Articles/305782/)
Tree RCU design principle
> before Linux 2.6.27

> Classic RCU's most important data structure is the rcu_ctrlblk structure, which contains the ->cpumask field, which contains one bit per CPU. Each CPU's bit is set to one at the beginning of each grace period, and each CPU must clear its bit after it passes through a quiescent state. Because multiple CPUs might want to clear their bits concurrently, which would corrupt the ->cpumask field, a ->lock spinlock is used to protect ->cpumask, preventing any such corruption. Unfortunately, this spinlock can also suffer extreme contention if there are more than a few hundred CPUs, which might soon become quite common if multicore trends continue. Worse yet, the fact that all CPUs must clear their own bit means that CPUs are not permitted to sleep through a grace period, which limits Linux's ability to conserve power.
在 `struct rcu_ctrlblk` 中有一個 `cpumask` 用來紀錄每個 CPU,1 代表正在 grace period,0 則代表在 quiescent state。
又因為 multiple CPU 會需要共同維護這份 struct,所以還需要一個 lock 來保護 cpumask。而若是有上百顆 CPU,那麼需要 lock 就會導致嚴重的競爭情形出現。更糟的是,因為所有 CPU 都要 clear 自己的 bit,那麼就不允許在 grace period 時有 CPU 進入 sleep,這會大大影響 Linux 的節能。
----
### Tree RCU (Hierarchical RCU)
> - [scalable classic RCU implementation](https://lwn.net/Articles/295094/)
`git diff` between classic RCU & tree RCU at Linux 2.6.27
> - [Hierarchical RCU](https://lwn.net/Articles/305782/)
Tree RCU design principle
在 Linux 2.6.27 取代掉 classic RCU
#### One effective way to reduce lock contention is to create a hierarchy
在每個 `rcu_node` 中都會有一個 lock,所以左下角的 rcu_node 只會有 CPU0 跟 CPU1 存取。而在 grace period 時,每次最多只會有一個 child `rcu_node` 可以去 access parent `rcu_node`。
這可以大幅度的降低 lock contention,原本 upper rcu_node's lock 會有 6 個競爭者,現在只會有 3 個。

- 關於 fanout 的設定
> CONFIG_RCU_FANOUT: Number of children for each rcu_node.
> [A Tour Through TREE_RCU’s Data Structures [LWN.net]](https://docs.kernel.org/RCU/Design/Data-Structures/Data-Structures.html)
> As can be seen from the diagram, on a 64-bit system a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout of 64 at the root and a fanout of 16 at the leaves.
在 2016 年的文章中是設定成 root 64 * leaf 16,而不是一開始的 $64*64$,因為有實驗指出在 leaf node 上發生 contention 的情況比 non-leaf 多,所以原本的 64 會造成太多 contention。
> you may use the CONFIG_RCU_FANOUT kernel configuration parameter to reduce the non-leaf fanout as needed
若是在 64-bit system 下超過 1024 個 CPU (32-bit 是 512),則 RCU 會自動新增第三層結構。
----
#### The tree of `rcu_node` structures is embedded into a linear array in the `rcu_state` structure
以一個 `rcu_state` array 來紀錄整個 `rcu_node` 的樹狀結構,而每個 node 上面的 indices 則是代表該 node 所涵蓋的 CPU 標號,像最前面的就是 root,所以會包含 0:7。
而這個 array 會在 compilter time 時就透過 `NR_CPUS` 分配好。
> This array is allocated statically at compile time based on the value of NR_CPUS.

----
#### How grace periods are detected
如下圖,紅色代表還未通過 quiescent state,**假測現在所有 CPU 都要告訴 RCU 自己已經通過 quiescent state**,但因為需要先取得 lock ,所以一個 node 一次只能有一個修改:
1. 所有 CPU 都尚未通過 quiescent state (figure 1)
2. CPU 0, 3, 5 幸運的得到了 lock,所以進入綠色 (figure 2)
3. 上面的 CPU 更改完後,其餘的 CPU 分別得到 lock 進行修改 (figure 3)
4. 步驟三的 CPU 會發現自己是該 pair 中最後一個 set 的 CPU,所以去嘗試取得 upper node 的 lock 來更改
5. 在接下來的三張圖中,CPU 1, 2, 4 輪流取得到 upper node 的 lock,成功更改數值 (figure 4~6)
6. 最後也代表 grace period 結束了

----
#### Per-CPU data
> The implementation maintains some per-CPU data, such as lists of RCU callbacks, organized into rcu_data structures. In addition, rcu (as in call_rcu()) and rcu_bh (as in call_rcu_bh()) each maintain their own hierarchy, as shown in the following figure.
`rcu_data` 會紀錄該 CPU 的相關資訊,像是 callbacks

----
#### dynticks-idle CPUs
> [RCU and dynticks-idle mode](http://www.joelfernandes.org/linuxinternals/2018/06/15/rcu-dynticks.html)
在 CPU 進入 `dynticks-idle` mode 時,因為連 timer 都關掉了 (for saving power),導致無法告知 RCU 他處在 quiescent state,所以需要處理此問題
----
#### State Machine


----
#### Quick Quiz
只先紀錄題目,答案 [Hierarchical RCU](https://lwn.net/Articles/305782/) 內有
- Quick Quiz 1: Wait a minute! With all those new locks, how do you avoid deadlock?
- Quick Quiz 2: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead?
- Quick Quiz 3: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!!
----
### Tiny RCU
- [[PATCH fyi] RCU: the bloatwatch edition](https://lore.kernel.org/all/20090113221724.GA15307@linux.vnet.ibm.com/)
Tiny RCU patch
- [RCU: The Bloatwatch Edition](https://lwn.net/Articles/323929/)
Tiny RCU design principle
- [Simplifying RCU](https://lwn.net/Articles/541037/)
remove `TINY_PREEMPT_RCU`
please switch to `TREE_PREEMPT_RCU` or `TINY_RCU`
Tiny RCU 是適用在 uniprocessor 的 RCU implementation
設定成 `CONFIG_SMP=n` 時可以轉為使用 Tiny RCU,最大的特點是只要 CPU 通過 quiescent state,就代表 grace period 結束了。
原則上來講,每個 context switch 都可以呼叫所有的 pending RCU callbacks,雖然實務上要考慮到 scheduler 跟 RCU 之間的關係。
- RCU 的每種 flavor 都會有以下結構
```c
struct rcu_ctrlblk {
long completed;
struct rcu_head *rcucblist;
struct rcu_head **donetail;
struct rcu_head **curtail;
};
```
- `completed` 只在 RCU torture 時呼叫的 `rcu_batches_completed()` 和 `rcu_batches_completed_bh()` 有用
- `rcucblist` 會指向 RCU callbacks list 的 head
- `donetail` 是記錄 RCU callbacks list 中最後一個已經完成 grace period 的 node 的 `next` ptr 的 reference
- `curtail` 是記錄 RCU callbacks list 中最後一個 node 的 `next` ptr 的 reference
下圖示例
* The following figure shows a sample callback list that has **two callbacks ready to invoke** and **a third callback still waiting for a grace period** (or, equivalently on a uniprocessor system, for a quiescent state):

#### Tiny RCU Schematic

---
## The RCU API & Config
RCU API, configs 演進與介紹
> [The RCU API, 2014 Edition](https://lwn.net/Articles/609904/)
> [The RCU API, 2019 edition](https://lwn.net/Articles/777036/)
[Kernel configuration parameters for RCU](https://lwn.net/Articles/777214/)
[The RCU API tables, 2019 edition](https://lwn.net/Articles/777165/)
#### API 詳情可以參照
- [The RCU API tables, 2019 edition](https://lwn.net/Articles/777165/)
- [Linux doc: 8. FULL LIST OF RCU APIs](https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#id11)
### Summary of RCU API changes (2019)
> The big change is the consolidation of the RCU bh, RCU preempt, and RCU sched flavors, which was carried out in response to an exploitable bug resulting from confusion between two flavors. The bug naturally led Linus Torvalds to call for a long-term solution, which ended up being this consolidation.
#### RCU flavor consolidation
最大的改變就是上面段落提到的 RCU bh, RCU preempt, and RCU sched flavors 都被整合了,因為這些 flavor 之間的混淆會導致誤用,進而造成程式的漏洞。在 Torvalds 要求解決的情況下,最後將這些 API 都整合統一。
在 API 整合以前,在 CONFIG_PREEMPT=y kernel 內 bh, sched 都需要使用不同的 synchronize_rcu API,而在 CONFIG_PREEMPT=n kernel 內,sched 與一般的 RCU 可以用相同的 synchronize_rcu。
而在整合之後,也就是 Linux 4.20 之後,只需要用同一種 flavor 的 RCU API,不同的是 RCU preempt 會等待 bh-disable, irq-disable, and preempt-disable 等段落的完成。
可以看下列幾個例子:
- synchronize_rcu() may be used in place of synchronize_rcu_bh() and synchronize_sched(), as well as all current uses of synchronize_rcu_mult().
- synchronize_rcu_expedited() may be used in place of synchronize_rcu_bh_expedited() and synchronize_sched_expedited().
- call_rcu() may be used in place of call_rcu_bh() and call_rcu_sched().
要注意的是 rcu_read_lock, rcu_read_lock_bh, rcu_read_lock_sched 並沒有統一
#### 為什麼 rcu_read_lock_bh(), rcu_read_lock_sched() 等 API 還要保留呢?
因為若是統一用 rcu_read_lock() 然後配合不同的 local_bh_disable() 來關閉 irq, preemption 等,不能夠清楚的表達出這些 disable function 直接的與 RCU 相關。
而有些測試像是 lockdep,可能會要求 rcu_dereference_bh() 明確的在 rcu_read_lock_bh() 的 critical section 內
----
#### Addition of RCU tasks
> [The RCU-tasks subsystem](https://lwn.net/Articles/607117/)
新增 RCU task
- RCU tasks has voluntary context switch and user-space execution as its sole quiescent states, but applies only to non-idle tasks.
- only in CONFIG_PREEMPT=y kernels
----
#### SRCU updates
- SRCU 主要的改動是在 2014,實作已經為了 scalability 重構過
- 而因為 tracepoint code 會用到 SRCU,所以新增了 `_notrace()` 等的 variant,像是 `srcu_read_lock_notrace()`
發現 SRCU `_notrace` 相關的 API 沒有寫入 LINUX DOC,可能送 patch?
https://github.com/torvalds/linux/commit/17f0da13873ba393a72f14d41ffc8ff388e38723
https://github.com/torvalds/linux/commit/3398496483df3508764d24917deaa8ab5176969e
https://lwn.net/Articles/777165/
https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#id11
header 可以把 2019 年的 API 加入
:::info
2022/08/20
已經送 patch,更新一開始 LWN 2019 RCU API 的文章
https://lkml.org/lkml/2022/8/20/103
:::
---
## [A Tour Through TREE_RCU’s Expedited Grace Periods](https://docs.kernel.org/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html)
> [expedited "big hammer" RCU grace periods](https://lwn.net/Articles/338843/)
> Unlike RCU’s normal grace periods, which accept long latencies to attain high efficiency and minimal disturbance, expedited grace periods accept lower efficiency and significant disturbance to attain shorter latencies.
一般的 RCU grace period 可以接受較長的延遲換取更高的 efficiency 跟較小的干擾。
而在 expedited grace periods 中則相反,希望能有較低的延遲 (latency)。