So What Has RCU Done Lately? (2023)

--- tags: rcu, NOTES --- # So What Has RCU Done Lately? (2023) > [youtube](https://youtu.be/9rNVyyPjoC4) > [slides](https://drive.google.com/file/d/19Oa5phwRImKr4sUirjYxc8oOi0lKHxVL/view) ## 心得這場演講告訴大家最近 Read-copy-update (RCU) 有哪些變化，改變的原因是什麼，以及未來的發展方向又有哪些。講到了不同 RCU variant 與 API 在實際使用後所面臨的問題與反饋，開發團隊也因應各種問題提出了些解決方法或是權宜之計。雖然 RCU 的概念很簡單 (RCU is dead simple)，但融入 Linux 核心後，不免俗得顧及到很多面向。簡單的概念在經歷了現實考量後，衍生出各式各樣又複雜的 variant，要面面俱到實在不容易。我想這也是為什麼 RCU 早在 2002 年就加入 Linux 核心，卻直到現在仍在開發中 (under development) - Backporting 到舊版時會不會出錯 - 在 32-bit system 時可能遭遇 overflow - 從 Embedded system 到 High performance computing 的使用情境都得考慮 - 與 Lockdep, BPF trampoline 等機制如何並存 - 在 real-time 的環境下如何正確運行又不過度影響其餘 task - 在 mobile device (ChromeOS, Android) 上需要顧及 battery lifetime - 處理 RCU Callback 也要顧及 cache locality (e.g. [RCU Callback Handling](https://www.kernel.org/doc/html/v5.5/RCU/Design/Data-Structures/Data-Structures.html#rcu-callback-handling)) :::success 前情提要，什麼是 Read-copy-update (RCU) - [Linux 核心設計: RCU 同步機制](https://hackmd.io/@sysprog/linux-rcu) - [What is RCU? – "Read, Copy, Update"](https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html) - [What is RCU, Fundamentally?](https://lwn.net/Articles/262464/) ::: --- ## 開場 ![](https://i.imgur.com/sCCUNbM.png) RCU 還在開發中嗎 (under active development)？當然 spinlock 跟 atomic operations 同樣也還在開發中 --- ## 經典 RCU review Paul 幾乎每次演講都會講一次這些XD ![](https://i.imgur.com/j5LXxVm.png) 右下角是不該出現的情況，有潛在的 use after free 風險 `synchronize_rcu()` 理應避免這情況出現 ---- ### Code Animation (印象中 animation 在其他演講，這裡沒展示) ![](https://i.imgur.com/0ubObiB.png) 左右都是 reader, 中間是 updater。雖然 updater 將資料從 blue node 更新到 green node，但左邊的 reader 讀取到的仍是 obsolete data，也就是 blue node。 Paul 說 RCU 的 lack of synchronize 就像我們透過電腦接觸世界，圖中的紅線以上就像是我們透過電腦得到世界上的資訊，但世界無時無刻都在改變。 ---- ### Core APIs ![](https://i.imgur.com/00THqk8.png) ---- ### RCU compared to rwlock ![](https://i.imgur.com/Rj2B0XD.png) Get much more efficient usage of CPU - red part: CPU delay ---- ### RCU compared to rwlock: Scalability (Empty critical section) ![](https://i.imgur.com/LYIrGGY.png) - y-axis: ns per operation (越低越好) - x-axis: number of CPUs ---- ### RCU compared to rwlock: Scalability (Non-Empty critical section) ![](https://i.imgur.com/RSARO2i.png) - y-axis: ns per operation (越低越好) - x-axis: critical section duration (ns) #### In short: 如果 CPU 數量很少，且 critical section 很大，那 rwlock 對你來說就夠了但如果 CPU 數量很多，且 critical section 很短 (protecting a very fast data structure like hash table)，那 RCU 可以帶來巨大的 benefits ---- ### Restrictions ![](https://i.imgur.com/IiEI6bc.png) ==RCU is a specialized tool== ---- ### Use cases ![](https://i.imgur.com/N4SKcHj.png) RCU 比起其他方法簡單，使用的地方更多 --- ## RCU Changes ### Outline (這裡只快速帶過，因為每個 topic 都可以當一個 full presentation) ![](https://i.imgur.com/P92GE5f.png) ---- ### Flavor consolidation 在 2019 改的，因為原先得設計較為複雜 (原本 RCU 不同 flavor 的 reader 需要對應不同 flavor 的 synchronize_rcu)，而開發者使用錯誤時會造成 exploit (security issue)，所以將其整合簡化 - 錯誤的使用造成 exploit ![](https://i.imgur.com/IyKhZR7.png =600x) - before ![](https://i.imgur.com/Jl5MzFK.png =600x) - after ![](https://i.imgur.com/kP0wXKd.png =600x) #### Backport ![](https://i.imgur.com/BJTeJP4.png =600x) backport 到 v4.19 或更早的版本會需要做更改，請使用 `synchronize_rcu_mult()` (mult: multiple) :::success - [RCU flavor consolidation](https://lwn.net/Articles/777036/#Summary%20of%20RCU%20API%20Changes) ::: ---- ### Optional Lockdep expression ![](https://i.imgur.com/prtCkM8.png =600x) 提升與 Lockdep 的整合告訴 lockdep 不論是當個 RCU reader 或拿了 lock，either way is fine，別報 warning (it allows shared code to work better) \>> Get good debug ability without API explosion :::success What is lockdep? - [Interrupts, threads, and lockdep](https://lwn.net/Articles/321663/) - [Runtime locking correctness validator](https://docs.kernel.org/locking/lockdep-design.html) - [Lockdep-RCU](https://lwn.net/Articles/371986/) - [The kernel lock validator](https://lwn.net/Articles/185666/) ::: ---- ### Single-argument kfree_rcu() and kvfree_rcu() 1. 原版的 `kvfree_rcu()` 有兩個參數，第二個參數是 name of rcu_head。 2. 雖然使用起來沒問題，但如果在很小的 structure 上且在 kernel 內的數量非常大時，額外的 8 byte (64-bit system) 會變成問題。所以新版的 `kvfree_rcu()` 拿掉了第二個參數，改善了上述的問題，但 trade-off 是變成用 allocate memory 的方式來做 track，在遇到 out of memory 時會進入 sleep，可能產生額外的 latency。 ![](https://i.imgur.com/jxJUpdU.png) 3. 又因為 single argument `kvfree_rcu()` 可能會進入 sleep，即便環境允許，但額外的 sleep 可能造成 SLA 上的困擾或是 lockdep 的 warning。所以後續重新命名，新增 suffix 變成 `kvfree_rcu_mightsleep()`。讓開發者能 ease of use 但需要打更多字XD (既能避免開發者誤用 single argument `kvfree_rcu()` ，也把 atomic contexts 與 non-atomic contexts 切開) ![](https://i.imgur.com/FiIyY69.png) :::success - `kfree_rcu()`: If the callback for call_rcu() is not doing anything more than calling kfree() on the structure, you can use kfree_rcu() instead of call_rcu() to avoid having to write your own callback: > reference from [What is RCU? – “Read, Copy, Update”](https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html) - [rcu: introduce kfree_rcu()](https://lwn.net/Articles/433493/) - source code: [#define kvfree_rcu(...)](https://elixir.bootlin.com/linux/v6.2.11/source/include/linux/rcupdate.h#L1013) - Rename kvfree_rcu() to kvfree_rcu_mightsleep(): The goal is to avoid accidental use of the single-argument forms, which can introduce functionality bugs in atomic contexts and latency bugs in non-atomic contexts. > reference from [Re: [PATCH v2 01/14] drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()](https://lore.kernel.org/lkml/e5b78f91-122a-0b0d-8d3f-922d462ba44d@kernel.dk/) ::: ---- ### Polled Grace-Period APIs 1. 原版 polled grace-period API 的使用流程得到 cookie 後，可以先 do sth，再透過 `cond_synchronize_rcu` 確認 cookie，就可以知道 grace period 是否已經完成。還沒完成的話可以透過 `synchronize_rcu()` 等待到完成 (if you do sth takes a while, you get safety but no extra overhead) ![](https://i.imgur.com/tQy2Dq4.png) 2. 新版 polled grace-period API 的使用流程 ![](https://i.imgur.com/ZLzEZRI.png) ![](https://i.imgur.com/60FPyrb.png) 原版的 API 有潛在的問題會造成 `synchronize_rcu()` 得到 false positive: 1. counter wrap: limit counter bit 造成 overflow (在 32-bit system 時較容易發生) 透過使用 `get_completed_synchronize_rcu()` 解決 2. 但即便使用 `get_completed_synchronize_rcu()`，若遇到 normal/expedited `synchronize_rcu()` 有 overlapping 發生時，也會造成 normal / expedited 競爭 state variable (cookie) 的問題 (或是用兩個 `synchronize_rcu()` 來解決，因為總有一個會搶贏XD) $\to$ 透過使用 \_full suffix API 處理，此 API 會有兩個 counter 給不同的版本使用，讓兩者分開避免發生競爭 ==Lockless grace period API== :::success - [[PATCH rcu 01/12] rcu: Make normal polling GP be more precise about sequence numbers](https://lore.kernel.org/lkml/20220722010341.GC1790663@paulmck-ThinkPad-P17-Gen-1/T/) - [get_completed_synchronize_rcu_full()](https://elixir.bootlin.com/linux/v6.2.11/source/kernel/rcu/tree.c#L3579) - [struct rcu_gp_oldstate](https://elixir.bootlin.com/linux/v6.2.11/source/include/linux/rcutree.h#L44) ::: ---- ### Tasks Trace RCU and Tasks Rude RCU ![](https://i.imgur.com/0sHWpJO.png) For tracing and BPF ---- ### Runtime RCU Callback (De-)Offlodaing 1. 在沒有 callback offloading 時，invoke callback 時會產生 interrupt，即便是 high priority 的工作也會被打斷。對 real-time 的情境不好 ![](https://i.imgur.com/02Q4Xvs.png) 2. 引入 callback offloading 之後，將 callback 交給專門的 rcuo kthread 來做，此時他會是一個 normal task，可以放入 scheduler 中排程，不打斷 high priority tasks (也可以 reduce jitter & save power) ![](https://i.imgur.com/tp8KsTy.png) 3. 但因為原本的 callback offloading 需要在 boot time 就決定好哪些 CPU 要 offloading 哪些不要，不彈性且麻煩，所以接著改進成可以有限度的在 runtime 進行調整，決定此 cpu 是否要做 offloading ![](https://i.imgur.com/ah3EMnf.png) :::success - [rcu/nocb: De-offload and re-offload support v3](https://lwn.net/Articles/835039/) ::: ---- ### SRCU memory-footprint diet 1. 最早的 srcu_struct 小小的 (hundreds of Bytes)，但只有一個 callback list，會造成 global contention 2. v4.12 後新增的 srcu_node，透過 combining tree 的結構，即便在大量的 CPU 下也能控制住 lock contention 的程度。 3. 隨之產生的問題是這必須在 build time 就配置好，並且此時只知道 CPU 的數量。而有些 distros 的 NR_CPUS 可以多達 4096，雖然 26KB 對一台 big machine 不是什麼負擔。但有些人會將 srcu structs 放在其他 structures 中，此時卻會因為結構太大，不能透過短短的 assembly 馬上得到 offset，進而讓 compiler 產生 worse code。 ![](https://i.imgur.com/SPtfXgG.png) 4. 最新的方式是將 `srcu_struct` 與 `srcu_node` 分開，選擇性的配置。現階段預設是在 < 128 CPUS 時不會配置 `srcu_node` ![](https://i.imgur.com/8FLVhxB.png) 5. 可以透過調整參數自行決定轉換的時機點 ![](https://i.imgur.com/7RMY1yM.png) ==memory is cheap, but not that cheap== ---- ### Real-time expedited grace periods ![](https://i.imgur.com/4bw5azw.png) ---- ### Lazy RCU Callbacks 在 near-idle device 下，例如說使用者只是打開檔案，關掉，打開，關掉，重複很多次產生出一堆 grace period 會造成沒必要的電量消耗，所以在 v6.2 誕生了 lazy grace period \>> laziness can be a virtue ![](https://i.imgur.com/cxqh5se.png) ![](https://i.imgur.com/20fVYfH.png) :::success - [Linux 6.2 Likely To Enjoy Measurable Power-Savings While Idle Or Lightly Loaded](https://www.phoronix.com/news/Lazy-RCU-Likely-For-Linux-6.2) ::: --- ## Miscellaneous ![](https://i.imgur.com/X4YKgnl.png) --- ## Future ![](https://i.imgur.com/4Bl8Nd9.png) --- ## Trends - upper line: RCU commit 數量 - lower line: 非 pual 的 RCU commit 數量 ![](https://i.imgur.com/3pRdYXO.png) * paul 所佔的 RCU commit 比例 ![](https://i.imgur.com/8r5Oz84.png) - 之前的演講也有提過，期許 RCU 的社群更加茁壯 ![](https://i.imgur.com/m8HwJmh.png) ---- ### Summary ![](https://i.imgur.com/5wFfYUq.png) 最後也是最重要的，RCU 仍在開發中，根據現在, 過去, 與未來的使用者需求，持續演化 :::success - [Recent RCU changes](https://lwn.net/Articles/894379/) - [Paul E. McKenney's Journal](https://paulmck.livejournal.com/) - [[PATCH 1/1] Reduce synchronize_rcu() waiting time](https://lore.kernel.org/lkml/20230321102748.127923-1-urezki@gmail.com/) :::