Linux 核心專題: RCU 研究

# Linux 核心專題: RCU 研究 > 執行人: Korin777 > [專題解說錄影](https://youtu.be/1JhPAoVnCrY) ## 簡述本文探討 Linux 核心的 RCU 同步機制及 userspace RCU 如何落實。發軔於 Linux 核心的 RCU 不只是作業系統核心的基礎建設，其概念也在 Meta (Facebook) 的 folly 函式庫，甚至新的 C++ 標準中落實，這反映出現代硬體的 scalability 驅動開發模式的變遷。 > [Why RCU Should be in C++26](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2545r1.pdf) RCU 最主要的應用是並行讀取遠多於寫入的情況，針對特定情境進行特化，是工程上常用的手段。RCU 最大化讀取操作，然後犧牲寫入的效能。換句話說，若在一個有大量寫入的情境下，RCU 的效能很有可能比用 mutex lock 或其他 lock-free 資料結構還要差。為了簡化讀取端操作，大量運用 atomic pointer 讀取對應的物件，需要付出的成本相當低，但並非沒有，其實 atomic reference 在不同的 CPU 上速度落差很大 > same core ~20ns, AMD same CCX (shared L3 cache across a group of cores) ~50ns, same chip (different L3 cache) 100-200ns, different chip 300-400ns > 參見 [Evaluating the Cost of Atomic Operations on Modern Architectures](https://arxiv.org/pdf/2010.09852.pdf) 儘管 cache coherence protocol 實作手法相當多，本質上都是這幾個位元的狀態移轉: * M=modified * E=exclusive * O=owned * S=shared * I=Invalid * F=forward 以上為一個 cache 能擁有的狀態，一般以二到三個 bit 來儲存。以瞭解 atomic 實作原理的角度來看，最重要的莫過於知道說 shared 代表哪個 cache 可以多個核使用。因此在 atomic 沒有被更改的情況下，讀取速度會是 L1 cache 的上限。若有寫入，則根據實作會在多個 state 中轉換。例如 MEOSI 會這麼實作：一個核標記自身的 cache 為 Owned，把其他核變為 invalidate，更改值後變成 modified，寫入下層的 cache 或記憶體，所有的核中的 cache 再重新讀回變成 shared。寫入端是讓 RCU 變得複雜的地方。首先，一次只能有一個寫入者，這以 lock 保護。寫入的邏輯是先做出一個物件 (此時讀取端仍可自由讀舊物件)，然後更新 atomic 指標來指向新的物件。atomic 指標更新後，新的讀取端就會讀到新的物件。這樣的優點是: * 寫入端完全不會檔到讀取端，因此十分適合讀取量遠大於寫入的情境 * 與 reader-writer mutex lock 相比，reader-writer lock 需要讀取端去更新 atomic reference count，後者成本高昂。這衍伸的問題是，寫入端無法知道舊物件該在何時釋放。更新完 atomic 指標後，其他的執行緒可能還在使用舊物件。如果過早釋放，就會造成segmentation fault，RCU 將更新完 atomic 指標後，到可安全回收物件的時間，叫做 grace period。RCU 最困難的地方，就是維持 grace period 的正確性，且不可無止盡等待 grace period。 ### urcu-qsbr ![](https://hackmd.io/_uploads/B1PGeiIO2.png) 如上圖，與其他 [flavor](https://lwn.net/Articles/573424/#URCU%20Flavors) 相比， reader 需要定期進入 quiescent states，讓 grace period (以下簡稱 GP) 能夠結束，相關程式碼可見 [`src/urcu-qsbr.c`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c)、[`include/urcu/static/urcu-qsbr.h`](https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h)，以下探討其實作 #### 追蹤 GP 相關的結構體首先看到以下結構體 ```c // include/urcu/static/urcu-common.h struct urcu_gp { unsigned long ctr; int32_t futex; } __attribute__((aligned(CAA_CACHE_LINE_SIZE))); ``` 用以追蹤 GP 的 global variable [`urcu_qsbr_gp`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L73) * `ctr` : 每經過一次 GP (透過[`urcu_qsbr_synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L239)) 會增加(透過 mutex lock 保護)，用來判斷 readers 的狀態 * `futex` : 若 GP 還不能結束，透過 [futex](https://man7.org/linux/man-pages/man2/futex.2.html) 等待下個 reader 進入 quiescent state 再 wake up 正在等待 GP 的執行緒 ```c // include/urcu/static/urcu-qsbr.h struct urcu_qsbr_reader { unsigned long ctr; struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE))); int waiting; pthread_t tid; unsigned int registered:1; }; ``` 用來追蹤 reader 在 GP 後是否進入過 quiescent state 的 thread local variable [`urcu_qsbr_reader`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L84) * `ctr` : reader 每次進入 quiescent state 或透過 [`_urcu_qsbr_thread_online`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/include/urcu/static/urcu-qsbr.h#L213) 重新 active 時會與全域 GP 計數器 (`urcu_qsbr_gp.ctr`) 同步，用來表示 reader 的狀態([`urcu_qsbr_reader_state`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/include/urcu/static/urcu-qsbr.h#L99)) * `URCU_READER_INACTIVE` : reader 透過 [`_urcu_qsbr_thread_offline`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/include/urcu/static/urcu-qsbr.h#LL195C20-L195C45) 進入 extended quiescent state ， `ctr == 0` * `URCU_READER_ACTIVE_CURRENT` : reader 在 GP 後已進入過 quiescent state ， `ctr == urcu_qsbr_gp.ctr` * `URCU_READER_ACTIVE_OLD` : reader 在 GP 後尚未進入過 quiescent state * `node` : 每個 reader 一開始都要透過 [`urcu_qsbr_register_thread`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#LL479C6-L479C31) 將 reader 串連在一個全域雙向鏈結串列 [`registry`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L86) ，藉由 `registry` 看每個 reader 的狀態判斷 GP 能否結束 * `waiting` : `urcu_qsbr_synchronize_rcu` 時，若 reader 狀態為 `URCU_READER_ACTIVE_OLD`，亦即 GP 在等待該 reader (以下代稱 `wating reader`) ，就將 `waitng` 設為 1 ，表示這個 reader 進入 quiescent state 要負責喚醒正在等待 GP 的執行緒。 #### Grace Period 接著來看 GP 的等待 ![](https://hackmd.io/_uploads/SyWblsUd3.png =80%x80%) [`urcu_qsbr_synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L356) : 函式中 [Global grace period counter 增加](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L401)後就是 GP 的開始。此處用一個 global 的 [wait-free stack](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/include/urcu/wfstack.h#L89) [`gp_waiters`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L92) 來追蹤正在等待 GP 的執行緒 (以 `struct urcu_wait_node` 的形式) ```c // src/urcu-wait.h struct urcu_wait_node { struct cds_wfs_node node; int32_t state; /* enum urcu_wait_state */ }; ``` 其中 * `node` : 每個 thread 透過 local [`struct urcu_wait_node wait`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#LL244C2-L244C23) 的 `node` 加到 `gp_waiters` * `state` : 等待 GP 的執行緒狀態 * `URCU_WAIT_WAITING` : 表示 thread 需要等待別人喚醒，也作為 futex wait 所預期的值 * `URCU_WAIT_WAKEUP` : 表示 thread 將被喚醒，由喚醒的 thread 來設定這個 bit mask * `URCU_WAIT_RUNNING` : thread 被喚醒後自己設定這個 bit mask 來避免額外的 futex wake 呼叫 * `URCU_WAIT_TEARDOWN` : 表示 grace period 結束，由喚醒的 thread 來設定這個 bit mask 可能有多個執行緒同時呼叫 `urcu_qsbr_synchronize_rcu` ，因此在進入 GP 前，`gp_waiters` 會有一個以上的 thread ，可以發現在進入 GP 前 `gp_waiters` 中的 thread 他們所需等待的 GP 是一致的 (grace period sharing) ，因此[第一個加到 `gp_waiters` 的 thread](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L267) (以下代稱 `wakeup thread`) 在進入 GP 前會將 `gp_waiters` 裡的 thread 都移到 local 的 wait-free stack [`waiters`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L245) 並負責增加 Global GP counter 及等待 GP 結束，而其他 thread (在 `waiter` 裡面的那些，以下代稱為 waiting thread) 只要透過 [`urcu_adaptative_busy_wait`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-wait.h#L146) 去等待 `wakeup thread` 通知 GP已結束。 [`rcu_gp_lock`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L62) : mutex lock ，保護 global GP counter (確保 GP 能正常結束)，同時也讓下一個 `wakeup thread` 阻塞(目前的 `wakeup thread` 在清空 `gp_waiters` 後，就會有下一個 wakeup thread 也要去等待 GP，但它必須等待前一個 GP 結束，將它阻塞在 [`urcu_move_waiters`](https://github.com/urcu/userspace-rcu/blob/e8363ee3e56dd3532147aa5704cf8107801ded9d/src/urcu-wait.h#L99) 前來將同一批 GP 的 thread 集中) [`urcu_wake_all_waiters`](https://github.com/urcu/userspace-rcu/blob/e8363ee3e56dd3532147aa5704cf8107801ded9d/src/urcu-wait.h#L202)、[`urcu_adaptative_wake_up`](https://github.com/urcu/userspace-rcu/blob/e8363ee3e56dd3532147aa5704cf8107801ded9d/src/urcu-wait.h#L127)、`urcu_adaptative_busy_wait` : `wakeup thread` 透過 `urcu_wait_node.state` 來通知 `waiting thread` grace period 結束。在 Global GP counter 增加後，透過 `wait_for_readers` 來等待每個 reader 在 GP 後都進入過 quiescent state。 [`wait_for_readers`](https://github.com/urcu/userspace-rcu/blob/106ed13754b1b836f4b59405f4e02aea4bf5eef0/src/urcu-qsbr.c#L157) : 透過 `registry` 來查看每個 reader 的狀態，將進入過 quiescent state 的 reader 從 `registry` 移動到 local 的 doubly-linked list `qsreaders` ，當 `registry` 為空就表示 GP 可結束，可以發現一開始是使用 busy waiting 的方式，若等待太久則會改為[使用 futex 等待](https://github.com/urcu/userspace-rcu/blob/e8363ee3e56dd3532147aa5704cf8107801ded9d/src/urcu-qsbr.c#L219) (這裡使用的 futex word 就是 `urcu_qsbr_gp.futex`) `waiting reader` 進入 quiescent state 在喚醒 `wakeup thread` ，除了讓出 cpu 資源也可以避免一直去競爭 `rcu_registry_lock` 這個 mutex lock。 [`rcu_registry_lock`](https://github.com/urcu/userspace-rcu/blob/e8363ee3e56dd3532147aa5704cf8107801ded9d/src/urcu-qsbr.c#L72) : 用來保護 `registry` 的 lock ，避免 reader 和 `wakeup thread` 同時去插入或移除。`wait_for_readers` 返回就是 GP 的結束，將 `qsreaders` 中的 reader 重新加入 `registry` 並釋放 lock 及喚醒 `wating thread` 來回收不再使用的記憶體，`rcu_gp_lock` 釋放的同時就可進行下一輪的 GP，自此為一輪 GP 的等待總結來說，GP 開始後，所有 reader 都必須進入過 quiescent state 才能去釋放記憶體，因此 reader 多久進一次 GP 就是使用者要考慮的 tradeoff ，太長會導致較多的記憶體使用，太短則可能影響 reader 的效能。 #### Reader 接著看 reader 與 writer 的行為，可參考 [test_urcu_qsbr.c](https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu_qsbr.c) 了解使用方式， `test_rcu_pointer` 為共享的資料， reader 負責去讀它， writer 則負責 publish 新的資料 * reader 在開始與結束分別要呼叫以下函式，將 reader 加入/移除 grace period 的等待列表(`registry`) * [`urcu_qsbr_register_thread`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/src/urcu-qsbr.c#L479) : 一定要在第一次進入 critical section 前呼叫，初始化 `urcu_qsbr_reader` 並加到 `registry` ，最後將 `urcu_qsbr_reader.ctr` 與 Global grace period counter 同步(也就是透過 `_urcu_qsbr_thread_online` active 這個 reader ) * [`urcu_qsbr_unregister_thread`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/src/urcu-qsbr.c#L492) : 一定要在 reader 結束前呼叫，透過 `_urcu_qsbr_thread_offline` 進入 extended quiescent state ，在將 reader 從 `registry` 移除 * reader 的 critical section 要用下列函式包住，保證在 critical section 中的 RCU-protected data structure 不被釋放，同時避免在 critical section 中誤用 `urcu_qsbr_synchronize_rcu`(會讓 reader 進入 extended quiescent state 導致被 rcu 保護的資料可能被釋放)，可以發現這兩個函式都是空的( `urcu_assert_debug` 只在 enable debug 時用來檢查 reader 已經 active)，這也是為何 qsbr 是所有 flavor 中 reader-side overhead 最低的原因 * [`_urcu_qsbr_read_lock`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-qsbr.h#L118) * [`_urcu_qsbr_read_unlock`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-qsbr.h#L130) * 因為 reader 進出 critical section 不與 Global grace period counter 同步也不進入 quiescent state ，所以必須定期呼叫下列函式來幫忙完成上面沒做的事，讓 grace period 可以結束 * [`_urcu_qsbr_quiescent_state`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-qsbr.h#L177) : 若 `urcu_qsbr_reader.ctr` 與 Global grace period counter 不同(代表有新的 grace period 且這個 reader 尚未進入過 quiescent state ，相同則不需要做任何事)，就透過 [`_urcu_qsbr_quiescent_state_update_and_wakeup`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-qsbr.h#L156) 更新 `urcu_qsbr_reader.ctr` 及在必要時喚醒 `wakeup thread`(根據 `urcu_qsbr_reader.waiting` 和 `urcu_qsbr_gp.futex`) * reader 必須在 critical section 內透過 `rcu_dereference` 來取得 RCU-protected data structure (與 `rcu_assign_pointer` 合用來作到 Publish/Subscribe Guarantee) * [`_rcu_dereference`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/pointer.h#L100) : 有兩種實作，一個是透過 C11 memory_order_consume load ，另一個則是 volatile casts and (for DEC Alpha) memory barriers #### Writer * writer 透過 `rcu_assign_pointer` 來發布新的資料，以 [`test_urcu_qsbr.c` 為例](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/tests/benchmark/test_urcu_qsbr.c#L203)，透過 cpu barrier 和 compiler barrier 來限制 assignment to `test_rcu_pointer` 會在 assignment to field reference by `new` 之後，也就確保 reader 讀到 `new` 它一定是已完成初始化的 ```c *new = 8; old = rcu_xchg_pointer(&test_rcu_pointer, new); ``` * [`rcu_assign_pointer`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/pointer.h#L156) : publish 新的資料並回傳 assigned value，裡面的 `cmm_wmb` 就是 memory barrier，在 x86 會是 `__asm__ __volatile__ ("sfence"::: "memory")` 前面的 [sfence](https://www.felixcloutier.com/x86/sfence) 就是 cpu barrier 後面的 memory 則是 compiler barrier * [`rcu_xchg_pointer`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/uatomic/x86.h#L128) : 與 `rcu_assign_pointer` 相同但回傳 old value 來做 memory reclamation ， x86 中 [lock prefix](https://www.felixcloutier.com/x86/lock) 可以做到類似 full barrier([mfence](https://www.felixcloutier.com/x86/mfence.html)) 的效果 * publish 完新的資料後，在透過 `synchronize_rcu` 等待 grace period 來安全的回收舊的資料 ### qsbr-mb 這個 flavor 有較快的 grace period detection，但單純的在 read side 加入 memory barrier 所產生的 overhead 導致 reader 效能較差，實作上 `mb`、`signal`、`MEMBARRIER` 類似，後面兩個就是去改善 read side 的 memory barrier，相關程式碼可見 [`src/urcu.c`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu.c)、[`include/urcu/static/urcu-mb.h`](https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h) #### Structure for tracing grace period reader 的結構體與 `qsbr` flavor 不同，與 `signal`、`MEMBARRIER` flavor 相同 ```c // include/urcu/static/urcu-common.h struct urcu_reader { /* Data used by both reader and synchronize_rcu() */ unsigned long ctr; char need_mb; /* Data used for registry */ struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE))); pthread_t tid; /* Reader registered flag, for internal checks. */ unsigned int registered:1; }; ``` * 多了 `need_mb`，不過它主要在 `signal` flavor 使用 #### Grace Period `qsbr` flavor 中有實作兩種 `synchronize_rcu` 的方法，而在 `mb`、`signal`、`MEMBARRIER`、`bp` flavor 中只實作了其中的另一種方法 (two-phase approach)，這是因為這 4 種 flavor 必須用一半的 bits 來紀錄 nesting critical section 所以在 64 bits 一樣會遇到 overflow 的問題上面介紹過的方法在 32-bit 可能會因為 Global grace period counter overflow 產生錯誤，所以只能在 64-bit 使用(雖然也會 overflow 但必須花數千年的時間，因此可以忽略這個可能性)，而 two-phase approach 就是來解決這個問題，代價為延長等待 grace period 的時間 > The fundamental issue is that there is no way to copy a value from one memory location to another atomically. Suppose reader thread T is preempted just before executing the STORE_SHARED() call in rcu_thread_online(), after the LOAD_SHARED() call has returned. Until the store takes place the thread is still in its extended quiescent state, so there is nothing to prevent other threads from making multiple calls to synchronize_rcu() (and thereby incrementing rcu_gp_ctr) during the preemption delay. If the counter cycles through all but one of its values, the stale value finally stored in thread T ’s rcu_reader.ctr will actually be rcu_gp_ctr’s next value. As a result, if another thread later calls synchronize_rcu() after T has entered a read- side critical section, then update_counter_and_wait() might return before T has left this critical section, in violation of RCU’s semantics #### overflow problem ![](https://hackmd.io/_uploads/ByX1giUO3.png) #### two-pahse approach ![](https://hackmd.io/_uploads/r1ICyoL_h.png) 下面探討 two-phase approach 實作，首先來看 global grace period counter 和 reader counter 的關係 * [`rcu_gp.ctr`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/src/urcu.c#L134) : 一樣只會在 grace peirod 變更，藉由第 [`URCU_GP_CTR_PHASE`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-common.h#L63) 個 bit (以下稱 phase bit) 來判斷 active reader 的狀態 * [`rcu_reader.ctr`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/src/urcu.c#L134) : phase bit 會在進入最外層的 critical section 時與 `rcu_gp.ctr` 同步，其餘的 lower bit 用來紀錄 nesty critical section 數量 * [`urcu_common_reader_state`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-common.h#L107) : 判斷 reader 的狀態 * `URCU_READER_INACTIVE` : reader lower bit 為 0 (透過 [`URCU_GP_CTR_NEST_MASK`](https://github.com/urcu/userspace-rcu/blob/ae0b76db76aa9614381625d93f228a0ec71d4222/include/urcu/static/urcu-common.h#L64) 檢查)，也就是 reader 不在 critical section (extended quiescent state) * `URCU_READER_ACTIVE_CURRENT` : reader 的 phase bit 跟 global grace period counter 一樣，也就是 reader 在 grace period 後可能進入過 quiescent state ，之所以不保證 reader 一定進入過的原因跟上面的 overflow 一樣 * 當 reader 準備離開 quiescent state 讀取 global grace period counter (假設 phase bit 為 1)但尚未存入自己的 counter ，這時發生搶佔並且其他 thread 透過 `synchronize_rcu` 將 global grace period counter 的 phase bit 切換成 0，因為 reader 還在 quiescent state `synchronize_rcu` 可以成功返回，之後 reader 同步 counter 並存取 RCU-protected data ，然後其他 writer 在更新 RCU-protected data 並透過 `synchronize_rcu` 嘗試釋放舊的資料( global grace period counter phase bit 又被切換成 1)，可以發現 reader 這時會是 `URCU_READER_ACTIVE_CURRENT` 的狀態，因此 `synchronize_rcu` 返回並釋放 reader 目前正在使用的記憶體 * `URCU_READER_ACTIVE_OLD` : reader 的 phase bit 跟 global grace period counter 不同，也就是 reader 在 grace period 後尚未進入過 quiescent state 接著來看 [`synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L408) 有什麼不同，[第一次呼叫 `wait_for_readers`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L458) 就是 grace period 的開始 * 在切換 phase bit 前，會先呼叫第一次 `wait_for_readers`(phase 1) 確定所有 reader 都與 global grace period counter 有相同的 phase bit 或已進入 quiescent state * 將進入 quiescent state 的 reader 從 `registry` 移到 `qsreaders` * phase bit 與 global grace period counter 相同的 reader 則移到 `cur_snap_readers` (無法保證裡面的 reader 在 grace period 後進入過 quiescent state) * 等到 registry 為空時，便切換 global grace period counter 的 phase bit 並呼叫第二次 `wait_for_readers`(phase2) 確定所有 reader 在 grace period 後都進入過 quiescent state * 此時要等待的 reader 都在`cur_snap_readers` 中，當 reader 進入 quiescent state 或 reader 的 phase bit 與 global grace period counter 相同就將 reader 移到 `qsreaders` * 因為 reader 只會在進入最外層 critical section 時與 global grace period counter 的 phase bit 同步，又 phase1 已確定所有 reader 的 phase bit 都與 global grace period counter 同步，因此在 phase2 若觀察到 reader 與 global grace period counter 的 phase bit 相同就表示 reader 一定離開過最外層 critical section 進入 quiescent state 並重新進入最外層 critical section * 等到 `cur_snap_readers` 為空時就是 grace period 的結束總結與另一實作不同之處， phase bit 不能像 64-bits counter 一樣保證 reader 一定在 grace period 後進到 quiescent state ，所以要在 phase bit 同步並切換 phase bit 後，多等待 reader 再次進入 quiescent state ，因此 grace period 時間也會被拉長 #### Reader and Writer reader 和 writer 的行為可參考 [test_urcu.c](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/tests/benchmark/test_urcu.c) ，跟 qsbr flavor 相比， reader 只要離開 critical section 就進入 extended quiescent state，因此 reader 不需定期進入 quiescent state ，並且要紀錄 nasty critical section 數量在進入 critical section 前，必須確保 [reader counter 同步](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/static/urcu-mb.h#L74)的執行順序(現代處理器的 store buffer 和 invalidate queue 導致 CPU 只保證自己會讀到自己寫入的最新資料，但其它 CPU 不一定會讀到，memory barrier 確保關鍵資料嚴格按照一定的順序來執行) > it's safe as long as nobody reorders operations in a way that causes an access to happen to an RCU-protected variable before the effects of rcu_read_lock() are visible to other CPUs * [`_urcu_mb_read_lock`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/static/urcu-mb.h#L90) : 這裡用到 [`cmm_barrier`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/compiler.h#L31) 這個 compiler barrier ，目前還不清楚註解指的情況為何 > The first cmm_barrier() call ensures that the compiler does not reorder the body of _urcu_mb_read_lock() with a mutex * [`_urcu_mb_read_lock_update`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/static/urcu-mb.h#L71) : 這裡則使用到 [`cmm_smp_mb`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/arch/x86.h#L49) 這個 memory barrier ， compiler barrier 確保 `_urcu_mb_read_lock` 發生在 cirtical section 之前， CPU barrier 確保 barrier 前的 load、store 指令在 barrier 後的 load、store 指令前要是 globally visible ，可以看到 else 並沒有使用 memory barrier ，因為它已在 critical section 內，只是進到下一層 nasty critical section ，因此 RCU-protected data 一定會受到保護 * [`_urcu_mb_read_unlock`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/static/urcu-mb.h#L109) : 這裡一樣使用到 `cmm_smp_mb` 避免與 mutex reoder * [`_urcu_mb_read_unlock_update_and_wakeup`](_urcu_mb_read_unlock_update_and_wakeup) : 這裡用到兩個 `cmm_smp_mb` memory barrier ，第一個確保 critical section 發生於 `_urcu_mb_read_unlock` 前，第二個確保 reader counter 的更新發生於 `urcu_common_wake_up_gp` 前 (wakeup thread 被喚醒後一定會看到更新後的 reader counter) 總結來說，前面提到 qsbr flavor 在 `rcu_read_lock` 及 `rcu_read_unlock` 不需做任何事，這是因為 memory barrier 都被放到定期呼叫的 `_urcu_qsbr_quiescent_state` 以及與 extended quiescent state 相關的 `_urcu_qsbr_thread_online` 和 `_urcu_qsbr_thread_offline` 當中。因為上述函式呼叫不像 `rcu_read_lock` 及 `rcu_read_unlock` 那麼頻繁，因此 memory barrier 所產生的 overhead 較少 ### qsbr-signal 這個 flavor 有最接近 qsbr flavor 的 reader 效能，但 writer 效能最差。主要透過 write-side 在需要時傳送 POSIX signal 給所有 reader 去執行 memory barrier ，因此 reader 能避免非必要的 CPU barrier ，相關程式碼可見 [`src/urcu.c`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu.c)、[`include/urcu/static/urcu-signal.h`](https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-signal.h) 以下來看 signal 的相關實作 * [`rcu_init`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L684) : 此函式會透過 GNU extension 在 main 函式之前執行，其他 flavor 會對應各自的實作，使用 `SIGUSR1` 作為強制 reader 執行 memory barrier 的 signal * [`sigrcu_handler`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L662) : 處理 signal 的函式， reader 收到 signal 後會執行 memory barrier ，並將 `urcu_reader.need_mb` 設為 0 表示 signal 已收到 * [`smp_mb_master`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L245) : 透過 [`force_mb_all_readers`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L203) 對所有 reader 發送 signal，只有在 writer 更新 RCU-protected data 並進入 grace period 後 (參考 [`synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L408))， writer 才需要透過它讓 reader 的 compiler barrier 提升成 memory barrier 總結來說， signal flavor 利用 signal 讓 reader 的 `_urcu_signal_read_lock` 及 `_urcu_signal_read_unlock` 不再需要 CPU barrier ，降低 barrier 產生的 overhead ，提高 reader 的效能，然而 writer 必需發送 signal 給所有 reader 延長 `synchronize_rcu` 的時間，降低 writer 的效能 ### qsbr-memb 這個 flavor 有接近 signal flavor 的 read-side 效能，同時又有接近 mb flavor 的 write-side 效能(也就是 write-side 效能高於 signal flavor 許多)，與這兩個 flavor 各自相比，效能的上升都高於下降(參考 [[RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)](https://lwn.net/Articles/369640/) )，相關程式碼可見 [`src/urcu.c`](https://github.com/urcu/userspace-rcu/blob/master/src/urcu.c)、[`include/urcu/static/urcu-memb.h`](https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-memb.h) 與 signal flavor 類似，由 writer(slow side) 在需要時透過 kernel 提供的 systemcall [`sys_membarrier()`](https://man7.org/linux/man-pages/man2/membarrier.2.html) 讓 reader(fast side) 執行 memroy barrier > There are cases where one side of the matching barriers (which we will refer to as "fast side") is executed much more often than the other (which we will refer to as "slow side"). This is a prime target for the use of membarrier(). The key idea is to replace, for these matching barriers, the fast-side memory barriers by simple compiler barriers, for example: asm volatile ("" : : : "memory") and replace the slow-side memory barriers by calls to membarrier(). This will add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side. 接著來看相關的實作 * [`rcu_init`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L652) : * [`rcu_sys_membarrier_init`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L633) : 檢查 kernel 是否支援 `sys_membarrier` ，沒有則退回 mb flavor * [`smp_mb_master`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/src/urcu.c#L182) : 若 kernel 支援 `sys_membarrier`， writer 透過它將 reader 的 compiler barrier 提升成 memory barrier * [`urcu_memb_smp_mb_slave`](https://github.com/urcu/userspace-rcu/blob/1a0ddffb9027fa82f360fdd1eb06d27d0942cc8c/include/urcu/static/urcu-memb.h#L75) : 若 kernel 支援 `sys_membarrier`，將 memory barrier 換成 compiler barrier 總結來說相較於 signal flavor ， memb flavor 只會讓正在執行的 thread 執行 memory barrier，並且不用去要求 `rcu_registry_lock` 這個 mutex lock > By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. ### qsbr-bp 上面的 flavor 都限制 reader 在進入 crtical section 前要呼叫 `rcu_register_thread`，在 reader thread 結束前要呼叫 `rcu_unregister_thread` ，否則會出錯， bp flavor 則移除了這項限制，透過讓 reader 在進入 critical section 前會自動去 register ， reader thread 在結束前會自動 unregister > Bullet-proof RCU is like memory-barrier-based RCU but, in addition, automatically invokes rcu_register_thread() when needed from the RCU read-side primitives. This further increases these primitives' overheads, but in return allows RCU to be used by libraries that are used by applications that create their own threads such that RCU cannot hook into thread creation and destruction. 實作大致上與 memb flavor 一致，下面來看 bp flavor 如何自動讓 reader register/unregister #### initialize [`_urcu_bp_init`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L642) : bp flavor 的 constructor ，一樣會在 main 函式前呼叫，此外每個 thread 在 register 時也會呼叫 * `urcu_bp_refcount` : 目前的 registered thread + main thread * [`urcu_bp_key`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L151) : 透過 [`pthread_key_create`](https://linux.die.net/man/3/pthread_key_create) 產生 thread-specific data key 及對應的 destructor [`urcu_bp_thread_exit_notifier`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L602) ， thread 在結束時能自動 unregister 的關鍵 > Upon key creation, the value NULL shall be associated with the new key in all active threads. Upon thread creation, the value NULL shall be associated with all defined keys in the new thread. An optional destructor function may be associated with each key value. At thread exit, if a key value has a non-NULL destructor pointer, and the thread has a non-NULL value associated with that key, the value of the key is set to NULL, and then the function pointed to is called with the previously associated value as its sole argument. The order of destructor calls is unspecified if more than one destructor exists for a thread when it exits. #### register 首先可以先觀察到跟其他 flavor 不同， [`urcu_bp_reader`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L159) 是一個 pointer ， bp flavor 就是在用到 [`_urcu_bp_read_lock`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/include/urcu/static/urcu-bp.h#L164) 和 [`_urcu_bp_read_ongoing`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/include/urcu/static/urcu-bp.h#L200) 時，讓還沒 register 的 thread (`urcu_bp_reader` 尚未配置記憶體) 呼叫 register 相關函式 * [`urcu_bp_register`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L535) : 可以看到這裡會先將 signal block ，確保 bp flavor 是 signal safe(參考 [urcu-bp: New "bulletproof" RCU library flavor](https://github.com/urcu/userspace-rcu/commit/fdee2e6dc73cc504ba24be89da539c68742e508e))，我認為這是因為在 bp flavor 我們不能控制何時會收到 thread exit signal ，所以在使用到 mutex lock 時才要先將 signal block 並等到 mutex unlock 後在接收被 block 的 signal 來避免 dead lock 發生，目前還不懂為什麼 `urcu_bp_reader` 不用 malloc 來配置記憶體 * [`add_thread`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L476) : 透過 mmap 配置 `urcu_bp_reader` 的記憶體並透過 [`pthread_setspecific`](https://linux.die.net/man/3/pthread_setspecific) 對應到 `urcu_bp_key` ，此時當 thread 結束時就會呼叫 key 對應的 destructor ，也就是 unregister 相關的函式 > The pthread_setspecific() function shall associate a thread-specific value with a key obtained via a previous call to pthread_key_create(). Different threads may bind different values to the same key. These values are typically pointers to blocks of dynamically allocated memory that have been reserved for use by the calling thread. #### unregister thread 成功 register 後，便透過 `urcu_bp_key` 保證了 thread 在結束時會 unregister * [`urcu_bp_unregister`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L576) : 一樣會將 signal block ，並釋放使用的記憶體，最後透過 [`urcu_bp_exit`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L659) 去減少 register thread 數量(當所有 register thread 結束才會去真正釋放 mmap 配置的記憶體) 總結來說，當使用者無法掌控 thread 的 create 和 destroy 時， bp flavor 會是最適合的選擇，代價就是會犧牲 read side 及 write side 的效能 ### RCU 各個 flavor 的應用場景若使用者無法掌控執行緒的開始與結束，例如 application hook 或無法掌控的 thread pool (參考 [[lttng-dev] QSBR urcu question](https://lore.kernel.org/lttng-dev/1546175211.59087.1617739554450.JavaMail.zimbra@efficios.com/T/#u)) ， `bp` flavor 會是最合適的。反之應該使用另外 4 種 flavor ，因為 `bp` flavor 會降低 reader 及 writer 的效能 * 若使用者清楚的知道每個 reader 該如何定期的進入 quiescent state 才不會讓 writer 被 block 太久或 grace period 遲遲無法結束(可能會導致 out of memory) ，且最在意 reader 的效能， `qsbr` flavor 會是最適合的 * 反之應該使用另外 3 種 flavor ，使用者就不需去考慮何時該讓 reader 進入 quiescent state * 若無法讓出一個 signal 或 Linux 核心不支援 [membarrier](https://man7.org/linux/man-pages/man2/membarrier.2.html) 系統呼叫，則只能使用 `mb` flavor ，他的 reader 效能最差但 writer 效能最好 * 若可以讓出一個 signal 做使用，且只在意 reader 的效能而不在意 writer 的效能， `signal` flavor 是最適合的 * 若 Linux 核心支援 [membarrier](https://man7.org/linux/man-pages/man2/membarrier.2.html) 系統呼叫，且希望 reader 效能要好但不能讓 writer 效能太差， `memb` flavor 會是最適合的 ### Summary 下圖擷取自 [User-space RCU](https://lwn.net/Articles/573424/) ![](https://hackmd.io/_uploads/Hk4-tj0v2.png) * 可以看到 Read-Side Cost， `++` 表示追蹤 nesting critical section 的 cost， `test` 表示判斷 thread 是否註冊的 branch 的 cost， `smp_mb()` 表示使用 full memory barrier 的 cost * 這裡我認為 signal flavor 的 read-side cost 應該沒有 test ### Benchmark 透過 [benchmark](https://github.com/urcu/userspace-rcu/tree/master/tests/benchmark) 目錄下的測試檔案來比較不同 flavor 的讀寫效能以下是 6 reader, 2 writer, 1 second 的結果 |6 reader/ 2 writer | read | write | |--|--|--| |qsbr|3542322664|347937| |signal|3171201626|624| |memb|1742311222|163172| |mb|71073919|537088| |bp|1890575203|75389| * 可以觀察到 bp flavor 的 read 次數比 memb 高，這可能是因為 bp flavor 的 write 次數較少，導致產生的 cache-line exchanges 也較少，所以 reader 受到 cache 的效能影響降低。下面將 writer 降為 0 個，可以看到 bp flavor reader 的 overhead 的確比較高 |1 reader/ 0 writer|read|write| |--|--|--| |memb|2660635174|0| |bp|2276424496|0| 至於 bp flavor write 次數較少的原因是因為在 [`urcu_bp_synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L275) 要將 signal block ，來確保 signal safe，透過 perf 可以發現它佔整體的 76%，產生的 overhead 非常高 ```shell $ sudo perf record -g --call-graph dwarf ./test_urcu_bp 0 1 1 $ sudo perf report --stdio -g graph,0.5,caller |--86.85%--urcu_bp_synchronize_rcu | |--76.69%--__GI___pthread_sigmask (inlined) |--4.66%--mutex_unlock --4.21%--mutex_lock ``` |0 reader/ 1 writer|read|write| |--|--|--| |memb|0|12123778| |bp|0|2447449| 此外 bp flavor 的 [`urcu_bp_synchronize_rcu`](https://github.com/urcu/userspace-rcu/blob/be152bdb1af23ed3781fbda231597bad750ee977/src/urcu-bp.c#L275) 並不像其他 flavor 有透過 queue 把 `waiting thread` batch 起來，也就是沒有共享 grace period 的機制。下圖可看出共享 grace period 的好處， reader 固定為 6 個並將 critical section 延長(透過 `-c` 參數，為了讓 waiting thread 更有機會 batch 起來)，這裡 flavor 不同所以當 writer 只有 1 個時 write 次數不同，不過這裡主要是想看 write throughput 能不能隨 writer 增加而上升，可以發現 grace period sharing 能隨著 writer 增加提升 write 次數。 ![](https://hackmd.io/_uploads/BJkTyjUdh.png) ### patch [[lttng-dev] [PATCH] Fix: revise urcu_read_lock_update() comment](https://lore.kernel.org/lttng-dev/55e5ace4-7e85-3f54-c7b8-23d157b556f2@efficios.com/T/#m95c237b9be65befc3d9f0b6060e13c12f5b6998f) 觀察到 `urcu_read_lock_update()` 註解對於 reader counter 的說明有誤， lower bit 用來追蹤 nesting critical section ，第 `URCU_BP_GP_CTR_PHASE` 個 bit 為 grace period 的 phase #### 參考資料 * [A Tour Through RCU's Requirements](https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html) * [What is RCU, Fundamentally?](https://lwn.net/Articles/262464/) * [User-space RCU](https://lwn.net/Articles/573424/) * [Userspace RCU的使用與原理](https://blog.csdn.net/weixin_43705457/article/details/115506185) * [sys_membarrier](https://lwn.net/Articles/369567/) * [User-Level Implementations of Read-Copy Update](https://www.efficios.com/pub/rcu/urcu-main.pdf) * [User-Level Implementations of Read-Copy Update Supplement](https://www.researchgate.net/publication/228865533_Supplementary_Material_for_User-Level_Implementations_of_Read-Copy_Update) * [Expediting membarrier()](https://lwn.net/Articles/728795/)