Linux 核心專題: RCU 實作

## Linux 核心專題: RCU 實作 > 執行人: eleanorLYJ > [專題解說錄影](https://youtu.be/BB_tW7bQSlw) ### Reviewed by `dcciou` 是否有考慮過如何在維持CTP的優點下改善其成本的問題 ### Reviewed by `yuyuan0625` 1. 下文[記憶體使用量](https://hackmd.io/2Jv0y1OpRgG0IdlcBD9Snw?both#%E8%A8%98%E6%86%B6%E9%AB%94%E4%BD%BF%E7%94%A8%E9%87%8F1)段落的圖片無法讀取，可能要再重新貼上 2. 想請問為何在更多的 reader 的情境，slot-list 就沒有明顯有快取優勢呢? ### Reviewed by `Ackerman666` 感覺這句話有點不嚴謹，應該是始於 writer 修改資料後呼叫 `synchronize_rcu()` > RCU 寬限期（RCU Grace Period，以下縮寫GP）：這是一個時間窗口，開始於 writer 開始修改 RCU 保護資料 ### Reviewed by `Shiang1212` > 操作：提供 O(1) 無鎖讀取，並由序列化 writer 在 O(N log(M)) 時間內垃圾收集未引用的值。請問這裡的 N、M 是指什麼？請附上說明。 ## 任務簡介 Userspace RCU 已成功將 Linux 核心的 RCU 的機制帶到若干應用場域，不過其程式碼規模龐大、整合較困難，且授權條款較嚴格 (LGPL 2.1)，於是本任務嘗試評估 [ctp: C Thread Primitives](https://github.com/nicowilliams/ctp)，探討是否可作為 userspace RCU 的替代選擇。 ## TODO: 理解 RCU > 閱讀 [Linux 核心設計: RCU 同步機制](https://hackmd.io/@sysprog/linux-rcu), [Is Parallel Programming Hard, And, If So, What Can You Do About It?](https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html) 的第九章 Deferred Processing，紀錄自己的認知和相關問題 RCU（Read-Copy-Update）是一種高效的同步機制，主要用於多執行緒環境中讀多寫少，並且 reader 端要求高效但對新舊資料不敏感的場景。RCU 允許多個 reader 並行讀取共享資料，而不需要阻塞寫操作，從而效能提升。 ### RCU 兩個主要元件 1. RCU 讀側臨界區（RCU Read-Side Critical Section，以下縮寫 cs）：這是一段由 `rcu_read_lock()` 和 `rcu_read_unlock()` 包圍的程式碼區域，處於這段區域內的 reader 可以安全地讀取 RCU 保護的共享資料。 2. RCU 寬限期（RCU Grace Period，以下縮寫GP）：這是一個時間窗口，開始於 writer 開始修改 RCU 保護資料，至在GP開始前所有的 reader 結束的時間區間。RCU 保證所有在 GP 開始之前進入 cs 的 reader 都已經完成。`synchronize_rcu()` 用於等待一個寬限期的結束。藉此確保每個 reader 對於不同物件保有前後一致的讀取。 (coherent view of each object) 補充 RCU 核心 API ![Core RCU API](https://hackmd.io/_uploads/SJRfT8ZIA.png) ### RCU 基礎為了實作 RCU 的特性，RCU 有了三個重要基礎: 1. publish-subscribe 機制 2. 等待已經存在的 RCU readers 3. 維護最近更新物件的多個版本  #### publish-subscribe 機制 reader 的讀取視為對目前版本的 RCU 保護資料的訂閱，而 updater 的操作則視為發布新版本。 RCU 的 publish-subscribe 機制是藉由兩個 primitive 達成的: **Publish** Updater 用 `rcu_assign_pointer()` 這 primitive 做到確保在實際 publish 前這 pointer 的 store 操作的順序正確，這意味著用 memory barrier 防止 pointer 更新之後再執行寫入的操作。在 Linux kernel, `rcu_assign_pointer()` 是按照 `smp_store_release()` 定義的。 updater 程式碼範例: ```c= new_value->a = 1; new_value->b = 2; new_value->c = 3; rcu_assign_pointer(global_ptr, new_value); // 保證將 new_value assign 給 pointer 發生於 line 1-3 之後 ``` :::danger 注意用語： * access 是「存取」，而非「訪問」務必詳閱 https://hackmd.io/@sysprog/it-vocabulary ::: **Subscribe** Reader 用 `rcu_dereference()` 這 primitive 做到確保實際資料被<s>訪問</s> 前 load 操作順序正確，而這代表有使用 memory barrier，確保資料的一致性。 reader 程式碼範例: ```c struct data *ptr = rcu_dereference(global_ptr); int value = ptr->value; ``` - `rcu_dereference(global_ptr)` 這行程式碼確保 `global_ptr` 被安全地讀取，並且其指向的資料在此之後的讀取操作是有序的，沒有被重新排序。 - `ptr->value`: 實際讀取到 global_ptr 所指向的資料 ### 等待已經存在的 RCU readers 這一部分是更深入的討論寬限期(GP)的規則對於 RCU reader (1) 如果給定 RCU 讀側臨界區間 (cs) 的任何部分先於 GP 的開始，則該 cs 的整體會先於該 GP 的結束。 (2) 如果給定 RCU cs 的任何部分在 GP 結束之後，則該 cs 的整個部分在該 GP 開始之後。 **RCU 禁止讀端臨界區與寬限期完全重疊** 以下圖片為例說明 cs 與 GP 的執行順序的關係，並且說明為什麼最終 r1 的值為1，而 r2 的值是0。 P0() 是 reader，P1() 是 updater。 ![image](https://hackmd.io/_uploads/HkNrfw-LC.png) `P0()` 進入 RCU 的 cs (`rcu_read_lock()`) 並且在這個區域內讀取變數 `x` 和 `y`。 `P1()` 修改變數 `x` 並設置為 `1`，然後呼叫 `synchronize_rcu()` 開始 RCU GP。在 GP 結束後，`P1()` 修改變數 `y` 並設置為 `1`。因為`P1()` 在 GP 開始之前將 `x` 設置為 `1`，所以 `r1` 讀取到 `1`。`P1()` 在 GP 結束後將 `y` 設置為 `1`，所以 `r2` 讀取到寬限期開始之前的值，即 `0`。 :::danger 注意用語: call 是「呼叫」，而非「調用」。參見 [資訊科技詞彙翻譯](https://hackmd.io/@sysprog/it-vocabulary) 和 [詞彙對照表](https://hackmd.io/@l10n-tw/glossaries) ::: **摘要: RCU 寬限期的順序保證** (Summary of RCU Grace-Period Ordering Guarantees) 以下討論四個可能情況，圖示中的左方塊都是 RCU reader，右方塊是 RCU updater。第一種情況: reader 在 updater 開始刪除之前執行，因此該 reader 可能擁有對已刪除資料元素的參考。因此，updater 在 reader 完成之前不得釋放該元素。 ![image](https://hackmd.io/_uploads/HkH8EPWUC.png) 第二種情況: reader 直到刪除完成後才開始執行。reader 不可能獲得對已刪除的資料元素的引用，因此可以在 reader 完成之前釋放該元素。 ![image](https://hackmd.io/_uploads/HJA8EDZI0.png) 第三種情況與第二種情況類似，但說明即使 reader 不可能獲得對元素的引用，仍然允許推遲該元素的釋放，直到 reader 完成之後。 ![image](https://hackmd.io/_uploads/r1uD4vb8C.png) 在第四個也是最後一個場景中，reader 在 updater 開始刪除資料元素之前進入 cs，但該元素在 reader 完成之前被（錯誤地）釋放，然而，正確的 RCU 實作不會允許第四種情況發生! ![image](https://hackmd.io/_uploads/By0DVDZ8R.png) ### 維護最近更新物件的多個版本由於無同步(synchronization-free)的 reader 提供的時間同步非常弱，RCU 透過空間同步進行補償。多個弱一致性的板本會帶來一些意外，以下為例: reader 存取同時也在更新的鏈結串列 1. 在圖中的第一行中，reader 引用 A 2. 在第二行中，它前進到 B，到目前為止已經看到了 A，然後是 B 3. 在第三行中，updater 刪除元素 A 4. 在第四行中，updater 將元素 E 新增至串列末端 5. 在最後一行中，reader 完成了存取，看到了元素 A 到 E ![image](https://hackmd.io/_uploads/ry-p9vbLC.png) 不同時間存取的 reader 會得到不同結果，例如再一次存取鏈結會得到 B 到 D 的結果，但在這特殊情況取得 A 到 E 的結果也是合法的。倘若 RCU 的不一致性會造成問題，可以用更強的同步機制或是使用 timestamps 與 versioning 等方式處理，但這些處理會造成成本增加。 ## TODO: 探討 [CTP](https://github.com/nicowilliams/ctp) 實作 RCU 哪些關鍵特徵 > CTP 專案建構在 thread-safe variable （TSV），確保 reader 執行緒在無 lock 的情況下運行，且不會被 writer 阻塞。另一方面，writer 會等待最後引用的 reader 執行緒完成讀取後才會寫入。 ### 什麼是 thread-safe variable(tsv)? CTP 專案中提及的 tsv 本身並不是一個廣泛被認可的理論概念，但它像是結合了 [thread safety](https://en.wikipedia.org/wiki/Thread_safety)和 [thread specific data](https://www.ibm.com/docs/ro/aix/7.2?topic=programming-thread-specific-data) 的概念，確保多個執行緒安全並行存取變數的具體實作。在 CTP 中，提供一種機制能確保資料在多執行緒中保持一致性並且防止 race condition 等問題，其中在某實作中(如 slot-list)，每個 reader 執行緒可以有自己的 slot 能夠儲存目前的副本，這類似於特定執行緒的資料，但透過安全機制更新和管理這些值。根據 [Using Thread Safe Variables to Protect Data](https://www.ni.com/docs/en-US/bundle/labwindows-cvi/page/cvi/programmerref/threadsafevariable.htm)，得知關於 tsv 好處： - 避免常見的多執行緒錯誤，如在存取變數之前忘記取得鎖的常見錯誤。 - 易於資料傳遞，只需傳遞 thread-safe variable handle，即可在函數之間傳遞受保護的資料，而無需同時傳遞 thread-lock handle 和受保護的變數。 :::danger data 是「資料」 ::: CTP 使用兩種主要實作來實作執行緒安全變數: slot-pair 與 slot-list ### ctp 目前有六個函式能被使用 `int thread_safe_var_init(thread_safe_var *vpp, thread_safe_var_dtor_f dtor)`: 初始化 tsv 變數的所有成員 `thread_safe_var_destroy` `int thread_safe_var_get(thread_safe_var vp, void **res, uint64_t *version)`: reader 取得變數的目前值與其版本，此外管理引用計數 (reference counting) 和 slot 選擇。 `void thread_safe_var_release(thread_safe_var vp)`: 清掉執行緒上的特定資料 `int thread_safe_var_set(thread_safe_var vp, void *data, uint64_t *new_version)`: data 是要寫入結構為 value * 的值(new_value)，vp 是 Pointer to thread-safe global variable 與 new_version 則會填入 new_value 的版本號。返回 0 代表成功。管理記憶體並確保只有在 readers 完成讀取後才更新。 `int thread_safe_var_wait(thread_safe_var vp)` : 當還沒有任何資料被寫入時，reader 需要等待，直到資料被寫入。 ## TODO: Slot-Pair 與 Slot-List 的差異 ### Slot-Pair 結構：每個 tsv 都有兩個 slot，一個用於目前資料，另一個用於上一個或下一個資料操作：具有 O(1) 無鎖和 spin-less 讀取以及 O(1) 序列化(serialized)寫入。 reader 行為： reader 呼叫 free() 和值析構函數(value destructor)，並且可能需要向潛在等待的 writer 發出訊號，這涉及獲取互斥鎖。儘管這在技術上是阻塞的，但它通常是在無競爭的資源上。 writer 行為： writer 將"上一個" slot 切換到"目前" slot。讀者從目前指定的 slot 讀取。 writer 會等待，直到前一個 slot 沒有活躍的 reader 為止。該值是引用計數的，因此當最後一個引用被刪除時會自動釋放。 ### Slot-List 操作：提供 O(1) 無鎖讀取，並由序列化 writer 在 O(N log(M)) 時間內垃圾收集未引用的值。結構：維護引用值的鏈結串列，串列的頭部始終是目前值。它還維護一個 subscription slots ，每個 reader 執行緒都有一個 slot。第一次讀取時，reader 分配一個 slot 並將值串列的頭部複製到其 slot 中。 reader 行為： subscription slots 分配是無鎖的。reader 中的所有內容都是無鎖的 writer 行為： writer 對引用值的鏈結串列執行垃圾回收(Garbage Collection, GC)。值在最後一個引用被刪除後的第一次寫入時被釋放，因為它們被 writer 回收。 ### API 的實作在 threead_safe_global.h 中表示， `thread_safe_var` 將替代 `thread_safe_var_s *` ```c typedef struct thread_safe_var_s *thread_safe_var; ``` #### `thread_safe_var_wait(thread_safe_var vp)` 這是 slot-pair 與 slot-list 共有的函式，其目的是等著 writer 第一次將值寫入。倘若已經能從 vp 取得到值，那就不用等了，直接 `return 0`。其餘狀況，先鎖定 waiter_lock，之後使用 `pthread_cond_wait()` 阻塞，直到能讀到 writer 第一次寫入值，之後用 pthread_cond_signal() 喚醒等待在指定條件變數上的一個執行緒，之後同樣再用 pthread_cond_signal() 逐一喚起其他的 waiter。 ```c if (thread_safe_var_get(vp, &junk, NULL) == 0 && junk != NULL) return 0; pthread_mutex_lock(&vp->waiter_lock); while (thread_safe_var_get(vp, &junk, NULL) == 0 && junk == NULL) { if (pthread_cond_wait(&vp->waiter_cv, &vp->waiter_lock) != 0) { pthread_mutex_unlock(&vp->waiter_lock); return err; } } ``` 最後，跟 futex 一樣，為了防止 [thundering herd problem](https://en.wikipedia.org/wiki/Thundering_herd_problem) (驚群問題)，呼叫 `pthread_cond_signal(&vp->waiter_cv)` ，而不是用廣播的方式，確保所有等待的 reader 執行緒是逐一喚醒。 ```c pthread_cond_signal(&vp->waiter_cv); pthread_mutex_unlock(&vp->waiter_lock); ``` ### slot-pair tsv 中有兩個 slot，一個 slot 存 "current" value，另一個存 "previous"/"next" value 。 writer 將 "previous" slot 放入下一個 "current" slot，而 reader 從看起來是 "current" slot 中讀取。 #### 資料結構核心的資料在 vwrapper 中，var 為 vwrapper 的 wrapper, 而在 thread_safe_var_s 中存有一對 slot 另外還有多個條件變數與鎖。 ```c /* This is an element on the list of referenced values */ struct value { volatile struct value *next; /* previous (still ref'd) value */ void *value; /* actual value */ volatile uint64_t version; /* version number */ volatile uint32_t referenced; /* for mark and sweep */ }; /* * Each thread that has read this thread-safe global variable gets one * of these. */ struct slot { volatile struct value *value; /* reference to last value read */ volatile uint32_t in_use; /* atomic */ thread_safe_var vp; /* for cleanup from thread key dtor */ /* We could add a pthread_t here */ }; /* * Slots are allocated in arrays that are linked into one larger logical * array. */ struct slots { volatile struct slots *next; /* atomic */ struct slot *slot_array;/* atomic */ volatile uint32_t slot_count; /* atomic */ uint32_t slot_base; /* logical index of slot_array[0] */ }; struct thread_safe_var_s { pthread_key_t tkey; /* to detect thread exits */ pthread_mutex_t write_lock; /* one writer at a time */ pthread_mutex_t waiter_lock; /* to signal waiters */ pthread_cond_t waiter_cv; /* to signal waiters */ var_dtor_t dtor; /* value destructor */ volatile struct value *values; /* atomic ref'd value list head */ volatile struct slots *slots; /* atomic reader subscription slots */ volatile uint32_t next_slot_idx; /* atomic index of next new slot */ volatile uint32_t slots_in_use; /* atomic count of live readers */ uint32_t nvalues; /* writer-only; for housekeeping */ }; ``` #### `thread_safe_var_get` 此函數目的: reader 會取 tsv (vp)的版本號得知目前 slot 是哪個，接著讀取這 slot 的值 (wrapper)，並且將引用計數加一，接著 reader 將會保留此 wrapper 當作執行緒的特定資料。以下一步步講解函式組成: `get` 的快速路徑為，如果這個執行緒擁有的 `wrapper` (`pthread_getspecific(vp->tkey)`) 與 `tsv` 的 `next_version - 1` 一致，就將 `wrapper` 的版本與實際值（`ptr`）存回參數中，並結束函式。 ```c if ((wrapper = pthread_getspecific(vp->tkey)) != NULL && wrapper->version == atomic_read_64(&vp->next_version) - 1) { /* Fast path */ *version = wrapper->version; *res = wrapper->ptr; return 0; } ``` 在確定 slot 為目前最新的 slot 的迴圈中，會與 writers 競爭。先讀取 version 之後讀取 slot (`v`)，之後判斷如果 `atomic_read_64(&vp->next_version) == (*version + 1)` 代表中途是否有其他執行緒在競爭，如果等於，這就確定 slot。反之不等於並且如果 slot 的活躍 reader (nreader) 是 0，就 `signal_writer()`通知 writer 寫入，其方式是使用 `vp->cv` 條件變數告知 writer ，而 reader 則繼續在此迴圈，直到確定好 slot。 ```c static int signal_writer(thread_safe_var vp) { pthread_mutex_lock(&vp->cv_lock); err = pthread_cond_signal(&vp->cv); return pthread_mutex_unlock(&vp->cv_lock); } ``` 選擇好 `slot` 後，就可以將 `slot` 中的 `vwrapper` 引用加一，並將 `vwrapper` 的值寫進 `*version` 與 `*res` 中。接著表示讀取完畢，將 `slot` 的活躍 reader 減一，然而此時，活躍 reader 等於 0，如果執行緒的最新版不是此目前 `slot` 的版本，那就要 `signal_write` ，因為代表有 writer 在等待。這樣的作法是要確保讀寫操作的同步，避免競爭。 ```c if (atomic_dec_32_nv(&v->nreaders) == 0 && atomic_read_64(&vp->next_version) != (*version + 1)) signal_writer(vp); ``` 最後設置新的 `wrapper`。 ```c pthread_setspecific(vp->tkey, wrapper); ``` #### `thread_safe_var_set` 首先，構建一個新的 `wrapper` 變數來保存新的值。 ```c wrapper = calloc(1, sizeof(*wrapper) wrapper->dtor = vp->dtor; wrapper->nref = 0; wrapper->ptr = cfdata; ``` 接著，使用 `pthread_mutex_lock(&vp->write_lock)` 來鎖定寫入操作，這也起到 memory barrier 的作用，以確保之前的寫操作完成，寫入操作是序列化。隨後，將新 `wrapper` 的版本號設置為 `vp->next_version`，也將這個版本號返回給 `new_version`。接著，獲取下一個 slot `v`。如果是這 tsv 第一次寫入，則將這新 `wrapper` 設置在兩個 slot 上。每個 slot 的 wrapper 引用計數(`nref`) 增加一次，並將這些 slot 的 wrapper 更新為新的 `wrapper`。然後，將 `vp->next_version` 增加一，並藉由 `pthread_cond_signal(&vp->waiter_cv)`通知等待的 reader ，讀取新的值。最後，解鎖寫入操作並結束函式。如果這不是第一次寫入，則這新 `wrapper` 的引用計數增加一次。然後，等待直到 slot 沒有活躍的 reader。如果有活躍的 reader，則使用 `pthread_cond_wait(&vp->cv, &vp->cv_lock)` 不斷等待，直到所有 reader 釋放 `cv`。 ```c while (atomic_read_32(&v->nreaders) > 0) { pthread_cond_wait(&vp->cv, &vp->cv_lock)) != 0); ``` 接著，解鎖 `cv_lock` 後，將 `v->wrapper` 更新為新的 `wrapper`，並將 `v->version` 和 `vp->next_version` 都增加一。最後，釋放舊的 `wrapper`，並解鎖 `write_lock`。這個過程與 QSBR（Quiescent State Based Reclamation）很相似，只有在最後一個活躍的 reader 不再讀取後，才能寫入新值並回收舊值。 :::danger 如何確保上述流程可正確運作？ ::: ### slot-list #### 資料結構 ```c struct value { volatile struct value *next; /* 指向之前的（仍引用的）值 */ void *value; /* 實際值 */ volatile uint64_t version; /* 版本 */ volatile uint32_t referenced; /* 用於標記和清除 */ }; /* * 每個讀取此 tsv 的執行緒都會獲得其中一個。 */ struct slot { volatile struct value *value; /* 最後讀取的值的引用 */ volatile uint32_t in_use; thread_safe_var vp; /* for cleanup from thread key dtor */ }; /* * slots 被分配在一個陣列中，而這陣列會被連結到一個更大的邏輯陣列。 */ struct slots { volatile struct slots *next; struct slot *slot_array; volatile uint32_t slot_count; uint32_t slot_base; /* slot_array[0] 的索引*/ }; struct thread_safe_var_s { pthread_key_t tkey; /* 檢測執行緒退出 */ pthread_mutex_t write_lock; /* 一次一個 writer */ pthread_mutex_t waiter_lock; /* 向 waiters 發出訊號時用 */ pthread_cond_t waiter_cv; /* 向 waiters 發出訊號 */ var_dtor_t dtor; /* value destructor */ volatile struct value *values; /* atomic 引用值得鏈結串列的頭 */ volatile struct slots *slots; /* atomic reader 的 subscription slots */ volatile uint32_t next_slot_idx; /* atomic 的下一個新 slot的索引 */ volatile uint32_t slots_in_use; /* atomic 的活著 reader 數 */ uint32_t nvalues; /* writer-only; for housekeeping */ }; ```  #### `int thread_safe_var_get(thread_safe_var vp, void **res, uint64_t *version)` - 將 vp->tkey 寫入 slot 變數 - 如果還沒寫入任何值進此執行緒的特定資料過 (`pthread_getspecific(vp->tkey) == NULL`)，用 get_free_slot 尋找空的 slot，倘若都沒有空的 slot 才用 grow_slot 去產出一個新的 slot，之後將 slot 寫入此執行緒的特定資料。 ```c if ((slot = pthread_getspecific(vp->tkey)) == NULL) { /* First time for this thread -> O(N) slow path (subscribe thread) */ slot_idx = atomic_inc_32_nv(&vp->next_slot_idx) - 1; if ((slot = get_free_slot(vp)) == NULL) { /* Slower path still: grow slots array list */ err = grow_slots(vp, slot_idx, 2); /* O(log N) */ slot = get_slot(vp, slot_idx); /* O(N) */ atomic_write_32(&slot->in_use, 1); } slots_in_use = atomic_inc_32_nv(&vp->slots_in_use); if ((err = pthread_setspecific(vp->tkey, slot)) != 0) return err; } ``` 接著在迴圈中反覆確認，slot 的 value 就是 `vp->tkey->values` 的首位, 也就是目前最新的值，倘若不是，就將 slot->value 的值寫進 `vp->tkey->values` 的首位。 ```c while (atomic_read_ptr((volatile void **)&slot->value) != (newest = atomic_read_ptr((volatile void **)&vp->values))) atomic_write_ptr((volatile void **)&slot->value, newest); ``` #### `grow_slots(thread_safe_var vp, uint32_t slot_idx, int tries)` slot_idx 為期望該 slot 的在 slots 索引位置，tries 為次數上限。無鎖定方式增長 slot_array。首先去計算目前總 vp->slots 這鏈結串列的 slot 總數，每一個 slots 結構中有包含 slot_count，因此迭代鏈結串列並加總 `slot_count` 即能得知。 ```c volatile struct slots **slotsp; struct slots *new_slots; for (slotsp = &vp->slots; atomic_read_ptr((volatile void **)slotsp) != NULL; slotsp = &((struct slots *)atomic_read_ptr((volatile void **)slotsp))->next) nslots += (*slotsp)->slot_count; ``` 倘若總 slot 數超過期望slot的索引位置，返回 0，表示成功。而嘗試次數超過預期就回傳 EAGAIN，以外就新建 new_slots 分配新的記憶體。 ```c if (nslots > slot_idx) return 0; if (tries < 1) return EAGAIN; if ((new_slots = calloc(1, sizeof(*new_slots))) == NULL) return errno; ``` EAGAIN 為錯誤碼，定義於 C 標準函式庫裡，`EAGAIN` 被定義在 `/usr/include/asm-generic/errno-bash.h` ```c #define EAGAIN 11 /* Try again */ ``` 確定要新增 slot 的數量(`additions`)。如果目前沒有任何 slot，初始 addition 數量設為 4。否則，初始設置為目前 nslots 的 1.5 倍。這可能是一種增量擴展策略。最終期望能夠增加的 slot 數能夠超過 slot_idx。接著就是初始化新的 slot_array： ```c additions = (nslots == 0) ? 4 : nslots + nslots / 2; while (slot_idx >= nslots + additions) { additions += additions + additions / 2; tries++; } new_slots->slot_array = calloc(additions, sizeof(*new_slots->slot_array)); // Then initialize the new slot_array ``` `slot_idx - nslots` 為 slot_idx 在 new_slots 中的索引值，將它標記為有使用。  ```c atomic_write_32(&new_slots->slot_array[slot_idx - nslots].in_use, 1); ``` 嘗試將 new_slots 加入到 `&vp->slots` 的最後一位 ```c if (atomic_cas_ptr((volatile void **)slotsp, NULL, new_slots) != NULL) { free(new_slots->slot_array); free(new_slots); } ``` 最後做遞迴，倘若此次擴展的 slot 數超過 slot_idx，會在下次遞迴中結束，倘若此競爭失敗，就再次嘗試擴展。 ```c return grow_slots(vp, slot_idx, tries - 1); ``` #### `thread_safe_var_set` 建構 new_value (存data 進去)之後，new_value->next 指向 values 的開頭，之後也確定 new_version 了 ```c new_value->value = data; new_value->next = atomic_read_ptr((volatile void **)&vp->values); if (new_value->next == NULL) new_value->version = 1; else new_value->version = new_value->next->version + 1; *new_version = new_value->version; ``` 接著確定 new_value 為 vp->values 的開頭，並且 vp 上的 nvalues 加一。 ```c /* Publish the new value */ atomic_write_ptr((volatile void **)&vp->values, new_value); vp->nvalues++; ``` 同樣有記得要去 `pthread_cond_signal(&vp->waiter_cv);` 呼叫在等待的執行緒。呼叫 `mark_values(vp)` 對舊值執行垃圾回收(Garabage Collection, GC)，標記仍在使用的值並識別可以釋放的值。下一段介紹更多。 ```c old_values = mark_values(vp); ``` 接著，呼叫 [`sched_yield()`](https://www.man7.org/linux/man-pages/man2/sched_yield.2.html) 放棄 CPU，允許其他執行緒（最好是 reader ）運作，接著解鎖 write_lock、釋放不再使用的舊值的記憶體，如果已設置，則呼叫析構函數 dtor，最終返回解鎖寫鎖的狀態。 #### `mark_value` `mark_values` 的目的是標記目前被引用的值，並回收那些不再被引用的舊值，基於 mark-and-sweep 的 GC。以下透過如何不把 `thread_safe_var_set` 剛寫入的值給回收掉的角度，去理解 `mark_value`。 mark-and-sweep 主要兩階段: - [ ] 1. 標記階段 (mark phase): 先走訪 tsv (變數名稱: `vp`)上所有的 `value` 這些值儲存在 `old_values_array` 中，接著走訪所有的 slots 中的 `value`，如果這 value 也有出現在 `old_values_array`，被認為還有被引用，在第二階段時就不會被回收掉。 ```c for (i = 0, v = atomic_read_ptr((volatile void **)&vp->values); i < vp->nvalues && v != NULL; v = v->next, i++) old_values_array[i] = v; qsort(old_values_array, vp->nvalues, sizeof(old_values_array[0]),value_cmp); vp->values->referenced = 1; /* curr value is always in use */ ``` 為了加速，使用 qsort 為 `old_values_array` 做排序，之後用二分搜尋法判斷某 value 是否存在在 `old_values_array`。 ```c for (slots = atomic_read_ptr((volatile void **)&vp->slots); slots != NULL;) { for (i = 0; i < slots->slot_count; i++) { slot = &slots->slot_array[i]; v = atomic_read_ptr((volatile void **)&slot->value); if (v == NULL || v == vp->values) /* vp->values 中的最新值總是被設定為 referenced，確保了它不會被誤刪除 */ continue; if ((value_binary_search(old_values_array, vp->nvalues, v)) != NULL) { v->referenced = 1; } } slots = slots->next; } ``` - [ ] 2. 清除階段 (sweep phase): 將那些 tsv 的 `values` 沒有被標記的值從鍊結串列中移除，並將這些 `values` 加入 `old_values` 鍊結串列中。而那些有在標記階段的有標記的 values，重設 referenced 為 0，為了下一次做 GC 有機會被回收。 ```c for (p = &vp->values; *p != NULL;) { v = *p; if (!v->referenced) { *p = v->next; v->next = old_values; old_values = v; // 將 v 接到 old_values 的頭 vp->nvalues--; continue; } v->referenced = 0; p = &v->next; } ``` 理解 mark_values 的關鍵在於理解 mark_values 函數如何區分目前有效的值和無效的值。在發佈新值之前，slot-list 實作不會等待顯式寬限期。相反，它依賴於 `mark_values()` 確定哪些值可以安全回收。 ## TODO: CTP 與 Userspace RCU 的效能評估 > 注意 Userspace RCU 的 flavor 和 CTP 的 slot-{pair,list} ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0,2,4,6,8-11 Off-line CPU(s) list: 1,3,5,7 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i5-12450H CPU family: 6 Model: 154 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: 3 CPU max MHz: 4400.0000 CPU min MHz: 0.0000 BogoMIPS: 4992.00 arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 320 KiB (8 instances) L1i: 384 KiB (8 instances) L2: 7 MiB (5 instances) L3: 12 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,2,4,6,8-11 ``` 關閉 hyperthreading 設定根據論壇討論: [Disable hyperthreading from within Linux (no access to BIOS)](https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios) ```shell $ cat /sys/devices/system/cpu/smt/active 1 $ echo off | sudo tee /sys/devices/system/cpu/smt/control $ cat /sys/devices/system/cpu/smt/active 0 ``` 與授課教師，討論測試結果異常，可能原因出自於，我的處理器有分 P-cores 與 E-cores，因此以下的測試，都集中於 P-cores 進行。 ### 如何分辨 P-core 與 E-core 雖然從 intel 的問答區 [how to know a core is P-core or E-core](https://community.intel.com/t5/Processors/how-to-know-a-core-is-P-core-or-E-core/td-p/1468471?profile.language=zh-CN)，得知無直接得知哪一核為 P-core 或 E-core，但從reddit 討論 :[how to identify p-cores and e-cores](https://www.reddit.com/r/intel/comments/whwwvw/how_to_identify_p_and_e_cores/?rdt=51698) 得知以下操作 ``` $ lscpu --all --extended CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0:0 yes 4400.0000 400.0000 865.329 1 - - - - no - - - 2 0 0 1 4:4:1:0 yes 4400.0000 400.0000 1200.428 3 - - - - no - - - 4 0 0 2 8:8:2:0 yes 4400.0000 400.0000 921.834 5 - - - - no - - - 6 0 0 3 12:12:3:0 yes 4400.0000 400.0000 1223.704 7 - - - - no - - - 8 0 0 4 20:20:5:0 yes 3300.0000 400.0000 1302.425 9 0 0 5 21:21:5:0 yes 3300.0000 400.0000 400.000 10 0 0 6 22:22:5:0 yes 3300.0000 400.0000 1106.814 11 0 0 7 23:23:5:0 yes 3300.0000 400.0000 1177.823 ``` 查看 MAXMHZ 列，便能得知我有 4 P-cores 為 Cores 0, 2, 4 與 6，而 4 E-cores 為 Cores 8, 9, 10 與 11。 ### power management [cpufreq-set(1)](https://linux.die.net/man/1/cpufreq-set) > cpufreq-set - A small tool which allows to modify cpufreq settings. `-c` 設定處理器編號 `-g` 設定 cpufreq governor `--max`, `--min`: 設定 cpufreq governor 可以選擇的新的最高, 最低 CPU 頻率。接著逐一設定 p-core: 以下我設定處理器編號為 0 的 P-core 使用的 CPUfreq 策略為 perfromace ，從而避免其他 governor（如 powersave）通常發生的動態頻率調整。這確保了可預測的最佳性能: ```shell $ sudo cpufreq-set -c 0 -g performance --max 4400000 --min 4200000 ``` 用以下迴圈檢查是否設定成功 ```shell $ for core in 0 2 4 6; do echo "Core $core:"; sudo cpufreq-info -c $core; done ``` ## 實驗重現步驟除了上述的設定，以下闡述關於我測試檔的使用方式。取得程式碼: ```shell $ git clone https://github.com/eleanorLYJ/ctp ``` 測試 ctp 必要條件為建構 `libtsgv.so`，因此需要在 `ctp` 目錄中使用 `make` 獲取該檔案。而主要測試檔案存於 `ctp/test`，故再切目錄至`test`。 ```shell $ make $ cd test ``` **記憶體與吞吐量長條圖** 1. 使用 `make` 生成執行檔。 ```shell $ make ``` 2. 執行以下命令產生性能結果： ```shell $ ./run_perf.sh {RUN_MODE(0)} {NUM_READERS} {NUM_WRITERS} {VALID_CPUS} {OUTPUT_DIR} {FIFO_PRIORITY} ``` 範例： ```shell $ ./run_perf.sh 0 7 1 0,2,4,6 output 90 ``` 以上命令，將限制 0 2 4 6 編碼的 cpu 執行程式碼，之後使用不同的方法（5 urcu + 2 ctp）生成 1 個 writer 和 7 個 reader 的記憶體與吞吐量長條圖，並且將中間結果存於 output 目錄，並且設定 FIFO 優先級（整數值）為 90。 **記憶體與吞吐量折線圖** 使用 make 生成可執行文件。執行以下命令生成性能結果： ```shell $ ./run_perf.sh {RUN_MODE(1)} {NUM_READERS} {NUM_WRITERS} {VALID_CPUS} {OUTPUT_DIR} {FIFO_PRIORITY} ``` 範例： ```shell $ ./run_perf.sh 1 100 1 0,2,4,6 output 90 ``` 以上命令，這將使用不同的方法（5 urcu + 2 ctp）生成 1 個寫入者和多個讀取者 [1, 100] 的不同組合而成的的記憶體與吞吐量折線圖，並且將中間結果存於 output 目錄，並且設定 FIFO 優先級（整數值）為 90。 ### 建立 7 個 reader 執行緒 1 個 writer 執行緒，測 URCU 與 CTP 以下測試是建立 7 個 reader 執行緒，1 個 writer 執行緒，在主執行緒建立完所有的執行緒後，會等待 1 秒，接著改變終止變數，讓所有的執行緒就要停止計算讀取 tsv 或對 tsv 寫入隨機資料的計數。其中 5 個 flavor 的 URCU 中，qsbr 具比較特別的寫法，在 reader 執行緒中，每 1024 次讀取會進入一次 quiescent_state (呼叫 `rcu_quiescent_state`)，藉此writer 能夠檢測到 GP 的結束，從而判斷記憶體的回收時機。 #### throughput 比較不同實作 1 秒內 7 個 readers 的讀取次數的總和的直條圖 ![reader_comparison_bar](https://hackmd.io/_uploads/BypQeDSvA.png) 同時，比較不同實作 1 秒內 1 個 writer 的讀取次數的總和的直條圖 ![writer_comparison_bar](https://hackmd.io/_uploads/rJpXewBwR.png) 以下是在我設定 CPU affinity, 關閉 hperthreading, 設定 taskset 等操作之前的圖 (固定 2 writers，搭配 1 到 10 個 reader) 出現了只有一個執行的 reader 執行緒總讀取最多，增加執行緒後總讀取次數卻下降的奇怪現象 ![Figure_1](https://hackmd.io/_uploads/rkYr_GjLR.png) ![Figure_2](https://hackmd.io/_uploads/HktHdfsUR.png) 推估有幾個問題造成的: 1. Cache Coherency Overhead 2. Context Switching 3. CPU mitgration 我期待設定 cpu affinity 後能減少 context switch 與 CPU mitgration 帶來的影響。我在各個測試檔中設定 CPU affinity - [commit 37d7785: Add functions to set and check thread CPU affinity](https://github.com/eleanorLYJ/ctp/commit/37d7785d21d34788990c897cba021d64e05303df) - [Commit bd58843: Enhance Validation and Diagnostics in CTP Testing](https://github.com/eleanorLYJ/ctp/commit/bd58843fcd9828ecef9fd1386efad87777d192db) 設定 cpu affinity 中，兩個重要的函式 > [pthread_setaffinity_np, pthread_getaffinity_np](https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html): set/get CPU affinity of a thread 除了設定 cpu affinity 之後 ,使用 [`taskset`](https://man7.org/linux/man-pages/man1/taskset.1.html) 控制執行緒在指定 CPU 上的運行,另外關閉 [hyperthread](https://zh.wikipedia.org/zh-tw/%E8%B6%85%E5%9F%B7%E8%A1%8C%E7%B7%92) 讓 CPU 一次只處理一個執行緒。在此之後重新畫的折線圖。以下說明設定 cpu affinity 對我測試的作用: 其環境沒有關閉 hyperthread，也沒有設定 taskset 的結果: `perf stat -e cycles,instructions,cache-references,cache-misses,context-switches,cpu-migrations {.exe} 7 2` Benchmark|qsbr|signal|mb|memb|bp|slotpair|slotlist | -------- | -------- | -------- |-------- | -------- | -------- | --|-- Cycles(core/atom)|66M/55M|118M/86M|94M/63M|114M/77M|71M/52M|114M/72M|121M/77M Cache Misses (core/atom)|93k/72k|87k/71k|96k/42k|79k/55k|101k/53k|185k/78k|219k/137k context-switches|48k|50k|69k|46k|51k|49k|52k CPU Migrations|1040|1261|992|987|1296|6638|1,0552 以下為在測試檔中設定 cpu affinity 後，之後可以看到 cycles,Cache Misses, context-switches 與 cpu-mitgrations 次數都下降 Program| qsbr|signal|mb|memb|bp|slotpair|slotlist| | -------- | -------- | -------- |-------- | -------- | -------- | --|-- | Cycles (core/atom)|2.776B/3.687B|4.526B/68.07M|5.32B/4.014B|3.557B/7.597B|3.324B/3.726B|3.162B/6.289B|4.929B/5.75B| Cache Misses (core/atom)|46.28k/54.29k |57.18k/60.60k |62.64k/36.96k |43.65k/44.42k |52.10k/71.09k |128.60k/240.79k |146.68k/131.63k Context Switches|42.47k|43.19k|66.15k|45.04k|45.77k|35.54k|47.76k CPU Migrations|12|10|10|12|13|7|8 ### 建立 [1, 100] 個 reader 執行緒，1 個 writer 執行緒測 urcu 與 CTP 在讀取吞吐量測試中，qsbr 和 bp 表現最佳，一秒內的讀取次數範圍在 555,000 到 655,000 次之間。這表明它們在高頻率讀取的操作具有優異的表現。mb 的讀取表現次好，但在初期從 1 到 10 個讀取者時，其讀取次數有顯著下降，隨後變得平穩，保持在第三高的讀取次數。相比之下，memb 和 signal 表現中等，而 slotpair 和 slotlist 的讀取表現最差。 > TODO: 探討為什麼是 3 readers 和 1 writer 的組合讀取總量特別低 ![image](https://hackmd.io/_uploads/Syg90UHvC.png) 在寫入吞吐量測試中，CTP 方法（slotpair 和 slotlist）表現最佳，一秒內的寫入次數範圍在 30,000 到 41,000 次之間。其餘方法在 reader 增加到 2 以後，寫入次數驟降到十位數。 ![image](https://hackmd.io/_uploads/SkNqCISPR.png) #### 記憶體使用量 :::danger Wikipedia 不可簡稱 wiki。參見 CS:APP 第 6 章的描述及〈[每位程式開發者都該有的記憶體知識](https://sysprog21.github.io/cpumemory-zhtw/)〉。 ::: 閱讀 [Wikipedia: CPU cache](https://en.wikipedia.org/wiki/CPU_cache) + Instruction cache (I-cache): 用於加速可執行指令的存取 + Data cache (D-cache): 用於加速資料存取和儲存；資料快取通常被組織為更多快取層級的層次結構（L1、L2 等) > 補 CS:APP ch6 [perf stat 輸出解讀](http://zhengheng.me/2015/11/12/perf-stat/) 用 perf 測快取使用量 ```shell perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-stores,LLC-load-misses,LLC-store-misses ``` 我每一次測的圖形都不盡相同，我認為是因為我是寫入隨機值的原因，而以下就討論各方法的相對位置 Cache Misses: CTP 方法的 cache miss 數量較高，這表明其資料載入效率較低，可能是因為頻繁的寫入操作導致更多的資料需要從主記憶體載入。 ![Cache miss](https://hackmd.io/_uploads/Sy5F08rv0.png) L1 DCache Load Misses: CTP 方法的 load miss 數量較高，而 bp 和 qsbr 的數量較低，顯示出它們在資料載入方面的效率更高。 ![L1 Load miss](https://hackmd.io/_uploads/Sy7uAIrPC.png) LL Load Misses: CTP 方法相較其他方法更容易出現異常高的 load miss 數，這可能影響其效能，而在大部分情況下，所有方法的 load miss 數位於中間部分。 ![LL Load miss](https://hackmd.io/_uploads/Bks_AUHPC.png) L1 DCache Store Misses: 所有方法的 store miss 數量均等於 2，顯示出在寫入操作上的一致性。 ![L1 store miss](https://hackmd.io/_uploads/BkzFRIrw0.png) LL Store Misses: CTP 方法的 store miss 數量較低，這表明其在寫入操作上的效率較高。bp 和 qsbr 的 store miss 數量較高，而 mb、memb 和 signal 位於中間部分。 ![LL store miss](https://hackmd.io/_uploads/H1UKCLrDA.png) ### 建立 [1, 100] 個 reader 執行緒，1 個 writer 執行緒測試 bullet-proof URCU 是否註冊執行緒會影響效能在撰寫 urcu 測試檔中，發現 [urcu README](https://github.com/urcu/userspace-rcu/blob/master/README.md) 談到 bullet-proof 的 urcu (bp) 可以不用 ruc_init() 與註冊(reigster)執行緒。 > urcu_bp_init(), and urcu_bp_unregister_thread() all become nops, whereas calling urcu_bp_register_thread() becomes optional. 我發現了在計算 throughput 時，bp 將 rcu_register_thread() 與 rcu_unregister_thread() 移除，整體的 throughput 會下降，因此有了以下的折線圖我將淺藍色線 (bp_no_reg) 為使用bp uruc 同架構的程式碼，但 rcu_init()、 rcu_register_thread() 與 rcu_unregister_thread() 移除。 ![reader_comparison_plot](https://hackmd.io/_uploads/SJjxg3Bw0.png) writer 讀寫次數最大 103 最低到 4，然而兩測試檔測到的寫入量都太少，因此從圖片無法看出 ![writer_comparison_plot](https://hackmd.io/_uploads/Hyclg2Sv0.png) 從圖中能看出，少了 rcu_init()、 rcu_register_thread() 與 rcu_unregister_thread() 的 bp_no_reg 的吞吐量測試變得非常地不穩定，而這還需要更多的細討論原因。 > TODO: 看 bullet-proof urcu 中，呼叫 rcu_register_thread() 會動用到哪一塊原始碼。 ## TODO: 評估 CTP 的限制和改進方案  修改部分為 slotlist 的 `grow_slots()`，此函式考慮在多執行緒中無鎖擴增 slots。會發生該函式的主要競爭條位置在 slotsp (vp->slots 的最後一個 slots)上，此使用該 atomic_cas_ptr 函式嘗試寫入，這確保一次只有一個執行緒可以成功更新 slot_array 。如果一個執行緒由於另一個執行緒競爭失敗而無法更新，它將進行清理並遞迴重試。以下修改目的是要減少不必要的遞迴次數。 ```diff /* Add new slots to logical array of slots */ if (atomic_cas_ptr((volatile void **)slotsp, NULL, new_slots) != NULL) { free(new_slots->slot_array); free(new_slots); + grow_slots(vp, slot_idx, tries - 1); } - return grow_slots(vp, slot_idx, tries - 1); + return 0 } ``` 測試 7 readers 與 1 writer 的讀取或寫入次數 | | before | after | | -------- | -------- | -------- | | reader counts | 425055 | 466009 | | writer counts | 319918 | 333802 | ## 總結 ### CTP 與 URCU 相比 CTP 優點 1. 更簡單的 writer 邏輯：CTP 具有更簡單的 writer 邏輯，因為 CTP 依賴於引用計數，而不是管理多個資料版本和寬限期。 2. 寫入操作較好：在測試中，CTP 的 store miss 次數較低，這表明在寫入操作上，CTP 具有較高的效率，因為它避免了頻繁的快取儲存失敗 3. API 簡單使用，使用者不需要進行大量的程式碼修改 CTP 缺點 1. 缺少豐富並行處理資料結構，如 linked list, queue, hash 等 2. 讀取操作較差: 在測試中，CTP 的 load miss 次數較高，特別是在 L1 DCache load miss 和 LLC load miss 方面。這表明在讀取操作上，CTP 的效率不如 URCU。 :::danger 闡述更多記憶體使用量的分析。 > ok ::: ### 結論從分析程式碼，能得知 slor-pair 僅維護兩個版本的值，實作簡單， slot-list 維護多版本的值，然而在測試結果中，兩者的優勢與劣勢都相似，適合讀寫比例較均衡且需要立即記憶體釋放的場景。根據測試結果，URCU 方法適合高讀取且低寫入的應用情境，如查詢系統。具體資料顯示，一秒鐘 qsbr 和 bp 的讀取次數範圍在 555,000 到 655,000 次之間，而 CTP 方法在這方面較差，讀取次數範圍在 355,000 到 450,000 次之間，但在寫入次數上顯著優於 URCU 方法，大概落在 30000～41000 之間，而 urcu 都降至個位數，因此 CTP 方法，儘管其在記憶體載入效率（如 cache miss 和 load miss）方面的表現不如 URCU 方法，但在高頻寫入操作下，CTP 方法能提供更好的效能和效率。

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.