RCU 與並行化程式閱讀筆記

# RCU 與並行化程式閱讀筆記 ## TODO: 在 lock-free 的環境中，回收 (reclaim) 資源為何是挑戰？ ![image](https://hackmd.io/_uploads/Bk4Ig_OWlg.png) RCU model 主要分為三個階段 1. Removal : 在寫入端的 CS 中，進行讀取資料 (Read)、資料複製 (Copy)，並更改資料 (Update) 操作 2. Grace Period : 一個等待時期，確保所有與執行刪除的資料相關的讀取端執行緒都可完成讀取 3. Reclamation : 回收舊資料而用意主要就是希望延遲釋放資源，等到所有可能還在讀那塊資料的讀者都結束之後再真正回收，但由於 RCU 是 lockless 的，因此如何知道所有還在讀取資料的讀者都結束這件事情是很困難的，必須透過「寬限期」（grace period）和「靜默狀態（quiescent state）」等等的機制來間接判斷，這正是實作上的最大挑戰。像是可能會遇到： 1. 長時間靜默延遲：若某執行緒在 CS 中停留過久（如被搶占或關閉中斷），整個寬限期就被拉長，導致大量舊節點無法及時回收 2. 在批量回收的實作中，會將多次更新註冊的回調（callbacks）累積起來，待一次全體靜默後批量執行 → 需要額外維護回調隊列、管理多個寬限期批次，以及確保不漏呼任何回調 ## TODO: RCU 的 callback 的作用？ ### 如何作用 callback 機制就是當滿足預設條件後才會呼叫執行的 function，主要分為兩個階段： 1. 註冊（registration）: 先把你要呼叫的函式（callback）登記到某個框架或資料結構裡像是 ```cpp call_rcu(&p->rcu_head, reclaim_callback); ``` 2. 觸發（invocation）: 當某個預設條件（event）發生時，系統／框架才會真正把先前註冊的 callback 給呼叫而在 RCU 的 callback 主要是用來 reclaim 資源，主要又分為單一次 reclaim 和批量 reclaim，單一次 reclaim 的流程圖就是當所有的 thread 進入 quiescent state 後呼叫先前註冊的 callback ，示意圖如下： ![image](https://hackmd.io/_uploads/B15gsn_bgl.png) ### 批量 reclaim 要先介紹 Interval 的概念，而詳細介紹如下： >A simple implementation of MemoryReclaimer could work as follows. Instead of handling each callback individually, we can introduce the concept of intervals, and group callbacks together by interval. Once every thread has called onQuiescentState, the current interval is considered to end, and a new interval is considered to begin. At the end of each interval, we know that it’s safe to invoke all the callbacks added in the previous interval, because every participating thread has called onQuiescentState since the previous interval ended. -- from [Using Quiescent States to Reclaim Memory](https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/) 也就是說當目前所有的 thread 進入 quiescent state 後則將目前這個 Interval 累積的 callback 保存，等到下一次所有的 thread 進入 quiescent state 就呼叫前次所有的 callback。 ![image](https://hackmd.io/_uploads/rkK6M3O-gl.png) 為什麼要有這樣的機制，就是因為每次執行完更新操作後，就會增添釋放記憶體的 callback，若更新的次數較多，每次只呼叫一個 callback 來釋放一次記憶體，就會導致記憶體釋放跟不上回收的速度，此外也會導致高頻的更新， writer 就會會頻繁的等待，連帶影響整體 throughput ## reclaim vs. recycle **Reclaim（回收）**：將不再使用的頁或物件真正歸還給更底層的管理單元（如系統或 swap），以便在系統記憶體不足時釋出空間 **Recycle（循環再用）**：保留已分配並初始化好的記憶體物件於同一層級的快取池中，讓下次分配可快速取用，避免反覆釋放→重新分配的開銷 ### Slab Allocator 閱讀 [Slab Allocator](https://kernel.org/doc/gorman/html/understand/understand011.html?utm_source=chatgpt.com) 了解用途 > he basic idea behind the slab allocator is to have caches of commonly used objects kept in an initialised state available for use by the kernel. Without an object based allocator, the kernel will spend much of its time allocating, initialising and freeing the same object. The slab allocator aims to to cache the freed object so that the basic structure is preserved between uses ![image](https://hackmd.io/_uploads/BJ6K6qYWxx.png) 以上是 slab 分配器的核心理念，主要是對常用的物件（object）維護一組已經初始化好的快取（cache），以供核心重複使用。此外若沒有以物件為單位的分配器，核心將花大量時間不斷地分配、初始化、釋放同一類物件；而 slab 分配器的宗旨，就是將已釋放的物件緩存起來，保留其基礎結構，以便下次再用而 slab 分配器主要有三大目標 >1. The allocation of small blocks of memory to help eliminate internal fragmentation that would be otherwise caused by the buddy system; >2. The caching of commonly used objects so that the system does not waste time allocating, initialising and destroying objects. Benchmarks on Solaris showed excellent speed improvements for allocations with the slab allocator in use; >3. The better utilisation of hardware cache by aligning objects to the L1 or L2 caches. 為了消除傳統 buddy system 導致的內部碎片，slab 分配器維護了兩組大小不同的小緩衝區（buffer）快取，從 2⁵＝32 bytes 一直到 2¹⁷＝131072 bytes。針對內核中那些初始化成本高的結構，維護一份已初始化好的物件快取。每次新建 slab 時，就把多個物件打包並呼叫「建構函式」（constructor）做初始化；當物件被釋放（free）時，則保留它的初始狀態，下次分配時可立即拿來使用，省去重複初始化的開銷。最後，slab 分配器還要優化硬體快取的使用。打包到 slab 後還剩下一些空間，這些空間就用來cache coloring：以不同的「起始位移」（offset）來放物件，使同一 slab 內的物件分佈在 CPU 快取的不同行，減少同一快取組（cache set）內多個物件互相清除（evict）的機會。 ![image](https://hackmd.io/_uploads/SyQLAcFWxx.png) Color Padding：頁面剩餘空間中刻意做 padding，用來調整下一 slab 的起始位移，不存放有效資料。而這就是 **Recycle** 的實際例子 >[Slab](https://en.wikipedia.org/wiki/SLUB_%28software%29?utm_source=chatgpt.com) ## 並行程式設計: Lock-Free Programming ### Lock-Free 和 Lock-Less >lock-free: 強調以系統的觀點來看，只要執行足夠長的時間，至少會有一個執行緒會有進展 (progress) lockless: 避免使用 lock 而能夠安全無誤地操作共用資料兩者屬於可以重疊的關係，但也不代表今天 lockless 就能保證 lock-free ，因為後者所重視的是**在時間足夠的情況下，至少會有一個執行緒會有進展** 但有個問題，以下程式碼不是應該是lock-free的嘛？為什麼教材會說這不是 lock-free ? ```cpp // Atomically perform UpperBound =max(UpperBound,y) void RecordMax(int y) { extern atomic<int> UpperBound; do { // Take a snapshot int o = UpperBound; // Quit if snapshot meets condition. if (o >= y) break; // Attempt to install new value. } while (UpperBound.compare_and_swap(y, o) != o); } ``` ### Deeper Look: Lock-Free Programming >transaction 若翻譯為「交易」，可能會受限於漢語字面上的意思而阻礙我們的理解，於是我們可將 "transaction" 聯想為「轉帳」，亦即將程式對資料的操作想成轉帳。例如匯款 10,000 元到某人的帳戶時，若過程中某人剛好查詢帳號中的存款，則此某人只該看到二種結果： > 1. 轉帳前的數字 > 2. 轉帳後的數字 > >除了轉帳之前和之後的二個數字，此某人不該看到其他數字，就算程式對帳號金額進行處理，而使得這個資料處在某種未完成的中間狀態，此某人也不應該看到任何資料的中間狀態這邊 transaction 的思路與我先前學的資料庫裡面的 transaction的定義其實是一樣的，而這種定義也應用到 atomic 的操作，在直到完全改變資料狀態前，其他人都不應該看到中間變換的狀態，而是直到 atomic write 之後才看得到改變後的狀態。在資料庫 transaction 的操作也是一樣，在完整的資料變更後才更新整個狀態，若中間有失敗則進行 rollback 退回初始的狀態。這邊教材介紹了一個機制是 compare and swap (CAS) >概念上來說，就是「不可分割地」做以下的事： > ```cpp int CAS(int *a, int old_val, int new_val) { if (*a == old_val) { *a = new_val; return old_val; } return *a; } ``` 但這種機制會有一個很嚴重的問題，就是 **ABA** 問題，這問題就是在 CAS 的操作中，時常是用判斷事先獲取的值與現在真正指標的值去做一個判斷，判斷是否這個指標的值已經被其他執行緒更改過，但這樣的判斷機制只能粗淺的認定目前的指標的內容是否符合預期，但無法追蹤其他執行緒是否有做過其他操作，這就會造成程式碼有不可預期的問題！ > 「裡面放的東西一樣」不能確保「剛剛到現在都沒有人動過裡面的東西」，兩者是不一樣的。底下是我一個 **ABA** 範例： ```cpp typedef struct Node { int value; struct Node *next; } Node; void push_node(Node *n) { Node *old_head; do { old_head = atomic_load(&head); n->next = old_head; } while (!atomic_compare_exchange_weak(&head, &old_head, n)); } Node* pop_node() { Node *old_head; Node *next; do { old_head = atomic_load(&head); if (!old_head) return NULL; next = old_head->next; } while (!atomic_compare_exchange_weak(&head, &old_head, next)); return old_head; } ``` 讓 thread1 先執行讀取 head，並且在讀取 head 後被 thread2 搶佔執行 pop_node() 2次後將 head 後一個的資料更換，但 head 本身不變，那這種情況下 thread1 看到的東西就是一樣的，但實際上這個鍊結串列的結構已然不同！ ```cpp void* thread1(void *arg) { printf("[T1] 開始 pop，讀取 head\n"); Node *old = atomic_load(&head); if (!old) return NULL; printf("[T1] 讀到 old->value = %d，準備 pop\n", old->value); // **got preempted by thread2** sleep(1); Node *next = old->next; printf("[T1] next value %d\n", next->value); // T1 continue, run CAS if (atomic_compare_exchange_strong(&head, &old, next)) { printf("[T1] CAS 成功，pop 出 %d\n", old->value); } else { printf("[T1] CAS 失敗，重新試一次或放棄\n"); } return NULL; } void* thread2(void *arg) { // Ensure thread 1 read first sleep(0.2); // pop A Node *a = pop_node(); printf("[T2] pop 出 %d\n", a->value); // pop B Node *b = pop_node(); printf("[T2] pop 出 %d\n", b->value); Node *c = make_node(3); // push back c push_node(c); printf("[T2] push 回 %d\n", b->value); // push back a push_node(a); printf("[T2] push 回 %d\n", b->value); return NULL; } ``` >lock-free 不是指完全不用 mutex lock，而是指整個程式的執行不會被其單個執行緒 lock 鎖住。考慮以下程式碼: ```cpp while (x == 0) { x = 1 - x; } ``` 而這種程式碼就是一個沒有使用 lock 但無法滿足lock-free的例子，它可能會造成 livelock，當有兩個執行緒並且 `x` 為共享變數，像以下的執行順序就會導致 `x` 一直是 `0`，無法跳出迴圈。 ```cpp x = 0 --> x = 1 - 0 --> x = 1 - 1 --> x = 0 ``` ### blocking vs non-blocking >想像以下情況：在使用 mutex 時，若其中一個執行緒在釋放 lock 之前被系統 preempt，則需要同一個資源的其他執行緒會無法繼續，直到原本的執行緒恢復執行完畢 >1. blocking：若系統 kill 尚在使用資源的執行緒，將造成整個程式完全無法繼續前進，意思：特定的執行緒被 block 時，會造成其他執行緒沒有 progress > >與之相對的是 non-blocking，有下列分類： > 1. wait-free 是 non-blocking 之中強度最高的，它保證所有的指令可以 bound 在某個時間內完成，也就是所有的執行緒經過一段足夠的時間後，一定會有所進展 >2. lock-free 相較於 wait-free 稍微弱一點，它允許部份執行緒被暫停，但仍然保證整體程式的其他部份會有進展。最差的情況下，就算只有一個執行緒在執行，其他所有執行緒都在等待，仍然屬於 lock-free >3. obstruction-free non-blocking 中最弱的，僅保證在沒有競爭情況下，執行緒可以持續運作，倘若發生競爭，則 obstruction-free 無法提供任何保證。換言之，可能會發生 livelock ### lock-free 衍生的問題撇除上面提到的 ABA 問題外，還有 sequential consistency 的問題，這邊又分為 1. compiler reordering 2. processor reordering 這二種之間，其實就是在 compile-time 跟 run-time 這二個不同的時間點，進行最佳化而重新排序指令的關係 > [淺談 Memory Reordering](https://web.archive.org/web/20220331124643/http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/) 裡面有對 compiler barrier 詳細的演練以下的圖解也滿清楚的 > ![image](https://hackmd.io/_uploads/ByeINJuu-le.png) > ![image](https://hackmd.io/_uploads/H1lSk_ubgx.png) ## TODO: 重現去年實驗並理解 Redis 並行化的技術挑戰 ### 閱讀 [introduce sys_membarrier(): process-wide memory barrier (v6)](https://lwn.net/Articles/370146/?utm_source=chatgpt.com) 先了解 `sys_membarrier()` 系統呼叫的作用，先看到他的介紹： >Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invokation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. `sys_membarrier()` 會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt)，讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier，若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。為了去更進一步解釋，這邊它給出一個例子 >Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) 而這兩個 thread 的實作如下 ```cpp Thread A Thread B prev mem accesses prev mem accesses smp_mb() smp_mb() follow mem accesses follow mem accesses ``` 在引用 `sys_membarrier()` 後則可以修改成 ```cpp Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses ``` 在 `Thread B` (Reader) 內可以直接使用編譯器屏障取代原先的 `smp_wb()` 這就可以省下許多成本，此外 `Thread A` (Writer) 使用 `sys_membarrier()` 取代 `smp_wb()` 只對「整個 process 裡所有活動中的 thread 發送 IPI，強制它們各自執行一次 full memory barrier」。而這樣更改後會有兩種情況： 1. Thread A 與 Thread B 不同時執行 ```cpp Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses ``` 這種情況下 `Thread B` 的存取只有保證 weakly ordered，但是可以的，因為 `Thread A` 不在意跟它排序，因為它們不重疊。 2. Thread A 與 Thread B 同時執行 ```cpp Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses ``` `Thread B` 在執行 barrier() 時只能保證自己程式內部順序；但同時 `Thread A` 正在呼叫 `sys_membarrier()`，kernel 會發 IPI 給所有正在跑的 thread，讓它們在 CPU core 上都執行一次硬體 memory barrier。這就會把 B 的 compiler barrier「升級」成真正的 full `smp_mb()`，在多 core 間強制排序。至於不在跑的 thread（被 scheduler deschedule），它們不需要做任何操作，因為 context switch 本身就隱含 barrier 效果。 >Each non-running process threads are intrinsically serialized by the scheduler. 作者這裡還有做一個效能實驗，測試 Expedited membarrier 和 Non-expedited membarrier 針對 `sys_membarrier()` 的效能差異 (在 Intel Xeon E5405 上執行） Expedited membarrier 呼叫次數：10,000,000 次單次開銷：約 2–3 微秒 Non-expedited membarrier 呼叫次數：1,000 次總耗時約 16 秒 → 單次開銷約 16 毫秒大約是 expedited 模式的 5,000–8,000 倍慢 >Add "int expedited" parameter, use synchronize_sched() in the non-expedited case. Thanks to Lai Jiangshan for making us consider seriously using synchronize_sched() to provide the low-overhead membarrier scheme. >Check num_online_cpus() == 1, quickly return without doing nothing. 此外還有針對 `sys_membarrier()` 的有效性進行測試 Operations in 10s, 6 readers, 2 writers: >(what we previously had) memory barriers in reader: 973494744 reads, 892368 writes signal-based scheme: 6289946025 reads, 1251 writes >(what we have now, with dynamic sys_membarrier check, expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 4316818891 reads, 503790 writes >(dynamic sys_membarrier check, non-expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 8698725501 reads, 313 writes 可以先觀察出動態 sys_membarrier 檢測會帶來額外的成本带来的成本，讓讀取與寫入端的 throughput 皆降低大約 7% ，但大致上還在一個基準線上，因此目前的預設機制還是有啟用動態觀察。在啟用 `sys_membarrier()` 後在 expedited 狀態下會提昇許多，因為讀取端只做編譯器屏障，因此效能遠比原始的 mb scheme 好，在 writer 端，相比於 signal-based scheme 提昇了 500 倍左右，但對於原始的 mb scheme 來說還是效能差了一點，因為需要做額外的 `sys_membarrier()` 呼叫，而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高，因為寫入端不發送 IPI（使用了更輕量級的 membarrier fallback），沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇，但同時寫入端没有 IPI，主要依賴 cpu 上下文切換來進行，或一個 thread 定期的去喚醒各個 cpu ，因此沒有即時性，每次 `sys_membarrier()` 的時間比較長也比較不可預期。 ### ### 閱讀 [User-space RCU](https://lwn.net/Articles/573424/#URCU%20Overview) 主要分為以下幾種： > 1. Signal-based RCU using sys_membarrier() system call (`liburcu`) 當作業系統有支援 `sys_membarrier` 系統呼叫時就可以使用，在讀寫兩端都使用 memory barrier ，有了這個系統呼叫，這個模式的 writer 效能會比 Signal-based RCU 要好，因為不用發送信號通知全部的 reader ，透過系統呼叫就可以執行，在使用 urcu 時，引入 `#include <urcu.h>` ，預設的模式就會是這個，但若系統不支援 `sys_membarrier` ，就會使用 Memory-barrier-based RCU ，連接時要使用 `-lurcu` 參數。 2. Memory-barrier-based RCU (`liburcu-mb`) 與使用 `sys_membarrier` 系統呼叫的 urcu 一樣，在讀寫兩端都使用 memory barrier ，可以更快測寬限期，但會導致讀取速度變慢。使用時，要加上 `-DRCU_MB` 編譯參數，連接時要加上 `-lurcu-mb` 參數。 3. Signal-based RCU (`liburcu-signal`) Signal-based RCU 效能較 QSBR 差，但是是最接近 QSBR 的機制，但 writer 需要發送信號給 reader ，所以 writer 效能最差，在使用時要使用 `-DRCU_SIGNAL` 參數進行編譯，依照[官方網站](https://liburcu.org/)的敘述，如果加上 `-lurcu-signal` 會導致錯誤。 4. Bullet-proof RCU (`liburcu-bp`) 這個模式不需要在執行緒開始與結束使用 `rcu_register_thread` 與 `rcu_unregister_thread` ，對於 Bullet-proof RCU 來說，這兩個操作都是無效的，在進入 critical section 前自動進行 register ，離開 critical section 時自動進行 unregister ，但代價是讀寫操作性能都降低。 5. QSBR (`liburcu-qsbr`) 與其他 flavor 不同的是， qsbr-flavor 必須要定期進入 quiescent states，後者是當執行緒沒有在 critical section 時，就處於 quiescent states ，所以在執行緒離開 critical section 時，就必須要手動進入 quiescent states ，否則寬限期不會結束。在使用上必須引入 `#include <urcu-qsbr.h>` 且進行連接時要使用 `-lurcu-qsbr` 參數。在 reader 進行讀取時，必須時常使用 `rcu_quiescent_state` 退出 critical section ，讓 writer 可以更新資料，在 urcu 的 [`test_urcu_qsbr.c`](https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu_qsbr.c#L132) 中有說明每 1024 個讀取操作之後進入 quiescent states ，而 `rcu_read_lock` 和 `rcu_read_unlock` 對 reader 來說是沒有作用的。 ```cpp // Worker thread reads config value: rcu_read_lock(); // Enter RCU read-side critical section Config *cfg = rcu_dereference(global_config); int max_conn = cfg->max_connections; rcu_read_unlock(); // Exit RCU read-side critical section printf("Max connections allowed: %d\n", max_conn); ``` ### 開發環境 ```shell $ uname -a Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 CPU(s) scaling MHz: 23% CPU max MHz: 4000.0000 CPU min MHz: 800.0000 BogoMIPS: 6799.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nons top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m d_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Vulnerable: No microcode Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flush es, SMT vulnerable Mds: Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Reg file data sampling: Not affected Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prct l Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointe r sanitization Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affect ed Srbds: Mitigation; Microcode Tsx async abort: Mitigation; TSX disabled ``` ### 理解 mt-redis 原始碼要重現去年實驗首先要做的事情就是先理解 [mt-redis](https://github.com/jserv/mt-redis) 的程式碼，首先就要先理解資料結構，那我認為 `yeh-sudo` 同學整理很完整，主要就是閱讀他的筆記去理解資料結構，而對於期末專題來說最需要注意的就是何處需要 `rcu` 相關的函式，這其中主要又在 `q_dict.c` 和 `db.c` 中分別處理 lock-free RCU hash table 的實做和資料庫層的邏輯實做（將使用者輸入的 get/set... 轉換為相對應的 `q_dict` 指令） > 參考去年同學整理的資料 [yeh-sudo/mt-redis 運作原理](https://hackmd.io/@yehsudo/mt-redis-work#User-space-RCU-%E5%9C%A8-mt-redis-%E4%B8%AD%E7%9A%84%E6%87%89%E7%94%A8) 在閱讀原始碼的時候也有發現重複引用的問題，這部份也已經發送 PR >[Commit 0a07584d31b3664c4fc14d2d46fd287ce6acb29f ](https://github.com/jserv/mt-redis/pull/1) ### memb-flavor 在 userspace-rcu 預設的情況下就會是 memb-flavor，因此只需要將 mt-redis 編譯完成即可使用 `memtier_benchmark` 進行測試效能，我這邊設定寫入與讀取的比例為 1:100 和 1:10，上網收集的資料說 cache 的使用情境大多為這兩種，因此只測試這兩種情境 >在 [Memcached](https://openbenchmarking.org/test/pts/memcached?utm_source=chatgpt.com) 的實際基準測試中，約 43.9% 的測試採用 SET:GET = 1:100 的配置 > >在 [Google Cloud Memorystore](https://cloud.google.com/blog/products/databases/scaling-google-cloud-memorystore-for-high-performance?utm_source=chatgpt.com) 中有提到 By default, the utility issues 10 get commands for every set command. ``` memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:10 --requests=20000 --json-out-file=memb-2 ``` ``` ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 15465.59 --- --- 1.69288 1.53500 6.68700 10.55900 1191.19 Gets 154579.35 77421.46 77157.89 1.11949 0.98300 3.71100 6.59100 8591.59 Waits 0.00 --- --- --- --- --- --- --- Totals 170044.94 77421.46 77157.89 1.17164 0.99100 4.01500 7.61500 9782.78 ``` ``` memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=memb-2 ``` ```shell ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 1782.18 --- --- 1.34007 1.09500 3.91900 7.39100 137.32 Gets 177331.16 87998.39 89332.78 1.11098 1.03900 2.59100 5.40700 9828.71 Waits 0.00 --- --- --- --- --- --- --- Totals 179113.34 87998.39 89332.78 1.11326 1.04700 2.60700 5.43900 9966.03 ``` >但在 2023 年， urcu-signal 被視為 deprecated 了! [Commit aad674a](https://github.com/urcu/userspace-rcu/commit/aad674a9a583e09e854145f18c5d8854269dce8c) 為關於移除 urcu-signal 的最後一個 commit，根據此 commit message 建議改使用 urcu-memb，因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能，並且也有改善後的寬限期性能。 ### qsbr-flavor qsbr 和 memb 最大的差別就是需要新增寬限期的實做，並且在 qsbr 中 `rcu_read_lock` 和 `rcu_read_unlock` 皆無作用，但這部份只須將 `uruc` 改成 `uruc-qsbr` 即可，它會將上述兩個函式重新定義成無作用，因此也不須更改這部份程式碼，需要新增的就是將原先有出現上述函式的結尾加上 `rcu_quiescent_state` 去通知該執行緒離開 CS ，另外這邊呼叫 `rcu_quiescent_state` 的頻率為每 1024 次讀取呼叫一次，參考 [test_urcu_qsbr.c](https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu_qsbr.c#L132) 。 ``` memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:10 --requests=20000 --json-out-file=memb-2 ``` ``` ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 15713.07 --- --- 1.53050 1.36700 5.91900 10.04700 1210.26 Gets 157052.94 8.64 157044.31 1.12175 1.02300 3.29500 6.11100 6117.59 Waits 0.00 --- --- --- --- --- --- --- Totals 172766.01 8.64 157044.31 1.15893 1.03100 3.59900 7.10300 7327.84 ``` ``` memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=qsbr-2 ``` ```shell ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 1794.26 --- --- 1.30414 1.07900 4.47900 7.80700 138.25 Gets 178533.69 0.00 178533.69 1.09649 1.03900 2.44700 5.50300 6953.73 Waits 0.00 --- --- --- --- --- --- --- Totals 180327.95 0.00 178533.69 1.09856 1.03900 2.46300 5.53500 7091.98 ``` 分別進行十次測試 - [ ] GET Ops/sec ![Get_Ops_sec](https://hackmd.io/_uploads/S1SUxKrfge.png) - [ ] GET Latency ![Get_Latency](https://hackmd.io/_uploads/SymU0dSGee.png) - [ ] GET p50 Latency ![Get_p50_Latency](https://hackmd.io/_uploads/SJ4ORdSzgl.png) - [ ] GET p99 Latency ![Get_p99_Latency](https://hackmd.io/_uploads/r1JiCdBMex.png) - [ ] GET p99.9 Latency ![Get_p99_9_Latency](https://hackmd.io/_uploads/S1HsR_HGel.png) - [ ] SET Ops/sec ![Set_Ops_sec](https://hackmd.io/_uploads/H14BgtHGgx.png) - [ ] SET Latency ![Set_Latency](https://hackmd.io/_uploads/HybJJFBMge.png) - [ ] SET p50 Latency ![Set_p50_Latency](https://hackmd.io/_uploads/S1-1JYrMxl.png) - [ ] SET p99 Latency ![Set_p99_Latency](https://hackmd.io/_uploads/SJZJJFHzex.png) - [ ] SET p99.9 Latency ![Set_p99_9_Latency](https://hackmd.io/_uploads/B1BNgFSGeg.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.