Linux 核心專題: rhashtable 及核心記憶體管理研究

## Linux 核心專題: rhashtable 及核心記憶體管理研究 > 執行人: SHChang-Anderson > [GitHub](https://github.com/SHChang-Anderson/rcuhashbash) > [專題解說錄影](https://youtu.be/FX0I8hzsizc) > [專題解說簡報](https://drive.google.com/file/d/1gY2m_WhPAvyRSgV5SIvRD8dnTLTHtdHZ/view) ### Reviewed by `kevinzxc1217` 當 hash table 進行 resize 過程中，會有新的 future table，即同時存在兩個 hash table，請問這樣會不會造成記憶體空間不足的問題？ > 新增 Bucket lock 和 Future Table 功能後，會增加記憶體使用量。其中，Future Table 本身會佔用一部分記憶體，而每個 Bucket lock 也會個別佔用記憶體空間。 > [name=SH] ### Reviewed by `fennecJ` * 針對 Table shrinking 的手法，可否分析該操作中 bucket 的數量對時間複雜度的影響為何？ * 在介紹 Table shrinking 的手法時，解說影片提及為了避免還有人在對 hash table 進行存取，要等待一段時間才把指標刪除，這裡的「等待一段時間」在影片中提及是透過 RCU 實作。顯然，隨著存取 hash table 的 thread 數量變多，要等待的時間也會相應變長。可否提供分析計算出存取 hash table 的 thread 數量和等待時間的數學關係為何？ * 根據你提供的實驗結果，可見 Lookups/sec 大致和 reader threads 呈固定比例，直到 reader threads 數量超過 8 個後，斜率有顯著下降，請問該現象的成因可能為何？ >根據〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉所提供的實驗環境，擁有 24 個硬體支援的執行緒，因此當讀取器（readers）數量低於硬體支援的執行緒數量時，效能呈現線性成長。然而，本實驗環境僅有 6 核心 12 執行緒，當讀取器數量達到 8 個時，效能就顯著下降。這是因為 16 個讀取器超過了可用執行緒的數量，導致效能瓶頸。 >[name=SH] ### Reviewed by `ssheep773` 在 table expansion 時若從 bucket 中串列的最後一個節點開始，是否可以減少資料競爭的情況。 >在 unzip 過程中，即使從 bucket 串列的最後一個節點開始走訪，仍需持續走訪鏈結串列，直到遇到不屬於同一個 bucket 的節點，並重複此過程。由於這整個過程都需要持有該 bucket lock，因此無法有效減少資料競爭的情況。此外，根據最新的 [lib/rhashtable.c](https://github.com/torvalds/linux/blob/master/lib/rhashtable.c) 實作，unzip 操作已被取消，改為逐一將舊的 bucket 節點透過 rehash 函式重新分配到新的 hash table。因此，無論從鏈結串列的開頭或結尾開始，資料競爭的情況皆相同，因為需要走訪整個鏈結串列才能完成該 bucket 的 rehash。 [name=SH] ## 任務說明閱讀 [rhashtable](https://hackmd.io/@linjohn/rhashtable) 及 Linux 核心記憶體管理研究、紀錄問題，確保對應的程式碼可運作於 Ubuntu Linux 24.04 (搭配 Linux v6.8)。 ## 開發環境 ```shell $ gcc --version gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 5500U with Radeon Graphics CPU family: 23 Model: 104 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 26% CPU max MHz: 4056.0000 CPU min MHz: 400.0000 BogoMIPS: 4192.08 $ uname -r 6.8.0-35-generic ``` ## TODO: 閱讀 [rhashtable](https://hackmd.io/@linjohn/rhashtable)，紀錄核心記憶體管理相關問題 > 可重用之前的材料，但需要針對 Linux v6.8+ 更新 ### [2023 開發紀錄](https://hackmd.io/@sysprog/HkKKE5eB3) 回顧及文獻探討 > [Relativistic hash tables, part 1: Algorithms](https://lwn.net/Articles/612021/) 事先理解 rhashtable ，參考過去共筆以及 [Relativistic hash tables, part 1: Algorithms](https://lwn.net/Articles/612021/) 了解 hashtable 的 resize 方式。 #### Table shrinking 若雜湊表中 buckets 數量過多，希望降低 buckets 數量使用 Table shrinking 的方式重新分配雜湊表。 * 第一步驟建立一個新的雜湊表並指向第一個 bucket 中串列的第一個節點。 * 接著將第一個 bucket 鏈結串列結尾指向下一個串列的開頭，值得注意的是這樣的作法避免在 resize 的過程中避免還在讀取的 readers 找不到節點的情況發生。 * 最後將舊的雜湊表指標指向新的雜湊表，這個時候舊的雜湊表尚不能回收，需要等待一個 [grace period](https://hackmd.io/@sysprog/linux-rcu#%E5%AF%AC%E9%99%90%E6%9C%9F) 避免尚為讀取完成的 readers 發生錯誤。 #### Table expansion 若雜湊表內 buckets 數量過少導致鏈結串列過長，搜尋效率低，需要將 buckets 內的節點分配到更多新的 buckets 上，確保負載更均勻地分散，這樣可以提升雜湊表操作的效率。 * 首先建立一個新的雜湊表，並加入更多的 buckets 。使新的雜湊表中每一個 buckets 連接到對應的舊節點。 * 將雜湊表指標指向新的雜湊表，保留舊有的雜湊表避免本來的 reader 發生錯誤。 * 等待一個 [grace period](https://hackmd.io/@sysprog/linux-rcu#%E5%AF%AC%E9%99%90%E6%9C%9F) 過後執行 Unzip 操作。 #### 閱讀論文〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉雜湊表為作業系統提供便利性，使得**平均存取與修改資料的時間為常數**。在並行處理中使用的雜湊表需要同步機制來維持內部資料的一致性。現有的雜湊表主要利用**互斥鎖**（mutex lock）實作並行處理。多個執行緒同時嘗試取得同一個互斥鎖時，就會發生爭搶。這會導致執行緒等待，降低程式的執行效率。 **隨著處理器數量增加**，問題會更嚴重，因為會有更多執行緒同時競爭互斥鎖。這使得程式無法有效利用多個處理器，難以擴展。現有一種可擴展的並行雜湊表解決方案是**讀取-複製-更新**（RCU，Read-Copy Update）。RCU 為並行程式提供了一種同步機制，對讀取器的開銷非常低。因此，RCU 特別適用於**讀取比寫入顯著多的資料結構**。現有的基於 RCU 的雜湊表存在一個主要缺陷：**無法動態調整大小**。雜湊表的效能與 hash buckets 的數量相關。**太少會導致效能低落，太多又會佔用過多記憶體**。系統需要一個**能根據需求自動調整大小**的雜湊表，以維持最佳效能。 **核心問題**： * 動態調整大小的困難：在不阻塞讀取操作的前提下，實作 RCU 雜湊表的動態調整大小是一個難題。 * 現有方案的不足：過去的解決方案，如 DDDS，會在調整大小期間導致查找速度變慢，影響效能。 **論文目標**：本論文介紹一種演算法，能夠在不阻塞或減慢並行查找的情況下，動態調整基於 RCU 的雜湊表大小。利用 RCU 的 wait-for-readers 操作來實作動態調整大小，同時不影響讀取操作。 #### Relativistic Hash Tables Relativistic Hash Table 是一種高**並行性**的雜湊表實作，它允許在調整大小的同時進行讀寫操作。 **核心概念：** * **Imprecise hash chains：** 在調整大小過程中，雜湊鏈結串列可能暫時包含不屬於該 bucket 的節點，以避免資料複製和浪費記憶體。 * **整數倍調整：** 雜湊表的大小調整限制為 bucket 數量的整數倍，簡化了調整演算法。 * **RCU 同步：** 使用 RCU（Read-Copy Update）機制實作讀寫並行，避免讀取操作被調整大小阻塞。 **實作步驟：** 1. **分配新雜湊表：** 配置一個容量較小的新雜湊表。 2. **連結新舊 bucket：** 將新表中的每個 bucket 連結到舊表中對應的第一個桶。 3. **串接舊 bucket 的鏈結串列：** 將每個舊 bucket 的鏈結串列末端連接到下一個舊 bucket 的開頭。 5. **設定雜湊表大小：** 更新雜湊表的大小資訊，以反映新的 bucket 數量。 5. **發布新雜湊表：** 讓讀取操作可以開始使用新的雜湊表。 6. **等待 readers：** 使用 `wait-for-readers` 操作等待所有在舊雜湊表上進行的讀取操作完成。確保沒有新的 reader 會再存取舊雜湊表。 7. **回收舊雜湊表：** 釋放舊雜湊表所佔用的記憶體空間。 ### Relativistic hash tables Implementation > 此部份將探究 rhashtable 的在 Linux 核心的實作。 #### rhashtable structure 首先看到 `rhashtable` 的結構： ```c /** * struct rhashtable - Hash table handle * @tbl: Bucket table * @key_len: Key length for hashfn * @max_elems: Maximum number of elements in table * @p: Configuration parameters * @rhlist: True if this is an rhltable * @run_work: Deferred worker to expand/shrink asynchronously * @mutex: Mutex to protect current/future table swapping * @lock: Spin lock to protect walker list * @nelems: Number of elements in table */ struct rhashtable { struct bucket_table __rcu *tbl; unsigned int key_len; unsigned int max_elems; struct rhashtable_params p; bool rhlist; struct work_struct run_work; struct mutex mutex; spinlock_t lock; atomic_t nelems; }; ``` * `tbl`：這是一個指向 `bucket_table` 結構的指標。`bucket_table` 儲存了雜湊表中的所有 buckets。 * `key_len`：是一個整數，表示雜湊函數所使用的鍵值的長度。 * `max_elems`：是一個整數，表示雜湊表中最大允許的元素數量。 * `p`：這是一個 `rhashtable_params` 結構，表示雜湊表所使用的參數。 * `rhlist`：這是一個布林值，表示是否為一個 `rhltable`。 :::info rhltable：允許多個元素擁有**相同鍵值**的雜湊表 struct rhltable 是 Linux 核心用來實作雜湊表鏈結串列（relativistic hash list table）的資料結構。它繼承了雜湊表的所有功能，但允許在同一個 bucket 中儲存多個具有相同鍵值的元素。 ```c /** * struct rhltable - Hash table with duplicate objects in a list * @ht: Underlying rhtable */ struct rhltable { struct rhashtable ht; }; ``` ::: * `run_work`：這是一個 `work_struct` 結構，用於執行擴張 (expansion) 或縮小 (shrink) 的 worker。 * `mutex`：互斥鎖，用於保護雜湊表避免發生同步存取導致資料不一致的問題。 * `lock`：這是一個 spinlock ，用於保護 walker list。 :::info rhashtable_walker 結構體：雜湊表中，讀取操作可能會與 resize 同時發生。為了確保讀取操作的正確性，雜湊表引入了 walker 。 * list：一個 `list_head` 結構，這個鏈結串列記錄了所有正在進行讀取操作的 walkers。 * tbl：指向 `bucket_table` 結構的指標。 tbl 記錄了這個 walker 正在走訪的雜湊表。 ```c /** * struct rhashtable_walker - Hash table walker * @list: List entry on list of walkers * @tbl: The table that we were walking over */ struct rhashtable_walker { struct list_head list; struct bucket_table *tbl; }; ``` ::: * `nelems`：表示雜湊表中目前元素的數量。 ## TODO: 確保 [rhashtable](https://hackmd.io/@linjohn/rhashtable) 對應的程式碼可運作於 Linux v6.8 ### 嘗試執行程式測試程式 #### 前期準備 * 檢查 Linux 核心版本 ```shell $ uname -r 6.8.0-31-generic ``` * 安裝核心標頭檔 linux-headers 套件 ```shell $ sudo apt install linux-headers-`uname -r` ``` * 更新套件清單 ```shell $ sudo apt update ``` * 確認 `linux-headers` 套件已正確安裝於開發環境 ```shell $ dpkg -L linux-headers-`uname -r` | grep "/lib/modules/.*/build" ``` * 得到以下輸出 ```shell /lib/modules/6.8.0-31-generic/build ``` ### 測試程式碼找到 Linux v6.8+ 版本的測試程式碼 [test_rhashtable.c](https://elixir.bootlin.com/linux/v6.8.12/source/lib/test_rhashtable.c) 並嘗試執行。 * 建立 `Makefile` 以編譯測試程式碼 ``` obj-m := test_rhashtable.o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: rm -rf *.o *.ko *.mod.* *.symvers *.order *.mod.cmd *.mod ``` * 掛載測試核心模組 ```shell $ sudo insmod test_rhashtable.ko ``` * 顯示核心資訊 ```shell $ sudo dmesg ``` 這是一段針對 rhashtable 進行的測試程式，主要測試其效能和功能。測試內容如下： **基本操作測試：** - 測試新增、刪除和走訪鍵值對的效能。 - 重複測試四次，每次新增 50000 個鍵值，然後刪除，並記錄測試時間。 **最大容量測試：** - 測試 rhashtable 是否能正確處理超過最大容量的情況，確保不會發生錯誤或資料遺失。 **重複插入測試：** - 測試 rhashtable 如何處理插入重複的鍵值，驗證是否能正確管理具有相同鍵的多個值。 **多執行緒並行測試：** - 模擬 10 個執行緒同時存取 rhashtable，測試 rhashtable 在並行環境下的效能和正確性。 **rhltable 鏈結串列測試：** - 測試與 rhashtable 相關的 rhltable 鏈結串列功能，包括新增、刪除和隨機操作，確保其功能正常運作。 **測試結果：** - 平均測試時間：每次新增 50000 個鍵值對並刪除的平均時間約為 49.15 毫秒。 - 最大容量測試：rhashtable 能正確處理超過最大容量的情況。 - 重複插入測試：rhashtable 能正確處理重複的鍵，將新值新增至鏈結串列中。 - 多執行緒並行測試：10 個執行緒並行測試成功，rhltable 鏈結串列測試回傳 0，表示未發生錯誤。 ```c [ 327.908633] Running rhashtable test nelem=8, max_size=0, shrinking=0 [ 327.908640] Test 00: [ 327.908712] Adding 50000 keys [ 327.928050] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 327.939587] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 327.939596] Deleting 50000 keys [ 327.955622] Duration of test: 46904488 ns [ 327.955641] Test 01: [ 327.955729] Adding 50000 keys [ 327.976176] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 327.986102] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 327.986109] Deleting 50000 keys [ 328.002335] Duration of test: 46601311 ns [ 328.002353] Test 02: [ 328.002470] Adding 50000 keys [ 328.024717] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 328.035320] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 328.035331] Deleting 50000 keys [ 328.050550] Duration of test: 48074468 ns [ 328.050568] Test 03: [ 328.050668] Adding 50000 keys [ 328.072286] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 328.082694] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 328.082703] Deleting 50000 keys [ 328.098314] Duration of test: 47642464 ns [ 328.106790] test if its possible to exceed max_size 8192: no, ok [ 328.106876] Average test time: 47305682 [ 328.106879] test inserting duplicates [ 328.106888] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=0) ]] ------------- [ 328.106900] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=2), val 1 (tid=0) ]] ------------- [ 328.106908] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=0) ]] ------------- [ 328.106917] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=2), val 1 (tid=0) ]] ------------- [ 328.106921] Testing concurrent rhashtable access from 10 threads [ 328.350688] test 3125 add/delete pairs into rhlist [ 328.381459] test 3125 random rhlist add/delete operations [ 328.410140] Started 10 threads, 0 failed, rhltable test returns 0 ``` * 卸載核心模組 ```shell $ sudo rmmod test_rhashtable ``` ### v6.8+ 與 v5.15+ 比較 #### `workqueue_types.h` 將 `workqueue_types.h` 這個標頭檔從 `workqueue.h` 中獨立出來。減少 `sched.h`（Linux 核心的排程相關標頭檔）的依賴性，進一步減少` rhashtable-types.h` 對 `workqueue.h` 的依賴。 > commit [b2fa844](https://github.com/torvalds/linux/commit/b2fa8443db320c4873feca2588b957439e350890) #### Allow rhashtable to be used from irq-safe contexts 在 commit [e47877c](https://github.com/torvalds/linux/commit/e47877c7aa821c413b45e05f804819579bdfa1a3) 中，主要目的是允許 rhashtable 在 IRQ-safe contexts 中使用。原始程式碼中使用 `local_bh_disable()` 和 `local_bh_enable()` 來保護臨界區間的資料，這表示 rhashtable 原本的使用情境只能防止被 Bottom Half 中斷。以下是修改內容： ```diff - static inline void rht_lock(struct bucket_table *tbl, - struct rhash_lock_head __rcu **bkt) + static inline unsigned long rht_lock(struct bucket_table *tbl, + struct rhash_lock_head __rcu **bkt) { - local_bh_disable(); + unsigned long flags; + local_irq_save(flags); bit_spin_lock(0, (unsigned long *)bkt); lock_map_acquire(&tbl->dep_map); + return flags; } ``` ```diff static inline void rht_unlock(struct bucket_table *tbl, - struct rhash_lock_head __rcu **bkt) + struct rhash_lock_head __rcu **bkt, + unsigned long flags) { lock_map_release(&tbl->dep_map); // 釋放鎖映射的互斥鎖 bit_spin_unlock(0, (unsigned long *)bkt); // 釋放桶的位元自旋鎖 - local_bh_enable(); + local_irq_restore(flags); } ``` **改動解釋：** * **使用 `local_irq_save(flags)` 和 `local_irq_restore(flags)`：** * 取代了原本的 `local_bh_disable()` 和 `local_bh_enable()`。 * `local_bh_disable()` 只關閉 bottom half 的中斷，而 `local_irq_save(flags)` 能夠保存當前的中斷狀態，並關閉所有中斷，確保在 `rht_lock` 和 `rht_unlock` 之間的臨界區段不會被任何中斷打斷，從而實作 IRQ-safe 的同步機制。 * **傳遞中斷狀態旗標 `flags`：** * `rht_lock` 返回保存的中斷狀態 `flags`，`rht_unlock` 接收這個旗標，以確保在 `rht_unlock` 中能夠正確恢復 `rht_lock` 之前的中斷狀態，包括中斷的開啟或關閉狀態。 ### 在 Ubuntu Linux 24.04 執行 [rcuhashbash-resize](https://github.com/SHChang-Anderson/rcuhashbash/) :::danger 解釋 rcuhashbash 的原理，特別是其 RCU 使用的考量因素。 ::: #### 編譯並執行下載程式碼 ```shell $ git clone https://github.com/chiangkd/rcuhashbash.git ``` 編譯程式碼 ```shell make ``` 加入 [Cppcheck](https://cppcheck.sourceforge.net/) 之後提交時出現以下錯誤： ```c rcuhashbash-resize.c:213:21: style: Condition '!&entry->node' is always false [knownConditionTrueFalse] if (!&entry->node) ^ rcuhashbash-resize.c:220:21: style: Condition '&entry->node' is always true [knownConditionTrueFalse] if (&entry->node) ^ rcuhashbash-resize.c:150:35: style: Variable 'new_table' can be declared as pointer to const [constVariablePointer] struct rcuhashbash_table *new_table; ^ rcuhashbash-resize.c:208:33: error: Uninitialized variable: entry->value [uninitvar] if ((entry->value & mask2) != (entry_prev->value & mask2)) ``` 首先檢查程式碼， `if (!&entry->node)` ， `&entry->node` 取得 `entry` 結構體中 `node` 成員的記憶體地址。由於若`entry` 存在，則 `node` 成員的記憶體地址永遠不可能為空。除此之外，`old_table->buckets[i].first = &entry->node;`，若 `entry == NULL` 對 `entry->node` 進行取成員操作，可能造成未定義的行為。 ```c hlist_for_each_entry(entry, &old_table->buckets[i], node) { if ((entry->value & mask2) != (entry_prev->value & mask2)) break; entry_prev = entry; } old_table->buckets[i].first = &entry->node; if (!&entry->node) continue; ``` 因此我對程式碼進行以下修正： ```diff - old_table->buckets[i].first = &entry->node; - if (!&entry->node) + if (!entry) { + old_table->buckets[i].first = NULL; + continue; + } + old_table->buckets[i].first = &entry->node; moved_one = true; hlist_for_each_entry(entry, &old_table->buckets[i], node) if ((entry->value & mask2) == (entry_prev->value & mask2)) break; + if (!entry) { + entry_prev->node.next = NULL; + continue; + } entry_prev->node.next = &entry->node; - if (&entry->node) ``` ----- ```c rcuhashbash-resize.c:150:35: style: Variable 'new_table' can be declared as pointer to const [constVariablePointer] ``` 這個訊息是一個程式碼風格建議，建議將 `new_table` 變數宣告為指向 `const` 的指標。這麼做的原因是，在 rcuhashbash_resize 函式中，new_table 指標所指向的 hashtable 內容在建立後就不會再被修改。使用 const 可以明確表示這一點，提高程式碼的可讀性和安全性。 ---- ```c rcuhashbash-resize.c:208:33: error: Uninitialized variable: entry->value [uninitvar] ``` 這是一個錯誤訊息，表示在使用 `entry->value` 時，該變數尚未被初始化。這可能會導致未定義的行為，需要修正。因此在 `hlist_for_each_entry` 迴圈之前，應該對 `entry_prev` 指標進行初始化。檢查程式碼發現 `entry->value` 的初始話被定義在 `rcuhashbash_init` 當中，不會出現未定義的問題，因此我使用 `--suppress=uninitvar:rcuhashbash-resize.c` 抑制錯誤訊息。確認程式的正確性，能夠正確的進行量測。 ```c [ 1347.255348] rcuhashbash summary: test=rcu readers=5 resize=true rcuhashbash summary: entries=65536 shift1=13 (8192 buckets) shift2=14 (16384 buckets) rcuhashbash summary: reads: 92811811 primary hits, 0 slowpath primary hits, 0 secondary hits, 0 misses rcuhashbash summary: resizes: 78 rcuhashbash summary: PASS rcuhashbash summary: total time: 3035726612 ns [ 1347.255368] rcuhashbash done ``` > commit [8e2e509](https://github.com/SHChang-Anderson/rcuhashbash/commit/8e2e509dee46f0f0aed2b909a086f1ad16205d09) ### 重現實驗結果 #### 實驗參數參照 [rhashtable](https://hackmd.io/@linjohn/rhashtable) 的實驗方式分別調整三中不同演算法（rcu, ddds, rwlock）與 reader 的數量，比較每秒 lookup hit 的次數，而 resizer 在過程中會不斷調整 table 的大小，從 $2^{13}$ 擴張到 $2^{14}$ 再縮小回去。 ```c static char *test = "rcu"; /* Hash table implementation to benchmark */ static int readers = 1; static bool resize = true; static u8 shift1 = 13; static u8 shift2 = 14; static unsigned long entries = 65536; ``` #### 實驗結果三中不同演算法，依序執行 1,2,4, 8, 16 個 readers 利用 shell script 固定每次測試時間為 10 秒 * **固定 table 大小** 固定大小在 $2^{13}$，比較不同演算法的查找次數 ![](https://hackmd.io/_uploads/HJq6B2UU0.png) * **開啟 resize** resize thread 會不斷在 $2^{13}$ 到 $2^{14}$ 之間調整雜湊表大小，比較在 resizing 時不同演算法的查找次數 ![](https://hackmd.io/_uploads/HyIUvZvI0.png) 實驗結果與 [rhashtable](https://hackmd.io/@linjohn/rhashtable) 相仿 rhashtable 在測試中都提供了線性可擴展的查找性能。在固定大小時，論文演算法與 DDDS 差不多。在 resizing 時，論文演算法明顯優於 DDDS。 ### 現行版本 rhashtable 與測試程式碼不同之處閱讀現行版本的 [rashtable.h](https://elixir.bootlin.com/linux/v6.8-rc1/source/include/linux/rhashtable.h) 程式碼時，我注意到在 rhashtble 的 bucket 結構體中存在 `future_tbl` ，從註解中可以得知， `future_tbl` 是在 resize 期間額外的 hashtable ，而在 [rcuhashbash](https://github.com/SHChang-Anderson/rcuhashbash/) 中並沒有出現這樣的結構。 ```diff /** * struct bucket_table - Table of hash buckets * @size: Number of hash buckets * @nest: Number of bits of first-level nested table. * @rehash: Current bucket being rehashed * @hash_rnd: Random seed to fold into hash * @walkers: List of active walkers * @rcu: RCU structure for freeing the table +* @future_tbl: Table under construction during rehashing * @ntbl: Nested table used when out of memory. * @buckets: size * hash buckets */ struct bucket_table { unsigned int size; unsigned int nest; u32 hash_rnd; struct list_head walkers; struct rcu_head rcu; + struct bucket_table __rcu *future_tbl; struct lockdep_map dep_map; struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp; }; ``` 使用 git log 找到最初使用到這樣結構的版本，在 commit [97def1](https://github.com/torvalds/linux/commit/97defe1ecf868b8127f8e62395499d6a06e4c4b1) 可以找到相關的更動以及說明。這個 patch 主要介紹了針對 RCU hash table 的幾項改進： **1. Spinlock 陣列保護 Bucket 操作：** ```diff /** * struct bucket_table - Table of hash buckets * @size: Number of hash buckets * @locks_mask: Mask to apply before accessing locks[] * @locks: Array of spinlocks protecting individual buckets * @buckets: size * hash buckets */ struct bucket_table { struct bucket_table { size_t size; size_t size; unsigned int locks_mask; + spinlock_t *locks; struct rhash_head __rcu *buckets[]; struct rhash_head __rcu *buckets[]; }; }; ``` * 引入 spinlock 陣列來保護 bucket 的修改。 * 允許插入和刪除與 resize 並行。 **2. 擴展/縮小時的 Future Table：** * 在擴展或縮小時，分配的新 bucket 表在調整大小過程開始時立即公開為 **future table** 。 * 查找、刪除和插入操作將短暫使用兩個 table。 * 在 RCU 寬限期結束且舊表到新表的初始鏈接完成後，future table 成為主要的 table。 ```c void *rhashtable_lookup(struct rhashtable *ht, const void *key) { const struct bucket_table *tbl, *old_tbl; struct rhash_head *he; u32 hash; BUG_ON(!ht->p.key_len); rcu_read_lock(); old_tbl = rht_dereference_rcu(ht->tbl, ht); tbl = rht_dereference_rcu(ht->future_tbl, ht); hash = key_hashfn(ht, key, ht->p.key_len); restart: rht_for_each_rcu(he, tbl, rht_bucket_index(tbl, hash)) { if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key, ht->p.key_len)) continue; rcu_read_unlock(); return rht_obj(ht, he); } if (unlikely(tbl != old_tbl)) { tbl = old_tbl; goto restart; } rcu_read_unlock(); return NULL; } ``` 在查找的程式碼中可以觀察到，會對兩個雜湊表進行查找，因此可得知，此舉的目的使得查找操作可能與雜湊表修改和大小調整同時發生。若我們對照這兩個版本之間的程式碼：在更新前的插入程式碼中，從註解我們可以得知： > The caller must ensure that no concurrent table mutations occur. It is however valid to have concurrent lookups if they are RCU protected. 使用者在進行插入操作時，必須確保沒有同時發生的雜湊表大小調整操作。然而，如果查找操作受到 RCU 保護，則允許並行查找。 ```c void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj) { //... rcu_read_lock(); tbl = rht_dereference_rcu(ht->future_tbl, ht); hash = head_hashfn(ht, tbl, obj); lock = bucket_lock(tbl, hash); spin_lock_bh(lock); RCU_INIT_POINTER(obj->next, tbl->buckets[hash]); rcu_assign_pointer(tbl->buckets[hash], obj); spin_unlock_bh(lock); //... } ``` 對照更新後的 `rhashtable_insert` 以及 `rhashtable_expand` 程式碼，可以發現每個 bucket 都受到 spinlock 的保護，使得針對不同 bucket 的插入和刪除操作可以並行。 > Make insertions go into the new, empty table right away. Deletions and lookups will be attempted in both tables until we synchronize. The synchronize_rcu() guarantees for the new table to be picked up 插入操作時會直接插入 `future table` 當中， `synchronize_rcu()` 能確保 `future table` 被採用，因此在進行 `unzip` 操作時，不會有新的插入操作進入原始的 `hashtable`。 ```c int rhashtable_expand(struct rhashtable *ht) { //... /* Make insertions go into the new, empty table right away. Deletions * and lookups will be attempted in both tables until we synchronize. * The synchronize_rcu() guarantees for the new table to be picked up * so no new additions go into the old table while we relink. */ rcu_assign_pointer(ht->future_tbl, new_tbl); synchronize_rcu(); /* For each new bucket, search the corresponding old bucket for the * first entry that hashes to the new bucket, and link the end of * newly formed bucket chain (containing entries added to future * table) to that entry. Since all the entries which will end up in * the new bucket appear in the same old bucket, this constructs an * entirely valid new hash table, but with multiple buckets * "zipped" together into a single imprecise chain. */ for (new_hash = 0; new_hash < new_tbl->size; new_hash++) { old_hash = rht_bucket_index(old_tbl, new_hash); old_bucket_lock = bucket_lock(old_tbl, old_hash); spin_lock_bh(old_bucket_lock); rht_for_each(he, old_tbl, old_hash) { if (head_hashfn(ht, new_tbl, he) == new_hash) { link_old_to_new(new_tbl, new_hash, he); break; } } spin_unlock_bh(old_bucket_lock); } //... } ``` #### 在 Linux v6.8 中的 bucket lock 實作 **`rhash_lock_head`** **`struct rhash_lock_head {}` 的作用** :::danger 注意用語： * access 是「存取」，而非「訪問」(visit) 避免使用 ChatGPT 來機器翻譯。 ::: 這個結構體，沒有任何成員變量，由於這個結構體沒有內容，開發者不能直接使用它來<s>訪問</s> 存取 bucket 的內容，必須先將其轉換為正確的<s>指針</s> 指標類型，這樣可以避免誤操作。 ```c /* * Objects in an rhashtable have an embedded struct rhash_head * which is linked into as hash chain from the hash table - or one * of two or more hash tables when the rhashtable is being resized. * The end of the chain is marked with a special nulls marks which has * the least significant bit set but otherwise stores the address of * the hash bucket. This allows us to be sure we've found the end * of the right list. * The value stored in the hash bucket has BIT(0) used as a lock bit. * This bit must be atomically set before any changes are made to * the chain. To avoid dereferencing this pointer without clearing * the bit first, we use an opaque 'struct rhash_lock_head *' for the * pointer stored in the bucket. This struct needs to be defined so * that rcu_dereference() works on it, but it has no content so a * cast is needed for it to be useful. This ensures it isn't * used by mistake with clearing the lock bit first. */ struct rhash_lock_head {}; ``` 在 rhashtable 中，每個物件都內嵌了一個 `struct rhash_head`。 hash chain 的末端使用一個特殊的`nulls` 標記來表示。這個標記的 least significant bit 被設置為 1，但其餘儲存 bucket 的地址。 :::danger 注意用語： * thread 是「執行緒」 * bit 是「位元」，而非「位」，後者是量詞 ::: bucket 中的最低有效位（BIT(0)）被用作鎖定位（lock bit）。在對 hash table 進行任何更改之前，必須先 atomic 設置這個<s>位</s> 位元。這是為了確保在多<s>線程</s> 執行緒環境下，對 hash table 的修改操作不會相互干擾。 --- **`rht_lock`** ```c static inline unsigned long rht_lock(struct bucket_table *tbl, struct rhash_lock_head __rcu **bkt) { unsigned long flags; local_irq_save(flags); bit_spin_lock(0, (unsigned long *)bkt); lock_map_acquire(&tbl->dep_map); return flags; } ``` **`rht_unlock`** ```c static inline void rht_unlock(struct bucket_table *tbl, struct rhash_lock_head __rcu **bkt, unsigned long flags) { lock_map_release(&tbl->dep_map); bit_spin_unlock(0, (unsigned long *)bkt); local_irq_restore(flags); } ``` **鎖定原理** 1. **利用最低有效位 (LSB)**：rhashtable 的設計利用指標的 least significant bit，LSB 作為鎖標記。 - 在正常的指標中，LSB 總是 0。 - 當一個 bucket 被鎖定時，其對應指針的 LSB 會被設置為 1。 2. **`bit_spin_lock` 函式**：程式碼中的 `bit_spin_lock` 函式用於指定<s>指針</s> 指標的 LSB 設置為 1，從而實作 bucket 的鎖定。 :::danger 注意用語 ::: ### resize 和插入操作並行 #### 加入插入操作 > commit [d3be3bd](https://github.com/SHChang-Anderson/rcuhashbash/commit/d3be3bd0c438c01091131deb1f513140e672ce86) 在過去的的 [rcuhashbash](https://github.com/chiangkd/rcuhashbash) 專案測試程式中，並沒有加入插入功能。因此我修改去年的測試程式，加入插入操作，並使用 mutex lock 避免插入操作與 resize 操作並行出現問題，也作為接下來的並行測試作為基準。在 commit [d3be3bd](https://github.com/SHChang-Anderson/rcuhashbash/commit/d3be3bd0c438c01091131deb1f513140e672ce86) 中，我實作了 `rhashtable_insert` 函式，並將其整合到 rcuhashbash 中。為了方便控制插入操作，我新增了 `insert` 切換選項和 `insertct` 數量限制。此外，為了確保在多執行緒環境下的安全插入，引入了 `table_mutex` 互斥鎖，以防止並行修改表格。最後，建立了一個 `rcuhashbash_insert_thread` 插入執行緒，以便進行插入效能測試。並在插入之後進行查找，確認插入是否成功。 #### 新增 Bucket lock 以及 Future Table > commit [d2f18d5](https://github.com/SHChang-Anderson/rcuhashbash/commit/d2f18d5490a779c491168d50e3952efa36308342) 在 [rcuhashbash](https://github.com/SHChang-Anderson/rcuhashbash) 中，我們將參考 Linux 核心 [lib/rhashtable.c](https://github.com/torvalds/linux/blob/master/lib/rhashtable.c) 中 commit [97def1](https://github.com/torvalds/linux/commit/97defe1ecf868b8127f8e62395499d6a06e4c4b1) 的方法，為 rhashtable 增加 Bucket lock 和 Future Table 功能，以便能夠並行處理不同 bucket 的修改和插入操作。首先，我們需要在原始的 `rcuhashbash_table` 結構中加入一個 `spinlock_t` 型態的指標。 ```diff struct rcuhashbash_table { unsigned long mask; + spinlock_t *locks; struct hlist_head buckets[]; }; ``` 接下來，加入一個 `bucket_lock` 函式。這個函式會接受一個 `rcuhashbash_table` 和一個 `hash` 作為參數，並回傳該 bucket 的鎖。這樣，在對特定 bucket 進行操作時，就能夠取得該 bucket 的鎖，以確保插入或修改操作的安全性。 ```c static spinlock_t *bucket_lock(const struct rcuhashbash_table *tbl, u32 hash) { return &tbl->locks[hash]; } ``` 在 [rcuhashbash](https://github.com/SHChang-Anderson/rcuhashbash) 中加入 `rcuhashbash_try_lookup2` 函式。此函式會優先在 `future_table` 中查找資料，若未找到，則會在舊的 hash table 中再次查找。此外，插入操作也會優先在 `future_table` 中進行，以確保最新的資料能被快速查找到。 ```c static bool rcuhashbash_try_lookup2(struct rcuhashbash_table *t, struct rcuhashbash_table *newt, u32 value) { struct rcuhashbash_entry *entry; struct hlist_node *node __attribute__((unused)); restart: hlist_for_each_entry_rcu(entry, &newt->buckets[value & newt->mask], node) { if (entry->value == value) { return true; } } if (unlikely(newt != t)) { newt = t; goto restart; } return false; } ``` 在這個函式中，`rcuhashbash_try_lookup2` 會首先在 `newt`（即 `future_table`）中進行查找。如果未找到，則會切換到舊的雜湊表 (`t`) 再次查找在擴展/縮小的過程中需要加入 bucket lock。值得注意的是，在原始的實作當中，縮小操作是 in-place，而在 Linux 核心的實作當中，在 resize 的過程會短暫地使用到兩個雜湊表。因此，我必須修改此部分的程式碼，以適用於 bucket lock。 ```c if (mask2 < table->mask) { /* Shrink. */ //.. /* Assume (and assert) that __krealloc shrinks in place. */ new_table = krealloc(table, sizeof(*table) + len2 * sizeof(table->buckets[0]), GFP_KERNEL); //.. } ``` 最後，在 unzip 的過程中，針對 future table 正在執行 unzip 的 bucket 進行上鎖，確保 resize 的安全性。 ### rhashtable 並行插入實驗 > 本實驗旨在評估 rhashtable 在 resize 並行情況下，不同插入策略對其性能的影響。比較以下兩種插入策略在 rhashtable 中的效能： 1. **hash table mutex lock**：在插入操作時，對整個哈希表加鎖，確保所有插入操作串行執行。 2. **bucket lock**：在插入操作時，僅對相關的 bucket 加鎖，允許不同 bucket 的插入操作並行執行，提高並行度。 **實驗方法** 1. **實驗環境**： - **固定 readers 數量**：設定固定數量的 readers 。 - **固定測試時間**：設定固定的測試時間，比較兩種方法在相同時間內能完成的插入次數。 2. **實驗流程**： - 將實驗分為兩組，每組執行相同的測試時間。 - 每組實驗中，固定數量的讀者線程持續對 rhashtable 進行讀取操作。 - 在兩組實驗中，各自使用不同的插入策略，並以固定的時間間隔 (10 ms) 持續進行插入操作，直到測試時間結束。 3. **實驗指標**： - **成功插入次數**：統計在測試時間內成功插入到 rhashtable 中的元素數量，作為評估不同插入策略的並行效率的主要指標。使用 git checkout 找到過去版本使用 hash table mutex lock 的程式碼，加入目前專案當中，命名為 `rcuhashbash-resize_wob_lock.c` 為接下來的測試做準備。 **測試腳本** `do_insert_measurement.sh` ```bash sudo dmesg -C for((sleept = 0; sleept < 5; sleept = sleept + 1)) do sudo insmod ${TARGET_MODULE[0]}.ko \ shift1=13 \ resize=true \ insert=true sleep ${sleep_time[sleept]} sudo rmmod ${TARGET_MODULE[0]} done mapfile -t inserts < <(sudo dmesg | grep "rcuhashbash summary: inserts" | awk '{print $4}') for i in "${!inserts[@]}"; do echo "${sleep_time[i]} ${inserts[i]}" done > output_bucket_lock.txt ``` **產生的數據圖腳本** `script.gp` ```bash set title "Insertion Performance: Bucket Lock vs Table Mutex" set xlabel "Time (seconds)" set ylabel "Insertions" set terminal png font "Times_New_Roman, 12" set key right set output "insertion_performance_comparison.png" plot \ "output_bucket_lock.txt" using 1:2 with linespoints linewidth 2 title "Bucket Lock", \ "output_table_mutex.txt" using 1:2 with linespoints linewidth 2 title "Table Mutex" \ ``` **實驗參數** 分別編譯 rcuhashbash-resize 和 rcuhashbash-resize_wob_lock 模組。執行測試: * 針對每個模組，依序設定測試時間為 5, 10, 15, 20, 25 秒。 * 在每個測試時間下，執行測試腳本，同時開啟插入和 resize 操作。 * 記錄每個測試時間下，兩個模組各自完成的插入次數。 ```bash! TARGET_MODULE=(rcuhashbash-resize rcuhashbash-resize_wob_lock) sleep_time=(5 10 15 20 25) tests="rcu" readers=1 insert=true resize=true ``` **執行 `do_insert_measurement.sh`** ```shell bash do_insert_measurement.sh ``` ### 實驗結果 > 比較兩種方法在不同時間的插入次數 * **使用 bucket lock** |時間 (秒) |插入次數| |-|-| |5| 414| |10| 828| 15| 1230 20 |1650 25| 2055 * **使用 table mutex** |時間 (秒) |插入次數| |-|-| 5| 83 10| 160 15 |240 20 |348 25 |441 **實驗圖表** ![image](https://hackmd.io/_uploads/HkiFQos8C.png) **Resize 次數比較** > 觀察並行插入是否對 resize 造成影響 * 使用 **bucket lock** 但不使用插入功能 |時間 (秒) |resize 次數| |-|-| 5| 91 10| 200 15| 287 20| 411 25| 499 * 使用 **bucket lock** 以固定的時間間隔 (10 ms) 持續進行插入操作 |時間 (秒) |resize 次數| |-|-| 5| 92 10| 199 15 |308 20| 410 25| 508 **實驗圖表** ![image](https://hackmd.io/_uploads/HyL2PjoLC.png) * 使用 **table mutex** 但不使用插入功能 |時間 (秒) |resize 次數| |-|-| 5| 91 10| 179 15| 281 20| 360 25| 488 * 使用 **table mutex** 以固定的時間間隔 (10 ms) 持續進行插入操作 |時間 (秒) |resize 次數| |-|-| 5 |93 10 |181 15 |281 20 |377 25 |469 **實驗圖表** ![image](https://hackmd.io/_uploads/BklB7AiIC.png) :::danger 如何解讀上方實驗結果？與最初論文描述的行為相符嗎？ ::: 實驗結果顯示，在各種測試時間下，使用 bucket lock 對特定的 bucket 進行臨界區間的控制，能在單位時間內完成更多次插入操作。進一步結合 resize 次數的比較實驗，我們可以得出以下結論： 1. **Bucket lock 提升插入效率：** 相比於 mutex lock 鎖定整個雜湊表，bucket lock 能夠實作更細緻的鎖定，允許不同 bucket 的插入操作並行進行。這有效降低了鎖競爭，從而提高了插入操作的吞吐量。 2. **Bucket lock 有助於 resize 並行：** 在 hash table 需要擴展或縮小時，bucket lock 允許插入操作與 resize 操作並行進行，在 resize 過程中，新的插入操作不會被完全阻塞，從而提高了整體的效能。 **總結：** 綜合實驗結果，bucket lock 在 rhashtable 的插入操作中表現出明顯的優勢。它不僅能提高插入效率，還能降低 resize 成本。