Linux 核心專題: [rhashtable](https://elixir.bootlin.com/linux/latest/source/include/linux/rhashtable.h) 研究和應用

# Linux 核心專題: [rhashtable](https://elixir.bootlin.com/linux/latest/source/include/linux/rhashtable.h) 研究和應用 > 執行人: chiangkd, POCHUN-CHEN > [專題解說錄影](https://youtu.be/7JY9DvHoLwk) > [專題解說簡報](https://docs.google.com/presentation/d/1hzjxfvY6Bev7QtrGZCIuSKB5g3jwVQrQG0e_A95UiFU/edit#slide=id.p20) :::success :question: 提問清單 * ? ::: ## 任務簡述 Linux 核心提供一套 Resizable, Scalable, Concurrent Hash Table，名為 `rhashtable` (見 [include/linux/rhashtable.h](https://elixir.bootlin.com/linux/latest/source/include/linux/rhashtable.h) 及 [lib/rhashtable.c](https://elixir.bootlin.com/linux/latest/source/lib/rhashtable.c))，描述於論文〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉。摘譯 LWN 文章，並佐以原論文，描述 rhashtable 何以落實高效的並行雜湊表，搭配撰寫 Linux 核心模組作為試驗。本任務應整理「[2022 年開發紀錄](https://hackmd.io/@linjohn/rhashtable)」素材，並於 Linux v5.15+ 重現實驗，著重於 Linux 核心原始程式碼的應用案例 (如 [`mac80211_hwsim`](https://www.kernel.org/doc/html/next/networking/mac80211_hwsim/mac80211_hwsim.html))。 ## TODO: 在 Linux v5.15+ 重現 [2022 年開發紀錄](https://hackmd.io/@linjohn/rhashtable) ### 撰寫 Linux 核心模組作為試驗關於核心模組掛載詳情請見[Linux 核心模組運作原理](https://hackmd.io/@sysprog/linux2021-summer/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Flinux-kernel-module#%E7%A8%8B%E5%BC%8F%E7%A2%BC%E6%BA%96%E5%82%99) #### 前期準備 * 我們需要進入 BIOS 將 UEFI Secure Boot 的功能**關閉** >如果無法直接禁用 Secure Boot，並且在 BIOS 或 UEFI 設定界面中只看到 "Secure Boot Control" 選項，那麼禁用 "Secure Boot Control" 選項 * 檢查 Linux 核心版本 ```shell $ uname -r ``` * 安裝核心標頭檔 `linux-headers` 套件在 Ubuntu 中，核心標頭檔通常位於 linux-headers 套件中。可以使用以下步驟來安裝核心標頭檔套件： ```shell $ sudo apt install linux-headers-`uname -r` ``` * 更新套件清單 ```shell $ sudo apt update ``` * 確認 `linux-headers` 套件已正確安裝於開發環境 ```shell $ dpkg -L linux-headers-`uname -r` | grep "/lib/modules/.*/build" ``` 預期得到以下輸出: ``` /lib/modules/5.19.0-43-generic/build ``` ### 測試程式碼準備 * 建立目錄 ```shell $ mkdir test_rhashtable ``` * 用編譯器儲存以下兩個程式碼 - [ ] [`test_rhashtable.c`](https://elixir.bootlin.com/linux/v5.19/source/lib/test_rhashtable.c) - [ ] `Makefile` ``` obj-m := test_rhashtable.o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: rm -rf *.o *.ko *.mod.* *.symvers *.order *.mod.cmd *.mod ``` #### 編譯核心模組編譯核心模組的命令: ```shell $ make ``` #### 掛載核心模組 * 產生 `test_rhashtable.ko` 之後，可將其掛載: ```shell $ sudo insmod test_rhashtable.ko ``` * dmesg (display message) 就是將開機時的資訊顯示出來的命令。 ```shell $ sudo dmesg ``` > 直接使用 dmesg 會顯示以下權限不足的警告 > "dmesg: read kernel buffer failed: Operation not permitted" > `$ sudo dmesg -c` 用法，清除核心緩衝區中的訊息並將其輸出到終端機上（可以清掉之前的資訊，只看新增的資訊，很方便）。顯示核心資訊 ```c [ 1624.670611] Running rhashtable test nelem=8, max_size=0, shrinking=0 [ 1624.670612] Test 00: [ 1624.670631] Adding 50000 keys [ 1624.677792] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.682341] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.682343] Deleting 50000 keys [ 1624.688610] Duration of test: 17977415 ns [ 1624.688613] Test 01: [ 1624.688636] Adding 50000 keys [ 1624.696129] Info: encountered resize [ 1624.696589] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=1 [ 1624.701504] Info: encountered resize [ 1624.701969] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=1 [ 1624.701970] Deleting 50000 keys [ 1624.708627] Duration of test: 19989237 ns [ 1624.708630] Test 02: [ 1624.708653] Adding 50000 keys [ 1624.716580] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.721199] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.721200] Deleting 50000 keys [ 1624.727728] Duration of test: 19072633 ns [ 1624.727731] Test 03: [ 1624.727752] Adding 50000 keys [ 1624.735153] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.739284] Traversal complete: counted=50000, nelems=50000, entries=50000, table-jumps=0 [ 1624.739285] Deleting 50000 keys [ 1624.745324] Duration of test: 17570950 ns [ 1624.746723] test if its possible to exceed max_size 8192: no, ok [ 1624.746750] Average test time: 18652558 [ 1624.746751] test inserting duplicates [ 1624.746753] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=0) ]] ------------- [ 1624.746757] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=2), val 1 (tid=0) ]] ------------- [ 1624.746759] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=0) ]] ------------- [ 1624.746761] ---- ht: ---- bucket[1] -> [[ val 21 (tid=1) ]] -> [[ val 1 (tid=2), val 1 (tid=0) ]] ------------- [ 1624.746762] Testing concurrent rhashtable access from 10 threads [ 1624.853127] test 3125 add/delete pairs into rhlist [ 1624.867967] test 3125 random rhlist add/delete operations [ 1624.879939] Started 10 threads, 0 failed, rhltable test returns 0 ``` #### 卸載核心模組 ```shell $ sudo rmmod test_rhashtable ``` ### [2022 開發紀錄](https://hackmd.io/@linjohn/rhashtable#rhashtable-%E7%A0%94%E7%A9%B6) 回顧及文獻摘錄 >- v5.13 [rcuhashbash](https://github.com/linjohnss/rcuhashbash) >- 更早的來源版本： [rcuhashbash-resize (v3.17)](https://git.kernel.org/pub/scm/linux/kernel/git/josh/rcuhashbash.git) [2022 開發紀錄](https://hackmd.io/@linjohn/rhashtable#rhashtable-%E7%A0%94%E7%A9%B6) 中提到為了簡單起見 relativistic hash tables 透過整數因子 (integral factors) 來改變 buckets 的數量，也就是說 hashtable 中 buckets 的數量是加倍或減半的，對應到論文〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉描述 >For simplicity, relativistic hash tables constrain resizing to change the number of buckets by integral factors—for instance, doubling or halving the number of buckets. This guarantees two constraints: First, when shrinking the table, each bucket of the new table will contain all entries from multiple buckets of the old table; and second, when growing the table, each bucket of the new table will contain entries from at most one bucket of the old table. 在 [LWN 文章](https://lwn.net/Articles/612021/)中也有相對應的描述 >In practice, it means that the size of a table can only change by an integer factor; normally that means that the size of a table can only be doubled or halved. #### RCU 相關 API 更新 :::spoiler 節錄 [changelog-v5.10](https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10) >commit ae2212a7216b674633bdc3bd2e24947a0665efb8 Author: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Date: Sun Jul 12 18:40:02 2020 +0530 rculist: Introduce list/hlist_for_each_entry_srcu() macros list/hlist_for_each_entry_rcu() provides an optional cond argument to specify the lock held in the updater side. However for SRCU read side, not providing the cond argument results into false positive as whether srcu_read_lock is held or not is not checked implicitly. Therefore, on read side the lockdep expression srcu_read_lock_held(srcu struct) can solve this issue. However, the function still fails to check the cases where srcu protected list is traversed with rcu_read_lock() instead of srcu_read_lock(). Therefore, to remove the false negative, this patch introduces two new list traversal primitives : list_for_each_entry_srcu() and hlist_for_each_entry_srcu(). Both of the functions have non-optional cond argument as it is required for both read and update side, and simply checks if the cond is true. For regular read side the lockdep expression srcu_read_lock_head() can be passed as the cond argument to list/hlist_for_each_entry_srcu(). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Suraj Upadhyay <usuraj35@gmail.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> [ paulmck: Add "true" per kbuild test robot feedback. ] Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> ::: 在 [linuux/rculist.h](https://github.com/torvalds/linux/blob/master/include/linux/rculist.h#L704) 中對於 `hlist_for_each_entry_rcu` 的描述 ```c /** * hlist_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the hlist_node within the struct. * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as hlist_add_head_rcu() * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(pos, head, member, cond...) \ for (__list_check_rcu(dummy, ## cond, 0), \ pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\ typeof(*(pos)), member); \ pos; \ pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\ &(pos)->member)), typeof(*(pos)), member)) ``` - 過往開發紀錄已將原先 [v2.6.37](https://elixir.bootlin.com/linux/v2.6.37/source/include/linux/rculist.h#L449) 版本 API 修正為在 [v5.13](https://elixir.bootlin.com/linux/v5.13/source/include/linux/rculist.h#L705) 上執行，對應 [v5.15](https://elixir.bootlin.com/linux/v5.15/source/include/linux/rculist.h#L704) 僅對 comment 做補充說明，並無更動 API。 - 自 [v5.10](https://www.kernel.org/doc/html/v5.10/genindex.html) 起改動除了對於 API 的更動以及新增 `cond`，參考 [LWN-Lockdep-RCU](https://lwn.net/Articles/371986/#Why%20Bother%20With%20lockdep-Enabling%20RCU?)、[RCU and lockdep checking](https://docs.kernel.org/RCU/lockdep.html)，RCU 透過允許 readers 和 writers 同時進入 critical section 來改善可擴展能力 (scalability) 問題，同時因為 `rcu_dereference()` 在 2.6.32 版本後出現次數大幅增加，故引入 lockdep 方式來檢查 `rcu_dereference()` - 此處 `rcu_dereference_raw()` 提取 RCU-protected value(pointer of index) 且 without lockdep-RCU checks，由 `__list_check_rcu` 展開 `RCU_LOCKDEP_WARN()` 來檢查 lockdep ### 編譯並執行下載程式碼 ```shell $ git clone https://github.com/linjohnss/rcuhashbash.git ``` 編譯程式碼 ```shell $ make ``` 在 Makefile 內寫好編譯、清除、掛載、卸載可直接使用。 - [ ] Makefile ```shell TARGET_MODULE := rcuhashbash-resize # obj-m += rcuhashbash.o obj-m += $(TARGET_MODULE).o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean load: sudo insmod $(TARGET_MODULE).ko unload: sudo rmmod $(TARGET_MODULE) || true >/dev/null ``` 執行編譯後，將會出現以下警告資訊，這是由於實作中以 `rcuhashbash_try_lookup()` 負責查找 hashtable 中的值，如果成功找到，則回傳 `true`，並且 `rcuhashbash_try_lookup()` 會被不同的演算法實作 (RP，DDDS，rwlock) 中被呼叫。 ``` make -C /lib/modules/5.19.0-43-generic/build M=/home/leo/rcuhashbash modules make[1]: Entering directory '/usr/src/linux-headers-5.19.0-43-generic' warning: the compiler differs from the one used to build the kernel The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 You are using: gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 CC [M] /home/leo/rcuhashbash/rcuhashbash-resize.o /home/leo/rcuhashbash/rcuhashbash-resize.c: In function ‘rcuhashbash_try_lookup’: /home/leo/rcuhashbash/rcuhashbash-resize.c:113:28: warning: unused variable ‘node’ [-Wunused-variable] 113 | struct hlist_node *node; | ^~~~ /home/leo/rcuhashbash/rcuhashbash-resize.c: In function ‘rcuhashbash_resize’: /home/leo/rcuhashbash/rcuhashbash-resize.c:183:44: warning: unused variable ‘node’ [-Wunused-variable] 183 | struct hlist_node *node; | ^~~~ /home/leo/rcuhashbash/rcuhashbash-resize.c:201:52: warning: unused variable ‘node’ [-Wunused-variable] 201 | struct hlist_node *node; | ^~~~ MODPOST /home/leo/rcuhashbash/Module.symvers CC [M] /home/leo/rcuhashbash/rcuhashbash-resize.mod.o LD [M] /home/leo/rcuhashbash/rcuhashbash-resize.ko BTF [M] /home/leo/rcuhashbash/rcuhashbash-resize.ko Skipping BTF generation for /home/leo/rcuhashbash/rcuhashbash-resize.ko due to unavailability of vmlinux make[1]: Leaving directory '/usr/src/linux-headers-5.19.0-43-generic' ``` - 可以在變數後加上 gcc extension `__attribute__((unused))` 消除。 ### 掛載核心模組產生 `rcuhashbash-resize.ko` 之後，可將其掛載: ```shell $ sudo make load ``` 將其卸載: ```shell $ sudo make unload ``` dmesg 顯示資訊的命令。 ```shell $ sudo dmesg ``` 出現以下訊息 > 由於印出狀態的函式 `rcuhashbash_print_stats()` 存在於 `rcuhashbash_exit()` 中，所以要卸載過 `rcuhashbash-resize.ko` 才會印出完整的狀態資訊。 ``` [ 7830.609092] rcuhashbash starting threads [ 8168.516186] rcuhashbash summary: test=rcu readers=8 resize=true rcuhashbash summary: entries=65536 shift1=13 (8192 buckets) shift2=14 (16384 buckets) rcuhashbash summary: reads: 47504534567 primary hits, 0 slowpath primary hits, 0 secondary hits, 0 misses rcuhashbash summary: resizes: 2844 rcuhashbash summary: PASS rcuhashbash summary: totak time: 337861881925 ns [ 8168.516193] rcuhashbash done [ 8184.890270] rcuhashbash starting threads ``` ### 重現實驗結果在 `rcuhashbash-resize.c` 中包含 3 個演算法 ([RP](https://en.wikipedia.org/wiki/Relativistic_programming), [DDDS](https://lwn.net/Articles/302132/), rwlock) 的比較，從 $2^{13}$ 擴張 (expansion) 至 $2^{14}$，再從 $2^{14}$ 縮小 (shrink) 回 $2^{13}$，此處可對應論文提及之 integral factors (hashtable buckets 的數量是加倍或減半) 編譯完成後使用 `modinfo` 查看 `module_param` 以及 `MODULE_PARM_DESC` 對核心模組的相關設定 ```shell $ modinfo rcuhashbash-resize.ko ... parm: test:Hash table implementation (charp) parm: readers:Number of reader threads (default: number of online CPUs) (int) parm: resize:Whether to run a resize thread (default: true) (bool) parm: shift1:Initial number of hash buckets, log 2 (byte) parm: shift2:Number of hash buckets after resize, log 2 (byte) parm: entries:Number of hash table entries (ulong) ``` 對於每組測試參數，執行 10 組 benchmark 測試，每次 10 秒，然後將結果取平均值。 - 為了測試 scalability，每組 benchmark 包含 1, 2, 4, 8 以及 16 concurrent reader threads, with and without an additional resize thread **測試案例 (Expect lookups to take less time in a table with more buckets)** - no resizing and $2^{13}$ buckets - no resizing and $2^{14}$ buckets - continuous resizing between $2^{13}$ and $2^{14}$ buckets **測試腳本** ```shell $ ls -al /bin/sh /bin/sh -> dash ``` - 如果一開始預設連結 `dash` ```shell $ sudo dpkg-reconfigure dash ``` - 改為連接 `bash` (selected "NO") `do_measurement.sh` ```shell tests=("rcu" "ddds" "rwlock") readers=(1 2 4 8 16) sleep_time=10 TARGET_MODULE=rcuhashbash-resize sudo rmmod ${TARGET_MODULE} || true >/dev/null # test 1: no resizing and 2^13 buckets sudo dmesg -C for((tcount = 1; tcount <= 10; tcount = tcount + 1)) do # iterate algorithm for((tests_count = 0; tests_count < 3; tests_count = tests_count + 1)) do for((readers_count = 0; readers_count < 5; readers_count = readers_count + 1)) do sudo insmod ${TARGET_MODULE}.ko \ test=${tests[tests_count]} \ readers=${readers[readers_count]} \ shift1=13 \ resize=false sleep ${sleep_time} sudo rmmod ${TARGET_MODULE} done done echo test1 ${tcount} test done done sudo dmesg > test1result.txt # test 2: no resizing and 2^14 buckets sudo dmesg -C for((tcount = 1; tcount <= 10; tcount = tcount + 1)) do # iterate algorithm for((tests_count = 0; tests_count < 3; tests_count = tests_count + 1)) do for((readers_count = 0; readers_count < 5; readers_count = readers_count + 1)) do sudo insmod ${TARGET_MODULE}.ko \ test=${tests[tests_count]} \ readers=${readers[readers_count]} \ shift1=14 \ resize=false sleep ${sleep_time} sudo rmmod ${TARGET_MODULE} done done echo test2 ${tcount} test done done sudo dmesg > test2result.txt # test 3: continuous resizing between $2^{13}$ and $2^{14}$ buckets sudo dmesg -C for((tcount = 1; tcount <= 10; tcount = tcount + 1)) do # iterate algorithm for((tests_count = 0; tests_count < 3; tests_count = tests_count + 1)) do for((readers_count = 0; readers_count < 5; readers_count = readers_count + 1)) do sudo insmod ${TARGET_MODULE}.ko \ test=${tests[tests_count]} \ readers=${readers[readers_count]} \ resize=true sleep ${sleep_time} sudo rmmod ${TARGET_MODULE} done done echo test3 ${tcount} test done done sudo dmesg > test3result.txt ``` 執行 `do_measurement.sh` ```shell $ sudo sh do_measurement.sh ``` 畫實驗產生的數據圖 `script.gp` ``` set title "Fixed hashtable size v.s resize between two sizes" set xlabel "Reader threads" set ylabel "Lookups/second (millions)" set terminal png font "Times_New_Roman, 12" set key right set output "8k_16k_rwlock_resize.png" plot \ "8k_16k_rwlock_resize.txt" using 1:2 with linespoints linewidth 2 title "8k (fixed)", \ "8k_16k_rwlock_resize.txt" using 1:3 with linespoints linewidth 2 title "16k (fixed)", \ "8k_16k_rwlock_resize.txt" using 1:4 with linespoints linewidth 2 title "resize (rwlock)"\ ``` **實驗環境** ```shell Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i5-12400 CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 CPU max MHz: 5600.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 288 KiB (6 instances) L1i: 192 KiB (6 instances) L2: 7.5 MiB (6 instances) L3: 18 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-11 ``` **原始數據** 1. 固定為 8k buckets without resizing，測試 10 秒 Lookups | Readers | RP | DDDS | rwlock | | ------- | ---------- | ---------- | --------- | | 1 | 307665798 | 307405065 | 278619902 | | 2 | 609246377 | 607262585 | 341572765 | | 4 | 1162583704 | 1156299133 | 315475526 | | 8 | 1942473238 | 1924575980 | 287988264 | | 16 | 2479381351 | 2442075521 | 322175441 | 2. 固定為 8k buckets without resizing，測試 10 秒 Lookups | Readers | RP | DDDS | rwlock | | ------- | ---------- | ---------- | --------- | | 1 | 450585149 | 449389794 | 398910323 | | 2 | 889254383 | 890039215 | 382165321 | | 4 | 1700583442 | 1693820498 | 340461121 | | 8 | 2835046421 | 2828206776 | 318158465 | | 16 | 3593027567 | 3565010193 | 321311198 | 3. 在 8k - 16k buckets 持續 resizing，測試 10 秒 Lookups | Readers | RP | DDDS | rwlock | | ------- | ---------- | ---------- | -------- | | 1 | 415006421 | 349717949 | 1734373 | | 2 | 818310957 | 692473055 | 2686819 | | 4 | 1572226023 | 1322716490 | 2754268 | | 8 | 2620503729 | 2211407562 | 2923274 | | 16 | 3324667883 | 2758353157 | 65682783 | **圖示及結果討論** 固定 8k buckets without resizing, 3 種演算法比較 ![](https://hackmd.io/_uploads/rkUGm3Cw3.png) - 在 16 個 readers 與原本的線性增長產生偏差，可能是由於超過了 12 個處理器核的限制，且隨著 readers 數量增加，RP 演算法表現優於 DDDS， 3 種演算法 (RP, DDDS, rwlock) resizing 比較 ![](https://hackmd.io/_uploads/BJh97n0vn.png) - 原論文中因 rwlock 表現太差故忽略比較，隨著 readers 數量增加，RP 演算法表現優於 DDDS RP - 固定 size (8k, 16k) 以及 resize 比較 ![](https://hackmd.io/_uploads/ByQwQh0Dn.png) - 基於 RP 的實作位於 upper 和 lower bound 之間，顯示出此演算法並沒有增加太多額外的 overhead DDDS - 固定 size (8k, 16k) 以及 resize 比較 ![](https://hackmd.io/_uploads/Sk-OQ3CDh.png) - 這裡雖然沒有和論文完全一樣 (resize 表現低於 lower bound)，由數據上來看 RP 還是有略優於 DDDS rwlock - 固定 size (8k, 16k) 以及 resize 比較 ![](https://hackmd.io/_uploads/rJt_m30D2.png) - 在 resize 下，基於 rwlock 的實作受到了很大的影響，但在 16 個 readers 下有回升。結果上來說如同論文所述，RP 以及 DDDS 對於 resize 並沒有喪失太多效能，且隨著 concurrent readers 增加，RP 表現略優於 DDDS，而 rwlock 的表現在 resize hashtable 就差強人意了。 ## TODO: 依據近期 Linux 核心發展，更新 [rhashtable](https://elixir.bootlin.com/linux/latest/source/include/linux/rhashtable.h) 研究素材 >- 論文〈[Read-Log-Update](http://alchem.usc.edu/courses-ee599/downloads/rlu-sosp15.pdf)〉 >- [How about RLU? Is RLU a generic concurrency control?](https://concurrencyfreaks.blogspot.com/2019/10/how-about-rlu-is-rlu-generic.html) >- [SOSP 解說影片](https://www.youtube.com/watch?v=at9cxc3JTkY&t=27s) 〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉在 2011 年發表，RCU 在 Linux 核心中廣泛被使用，以 lock-free synchronization 的機制，透過區分 readers 和 writers 以及延遲回收 (deferred destruction) 來減少 lock 的使用，其優缺點包含 (節錄自 [SOSP'2015](https://dl.acm.org/doi/10.1145/2815400.2815406)): - 優點: - 效率高 (no overhead for **readers**) - Popular, 廣泛應用於 Linux kernel 中 - 缺點: - Hard to program (in non-trivial cases) - 只允許 single pointer updates 最早的 [commit](https://github.com/torvalds/linux/commit/7e1e77636e36075ebf118298855268468f1028e8) 在 2014 年被納入 kernel 中，而 [Read-Log-Update](https://dl.acm.org/doi/pdf/10.1145/2815400.2815406) (RLU) 作為 RCU 的擴充在 2015 年 SOSP 研討會上被提出，RLU 保有了 RCU 非同步走訪 (unsynchronized traversals) 的特性，同時能夠支援 [multi-location atomic updates](https://firebase.blog/posts/2015/09/introducing-multi-location-updates-and_86/)，論文中也同樣在 kernel space 對於 rhashtable 進行測試 >Thus, to compare the performance of the kernel implementation of RCU with RLU, we create a Linux kernel module along the same line as Triplett et al. with the RCU hash table 具體的實作方法是透過紀錄所有的更動在 per-thread log 中，如下圖 (節錄自 [SOSP'2015](https://dl.acm.org/doi/10.1145/2815400.2815406)) ![](https://hackmd.io/_uploads/Hk6bORHdh.png) :::danger 修正圖片存取權限 :notes: jserv >Fixed, thanks ::: **Example** ![](https://hackmd.io/_uploads/B14xuRH_n.png) - Thread $T_2$ 負責更新 object $O_2$ 和 $O_3$， Thread $T_1$ 和 $T_3$ 只負責做讀取 - 初始狀態下 global clock (`g-clock`) 是 `22`，$T_2$ 一開始有空的 write-log (`w-log`) 以及 $\infty$ 的**區域寫時鐘** (**local write clock**, `w-clock`) - 這些 per-thread 的 write-clocks 是用來 **stealing**(不確定中文翻譯，故保留原文) 來確保正確性。 - 首先 (第一張圖片)，$T_1$ 和 $T_2$ 讀取 `g-clock` 並儲存到自己的 `l-clock` 中，接著讀取 object ($O_1$， $O_2$)，此時還沒有任何 object 被 lock 住，所以就直接讀取 - 第二張圖片中，$T_2$ **在更新 (updates) 之前** locks and logs $O_2$ 及 $O_3$，可以從圖上看到 $O_2$ 及 $O_3$ 被 copy 到 $T_2$ 的 `w-log` 中。同時，$T_1$ 讀取 $O_2$ 且偵測到它被 $T_2$ 給 locked 了，故 $T_1$ **因此決定必須要 steal 新的 copied object**，故 $T_1$ 會去比較他的 `l-clock` 和 $T_2$ 的 `w-clock` - 如果 $T_1$ 的 `l-clock` 大於等於 $T_2$ 的 `w-clock` 才會去 **steal** 新的 copy - 否則就直接去讀取 memory 中的 object (如同圖上的情況) - 最後一張圖中，$T_2$ 開始 commit 新的 object，首先它先計算下一個 clock value (`23`)，將其寫入到自己的 `w-clock` 以及全域的 `g-clock` 中 (這裡的順序很重要，因為這個 `g-clock` 就會決定新舊資料的差別)。此時操作就開始分為 "old" 和 "new"，也就是在 clock increment 之前和之後，此時 $T_2$ 就必須要等待舊的操作結束，也就是 $T_1$。 - 同一時間， $T_3$ 去讀取 $O_2$ 且透過比較他 ($T_3$) 的 `l-clock` 和 $T_2$ 的 `w-clock` 認知到這個操作是 "new" 的，所以就去 **steal** $T_2$ 中的 write-log 中的新 copy object $O_2$，如此一來新的操作都只會去讀取新的 object copies。 - 一但 $T_2$ 的等待結束，就不會有任何人可以讀取到 old copies，也就到了可以安全的把新的 copies 寫回 $O_2$ 和 $O_3$ memory 了雖然 RLU 保有了 RCU 緣有的優點且額外實現了 [multi-location atomic updates](https://firebase.blog/posts/2015/09/introducing-multi-location-updates-and_86/)，但尚無法稱其為一種通用的 concurrency control，[How about RLU? Is RLU a generic concurrency control?](https://concurrencyfreaks.blogspot.com/2019/10/how-about-rlu-is-rlu-generic.html) 中指出 RLU 相較於 RCU 已經更為通用且高效，但尚未能夠稱為是通用的方法，因為 RLU 僅實踐 write-read synchronization，並不包含 write-write synchronization >RLU is certainly more generic and efficient than RCU but it does not allow for generic code to be used. 而這點在論文中也有提及 >The basic idea of RLU described above provides read-write synchronization for object accesses. It does not however ensure write-write synchronization, which must be managedby the programmer if needed (as with RCU) 為了避免 write-write conflicts， RLU 會在需要更動 object 時進行上鎖，參照論文 pseudo code `line 36` ``` if TRY_LOCK(obj, ptr-copy) then => Try to install copy RLU_ABORT(ctx) => Failed, retry RLU section ``` 〈[Read-Log-Update](http://alchem.usc.edu/courses-ee599/downloads/rlu-sosp15.pdf)〉中也進行了和〈[Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming](https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf)〉相同的實驗以此來比較吞吐量 (在 SOSP 中有提及 RLU 的挑戰之一是在保有 RCU 的效能)，論文實測出來的結果吞吐量是差不多的 (實驗待補)，但 resize 卻慢 RCU 兩倍左右，這是因為 RLU 在 duplicating nodes 的 overhead，但 resize 這件事情本身發生的並不頻繁，所以可以推敲出 resize 的 latency 相較於 concurrent lookups 更顯得沒那麼重要。 ### Related work - [MV-RLU: Scaling Read-Log-Update with Multi-Versioning](https://cosmoss-jigu.github.io/pages/pubs/mvrlu-kim-asplos19.pdf) - 為了改善 RLU 隨著 write 操作增加時的 overhead，改進 RLU 的 two-version 實作，改為 multi-version ## TODO: 在 Linux 核心挑選完整案例解說 [rhashtable](https://elixir.bootlin.com/linux/latest/source/include/linux/rhashtable.h) 的應用在 [selinux-kernel/net/bridge/br_vlan_tunnel.c](https://github.com/SELinuxProject/selinux-kernel/blob/main/net/bridge/br_vlan_tunnel.c) 中，運用了 `rhashtable` 來實現網路橋接 (bridge) 的功能。 :::warning TODO: 提供 `br_vlan_tunnel.c` 對應的測試方式，避免「舉燭」，要能實際操作。「舉燭」，要能夠實際操作。 :notes: jserv ::: 具體來說， `br_vlan_tunnel.c` 用於管理和處理 VLAN (VLAN tunnel port) 通道連接埠的資料，包括增加、刪除、查詢 VLAN 相關資料。這些函式與連接埠設備的VLAN連接埠相關聯，透過程式管理與處理目的元資料 (dst_metadata) ，確保資料封包按照預期的方式在連接埠設備上傳遞和處理。 - [ ] `br_vlan_tunid_cmp` 這個函式的作用是根據 **VLAN 連接埠的目的 metadata 中的 `tunnel_id`** 來比較是否與**給定的資料**一致。 ```c static inline int br_vlan_tunid_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct net_bridge_vlan *vle = ptr; __be64 tunid = *(__be64 *)arg->key; return vle->tinfo.tunnel_id != tunid; } ``` 1. `ptr` 指標指向 `struct net_bridge_vlan` 結構體的 `vle` 。 >`struct net_bridge_vlan` 是一個表示網路橋接中VLAN配置的資料結構。 >透過將 `ptr` 指向 `struct net_bridge_vlan` 可以在雜湊表中儲存和查詢與特定通道 ID 相關的 VLAN 配置。 2. `arg->key` 是一個指向 `void` 的指標，而 `vle->tinfo.tunnel_id` 是 `__be64` 類型的指標，故須要轉型。 3. 比較 `arg->key` 與 `vle->tinfo.tunnel_id` 是否相同。 - [ ] `br_vlan_tunnel_rht_params` 定義一個名為 `br_vlan_tunnel_rht_params` 的 `struct rhashtable_params` 結構體，設定配置 `rhashtable` 的參數。 ```c static const struct rhashtable_params br_vlan_tunnel_rht_params = { .head_offset = offsetof(struct net_bridge_vlan, tnode), .key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id), .key_len = sizeof(__be64), .nelem_hint = 3, .obj_cmpfn = br_vlan_tunid_cmp, .automatic_shrinking = true, }; ``` 1. `.head_offset`用於確認雜湊表的節點結構中，指向雜湊表頭部的指標位置。在這裡為 `tnode` 的偏移量。[offsetof 概念](https://hackmd.io/@sysprog/c-linked-list#container_of) 2. `.key_offset` 用於確認雜湊表的節點結構中，指向關鍵字的指標位置。 3. `.key_len` 為關鍵字的長度。 4. `.nelem_hint` 為雜湊表內的預計儲存的元素數量。 5. `.obj_cmpfn` 指向比較函式的指標，用於比較關鍵字。 6. ==`.automatic_shrinking` 是否允許雜湊表 resize==。 - [ ] `br_vlan_tunnel_lookup` 此函式用於在給定的 `rhashtable` 雜湊表中查詢指定的通道 ID （`tunnel_id`）。 ```c static struct net_bridge_vlan *br_vlan_tunnel_lookup(struct rhashtable *tbl, __be64 tunnel_id) { return rhashtable_lookup_fast(tbl, &tunnel_id, br_vlan_tunnel_rht_params); } ``` 1. `tb1` 為一個指向 `rhashtable` 的指標，在此 `rhashtable` 查詢目標。 2. `tunnel_id` 為一個 `__be64` 型別的通道 ID ，表示為要查詢的目標。 - [ ] `vlan_tunnel_info_release` 此函式用來釋放 `struct net_bridge_vlan` 結構體內部與 VLAN 相關的通道資料。 ```c static void vlan_tunnel_info_release(struct net_bridge_vlan *vlan) { struct metadata_dst *tdst = rtnl_dereference(vlan->tinfo.tunnel_dst); WRITE_ONCE(vlan->tinfo.tunnel_id, 0); RCU_INIT_POINTER(vlan->tinfo.tunnel_dst, NULL); dst_release(&tdst->dst); } ``` 1. 透過 `rtnl_dereference` 巨集來取得指向 `tunnel_dst` 成員的指標 `tdst` 。 >`rtnl_dereference` 用於對指標解指標，並進行 RCU(Read Copy Update)處理。 2. 使用巨集 `WRITE_ONCE` 將 `tunnel_id` 成員的值設定為０。 3. 使用巨集 `RCU_INIT_POINTER` 將 `tunnel_dst` 成員的值設定為 `NULL` 。 4. 釋放 `tdst->dst` 指向目標的資源。 - [ ] `vlan_tunnel_info_del` 此函式用來刪除 VLAN 相關的通道資料，並且釋放相關資源。 ```c void vlan_tunnel_info_del(struct net_bridge_vlan_group *vg, struct net_bridge_vlan *vlan) { if (!rcu_access_pointer(vlan->tinfo.tunnel_dst)) return; rhashtable_remove_fast(&vg->tunnel_hash, &vlan->tnode, br_vlan_tunnel_rht_params); vlan_tunnel_info_release(vlan); } ``` 1. 利用巨集 `rcu_access_pointer` 檢查 `tunnel_dst` 是否為 `null` 。 2. 使用 `rhashtable_remove_fast` 從 `&vg->tunnel_hash` 雜湊表中快速移除 `VLAN` 通道節點資訊。 1. `&vlan->tnode` 要移除的節點。 2. `br_vlan_tunnel_rht_params` 雜湊表內的參數。 3. 利用先前提到的 `vlan_tunnel_info_release` 釋放資源。 - [ ] `__vlan_tunnel_info_add` 此函式用於新增 VLAN 通道的資訊。 ```c static int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg, struct net_bridge_vlan *vlan, u32 tun_id) { struct metadata_dst *metadata = rtnl_dereference(vlan->tinfo.tunnel_dst); __be64 key = key32_to_tunnel_id(cpu_to_be32(tun_id)); int err; if (metadata) return -EEXIST; metadata = __ip_tun_set_dst(0, 0, 0, 0, 0, TUNNEL_KEY, key, 0); if (!metadata) return -EINVAL; metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_BRIDGE; rcu_assign_pointer(vlan->tinfo.tunnel_dst, metadata); WRITE_ONCE(vlan->tinfo.tunnel_id, key); err = rhashtable_lookup_insert_fast(&vg->tunnel_hash, &vlan->tnode, br_vlan_tunnel_rht_params); if (err) goto out; return 0; out: vlan_tunnel_info_release(vlan); return err; } ``` 1. 輸入參數 1. `vg` : VLAN 群組。 2. `vlan` : 要添加資料的 VLAN。 3. `tun_id` : 通道 ID 。 2. 利用巨集 `rtnl_dereference` 檢查 `vlan->tinfo.tunnel_dst` 是否為 `null` ，如果不是，表示通道內已有資料，回傳錯誤代碼 `-EEXIST` 。 3. 利用 `__ip_tun_set_dst` 函式建立一個 `metadata` ，用來表示通道的目的地。 :::spoiler `__ip_tun_set_dst` 傳入參數為下列： 1. IPv4地址的來源地址 2. IPv4地址的目的地址 3. IPv6地址的來源地址 4. IPv6地址的目的地址 5. 來源連接埠 6. `TUNNEL_KEY` 表示通道類型， `TUNNEL_KEY` 表示使用 Key 通道 7. `KEY` 表示通道 ID 8. 傳輸層協議類型 ::: 4. 確認 `metadata` 是否為 `NULL` ，是的話，回傳錯誤代碼 `-EEXIST` 。 5. 將 `IP_TUNNEL_INFO_TX` 和 `IP_TUNNEL_INFO_BRIDGE` 的值設定成 `mode` 內的對應值。表示該通道是用於傳輸與橋接。 6. 使用 `rcu_assign_pointer` 將 `metadata` 賦值給 `vlan->tinfo.tunnel_dst` 將 VLAN 通道資料與通道目的地連結起來。 7. 使用巨集 `WRITE_ONCE` 將通道 ID (`key`) 寫入 `vlan->tinfo.tunnel_id` 。表示該 VLAN 通道資料與通道 ID 的對應。 8. 使用 `rhashtable_lookup_insert_fast` 函式將 VLAN 通道資料添加到 `tunnel_hash` 雜湊表中。 9. 如果新增失敗，跳到 out ，並且釋放掉資源。 10. 如果新增的過程都順利，回傳 0 。 - [ ] `vlan_tunnel_init` 此函式用於初始化 VLAN 通道的雜湊表，由指標 `vg` 指向 `net_bridge_vlan_group` 作為輸入參數。 ```c int vlan_tunnel_init(struct net_bridge_vlan_group *vg) { return rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params); } ``` - [ ] `vlan_tunnel_deinit` 此函式用於破壞 VLAN 通道的雜湊表，由指標 `vg` 指向 `net_bridge_vlan_group` 作為輸入參數。 ```c void vlan_tunnel_deinit(struct net_bridge_vlan_group *vg) { rhashtable_destroy(&vg->tunnel_hash); } ``` ## 實測 `br_vlan_tunnel.c` 下載程式碼 ```shell $ git clone https://github.com/SELinuxProject/selinux-kernel.git ``` 進入 Repository ``` $ cd selinux-kernel/net/bridge ``` 編譯程式碼 ```shell $ sudo make -C /lib/modules/5.19.0-43-generic/build M=$(PWD) ``` > `$ sudo make -C /lib/modules/$(shell uname -r)/build M=$(PWD)` 會出現`Command 'shell' not found` ，手動輸入版本資訊可以解決此問題。 > 版本資訊 `$ uname -r` 。將其掛載: ```shell $ sudo insmod br_vlan_tunnel.ko ``` ### 測試方法想想要用什麼方法測試？？將其卸載: ```shell $ sudo rmmod br_vlan_tunnel ``` ## TODO: 在 Linux 核心找出類似的並行資料結構 >- [/include/linux/kfifo.h](https://elixir.bootlin.com/linux/latest/source/include/linux/kfifo.h) [kfifo](https://archive.kernel.org/oldlinux/htmldocs/kernel-api/kfifo.html) 為 lock-free synchronization 的實作之一，支援 single reader single writer