執行人: tsaiiuo, eleanorLYJ
2024 年報告
重現實驗並解讀實驗結果
探討 Redis 對於雲端運算和網路服務系統架構的幫助
解釋為何 userspace RCU 對於提高 Redis 並行程度有益
分析限制 Redis 並行的因素
彙整 userspace RCU 和 Redis 的整合,要顧及哪些議題?
評估 neco 的運用和調整
要使用 memb flavor 就需要先了解 sys_membarrier()
系統呼叫的作用,先看到他的介紹:
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
sys_membarrier()
會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt),讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier,若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。
為了去更進一步解釋,這邊它給出一個例子
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
而這兩個 thread 的實作如下
在引用 sys_membarrier()
後則可以修改成
在 Thread B
(Reader) 內可以直接使用編譯器屏障取代原先的 smp_wb()
這就可以省下許多成本,此外 Thread A
(Writer) 使用 sys_membarrier()
取代 smp_wb()
只對「整個 process 裡所有活動中的 thread 發送 IPI,強制它們各自執行一次 full memory barrier」。
而這樣更改後會有兩種情況:
這種情況下 Thread B
的存取只有保證 weakly ordered,但是可以的,因為 Thread A
不在意跟它排序,因為它們不重疊。
Thread B
在執行 barrier() 時只能保證自己程式內部順序;
但同時 Thread A
正在呼叫 sys_membarrier()
,kernel 會發 IPI 給所有正在跑的 thread,讓它們在 CPU core 上都執行一次硬體 memory barrier。
這就會把 B 的 compiler barrier「升級」成真正的 full smp_mb()
,在多 core 間強制排序。至於不在跑的 thread(被 scheduler deschedule),它們不需要做任何操作,因為 context switch 本身就隱含 barrier 效果。
Each non-running process threads are intrinsically serialized by the scheduler.
作者這裡還有做一個效能實驗,測試 Expedited membarrier 和 Non-expedited membarrier 針對 sys_membarrier()
的效能差異 (在 Intel Xeon E5405 上執行)
Expedited membarrier
呼叫次數:10,000,000 次
單次開銷:約 2–3 微秒
Non-expedited membarrier
呼叫次數:1,000 次
總耗時約 16 秒 → 單次開銷約 16 毫秒
大約是 expedited 模式的 5,000–8,000 倍慢
Add "int expedited" parameter, use synchronize_sched() in the non-expedited
case. Thanks to Lai Jiangshan for making us consider seriously using
synchronize_sched() to provide the low-overhead membarrier scheme.
Check num_online_cpus() == 1, quickly return without doing nothing.
此外還有針對 sys_membarrier()
的有效性進行測試
Operations in 10s, 6 readers, 2 writers:
(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes
(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes
(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes
可以先觀察出 動態 sys_membarrier 檢測會帶來額外的成本带来的成本,讓讀取與寫入端的 throughput 皆降低大約 7% ,但大致上還在一個基準線上,因此目前的預設機制還是有啟用動態觀察。
在啟用 sys_membarrier()
後在 expedited 狀態下會提昇許多,因為讀取端只做編譯器屏障,因此效能遠比原始的 mb scheme 好 ,在 writer 端,相比於 signal-based scheme 提昇了 500 倍左右,但對於原始的 mb scheme 來說還是效能差了一點,因為需要做額外的 sys_membarrier()
呼叫,而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高,因為寫入端不發送 IPI(使用了更輕量級的 membarrier fallback),沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇,但同時寫入端没有 IPI,主要依賴 cpu 上下文切換來進行,或一個 thread 定期的去喚醒各個 cpu ,因此沒有即時性,每次 sys_membarrier()
的時間比較長也比較不可預期。
mt-redis 筆記: https://hackmd.io/@tMeEIUCqQSCiXUaa-3tcsg/r11N0uJbll
目的在於討論哪裡哪一個 urcu flavor 最適合 redis?
實驗將使用 P-core,並使用 taskset 避開 E-core。另外也關閉Hyperthreading。
因此啟用 mt-redis 的命令需要更改,以下是我將 mt-redis 伺服器綁定到 P-core, logic CPU id 為 0,2,4,6,8,10,12,14,16 的命令:
-–appendonly no 關閉 AOF(Append Only File),避免每個寫入都寫到磁碟
-–save "" 關閉 RDB 快照功能,也就是關閉定時儲存資料快照功能
以下是我使用 memtier_benchmark 的命令
沒有下特別 command,預設是針對 string 這 type 做測試。
去年葉同學比較五種 URCU flavor 在不同讀寫比例下的表現,讀寫比例分別為 10:1, 100:1 與 1:1。
測試對象為 mt-redis,並使用同樣命令執行 100 次,並觀察 SET Ops/sec, SET Average Latency, SET p50 Latency, SET p99 Latency, SET p99.9 Latency, GET Ops/sec, GET Average Latency, GET p50 Latency, GET p99 Latency, GET p99.9 Latency
去年的結論
在讀取大於寫入的測試中,Signal-based RCU 表現最好,而在 1:1 的測試中,Signal-based RCU 的寫入效能較差。理論上 QSBR 在效能上應該是首選,但在實際測試中表現不如預期,可能需要調整進入 quiescent state 的時機
但我發現在 2023 年, urcu-signal 被視為 deprecated 了! Commit aad674a 為關於移除 urcu-signal 的最後一個 commit,根據此 commit message 建議改使用 urcu-memb,因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能,並且也有改善後的寬限期性能。
SET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
QSBR | 3161.1825 | 44.0917 | 3077.12 | 3395.76 | 3156.970 | 3152.3897 | 3169.9753 |
BP | 3155.6116 | 35.5904 | 3076.91 | 3279.57 | 3152.895 | 3148.5141 | 3162.7091 |
MEMB | 3157.9045 | 48.7512 | 3086.58 | 3347.93 | 3149.165 | 3148.1825 | 3167.6265 |
MB | 3156.2084 | 36.6910 | 3098.75 | 3313.90 | 3152.155 | 3148.8914 | 3163.5254 |
SIGNAL | 3147.1692 | 52.7197 | 2924.33 | 3470.32 | 3139.700 | 3136.6558 | 3157.6826 |
GET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
QSBR | 314545.6027 | 4387.2335 | 306181.56 | 337886.26 | 314126.64 | 313670.6949 | 315420.5105 |
BP | 313991.2927 | 3541.3685 | 306159.83 | 326325.92 | 313721.055 | 313285.0684 | 314697.5170 |
MEMB | 314219.4267 | 4850.8 | 307122.07 | 333127.1 | 313349.84 | 313252.0738 | 315186.7796 |
MB | 314050.6865 | 3650.8393 | 308333.59 | 329741.69 | 313647.345 | 313322.6314 | 314778.7416 |
SIGNAL | 313151.2636 | 5245.7102 | 290978.57 | 345305.32 | 312408.195 | 312105.1572 | 314197.3700 |
SET Average Latency
GET Average Latency
SET p50 Latency
GET p50 Latency
SET p99 Latency
GET p99 Latency
SET p99.9 Latency
GET p99.9 Latency
SET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
QSBR | 28660.8337 | 320.453 | 28114.8 | 30779.23 | 28630.195 | 28596.9285 | 28724.7388 |
BP | 28658.5042 | 315.7301 | 28105.65 | 29984.92 | 28611.525 | 28595.5409 | 28721.4675 |
MEMB | 28630.9119 | 395.0763 | 27642.4 | 31215.68 | 28617.705 | 28552.1253 | 28709.6985 |
MB | 28634.8009 | 468.6139 | 27521.34 | 31324.04 | 28564.03 | 28541.3493 | 28728.2525 |
SIGNAL | 28613.3392 | 405.6762 | 26497.11 | 30684.58 | 28576.275 | 28532.4388 | 28694.2398 |
GET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
QSBR | 286466.53 | 3202.9471 | 281008.89 | 307640.06 | 286160.30 | 285827.7931 | 287105.2649 |
BP | 286443.24 | 3155.7368 | 280917.47 | 299700.80 | 285973.71 | 285813.9208 | 287072.5632 |
MEMB | 286167.46 | 3948.8055 | 276287.26 | 312002.31 | 286035.46 | 285379.9879 | 286954.9399 |
MB | 286206.33 | 4683.8179 | 275077.25 | 313085.39 | 285498.95 | 285272.2763 | 287140.3825 |
SIGNAL | 285991.8171 | 4054.7599 | 264839.98 | 306694.00 | 285621.38 | 285183.2116 | 286800.4226 |
SET Average Latency
GET Average Latency
SET p50 Latency
GET p50 Latency
SET p99 Latency
GET p99 Latency
SET p99.9 Latency
GET p99.9 Latency
不使用 SIGNAL,因為它在多個關鍵指標上表現明顯較差,特別是極端情況下的延遲
SET Ops/sec
GET Ops/sec
SET Average Latency
GET Average Latency
SET p50 Latency
GET p50 Latency
SET p99 Latency
GET p99 Latency
SET p99.9 Latency
GET p99.9 Latency
去年的實驗結果是在沒設定 cpu affinity 的情況下應該將執行緒的數量設定為 4 ,因此我先想復現去年的實驗,並探討執行緒的數量對吞吐量效能的影響。
我這邊設定寫入與讀取的比例為 1:100 和 1:10,上網收集的資料說 cache 的使用情境大多為這兩種,因此只測試這兩種情境
Memcached 的實際基準測試中,約 43.9% 的測試採用 SET:GET = 1:100 的配置
Google Cloud Memorystore 中有提到 By default, the utility issues 10 get commands for every set command.
以下測試是想測試執行緒數量使用 4 與 8 對於 qsbr 與 memb 的比較,並使用
SET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 1911.112 | 36.840 | 1757.440 | 2006.030 | 1905.100 | 1903.892 | 1918.333 |
qsbr-8t | 1857.539 | 47.483 | 1687.700 | 1965.850 | 1853.320 | 1848.232 | 1866.845 |
memb-4t | 1904.987 | 48.352 | 1684.740 | 2037.310 | 1913.505 | 1895.510 | 1914.464 |
memb-8t | 1843.536 | 42.711 | 1684.790 | 1949.790 | 1852.665 | 1835.164 | 1851.907 |
GET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 |
qsbr-8t | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 |
memb-4t | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 |
memb-8t | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 |
SET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 |
qsbr-8t | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 |
memb-4t | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 |
memb-8t | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 |
GET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 |
qsbr-8t | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 |
memb-4t | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 |
memb-8t | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 |
SET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 |
qsbr-8t | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 |
memb-4t | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 |
memb-8t | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 |
GET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 |
qsbr-8t | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 |
memb-4t | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 |
memb-8t | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 |
SET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 |
qsbr-8t | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 |
memb-4t | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 |
memb-8t | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 |
GET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 |
qsbr-8t | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 |
memb-4t | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 |
memb-8t | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 |
SET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 |
qsbr-8t | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 |
memb-4t | 10.423 | 1.753 | 8.895 | 24.831 | 10.111 | 10.080 | 10.767 |
memb-8t | 12.050 | 1.328 | 9.599 | 17.279 | 11.871 | 11.789 | 12.310 |
GET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-4t | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 |
qsbr-8t | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 |
memb-4t | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 |
memb-8t | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 |
有看到去年有個 review 是為什麼 cpu affinity 不要一個 thread 一個 cpu 而是要兩個 thread 一個 cpu? 而去年負責專題的同學的回覆如下
因為我測試過每個 CPU 一個執行緒,但是效能不如一個 CPU 兩個執行緒,這樣也能透過 context switch 讓要進行 I/O 的執行緒休息,將 CPU 資源讓給另外一個執行緒。
但這段敘述並沒有實驗去說明,因此我接下來是要測試:一個 CPU 兩個執行緒還是一個 CPU 一個執行緒比較好,並且分析去年的結論是否是對的。以下分別針對 qsbr 與 memb 兩種 flavor 去探討
SET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 1937.405 | 52.190 | 1823.790 | 2097.200 | 1930.500 | 1927.176 | 1947.634 |
memb-two-to-one-affinity | 2018.488 | 72.174 | 1869.110 | 2211.830 | 2020.165 | 2004.342 | 2032.635 |
memb-4t-no-affinity | 1904.987 | 48.352 | 1684.740 | 2037.310 | 1913.505 | 1895.510 | 1914.464 |
memb-8t-no-affinity | 1843.536 | 42.711 | 1684.790 | 1949.790 | 1852.665 | 1835.164 | 1851.907 |
GET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 192776.628 | 5193.027 | 181471.580 | 208676.540 | 192089.445 | 191758.795 | 193794.462 |
memb-two-to-one-affinity | 200844.686 | 7181.510 | 185980.900 | 220082.210 | 201011.445 | 199437.110 | 202252.262 |
memb-4t-no-affinity | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 |
memb-8t-no-affinity | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 |
SET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 1.405 | 0.039 | 1.330 | 1.536 | 1.400 | 1.398 | 1.413 |
memb-two-to-one-affinity | 1.332 | 0.037 | 1.278 | 1.524 | 1.326 | 1.325 | 1.339 |
memb-4t-no-affinity | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 |
memb-8t-no-affinity | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 |
GET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 1.036 | 0.015 | 1.014 | 1.084 | 1.032 | 1.033 | 1.039 |
memb-two-to-one-affinity | 1.006 | 0.016 | 0.978 | 1.087 | 1.002 | 1.002 | 1.009 |
memb-4t-no-affinity | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 |
memb-8t-no-affinity | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 |
SET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 1.153 | 0.084 | 1.007 | 1.327 | 1.159 | 1.137 | 1.170 |
memb-two-to-one-affinity | 1.249 | 0.024 | 1.191 | 1.359 | 1.247 | 1.244 | 1.254 |
memb-4t-no-affinity | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 |
memb-8t-no-affinity | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 |
GET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 0.943 | 0.010 | 0.919 | 0.967 | 0.943 | 0.941 | 0.945 |
memb-two-to-one-affinity | 0.932 | 0.013 | 0.903 | 0.967 | 0.927 | 0.929 | 0.934 |
memb-4t-no-affinity | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 |
memb-8t-no-affinity | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 |
SET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 5.553 | 0.285 | 4.799 | 6.943 | 5.535 | 5.497 | 5.609 |
memb-two-to-one-affinity | 4.830 | 0.355 | 4.063 | 6.143 | 4.831 | 4.761 | 4.900 |
memb-4t-no-affinity | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 |
memb-8t-no-affinity | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 |
GET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 4.149 | 0.298 | 3.039 | 4.959 | 4.079 | 4.091 | 4.208 |
memb-two-to-one-affinity | 3.275 | 0.392 | 2.495 | 4.671 | 3.343 | 3.198 | 3.352 |
memb-4t-no-affinity | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 |
memb-8t-no-affinity | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 |
SET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 9.043 | 0.838 | 7.519 | 14.719 | 8.959 | 8.879 | 9.207 |
memb-two-to-one-affinity | 8.051 | 1.040 | 6.399 | 14.015 | 7.887 | 7.848 | 8.255 |
memb-4t-no-affinity | 10.423 | 1.753 | 8.895 | 24.831 | 10.111 | 10.080 | 10.767 |
memb-8t-no-affinity | 12.050 | 1.328 | 9.599 | 17.279 | 11.871 | 11.789 | 12.310 |
GET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
memb-one-to-one-affinity | 7.057 | 0.619 | 5.951 | 12.031 | 6.991 | 6.936 | 7.178 |
memb-two-to-one-affinity | 6.205 | 0.476 | 5.375 | 8.319 | 6.111 | 6.112 | 6.298 |
memb-4t-no-affinity | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 |
memb-8t-no-affinity | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 |
以下是針對一個 CPU 兩個執行緒和一個 CPU 一個執行緒的情況下,我各拿一個中位數的結果來做分析,以下是結果:
一個 CPU 兩個執行緒
一個 CPU 一個執行緒
SET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 1976.983 | 53.720 | 1858.990 | 2088.910 | 1976.620 | 1966.454 | 1987.512 |
qsbr-two-to-one-affinity | 1926.101 | 63.233 | 1761.320 | 2085.570 | 1919.775 | 1913.707 | 1938.494 |
qsbr-4t-no-affinity | 1911.112 | 36.840 | 1757.440 | 2006.030 | 1905.100 | 1903.892 | 1918.333 |
qsbr-8t-no-affinity | 1857.539 | 47.483 | 1687.700 | 1965.850 | 1853.320 | 1848.232 | 1866.845 |
GET Ops/sec
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 196714.781 | 5345.271 | 184974.190 | 207851.710 | 196678.310 | 195667.108 | 197762.454 |
qsbr-two-to-one-affinity | 191651.843 | 6291.792 | 175255.650 | 207519.610 | 191022.075 | 190418.652 | 192885.034 |
qsbr-4t-no-affinity | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 |
qsbr-8t-no-affinity | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 |
SET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 1.296 | 0.049 | 1.231 | 1.514 | 1.286 | 1.287 | 1.306 |
qsbr-two-to-one-affinity | 1.332 | 0.037 | 1.273 | 1.515 | 1.328 | 1.325 | 1.340 |
qsbr-4t-no-affinity | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 |
qsbr-8t-no-affinity | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 |
GET Average Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 1.015 | 0.020 | 0.987 | 1.085 | 1.007 | 1.011 | 1.019 |
qsbr-two-to-one-affinity | 1.036 | 0.018 | 1.015 | 1.120 | 1.030 | 1.032 | 1.039 |
qsbr-4t-no-affinity | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 |
qsbr-8t-no-affinity | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 |
SET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 1.001 | 0.033 | 0.975 | 1.191 | 0.991 | 0.994 | 1.007 |
qsbr-two-to-one-affinity | 1.089 | 0.060 | 1.023 | 1.279 | 1.063 | 1.077 | 1.100 |
qsbr-4t-no-affinity | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 |
qsbr-8t-no-affinity | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 |
GET p50 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 0.949 | 0.010 | 0.927 | 0.975 | 0.943 | 0.946 | 0.951 |
qsbr-two-to-one-affinity | 0.965 | 0.013 | 0.935 | 0.999 | 0.959 | 0.962 | 0.967 |
qsbr-4t-no-affinity | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 |
qsbr-8t-no-affinity | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 |
SET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 4.837 | 0.464 | 4.223 | 7.135 | 4.831 | 4.746 | 4.928 |
qsbr-two-to-one-affinity | 5.069 | 0.327 | 4.223 | 6.623 | 5.055 | 5.005 | 5.133 |
qsbr-4t-no-affinity | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 |
qsbr-8t-no-affinity | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 |
GET p99 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 3.555 | 0.530 | 2.719 | 5.151 | 3.559 | 3.451 | 3.659 |
qsbr-two-to-one-affinity | 3.775 | 0.337 | 2.767 | 4.927 | 3.735 | 3.709 | 3.841 |
qsbr-4t-no-affinity | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 |
qsbr-8t-no-affinity | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 |
SET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 7.569 | 1.141 | 6.079 | 13.823 | 7.391 | 7.345 | 7.793 |
qsbr-two-to-one-affinity | 8.388 | 0.826 | 6.975 | 13.375 | 8.255 | 8.226 | 8.550 |
qsbr-4t-no-affinity | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 |
qsbr-8t-no-affinity | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 |
GET p99.9 Latency
Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
---|---|---|---|---|---|---|---|
qsbr-one-to-one-affinity | 5.984 | 0.847 | 5.151 | 10.367 | 5.759 | 5.818 | 6.150 |
qsbr-two-to-one-affinity | 6.854 | 0.480 | 5.727 | 9.535 | 6.863 | 6.760 | 6.948 |
qsbr-4t-no-affinity | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 |
qsbr-8t-no-affinity | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 |
一樣針對一個 CPU 兩個執行緒與一個 CPU 一個執行緒進行分析,分別取一次中位數的結果:
一個 CPU 一個執行緒
首先我好奇為甚麼 redis/ valkey 是使用 single thread?
因此我閱讀 The Engineering Wisdom Behind Redis’s Single-Threaded Design,理解 redis 的設計與考量點。
The Engineering Wisdom Behind Redis’s Single-Threaded Design 文章整理
將 yeh-sudo/mt-redis 的成果用於 Valkey,確保 userspace RCU 發揮其作用