Try   HackMD

Linux 核心專題: 高度並行的 Valkey 實作

執行人: tsaiiuo, eleanorLYJ

TODO: 重現去年實驗並理解 Redis 並行化的技術挑戰

2024 年報告
重現實驗並解讀實驗結果
探討 Redis 對於雲端運算和網路服務系統架構的幫助
解釋為何 userspace RCU 對於提高 Redis 並行程度有益
分析限制 Redis 並行的因素
彙整 userspace RCU 和 Redis 的整合,要顧及哪些議題?
評估 neco 的運用和調整

閱讀 introduce sys_membarrier(): process-wide memory barrier (v6)

要使用 memb flavor 就需要先了解 sys_membarrier() 系統呼叫的作用,先看到他的介紹:

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

sys_membarrier() 會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt),讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier,若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。

為了去更進一步解釋,這邊它給出一個例子

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

而這兩個 thread 的實作如下

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

在引用 sys_membarrier() 後則可以修改成

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

Thread B (Reader) 內可以直接使用編譯器屏障取代原先的 smp_wb() 這就可以省下許多成本,此外 Thread A (Writer) 使用 sys_membarrier() 取代 smp_wb() 只對「整個 process 裡所有活動中的 thread 發送 IPI,強制它們各自執行一次 full memory barrier」。

而這樣更改後會有兩種情況:

  1. Thread A 與 Thread B 不同時執行
Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

這種情況下 Thread B 的存取只有保證 weakly ordered,但是可以的,因為 Thread A 不在意跟它排序,因為它們不重疊。

  1. Thread A 與 Thread B 同時執行
Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

Thread B 在執行 barrier() 時只能保證自己程式內部順序;

但同時 Thread A 正在呼叫 sys_membarrier(),kernel 會發 IPI 給所有正在跑的 thread,讓它們在 CPU core 上都執行一次硬體 memory barrier。

這就會把 B 的 compiler barrier「升級」成真正的 full smp_mb(),在多 core 間強制排序。至於不在跑的 thread(被 scheduler deschedule),它們不需要做任何操作,因為 context switch 本身就隱含 barrier 效果。

Each non-running process threads are intrinsically serialized by the scheduler.

作者這裡還有做一個效能實驗,測試 Expedited membarrier 和 Non-expedited membarrier 針對 sys_membarrier() 的效能差異 (在 Intel Xeon E5405 上執行)

Expedited membarrier

​​​​呼叫次數:10,000,000 次

​​​​單次開銷:約 2–3 微秒

Non-expedited membarrier

​​​​呼叫次數:1,000 次

​​​​總耗時約 16 秒 → 單次開銷約 16 毫秒

​​​​大約是 expedited 模式的 5,000–8,000 倍慢

Add "int expedited" parameter, use synchronize_sched() in the non-expedited
case. Thanks to Lai Jiangshan for making us consider seriously using
synchronize_sched() to provide the low-overhead membarrier scheme.
Check num_online_cpus() == 1, quickly return without doing nothing.

此外還有針對 sys_membarrier() 的有效性進行測試
Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes

(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes

可以先觀察出 動態 sys_membarrier 檢測會帶來額外的成本带来的成本,讓讀取與寫入端的 throughput 皆降低大約 7% ,但大致上還在一個基準線上,因此目前的預設機制還是有啟用動態觀察。
在啟用 sys_membarrier() 後在 expedited 狀態下會提昇許多,因為讀取端只做編譯器屏障,因此效能遠比原始的 mb scheme 好 ,在 writer 端,相比於 signal-based scheme 提昇了 500 倍左右,但對於原始的 mb scheme 來說還是效能差了一點,因為需要做額外的 sys_membarrier() 呼叫,而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高,因為寫入端不發送 IPI(使用了更輕量級的 membarrier fallback),沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇,但同時寫入端没有 IPI,主要依賴 cpu 上下文切換來進行,或一個 thread 定期的去喚醒各個 cpu ,因此沒有即時性,每次 sys_membarrier() 的時間比較長也比較不可預期。

mt-redis 筆記: https://hackmd.io/@tMeEIUCqQSCiXUaa-3tcsg/r11N0uJbll

重現實驗

目的在於討論哪裡哪一個 urcu flavor 最適合 redis?

開發環境

$ uname -a
Linux eleanor 6.11.0-24-generic #24~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 25 20:14:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   20
  On-line CPU(s) list:    0-19
Vendor ID:                GenuineIntel
  Model name:             12th Gen Intel(R) Core(TM) i7-12700
    CPU family:           6
    Model:                151
    Thread(s) per core:   2
    Core(s) per socket:   12
    Socket(s):            1
    Stepping:             2
    CPU(s) scaling MHz:   19%
    CPU max MHz:          4900.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4224.00
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    512 KiB (12 instances)
  L1i:                    512 KiB (12 instances)
  L2:                     12 MiB (9 instances)
  L3:                     25 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-19

實驗將使用 P-core,並使用 taskset 避開 E-core。另外也關閉Hyperthreading。

因此啟用 mt-redis 的命令需要更改,以下是我將 mt-redis 伺服器綁定到 P-core, logic CPU id 為 0,2,4,6,8,10,12,14,16 的命令:

taskset -c 0,2,4,6,8,10,12,14,16 ./redis-server --port 6379 --appendonly no --save ""

-–appendonly no 關閉 AOF(Append Only File),避免每個寫入都寫到磁碟
-–save "" 關閉 RDB 快照功能,也就是關閉定時儲存資料快照功能

以下是我使用 memtier_benchmark 的命令

taskset -c 17,18,19 \
        memtier_benchmark \
        -p 6379 \
        --random-data \
        --ratio=${RATIO} \
        --requests=20000 \
        --json-out-file="${OUTPUT_FILE}" \
        --hide-histogram 

理解 memtier_bench

沒有下特別 command,預設是針對 string 這 type 做測試。

去年實驗設計與結果

去年葉同學比較五種 URCU flavor 在不同讀寫比例下的表現,讀寫比例分別為 10:1, 100:1 與 1:1。
測試對象為 mt-redis,並使用同樣命令執行 100 次,並觀察 SET Ops/sec, SET Average Latency, SET p50 Latency, SET p99 Latency, SET p99.9 Latency, GET Ops/sec, GET Average Latency, GET p50 Latency, GET p99 Latency, GET p99.9 Latency

去年的結論

在讀取大於寫入的測試中,Signal-based RCU 表現最好,而在 1:1 的測試中,Signal-based RCU 的寫入效能較差。理論上 QSBR 在效能上應該是首選,但在實際測試中表現不如預期,可能需要調整進入 quiescent state 的時機

但我發現在 2023 年, urcu-signal 被視為 deprecated 了! Commit aad674a 為關於移除 urcu-signal 的最後一個 commit,根據此 commit message 建議改使用 urcu-memb,因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能,並且也有改善後的寬限期性能。

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入

吞吐量

SET Ops/sec

image

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 3161.1825 44.0917 3077.12 3395.76 3156.970 3152.3897 3169.9753
BP 3155.6116 35.5904 3076.91 3279.57 3152.895 3148.5141 3162.7091
MEMB 3157.9045 48.7512 3086.58 3347.93 3149.165 3148.1825 3167.6265
MB 3156.2084 36.6910 3098.75 3313.90 3152.155 3148.8914 3163.5254
SIGNAL 3147.1692 52.7197 2924.33 3470.32 3139.700 3136.6558 3157.6826

GET Ops/sec

image

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 314545.6027 4387.2335 306181.56 337886.26 314126.64 313670.6949 315420.5105
BP 313991.2927 3541.3685 306159.83 326325.92 313721.055 313285.0684 314697.5170
MEMB 314219.4267 4850.8 307122.07 333127.1 313349.84 313252.0738 315186.7796
MB 314050.6865 3650.8393 308333.59 329741.69 313647.345 313322.6314 314778.7416
SIGNAL 313151.2636 5245.7102 290978.57 345305.32 312408.195 312105.1572 314197.3700
  • BP 標準差最低(3541)
  • MB 標準差第二低, 最低值表現最佳
  • QSBR 有最大平均吞吐能力
平均延遲

SET Average Latency

image
GET Average Latency
image

p50 延遲

SET p50 Latency

image

GET p50 Latency

image

p99 延遲

SET p99 Latency

image

GET p99 Latency

image

p99.9 延遲

SET p99.9 Latency

image

GET p99.9 Latency

image

小節

讀寫比例 10:1:模擬 Cache 情境,讀取較多於寫入

吞吐量

SET Ops/sec

comparison_set_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 28660.8337 320.453 28114.8 30779.23 28630.195 28596.9285 28724.7388
BP 28658.5042 315.7301 28105.65 29984.92 28611.525 28595.5409 28721.4675
MEMB 28630.9119 395.0763 27642.4 31215.68 28617.705 28552.1253 28709.6985
MB 28634.8009 468.6139 27521.34 31324.04 28564.03 28541.3493 28728.2525
SIGNAL 28613.3392 405.6762 26497.11 30684.58 28576.275 28532.4388 28694.2398
  • 所有 flavor 的平均吞吐量非常接近 (約 28600 ops/sec)
  • QSBR 和 BP 的標準差最小 (約320),表現最穩定
  • SIGNAL 的最低值 (26497) 明顯低於其他 flavor

GET Ops/sec

comparison_get_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 286466.53 3202.9471 281008.89 307640.06 286160.30 285827.7931 287105.2649
BP 286443.24 3155.7368 280917.47 299700.80 285973.71 285813.9208 287072.5632
MEMB 286167.46 3948.8055 276287.26 312002.31 286035.46 285379.9879 286954.9399
MB 286206.33 4683.8179 275077.25 313085.39 285498.95 285272.2763 287140.3825
SIGNAL 285991.8171 4054.7599 264839.98 306694.00 285621.38 285183.2116 286800.4226
  • 同樣呈現高度相似的吞吐量 (~286,000 ops/sec)
  • QSBR 和 BP 再次顯示最佳穩定性
  • SIGNAL 的最低值 (264,840) 明顯偏低
平均延遲

SET Average Latency

comparison_set_average_latency

GET Average Latency

comparison_get_average_latency

p50 延遲

SET p50 Latency

comparison_set_p50_latency

GET p50 Latency

comparison_get_p50_latency

p99 延遲

SET p99 Latency

comparison_set_p99_latency

GET p99 Latency

comparison_get_p99_latency

p99.9 延遲

SET p99.9 Latency

comparison_set_p99_9_latency

GET p99.9 Latency

comparison_get_p99_9_latency

小節

不使用 SIGNAL,因為它在多個關鍵指標上表現明顯較差,特別是極端情況下的延遲

讀寫比例 1:1:模擬 Session Store 情境,讀取與寫入比例相近

吞吐量

SET Ops/sec

comparison_set_ops_sec

GET Ops/sec

comparison_get_ops_sec

平均延遲

SET Average Latency

comparison_set_average_latency

GET Average Latency

comparison_get_average_latency

p50 延遲

SET p50 Latency

comparison_set_p50_latency

GET p50 Latency

comparison_get_p50_latency

p99 延遲

SET p99 Latency

comparison_get_p99_latency

GET p99 Latency

p99.9 延遲

SET p99.9 Latency

comparison_set_p99_9_latency

GET p99.9 Latency

comparison_get_p99_9_latency

memtier_benchmark 的 pipeline 參數

taskset -c 11-15 memtier_benchmark --hide-histogram -p 6379 --pipeline=n --random-data -x 10

執行緒數量的影響

開發環境

$ uname -a
Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
    CPU family:           6
    Model:                94
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   30%
    CPU max MHz:          4000.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6799.81
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
                          s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
                          art arch_perfmon pebs bts rep_good nopl xtopology nons
                          top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor 
                          ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                          cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                          er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
                          id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp
                          riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 
                          smep bmi2 erms invpcid mpx rdseed adx smap clflushopt 
                          intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
                          t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m
                          d_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7

去年的實驗結果是在沒設定 cpu affinity 的情況下應該將執行緒的數量設定為 4 ,因此我先想復現去年的實驗,並探討執行緒的數量對吞吐量效能的影響。
我這邊設定寫入與讀取的比例為 1:100 和 1:10,上網收集的資料說 cache 的使用情境大多為這兩種,因此只測試這兩種情境

Memcached 的實際基準測試中,約 43.9% 的測試採用 SET:GET = 1:100 的配置
Google Cloud Memorystore 中有提到 By default, the utility issues 10 get commands for every set command.

以下測試是想測試執行緒數量使用 4 與 8 對於 qsbr 與 memb 的比較,並使用

memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=memb-2

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1911.112 36.840 1757.440 2006.030 1905.100 1903.892 1918.333
qsbr-8t 1857.539 47.483 1687.700 1965.850 1853.320 1848.232 1866.845
memb-4t 1904.987 48.352 1684.740 2037.310 1913.505 1895.510 1914.464
memb-8t 1843.536 42.711 1684.790 1949.790 1852.665 1835.164 1851.907

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 190160.503 3665.605 174869.310 199605.400 189562.675 189442.045 190878.962
qsbr-8t 184829.809 4724.670 167930.680 195606.950 184409.640 183903.774 185755.844
memb-4t 189551.082 4811.135 167635.600 202717.860 190398.130 188608.099 190494.064
memb-8t 183436.420 4249.845 167640.890 194008.960 184344.715 182603.450 184269.389

觀察

  • 在沒設置 cpu affinity 時, 4 個執行緒的吞吐量的平均數與中位數皆是比 8 個執行緒還要高的
  • qsbr 的表現在平均數比 memb 還要來的好,但 memb 的中位數比 qsbr 還要來的好,可以從最小數與標準差發現 memb 的吞吐量較不穩定,因此標準差也較大
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.459 0.059 1.357 1.681 1.449 1.447 1.470
qsbr-8t 1.407 0.061 1.316 1.699 1.395 1.395 1.419
memb-4t 1.501 0.128 1.405 2.567 1.474 1.475 1.526
memb-8t 1.487 0.091 1.387 1.954 1.469 1.469 1.504

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.042 0.013 1.012 1.122 1.040 1.039 1.045
qsbr-8t 1.072 0.022 1.040 1.165 1.079 1.068 1.077
memb-4t 1.043 0.025 1.018 1.187 1.035 1.038 1.048
memb-8t 1.078 0.022 1.052 1.162 1.069 1.073 1.082
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.221 0.110 1.007 1.423 1.247 1.200 1.243
qsbr-8t 1.054 0.046 1.015 1.295 1.047 1.045 1.063
memb-4t 1.317 0.061 1.135 1.695 1.319 1.305 1.329
memb-8t 1.162 0.082 1.047 1.431 1.147 1.146 1.178

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 0.958 0.007 0.935 0.975 0.959 0.957 0.960
qsbr-8t 0.993 0.012 0.967 1.015 0.999 0.990 0.995
memb-4t 0.941 0.014 0.839 0.967 0.943 0.938 0.944
memb-8t 0.981 0.012 0.951 1.015 0.975 0.979 0.984

觀察

  • 在 8 個執行緒時, Get 操作的延遲在平均數與中位數都是比較低的,但在 Set 操作上卻會略為提高
  • 在 p50 上 Get 操作在 8 個執行緒時不再取得更低的延遲,反之延遲較 4 個執行緒來的高
  • qsbr 在 Set 和 Get 操作上的平均延遲都比 memb 還要來的低
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 6.030 0.432 5.279 7.391 5.983 5.946 6.115
qsbr-8t 6.230 0.840 5.183 9.407 5.919 6.065 6.394
memb-4t 6.512 0.611 5.695 10.559 6.431 6.392 6.632
memb-8t 7.008 1.004 5.471 11.327 6.815 6.811 7.205

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 3.882 0.398 3.151 5.055 3.791 3.804 3.960
qsbr-8t 4.004 0.402 3.391 5.855 3.903 3.925 4.082
memb-4t 4.266 0.357 3.807 6.335 4.223 4.196 4.336
memb-8t 4.291 0.425 3.727 6.495 4.191 4.208 4.374
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 9.682 1.119 8.319 13.631 9.375 9.462 9.901
qsbr-8t 11.010 1.314 8.511 15.743 10.975 10.753 11.268
memb-4t 10.423 1.753 8.895 24.831 10.111 10.080 10.767
memb-8t 12.050 1.328 9.599 17.279 11.871 11.789 12.310

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 6.454 0.506 5.727 8.383 6.399 6.355 6.553
qsbr-8t 7.671 0.698 6.591 11.135 7.455 7.534 7.808
memb-4t 7.097 0.834 6.207 14.079 6.943 6.934 7.261
memb-8t 8.064 0.704 7.007 12.159 7.999 7.926 8.202

觀察

  • 可以看出在 p99 與 p99.9 的情況下越多執行緒會導致越高的平均延遲

小結論

  • 在讀寫佔比 100:1 並且沒有設置 cpu affinity的情況下, 4 個執行緒的吞吐量會比 8 個執行緒的吞吐量來的好,不論是在延遲或平均操作上皆是。

設置 cpu affinity

有看到去年有個 review 是為什麼 cpu affinity 不要一個 thread 一個 cpu 而是要兩個 thread 一個 cpu? 而去年負責專題的同學的回覆如下

因為我測試過每個 CPU 一個執行緒,但是效能不如一個 CPU 兩個執行緒,這樣也能透過 context switch 讓要進行 I/O 的執行緒休息,將 CPU 資源讓給另外一個執行緒。

但這段敘述並沒有實驗去說明,因此我接下來是要測試:一個 CPU 兩個執行緒還是一個 CPU 一個執行緒比較好,並且分析去年的結論是否是對的。以下分別針對 qsbr 與 memb 兩種 flavor 去探討

memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=file_name

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 memb flavor 下

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1937.405 52.190 1823.790 2097.200 1930.500 1927.176 1947.634
memb-two-to-one-affinity 2018.488 72.174 1869.110 2211.830 2020.165 2004.342 2032.635
memb-4t-no-affinity 1904.987 48.352 1684.740 2037.310 1913.505 1895.510 1914.464
memb-8t-no-affinity 1843.536 42.711 1684.790 1949.790 1852.665 1835.164 1851.907

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 192776.628 5193.027 181471.580 208676.540 192089.445 191758.795 193794.462
memb-two-to-one-affinity 200844.686 7181.510 185980.900 220082.210 201011.445 199437.110 202252.262
memb-4t-no-affinity 189551.082 4811.135 167635.600 202717.860 190398.130 188608.099 190494.064
memb-8t-no-affinity 183436.420 4249.845 167640.890 194008.960 184344.715 182603.450 184269.389
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.405 0.039 1.330 1.536 1.400 1.398 1.413
memb-two-to-one-affinity 1.332 0.037 1.278 1.524 1.326 1.325 1.339
memb-4t-no-affinity 1.501 0.128 1.405 2.567 1.474 1.475 1.526
memb-8t-no-affinity 1.487 0.091 1.387 1.954 1.469 1.469 1.504

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.036 0.015 1.014 1.084 1.032 1.033 1.039
memb-two-to-one-affinity 1.006 0.016 0.978 1.087 1.002 1.002 1.009
memb-4t-no-affinity 1.043 0.025 1.018 1.187 1.035 1.038 1.048
memb-8t-no-affinity 1.078 0.022 1.052 1.162 1.069 1.073 1.082
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.153 0.084 1.007 1.327 1.159 1.137 1.170
memb-two-to-one-affinity 1.249 0.024 1.191 1.359 1.247 1.244 1.254
memb-4t-no-affinity 1.317 0.061 1.135 1.695 1.319 1.305 1.329
memb-8t-no-affinity 1.162 0.082 1.047 1.431 1.147 1.146 1.178

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 0.943 0.010 0.919 0.967 0.943 0.941 0.945
memb-two-to-one-affinity 0.932 0.013 0.903 0.967 0.927 0.929 0.934
memb-4t-no-affinity 0.941 0.014 0.839 0.967 0.943 0.938 0.944
memb-8t-no-affinity 0.981 0.012 0.951 1.015 0.975 0.979 0.984
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 5.553 0.285 4.799 6.943 5.535 5.497 5.609
memb-two-to-one-affinity 4.830 0.355 4.063 6.143 4.831 4.761 4.900
memb-4t-no-affinity 6.512 0.611 5.695 10.559 6.431 6.392 6.632
memb-8t-no-affinity 7.008 1.004 5.471 11.327 6.815 6.811 7.205

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 4.149 0.298 3.039 4.959 4.079 4.091 4.208
memb-two-to-one-affinity 3.275 0.392 2.495 4.671 3.343 3.198 3.352
memb-4t-no-affinity 4.266 0.357 3.807 6.335 4.223 4.196 4.336
memb-8t-no-affinity 4.291 0.425 3.727 6.495 4.191 4.208 4.374
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 9.043 0.838 7.519 14.719 8.959 8.879 9.207
memb-two-to-one-affinity 8.051 1.040 6.399 14.015 7.887 7.848 8.255
memb-4t-no-affinity 10.423 1.753 8.895 24.831 10.111 10.080 10.767
memb-8t-no-affinity 12.050 1.328 9.599 17.279 11.871 11.789 12.310

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 7.057 0.619 5.951 12.031 6.991 6.936 7.178
memb-two-to-one-affinity 6.205 0.476 5.375 8.319 6.111 6.112 6.298
memb-4t-no-affinity 7.097 0.834 6.207 14.079 6.943 6.934 7.261
memb-8t-no-affinity 8.064 0.704 7.007 12.159 7.999 7.926 8.202

使用 perf 觀察程式的行為

以下是針對一個 CPU 兩個執行緒和一個 CPU 一個執行緒的情況下,我各拿一個中位數的結果來做分析,以下是結果:
一個 CPU 兩個執行緒

 Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         71,780.04 msec task-clock                       #    1.653 CPUs utilized             
           560,343      context-switches                 #    7.806 K/sec                     
            14,189      cpu-migrations                   #  197.673 /sec                      
             2,025      page-faults                      #   28.211 /sec                      
   257,627,916,414      cycles                           #    3.589 GHz                         (39.99%)
   141,018,941,758      instructions                     #    0.55  insn per cycle              (50.00%)
    22,306,027,705      branches                         #  310.755 M/sec                       (49.92%)
       474,877,129      branch-misses                    #    2.13% of all branches             (49.98%)
    39,701,370,753      L1-dcache-loads                  #  553.098 M/sec                       (49.92%)
     3,265,404,792      L1-dcache-load-misses            #    8.22% of all L1-dcache accesses   (49.96%)
     1,107,813,317      LLC-loads                        #   15.433 M/sec                       (40.05%)
         6,557,980      LLC-load-misses                  #    0.59% of all LL-cache accesses    (40.09%)
    14,661,818,780      cache-references                 #  204.260 M/sec                       (40.08%)
       172,241,796      cache-misses                     #    1.17% of all cache refs           (40.10%)

      43.418281953 seconds time elapsed

      22.202794000 seconds user
      49.473298000 seconds sys

一個 CPU 一個執行緒

 Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         71,423.46 msec task-clock                       #    1.599 CPUs utilized             
           333,882      context-switches                 #    4.675 K/sec                     
            18,199      cpu-migrations                   #  254.804 /sec                      
             1,626      page-faults                      #   22.766 /sec                      
   254,850,020,885      cycles                           #    3.568 GHz                         (40.03%)
   136,686,768,652      instructions                     #    0.54  insn per cycle              (49.98%)
    21,480,788,799      branches                         #  300.753 M/sec                       (49.94%)
       467,898,042      branch-misses                    #    2.18% of all branches             (49.95%)
    38,495,992,811      L1-dcache-loads                  #  538.982 M/sec                       (50.05%)
     3,143,432,121      L1-dcache-load-misses            #    8.17% of all L1-dcache accesses   (49.97%)
     1,053,471,203      LLC-loads                        #   14.750 M/sec                       (40.05%)
         5,857,429      LLC-load-misses                  #    0.56% of all LL-cache accesses    (40.05%)
    14,337,925,549      cache-references                 #  200.745 M/sec                       (40.03%)
       161,865,927      cache-misses                     #    1.13% of all cache refs           (39.98%)

      44.655900137 seconds time elapsed

      22.039534000 seconds user
      49.579251000 seconds sys

結論

  • 可以觀察出在使用 memb 的機制並且 cpu affinity 在兩個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,並且相較於同樣是 8 個執行緒但無 cpu affinity 在 Get 和 Set 操作的吞吐量與延遲上,都好了大約 10% ,在 memb 的實驗結果跟去年得到的結論一樣,兩個執行緒對一個 cpu 的情況會取得比較好的效能。
  • 並且二對一的模式在 p99 與 p99.9 的表現也是四個裡面最好, tail-latency 也是表現最好的。
  • 在一個 CPU 一個執行緒時 context switchs 數量約低於兩個執行緒 40% ,但同時 cpu migrations 的數量卻多於一個 CPU 兩個執行緒,而 cpu migration 單次的成本是遠大於 context switch 的,因為會導致 L1/L2 cache miss(需要重新 warm-up cache),並且也會清除 TLB,但兩個執行緒在一個 CPU 上會頻繁交替執行,也會導致 cache conflict 提昇 cache misss 的數量。

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 qsbr flavor 下

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1976.983 53.720 1858.990 2088.910 1976.620 1966.454 1987.512
qsbr-two-to-one-affinity 1926.101 63.233 1761.320 2085.570 1919.775 1913.707 1938.494
qsbr-4t-no-affinity 1911.112 36.840 1757.440 2006.030 1905.100 1903.892 1918.333
qsbr-8t-no-affinity 1857.539 47.483 1687.700 1965.850 1853.320 1848.232 1866.845

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 196714.781 5345.271 184974.190 207851.710 196678.310 195667.108 197762.454
qsbr-two-to-one-affinity 191651.843 6291.792 175255.650 207519.610 191022.075 190418.652 192885.034
qsbr-4t-no-affinity 190160.503 3665.605 174869.310 199605.400 189562.675 189442.045 190878.962
qsbr-8t-no-affinity 184829.809 4724.670 167930.680 195606.950 184409.640 183903.774 185755.844
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.296 0.049 1.231 1.514 1.286 1.287 1.306
qsbr-two-to-one-affinity 1.332 0.037 1.273 1.515 1.328 1.325 1.340
qsbr-4t-no-affinity 1.459 0.059 1.357 1.681 1.449 1.447 1.470
qsbr-8t-no-affinity 1.407 0.061 1.316 1.699 1.395 1.395 1.419

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.015 0.020 0.987 1.085 1.007 1.011 1.019
qsbr-two-to-one-affinity 1.036 0.018 1.015 1.120 1.030 1.032 1.039
qsbr-4t-no-affinity 1.042 0.013 1.012 1.122 1.040 1.039 1.045
qsbr-8t-no-affinity 1.072 0.022 1.040 1.165 1.079 1.068 1.077
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.001 0.033 0.975 1.191 0.991 0.994 1.007
qsbr-two-to-one-affinity 1.089 0.060 1.023 1.279 1.063 1.077 1.100
qsbr-4t-no-affinity 1.221 0.110 1.007 1.423 1.247 1.200 1.243
qsbr-8t-no-affinity 1.054 0.046 1.015 1.295 1.047 1.045 1.063

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 0.949 0.010 0.927 0.975 0.943 0.946 0.951
qsbr-two-to-one-affinity 0.965 0.013 0.935 0.999 0.959 0.962 0.967
qsbr-4t-no-affinity 0.958 0.007 0.935 0.975 0.959 0.957 0.960
qsbr-8t-no-affinity 0.993 0.012 0.967 1.015 0.999 0.990 0.995
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 4.837 0.464 4.223 7.135 4.831 4.746 4.928
qsbr-two-to-one-affinity 5.069 0.327 4.223 6.623 5.055 5.005 5.133
qsbr-4t-no-affinity 6.030 0.432 5.279 7.391 5.983 5.946 6.115
qsbr-8t-no-affinity 6.230 0.840 5.183 9.407 5.919 6.065 6.394

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 3.555 0.530 2.719 5.151 3.559 3.451 3.659
qsbr-two-to-one-affinity 3.775 0.337 2.767 4.927 3.735 3.709 3.841
qsbr-4t-no-affinity 3.882 0.398 3.151 5.055 3.791 3.804 3.960
qsbr-8t-no-affinity 4.004 0.402 3.391 5.855 3.903 3.925 4.082
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 7.569 1.141 6.079 13.823 7.391 7.345 7.793
qsbr-two-to-one-affinity 8.388 0.826 6.975 13.375 8.255 8.226 8.550
qsbr-4t-no-affinity 9.682 1.119 8.319 13.631 9.375 9.462 9.901
qsbr-8t-no-affinity 11.010 1.314 8.511 15.743 10.975 10.753 11.268

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 5.984 0.847 5.151 10.367 5.759 5.818 6.150
qsbr-two-to-one-affinity 6.854 0.480 5.727 9.535 6.863 6.760 6.948
qsbr-4t-no-affinity 6.454 0.506 5.727 8.383 6.399 6.355 6.553
qsbr-8t-no-affinity 7.671 0.698 6.591 11.135 7.455 7.534 7.808

使用 perf 分析程式行為

一樣針對一個 CPU 兩個執行緒與一個 CPU 一個執行緒進行分析,分別取一次中位數的結果:

         70,013.67 msec task-clock                       #    2.401 CPUs utilized             
           741,501      context-switches                 #   10.591 K/sec                     
            15,581      cpu-migrations                   #  222.542 /sec                      
             3,627      page-faults                      #   51.804 /sec                      
   249,258,869,962      cycles                           #    3.560 GHz                         (39.99%)
   132,407,403,431      instructions                     #    0.53  insn per cycle              (49.98%)
    21,077,879,352      branches                         #  301.054 M/sec                       (50.03%)
       478,702,322      branch-misses                    #    2.27% of all branches             (50.05%)
    35,215,497,531      L1-dcache-loads                  #  502.980 M/sec                       (50.12%)
     3,281,644,729      L1-dcache-load-misses            #    9.32% of all L1-dcache accesses   (50.04%)
     1,129,746,159      LLC-loads                        #   16.136 M/sec                       (40.06%)
         7,744,776      LLC-load-misses                  #    0.69% of all LL-cache accesses    (39.95%)
    14,435,913,114      cache-references                 #  206.187 M/sec                       (39.92%)
       185,804,120      cache-misses                     #    1.29% of all cache refs           (39.95%)

      29.158730419 seconds time elapsed

      18.440255000 seconds user
      51.487935000 seconds sys

一個 CPU 一個執行緒

Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         67,887.06 msec task-clock                       #    1.507 CPUs utilized             
           375,487      context-switches                 #    5.531 K/sec                     
            19,709      cpu-migrations                   #  290.320 /sec                      
             3,209      page-faults                      #   47.270 /sec                      
   240,616,502,442      cycles                           #    3.544 GHz                         (40.06%)
   128,321,848,547      instructions                     #    0.53  insn per cycle              (50.05%)
    20,388,085,704      branches                         #  300.324 M/sec                       (50.00%)
       468,964,574      branch-misses                    #    2.30% of all branches             (50.05%)
    34,179,214,077      L1-dcache-loads                  #  503.472 M/sec                       (50.01%)
     3,167,434,498      L1-dcache-load-misses            #    9.27% of all L1-dcache accesses   (50.00%)
     1,059,289,040      LLC-loads                        #   15.604 M/sec                       (39.97%)
         5,442,227      LLC-load-misses                  #    0.51% of all LL-cache accesses    (39.98%)
    13,343,467,237      cache-references                 #  196.554 M/sec                       (39.96%)
       151,742,665      cache-misses                     #    1.14% of all cache refs           (40.02%)

      45.045660022 seconds time elapsed

      17.981155000 seconds user
      50.111441000 seconds sys

結論

  • 可以觀察出在使用 qsbr 的機制並且 cpu affinity 在一個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,跟去年得到的結論不一樣,原因可能是一個 CPU 兩個執行緒的 cache miss rate 較高,也有可能是 qsbr 的機制,需要進一步實驗(待補)
  • 但也能觀察出一個執行緒榜定一個 cpu 的情況下,雖然在 p99.9 情況下延遲的平均和中位數皆是最佳,但其 tail-latency 也是最大的,原因可能與 qsbr 的機制有關,需要進一步實驗(待補)

如何引入 neco

single-thread 的設計

首先我好奇為甚麼 redis/ valkey 是使用 single thread?
因此我閱讀 The Engineering Wisdom Behind Redis’s Single-Threaded Design,理解 redis 的設計與考量點。

The Engineering Wisdom Behind Redis’s Single-Threaded Design 文章整理

解釋為何 userspace RCU 對於提高 Redis 並行程度有益

分析限制 Redis 並行的因素

TODO: 使 Valkey 具備並行處理能力

yeh-sudo/mt-redis 的成果用於 Valkey,確保 userspace RCU 發揮其作用

閱讀 valkey 的資料結構