Try   HackMD

Linux 核心專題: 高度並行的 Valkey 實作

執行人: tsaiiuo, eleanorLYJ
Github
專題解說影片

Q : 知道 p50 p99 p99.9 不同 URCU flavor的不同表現後能夠做什麼改善 ?

因可以針對應用情境需求做選擇。若應用對於極端延遲敏感,那麼即可選擇在 p99, p99.9 這高百分位數值低的 URCU flavor; 若應用以普遍反應時間為主,則可參考平均還有中位數(p50) 的數值做選擇。

Q: 如何解釋 URCU flavor 對於效能的影響?

可以先從各個 flavor 的底層實作解釋,像是在 qsbr flavor 中我們能自定義 writer 更新的頻率降低,以此提升 reader 的效能,因此在讀寫比 100:1 與 10:1 的情境下, qsbr 的 Get 的吞吐量與延遲上表現的最好,除此之外不同比例的讀寫比配合不同的 flavor 也都會有不同的效能影響,在讀寫比 1:1 時,上述的 qsbr 的吞吐量與延遲反而會比其他 flavor 來得低,可能是讀寫高交錯的情境中,qsbr 的 grace period 幾乎無法推進,導致寫入延遲暴增,整體系統吞吐量降低,可以看到測試中最久的 Set 延遲到達 888ms ,再次應證 URCU flavor 需要配合不同的使用情境作使用。

Reviewed by MikazukiHikari

要怎麼評估電腦本身對於 throughout 的影響,因為看起來不同設備的 throughout 不一樣?

這本身是一個需要實驗的問題,有很多電腦本身的硬體效能(如 CPU 核心數、記憶體頻寬、I/O 裝置)與系統軟體(如排程器、記憶體管理)會直接影響延遲與吞吐量,而在此次實驗的兩種設備中最直觀的差別可能是 CPU 核心數以及有無 P-core。 tsaiiuo

Reviewed by Andrewtangtang

關於執行緒數量的影響有推測什麼原因導致 4 個執行緒比 8 個執行緒各項指標都表現比較好嗎?

根據 perf 的觀察可以看出當執行序為 4 時 context-switches 的數量是少於 8 個執行序的(少了快 1 倍),但同時 cpu-migration 的數量也是比較高(高了 1/4 倍左右),但根據實驗結果能推斷是 context-switches 降低的成本效益是大於 cpu-migration 的成本,導致 4 個執行序的表現比較好。 tsaiiuo

TODO: 重現去年實驗並理解 Redis 並行化的技術挑戰

2024 年報告
重現實驗並解讀實驗結果
探討 Redis 對於雲端運算和網路服務系統架構的幫助
解釋為何 userspace RCU 對於提高 Redis 並行程度有益
分析限制 Redis 並行的因素
彙整 userspace RCU 和 Redis 的整合,要顧及哪些議題?
評估 neco 的運用和調整

閱讀 introduce sys_membarrier(): process-wide memory barrier (v6)

要使用 memb flavor 就需要先了解 sys_membarrier() 系統呼叫的作用,先看到他的介紹:

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

sys_membarrier() 會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt),讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier,若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。

為了去更進一步解釋,這邊它給出一個例子

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

而這兩個 thread 的實作如下

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

在引用 sys_membarrier() 後則可以修改成

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

Thread B (Reader) 內可以直接使用編譯器屏障取代原先的 smp_wb() 這就可以省下許多成本,此外 Thread A (Writer) 使用 sys_membarrier() 取代 smp_wb() 只對「整個 process 裡所有活動中的 thread 發送 IPI,強制它們各自執行一次 full memory barrier」。

而這樣更改後會有兩種情況:

  1. Thread A 與 Thread B 不同時執行
Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

這種情況下 Thread B 的存取只有保證 weakly ordered,但是可以的,因為 Thread A 不在意跟它排序,因為它們不重疊。

  1. Thread A 與 Thread B 同時執行
Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

Thread B 在執行 barrier() 時只能保證自己程式內部順序;

但同時 Thread A 正在呼叫 sys_membarrier(),kernel 會發 IPI 給所有正在跑的 thread,讓它們在 CPU core 上都執行一次硬體 memory barrier。

這就會把 B 的 compiler barrier「升級」成真正的 full smp_mb(),在多 core 間強制排序。至於不在跑的 thread(被 scheduler deschedule),它們不需要做任何操作,因為 context switch 本身就隱含 barrier 效果。

Each non-running process threads are intrinsically serialized by the scheduler.

作者這裡還有做一個效能實驗,測試 Expedited membarrier 和 Non-expedited membarrier 針對 sys_membarrier() 的效能差異 (在 Intel Xeon E5405 上執行)

Expedited membarrier

​​​​呼叫次數:10,000,000 次

​​​​單次開銷:約 2–3 微秒

Non-expedited membarrier

​​​​呼叫次數:1,000 次

​​​​總耗時約 16 秒 → 單次開銷約 16 毫秒

​​​​大約是 expedited 模式的 5,000–8,000 倍慢

Add "int expedited" parameter, use synchronize_sched() in the non-expedited
case. Thanks to Lai Jiangshan for making us consider seriously using
synchronize_sched() to provide the low-overhead membarrier scheme.
Check num_online_cpus() == 1, quickly return without doing nothing.

此外還有針對 sys_membarrier() 的有效性進行測試
Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes

(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes

可以先觀察出 動態 sys_membarrier 檢測會帶來額外的成本带来的成本,讓讀取與寫入端的 throughput 皆降低大約 7% ,但大致上還在一個基準線上,因此目前的預設機制還是有啟用動態觀察。
在啟用 sys_membarrier() 後在 expedited 狀態下會提昇許多,因為讀取端只做編譯器屏障,因此效能遠比原始的 mb scheme 好 ,在 writer 端,相比於 signal-based scheme 提昇了 500 倍左右,但對於原始的 mb scheme 來說還是效能差了一點,因為需要做額外的 sys_membarrier() 呼叫,而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高,因為寫入端不發送 IPI(使用了更輕量級的 membarrier fallback),沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇,但同時寫入端没有 IPI,主要依賴 cpu 上下文切換來進行,或一個 thread 定期的去喚醒各個 cpu ,因此沒有即時性,每次 sys_membarrier() 的時間比較長也比較不可預期。

RCU 與並行化程式閱讀筆記: https://hackmd.io/@tMeEIUCqQSCiXUaa-3tcsg/r11N0uJbll/edit

RCU 的 callback 的作用?

如何作用

callback 機制就是當滿足預設條件後才會呼叫執行的 function,主要分為兩個階段:

  1. 註冊(registration): 先把你要呼叫的函式(callback)登記到某個框架或資料結構裡像是
call_rcu(&p->rcu_head, reclaim_callback);
  1. 觸發(invocation): 當某個預設條件(event)發生時,系統/框架才會真正把先前註冊的 callback 給呼叫

而在 RCU 的 callback 主要是用來 reclaim 資源,主要又分為單一次 reclaim 和批量 reclaim,單一次 reclaim 的流程圖就是當所有的 thread 進入 quiescent state 後呼叫先前註冊的 callback ,示意圖如下:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

重現實驗

目的在於討論哪裡哪一個 urcu flavor 最適合 redis?

開發環境

$ uname -a
Linux eleanor 6.11.0-24-generic #24~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 25 20:14:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   20
  On-line CPU(s) list:    0-19
Vendor ID:                GenuineIntel
  Model name:             12th Gen Intel(R) Core(TM) i7-12700
    CPU family:           6
    Model:                151
    Thread(s) per core:   2
    Core(s) per socket:   12
    Socket(s):            1
    Stepping:             2
    CPU(s) scaling MHz:   19%
    CPU max MHz:          4900.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4224.00
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    512 KiB (12 instances)
  L1i:                    512 KiB (12 instances)
  L2:                     12 MiB (9 instances)
  L3:                     25 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-19

實驗將使用 P-core,並使用 taskset 避開 E-core。另外也關閉Hyperthreading。

因此啟用 mt-redis 的命令需要更改,以下是我將 mt-redis 伺服器綁定到 P-core, logic CPU id 為 0,2,4,6,8,10,12,14,16 的命令:

taskset -c 0,2,4,6,8,10,12,14,16 ./redis-server --port 6379 --appendonly no --save ""

-–appendonly no 關閉 AOF(Append Only File),避免每個寫入都寫到磁碟
-–save "" 關閉 RDB 快照功能,也就是關閉定時儲存資料快照功能

以下是我使用 memtier_benchmark 的命令

taskset -c 17,18,19 \
        memtier_benchmark \
        -p 6379 \
        --random-data \
        --ratio=${RATIO} \
        --requests=20000 \
        --json-out-file="${OUTPUT_FILE}" \
        --hide-histogram 

理解 memtier_bench

沒有下特別 command,預設是針對 string 這 type 做測試。

去年實驗設計與結果

去年葉同學比較五種 URCU flavor 在不同讀寫比例下的表現,讀寫比例分別為 10:1, 100:1 與 1:1。
測試對象為 mt-redis,並使用同樣命令執行 100 次,並觀察 SET Ops/sec, SET Average Latency, SET p50 Latency, SET p99 Latency, SET p99.9 Latency, GET Ops/sec, GET Average Latency, GET p50 Latency, GET p99 Latency, GET p99.9 Latency

去年的結論

在讀取大於寫入的測試中,Signal-based RCU 表現最好,而在 1:1 的測試中,Signal-based RCU 的寫入效能較差。理論上 QSBR 在效能上應該是首選,但在實際測試中表現不如預期,可能需要調整進入 quiescent state 的時機

但我發現在 2023 年, urcu-signal 被視為 deprecated 了! Commit aad674a 為關於移除 urcu-signal 的最後一個 commit,根據此 commit message 建議改使用 urcu-memb,因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能,並且也有改善後的寬限期性能。

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入

吞吐量

SET Ops/sec

comparison_set_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 3044.210 124.389 2779.570 3535.730 3039.250 3019.404 3069.015
BP 3084.212 132.025 2843.330 3441.970 3066.740 3057.884 3110.541
MEMB 3052.686 142.048 2811.680 3498.170 3031.835 3024.359 3081.014
MB 3062.779 141.894 2864.770 3609.120 3041.210 3034.482 3091.076
SIGNAL 3054.204 126.075 2808.940 3447.240 3024.835 3029.062 3079.346

GET Ops/sec

comparison_get_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 302906.517 12377.009 276574.670 351814.290 302413.045 300438.278 305374.756
BP 306886.910 13136.841 282918.290 342484.500 305148.375 304267.144 309506.676
MEMB 303749.942 14134.116 279768.790 348076.500 301675.285 300931.298 306568.586
MB 304754.178 14118.777 285051.500 359116.510 302608.090 301938.593 307569.762
SIGNAL 303900.933 12544.743 279496.530 343009.320 300978.660 301399.244 306402.622
平均延遲

SET Average Latency

comparison_set_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.665 0.015 0.632 0.705 0.664 0.662 0.668
BP 0.665 0.015 0.620 0.699 0.665 0.662 0.668
MEMB 0.666 0.014 0.635 0.709 0.665 0.663 0.669
MB 0.667 0.015 0.629 0.712 0.664 0.664 0.670
SIGNAL 0.666 0.013 0.630 0.691 0.667 0.663 0.669

GET Average Latency

comparison_get_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.649 0.007 0.631 0.666 0.649 0.647 0.650
BP 0.648 0.008 0.628 0.672 0.648 0.646 0.649
MEMB 0.649 0.007 0.626 0.669 0.648 0.647 0.650
MB 0.648 0.008 0.624 0.664 0.648 0.646 0.649
SIGNAL 0.648 0.008 0.627 0.666 0.648 0.647 0.650
p50 延遲

SET p50 Latency

comparison_set_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.499 0.004 0.487 0.511 0.495 0.498 0.499
BP 0.499 0.005 0.487 0.511 0.495 0.498 0.500
MEMB 0.500 0.005 0.495 0.511 0.503 0.499 0.501
MB 0.498 0.005 0.487 0.511 0.495 0.498 0.499
SIGNAL 0.499 0.005 0.487 0.511 0.495 0.498 0.500

GET p50 Latency

comparison_get_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.497 0.004 0.487 0.503 0.495 0.496 0.498
BP 0.497 0.005 0.487 0.511 0.495 0.496 0.498
MEMB 0.497 0.004 0.487 0.503 0.495 0.497 0.498
MB 0.496 0.005 0.487 0.511 0.495 0.495 0.497
SIGNAL 0.497 0.005 0.487 0.503 0.495 0.496 0.498
p99 延遲

SET p99 Latency

comparison_set_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 4.524 0.016 4.511 4.543 4.511 4.521 4.527
BP 4.526 0.016 4.511 4.543 4.511 4.523 4.529
MEMB 4.528 0.026 4.511 4.735 4.511 4.523 4.533
MB 4.526 0.029 4.511 4.767 4.511 4.520 4.531
SIGNAL 4.526 0.016 4.511 4.543 4.511 4.523 4.529

GET p99 Latency

comparison_get_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 4.516 0.012 4.511 4.543 4.511 4.514 4.518
BP 4.515 0.010 4.511 4.543 4.511 4.513 4.517
MEMB 4.515 0.011 4.511 4.543 4.511 4.513 4.518
MB 4.515 0.010 4.511 4.543 4.511 4.513 4.517
SIGNAL 4.517 0.013 4.511 4.543 4.511 4.515 4.520
p99.9 延遲

SET p99.9 Latency

comparison_set_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 7.518 0.107 6.527 7.775 7.519 7.497 7.540
BP 7.614 0.487 7.519 10.559 7.519 7.517 7.711
MEMB 7.555 0.209 7.519 9.151 7.519 7.513 7.597
MB 7.593 0.390 7.519 10.687 7.519 7.516 7.671
SIGNAL 7.669 0.678 7.519 12.415 7.519 7.534 7.804

GET p99.9 Latency

comparison_get_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 7.519 0.000 7.519 7.519 7.519 7.519 7.519
BP 7.519 0.003 7.519 7.551 7.519 7.519 7.520
MEMB 7.519 0.000 7.519 7.519 7.519 7.519 7.519
MB 7.519 0.000 7.519 7.519 7.519 7.519 7.519
SIGNAL 7.519 0.003 7.519 7.551 7.519 7.519 7.520
小結

所有 Flavor 的 GET 吞吐量與延遲性能都非常接近,其中 GET 吞吐量 BP 略微領先。
SIGNAL 的 SET p99.9 延遲表現最差,延遲最高且標準差最大,說明其寫操作的尾部延遲較不穩定。而 QSBR 的 SET p99.9 延遲表現最好,這表明 QSBR 在寫操作頻率極低時,其清理機制可能表現更優。

讀寫比例 10:1:模擬 Cache 情境,讀取較多於寫入

吞吐量

SET Ops/sec

comparison_set_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 28660.834 320.453 28114.800 30779.230 28630.195 28596.929 28724.739
BP 28658.504 315.730 28105.650 29984.920 28611.525 28595.541 28721.468
MEMB 28630.912 395.076 27642.400 31215.680 28617.705 28552.125 28709.699
MB 28634.801 468.614 27521.340 31324.040 28564.030 28541.349 28728.253
SIGNAL 28613.339 405.676 26497.110 30684.580 28576.275 28532.439 28694.240

GET Ops/sec

comparison_get_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 286466.529 3202.947 281008.890 307640.060 286160.300 285827.793 287105.265
BP 286443.242 3155.737 280917.470 299700.800 285973.705 285813.921 287072.563
MEMB 286167.464 3948.806 276287.260 312002.310 286035.460 285379.988 286954.940
MB 286206.329 4683.818 275077.250 313085.390 285498.950 285272.276 287140.382
SIGNAL 285991.817 4054.760 264839.980 306694.000 285621.375 285183.212 286800.423
平均延遲

SET Average Latency

comparison_set_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.635 0.002 0.630 0.641 0.635 0.634 0.635
BP 0.635 0.002 0.630 0.639 0.635 0.634 0.635
MEMB 0.635 0.002 0.630 0.640 0.635 0.635 0.635
MB 0.635 0.002 0.630 0.641 0.635 0.634 0.635
SIGNAL 0.635 0.002 0.630 0.643 0.635 0.635 0.636

GET Average Latency

comparison_get_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.635 0.002 0.630 0.640 0.634 0.634 0.635
BP 0.634 0.002 0.629 0.639 0.634 0.634 0.635
MEMB 0.635 0.002 0.630 0.639 0.635 0.634 0.635
MB 0.635 0.002 0.630 0.641 0.635 0.634 0.635
SIGNAL 0.635 0.002 0.629 0.643 0.635 0.635 0.635
p50 延遲

SET p50 Latency

comparison_set_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.497 0.006 0.487 0.511 0.495 0.496 0.498
BP 0.497 0.005 0.487 0.511 0.495 0.497 0.498
MEMB 0.498 0.005 0.487 0.503 0.495 0.497 0.499
MB 0.498 0.005 0.487 0.511 0.495 0.497 0.499
SIGNAL 0.499 0.004 0.487 0.511 0.503 0.499 0.500

GET p50 Latency

comparison_get_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.497 0.006 0.487 0.511 0.495 0.496 0.498
BP 0.497 0.005 0.487 0.511 0.495 0.497 0.498
MEMB 0.498 0.005 0.487 0.503 0.495 0.497 0.499
MB 0.498 0.005 0.487 0.511 0.495 0.497 0.499
SIGNAL 0.499 0.004 0.487 0.511 0.503 0.498 0.500
p99 延遲

SET p99 Latency

comparison_set_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 1.384 0.021 1.359 1.495 1.383 1.379 1.388
BP 1.383 0.018 1.359 1.447 1.375 1.380 1.387
MEMB 1.385 0.018 1.359 1.439 1.383 1.381 1.388
MB 1.385 0.019 1.351 1.455 1.383 1.381 1.389
SIGNAL 1.425 0.091 1.359 1.959 1.391 1.407 1.443

GET p99 Latency

comparison_get_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 1.383 0.020 1.359 1.487 1.375 1.379 1.387
BP 1.383 0.018 1.359 1.439 1.375 1.380 1.387
MEMB 1.385 0.017 1.359 1.439 1.383 1.381 1.388
MB 1.385 0.019 1.351 1.455 1.383 1.381 1.389
SIGNAL 1.426 0.091 1.359 1.935 1.391 1.408 1.444
p99.9 延遲

SET p99.9 Latency

comparison_set_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 2.851 0.290 2.271 4.319 2.807 2.793 2.909
BP 2.822 0.248 2.367 3.503 2.799 2.772 2.871
MEMB 2.810 0.278 2.351 3.839 2.767 2.754 2.865
MB 2.824 0.246 2.415 3.983 2.815 2.775 2.873
SIGNAL 2.919 0.442 2.351 4.351 2.815 2.831 3.007

GET p99.9 Latency

comparison_get_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 2.853 0.286 2.351 4.319 2.815 2.796 2.910
BP 2.819 0.237 2.431 3.471 2.799 2.772 2.866
MEMB 2.810 0.274 2.303 3.743 2.783 2.755 2.865
MB 2.830 0.239 2.399 3.903 2.815 2.782 2.878
SIGNAL 2.929 0.441 2.367 4.383 2.815 2.841 3.017
小結

在讀多寫少的場景下,QSBR 在吞吐量最佳、BP 次之。然而,QSBR 在高百分位 GET 與 SET 延遲上都較不穩定。MEMB 和 MB 表現很接近。
在所有 FLAVOR 中,不該使用 SIGNAL,因為它在多個吞吐量與延遲指標上表現明顯最差,尤其極端情況下的延遲

讀寫比例 1:1:模擬 Session Store 情境,讀取與寫入比例相近

吞吐量

SET Ops/sec

comparison_set_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 133086.415 46847.384 0.000 163293.610 148770.615 123744.048 142428.782
BP 155610.583 2395.181 152699.060 170369.850 155087.675 155132.933 156088.233
MEMB 155501.622 1163.124 153110.860 159132.270 155344.040 155269.671 155733.574
MB 155821.162 2447.141 143412.070 168457.980 155811.510 155333.149 156309.174
SIGNAL 156319.059 2753.169 153813.820 169675.580 155630.615 155770.018 156868.099

GET Ops/sec

comparison_get_ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 133086.415 46847.384 0.000 163293.610 148770.615 123744.048 142428.782
BP 155610.583 2395.181 152699.060 170369.850 155087.675 155132.933 156088.233
MEMB 155501.622 1163.124 153110.860 159132.270 155344.040 155269.671 155733.574
MB 155821.162 2447.141 143412.070 168457.980 155811.510 155333.149 156309.174
SIGNAL 156319.059 2753.169 153813.820 169675.580 155630.615 155770.018 156868.099
平均延遲

SET Average Latency

comparison_set_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 9.730 88.360 0.000 888.601 0.677 -7.891 27.351
BP 0.642 0.002 0.637 0.647 0.642 0.641 0.642
MEMB 0.642 0.002 0.638 0.646 0.642 0.642 0.642
MB 0.642 0.002 0.636 0.646 0.642 0.641 0.642
SIGNAL 0.641 0.001 0.638 0.645 0.642 0.641 0.642

GET Average Latency

comparison_get_average_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 2.678 19.652 0.000 197.910 0.655 -1.242 6.597
BP 0.641 0.002 0.637 0.647 0.642 0.641 0.642
MEMB 0.642 0.002 0.638 0.646 0.642 0.641 0.642
MB 0.641 0.002 0.636 0.646 0.641 0.641 0.642
SIGNAL 0.641 0.001 0.638 0.644 0.641 0.641 0.641
p50 延遲

SET p50 Latency

comparison_set_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.455 0.141 0.007 0.511 0.503 0.427 0.483
BP 0.507 0.005 0.495 0.519 0.511 0.506 0.508
MEMB 0.509 0.004 0.495 0.519 0.511 0.508 0.510
MB 0.507 0.005 0.495 0.519 0.507 0.506 0.508
SIGNAL 0.508 0.005 0.495 0.519 0.511 0.507 0.509

GET p50 Latency

comparison_get_p50_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 0.454 0.141 0.007 0.503 0.495 0.426 0.482
BP 0.507 0.005 0.495 0.519 0.507 0.506 0.508
MEMB 0.508 0.005 0.495 0.519 0.511 0.507 0.509
MB 0.506 0.005 0.495 0.519 0.503 0.505 0.507
SIGNAL 0.507 0.005 0.495 0.519 0.511 0.506 0.508
p99 延遲

SET p99 Latency

comparison_set_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 14.427 101.888 0.007 1028.095 4.543 -5.892 34.746
BP 1.451 0.039 1.375 1.527 1.455 1.443 1.459
MEMB 1.463 0.040 1.383 1.543 1.467 1.455 1.471
MB 1.455 0.045 1.383 1.543 1.455 1.447 1.464
SIGNAL 1.455 0.039 1.383 1.527 1.455 1.448 1.463

GET p99 Latency

comparison_get_p99_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 5.214 10.926 0.007 113.151 4.511 3.035 7.393
BP 1.450 0.040 1.375 1.527 1.451 1.442 1.457
MEMB 1.462 0.040 1.383 1.543 1.471 1.454 1.470
MB 1.455 0.046 1.383 1.543 1.455 1.446 1.464
SIGNAL 1.454 0.038 1.383 1.527 1.455 1.446 1.462
p99.9 延遲

SET p99.9 Latency

comparison_set_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 1824.150 17529.922 0.007 176160.767 7.759 -1671.690 5319.990
BP 2.726 0.191 2.303 3.295 2.687 2.688 2.764
MEMB 2.690 0.212 2.351 3.343 2.655 2.648 2.732
MB 2.733 0.229 2.271 3.567 2.743 2.688 2.779
SIGNAL 2.685 0.201 2.303 3.247 2.639 2.645 2.725

GET p99.9 Latency

comparison_get_p99_9_latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
QSBR 124.821 901.113 0.007 8454.143 7.519 -54.880 304.523
BP 2.726 0.187 2.319 3.279 2.703 2.688 2.763
MEMB 2.689 0.215 2.303 3.359 2.655 2.646 2.732
MB 2.734 0.227 2.319 3.519 2.735 2.689 2.780
SIGNAL 2.685 0.204 2.303 3.247 2.655 2.644 2.725

小結

在讀寫均衡的場景下,BP, MEMB, MB, SIGNAL 這四種 URCU 的表現都相對穩定且良好。而 QSBR 表現極差,不適用於此比例。QSBR 的吞吐量較低,且波動大 (標準差大)、QSBR 平均延遲大,另外 95% 信賴區間大,表示結果較不可靠 -> 可能是我記憶體回收阻塞或是我測量時有誤 (因有負的 CI 下限)

藉由不同讀寫比例,分析不同 favor

QSBR 在讀寫比例 1:1 的情況下,QSBR 的性能表現極不穩定,而在讀寫比例 100:1 的極端讀多場景下,QSBR 的 SET p99.9 延遲表現反而最佳,因此適合的密集讀取,並且對寫操作的延遲波動有較高的容忍度。

BP, MEMB, MB 的穩定性: 這三種 URCU 在所有讀寫比例下都表現出相對穩定的性能,其中 BP 綜合表現最好。

SIGNAL 在讀操作的平均性能上與其他 Flavor 接近,但在 10:1 和 100:1 讀寫比例下,其高百分位延遲(尤其是 p99.9)

執行緒數量的影響

開發環境

$ uname -a
Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
    CPU family:           6
    Model:                94
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   30%
    CPU max MHz:          4000.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6799.81
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
                          s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
                          art arch_perfmon pebs bts rep_good nopl xtopology nons
                          top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor 
                          ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                          cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                          er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
                          id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp
                          riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 
                          smep bmi2 erms invpcid mpx rdseed adx smap clflushopt 
                          intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
                          t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m
                          d_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7

由於 cpu 較舊,因此整體對於多執行緒的吞吐量與延遲等指標會與其他同學有落差,並且對於一些實做的提昇幅度也不一樣,所以這邊先使用 bpftrace 去紀錄與統計整體 read/write 系統呼叫的延遲,並後續觀察是否與系統呼叫的成本有關

$ sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read /comm == "redis-server"/ {
    @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @read_latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}
@read_latency_us: 
[1]              2808001 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)           1153948 |@@@@@@@@@@@@@@@@@@@@@                               |
[4, 8)             25778 |                                                    |
[8, 16)             8368 |                                                    |
[16, 32)            4015 |                                                    |
[32, 64)              32 |                                                    |
[64, 128)             13 |                                                    |
[128, 256)             9 |                                                    |
[256, 512)            32 |                                                    |
[512, 1K)              2 |                                                    |
[1K, 2K)               2 |                                                    |

去年的實驗結果是在沒設定 cpu affinity 的情況下應該將執行緒的數量設定為 4 ,因此我先想復現去年的實驗,並探討執行緒的數量對吞吐量效能的影響。
我這邊設定寫入與讀取的比例為 1:100 和 1:10,上網收集的資料說 cache 的使用情境大多為這兩種,因此只測試這兩種情境

Memcached 的實際基準測試中,約 43.9% 的測試採用 SET:GET = 1:100 的配置
Google Cloud Memorystore 中有提到 By default, the utility issues 10 get commands for every set command.

以下測試是想測試執行緒數量使用 4 與 8 對於 qsbr 與 memb 的比較,並使用

taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1911.112 36.840 1757.440 2006.030 1905.100 1903.892 1918.333
qsbr-8t 1857.539 47.483 1687.700 1965.850 1853.320 1848.232 1866.845
memb-4t 1904.987 48.352 1684.740 2037.310 1913.505 1895.510 1914.464
memb-8t 1843.536 42.711 1684.790 1949.790 1852.665 1835.164 1851.907

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 190160.503 3665.605 174869.310 199605.400 189562.675 189442.045 190878.962
qsbr-8t 184829.809 4724.670 167930.680 195606.950 184409.640 183903.774 185755.844
memb-4t 189551.082 4811.135 167635.600 202717.860 190398.130 188608.099 190494.064
memb-8t 183436.420 4249.845 167640.890 194008.960 184344.715 182603.450 184269.389

觀察

  • 在沒設置 cpu affinity 時, 4 個執行緒的吞吐量的平均數與中位數皆是比 8 個執行緒還要高的
  • qsbr 的表現在平均數比 memb 還要來的好,但 memb 的中位數比 qsbr 還要來的好,可以從最小數與標準差發現 memb 的吞吐量較不穩定,因此標準差也較大
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.459 0.059 1.357 1.681 1.449 1.447 1.470
qsbr-8t 1.407 0.061 1.316 1.699 1.395 1.395 1.419
memb-4t 1.501 0.128 1.405 2.567 1.474 1.475 1.526
memb-8t 1.487 0.091 1.387 1.954 1.469 1.469 1.504

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.042 0.013 1.012 1.122 1.040 1.039 1.045
qsbr-8t 1.072 0.022 1.040 1.165 1.079 1.068 1.077
memb-4t 1.043 0.025 1.018 1.187 1.035 1.038 1.048
memb-8t 1.078 0.022 1.052 1.162 1.069 1.073 1.082
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 1.221 0.110 1.007 1.423 1.247 1.200 1.243
qsbr-8t 1.054 0.046 1.015 1.295 1.047 1.045 1.063
memb-4t 1.317 0.061 1.135 1.695 1.319 1.305 1.329
memb-8t 1.162 0.082 1.047 1.431 1.147 1.146 1.178

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 0.958 0.007 0.935 0.975 0.959 0.957 0.960
qsbr-8t 0.993 0.012 0.967 1.015 0.999 0.990 0.995
memb-4t 0.941 0.014 0.839 0.967 0.943 0.938 0.944
memb-8t 0.981 0.012 0.951 1.015 0.975 0.979 0.984

觀察

  • 在 8 個執行緒時, Get 操作的延遲在平均數與中位數都是比較低的,但在 Set 操作上卻會略為提高
  • 在 p50 上 Get 操作在 8 個執行緒時不再取得更低的延遲,反之延遲較 4 個執行緒來的高
  • qsbr 在 Set 和 Get 操作上的平均延遲都比 memb 還要來的低
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 6.030 0.432 5.279 7.391 5.983 5.946 6.115
qsbr-8t 6.230 0.840 5.183 9.407 5.919 6.065 6.394
memb-4t 6.512 0.611 5.695 10.559 6.431 6.392 6.632
memb-8t 7.008 1.004 5.471 11.327 6.815 6.811 7.205

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 3.882 0.398 3.151 5.055 3.791 3.804 3.960
qsbr-8t 4.004 0.402 3.391 5.855 3.903 3.925 4.082
memb-4t 4.266 0.357 3.807 6.335 4.223 4.196 4.336
memb-8t 4.291 0.425 3.727 6.495 4.191 4.208 4.374
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 9.682 1.119 8.319 13.631 9.375 9.462 9.901
qsbr-8t 11.010 1.314 8.511 15.743 10.975 10.753 11.268
memb-4t 10.423 1.753 8.895 24.831 10.111 10.080 10.767
memb-8t 12.050 1.328 9.599 17.279 11.871 11.789 12.310

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-4t 6.454 0.506 5.727 8.383 6.399 6.355 6.553
qsbr-8t 7.671 0.698 6.591 11.135 7.455 7.534 7.808
memb-4t 7.097 0.834 6.207 14.079 6.943 6.934 7.261
memb-8t 8.064 0.704 7.007 12.159 7.999 7.926 8.202

觀察

  • 可以看出在 p99 與 p99.9 的情況下越多執行緒會導致越高的平均延遲

小結論

  • 在讀寫佔比 100:1 並且沒有設置 cpu affinity的情況下, 4 個執行緒的吞吐量會比 8 個執行緒的吞吐量來的好,不論是在延遲或平均操作上皆是。

設置 cpu affinity

有看到去年有個 review 是為什麼 thread-level cpu affinity 不要一個 thread 一個 cpu 而是要兩個 thread 一個 cpu? 而去年負責專題的同學的回覆如下

因為我測試過每個 CPU 一個執行緒,但是效能不如一個 CPU 兩個執行緒,這樣也能透過 context switch 讓要進行 I/O 的執行緒休息,將 CPU 資源讓給另外一個執行緒。

但這段敘述並沒有實驗去說明,因此我接下來是要測試:一個 CPU 兩個執行緒還是一個 CPU 一個執行緒比較好,並且分析去年的結論是否是對的。用以下程式碼將目前執行中的 thread 綁定(或限制)到指定的 CPU 核心集合中執行:

+ cpu_set_t cpuset;
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpu_id, &cpuset);
+ pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);

以下分別針對 qsbr 與 memb 兩種 flavor 去探討

taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=file_name

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 memb flavor 下

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1937.405 52.190 1823.790 2097.200 1930.500 1927.176 1947.634
memb-two-to-one-affinity 2018.488 72.174 1869.110 2211.830 2020.165 2004.342 2032.635
memb-4t-no-affinity 1904.987 48.352 1684.740 2037.310 1913.505 1895.510 1914.464
memb-8t-no-affinity 1843.536 42.711 1684.790 1949.790 1852.665 1835.164 1851.907

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 192776.628 5193.027 181471.580 208676.540 192089.445 191758.795 193794.462
memb-two-to-one-affinity 200844.686 7181.510 185980.900 220082.210 201011.445 199437.110 202252.262
memb-4t-no-affinity 189551.082 4811.135 167635.600 202717.860 190398.130 188608.099 190494.064
memb-8t-no-affinity 183436.420 4249.845 167640.890 194008.960 184344.715 182603.450 184269.389
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.405 0.039 1.330 1.536 1.400 1.398 1.413
memb-two-to-one-affinity 1.332 0.037 1.278 1.524 1.326 1.325 1.339
memb-4t-no-affinity 1.501 0.128 1.405 2.567 1.474 1.475 1.526
memb-8t-no-affinity 1.487 0.091 1.387 1.954 1.469 1.469 1.504

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.036 0.015 1.014 1.084 1.032 1.033 1.039
memb-two-to-one-affinity 1.006 0.016 0.978 1.087 1.002 1.002 1.009
memb-4t-no-affinity 1.043 0.025 1.018 1.187 1.035 1.038 1.048
memb-8t-no-affinity 1.078 0.022 1.052 1.162 1.069 1.073 1.082
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 1.153 0.084 1.007 1.327 1.159 1.137 1.170
memb-two-to-one-affinity 1.249 0.024 1.191 1.359 1.247 1.244 1.254
memb-4t-no-affinity 1.317 0.061 1.135 1.695 1.319 1.305 1.329
memb-8t-no-affinity 1.162 0.082 1.047 1.431 1.147 1.146 1.178

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 0.943 0.010 0.919 0.967 0.943 0.941 0.945
memb-two-to-one-affinity 0.932 0.013 0.903 0.967 0.927 0.929 0.934
memb-4t-no-affinity 0.941 0.014 0.839 0.967 0.943 0.938 0.944
memb-8t-no-affinity 0.981 0.012 0.951 1.015 0.975 0.979 0.984
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 5.553 0.285 4.799 6.943 5.535 5.497 5.609
memb-two-to-one-affinity 4.830 0.355 4.063 6.143 4.831 4.761 4.900
memb-4t-no-affinity 6.512 0.611 5.695 10.559 6.431 6.392 6.632
memb-8t-no-affinity 7.008 1.004 5.471 11.327 6.815 6.811 7.205

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 4.149 0.298 3.039 4.959 4.079 4.091 4.208
memb-two-to-one-affinity 3.275 0.392 2.495 4.671 3.343 3.198 3.352
memb-4t-no-affinity 4.266 0.357 3.807 6.335 4.223 4.196 4.336
memb-8t-no-affinity 4.291 0.425 3.727 6.495 4.191 4.208 4.374
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 9.043 0.838 7.519 14.719 8.959 8.879 9.207
memb-two-to-one-affinity 8.051 1.040 6.399 14.015 7.887 7.848 8.255
memb-4t-no-affinity 10.423 1.753 8.895 24.831 10.111 10.080 10.767
memb-8t-no-affinity 12.050 1.328 9.599 17.279 11.871 11.789 12.310

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
memb-one-to-one-affinity 7.057 0.619 5.951 12.031 6.991 6.936 7.178
memb-two-to-one-affinity 6.205 0.476 5.375 8.319 6.111 6.112 6.298
memb-4t-no-affinity 7.097 0.834 6.207 14.079 6.943 6.934 7.261
memb-8t-no-affinity 8.064 0.704 7.007 12.159 7.999 7.926 8.202

使用 perf 觀察程式的行為

以下是針對一個 CPU 兩個執行緒和一個 CPU 一個執行緒的情況下,我各拿一個中位數的結果來做分析,以下是結果:
一個 CPU 兩個執行緒

 Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         71,780.04 msec task-clock                       #    1.653 CPUs utilized             
           560,343      context-switches                 #    7.806 K/sec                     
            14,189      cpu-migrations                   #  197.673 /sec                      
             2,025      page-faults                      #   28.211 /sec                      
   257,627,916,414      cycles                           #    3.589 GHz                         (39.99%)
   141,018,941,758      instructions                     #    0.55  insn per cycle              (50.00%)
    22,306,027,705      branches                         #  310.755 M/sec                       (49.92%)
       474,877,129      branch-misses                    #    2.13% of all branches             (49.98%)
    39,701,370,753      L1-dcache-loads                  #  553.098 M/sec                       (49.92%)
     3,265,404,792      L1-dcache-load-misses            #    8.22% of all L1-dcache accesses   (49.96%)
     1,107,813,317      LLC-loads                        #   15.433 M/sec                       (40.05%)
         6,557,980      LLC-load-misses                  #    0.59% of all LL-cache accesses    (40.09%)
    14,661,818,780      cache-references                 #  204.260 M/sec                       (40.08%)
       172,241,796      cache-misses                     #    1.17% of all cache refs           (40.10%)

      43.418281953 seconds time elapsed

      22.202794000 seconds user
      49.473298000 seconds sys

一個 CPU 一個執行緒

 Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         71,423.46 msec task-clock                       #    1.599 CPUs utilized             
           333,882      context-switches                 #    4.675 K/sec                     
            18,199      cpu-migrations                   #  254.804 /sec                      
             1,626      page-faults                      #   22.766 /sec                      
   254,850,020,885      cycles                           #    3.568 GHz                         (40.03%)
   136,686,768,652      instructions                     #    0.54  insn per cycle              (49.98%)
    21,480,788,799      branches                         #  300.753 M/sec                       (49.94%)
       467,898,042      branch-misses                    #    2.18% of all branches             (49.95%)
    38,495,992,811      L1-dcache-loads                  #  538.982 M/sec                       (50.05%)
     3,143,432,121      L1-dcache-load-misses            #    8.17% of all L1-dcache accesses   (49.97%)
     1,053,471,203      LLC-loads                        #   14.750 M/sec                       (40.05%)
         5,857,429      LLC-load-misses                  #    0.56% of all LL-cache accesses    (40.05%)
    14,337,925,549      cache-references                 #  200.745 M/sec                       (40.03%)
       161,865,927      cache-misses                     #    1.13% of all cache refs           (39.98%)

      44.655900137 seconds time elapsed

      22.039534000 seconds user
      49.579251000 seconds sys

結論

  • 可以觀察出在使用 memb 的機制並且 cpu affinity 在兩個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,並且相較於同樣是 8 個執行緒但無 cpu affinity 在 Get 和 Set 操作的吞吐量與延遲上,都好了大約 10% ,在 memb 的實驗結果跟去年得到的結論一樣,兩個執行緒對一個 cpu 的情況會取得比較好的效能。
  • 並且二對一的模式在 p99 與 p99.9 的表現也是四個裡面最好, tail-latency 也是表現最好的。
  • 在一個 CPU 一個執行緒時 context switchs 數量約低於兩個執行緒 40% ,但同時 cpu migrations 的數量卻多於一個 CPU 兩個執行緒,而 cpu migration 單次的成本是遠大於 context switch 的,因為會導致 L1/L2 cache miss(需要重新 warm-up cache),並且也會清除 TLB,但兩個執行緒在一個 CPU 上會頻繁交替執行,也會導致 cache conflict 提昇 cache misss 的數量。

讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 qsbr flavor 下

吞吐量

SET Ops/sec

Set_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1976.983 53.720 1858.990 2088.910 1976.620 1966.454 1987.512
qsbr-two-to-one-affinity 1926.101 63.233 1761.320 2085.570 1919.775 1913.707 1938.494
qsbr-4t-no-affinity 1911.112 36.840 1757.440 2006.030 1905.100 1903.892 1918.333
qsbr-8t-no-affinity 1857.539 47.483 1687.700 1965.850 1853.320 1848.232 1866.845

GET Ops/sec

Get_Ops_sec

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 196714.781 5345.271 184974.190 207851.710 196678.310 195667.108 197762.454
qsbr-two-to-one-affinity 191651.843 6291.792 175255.650 207519.610 191022.075 190418.652 192885.034
qsbr-4t-no-affinity 190160.503 3665.605 174869.310 199605.400 189562.675 189442.045 190878.962
qsbr-8t-no-affinity 184829.809 4724.670 167930.680 195606.950 184409.640 183903.774 185755.844
平均延遲

SET Average Latency

Set_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.296 0.049 1.231 1.514 1.286 1.287 1.306
qsbr-two-to-one-affinity 1.332 0.037 1.273 1.515 1.328 1.325 1.340
qsbr-4t-no-affinity 1.459 0.059 1.357 1.681 1.449 1.447 1.470
qsbr-8t-no-affinity 1.407 0.061 1.316 1.699 1.395 1.395 1.419

GET Average Latency

Get_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.015 0.020 0.987 1.085 1.007 1.011 1.019
qsbr-two-to-one-affinity 1.036 0.018 1.015 1.120 1.030 1.032 1.039
qsbr-4t-no-affinity 1.042 0.013 1.012 1.122 1.040 1.039 1.045
qsbr-8t-no-affinity 1.072 0.022 1.040 1.165 1.079 1.068 1.077
p50 延遲

SET p50 Latency

Set_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 1.001 0.033 0.975 1.191 0.991 0.994 1.007
qsbr-two-to-one-affinity 1.089 0.060 1.023 1.279 1.063 1.077 1.100
qsbr-4t-no-affinity 1.221 0.110 1.007 1.423 1.247 1.200 1.243
qsbr-8t-no-affinity 1.054 0.046 1.015 1.295 1.047 1.045 1.063

GET p50 Latency

Get_p50_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 0.949 0.010 0.927 0.975 0.943 0.946 0.951
qsbr-two-to-one-affinity 0.965 0.013 0.935 0.999 0.959 0.962 0.967
qsbr-4t-no-affinity 0.958 0.007 0.935 0.975 0.959 0.957 0.960
qsbr-8t-no-affinity 0.993 0.012 0.967 1.015 0.999 0.990 0.995
p99 延遲

SET p99 Latency

Set_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 4.837 0.464 4.223 7.135 4.831 4.746 4.928
qsbr-two-to-one-affinity 5.069 0.327 4.223 6.623 5.055 5.005 5.133
qsbr-4t-no-affinity 6.030 0.432 5.279 7.391 5.983 5.946 6.115
qsbr-8t-no-affinity 6.230 0.840 5.183 9.407 5.919 6.065 6.394

GET p99 Latency

Get_p99_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 3.555 0.530 2.719 5.151 3.559 3.451 3.659
qsbr-two-to-one-affinity 3.775 0.337 2.767 4.927 3.735 3.709 3.841
qsbr-4t-no-affinity 3.882 0.398 3.151 5.055 3.791 3.804 3.960
qsbr-8t-no-affinity 4.004 0.402 3.391 5.855 3.903 3.925 4.082
p99.9 延遲

SET p99.9 Latency

Set_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 7.569 1.141 6.079 13.823 7.391 7.345 7.793
qsbr-two-to-one-affinity 8.388 0.826 6.975 13.375 8.255 8.226 8.550
qsbr-4t-no-affinity 9.682 1.119 8.319 13.631 9.375 9.462 9.901
qsbr-8t-no-affinity 11.010 1.314 8.511 15.743 10.975 10.753 11.268

GET p99.9 Latency

Get_p99_9_Latency

Flavor Mean Std Dev Min Max Median 95% CI Lower 95% CI Upper
qsbr-one-to-one-affinity 5.984 0.847 5.151 10.367 5.759 5.818 6.150
qsbr-two-to-one-affinity 6.854 0.480 5.727 9.535 6.863 6.760 6.948
qsbr-4t-no-affinity 6.454 0.506 5.727 8.383 6.399 6.355 6.553
qsbr-8t-no-affinity 7.671 0.698 6.591 11.135 7.455 7.534 7.808

使用 perf 分析程式行為

一樣針對一個 CPU 兩個執行緒與一個 CPU 一個執行緒進行分析,分別取一次中位數的結果:

         70,013.67 msec task-clock                       #    2.401 CPUs utilized             
           741,501      context-switches                 #   10.591 K/sec                     
            15,581      cpu-migrations                   #  222.542 /sec                      
             3,627      page-faults                      #   51.804 /sec                      
   249,258,869,962      cycles                           #    3.560 GHz                         (39.99%)
   132,407,403,431      instructions                     #    0.53  insn per cycle              (49.98%)
    21,077,879,352      branches                         #  301.054 M/sec                       (50.03%)
       478,702,322      branch-misses                    #    2.27% of all branches             (50.05%)
    35,215,497,531      L1-dcache-loads                  #  502.980 M/sec                       (50.12%)
     3,281,644,729      L1-dcache-load-misses            #    9.32% of all L1-dcache accesses   (50.04%)
     1,129,746,159      LLC-loads                        #   16.136 M/sec                       (40.06%)
         7,744,776      LLC-load-misses                  #    0.69% of all LL-cache accesses    (39.95%)
    14,435,913,114      cache-references                 #  206.187 M/sec                       (39.92%)
       185,804,120      cache-misses                     #    1.29% of all cache refs           (39.95%)

      29.158730419 seconds time elapsed

      18.440255000 seconds user
      51.487935000 seconds sys

一個 CPU 一個執行緒

Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':

         67,887.06 msec task-clock                       #    1.507 CPUs utilized             
           375,487      context-switches                 #    5.531 K/sec                     
            19,709      cpu-migrations                   #  290.320 /sec                      
             3,209      page-faults                      #   47.270 /sec                      
   240,616,502,442      cycles                           #    3.544 GHz                         (40.06%)
   128,321,848,547      instructions                     #    0.53  insn per cycle              (50.05%)
    20,388,085,704      branches                         #  300.324 M/sec                       (50.00%)
       468,964,574      branch-misses                    #    2.30% of all branches             (50.05%)
    34,179,214,077      L1-dcache-loads                  #  503.472 M/sec                       (50.01%)
     3,167,434,498      L1-dcache-load-misses            #    9.27% of all L1-dcache accesses   (50.00%)
     1,059,289,040      LLC-loads                        #   15.604 M/sec                       (39.97%)
         5,442,227      LLC-load-misses                  #    0.51% of all LL-cache accesses    (39.98%)
    13,343,467,237      cache-references                 #  196.554 M/sec                       (39.96%)
       151,742,665      cache-misses                     #    1.14% of all cache refs           (40.02%)

      45.045660022 seconds time elapsed

      17.981155000 seconds user
      50.111441000 seconds sys

結論

  • 可以觀察出在使用 qsbr 的機制並且 cpu affinity 在一個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,跟去年得到的結論不一樣,原因可能是一個 CPU 兩個執行緒的 cache miss rate 較高,也有可能是 qsbr 的機制,需要進一步實驗(待補)
  • 但也能觀察出一個執行緒榜定一個 cpu 的情況下,雖然在 p99.9 情況下延遲的平均和中位數皆是最佳,但其 tail-latency 也是最大的,原因可能與 qsbr 的機制有關,需要進一步實驗(待補)

TODO: 將並行程式設計相關的素材放到其他筆記頁面

引入 neco

去年的報告並沒有去詳細解釋該如何引入和各 API 該如何使用,並且最後的引用結果也沒顯著提昇效能,因此這邊我想從頭看一下 neco 文件看如何引用和為什麼對於提昇效能有限

開發環境

$ uname -a
Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
    CPU family:           6
    Model:                94
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   30%
    CPU max MHz:          4000.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6799.81
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
                          s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
                          art arch_perfmon pebs bts rep_good nopl xtopology nons
                          top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor 
                          ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                          cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                          er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
                          id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp
                          riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 
                          smep bmi2 erms invpcid mpx rdseed adx smap clflushopt 
                          intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
                          t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m
                          d_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7

什麼是 neco ?

neco 是一個以 coroutines 提供並行處理的跨平台 C 函式庫,可以在多種作業系統上使用,目標是提供快速的單執行緒並行處理,且提供 API 給網路與 I/O 。那希望可以引入 neco 來當低延遲非同步處理引擎,讓每個 worker thread 能以 coroutine 管理多個 client 的 I/O 流程,減少 context switch 成本與 thread 數量,並提升 tail latency 表現。

具體表現能用 bluebox 展示 neco 的功能,支持 SET、GET、DEL 和 PING 操作。較單執行緒的 Redis 有更高的資料吞吐量和更低的延遲。評估 M:N threading model 對於 Redis 並行化的助益。

在我的設備中對 bluebox 與原始做 valkey各 30 次測試並做一個效能的比較

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets         1031.52          ---          ---         1.93272         1.89500         3.99900         5.59900        79.48 
Gets       102638.44         0.00    102638.44         1.92741         1.90300         3.90300         4.89500      3997.68 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     103669.95         0.00    102638.44         1.92747         1.90300         3.90300         4.89500      4077.15 
ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets         1663.52          ---          ---         1.19864         1.02300         3.00700         6.43100       128.18 
Gets       165524.81      3870.41    161654.40         1.19570         1.02300         2.95900         5.69500      6575.55 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     167188.34      3870.41    161654.40         1.19573         1.02300         2.97500         5.72700      6703.73 

透過這個專案可以觀察出 neco 配合單執行緒帶來的平行增益是很可觀的,可以達到 60% 的一個吞吐量增幅,因此也期望能使用這個套件去提昇 mt-redis 的吞吐量。

先閱讀以下的 neco 文件,了解 neco 的使用方法

neco doc

先從多執行緒的範例開始切入:

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include "neco.h"
#include "neco.c"

void coro1(int argc, void *argv[]) {
    // This coroutine is running in a different scheduler than coro2.
    int rd = *(int*)argv[0];
    int wr = *(int*)argv[1];
    int val;
    neco_read(rd, &val, sizeof(int));
    printf("coro1: %d\n", val);
    neco_write(wr, &(int){ 2 }, sizeof(int));
}

void coro2(int argc, void *argv[]) {
    // This coroutine is running in a different scheduler than coro1.
    int rd = *(int*)argv[0];
    int wr = *(int*)argv[1];
    int val;
    neco_write(wr, &(int){ 1 }, sizeof(int));
    neco_read(rd, &val, sizeof(int));
    printf("coro2: %d\n", val);
}

void *runtime1(void *arg) {
    int *pipefds = arg;
    neco_start(coro1, 2, &pipefds[0], &pipefds[3]);
    return 0;
}

void *runtime2(void *arg) {
    int *pipefds = arg;
    neco_start(coro2, 2, &pipefds[2], &pipefds[1]);
    return 0;
}

int main() {
    int pipefds[4];
    pipe(&pipefds[0]);
    pipe(&pipefds[2]);
    pthread_t th1, th2;
    pthread_create(&th1, 0, runtime1, pipefds);
    pthread_create(&th2, 0, runtime2, pipefds);
    pthread_join(th1, 0);
    pthread_join(th2, 0);
    return 0;
}

這邊範例主要用到三個主要的 API :分別是 neco_startneco_readneco_write

neco_start

是 neco coroutine 函式庫的入口點 API,用來在當前執行緒中建立並執行一個 coroutine 調度器(scheduler),並從指定的 coroutine 函式開始執行。此函式將負責初始化內部 runtime,如果尚未存在 scheduler,則會自動建立一個。

這邊的排程器是使用 sco 排程器,是 neco 作者做出的 fair and deterministic scheduler

int neco_start(void(*coroutine)(int argc, void *argv[]), int argc,...);

第一個參數須傳入 coroutine 函式指標,而 coroutine 函式又須滿足以下格式


void coroutine(int argc, void *argv[]) {
    char *msg = argv[0];
    printf("%s\n", msg);
}

第一個參數是傳入 coroutine 的參數數量,對應 argv 的長度,第二個參數則是參數陣列。

neco_read

on-blocking I/O 包裝函式,用於在 coroutine 環境中從指定的 file descriptor (fd) 讀取資料。這個 API 實現是用 POSIX 的 read() 系統呼叫,但額外加上加上 coroutine 調度支援,能讓當前 coroutine 在資料尚未可讀時主動讓出(yield)執行權。

ssize_t neco_read(int fd, void *data, size_t nbytes);

第一個參數是傳入 fd,第二個參數是要存取的指標,第三個參數是要存取幾個 bytes

neco_write

對 POSIX write() 系統呼叫的包裝函式,設計目的是在 coroutine 執行環境中提供非阻塞的寫入操作。當 fd 暫時無法寫入時,該 coroutine 會自動讓出(yield),等待可寫再繼續執行,參數跟邏輯都跟 neco_read 相同。

ssize_t neco_read(int fd, void *data, size_t nbytes);

因此我們在看回去多執行緒的範例,程式的行為是創造兩個執行緒,並且分別對兩個 pipe 進行資訊傳遞,那根據上面 API 的行為範例程式碼的結果應該是:

coro1: 1
coro2: 2

因為當 th1 先執行時,會在 coro1 呼叫 neco_read 去查看該 fd 有無可讀取的資料,但沒有因此會主動讓出給 th2 , th2 在 coro2 內的 neco_read 也一樣,一定要等到 coro1 寫入 pipe 後才可讀取,因此結果會如上

以下是針對 API 的參數去做測試,再次確認 API 使用方法

void *runtime1(void *arg) {
+ int temp = 123;
- neco_start(coro1, 2, &pipefds[0], &pipefds[3]);
+ neco_start(coro1, 3, &pipefds[0], &pipefds[3], &temp);
}

coro1: temp: 123
coro1: 1
coro2: 2

該如何引入 mt-redis?

分析限制 Redis 並行的因素

為甚麼 redis/ valkey 是使用 single thread?
因此我閱讀 The Engineering Wisdom Behind Redis’s Single-Threaded Design,理解 redis 的設計與考量點。

The Engineering Wisdom Behind Redis’s Single-Threaded Design 文章整理

TODO: 使 Valkey 具備並行處理能力

yeh-sudo/mt-redis 的成果用於 Valkey,確保 userspace RCU 發揮其作用

閱讀 valkey 的資料結構

mt-redis 中,將 urcu 加到 redis 的關鍵改動

可以搭配 yeh-sudo 寫的 mt-redis 運作原理
以下討論 mt-redis 中,將 rcu 加到 redis 的關鍵改動

通用雜湊表使用 rculfhash (lock-free hash table)

redis/valkey 的雜湊表,考慮是單執行緒的設計。在雜湊表擴容時,這避免了長時間的 rehashing,有使用到 incremental rehashing,在增刪查改操作中會逐步的進行,將舊的雜湊表的資料移至新的資料表中,保有高可用性。

然而在 mt-redis 的通用雜湊表使用 rculfhash ,而 rculfhash 本身就支持動態調整大小,因此 mt-redis 中就沒有手動作 incremental rehashing。

在 rculfhash 提供了 cds_lfht_node 作為將 hash table 連結起來的媒介,可以使用 caa_container_of 這個巨集取得整個雜湊表位置。

因此在 redis 中幾個重要的結構中都能看到考量到 rculfhash 的設計。

  • struct redisDB 中存於通用字典 (q_dict)。
  • struct q_dict 是 Redis 用來管理 key-value 的核心結構,其中成員就包含struct cds_lfht *table
  • struct dictEntry 是具體存儲每個 key-value pair 的單位,其中成有 cds_lfht_node node,以串起雜湊表,另外有還有用於推延回收的 rcu_head rcu_headvoid* ptr 這將指向存於 value 的 robj (struct redisObject) 結構

也因此以下展示 mt-redis 插入 key-value 流程,也顯示了 redisDb、q_dict、q_dictEntry 以及 robj 之間的關係

q_dictAdd() 的步驟:

  1. rcu_read_lock()
  2. 建立 dictEntry, 將要插入的 key 轉成 sds 格式 (Simple Dynamic String),在 dictEntry.key 存 key,dictEntry.val 存含有 value 的 robj 指針
  3. 使用 cds_lfht_add_replace(),將 dictEntry 加入到 q_dict.table 中
  4. rcu_read_unlock()

圖示:

+-----------------+   
|    redisDb      |   
+------------------+   
| q_dict *dict   --> +------------------------+   
|                  |  struct q_dict          |   (q_dict 是雜湊表,存放在 redisDb 的 dict 成員中)
|                  |  +-------------------+  |
|                  |  | int size          |  |
|                  |  | cds_lfht *table    |   
|                  |  +-------------------+  |
+------------------+                       |
                                          |
                                          |     (使用 cds_lfht_add_replace() 加到 table 中)
                                          v
                               +------------------------+
                               |  struct q_dictEntry    |
                               +------------------------+
                               |  +------------------+  |
                               |  | cds_lfht_node    |  | -----> 雜湊表節點  
                               |  | node             |  |
                               |  +------------------+  |
                               |  +------------------+  |
                               |  | struct rcu_head  |  | -----> call_rcu 會使用到,延後會回收
                               |  | rcu_head         |  |
                               |  +------------------+  |
                               |  +------------------+  |
                               |  | void *key        |  | -----> key 會轉成 sds 類型所保存
                               |  +------------------+  |
                               |  +------------------+  |
                               |  | void *val        |  | -----> value 將以 robj 形式所保存
                               |  +------------------+  |
                               +------------------------+
                                          |
                                          v
                                +---------------------+
                                | robj                |  (robj 是 Redis 中的資料對象)
                                +---------------------+
                                | type  | encoding    | 
                                | ptr   |             |-----> sds 或其它資料類型
                                +---------------------+


使用 concurrent queue 處理多個請求

mt-redis 在 struct redisServer 中加入 userspace RCU 庫中的 wait-free concurrent queue 的頭尾結構以處理多個要求。

struct redisServer {
    //multi-thread redis support
    pthread_mutex_t command_request_lock;
    int threads_num;    /* num of worker threads.*/
    list *command_requests;
    struct cds_wfcq_head command_requests_head;
    struct cds_wfcq_tail command_requests_tail;
    int socketpairs[2];
    q_eventloop qel;
};

在 initServer 對通用雜湊表初始化,還有初始化 q_worker, q_master

// server.c
extern struct redisServer server;

void initServer(void) 
 {
 ...
     for (j = 0; j < server.dbnum; j++) {
            server.db[j].dict = zmalloc(sizeof(q_dict));
            server.db[j].dict->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL);
            server.db[j].expires = zmalloc(sizeof(q_dict));
            server.db[j].expires->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL);
    }
    
    q_workers_init(server.threads_num);
    q_master_init();
    aeCreateFileEvent(server.el, server.socketpairs[0], AE_READABLE, 
                server_event_process, NULL)
    

mt-redis 中有三種執行緒還有定義管理客戶連結與請求的資料結構

struct client 負責管理客戶端連接和請求處理

mt-redis 中的執行緒角色分成為 master, wroker 還有 server 執行緒。不同執行緒的職責,Master 負責管理 client,而 worker 負責並行讀操作,另外 Server 負責序列寫入操作

strcut client

這結構定義了每個客戶端連接的完整狀態

/* With multiplexing we need to take per-client state.
 * Clients are taken in a linked list. */
typedef struct client {
    q_eventloop *qel;       /* Eventloop of the worker's thread which handles this client.*/
    int curidx;             /* The worker idx that this client current belogn to.*/
    uint64_t id;            /* Client incremental unique ID. */
    int fd;                 /* Client socket. */
    redisDb *db;            /* Pointer to currently SELECTed DB. */
    sds querybuf;           /* Buffer we use to accumulate client queries. */
    int argc;               /* Num of arguments of current command. */
    robj **argv;            /* Arguments of current command. */
    struct redisCommand *cmd, *lastcmd;  /* Last command executed. */
    int flags;              /* Client flags: CLIENT_* macros. */
    ...

結構成員介紹

  • qel 指向處理該客戶端的事件循環
  • curidx 記錄當前所屬的 worker 執行緒 ID
  • fd 為客戶端 socket,querybuf 用於存接收到的命令
  • argcargv 存儲解析後的命令參數,cmd 指向當前執行的命令

q_command_request 將藉由 q_node 將會掛到 redisServer 中的 wfcq

//command forwarded from worker thread to server thread.
typedef struct q_command_request {
    int type;
    struct client* c;
    struct cds_wfcq_node q_node;
} q_command_request;

Master 執行緒

主要負責接受新的客戶端連接與設置監聽 socket 並處理新連接。

static int setup_master(void) {
    int j;
    for (j = 0; j < server.ipfd_count; j++) {
        aeCreateFileEvent(master.qel.el, server.ipfd[j], AE_READABLE, acceptTcpHandler, NULL)
    }
    if (server.sofd > 0)
	    aeCreateFileEvent(master.qel.el, server.sofd, AE_READABLE,
                acceptUnixHandler,NULL) 
    return C_OK;
}

acceptTcpHandler()while(max--) 中有使用 q_work 的 dispatch_conn_new()

當 master 執行緒接受新連接時,會以輪詢方式分發給不同 worker。 而在 worker 執行緒創建 struct client,這部分以下說明。

// in q_master.c
static void acceptCommonHandler(int fd, int flags, char *ip) 
{
    ...
    dispatch_conn_new(fd);
    ...
}

dispatch_conn_new()

  1. 創建一個 connswapunit 結構來封裝 socket descriptor
  2. 選擇一個 worker 執行緒(輪詢方式)
  3. 將 connswapunit 推入該 worker 的隊列
  4. 通過 socket pair 通知 worker 執行緒有新連接
// q_worker.c
void dispatch_conn_new(int sd) {
    struct connswapunit *su = csui_new();
    char buf[1];
    q_worker *worker;
	
    int tid = (last_worker_thread + 1) % server.threads_num;
    worker = darray_get(&workers, (uint32_t) tid);
    last_worker_thread = tid;

    su->num = sd;
    su->data = NULL;
    csul_push(worker, su);

    buf[0] = 'c';
    // to be handled by worker's worker_thread_event_process loop
    write(worker->socketpairs[0], buf, 1);
}

worker 負責並行的讀操作

connswapunit 結構中的 q_node 用於 worker 執行緒的 waiting-free concurrent queue

q_worker 中兩個 queue:

  • q_head/q_tail:用於接收新連接
  • r_head/r_tail:用於接收從 server 執行緒返回的 client
typedef struct q_worker {
    int id;
    q_eventloop qel;
    int socketpairs[2];   /*0: used by master/server thread, 1: belong to worker thread.*/

    /* queue for replication thread to switch the client back to this worker's thread.*/
    struct cds_wfcq_head r_head;
    struct cds_wfcq_tail r_tail;

    struct cds_wfcq_tail q_tail;
    struct cds_wfcq_head q_head;

    q_worker_stats stats;
} q_worker;

struct connswapunit {
    int num;
    void *data;
    struct cds_wfcq_node q_node;
};
static int setup_worker(q_worker *worker) {
    int status = aeCreateFileEvent(worker->qel.el, worker->socketpairs[1], AE_READABLE,
            worker_thread_event_process, worker);

worker_thread_event_process 處理命令處理。
當 worker 執行緒收到 'c' 字符的通知時

  1. 從隊列中取出 connswapunit
  2. 提取 socket descriptor
  3. 建立 client instance,並且將一個 client 綁一個 thread id
// worker threads to master/server thread communication handler function
static void
worker_thread_event_process(aeEventLoop *el, int fd, void *privdata, int mask) {
    if (read(fd, buf, 1) != 1) {
        buf[0] = 'c';
    }

    switch (buf[0]) {
         case 'c':
             csu = csul_pop(worker);
             sd = csu->num;
             zfree(csu);
            ...
            c = createClient(&worker->qel, sd)`;
            c->curidx = worker->id;
}

Worker 執行緒通過 worker_readQueryFromClient 讀取客戶端數據,然後使用 worker_processInputBuffer 處理命令。

處理命令時,發現如果有需要在 server 執行緒處理的命令,client 將被轉移到 server 執行緒,這部分邏輯見 int worker_processCommand(client *c)

處理步驟

  • 使用 q_createCommandRequest(c) 創建一個命令請求結構,封裝 client 訊息
  • 使用 cds_wfcq_enqueue() 將命令請求加入到 server 執行緒的全局命令隊列中
  • 設置 c->flags |= CLIENT_JUMP 標誌,表示 client 正在執行緒間跳轉
  • 通過 write(server.socketpairs[1], buf, 1) 向 server 執行緒發送通知訊號。
// server.c

/* worker's version of processCommand. */
int worker_processCommand(client *c) {
	struct q_command_request *r;
	...
	c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
    // CMD is not readonly 
    unlinkClientFromEventloop(c);
    r = q_createCommandRequest(c);
    cds_wfcq_enqueue(&server.command_requests_head, &server.command_requests_tail, &r->q_node);
    c->flags |= CLIENT_JUMP;

    buf[0] = 'r'; /* r stands for request from worker thread */

server 執行緒處理寫入操作

能在server_event_process() 看到 worker 執行緒到 server 執行緒的命令轉發機制

void server_event_process

  1. 藉由 __cds_wfcq_dequeue_blocking 來從隊列中取出請求。
  2. 獲取客戶端訊息並設置事件循環
  3. 釋放命令
  4. 使用 server_processCommand() 處理命令
  5. 判斷客戶端類型,是 slave 客戶端或是管理命令,就保持在 server 執行緒上
  6. dispatch 給 worker 處理
void server_event_process(struct aeEventLoop *eventLoop, int fd, void *clientData, int mask) {
    char buf[1];
    client *c;
    int from;
    q_command_request *r;
    struct cds_wfcq_node *qnode;

    // Number of request in command_requests NOT 1:1 with num of bytes inside socketpairs
    qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail);

    //the sockerpair buf can be overflowed when num of connections increase, so 
    //Num of requests in command_requests list is NOT 1:1 with num of bytes in socketpair buf, the code
    //has to deal with this mismatch.

    while(read(fd, buf, 1) == 1 || qnode != NULL ){

        if (qnode == NULL) break;
        r = caa_container_of(qnode, struct q_command_request, q_node);

        c = r->c;
        from = c->curidx; 
        c->qel = &server.qel;
        c->curidx = -1;
        q_freeCommandRequest(r);

        server_processCommand(c);
        if (c->flags & CLIENT_SLAVE || c->cmd->flags & CMD_ADMIN) {
            keep_slave_to_server_thread(c, from);
        } else {
            dispatch_to_worker(c, from);
        }
        qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail);
    }
}

因此從以上能看得處來 mt-redis 保持讀取操作在 worker 執行緒處理,而寫入操作統一在 server 執行緒處理。