# Linux 核心專題: 高度並行的 Valkey 實作
> 執行人: tsaiiuo, eleanorLYJ
> [Github](https://github.com/jserv/mt-redis)
> [專題解說影片](https://www.youtube.com/live/AI8shjBfUoE?si=ZFkCHwh-ucIHP8m7&t=8935)
### Q : 知道 p50 p99 p99.9 不同 URCU flavor的不同表現後能夠做什麼改善 ?
因可以針對應用情境需求做選擇。若應用對於極端延遲敏感,那麼即可選擇在 p99, p99.9 這高百分位數值低的 URCU flavor; 若應用以普遍反應時間為主,則可參考平均還有中位數(p50) 的數值做選擇。
### Q: 如何解釋 URCU flavor 對於效能的影響?
可以先從各個 flavor 的底層實作解釋,像是在 qsbr flavor 中我們能自定義 writer 更新的頻率降低,以此提升 reader 的效能,因此在讀寫比 100:1 與 10:1 的情境下, qsbr 的 Get 的吞吐量與延遲上表現的最好,除此之外不同比例的讀寫比配合不同的 flavor 也都會有不同的效能影響,在讀寫比 1:1 時,上述的 qsbr 的吞吐量與延遲反而會比其他 flavor 來得低,可能是讀寫高交錯的情境中,qsbr 的 grace period 幾乎無法推進,導致寫入延遲暴增,整體系統吞吐量降低,可以看到測試中最久的 Set 延遲到達 888ms ,再次應證 URCU flavor 需要配合不同的使用情境作使用。
### Reviewed by `MikazukiHikari`
要怎麼評估電腦本身對於 throughout 的影響,因為看起來不同設備的 throughout 不一樣?
>這本身是一個需要實驗的問題,有很多電腦本身的硬體效能(如 CPU 核心數、記憶體頻寬、I/O 裝置)與系統軟體(如排程器、記憶體管理)會直接影響延遲與吞吐量,而在此次實驗的兩種設備中最直觀的差別可能是 CPU 核心數以及有無 P-core。 [name=tsaiiuo]
### Reviewed by `Andrewtangtang`
關於執行緒數量的影響有推測什麼原因導致 4 個執行緒比 8 個執行緒各項指標都表現比較好嗎?
>根據 perf 的觀察可以看出當執行序為 4 時 context-switches 的數量是少於 8 個執行序的(少了快 1 倍),但同時 cpu-migration 的數量也是比較高(高了 1/4 倍左右),但根據實驗結果能推斷是 context-switches 降低的成本效益是大於 cpu-migration 的成本,導致 4 個執行序的表現比較好。 [name=tsaiiuo]
>
## TODO: 重現去年實驗並理解 Redis 並行化的技術挑戰
> [2024 年報告](https://hackmd.io/@sysprog/B1W2HKTER)
> 重現實驗並解讀實驗結果
> 探討 Redis 對於雲端運算和網路服務系統架構的幫助
> 解釋為何 userspace RCU 對於提高 Redis 並行程度有益
> 分析限制 Redis 並行的因素
> 彙整 userspace RCU 和 Redis 的整合,要顧及哪些議題?
> 評估 neco 的運用和調整
### 閱讀 [introduce sys_membarrier(): process-wide memory barrier (v6)](https://lwn.net/Articles/370146/?utm_source=chatgpt.com)
要使用 memb flavor 就需要先了解 `sys_membarrier()` 系統呼叫的作用,先看到他的介紹:
>Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
`sys_membarrier()` 會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt),讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier,若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。
為了去更進一步解釋,這邊它給出一個例子
>Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
而這兩個 thread 的實作如下
```cpp
Thread A Thread B
prev mem accesses prev mem accesses
smp_mb() smp_mb()
follow mem accesses follow mem accesses
```
在引用 `sys_membarrier()` 後則可以修改成
```cpp
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
```
在 `Thread B` (Reader) 內可以直接使用編譯器屏障取代原先的 `smp_wb()` 這就可以省下許多成本,此外 `Thread A` (Writer) 使用 `sys_membarrier()` 取代 `smp_wb()` 只對「整個 process 裡所有活動中的 thread 發送 IPI,強制它們各自執行一次 full memory barrier」。
而這樣更改後會有兩種情況:
1. Thread A 與 Thread B 不同時執行
```cpp
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
```
這種情況下 `Thread B` 的存取只有保證 weakly ordered,但是可以的,因為 `Thread A` 不在意跟它排序,因為它們不重疊。
2. Thread A 與 Thread B 同時執行
```cpp
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
```
`Thread B` 在執行 barrier() 時只能保證自己程式內部順序;
但同時 `Thread A` 正在呼叫 `sys_membarrier()`,kernel 會發 IPI 給所有正在跑的 thread,讓它們在 CPU core 上都執行一次硬體 memory barrier。
這就會把 B 的 compiler barrier「升級」成真正的 full `smp_mb()`,在多 core 間強制排序。至於不在跑的 thread(被 scheduler deschedule),它們不需要做任何操作,因為 context switch 本身就隱含 barrier 效果。
>Each non-running process threads are intrinsically serialized by the scheduler.
作者這裡還有做一個效能實驗,測試 Expedited membarrier 和 Non-expedited membarrier 針對 `sys_membarrier()` 的效能差異 (在 Intel Xeon E5405 上執行)
Expedited membarrier
呼叫次數:10,000,000 次
單次開銷:約 2–3 微秒
Non-expedited membarrier
呼叫次數:1,000 次
總耗時約 16 秒 → 單次開銷約 16 毫秒
大約是 expedited 模式的 5,000–8,000 倍慢
>Add "int expedited" parameter, use synchronize_sched() in the non-expedited
case. Thanks to Lai Jiangshan for making us consider seriously using
synchronize_sched() to provide the low-overhead membarrier scheme.
>Check num_online_cpus() == 1, quickly return without doing nothing.
此外還有針對 `sys_membarrier()` 的有效性進行測試
Operations in 10s, 6 readers, 2 writers:
>(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes
>(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes
>(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes
可以先觀察出 動態 sys_membarrier 檢測會帶來額外的成本带来的成本,讓讀取與寫入端的 throughput 皆降低大約 7% ,但大致上還在一個基準線上,因此目前的預設機制還是有啟用動態觀察。
在啟用 `sys_membarrier()` 後在 expedited 狀態下會提昇許多,因為讀取端只做編譯器屏障,因此效能遠比原始的 mb scheme 好 ,在 writer 端,相比於 signal-based scheme 提昇了 500 倍左右,但對於原始的 mb scheme 來說還是效能差了一點,因為需要做額外的 `sys_membarrier()` 呼叫,而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高,因為寫入端不發送 IPI(使用了更輕量級的 membarrier fallback),沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇,但同時寫入端没有 IPI,主要依賴 cpu 上下文切換來進行,或一個 thread 定期的去喚醒各個 cpu ,因此沒有即時性,每次 `sys_membarrier()` 的時間比較長也比較不可預期。
>RCU 與並行化程式閱讀筆記: https://hackmd.io/@tMeEIUCqQSCiXUaa-3tcsg/r11N0uJbll/edit
### RCU 的 callback 的作用?
#### 如何作用
callback 機制就是當滿足預設條件後才會呼叫執行的 function,主要分為兩個階段:
1. 註冊(registration): 先把你要呼叫的函式(callback)登記到某個框架或資料結構裡像是
```cpp
call_rcu(&p->rcu_head, reclaim_callback);
```
2. 觸發(invocation): 當某個預設條件(event)發生時,系統/框架才會真正把先前註冊的 callback 給呼叫
而在 RCU 的 callback 主要是用來 reclaim 資源,主要又分為單一次 reclaim 和批量 reclaim,單一次 reclaim 的流程圖就是當所有的 thread 進入 quiescent state 後呼叫先前註冊的 callback ,示意圖如下:

### 重現實驗
目的在於討論哪裡哪一個 urcu flavor 最適合 redis?
#### 開發環境
```bash
$ uname -a
Linux eleanor 6.11.0-24-generic #24~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 25 20:14:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12700
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
CPU(s) scaling MHz: 19%
CPU max MHz: 4900.0000
CPU min MHz: 800.0000
BogoMIPS: 4224.00
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 512 KiB (12 instances)
L1i: 512 KiB (12 instances)
L2: 12 MiB (9 instances)
L3: 25 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
```
實驗將使用 P-core,並使用 taskset 避開 E-core。另外也關閉Hyperthreading。
因此啟用 mt-redis 的命令需要更改,以下是我將 mt-redis 伺服器綁定到 P-core, logic CPU id 為 0,2,4,6,8,10,12,14,16 的命令:
```bash
taskset -c 0,2,4,6,8,10,12,14,16 ./redis-server --port 6379 --appendonly no --save ""
```
-–appendonly no 關閉 AOF(Append Only File),避免每個寫入都寫到磁碟
-–save "" 關閉 RDB 快照功能,也就是關閉定時儲存資料快照功能
以下是我使用 memtier_benchmark 的命令
```bash
taskset -c 17,18,19 \
memtier_benchmark \
-p 6379 \
--random-data \
--ratio=${RATIO} \
--requests=20000 \
--json-out-file="${OUTPUT_FILE}" \
--hide-histogram
```
#### 理解 memtier_bench
沒有下特別 command,預設是針對 string 這 type 做測試。
#### 去年實驗設計與結果
去年葉同學比較五種 URCU flavor 在不同讀寫比例下的表現,讀寫比例分別為 10:1, 100:1 與 1:1。
測試對象為 mt-redis,並使用同樣命令執行 100 次,並觀察 SET Ops/sec, SET Average Latency, SET p50 Latency, SET p99 Latency, SET p99.9 Latency, GET Ops/sec, GET Average Latency, GET p50 Latency, GET p99 Latency, GET p99.9 Latency
去年的結論
> 在讀取大於寫入的測試中,Signal-based RCU 表現最好,而在 1:1 的測試中,Signal-based RCU 的寫入效能較差。理論上 QSBR 在效能上應該是首選,但在實際測試中表現不如預期,可能需要調整進入 quiescent state 的時機
但我發現在 2023 年, urcu-signal 被視為 deprecated 了! [Commit aad674a](https://github.com/urcu/userspace-rcu/commit/aad674a9a583e09e854145f18c5d8854269dce8c) 為關於移除 urcu-signal 的最後一個 commit,根據此 commit message 建議改使用 urcu-memb,因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能,並且也有改善後的寬限期性能。
#### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入
##### 吞吐量
SET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 3044.210 | 124.389 | 2779.570 | 3535.730 | 3039.250 | 3019.404 | 3069.015 |
| BP | 3084.212 | 132.025 | 2843.330 | 3441.970 | 3066.740 | 3057.884 | 3110.541 |
| MEMB | 3052.686 | 142.048 | 2811.680 | 3498.170 | 3031.835 | 3024.359 | 3081.014 |
| MB | 3062.779 | 141.894 | 2864.770 | 3609.120 | 3041.210 | 3034.482 | 3091.076 |
| SIGNAL | 3054.204 | 126.075 | 2808.940 | 3447.240 | 3024.835 | 3029.062 | 3079.346 |
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 302906.517 | 12377.009 | 276574.670 | 351814.290 | 302413.045 | 300438.278 | 305374.756 |
| BP | 306886.910 | 13136.841 | 282918.290 | 342484.500 | 305148.375 | 304267.144 | 309506.676 |
| MEMB | 303749.942 | 14134.116 | 279768.790 | 348076.500 | 301675.285 | 300931.298 | 306568.586 |
| MB | 304754.178 | 14118.777 | 285051.500 | 359116.510 | 302608.090 | 301938.593 | 307569.762 |
| SIGNAL | 303900.933 | 12544.743 | 279496.530 | 343009.320 | 300978.660 | 301399.244 | 306402.622 |
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.665 | 0.015 | 0.632 | 0.705 | 0.664 | 0.662 | 0.668 |
| BP | 0.665 | 0.015 | 0.620 | 0.699 | 0.665 | 0.662 | 0.668 |
| MEMB | 0.666 | 0.014 | 0.635 | 0.709 | 0.665 | 0.663 | 0.669 |
| MB | 0.667 | 0.015 | 0.629 | 0.712 | 0.664 | 0.664 | 0.670 |
| SIGNAL | 0.666 | 0.013 | 0.630 | 0.691 | 0.667 | 0.663 | 0.669 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.649 | 0.007 | 0.631 | 0.666 | 0.649 | 0.647 | 0.650 |
| BP | 0.648 | 0.008 | 0.628 | 0.672 | 0.648 | 0.646 | 0.649 |
| MEMB | 0.649 | 0.007 | 0.626 | 0.669 | 0.648 | 0.647 | 0.650 |
| MB | 0.648 | 0.008 | 0.624 | 0.664 | 0.648 | 0.646 | 0.649 |
| SIGNAL | 0.648 | 0.008 | 0.627 | 0.666 | 0.648 | 0.647 | 0.650 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.499 | 0.004 | 0.487 | 0.511 | 0.495 | 0.498 | 0.499 |
| BP | 0.499 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.500 |
| MEMB | 0.500 | 0.005 | 0.495 | 0.511 | 0.503 | 0.499 | 0.501 |
| MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.499 |
| SIGNAL | 0.499 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.500 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.497 | 0.004 | 0.487 | 0.503 | 0.495 | 0.496 | 0.498 |
| BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 |
| MEMB | 0.497 | 0.004 | 0.487 | 0.503 | 0.495 | 0.497 | 0.498 |
| MB | 0.496 | 0.005 | 0.487 | 0.511 | 0.495 | 0.495 | 0.497 |
| SIGNAL | 0.497 | 0.005 | 0.487 | 0.503 | 0.495 | 0.496 | 0.498 |
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 4.524 | 0.016 | 4.511 | 4.543 | 4.511 | 4.521 | 4.527 |
| BP | 4.526 | 0.016 | 4.511 | 4.543 | 4.511 | 4.523 | 4.529 |
| MEMB | 4.528 | 0.026 | 4.511 | 4.735 | 4.511 | 4.523 | 4.533 |
| MB | 4.526 | 0.029 | 4.511 | 4.767 | 4.511 | 4.520 | 4.531 |
| SIGNAL | 4.526 | 0.016 | 4.511 | 4.543 | 4.511 | 4.523 | 4.529 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 4.516 | 0.012 | 4.511 | 4.543 | 4.511 | 4.514 | 4.518 |
| BP | 4.515 | 0.010 | 4.511 | 4.543 | 4.511 | 4.513 | 4.517 |
| MEMB | 4.515 | 0.011 | 4.511 | 4.543 | 4.511 | 4.513 | 4.518 |
| MB | 4.515 | 0.010 | 4.511 | 4.543 | 4.511 | 4.513 | 4.517 |
| SIGNAL | 4.517 | 0.013 | 4.511 | 4.543 | 4.511 | 4.515 | 4.520 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 7.518 | 0.107 | 6.527 | 7.775 | 7.519 | 7.497 | 7.540 |
| BP | 7.614 | 0.487 | 7.519 | 10.559 | 7.519 | 7.517 | 7.711 |
| MEMB | 7.555 | 0.209 | 7.519 | 9.151 | 7.519 | 7.513 | 7.597 |
| MB | 7.593 | 0.390 | 7.519 | 10.687 | 7.519 | 7.516 | 7.671 |
| SIGNAL | 7.669 | 0.678 | 7.519 | 12.415 | 7.519 | 7.534 | 7.804 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 |
| BP | 7.519 | 0.003 | 7.519 | 7.551 | 7.519 | 7.519 | 7.520 |
| MEMB | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 |
| MB | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 |
| SIGNAL | 7.519 | 0.003 | 7.519 | 7.551 | 7.519 | 7.519 | 7.520 |
##### 小結
所有 Flavor 的 GET 吞吐量與延遲性能都非常接近,其中 GET 吞吐量 BP 略微領先。
SIGNAL 的 SET p99.9 延遲表現最差,延遲最高且標準差最大,說明其寫操作的尾部延遲較不穩定。而 QSBR 的 SET p99.9 延遲表現最好,這表明 QSBR 在寫操作頻率極低時,其清理機制可能表現更優。
#### 讀寫比例 10:1:模擬 Cache 情境,讀取較多於寫入
##### 吞吐量
SET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 28660.834 | 320.453 | 28114.800 | 30779.230 | 28630.195 | 28596.929 | 28724.739 |
| BP | 28658.504 | 315.730 | 28105.650 | 29984.920 | 28611.525 | 28595.541 | 28721.468 |
| MEMB | 28630.912 | 395.076 | 27642.400 | 31215.680 | 28617.705 | 28552.125 | 28709.699 |
| MB | 28634.801 | 468.614 | 27521.340 | 31324.040 | 28564.030 | 28541.349 | 28728.253 |
| SIGNAL | 28613.339 | 405.676 | 26497.110 | 30684.580 | 28576.275 | 28532.439 | 28694.240 |
<!-- - 所有 flavor 的平均吞吐量非常接近 (約 28600 ops/sec)
- QSBR 和 BP 的標準差最小 (約320),表現最穩定
- SIGNAL 的最低值 (26497) 明顯低於其他 flavor
-->
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 286466.529 | 3202.947 | 281008.890 | 307640.060 | 286160.300 | 285827.793 | 287105.265 |
| BP | 286443.242 | 3155.737 | 280917.470 | 299700.800 | 285973.705 | 285813.921 | 287072.563 |
| MEMB | 286167.464 | 3948.806 | 276287.260 | 312002.310 | 286035.460 | 285379.988 | 286954.940 |
| MB | 286206.329 | 4683.818 | 275077.250 | 313085.390 | 285498.950 | 285272.276 | 287140.382 |
| SIGNAL | 285991.817 | 4054.760 | 264839.980 | 306694.000 | 285621.375 | 285183.212 | 286800.423 |
<!-- - 同樣呈現高度相似的吞吐量 (~286,000 ops/sec)
- QSBR 和 BP 再次顯示最佳穩定性
- SIGNAL 的最低值 (264,840) 明顯偏低 -->
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 |
| BP | 0.635 | 0.002 | 0.630 | 0.639 | 0.635 | 0.634 | 0.635 |
| MEMB | 0.635 | 0.002 | 0.630 | 0.640 | 0.635 | 0.635 | 0.635 |
| MB | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 |
| SIGNAL | 0.635 | 0.002 | 0.630 | 0.643 | 0.635 | 0.635 | 0.636 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.635 | 0.002 | 0.630 | 0.640 | 0.634 | 0.634 | 0.635 |
| BP | 0.634 | 0.002 | 0.629 | 0.639 | 0.634 | 0.634 | 0.635 |
| MEMB | 0.635 | 0.002 | 0.630 | 0.639 | 0.635 | 0.634 | 0.635 |
| MB | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 |
| SIGNAL | 0.635 | 0.002 | 0.629 | 0.643 | 0.635 | 0.635 | 0.635 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.497 | 0.006 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 |
| BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.498 |
| MEMB | 0.498 | 0.005 | 0.487 | 0.503 | 0.495 | 0.497 | 0.499 |
| MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.499 |
| SIGNAL | 0.499 | 0.004 | 0.487 | 0.511 | 0.503 | 0.499 | 0.500 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.497 | 0.006 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 |
| BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.498 |
| MEMB | 0.498 | 0.005 | 0.487 | 0.503 | 0.495 | 0.497 | 0.499 |
| MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.499 |
| SIGNAL | 0.499 | 0.004 | 0.487 | 0.511 | 0.503 | 0.498 | 0.500 |
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 1.384 | 0.021 | 1.359 | 1.495 | 1.383 | 1.379 | 1.388 |
| BP | 1.383 | 0.018 | 1.359 | 1.447 | 1.375 | 1.380 | 1.387 |
| MEMB | 1.385 | 0.018 | 1.359 | 1.439 | 1.383 | 1.381 | 1.388 |
| MB | 1.385 | 0.019 | 1.351 | 1.455 | 1.383 | 1.381 | 1.389 |
| SIGNAL | 1.425 | 0.091 | 1.359 | 1.959 | 1.391 | 1.407 | 1.443 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 1.383 | 0.020 | 1.359 | 1.487 | 1.375 | 1.379 | 1.387 |
| BP | 1.383 | 0.018 | 1.359 | 1.439 | 1.375 | 1.380 | 1.387 |
| MEMB | 1.385 | 0.017 | 1.359 | 1.439 | 1.383 | 1.381 | 1.388 |
| MB | 1.385 | 0.019 | 1.351 | 1.455 | 1.383 | 1.381 | 1.389 |
| SIGNAL | 1.426 | 0.091 | 1.359 | 1.935 | 1.391 | 1.408 | 1.444 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 2.851 | 0.290 | 2.271 | 4.319 | 2.807 | 2.793 | 2.909 |
| BP | 2.822 | 0.248 | 2.367 | 3.503 | 2.799 | 2.772 | 2.871 |
| MEMB | 2.810 | 0.278 | 2.351 | 3.839 | 2.767 | 2.754 | 2.865 |
| MB | 2.824 | 0.246 | 2.415 | 3.983 | 2.815 | 2.775 | 2.873 |
| SIGNAL | 2.919 | 0.442 | 2.351 | 4.351 | 2.815 | 2.831 | 3.007 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 2.853 | 0.286 | 2.351 | 4.319 | 2.815 | 2.796 | 2.910 |
| BP | 2.819 | 0.237 | 2.431 | 3.471 | 2.799 | 2.772 | 2.866 |
| MEMB | 2.810 | 0.274 | 2.303 | 3.743 | 2.783 | 2.755 | 2.865 |
| MB | 2.830 | 0.239 | 2.399 | 3.903 | 2.815 | 2.782 | 2.878 |
| SIGNAL | 2.929 | 0.441 | 2.367 | 4.383 | 2.815 | 2.841 | 3.017 |
##### 小結
在讀多寫少的場景下,QSBR 在吞吐量最佳、BP 次之。然而,QSBR 在高百分位 GET 與 SET 延遲上都較不穩定。MEMB 和 MB 表現很接近。
在所有 FLAVOR 中,不該使用 SIGNAL,因為它在多個吞吐量與延遲指標上表現明顯最差,尤其極端情況下的延遲
#### 讀寫比例 1:1:模擬 Session Store 情境,讀取與寫入比例相近
##### 吞吐量
SET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 133086.415 | 46847.384 | 0.000 | 163293.610 | 148770.615 | 123744.048 | 142428.782 |
| BP | 155610.583 | 2395.181 | 152699.060 | 170369.850 | 155087.675 | 155132.933 | 156088.233 |
| MEMB | 155501.622 | 1163.124 | 153110.860 | 159132.270 | 155344.040 | 155269.671 | 155733.574 |
| MB | 155821.162 | 2447.141 | 143412.070 | 168457.980 | 155811.510 | 155333.149 | 156309.174 |
| SIGNAL | 156319.059 | 2753.169 | 153813.820 | 169675.580 | 155630.615 | 155770.018 | 156868.099 |
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 133086.415 | 46847.384 | 0.000 | 163293.610 | 148770.615 | 123744.048 | 142428.782 |
| BP | 155610.583 | 2395.181 | 152699.060 | 170369.850 | 155087.675 | 155132.933 | 156088.233 |
| MEMB | 155501.622 | 1163.124 | 153110.860 | 159132.270 | 155344.040 | 155269.671 | 155733.574 |
| MB | 155821.162 | 2447.141 | 143412.070 | 168457.980 | 155811.510 | 155333.149 | 156309.174 |
| SIGNAL | 156319.059 | 2753.169 | 153813.820 | 169675.580 | 155630.615 | 155770.018 | 156868.099 |
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 9.730 | 88.360 | 0.000 | 888.601 | 0.677 | -7.891 | 27.351 |
| BP | 0.642 | 0.002 | 0.637 | 0.647 | 0.642 | 0.641 | 0.642 |
| MEMB | 0.642 | 0.002 | 0.638 | 0.646 | 0.642 | 0.642 | 0.642 |
| MB | 0.642 | 0.002 | 0.636 | 0.646 | 0.642 | 0.641 | 0.642 |
| SIGNAL | 0.641 | 0.001 | 0.638 | 0.645 | 0.642 | 0.641 | 0.642 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 2.678 | 19.652 | 0.000 | 197.910 | 0.655 | -1.242 | 6.597 |
| BP | 0.641 | 0.002 | 0.637 | 0.647 | 0.642 | 0.641 | 0.642 |
| MEMB | 0.642 | 0.002 | 0.638 | 0.646 | 0.642 | 0.641 | 0.642 |
| MB | 0.641 | 0.002 | 0.636 | 0.646 | 0.641 | 0.641 | 0.642 |
| SIGNAL | 0.641 | 0.001 | 0.638 | 0.644 | 0.641 | 0.641 | 0.641 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.455 | 0.141 | 0.007 | 0.511 | 0.503 | 0.427 | 0.483 |
| BP | 0.507 | 0.005 | 0.495 | 0.519 | 0.511 | 0.506 | 0.508 |
| MEMB | 0.509 | 0.004 | 0.495 | 0.519 | 0.511 | 0.508 | 0.510 |
| MB | 0.507 | 0.005 | 0.495 | 0.519 | 0.507 | 0.506 | 0.508 |
| SIGNAL | 0.508 | 0.005 | 0.495 | 0.519 | 0.511 | 0.507 | 0.509 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 0.454 | 0.141 | 0.007 | 0.503 | 0.495 | 0.426 | 0.482 |
| BP | 0.507 | 0.005 | 0.495 | 0.519 | 0.507 | 0.506 | 0.508 |
| MEMB | 0.508 | 0.005 | 0.495 | 0.519 | 0.511 | 0.507 | 0.509 |
| MB | 0.506 | 0.005 | 0.495 | 0.519 | 0.503 | 0.505 | 0.507 |
| SIGNAL | 0.507 | 0.005 | 0.495 | 0.519 | 0.511 | 0.506 | 0.508 |
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 14.427 | 101.888 | 0.007 | 1028.095 | 4.543 | -5.892 | 34.746 |
| BP | 1.451 | 0.039 | 1.375 | 1.527 | 1.455 | 1.443 | 1.459 |
| MEMB | 1.463 | 0.040 | 1.383 | 1.543 | 1.467 | 1.455 | 1.471 |
| MB | 1.455 | 0.045 | 1.383 | 1.543 | 1.455 | 1.447 | 1.464 |
| SIGNAL | 1.455 | 0.039 | 1.383 | 1.527 | 1.455 | 1.448 | 1.463 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 5.214 | 10.926 | 0.007 | 113.151 | 4.511 | 3.035 | 7.393 |
| BP | 1.450 | 0.040 | 1.375 | 1.527 | 1.451 | 1.442 | 1.457 |
| MEMB | 1.462 | 0.040 | 1.383 | 1.543 | 1.471 | 1.454 | 1.470 |
| MB | 1.455 | 0.046 | 1.383 | 1.543 | 1.455 | 1.446 | 1.464 |
| SIGNAL | 1.454 | 0.038 | 1.383 | 1.527 | 1.455 | 1.446 | 1.462 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 1824.150 | 17529.922 | 0.007 | 176160.767 | 7.759 | -1671.690 | 5319.990 |
| BP | 2.726 | 0.191 | 2.303 | 3.295 | 2.687 | 2.688 | 2.764 |
| MEMB | 2.690 | 0.212 | 2.351 | 3.343 | 2.655 | 2.648 | 2.732 |
| MB | 2.733 | 0.229 | 2.271 | 3.567 | 2.743 | 2.688 | 2.779 |
| SIGNAL | 2.685 | 0.201 | 2.303 | 3.247 | 2.639 | 2.645 | 2.725 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
| --- | --- | --- | --- | --- | --- | --- | --- |
| QSBR | 124.821 | 901.113 | 0.007 | 8454.143 | 7.519 | -54.880 | 304.523 |
| BP | 2.726 | 0.187 | 2.319 | 3.279 | 2.703 | 2.688 | 2.763 |
| MEMB | 2.689 | 0.215 | 2.303 | 3.359 | 2.655 | 2.646 | 2.732 |
| MB | 2.734 | 0.227 | 2.319 | 3.519 | 2.735 | 2.689 | 2.780 |
| SIGNAL | 2.685 | 0.204 | 2.303 | 3.247 | 2.655 | 2.644 | 2.725 |
#### 小結
在讀寫均衡的場景下,BP, MEMB, MB, SIGNAL 這四種 URCU 的表現都相對穩定且良好。而 QSBR 表現極差,不適用於此比例。QSBR 的吞吐量較低,且波動大 (標準差大)、QSBR 平均延遲大,另外 95% 信賴區間大,表示結果較不可靠 -> 可能是我記憶體回收阻塞或是我測量時有誤 (因有負的 CI 下限)
#### 藉由不同讀寫比例,分析不同 favor
QSBR 在讀寫比例 1:1 的情況下,QSBR 的性能表現極不穩定,而在讀寫比例 100:1 的極端讀多場景下,QSBR 的 SET p99.9 延遲表現反而最佳,因此適合的密集讀取,並且對寫操作的延遲波動有較高的容忍度。
BP, MEMB, MB 的穩定性: 這三種 URCU 在所有讀寫比例下都表現出相對穩定的性能,其中 BP 綜合表現最好。
SIGNAL 在讀操作的平均性能上與其他 Flavor 接近,但在 10:1 和 100:1 讀寫比例下,其高百分位延遲(尤其是 p99.9)
### 執行緒數量的影響
#### 開發環境
```bash
$ uname -a
Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU family: 6
Model: 94
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 3
CPU(s) scaling MHz: 30%
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6799.81
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
art arch_perfmon pebs bts rep_good nopl xtopology nons
top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp
riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2
smep bmi2 erms invpcid mpx rdseed adx smap clflushopt
intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m
d_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
```
由於 cpu 較舊,因此整體對於多執行緒的吞吐量與延遲等指標會與其他同學有落差,並且對於一些實做的提昇幅度也不一樣,所以這邊先使用 bpftrace 去紀錄與統計整體 read/write 系統呼叫的延遲,並後續觀察是否與系統呼叫的成本有關
```
$ sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read /comm == "redis-server"/ {
@start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
@read_latency_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}
```
```
@read_latency_us:
[1] 2808001 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4) 1153948 |@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 25778 | |
[8, 16) 8368 | |
[16, 32) 4015 | |
[32, 64) 32 | |
[64, 128) 13 | |
[128, 256) 9 | |
[256, 512) 32 | |
[512, 1K) 2 | |
[1K, 2K) 2 | |
```
去年的實驗結果是在沒設定 cpu affinity 的情況下應該將執行緒的數量設定為 4 ,因此我先想復現去年的實驗,並探討執行緒的數量對吞吐量效能的影響。
我這邊設定寫入與讀取的比例為 1:100 和 1:10,上網收集的資料說 cache 的使用情境大多為這兩種,因此只測試這兩種情境
>[Memcached](https://openbenchmarking.org/test/pts/memcached?utm_source=chatgpt.com) 的實際基準測試中,約 43.9% 的測試採用 SET:GET = 1:100 的配置
>[Google Cloud Memorystore](https://cloud.google.com/blog/products/databases/scaling-google-cloud-memorystore-for-high-performance?utm_source=chatgpt.com) 中有提到 By default, the utility issues 10 get commands for every set command.
以下測試是想測試執行緒數量使用 4 與 8 對於 qsbr 與 memb 的比較,並使用
```
taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000
```
#### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入
##### 吞吐量
SET Ops/sec

| Flavor | Mean| Std Dev | Min | Max | Median | 95% CI Lower| 95% CI Upper|
|--|--|--|--|--|--|--|-|
|qsbr-4t| 1911.112| 36.840| 1757.440| 2006.030| 1905.100| 1903.892| 1918.333|
|qsbr-8t| 1857.539| 47.483| 1687.700| 1965.850| 1853.320| 1848.232| 1866.845|
|memb-4t| 1904.987| 48.352| 1684.740| 2037.310| 1913.505| 1895.510| 1914.464|
|memb-8t| 1843.536| 42.711| 1684.790| 1949.790| 1852.665| 1835.164| 1851.907|
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|-------------|----------|-------------|-------------|------------|---------------|---------------|
| qsbr-4t | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 |
| qsbr-8t | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 |
| memb-4t | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 |
| memb-8t | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 |
#### 觀察
- 在沒設置 cpu affinity 時, 4 個執行緒的吞吐量的平均數與中位數皆是比 8 個執行緒還要高的
- qsbr 的表現在平均數比 memb 還要來的好,但 memb 的中位數比 qsbr 還要來的好,可以從最小數與標準差發現 memb 的吞吐量較不穩定,因此標準差也較大
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-4t | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 |
| qsbr-8t | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 |
| memb-4t | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 |
| memb-8t | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-4t | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 |
| qsbr-8t | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 |
| memb-4t | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 |
| memb-8t | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-4t | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 |
| qsbr-8t | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 |
| memb-4t | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 |
| memb-8t | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-4t | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 |
| qsbr-8t | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 |
| memb-4t | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 |
| memb-8t | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 |
#### 觀察
- 在 8 個執行緒時, Get 操作的延遲在平均數與中位數都是比較低的,但在 Set 操作上卻會略為提高
- 在 p50 上 Get 操作在 8 個執行緒時不再取得更低的延遲,反之延遲較 4 個執行緒來的高
- qsbr 在 Set 和 Get 操作上的平均延遲都比 memb 還要來的低
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|--------|--------|----------------|----------------|
| qsbr-4t | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 |
| qsbr-8t | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 |
| memb-4t | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 |
| memb-8t | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-4t | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 |
| qsbr-8t | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 |
| memb-4t | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 |
| memb-8t | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|--------|--------|----------------|----------------|
| qsbr-4t | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 |
| qsbr-8t | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 |
| memb-4t | 10.423 | 1.753 | 8.895 | 24.831 | 10.111 | 10.080 | 10.767 |
| memb-8t | 12.050 | 1.328 | 9.599 | 17.279 | 11.871 | 11.789 | 12.310 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|----------|--------|---------|-------|--------|--------|----------------|----------------|
| qsbr-4t | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 |
| qsbr-8t | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 |
| memb-4t | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 |
| memb-8t | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 |
#### 觀察
- 可以看出在 p99 與 p99.9 的情況下越多執行緒會導致越高的平均延遲
#### 小結論
- 在讀寫佔比 100:1 並且沒有設置 cpu affinity的情況下, 4 個執行緒的吞吐量會比 8 個執行緒的吞吐量來的好,不論是在延遲或平均操作上皆是。
### 設置 cpu affinity
有看到去年有個 review 是為什麼 thread-level cpu affinity 不要一個 thread 一個 cpu 而是要兩個 thread 一個 cpu? 而去年負責專題的同學的回覆如下
>因為我測試過每個 CPU 一個執行緒,但是效能不如一個 CPU 兩個執行緒,這樣也能透過 context switch 讓要進行 I/O 的執行緒休息,將 CPU 資源讓給另外一個執行緒。
但這段敘述並沒有實驗去說明,因此我接下來是要測試:一個 CPU 兩個執行緒還是一個 CPU 一個執行緒比較好,並且分析去年的結論是否是對的。用以下程式碼將目前執行中的 thread 綁定(或限制)到指定的 CPU 核心集合中執行:
```diff!
+ cpu_set_t cpuset;
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpu_id, &cpuset);
+ pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
```
以下分別針對 qsbr 與 memb 兩種 flavor 去探討
```
taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=file_name
```
#### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 memb flavor 下
##### 吞吐量
SET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|----------|---------|---------|---------|----------|----------------|----------------|
| memb-one-to-one-affinity | 1937.405 | 52.190 | 1823.790| 2097.200| 1930.500 | 1927.176 | 1947.634 |
| memb-two-to-one-affinity | 2018.488 | 72.174 | 1869.110| 2211.830| 2020.165 | 2004.342 | 2032.635 |
| memb-4t-no-affinity | 1904.987 | 48.352 | 1684.740| 2037.310| 1913.505 | 1895.510 | 1914.464 |
| memb-8t-no-affinity | 1843.536 | 42.711 | 1684.790| 1949.790| 1852.665 | 1835.164 | 1851.907 |
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|------------|-----------|-------------|-------------|------------|---------------|---------------|
| memb-one-to-one-affinity | 192776.628 | 5193.027 | 181471.580 | 208676.540 | 192089.445 | 191758.795 | 193794.462 |
| memb-two-to-one-affinity | 200844.686 | 7181.510 | 185980.900 | 220082.210 | 201011.445 | 199437.110 | 202252.262 |
| memb-4t-no-affinity | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 |
| memb-8t-no-affinity | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 |
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 1.405 | 0.039 | 1.330 | 1.536 | 1.400 | 1.398 | 1.413 |
| memb-two-to-one-affinity | 1.332 | 0.037 | 1.278 | 1.524 | 1.326 | 1.325 | 1.339 |
| memb-4t-no-affinity | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 |
| memb-8t-no-affinity | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 1.036 | 0.015 | 1.014 | 1.084 | 1.032 | 1.033 | 1.039 |
| memb-two-to-one-affinity | 1.006 | 0.016 | 0.978 | 1.087 | 1.002 | 1.002 | 1.009 |
| memb-4t-no-affinity | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 |
| memb-8t-no-affinity | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 1.153 | 0.084 | 1.007 | 1.327 | 1.159 | 1.137 | 1.170 |
| memb-two-to-one-affinity | 1.249 | 0.024 | 1.191 | 1.359 | 1.247 | 1.244 | 1.254 |
| memb-4t-no-affinity | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 |
| memb-8t-no-affinity | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 0.943 | 0.010 | 0.919 | 0.967 | 0.943 | 0.941 | 0.945 |
| memb-two-to-one-affinity | 0.932 | 0.013 | 0.903 | 0.967 | 0.927 | 0.929 | 0.934 |
| memb-4t-no-affinity | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 |
| memb-8t-no-affinity | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 |
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|--------|--------|--------|---------------|---------------|
| memb-one-to-one-affinity | 5.553 | 0.285 | 4.799 | 6.943 | 5.535 | 5.497 | 5.609 |
| memb-two-to-one-affinity | 4.830 | 0.355 | 4.063 | 6.143 | 4.831 | 4.761 | 4.900 |
| memb-4t-no-affinity | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 |
| memb-8t-no-affinity | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 4.149 | 0.298 | 3.039 | 4.959 | 4.079 | 4.091 | 4.208 |
| memb-two-to-one-affinity | 3.275 | 0.392 | 2.495 | 4.671 | 3.343 | 3.198 | 3.352 |
| memb-4t-no-affinity | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 |
| memb-8t-no-affinity | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|---------------|---------------|
| memb-one-to-one-affinity | 9.043 | 0.838 | 7.519 | 14.719| 8.959 | 8.879 | 9.207 |
| memb-two-to-one-affinity | 8.051 | 1.040 | 6.399 | 14.015| 7.887 | 7.848 | 8.255 |
| memb-4t-no-affinity |10.423 | 1.753 | 8.895 | 24.831|10.111 |10.080 |10.767 |
| memb-8t-no-affinity |12.050 | 1.328 | 9.599 | 17.279|11.871 |11.789 |12.310 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|--------|--------|---------------|---------------|
| memb-one-to-one-affinity | 7.057 | 0.619 | 5.951 | 12.031 | 6.991 | 6.936 | 7.178 |
| memb-two-to-one-affinity | 6.205 | 0.476 | 5.375 | 8.319 | 6.111 | 6.112 | 6.298 |
| memb-4t-no-affinity | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 |
| memb-8t-no-affinity | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 |
#### 使用 perf 觀察程式的行為
以下是針對一個 CPU 兩個執行緒和一個 CPU 一個執行緒的情況下,我各拿一個中位數的結果來做分析,以下是結果:
一個 CPU 兩個執行緒
```
Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':
71,780.04 msec task-clock # 1.653 CPUs utilized
560,343 context-switches # 7.806 K/sec
14,189 cpu-migrations # 197.673 /sec
2,025 page-faults # 28.211 /sec
257,627,916,414 cycles # 3.589 GHz (39.99%)
141,018,941,758 instructions # 0.55 insn per cycle (50.00%)
22,306,027,705 branches # 310.755 M/sec (49.92%)
474,877,129 branch-misses # 2.13% of all branches (49.98%)
39,701,370,753 L1-dcache-loads # 553.098 M/sec (49.92%)
3,265,404,792 L1-dcache-load-misses # 8.22% of all L1-dcache accesses (49.96%)
1,107,813,317 LLC-loads # 15.433 M/sec (40.05%)
6,557,980 LLC-load-misses # 0.59% of all LL-cache accesses (40.09%)
14,661,818,780 cache-references # 204.260 M/sec (40.08%)
172,241,796 cache-misses # 1.17% of all cache refs (40.10%)
43.418281953 seconds time elapsed
22.202794000 seconds user
49.473298000 seconds sys
```
一個 CPU 一個執行緒
```
Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':
71,423.46 msec task-clock # 1.599 CPUs utilized
333,882 context-switches # 4.675 K/sec
18,199 cpu-migrations # 254.804 /sec
1,626 page-faults # 22.766 /sec
254,850,020,885 cycles # 3.568 GHz (40.03%)
136,686,768,652 instructions # 0.54 insn per cycle (49.98%)
21,480,788,799 branches # 300.753 M/sec (49.94%)
467,898,042 branch-misses # 2.18% of all branches (49.95%)
38,495,992,811 L1-dcache-loads # 538.982 M/sec (50.05%)
3,143,432,121 L1-dcache-load-misses # 8.17% of all L1-dcache accesses (49.97%)
1,053,471,203 LLC-loads # 14.750 M/sec (40.05%)
5,857,429 LLC-load-misses # 0.56% of all LL-cache accesses (40.05%)
14,337,925,549 cache-references # 200.745 M/sec (40.03%)
161,865,927 cache-misses # 1.13% of all cache refs (39.98%)
44.655900137 seconds time elapsed
22.039534000 seconds user
49.579251000 seconds sys
```
#### 結論
- 可以觀察出在使用 memb 的機制並且 cpu affinity 在兩個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,並且相較於同樣是 8 個執行緒但無 cpu affinity 在 Get 和 Set 操作的吞吐量與延遲上,都好了大約 10% ,在 memb 的實驗結果跟去年得到的結論一樣,兩個執行緒對一個 cpu 的情況會取得比較好的效能。
- 並且二對一的模式在 p99 與 p99.9 的表現也是四個裡面最好, tail-latency 也是表現最好的。
- 在一個 CPU 一個執行緒時 context switchs 數量約低於兩個執行緒 40% ,但同時 cpu migrations 的數量卻多於一個 CPU 兩個執行緒,而 cpu migration 單次的成本是遠大於 context switch 的,因為會導致 L1/L2 cache miss(需要重新 warm-up cache),並且也會清除 TLB,但兩個執行緒在一個 CPU 上會頻繁交替執行,也會導致 cache conflict 提昇 cache misss 的數量。
#### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 qsbr flavor 下
##### 吞吐量
SET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|----------|---------|---------|---------|----------|----------------|----------------|
| qsbr-one-to-one-affinity | 1976.983 | 53.720 | 1858.990| 2088.910| 1976.620 | 1966.454 | 1987.512 |
| qsbr-two-to-one-affinity | 1926.101 | 63.233 | 1761.320| 2085.570| 1919.775 | 1913.707 | 1938.494 |
| qsbr-4t-no-affinity | 1911.112 | 36.840 | 1757.440| 2006.030| 1905.100 | 1903.892 | 1918.333 |
| qsbr-8t-no-affinity | 1857.539 | 47.483 | 1687.700| 1965.850| 1853.320 | 1848.232 | 1866.845 |
GET Ops/sec

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|-------------|----------|-------------|-------------|-------------|----------------|----------------|
| qsbr-one-to-one-affinity | 196714.781 | 5345.271 | 184974.190 | 207851.710 | 196678.310 | 195667.108 | 197762.454 |
| qsbr-two-to-one-affinity | 191651.843 | 6291.792 | 175255.650 | 207519.610 | 191022.075 | 190418.652 | 192885.034 |
| qsbr-4t-no-affinity | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 |
| qsbr-8t-no-affinity | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 |
##### 平均延遲
SET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 1.296 | 0.049 | 1.231 | 1.514 | 1.286 | 1.287 | 1.306 |
| qsbr-two-to-one-affinity | 1.332 | 0.037 | 1.273 | 1.515 | 1.328 | 1.325 | 1.340 |
| qsbr-4t-no-affinity | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 |
| qsbr-8t-no-affinity | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 |
GET Average Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 1.015 | 0.020 | 0.987 | 1.085 | 1.007 | 1.011 | 1.019 |
| qsbr-two-to-one-affinity | 1.036 | 0.018 | 1.015 | 1.120 | 1.030 | 1.032 | 1.039 |
| qsbr-4t-no-affinity | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 |
| qsbr-8t-no-affinity | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 |
##### p50 延遲
SET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 1.001 | 0.033 | 0.975 | 1.191 | 0.991 | 0.994 | 1.007 |
| qsbr-two-to-one-affinity | 1.089 | 0.060 | 1.023 | 1.279 | 1.063 | 1.077 | 1.100 |
| qsbr-4t-no-affinity | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 |
| qsbr-8t-no-affinity | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 |
GET p50 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 0.949 | 0.010 | 0.927 | 0.975 | 0.943 | 0.946 | 0.951 |
| qsbr-two-to-one-affinity | 0.965 | 0.013 | 0.935 | 0.999 | 0.959 | 0.962 | 0.967 |
| qsbr-4t-no-affinity | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 |
| qsbr-8t-no-affinity | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 |
##### p99 延遲
SET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 4.837 | 0.464 | 4.223 | 7.135 | 4.831 | 4.746 | 4.928 |
| qsbr-two-to-one-affinity | 5.069 | 0.327 | 4.223 | 6.623 | 5.055 | 5.005 | 5.133 |
| qsbr-4t-no-affinity | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 |
| qsbr-8t-no-affinity | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 |
GET p99 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|-------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 3.555 | 0.530 | 2.719 | 5.151 | 3.559 | 3.451 | 3.659 |
| qsbr-two-to-one-affinity | 3.775 | 0.337 | 2.767 | 4.927 | 3.735 | 3.709 | 3.841 |
| qsbr-4t-no-affinity | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 |
| qsbr-8t-no-affinity | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 |
##### p99.9 延遲
SET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|--------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 7.569 | 1.141 | 6.079 | 13.823 | 7.391 | 7.345 | 7.793 |
| qsbr-two-to-one-affinity | 8.388 | 0.826 | 6.975 | 13.375 | 8.255 | 8.226 | 8.550 |
| qsbr-4t-no-affinity | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 |
| qsbr-8t-no-affinity | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 |
GET p99.9 Latency

| Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper |
|--------------------------|--------|---------|-------|--------|--------|----------------|----------------|
| qsbr-one-to-one-affinity | 5.984 | 0.847 | 5.151 | 10.367 | 5.759 | 5.818 | 6.150 |
| qsbr-two-to-one-affinity | 6.854 | 0.480 | 5.727 | 9.535 | 6.863 | 6.760 | 6.948 |
| qsbr-4t-no-affinity | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 |
| qsbr-8t-no-affinity | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 |
#### 使用 perf 分析程式行為
一樣針對一個 CPU 兩個執行緒與一個 CPU 一個執行緒進行分析,分別取一次中位數的結果:
```
70,013.67 msec task-clock # 2.401 CPUs utilized
741,501 context-switches # 10.591 K/sec
15,581 cpu-migrations # 222.542 /sec
3,627 page-faults # 51.804 /sec
249,258,869,962 cycles # 3.560 GHz (39.99%)
132,407,403,431 instructions # 0.53 insn per cycle (49.98%)
21,077,879,352 branches # 301.054 M/sec (50.03%)
478,702,322 branch-misses # 2.27% of all branches (50.05%)
35,215,497,531 L1-dcache-loads # 502.980 M/sec (50.12%)
3,281,644,729 L1-dcache-load-misses # 9.32% of all L1-dcache accesses (50.04%)
1,129,746,159 LLC-loads # 16.136 M/sec (40.06%)
7,744,776 LLC-load-misses # 0.69% of all LL-cache accesses (39.95%)
14,435,913,114 cache-references # 206.187 M/sec (39.92%)
185,804,120 cache-misses # 1.29% of all cache refs (39.95%)
29.158730419 seconds time elapsed
18.440255000 seconds user
51.487935000 seconds sys
```
一個 CPU 一個執行緒
```
Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ':
67,887.06 msec task-clock # 1.507 CPUs utilized
375,487 context-switches # 5.531 K/sec
19,709 cpu-migrations # 290.320 /sec
3,209 page-faults # 47.270 /sec
240,616,502,442 cycles # 3.544 GHz (40.06%)
128,321,848,547 instructions # 0.53 insn per cycle (50.05%)
20,388,085,704 branches # 300.324 M/sec (50.00%)
468,964,574 branch-misses # 2.30% of all branches (50.05%)
34,179,214,077 L1-dcache-loads # 503.472 M/sec (50.01%)
3,167,434,498 L1-dcache-load-misses # 9.27% of all L1-dcache accesses (50.00%)
1,059,289,040 LLC-loads # 15.604 M/sec (39.97%)
5,442,227 LLC-load-misses # 0.51% of all LL-cache accesses (39.98%)
13,343,467,237 cache-references # 196.554 M/sec (39.96%)
151,742,665 cache-misses # 1.14% of all cache refs (40.02%)
45.045660022 seconds time elapsed
17.981155000 seconds user
50.111441000 seconds sys
```
#### 結論
- 可以觀察出在使用 qsbr 的機制並且 cpu affinity 在一個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,跟去年得到的結論不一樣,原因可能是一個 CPU 兩個執行緒的 cache miss rate 較高,也有可能是 qsbr 的機制,需要進一步實驗(待補)
- 但也能觀察出一個執行緒榜定一個 cpu 的情況下,雖然在 p99.9 情況下延遲的平均和中位數皆是最佳,但其 tail-latency 也是最大的,原因可能與 qsbr 的機制有關,需要進一步實驗(待補)
:::danger
TODO: 將並行程式設計相關的素材放到其他筆記頁面
:::
## 引入 neco
>去年的報告並沒有去詳細解釋該如何引入和各 API 該如何使用,並且最後的引用結果也沒顯著提昇效能,因此這邊我想從頭看一下 neco 文件看如何引用和為什麼對於提昇效能有限
#### 開發環境
```bash
$ uname -a
Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU family: 6
Model: 94
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 3
CPU(s) scaling MHz: 30%
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6799.81
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
art arch_perfmon pebs bts rep_good nopl xtopology nons
top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp
riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2
smep bmi2 erms invpcid mpx rdseed adx smap clflushopt
intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara
t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m
d_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
```
### 什麼是 neco ?
neco 是一個以 coroutines 提供並行處理的跨平台 C 函式庫,可以在多種作業系統上使用,目標是提供快速的單執行緒並行處理,且提供 API 給網路與 I/O 。那希望可以引入 neco 來當低延遲非同步處理引擎,讓每個 worker thread 能以 coroutine 管理多個 client 的 I/O 流程,減少 context switch 成本與 thread 數量,並提升 tail latency 表現。
具體表現能用 [bluebox](https://github.com/tidwall/bluebox) 展示 neco 的功能,支持 SET、GET、DEL 和 PING 操作。較單執行緒的 Redis 有更高的資料吞吐量和更低的延遲。評估 M:N threading model 對於 Redis 並行化的助益。
在我的設備中對 bluebox 與原始做 valkey各 30 次測試並做一個效能的比較
```
ALL STATS
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 1031.52 --- --- 1.93272 1.89500 3.99900 5.59900 79.48
Gets 102638.44 0.00 102638.44 1.92741 1.90300 3.90300 4.89500 3997.68
Waits 0.00 --- --- --- --- --- --- ---
Totals 103669.95 0.00 102638.44 1.92747 1.90300 3.90300 4.89500 4077.15
```
```
ALL STATS
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 1663.52 --- --- 1.19864 1.02300 3.00700 6.43100 128.18
Gets 165524.81 3870.41 161654.40 1.19570 1.02300 2.95900 5.69500 6575.55
Waits 0.00 --- --- --- --- --- --- ---
Totals 167188.34 3870.41 161654.40 1.19573 1.02300 2.97500 5.72700 6703.73
```
透過這個專案可以觀察出 neco 配合單執行緒帶來的平行增益是很可觀的,可以達到 60% 的一個吞吐量增幅,因此也期望能使用這個套件去提昇 mt-redis 的吞吐量。
先閱讀以下的 neco 文件,了解 neco 的使用方法
> [neco doc](https://github.com/tidwall/neco/blob/main/docs/API.md#generators)
先從多執行緒的範例開始切入:
```c++
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include "neco.h"
#include "neco.c"
void coro1(int argc, void *argv[]) {
// This coroutine is running in a different scheduler than coro2.
int rd = *(int*)argv[0];
int wr = *(int*)argv[1];
int val;
neco_read(rd, &val, sizeof(int));
printf("coro1: %d\n", val);
neco_write(wr, &(int){ 2 }, sizeof(int));
}
void coro2(int argc, void *argv[]) {
// This coroutine is running in a different scheduler than coro1.
int rd = *(int*)argv[0];
int wr = *(int*)argv[1];
int val;
neco_write(wr, &(int){ 1 }, sizeof(int));
neco_read(rd, &val, sizeof(int));
printf("coro2: %d\n", val);
}
void *runtime1(void *arg) {
int *pipefds = arg;
neco_start(coro1, 2, &pipefds[0], &pipefds[3]);
return 0;
}
void *runtime2(void *arg) {
int *pipefds = arg;
neco_start(coro2, 2, &pipefds[2], &pipefds[1]);
return 0;
}
int main() {
int pipefds[4];
pipe(&pipefds[0]);
pipe(&pipefds[2]);
pthread_t th1, th2;
pthread_create(&th1, 0, runtime1, pipefds);
pthread_create(&th2, 0, runtime2, pipefds);
pthread_join(th1, 0);
pthread_join(th2, 0);
return 0;
}
```
這邊範例主要用到三個主要的 API :分別是 `neco_start` , `neco_read` 和 `neco_write`
#### neco_start
是 neco coroutine 函式庫的入口點 API,用來在當前執行緒中建立並執行一個 coroutine 調度器(scheduler),並從指定的 coroutine 函式開始執行。此函式將負責初始化內部 runtime,如果尚未存在 scheduler,則會自動建立一個。
> 這邊的排程器是使用 [sco](https://github.com/tidwall/sco) 排程器,是 neco 作者做出的 fair and deterministic scheduler
```cpp
int neco_start(void(*coroutine)(int argc, void *argv[]), int argc,...);
```
第一個參數須傳入 coroutine 函式指標,而 coroutine 函式又須滿足以下格式
```cpp
void coroutine(int argc, void *argv[]) {
char *msg = argv[0];
printf("%s\n", msg);
}
```
第一個參數是傳入 coroutine 的參數數量,對應 argv 的長度,第二個參數則是參數陣列。
#### neco_read
on-blocking I/O 包裝函式,用於在 coroutine 環境中從指定的 file descriptor (fd) 讀取資料。這個 API 實現是用 POSIX 的 `read()` 系統呼叫,但額外加上加上 coroutine 調度支援,能讓當前 coroutine 在資料尚未可讀時主動讓出(yield)執行權。
```cpp
ssize_t neco_read(int fd, void *data, size_t nbytes);
```
第一個參數是傳入 fd,第二個參數是要存取的指標,第三個參數是要存取幾個 bytes
#### neco_write
對 POSIX `write()` 系統呼叫的包裝函式,設計目的是在 coroutine 執行環境中提供非阻塞的寫入操作。當 fd 暫時無法寫入時,該 coroutine 會自動讓出(yield),等待可寫再繼續執行,參數跟邏輯都跟 `neco_read` 相同。
```cpp
ssize_t neco_read(int fd, void *data, size_t nbytes);
```
因此我們在看回去多執行緒的範例,程式的行為是創造兩個執行緒,並且分別對兩個 pipe 進行資訊傳遞,那根據上面 API 的行為範例程式碼的結果應該是:
>coro1: 1
coro2: 2
因為當 th1 先執行時,會在 coro1 呼叫 `neco_read` 去查看該 fd 有無可讀取的資料,但沒有因此會主動讓出給 th2 , th2 在 coro2 內的 `neco_read` 也一樣,一定要等到 coro1 寫入 pipe 後才可讀取,因此結果會如上
以下是針對 API 的參數去做測試,再次確認 API 使用方法
```diff
void *runtime1(void *arg) {
+ int temp = 123;
- neco_start(coro1, 2, &pipefds[0], &pipefds[3]);
+ neco_start(coro1, 3, &pipefds[0], &pipefds[3], &temp);
}
```
>coro1: temp: 123
coro1: 1
coro2: 2
### 該如何引入 mt-redis?
## 分析限制 Redis 並行的因素
為甚麼 redis/ valkey 是使用 single thread?
因此我閱讀 [The Engineering Wisdom Behind Redis’s Single-Threaded Design](https://community.aws/content/2tfQijlXleV5iM6hDkSGkMJlpcd/the-engineering-wisdom-behind-redis-s-single-threaded-design),理解 redis 的設計與考量點。
[The Engineering Wisdom Behind Redis’s Single-Threaded Design 文章整理](/TX3sCh38Qc21i7nyYpXgyQ)
## TODO: 使 Valkey 具備並行處理能力
> 將 [yeh-sudo/mt-redis](https://github.com/yeh-sudo/mt-redis) 的成果用於 [Valkey](https://valkey.io/),確保 userspace RCU 發揮其作用
[閱讀 valkey 的資料結構](/wCk1UR9ESrO337Y5a2PbIg)
### mt-redis 中,將 urcu 加到 redis 的關鍵改動
可以搭配 yeh-sudo 寫的 [mt-redis 運作原理](https://hackmd.io/-6FWkmVkRRqrt7QZxNRofA?view#%E5%AD%97%E5%85%B8-q_dict)
**以下討論 mt-redis 中,將 rcu 加到 redis 的關鍵改動**。
#### 通用雜湊表使用 rculfhash (lock-free hash table)
redis/valkey 的雜湊表,考慮是單執行緒的設計。在雜湊表擴容時,這避免了長時間的 rehashing,有使用到 incremental rehashing,在增刪查改操作中會逐步的進行,將舊的雜湊表的資料移至新的資料表中,保有高可用性。
然而在 mt-redis 的通用雜湊表使用 rculfhash ,而 rculfhash 本身就支持動態調整大小,因此 mt-redis 中就沒有手動作 incremental rehashing。
在 rculfhash 提供了 `cds_lfht_node` 作為將 hash table 連結起來的媒介,可以使用 `caa_container_of` 這個巨集取得整個雜湊表位置。
因此在 redis 中幾個重要的結構中都能看到考量到 rculfhash 的設計。
- `struct redisDB` 中存於通用字典 (`q_dict`)。
- `struct q_dict` 是 Redis 用來管理 key-value 的核心結構,其中成員就包含`struct cds_lfht *table` 。
- `struct dictEntry` 是具體存儲每個 key-value pair 的單位,其中成有 `cds_lfht_node` node,以串起雜湊表,另外有還有用於推延回收的 `rcu_head rcu_head` 與 `void* ptr` 這將指向存於 value 的 robj (`struct redisObject`) 結構
也因此以下展示 mt-redis 插入 key-value 流程,也顯示了 redisDb、q_dict、q_dictEntry 以及 robj 之間的關係
`q_dictAdd()` 的步驟:
1. rcu_read_lock()
2. 建立 dictEntry, 將要插入的 key 轉成 sds 格式 (Simple Dynamic String),在 `dictEntry.key` 存 key,`dictEntry.val` 存含有 value 的 robj 指針
3. 使用 `cds_lfht_add_replace()`,將 dictEntry 加入到 q_dict.table 中
5. rcu_read_unlock()
圖示:
```
+-----------------+
| redisDb |
+------------------+
| q_dict *dict --> +------------------------+
| | struct q_dict | (q_dict 是雜湊表,存放在 redisDb 的 dict 成員中)
| | +-------------------+ |
| | | int size | |
| | | cds_lfht *table |
| | +-------------------+ |
+------------------+ |
|
| (使用 cds_lfht_add_replace() 加到 table 中)
v
+------------------------+
| struct q_dictEntry |
+------------------------+
| +------------------+ |
| | cds_lfht_node | | -----> 雜湊表節點
| | node | |
| +------------------+ |
| +------------------+ |
| | struct rcu_head | | -----> call_rcu 會使用到,延後會回收
| | rcu_head | |
| +------------------+ |
| +------------------+ |
| | void *key | | -----> key 會轉成 sds 類型所保存
| +------------------+ |
| +------------------+ |
| | void *val | | -----> value 將以 robj 形式所保存
| +------------------+ |
+------------------------+
|
v
+---------------------+
| robj | (robj 是 Redis 中的資料對象)
+---------------------+
| type | encoding |
| ptr | |-----> sds 或其它資料類型
+---------------------+
```
#### 使用 concurrent queue 處理多個請求
mt-redis 在 `struct redisServer` 中加入 userspace RCU 庫中的 wait-free concurrent queue 的頭尾結構以處理多個要求。
```c
struct redisServer {
//multi-thread redis support
pthread_mutex_t command_request_lock;
int threads_num; /* num of worker threads.*/
list *command_requests;
struct cds_wfcq_head command_requests_head;
struct cds_wfcq_tail command_requests_tail;
int socketpairs[2];
q_eventloop qel;
};
```
在 initServer 對通用雜湊表初始化,還有初始化 q_worker, q_master
```c
// server.c
extern struct redisServer server;
void initServer(void)
{
...
for (j = 0; j < server.dbnum; j++) {
server.db[j].dict = zmalloc(sizeof(q_dict));
server.db[j].dict->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL);
server.db[j].expires = zmalloc(sizeof(q_dict));
server.db[j].expires->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL);
}
q_workers_init(server.threads_num);
q_master_init();
aeCreateFileEvent(server.el, server.socketpairs[0], AE_READABLE,
server_event_process, NULL)
```
#### mt-redis 中有三種執行緒還有定義管理客戶連結與請求的資料結構
`struct client` 負責管理客戶端連接和請求處理
mt-redis 中的執行緒角色分成為 master, wroker 還有 server 執行緒。不同執行緒的職責,Master 負責管理 client,而 worker 負責並行讀操作,另外 Server 負責序列寫入操作
#### strcut client
這結構定義了每個客戶端連接的完整狀態
```c
/* With multiplexing we need to take per-client state.
* Clients are taken in a linked list. */
typedef struct client {
q_eventloop *qel; /* Eventloop of the worker's thread which handles this client.*/
int curidx; /* The worker idx that this client current belogn to.*/
uint64_t id; /* Client incremental unique ID. */
int fd; /* Client socket. */
redisDb *db; /* Pointer to currently SELECTed DB. */
sds querybuf; /* Buffer we use to accumulate client queries. */
int argc; /* Num of arguments of current command. */
robj **argv; /* Arguments of current command. */
struct redisCommand *cmd, *lastcmd; /* Last command executed. */
int flags; /* Client flags: CLIENT_* macros. */
...
```
結構成員介紹
- `qel` 指向處理該客戶端的事件循環
- `curidx` 記錄當前所屬的 worker 執行緒 ID
- `fd` 為客戶端 socket,querybuf 用於存接收到的命令
- `argc`、`argv` 存儲解析後的命令參數,cmd 指向當前執行的命令
---
q_command_request 將藉由 q_node 將會掛到 redisServer 中的 wfcq
```c
//command forwarded from worker thread to server thread.
typedef struct q_command_request {
int type;
struct client* c;
struct cds_wfcq_node q_node;
} q_command_request;
```
#### Master 執行緒
主要負責接受新的客戶端連接與設置監聽 socket 並處理新連接。
```c
static int setup_master(void) {
int j;
for (j = 0; j < server.ipfd_count; j++) {
aeCreateFileEvent(master.qel.el, server.ipfd[j], AE_READABLE, acceptTcpHandler, NULL)
}
if (server.sofd > 0)
aeCreateFileEvent(master.qel.el, server.sofd, AE_READABLE,
acceptUnixHandler,NULL)
return C_OK;
}
```
`acceptTcpHandler()` 在 `while(max--) `中有使用 q_work 的 `dispatch_conn_new()`
當 master 執行緒接受新連接時,會以輪詢方式分發給不同 worker。 而在 worker 執行緒創建 struct client,這部分以下說明。
```c
// in q_master.c
static void acceptCommonHandler(int fd, int flags, char *ip)
{
...
dispatch_conn_new(fd);
...
}
```
`dispatch_conn_new()` 將
1. 創建一個 connswapunit 結構來封裝 socket descriptor
2. 選擇一個 worker 執行緒(輪詢方式)
3. 將 connswapunit 推入該 worker 的隊列
4. 通過 socket pair 通知 worker 執行緒有新連接
```c
// q_worker.c
void dispatch_conn_new(int sd) {
struct connswapunit *su = csui_new();
char buf[1];
q_worker *worker;
int tid = (last_worker_thread + 1) % server.threads_num;
worker = darray_get(&workers, (uint32_t) tid);
last_worker_thread = tid;
su->num = sd;
su->data = NULL;
csul_push(worker, su);
buf[0] = 'c';
// to be handled by worker's worker_thread_event_process loop
write(worker->socketpairs[0], buf, 1);
}
```
#### worker 負責並行的讀操作
connswapunit 結構中的 q_node 用於 worker 執行緒的 waiting-free concurrent queue
q_worker 中兩個 queue:
- q_head/q_tail:用於接收新連接
- r_head/r_tail:用於接收從 server 執行緒返回的 client
```c
typedef struct q_worker {
int id;
q_eventloop qel;
int socketpairs[2]; /*0: used by master/server thread, 1: belong to worker thread.*/
/* queue for replication thread to switch the client back to this worker's thread.*/
struct cds_wfcq_head r_head;
struct cds_wfcq_tail r_tail;
struct cds_wfcq_tail q_tail;
struct cds_wfcq_head q_head;
q_worker_stats stats;
} q_worker;
struct connswapunit {
int num;
void *data;
struct cds_wfcq_node q_node;
};
```
```c
static int setup_worker(q_worker *worker) {
int status = aeCreateFileEvent(worker->qel.el, worker->socketpairs[1], AE_READABLE,
worker_thread_event_process, worker);
```
`worker_thread_event_process` 處理命令處理。
當 worker 執行緒收到 'c' 字符的通知時
1. 從隊列中取出 connswapunit
2. 提取 socket descriptor
3. 建立 client instance,並且將一個 client 綁一個 thread id
```c
// worker threads to master/server thread communication handler function
static void
worker_thread_event_process(aeEventLoop *el, int fd, void *privdata, int mask) {
if (read(fd, buf, 1) != 1) {
buf[0] = 'c';
}
switch (buf[0]) {
case 'c':
csu = csul_pop(worker);
sd = csu->num;
zfree(csu);
...
c = createClient(&worker->qel, sd)`;
c->curidx = worker->id;
}
```
Worker 執行緒通過 `worker_readQueryFromClient` 讀取客戶端數據,然後使用 `worker_processInputBuffer` 處理命令。
處理命令時,發現如果有需要在 server 執行緒處理的命令,client 將被轉移到 server 執行緒,這部分邏輯見 `int worker_processCommand(client *c)`。
處理步驟
- 使用 `q_createCommandRequest(c)` 創建一個命令請求結構,封裝 client 訊息
- 使用 `cds_wfcq_enqueue()` 將命令請求加入到 server 執行緒的全局命令隊列中
- 設置 `c->flags |= CLIENT_JUMP` 標誌,表示 client 正在執行緒間跳轉
- 通過 `write(server.socketpairs[1], buf, 1)` 向 server 執行緒發送通知訊號。
```c
// server.c
/* worker's version of processCommand. */
int worker_processCommand(client *c) {
struct q_command_request *r;
...
c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
// CMD is not readonly
unlinkClientFromEventloop(c);
r = q_createCommandRequest(c);
cds_wfcq_enqueue(&server.command_requests_head, &server.command_requests_tail, &r->q_node);
c->flags |= CLIENT_JUMP;
buf[0] = 'r'; /* r stands for request from worker thread */
```
#### server 執行緒處理寫入操作
能在`server_event_process()` 看到 worker 執行緒到 server 執行緒的命令轉發機制
`void server_event_process`
1. 藉由 `__cds_wfcq_dequeue_blocking` 來從隊列中取出請求。
2. 獲取客戶端訊息並設置事件循環
3. 釋放命令
4. 使用 `server_processCommand()` 處理命令
5. 判斷客戶端類型,是 slave 客戶端或是管理命令,就保持在 server 執行緒上
6. dispatch 給 worker 處理
```c
void server_event_process(struct aeEventLoop *eventLoop, int fd, void *clientData, int mask) {
char buf[1];
client *c;
int from;
q_command_request *r;
struct cds_wfcq_node *qnode;
// Number of request in command_requests NOT 1:1 with num of bytes inside socketpairs
qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail);
//the sockerpair buf can be overflowed when num of connections increase, so
//Num of requests in command_requests list is NOT 1:1 with num of bytes in socketpair buf, the code
//has to deal with this mismatch.
while(read(fd, buf, 1) == 1 || qnode != NULL ){
if (qnode == NULL) break;
r = caa_container_of(qnode, struct q_command_request, q_node);
c = r->c;
from = c->curidx;
c->qel = &server.qel;
c->curidx = -1;
q_freeCommandRequest(r);
server_processCommand(c);
if (c->flags & CLIENT_SLAVE || c->cmd->flags & CMD_ADMIN) {
keep_slave_to_server_thread(c, from);
} else {
dispatch_to_worker(c, from);
}
qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail);
}
}
```
因此從以上能看得處來 mt-redis 保持讀取操作在 worker 執行緒處理,而寫入操作統一在 server 執行緒處理。