sysprog
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Linux 核心專題: 高度並行的 Valkey 實作 > 執行人: tsaiiuo, eleanorLYJ > [Github](https://github.com/jserv/mt-redis) > [專題解說影片](https://www.youtube.com/live/AI8shjBfUoE?si=ZFkCHwh-ucIHP8m7&t=8935) ### Q : 知道 p50 p99 p99.9 不同 URCU flavor的不同表現後能夠做什麼改善 ? 因可以針對應用情境需求做選擇。若應用對於極端延遲敏感,那麼即可選擇在 p99, p99.9 這高百分位數值低的 URCU flavor; 若應用以普遍反應時間為主,則可參考平均還有中位數(p50) 的數值做選擇。 ### Q: 如何解釋 URCU flavor 對於效能的影響? 可以先從各個 flavor 的底層實作解釋,像是在 qsbr flavor 中我們能自定義 writer 更新的頻率降低,以此提升 reader 的效能,因此在讀寫比 100:1 與 10:1 的情境下, qsbr 的 Get 的吞吐量與延遲上表現的最好,除此之外不同比例的讀寫比配合不同的 flavor 也都會有不同的效能影響,在讀寫比 1:1 時,上述的 qsbr 的吞吐量與延遲反而會比其他 flavor 來得低,可能是讀寫高交錯的情境中,qsbr 的 grace period 幾乎無法推進,導致寫入延遲暴增,整體系統吞吐量降低,可以看到測試中最久的 Set 延遲到達 888ms ,再次應證 URCU flavor 需要配合不同的使用情境作使用。 ### Reviewed by `MikazukiHikari` 要怎麼評估電腦本身對於 throughout 的影響,因為看起來不同設備的 throughout 不一樣? >這本身是一個需要實驗的問題,有很多電腦本身的硬體效能(如 CPU 核心數、記憶體頻寬、I/O 裝置)與系統軟體(如排程器、記憶體管理)會直接影響延遲與吞吐量,而在此次實驗的兩種設備中最直觀的差別可能是 CPU 核心數以及有無 P-core。 [name=tsaiiuo] ### Reviewed by `Andrewtangtang` 關於執行緒數量的影響有推測什麼原因導致 4 個執行緒比 8 個執行緒各項指標都表現比較好嗎? >根據 perf 的觀察可以看出當執行序為 4 時 context-switches 的數量是少於 8 個執行序的(少了快 1 倍),但同時 cpu-migration 的數量也是比較高(高了 1/4 倍左右),但根據實驗結果能推斷是 context-switches 降低的成本效益是大於 cpu-migration 的成本,導致 4 個執行序的表現比較好。 [name=tsaiiuo] > ## TODO: 重現去年實驗並理解 Redis 並行化的技術挑戰 > [2024 年報告](https://hackmd.io/@sysprog/B1W2HKTER) > 重現實驗並解讀實驗結果 > 探討 Redis 對於雲端運算和網路服務系統架構的幫助 > 解釋為何 userspace RCU 對於提高 Redis 並行程度有益 > 分析限制 Redis 並行的因素 > 彙整 userspace RCU 和 Redis 的整合,要顧及哪些議題? > 評估 neco 的運用和調整 ### 閱讀 [introduce sys_membarrier(): process-wide memory barrier (v6)](https://lwn.net/Articles/370146/?utm_source=chatgpt.com) 要使用 memb flavor 就需要先了解 `sys_membarrier()` 系統呼叫的作用,先看到他的介紹: >Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invokation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. `sys_membarrier()` 會向所有該 process 進行中的 thread 發送 IPI (Inter-Processor Interrupt),讓所有 thread 在各自的 CPU 上執行一次真正的 memory barrier,若沒有在進行的 thread 或不屬於此 process 的都不會被影響到。 為了去更進一步解釋,這邊它給出一個例子 >Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) 而這兩個 thread 的實作如下 ```cpp Thread A Thread B prev mem accesses prev mem accesses smp_mb() smp_mb() follow mem accesses follow mem accesses ``` 在引用 `sys_membarrier()` 後則可以修改成 ```cpp Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses ``` 在 `Thread B` (Reader) 內可以直接使用編譯器屏障取代原先的 `smp_wb()` 這就可以省下許多成本,此外 `Thread A` (Writer) 使用 `sys_membarrier()` 取代 `smp_wb()` 只對「整個 process 裡所有活動中的 thread 發送 IPI,強制它們各自執行一次 full memory barrier」。 而這樣更改後會有兩種情況: 1. Thread A 與 Thread B 不同時執行 ```cpp Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses ``` 這種情況下 `Thread B` 的存取只有保證 weakly ordered,但是可以的,因為 `Thread A` 不在意跟它排序,因為它們不重疊。 2. Thread A 與 Thread B 同時執行 ```cpp Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses ``` `Thread B` 在執行 barrier() 時只能保證自己程式內部順序; 但同時 `Thread A` 正在呼叫 `sys_membarrier()`,kernel 會發 IPI 給所有正在跑的 thread,讓它們在 CPU core 上都執行一次硬體 memory barrier。 這就會把 B 的 compiler barrier「升級」成真正的 full `smp_mb()`,在多 core 間強制排序。至於不在跑的 thread(被 scheduler deschedule),它們不需要做任何操作,因為 context switch 本身就隱含 barrier 效果。 >Each non-running process threads are intrinsically serialized by the scheduler. 作者這裡還有做一個效能實驗,測試 Expedited membarrier 和 Non-expedited membarrier 針對 `sys_membarrier()` 的效能差異 (在 Intel Xeon E5405 上執行) Expedited membarrier 呼叫次數:10,000,000 次 單次開銷:約 2–3 微秒 Non-expedited membarrier 呼叫次數:1,000 次 總耗時約 16 秒 → 單次開銷約 16 毫秒 大約是 expedited 模式的 5,000–8,000 倍慢 >Add "int expedited" parameter, use synchronize_sched() in the non-expedited case. Thanks to Lai Jiangshan for making us consider seriously using synchronize_sched() to provide the low-overhead membarrier scheme. >Check num_online_cpus() == 1, quickly return without doing nothing. 此外還有針對 `sys_membarrier()` 的有效性進行測試 Operations in 10s, 6 readers, 2 writers: >(what we previously had) memory barriers in reader: 973494744 reads, 892368 writes signal-based scheme: 6289946025 reads, 1251 writes >(what we have now, with dynamic sys_membarrier check, expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 4316818891 reads, 503790 writes >(dynamic sys_membarrier check, non-expedited scheme) memory barriers in reader: 907693804 reads, 817793 writes sys_membarrier scheme: 8698725501 reads, 313 writes 可以先觀察出 動態 sys_membarrier 檢測會帶來額外的成本带来的成本,讓讀取與寫入端的 throughput 皆降低大約 7% ,但大致上還在一個基準線上,因此目前的預設機制還是有啟用動態觀察。 在啟用 `sys_membarrier()` 後在 expedited 狀態下會提昇許多,因為讀取端只做編譯器屏障,因此效能遠比原始的 mb scheme 好 ,在 writer 端,相比於 signal-based scheme 提昇了 500 倍左右,但對於原始的 mb scheme 來說還是效能差了一點,因為需要做額外的 `sys_membarrier()` 呼叫,而在 Non-expedited 狀態下讀取端的 throughput 可以來到最高,因為寫入端不發送 IPI(使用了更輕量級的 membarrier fallback),沒有了 IPI 的干擾後讀取端的 throughput 就有了顯著的提昇,但同時寫入端没有 IPI,主要依賴 cpu 上下文切換來進行,或一個 thread 定期的去喚醒各個 cpu ,因此沒有即時性,每次 `sys_membarrier()` 的時間比較長也比較不可預期。 >RCU 與並行化程式閱讀筆記: https://hackmd.io/@tMeEIUCqQSCiXUaa-3tcsg/r11N0uJbll/edit ### RCU 的 callback 的作用? #### 如何作用 callback 機制就是當滿足預設條件後才會呼叫執行的 function,主要分為兩個階段: 1. 註冊(registration): 先把你要呼叫的函式(callback)登記到某個框架或資料結構裡像是 ```cpp call_rcu(&p->rcu_head, reclaim_callback); ``` 2. 觸發(invocation): 當某個預設條件(event)發生時,系統/框架才會真正把先前註冊的 callback 給呼叫 而在 RCU 的 callback 主要是用來 reclaim 資源,主要又分為單一次 reclaim 和批量 reclaim,單一次 reclaim 的流程圖就是當所有的 thread 進入 quiescent state 後呼叫先前註冊的 callback ,示意圖如下: ![image](https://hackmd.io/_uploads/B15gsn_bgl.png) ### 重現實驗 目的在於討論哪裡哪一個 urcu flavor 最適合 redis? #### 開發環境 ```bash $ uname -a Linux eleanor 6.11.0-24-generic #24~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 25 20:14:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i7-12700 CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 2 CPU(s) scaling MHz: 19% CPU max MHz: 4900.0000 CPU min MHz: 800.0000 BogoMIPS: 4224.00 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 512 KiB (12 instances) L1i: 512 KiB (12 instances) L2: 12 MiB (9 instances) L3: 25 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-19 ``` 實驗將使用 P-core,並使用 taskset 避開 E-core。另外也關閉Hyperthreading。 因此啟用 mt-redis 的命令需要更改,以下是我將 mt-redis 伺服器綁定到 P-core, logic CPU id 為 0,2,4,6,8,10,12,14,16 的命令: ```bash taskset -c 0,2,4,6,8,10,12,14,16 ./redis-server --port 6379 --appendonly no --save "" ``` -–appendonly no 關閉 AOF(Append Only File),避免每個寫入都寫到磁碟 -–save "" 關閉 RDB 快照功能,也就是關閉定時儲存資料快照功能 以下是我使用 memtier_benchmark 的命令 ```bash taskset -c 17,18,19 \ memtier_benchmark \ -p 6379 \ --random-data \ --ratio=${RATIO} \ --requests=20000 \ --json-out-file="${OUTPUT_FILE}" \ --hide-histogram ``` #### 理解 memtier_bench 沒有下特別 command,預設是針對 string 這 type 做測試。 #### 去年實驗設計與結果 去年葉同學比較五種 URCU flavor 在不同讀寫比例下的表現,讀寫比例分別為 10:1, 100:1 與 1:1。 測試對象為 mt-redis,並使用同樣命令執行 100 次,並觀察 SET Ops/sec, SET Average Latency, SET p50 Latency, SET p99 Latency, SET p99.9 Latency, GET Ops/sec, GET Average Latency, GET p50 Latency, GET p99 Latency, GET p99.9 Latency 去年的結論 > 在讀取大於寫入的測試中,Signal-based RCU 表現最好,而在 1:1 的測試中,Signal-based RCU 的寫入效能較差。理論上 QSBR 在效能上應該是首選,但在實際測試中表現不如預期,可能需要調整進入 quiescent state 的時機 但我發現在 2023 年, urcu-signal 被視為 deprecated 了! [Commit aad674a](https://github.com/urcu/userspace-rcu/commit/aad674a9a583e09e854145f18c5d8854269dce8c) 為關於移除 urcu-signal 的最後一個 commit,根據此 commit message 建議改使用 urcu-memb,因 urcu-memb 無需使用保留 signal 即可獲得類似的 read-side 的性能,並且也有改善後的寬限期性能。 #### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入 ##### 吞吐量 SET Ops/sec ![comparison_set_ops_sec](https://hackmd.io/_uploads/SJgqn92tEeg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 3044.210 | 124.389 | 2779.570 | 3535.730 | 3039.250 | 3019.404 | 3069.015 | | BP | 3084.212 | 132.025 | 2843.330 | 3441.970 | 3066.740 | 3057.884 | 3110.541 | | MEMB | 3052.686 | 142.048 | 2811.680 | 3498.170 | 3031.835 | 3024.359 | 3081.014 | | MB | 3062.779 | 141.894 | 2864.770 | 3609.120 | 3041.210 | 3034.482 | 3091.076 | | SIGNAL | 3054.204 | 126.075 | 2808.940 | 3447.240 | 3024.835 | 3029.062 | 3079.346 | GET Ops/sec ![comparison_get_ops_sec](https://hackmd.io/_uploads/H1Yn93KElg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 302906.517 | 12377.009 | 276574.670 | 351814.290 | 302413.045 | 300438.278 | 305374.756 | | BP | 306886.910 | 13136.841 | 282918.290 | 342484.500 | 305148.375 | 304267.144 | 309506.676 | | MEMB | 303749.942 | 14134.116 | 279768.790 | 348076.500 | 301675.285 | 300931.298 | 306568.586 | | MB | 304754.178 | 14118.777 | 285051.500 | 359116.510 | 302608.090 | 301938.593 | 307569.762 | | SIGNAL | 303900.933 | 12544.743 | 279496.530 | 343009.320 | 300978.660 | 301399.244 | 306402.622 | ##### 平均延遲 SET Average Latency ![comparison_set_average_latency](https://hackmd.io/_uploads/HJ529hYVlx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.665 | 0.015 | 0.632 | 0.705 | 0.664 | 0.662 | 0.668 | | BP | 0.665 | 0.015 | 0.620 | 0.699 | 0.665 | 0.662 | 0.668 | | MEMB | 0.666 | 0.014 | 0.635 | 0.709 | 0.665 | 0.663 | 0.669 | | MB | 0.667 | 0.015 | 0.629 | 0.712 | 0.664 | 0.664 | 0.670 | | SIGNAL | 0.666 | 0.013 | 0.630 | 0.691 | 0.667 | 0.663 | 0.669 | GET Average Latency ![comparison_get_average_latency](https://hackmd.io/_uploads/BJFnq2tVll.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.649 | 0.007 | 0.631 | 0.666 | 0.649 | 0.647 | 0.650 | | BP | 0.648 | 0.008 | 0.628 | 0.672 | 0.648 | 0.646 | 0.649 | | MEMB | 0.649 | 0.007 | 0.626 | 0.669 | 0.648 | 0.647 | 0.650 | | MB | 0.648 | 0.008 | 0.624 | 0.664 | 0.648 | 0.646 | 0.649 | | SIGNAL | 0.648 | 0.008 | 0.627 | 0.666 | 0.648 | 0.647 | 0.650 | ##### p50 延遲 SET p50 Latency ![comparison_set_p50_latency](https://hackmd.io/_uploads/Syq3q2K4ee.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.499 | 0.004 | 0.487 | 0.511 | 0.495 | 0.498 | 0.499 | | BP | 0.499 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.500 | | MEMB | 0.500 | 0.005 | 0.495 | 0.511 | 0.503 | 0.499 | 0.501 | | MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.499 | | SIGNAL | 0.499 | 0.005 | 0.487 | 0.511 | 0.495 | 0.498 | 0.500 | GET p50 Latency ![comparison_get_p50_latency](https://hackmd.io/_uploads/Hyqn5hFElg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.497 | 0.004 | 0.487 | 0.503 | 0.495 | 0.496 | 0.498 | | BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 | | MEMB | 0.497 | 0.004 | 0.487 | 0.503 | 0.495 | 0.497 | 0.498 | | MB | 0.496 | 0.005 | 0.487 | 0.511 | 0.495 | 0.495 | 0.497 | | SIGNAL | 0.497 | 0.005 | 0.487 | 0.503 | 0.495 | 0.496 | 0.498 | ##### p99 延遲 SET p99 Latency ![comparison_set_p99_latency](https://hackmd.io/_uploads/Hkq25hYEgl.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 4.524 | 0.016 | 4.511 | 4.543 | 4.511 | 4.521 | 4.527 | | BP | 4.526 | 0.016 | 4.511 | 4.543 | 4.511 | 4.523 | 4.529 | | MEMB | 4.528 | 0.026 | 4.511 | 4.735 | 4.511 | 4.523 | 4.533 | | MB | 4.526 | 0.029 | 4.511 | 4.767 | 4.511 | 4.520 | 4.531 | | SIGNAL | 4.526 | 0.016 | 4.511 | 4.543 | 4.511 | 4.523 | 4.529 | GET p99 Latency ![comparison_get_p99_9_latency](https://hackmd.io/_uploads/r1qh52YEll.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 4.516 | 0.012 | 4.511 | 4.543 | 4.511 | 4.514 | 4.518 | | BP | 4.515 | 0.010 | 4.511 | 4.543 | 4.511 | 4.513 | 4.517 | | MEMB | 4.515 | 0.011 | 4.511 | 4.543 | 4.511 | 4.513 | 4.518 | | MB | 4.515 | 0.010 | 4.511 | 4.543 | 4.511 | 4.513 | 4.517 | | SIGNAL | 4.517 | 0.013 | 4.511 | 4.543 | 4.511 | 4.515 | 4.520 | ##### p99.9 延遲 SET p99.9 Latency ![comparison_set_p99_9_latency](https://hackmd.io/_uploads/Sklq2qnt4xl.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 7.518 | 0.107 | 6.527 | 7.775 | 7.519 | 7.497 | 7.540 | | BP | 7.614 | 0.487 | 7.519 | 10.559 | 7.519 | 7.517 | 7.711 | | MEMB | 7.555 | 0.209 | 7.519 | 9.151 | 7.519 | 7.513 | 7.597 | | MB | 7.593 | 0.390 | 7.519 | 10.687 | 7.519 | 7.516 | 7.671 | | SIGNAL | 7.669 | 0.678 | 7.519 | 12.415 | 7.519 | 7.534 | 7.804 | GET p99.9 Latency ![comparison_get_p99_latency](https://hackmd.io/_uploads/S15nchtVxg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 | | BP | 7.519 | 0.003 | 7.519 | 7.551 | 7.519 | 7.519 | 7.520 | | MEMB | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 | | MB | 7.519 | 0.000 | 7.519 | 7.519 | 7.519 | 7.519 | 7.519 | | SIGNAL | 7.519 | 0.003 | 7.519 | 7.551 | 7.519 | 7.519 | 7.520 | ##### 小結 所有 Flavor 的 GET 吞吐量與延遲性能都非常接近,其中 GET 吞吐量 BP 略微領先。 SIGNAL 的 SET p99.9 延遲表現最差,延遲最高且標準差最大,說明其寫操作的尾部延遲較不穩定。而 QSBR 的 SET p99.9 延遲表現最好,這表明 QSBR 在寫操作頻率極低時,其清理機制可能表現更優。 #### 讀寫比例 10:1:模擬 Cache 情境,讀取較多於寫入 ##### 吞吐量 SET Ops/sec ![comparison_set_ops_sec](https://hackmd.io/_uploads/SkgK_TntExg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 28660.834 | 320.453 | 28114.800 | 30779.230 | 28630.195 | 28596.929 | 28724.739 | | BP | 28658.504 | 315.730 | 28105.650 | 29984.920 | 28611.525 | 28595.541 | 28721.468 | | MEMB | 28630.912 | 395.076 | 27642.400 | 31215.680 | 28617.705 | 28552.125 | 28709.699 | | MB | 28634.801 | 468.614 | 27521.340 | 31324.040 | 28564.030 | 28541.349 | 28728.253 | | SIGNAL | 28613.339 | 405.676 | 26497.110 | 30684.580 | 28576.275 | 28532.439 | 28694.240 | <!-- - 所有 flavor 的平均吞吐量非常接近 (約 28600 ops/sec) - QSBR 和 BP 的標準差最小 (約320),表現最穩定 - SIGNAL 的最低值 (26497) 明顯低於其他 flavor --> GET Ops/sec ![comparison_get_ops_sec](https://hackmd.io/_uploads/BkK_ThYEgx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 286466.529 | 3202.947 | 281008.890 | 307640.060 | 286160.300 | 285827.793 | 287105.265 | | BP | 286443.242 | 3155.737 | 280917.470 | 299700.800 | 285973.705 | 285813.921 | 287072.563 | | MEMB | 286167.464 | 3948.806 | 276287.260 | 312002.310 | 286035.460 | 285379.988 | 286954.940 | | MB | 286206.329 | 4683.818 | 275077.250 | 313085.390 | 285498.950 | 285272.276 | 287140.382 | | SIGNAL | 285991.817 | 4054.760 | 264839.980 | 306694.000 | 285621.375 | 285183.212 | 286800.423 | <!-- - 同樣呈現高度相似的吞吐量 (~286,000 ops/sec) - QSBR 和 BP 再次顯示最佳穩定性 - SIGNAL 的最低值 (264,840) 明顯偏低 --> ##### 平均延遲 SET Average Latency ![comparison_set_average_latency](https://hackmd.io/_uploads/rygFdTnK4gg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 | | BP | 0.635 | 0.002 | 0.630 | 0.639 | 0.635 | 0.634 | 0.635 | | MEMB | 0.635 | 0.002 | 0.630 | 0.640 | 0.635 | 0.635 | 0.635 | | MB | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 | | SIGNAL | 0.635 | 0.002 | 0.630 | 0.643 | 0.635 | 0.635 | 0.636 | GET Average Latency ![comparison_get_average_latency](https://hackmd.io/_uploads/ByFup2FNlg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.635 | 0.002 | 0.630 | 0.640 | 0.634 | 0.634 | 0.635 | | BP | 0.634 | 0.002 | 0.629 | 0.639 | 0.634 | 0.634 | 0.635 | | MEMB | 0.635 | 0.002 | 0.630 | 0.639 | 0.635 | 0.634 | 0.635 | | MB | 0.635 | 0.002 | 0.630 | 0.641 | 0.635 | 0.634 | 0.635 | | SIGNAL | 0.635 | 0.002 | 0.629 | 0.643 | 0.635 | 0.635 | 0.635 | ##### p50 延遲 SET p50 Latency ![comparison_set_p50_latency](https://hackmd.io/_uploads/rkKO62tVxg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.497 | 0.006 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 | | BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.498 | | MEMB | 0.498 | 0.005 | 0.487 | 0.503 | 0.495 | 0.497 | 0.499 | | MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.499 | | SIGNAL | 0.499 | 0.004 | 0.487 | 0.511 | 0.503 | 0.499 | 0.500 | GET p50 Latency ![comparison_get_p50_latency](https://hackmd.io/_uploads/SJYOT3YVxg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.497 | 0.006 | 0.487 | 0.511 | 0.495 | 0.496 | 0.498 | | BP | 0.497 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.498 | | MEMB | 0.498 | 0.005 | 0.487 | 0.503 | 0.495 | 0.497 | 0.499 | | MB | 0.498 | 0.005 | 0.487 | 0.511 | 0.495 | 0.497 | 0.499 | | SIGNAL | 0.499 | 0.004 | 0.487 | 0.511 | 0.503 | 0.498 | 0.500 | ##### p99 延遲 SET p99 Latency ![comparison_set_p99_latency](https://hackmd.io/_uploads/r19OTnY4xg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 1.384 | 0.021 | 1.359 | 1.495 | 1.383 | 1.379 | 1.388 | | BP | 1.383 | 0.018 | 1.359 | 1.447 | 1.375 | 1.380 | 1.387 | | MEMB | 1.385 | 0.018 | 1.359 | 1.439 | 1.383 | 1.381 | 1.388 | | MB | 1.385 | 0.019 | 1.351 | 1.455 | 1.383 | 1.381 | 1.389 | | SIGNAL | 1.425 | 0.091 | 1.359 | 1.959 | 1.391 | 1.407 | 1.443 | GET p99 Latency ![comparison_get_p99_latency](https://hackmd.io/_uploads/BkFOa3F4eg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 1.383 | 0.020 | 1.359 | 1.487 | 1.375 | 1.379 | 1.387 | | BP | 1.383 | 0.018 | 1.359 | 1.439 | 1.375 | 1.380 | 1.387 | | MEMB | 1.385 | 0.017 | 1.359 | 1.439 | 1.383 | 1.381 | 1.388 | | MB | 1.385 | 0.019 | 1.351 | 1.455 | 1.383 | 1.381 | 1.389 | | SIGNAL | 1.426 | 0.091 | 1.359 | 1.935 | 1.391 | 1.408 | 1.444 | ##### p99.9 延遲 SET p99.9 Latency ![comparison_set_p99_9_latency](https://hackmd.io/_uploads/ryc_6hKNle.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 2.851 | 0.290 | 2.271 | 4.319 | 2.807 | 2.793 | 2.909 | | BP | 2.822 | 0.248 | 2.367 | 3.503 | 2.799 | 2.772 | 2.871 | | MEMB | 2.810 | 0.278 | 2.351 | 3.839 | 2.767 | 2.754 | 2.865 | | MB | 2.824 | 0.246 | 2.415 | 3.983 | 2.815 | 2.775 | 2.873 | | SIGNAL | 2.919 | 0.442 | 2.351 | 4.351 | 2.815 | 2.831 | 3.007 | GET p99.9 Latency ![comparison_get_p99_9_latency](https://hackmd.io/_uploads/ByFda3KNxl.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 2.853 | 0.286 | 2.351 | 4.319 | 2.815 | 2.796 | 2.910 | | BP | 2.819 | 0.237 | 2.431 | 3.471 | 2.799 | 2.772 | 2.866 | | MEMB | 2.810 | 0.274 | 2.303 | 3.743 | 2.783 | 2.755 | 2.865 | | MB | 2.830 | 0.239 | 2.399 | 3.903 | 2.815 | 2.782 | 2.878 | | SIGNAL | 2.929 | 0.441 | 2.367 | 4.383 | 2.815 | 2.841 | 3.017 | ##### 小結 在讀多寫少的場景下,QSBR 在吞吐量最佳、BP 次之。然而,QSBR 在高百分位 GET 與 SET 延遲上都較不穩定。MEMB 和 MB 表現很接近。 在所有 FLAVOR 中,不該使用 SIGNAL,因為它在多個吞吐量與延遲指標上表現明顯最差,尤其極端情況下的延遲 #### 讀寫比例 1:1:模擬 Session Store 情境,讀取與寫入比例相近 ##### 吞吐量 SET Ops/sec ![comparison_set_ops_sec](https://hackmd.io/_uploads/BkWh71aYVel.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 133086.415 | 46847.384 | 0.000 | 163293.610 | 148770.615 | 123744.048 | 142428.782 | | BP | 155610.583 | 2395.181 | 152699.060 | 170369.850 | 155087.675 | 155132.933 | 156088.233 | | MEMB | 155501.622 | 1163.124 | 153110.860 | 159132.270 | 155344.040 | 155269.671 | 155733.574 | | MB | 155821.162 | 2447.141 | 143412.070 | 168457.980 | 155811.510 | 155333.149 | 156309.174 | | SIGNAL | 156319.059 | 2753.169 | 153813.820 | 169675.580 | 155630.615 | 155770.018 | 156868.099 | GET Ops/sec ![comparison_get_ops_sec](https://hackmd.io/_uploads/By3mJTtElg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 133086.415 | 46847.384 | 0.000 | 163293.610 | 148770.615 | 123744.048 | 142428.782 | | BP | 155610.583 | 2395.181 | 152699.060 | 170369.850 | 155087.675 | 155132.933 | 156088.233 | | MEMB | 155501.622 | 1163.124 | 153110.860 | 159132.270 | 155344.040 | 155269.671 | 155733.574 | | MB | 155821.162 | 2447.141 | 143412.070 | 168457.980 | 155811.510 | 155333.149 | 156309.174 | | SIGNAL | 156319.059 | 2753.169 | 153813.820 | 169675.580 | 155630.615 | 155770.018 | 156868.099 | ##### 平均延遲 SET Average Latency ![comparison_set_average_latency](https://hackmd.io/_uploads/Skgn71pY4lx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 9.730 | 88.360 | 0.000 | 888.601 | 0.677 | -7.891 | 27.351 | | BP | 0.642 | 0.002 | 0.637 | 0.647 | 0.642 | 0.641 | 0.642 | | MEMB | 0.642 | 0.002 | 0.638 | 0.646 | 0.642 | 0.642 | 0.642 | | MB | 0.642 | 0.002 | 0.636 | 0.646 | 0.642 | 0.641 | 0.642 | | SIGNAL | 0.641 | 0.001 | 0.638 | 0.645 | 0.642 | 0.641 | 0.642 | GET Average Latency ![comparison_get_average_latency](https://hackmd.io/_uploads/Skn7yatEge.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 2.678 | 19.652 | 0.000 | 197.910 | 0.655 | -1.242 | 6.597 | | BP | 0.641 | 0.002 | 0.637 | 0.647 | 0.642 | 0.641 | 0.642 | | MEMB | 0.642 | 0.002 | 0.638 | 0.646 | 0.642 | 0.641 | 0.642 | | MB | 0.641 | 0.002 | 0.636 | 0.646 | 0.641 | 0.641 | 0.642 | | SIGNAL | 0.641 | 0.001 | 0.638 | 0.644 | 0.641 | 0.641 | 0.641 | ##### p50 延遲 SET p50 Latency ![comparison_set_p50_latency](https://hackmd.io/_uploads/rJenQ16YNlg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.455 | 0.141 | 0.007 | 0.511 | 0.503 | 0.427 | 0.483 | | BP | 0.507 | 0.005 | 0.495 | 0.519 | 0.511 | 0.506 | 0.508 | | MEMB | 0.509 | 0.004 | 0.495 | 0.519 | 0.511 | 0.508 | 0.510 | | MB | 0.507 | 0.005 | 0.495 | 0.519 | 0.507 | 0.506 | 0.508 | | SIGNAL | 0.508 | 0.005 | 0.495 | 0.519 | 0.511 | 0.507 | 0.509 | GET p50 Latency ![comparison_get_p50_latency](https://hackmd.io/_uploads/SJhXkTYVgx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 0.454 | 0.141 | 0.007 | 0.503 | 0.495 | 0.426 | 0.482 | | BP | 0.507 | 0.005 | 0.495 | 0.519 | 0.507 | 0.506 | 0.508 | | MEMB | 0.508 | 0.005 | 0.495 | 0.519 | 0.511 | 0.507 | 0.509 | | MB | 0.506 | 0.005 | 0.495 | 0.519 | 0.503 | 0.505 | 0.507 | | SIGNAL | 0.507 | 0.005 | 0.495 | 0.519 | 0.511 | 0.506 | 0.508 | ##### p99 延遲 SET p99 Latency ![comparison_set_p99_latency](https://hackmd.io/_uploads/HJ3XJpKNge.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 14.427 | 101.888 | 0.007 | 1028.095 | 4.543 | -5.892 | 34.746 | | BP | 1.451 | 0.039 | 1.375 | 1.527 | 1.455 | 1.443 | 1.459 | | MEMB | 1.463 | 0.040 | 1.383 | 1.543 | 1.467 | 1.455 | 1.471 | | MB | 1.455 | 0.045 | 1.383 | 1.543 | 1.455 | 1.447 | 1.464 | | SIGNAL | 1.455 | 0.039 | 1.383 | 1.527 | 1.455 | 1.448 | 1.463 | GET p99 Latency ![comparison_get_p99_latency](https://hackmd.io/_uploads/HJ2XkptNxl.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 5.214 | 10.926 | 0.007 | 113.151 | 4.511 | 3.035 | 7.393 | | BP | 1.450 | 0.040 | 1.375 | 1.527 | 1.451 | 1.442 | 1.457 | | MEMB | 1.462 | 0.040 | 1.383 | 1.543 | 1.471 | 1.454 | 1.470 | | MB | 1.455 | 0.046 | 1.383 | 1.543 | 1.455 | 1.446 | 1.464 | | SIGNAL | 1.454 | 0.038 | 1.383 | 1.527 | 1.455 | 1.446 | 1.462 | ##### p99.9 延遲 SET p99.9 Latency ![comparison_set_p99_9_latency](https://hackmd.io/_uploads/H1hmypF4xx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 1824.150 | 17529.922 | 0.007 | 176160.767 | 7.759 | -1671.690 | 5319.990 | | BP | 2.726 | 0.191 | 2.303 | 3.295 | 2.687 | 2.688 | 2.764 | | MEMB | 2.690 | 0.212 | 2.351 | 3.343 | 2.655 | 2.648 | 2.732 | | MB | 2.733 | 0.229 | 2.271 | 3.567 | 2.743 | 2.688 | 2.779 | | SIGNAL | 2.685 | 0.201 | 2.303 | 3.247 | 2.639 | 2.645 | 2.725 | GET p99.9 Latency ![comparison_get_p99_9_latency](https://hackmd.io/_uploads/B1enQJTtVel.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | | --- | --- | --- | --- | --- | --- | --- | --- | | QSBR | 124.821 | 901.113 | 0.007 | 8454.143 | 7.519 | -54.880 | 304.523 | | BP | 2.726 | 0.187 | 2.319 | 3.279 | 2.703 | 2.688 | 2.763 | | MEMB | 2.689 | 0.215 | 2.303 | 3.359 | 2.655 | 2.646 | 2.732 | | MB | 2.734 | 0.227 | 2.319 | 3.519 | 2.735 | 2.689 | 2.780 | | SIGNAL | 2.685 | 0.204 | 2.303 | 3.247 | 2.655 | 2.644 | 2.725 | #### 小結 在讀寫均衡的場景下,BP, MEMB, MB, SIGNAL 這四種 URCU 的表現都相對穩定且良好。而 QSBR 表現極差,不適用於此比例。QSBR 的吞吐量較低,且波動大 (標準差大)、QSBR 平均延遲大,另外 95% 信賴區間大,表示結果較不可靠 -> 可能是我記憶體回收阻塞或是我測量時有誤 (因有負的 CI 下限) #### 藉由不同讀寫比例,分析不同 favor QSBR 在讀寫比例 1:1 的情況下,QSBR 的性能表現極不穩定,而在讀寫比例 100:1 的極端讀多場景下,QSBR 的 SET p99.9 延遲表現反而最佳,因此適合的密集讀取,並且對寫操作的延遲波動有較高的容忍度。 BP, MEMB, MB 的穩定性: 這三種 URCU 在所有讀寫比例下都表現出相對穩定的性能,其中 BP 綜合表現最好。 SIGNAL 在讀操作的平均性能上與其他 Flavor 接近,但在 10:1 和 100:1 讀寫比例下,其高百分位延遲(尤其是 p99.9) ### 執行緒數量的影響 #### 開發環境 ```bash $ uname -a Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 CPU(s) scaling MHz: 30% CPU max MHz: 4000.0000 CPU min MHz: 800.0000 BogoMIPS: 6799.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nons top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m d_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 ``` 由於 cpu 較舊,因此整體對於多執行緒的吞吐量與延遲等指標會與其他同學有落差,並且對於一些實做的提昇幅度也不一樣,所以這邊先使用 bpftrace 去紀錄與統計整體 read/write 系統呼叫的延遲,並後續觀察是否與系統呼叫的成本有關 ``` $ sudo bpftrace -e ' tracepoint:syscalls:sys_enter_read /comm == "redis-server"/ { @start[tid] = nsecs; } tracepoint:syscalls:sys_exit_read /@start[tid]/ { @read_latency_us = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); } ``` ``` @read_latency_us: [1] 2808001 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 4) 1153948 |@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 25778 | | [8, 16) 8368 | | [16, 32) 4015 | | [32, 64) 32 | | [64, 128) 13 | | [128, 256) 9 | | [256, 512) 32 | | [512, 1K) 2 | | [1K, 2K) 2 | | ``` 去年的實驗結果是在沒設定 cpu affinity 的情況下應該將執行緒的數量設定為 4 ,因此我先想復現去年的實驗,並探討執行緒的數量對吞吐量效能的影響。 我這邊設定寫入與讀取的比例為 1:100 和 1:10,上網收集的資料說 cache 的使用情境大多為這兩種,因此只測試這兩種情境 >[Memcached](https://openbenchmarking.org/test/pts/memcached?utm_source=chatgpt.com) 的實際基準測試中,約 43.9% 的測試採用 SET:GET = 1:100 的配置 >[Google Cloud Memorystore](https://cloud.google.com/blog/products/databases/scaling-google-cloud-memorystore-for-high-performance?utm_source=chatgpt.com) 中有提到 By default, the utility issues 10 get commands for every set command. 以下測試是想測試執行緒數量使用 4 與 8 對於 qsbr 與 memb 的比較,並使用 ``` taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 ``` #### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入 ##### 吞吐量 SET Ops/sec ![Set_Ops_sec](https://hackmd.io/_uploads/BywTLMgNgx.png) | Flavor | Mean| Std Dev | Min | Max | Median | 95% CI Lower| 95% CI Upper| |--|--|--|--|--|--|--|-| |qsbr-4t| 1911.112| 36.840| 1757.440| 2006.030| 1905.100| 1903.892| 1918.333| |qsbr-8t| 1857.539| 47.483| 1687.700| 1965.850| 1853.320| 1848.232| 1866.845| |memb-4t| 1904.987| 48.352| 1684.740| 2037.310| 1913.505| 1895.510| 1914.464| |memb-8t| 1843.536| 42.711| 1684.790| 1949.790| 1852.665| 1835.164| 1851.907| GET Ops/sec ![Get_Ops_sec](https://hackmd.io/_uploads/ry7AIMx4el.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|-------------|----------|-------------|-------------|------------|---------------|---------------| | qsbr-4t | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 | | qsbr-8t | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 | | memb-4t | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 | | memb-8t | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 | #### 觀察 - 在沒設置 cpu affinity 時, 4 個執行緒的吞吐量的平均數與中位數皆是比 8 個執行緒還要高的 - qsbr 的表現在平均數比 memb 還要來的好,但 memb 的中位數比 qsbr 還要來的好,可以從最小數與標準差發現 memb 的吞吐量較不穩定,因此標準差也較大 ##### 平均延遲 SET Average Latency ![Set_Latency](https://hackmd.io/_uploads/SJFOcePNex.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-4t | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 | | qsbr-8t | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 | | memb-4t | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 | | memb-8t | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 | GET Average Latency ![Get_Latency](https://hackmd.io/_uploads/SJnY9gwNxg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-4t | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 | | qsbr-8t | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 | | memb-4t | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 | | memb-8t | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 | ##### p50 延遲 SET p50 Latency ![Set_p50_Latency](https://hackmd.io/_uploads/SJEfDzlNex.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-4t | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 | | qsbr-8t | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 | | memb-4t | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 | | memb-8t | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 | GET p50 Latency ![Get_p50_Latency](https://hackmd.io/_uploads/S1RGwGlNgl.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-4t | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 | | qsbr-8t | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 | | memb-4t | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 | | memb-8t | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 | #### 觀察 - 在 8 個執行緒時, Get 操作的延遲在平均數與中位數都是比較低的,但在 Set 操作上卻會略為提高 - 在 p50 上 Get 操作在 8 個執行緒時不再取得更低的延遲,反之延遲較 4 個執行緒來的高 - qsbr 在 Set 和 Get 操作上的平均延遲都比 memb 還要來的低 ##### p99 延遲 SET p99 Latency ![Set_p99_Latency](https://hackmd.io/_uploads/SkKLqgvNgx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|--------|--------|----------------|----------------| | qsbr-4t | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 | | qsbr-8t | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 | | memb-4t | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 | | memb-8t | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 | GET p99 Latency ![Get_p99_Latency](https://hackmd.io/_uploads/H1Bv5gPExe.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-4t | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 | | qsbr-8t | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 | | memb-4t | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 | | memb-8t | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 | ##### p99.9 延遲 SET p99.9 Latency ![Set_p99_9_Latency](https://hackmd.io/_uploads/H1wSPfx4xe.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|--------|--------|----------------|----------------| | qsbr-4t | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 | | qsbr-8t | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 | | memb-4t | 10.423 | 1.753 | 8.895 | 24.831 | 10.111 | 10.080 | 10.767 | | memb-8t | 12.050 | 1.328 | 9.599 | 17.279 | 11.871 | 11.789 | 12.310 | GET p99.9 Latency ![Get_p99_9_Latency](https://hackmd.io/_uploads/H1m8wzgNex.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |----------|--------|---------|-------|--------|--------|----------------|----------------| | qsbr-4t | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 | | qsbr-8t | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 | | memb-4t | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 | | memb-8t | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 | #### 觀察 - 可以看出在 p99 與 p99.9 的情況下越多執行緒會導致越高的平均延遲 #### 小結論 - 在讀寫佔比 100:1 並且沒有設置 cpu affinity的情況下, 4 個執行緒的吞吐量會比 8 個執行緒的吞吐量來的好,不論是在延遲或平均操作上皆是。 ### 設置 cpu affinity 有看到去年有個 review 是為什麼 thread-level cpu affinity 不要一個 thread 一個 cpu 而是要兩個 thread 一個 cpu? 而去年負責專題的同學的回覆如下 >因為我測試過每個 CPU 一個執行緒,但是效能不如一個 CPU 兩個執行緒,這樣也能透過 context switch 讓要進行 I/O 的執行緒休息,將 CPU 資源讓給另外一個執行緒。 但這段敘述並沒有實驗去說明,因此我接下來是要測試:一個 CPU 兩個執行緒還是一個 CPU 一個執行緒比較好,並且分析去年的結論是否是對的。用以下程式碼將目前執行中的 thread 綁定(或限制)到指定的 CPU 核心集合中執行: ```diff! + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(cpu_id, &cpuset); + pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); ``` 以下分別針對 qsbr 與 memb 兩種 flavor 去探討 ``` taskset -c 4-7 memtier_benchmark --hide-histogram -p 6379 --random-data --ratio=1:100 --requests=20000 --json-out-file=file_name ``` #### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 memb flavor 下 ##### 吞吐量 SET Ops/sec ![Set_Ops_sec](https://hackmd.io/_uploads/BkMIUHx4xg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|----------|---------|---------|---------|----------|----------------|----------------| | memb-one-to-one-affinity | 1937.405 | 52.190 | 1823.790| 2097.200| 1930.500 | 1927.176 | 1947.634 | | memb-two-to-one-affinity | 2018.488 | 72.174 | 1869.110| 2211.830| 2020.165 | 2004.342 | 2032.635 | | memb-4t-no-affinity | 1904.987 | 48.352 | 1684.740| 2037.310| 1913.505 | 1895.510 | 1914.464 | | memb-8t-no-affinity | 1843.536 | 42.711 | 1684.790| 1949.790| 1852.665 | 1835.164 | 1851.907 | GET Ops/sec ![Get_Ops_sec](https://hackmd.io/_uploads/SJvv8BxNgg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|------------|-----------|-------------|-------------|------------|---------------|---------------| | memb-one-to-one-affinity | 192776.628 | 5193.027 | 181471.580 | 208676.540 | 192089.445 | 191758.795 | 193794.462 | | memb-two-to-one-affinity | 200844.686 | 7181.510 | 185980.900 | 220082.210 | 201011.445 | 199437.110 | 202252.262 | | memb-4t-no-affinity | 189551.082 | 4811.135 | 167635.600 | 202717.860 | 190398.130 | 188608.099 | 190494.064 | | memb-8t-no-affinity | 183436.420 | 4249.845 | 167640.890 | 194008.960 | 184344.715 | 182603.450 | 184269.389 | ##### 平均延遲 SET Average Latency ![Set_Latency](https://hackmd.io/_uploads/SygYIBx4gg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 1.405 | 0.039 | 1.330 | 1.536 | 1.400 | 1.398 | 1.413 | | memb-two-to-one-affinity | 1.332 | 0.037 | 1.278 | 1.524 | 1.326 | 1.325 | 1.339 | | memb-4t-no-affinity | 1.501 | 0.128 | 1.405 | 2.567 | 1.474 | 1.475 | 1.526 | | memb-8t-no-affinity | 1.487 | 0.091 | 1.387 | 1.954 | 1.469 | 1.469 | 1.504 | GET Average Latency ![Get_Latency](https://hackmd.io/_uploads/S1e58HeVxx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 1.036 | 0.015 | 1.014 | 1.084 | 1.032 | 1.033 | 1.039 | | memb-two-to-one-affinity | 1.006 | 0.016 | 0.978 | 1.087 | 1.002 | 1.002 | 1.009 | | memb-4t-no-affinity | 1.043 | 0.025 | 1.018 | 1.187 | 1.035 | 1.038 | 1.048 | | memb-8t-no-affinity | 1.078 | 0.022 | 1.052 | 1.162 | 1.069 | 1.073 | 1.082 | ##### p50 延遲 SET p50 Latency ![Set_p50_Latency](https://hackmd.io/_uploads/rJGjLSl4el.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 1.153 | 0.084 | 1.007 | 1.327 | 1.159 | 1.137 | 1.170 | | memb-two-to-one-affinity | 1.249 | 0.024 | 1.191 | 1.359 | 1.247 | 1.244 | 1.254 | | memb-4t-no-affinity | 1.317 | 0.061 | 1.135 | 1.695 | 1.319 | 1.305 | 1.329 | | memb-8t-no-affinity | 1.162 | 0.082 | 1.047 | 1.431 | 1.147 | 1.146 | 1.178 | GET p50 Latency ![Get_p50_Latency](https://hackmd.io/_uploads/rkToLreVlx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 0.943 | 0.010 | 0.919 | 0.967 | 0.943 | 0.941 | 0.945 | | memb-two-to-one-affinity | 0.932 | 0.013 | 0.903 | 0.967 | 0.927 | 0.929 | 0.934 | | memb-4t-no-affinity | 0.941 | 0.014 | 0.839 | 0.967 | 0.943 | 0.938 | 0.944 | | memb-8t-no-affinity | 0.981 | 0.012 | 0.951 | 1.015 | 0.975 | 0.979 | 0.984 | ##### p99 延遲 SET p99 Latency ![Set_p99_Latency](https://hackmd.io/_uploads/HkjnUrgVle.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|--------|--------|--------|---------------|---------------| | memb-one-to-one-affinity | 5.553 | 0.285 | 4.799 | 6.943 | 5.535 | 5.497 | 5.609 | | memb-two-to-one-affinity | 4.830 | 0.355 | 4.063 | 6.143 | 4.831 | 4.761 | 4.900 | | memb-4t-no-affinity | 6.512 | 0.611 | 5.695 | 10.559 | 6.431 | 6.392 | 6.632 | | memb-8t-no-affinity | 7.008 | 1.004 | 5.471 | 11.327 | 6.815 | 6.811 | 7.205 | GET p99 Latency ![Get_p99_Latency](https://hackmd.io/_uploads/rJPrtrx4ll.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 4.149 | 0.298 | 3.039 | 4.959 | 4.079 | 4.091 | 4.208 | | memb-two-to-one-affinity | 3.275 | 0.392 | 2.495 | 4.671 | 3.343 | 3.198 | 3.352 | | memb-4t-no-affinity | 4.266 | 0.357 | 3.807 | 6.335 | 4.223 | 4.196 | 4.336 | | memb-8t-no-affinity | 4.291 | 0.425 | 3.727 | 6.495 | 4.191 | 4.208 | 4.374 | ##### p99.9 延遲 SET p99.9 Latency ![Set_p99_9_Latency](https://hackmd.io/_uploads/BkX0UBeNge.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|---------------|---------------| | memb-one-to-one-affinity | 9.043 | 0.838 | 7.519 | 14.719| 8.959 | 8.879 | 9.207 | | memb-two-to-one-affinity | 8.051 | 1.040 | 6.399 | 14.015| 7.887 | 7.848 | 8.255 | | memb-4t-no-affinity |10.423 | 1.753 | 8.895 | 24.831|10.111 |10.080 |10.767 | | memb-8t-no-affinity |12.050 | 1.328 | 9.599 | 17.279|11.871 |11.789 |12.310 | GET p99.9 Latency ![Get_p99_9_Latency](https://hackmd.io/_uploads/rJdyDHxNel.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|--------|--------|---------------|---------------| | memb-one-to-one-affinity | 7.057 | 0.619 | 5.951 | 12.031 | 6.991 | 6.936 | 7.178 | | memb-two-to-one-affinity | 6.205 | 0.476 | 5.375 | 8.319 | 6.111 | 6.112 | 6.298 | | memb-4t-no-affinity | 7.097 | 0.834 | 6.207 | 14.079 | 6.943 | 6.934 | 7.261 | | memb-8t-no-affinity | 8.064 | 0.704 | 7.007 | 12.159 | 7.999 | 7.926 | 8.202 | #### 使用 perf 觀察程式的行為 以下是針對一個 CPU 兩個執行緒和一個 CPU 一個執行緒的情況下,我各拿一個中位數的結果來做分析,以下是結果: 一個 CPU 兩個執行緒 ``` Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ': 71,780.04 msec task-clock # 1.653 CPUs utilized 560,343 context-switches # 7.806 K/sec 14,189 cpu-migrations # 197.673 /sec 2,025 page-faults # 28.211 /sec 257,627,916,414 cycles # 3.589 GHz (39.99%) 141,018,941,758 instructions # 0.55 insn per cycle (50.00%) 22,306,027,705 branches # 310.755 M/sec (49.92%) 474,877,129 branch-misses # 2.13% of all branches (49.98%) 39,701,370,753 L1-dcache-loads # 553.098 M/sec (49.92%) 3,265,404,792 L1-dcache-load-misses # 8.22% of all L1-dcache accesses (49.96%) 1,107,813,317 LLC-loads # 15.433 M/sec (40.05%) 6,557,980 LLC-load-misses # 0.59% of all LL-cache accesses (40.09%) 14,661,818,780 cache-references # 204.260 M/sec (40.08%) 172,241,796 cache-misses # 1.17% of all cache refs (40.10%) 43.418281953 seconds time elapsed 22.202794000 seconds user 49.473298000 seconds sys ``` 一個 CPU 一個執行緒 ``` Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ': 71,423.46 msec task-clock # 1.599 CPUs utilized 333,882 context-switches # 4.675 K/sec 18,199 cpu-migrations # 254.804 /sec 1,626 page-faults # 22.766 /sec 254,850,020,885 cycles # 3.568 GHz (40.03%) 136,686,768,652 instructions # 0.54 insn per cycle (49.98%) 21,480,788,799 branches # 300.753 M/sec (49.94%) 467,898,042 branch-misses # 2.18% of all branches (49.95%) 38,495,992,811 L1-dcache-loads # 538.982 M/sec (50.05%) 3,143,432,121 L1-dcache-load-misses # 8.17% of all L1-dcache accesses (49.97%) 1,053,471,203 LLC-loads # 14.750 M/sec (40.05%) 5,857,429 LLC-load-misses # 0.56% of all LL-cache accesses (40.05%) 14,337,925,549 cache-references # 200.745 M/sec (40.03%) 161,865,927 cache-misses # 1.13% of all cache refs (39.98%) 44.655900137 seconds time elapsed 22.039534000 seconds user 49.579251000 seconds sys ``` #### 結論 - 可以觀察出在使用 memb 的機制並且 cpu affinity 在兩個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,並且相較於同樣是 8 個執行緒但無 cpu affinity 在 Get 和 Set 操作的吞吐量與延遲上,都好了大約 10% ,在 memb 的實驗結果跟去年得到的結論一樣,兩個執行緒對一個 cpu 的情況會取得比較好的效能。 - 並且二對一的模式在 p99 與 p99.9 的表現也是四個裡面最好, tail-latency 也是表現最好的。 - 在一個 CPU 一個執行緒時 context switchs 數量約低於兩個執行緒 40% ,但同時 cpu migrations 的數量卻多於一個 CPU 兩個執行緒,而 cpu migration 單次的成本是遠大於 context switch 的,因為會導致 L1/L2 cache miss(需要重新 warm-up cache),並且也會清除 TLB,但兩個執行緒在一個 CPU 上會頻繁交替執行,也會導致 cache conflict 提昇 cache misss 的數量。 #### 讀寫比例 100 : 1 模擬 Cache 情境,讀取遠多於寫入,在 qsbr flavor 下 ##### 吞吐量 SET Ops/sec ![Set_Ops_sec](https://hackmd.io/_uploads/Sy2lfSxNlx.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|----------|---------|---------|---------|----------|----------------|----------------| | qsbr-one-to-one-affinity | 1976.983 | 53.720 | 1858.990| 2088.910| 1976.620 | 1966.454 | 1987.512 | | qsbr-two-to-one-affinity | 1926.101 | 63.233 | 1761.320| 2085.570| 1919.775 | 1913.707 | 1938.494 | | qsbr-4t-no-affinity | 1911.112 | 36.840 | 1757.440| 2006.030| 1905.100 | 1903.892 | 1918.333 | | qsbr-8t-no-affinity | 1857.539 | 47.483 | 1687.700| 1965.850| 1853.320 | 1848.232 | 1866.845 | GET Ops/sec ![Get_Ops_sec](https://hackmd.io/_uploads/Sk7s5Sl4xe.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|-------------|----------|-------------|-------------|-------------|----------------|----------------| | qsbr-one-to-one-affinity | 196714.781 | 5345.271 | 184974.190 | 207851.710 | 196678.310 | 195667.108 | 197762.454 | | qsbr-two-to-one-affinity | 191651.843 | 6291.792 | 175255.650 | 207519.610 | 191022.075 | 190418.652 | 192885.034 | | qsbr-4t-no-affinity | 190160.503 | 3665.605 | 174869.310 | 199605.400 | 189562.675 | 189442.045 | 190878.962 | | qsbr-8t-no-affinity | 184829.809 | 4724.670 | 167930.680 | 195606.950 | 184409.640 | 183903.774 | 185755.844 | ##### 平均延遲 SET Average Latency ![Set_Latency](https://hackmd.io/_uploads/r1eAzHlVgg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 1.296 | 0.049 | 1.231 | 1.514 | 1.286 | 1.287 | 1.306 | | qsbr-two-to-one-affinity | 1.332 | 0.037 | 1.273 | 1.515 | 1.328 | 1.325 | 1.340 | | qsbr-4t-no-affinity | 1.459 | 0.059 | 1.357 | 1.681 | 1.449 | 1.447 | 1.470 | | qsbr-8t-no-affinity | 1.407 | 0.061 | 1.316 | 1.699 | 1.395 | 1.395 | 1.419 | GET Average Latency ![Get_Latency](https://hackmd.io/_uploads/SktCzSl4eg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 1.015 | 0.020 | 0.987 | 1.085 | 1.007 | 1.011 | 1.019 | | qsbr-two-to-one-affinity | 1.036 | 0.018 | 1.015 | 1.120 | 1.030 | 1.032 | 1.039 | | qsbr-4t-no-affinity | 1.042 | 0.013 | 1.012 | 1.122 | 1.040 | 1.039 | 1.045 | | qsbr-8t-no-affinity | 1.072 | 0.022 | 1.040 | 1.165 | 1.079 | 1.068 | 1.077 | ##### p50 延遲 SET p50 Latency ![Set_p50_Latency](https://hackmd.io/_uploads/rJXi7SlNxg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 1.001 | 0.033 | 0.975 | 1.191 | 0.991 | 0.994 | 1.007 | | qsbr-two-to-one-affinity | 1.089 | 0.060 | 1.023 | 1.279 | 1.063 | 1.077 | 1.100 | | qsbr-4t-no-affinity | 1.221 | 0.110 | 1.007 | 1.423 | 1.247 | 1.200 | 1.243 | | qsbr-8t-no-affinity | 1.054 | 0.046 | 1.015 | 1.295 | 1.047 | 1.045 | 1.063 | GET p50 Latency ![Get_p50_Latency](https://hackmd.io/_uploads/r1hj7Sg4ee.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 0.949 | 0.010 | 0.927 | 0.975 | 0.943 | 0.946 | 0.951 | | qsbr-two-to-one-affinity | 0.965 | 0.013 | 0.935 | 0.999 | 0.959 | 0.962 | 0.967 | | qsbr-4t-no-affinity | 0.958 | 0.007 | 0.935 | 0.975 | 0.959 | 0.957 | 0.960 | | qsbr-8t-no-affinity | 0.993 | 0.012 | 0.967 | 1.015 | 0.999 | 0.990 | 0.995 | ##### p99 延遲 SET p99 Latency ![Set_p99_Latency](https://hackmd.io/_uploads/S1jh7rgEee.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 4.837 | 0.464 | 4.223 | 7.135 | 4.831 | 4.746 | 4.928 | | qsbr-two-to-one-affinity | 5.069 | 0.327 | 4.223 | 6.623 | 5.055 | 5.005 | 5.133 | | qsbr-4t-no-affinity | 6.030 | 0.432 | 5.279 | 7.391 | 5.983 | 5.946 | 6.115 | | qsbr-8t-no-affinity | 6.230 | 0.840 | 5.183 | 9.407 | 5.919 | 6.065 | 6.394 | GET p99 Latency ![Get_p99_Latency](https://hackmd.io/_uploads/H1NpXSl4ll.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|-------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 3.555 | 0.530 | 2.719 | 5.151 | 3.559 | 3.451 | 3.659 | | qsbr-two-to-one-affinity | 3.775 | 0.337 | 2.767 | 4.927 | 3.735 | 3.709 | 3.841 | | qsbr-4t-no-affinity | 3.882 | 0.398 | 3.151 | 5.055 | 3.791 | 3.804 | 3.960 | | qsbr-8t-no-affinity | 4.004 | 0.402 | 3.391 | 5.855 | 3.903 | 3.925 | 4.082 | ##### p99.9 延遲 SET p99.9 Latency ![Set_p99_9_Latency](https://hackmd.io/_uploads/rk6TXrlVle.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|--------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 7.569 | 1.141 | 6.079 | 13.823 | 7.391 | 7.345 | 7.793 | | qsbr-two-to-one-affinity | 8.388 | 0.826 | 6.975 | 13.375 | 8.255 | 8.226 | 8.550 | | qsbr-4t-no-affinity | 9.682 | 1.119 | 8.319 | 13.631 | 9.375 | 9.462 | 9.901 | | qsbr-8t-no-affinity | 11.010 | 1.314 | 8.511 | 15.743 | 10.975 | 10.753 | 11.268 | GET p99.9 Latency ![Get_p99_9_Latency](https://hackmd.io/_uploads/Sk4RXHxEeg.png) | Flavor | Mean | Std Dev | Min | Max | Median | 95% CI Lower | 95% CI Upper | |--------------------------|--------|---------|-------|--------|--------|----------------|----------------| | qsbr-one-to-one-affinity | 5.984 | 0.847 | 5.151 | 10.367 | 5.759 | 5.818 | 6.150 | | qsbr-two-to-one-affinity | 6.854 | 0.480 | 5.727 | 9.535 | 6.863 | 6.760 | 6.948 | | qsbr-4t-no-affinity | 6.454 | 0.506 | 5.727 | 8.383 | 6.399 | 6.355 | 6.553 | | qsbr-8t-no-affinity | 7.671 | 0.698 | 6.591 | 11.135 | 7.455 | 7.534 | 7.808 | #### 使用 perf 分析程式行為 一樣針對一個 CPU 兩個執行緒與一個 CPU 一個執行緒進行分析,分別取一次中位數的結果: ``` 70,013.67 msec task-clock # 2.401 CPUs utilized 741,501 context-switches # 10.591 K/sec 15,581 cpu-migrations # 222.542 /sec 3,627 page-faults # 51.804 /sec 249,258,869,962 cycles # 3.560 GHz (39.99%) 132,407,403,431 instructions # 0.53 insn per cycle (49.98%) 21,077,879,352 branches # 301.054 M/sec (50.03%) 478,702,322 branch-misses # 2.27% of all branches (50.05%) 35,215,497,531 L1-dcache-loads # 502.980 M/sec (50.12%) 3,281,644,729 L1-dcache-load-misses # 9.32% of all L1-dcache accesses (50.04%) 1,129,746,159 LLC-loads # 16.136 M/sec (40.06%) 7,744,776 LLC-load-misses # 0.69% of all LL-cache accesses (39.95%) 14,435,913,114 cache-references # 206.187 M/sec (39.92%) 185,804,120 cache-misses # 1.29% of all cache refs (39.95%) 29.158730419 seconds time elapsed 18.440255000 seconds user 51.487935000 seconds sys ``` 一個 CPU 一個執行緒 ``` Performance counter stats for 'taskset -c 0-3 ./src/redis-server ./redis.conf --appendonly no --save ': 67,887.06 msec task-clock # 1.507 CPUs utilized 375,487 context-switches # 5.531 K/sec 19,709 cpu-migrations # 290.320 /sec 3,209 page-faults # 47.270 /sec 240,616,502,442 cycles # 3.544 GHz (40.06%) 128,321,848,547 instructions # 0.53 insn per cycle (50.05%) 20,388,085,704 branches # 300.324 M/sec (50.00%) 468,964,574 branch-misses # 2.30% of all branches (50.05%) 34,179,214,077 L1-dcache-loads # 503.472 M/sec (50.01%) 3,167,434,498 L1-dcache-load-misses # 9.27% of all L1-dcache accesses (50.00%) 1,059,289,040 LLC-loads # 15.604 M/sec (39.97%) 5,442,227 LLC-load-misses # 0.51% of all LL-cache accesses (39.98%) 13,343,467,237 cache-references # 196.554 M/sec (39.96%) 151,742,665 cache-misses # 1.14% of all cache refs (40.02%) 45.045660022 seconds time elapsed 17.981155000 seconds user 50.111441000 seconds sys ``` #### 結論 - 可以觀察出在使用 qsbr 的機制並且 cpu affinity 在一個執行緒榜定一個 cpu 的情況下,在 Get 和 Set 操作上的平均與中位數的吞吐量和延遲皆是表現最好的,跟去年得到的結論不一樣,原因可能是一個 CPU 兩個執行緒的 cache miss rate 較高,也有可能是 qsbr 的機制,需要進一步實驗(待補) - 但也能觀察出一個執行緒榜定一個 cpu 的情況下,雖然在 p99.9 情況下延遲的平均和中位數皆是最佳,但其 tail-latency 也是最大的,原因可能與 qsbr 的機制有關,需要進一步實驗(待補) :::danger TODO: 將並行程式設計相關的素材放到其他筆記頁面 ::: ## 引入 neco >去年的報告並沒有去詳細解釋該如何引入和各 API 該如何使用,並且最後的引用結果也沒顯著提昇效能,因此這邊我想從頭看一下 neco 文件看如何引用和為什麼對於提昇效能有限 #### 開發環境 ```bash $ uname -a Linux iantsai-P30-F5 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 CPU(s) scaling MHz: 30% CPU max MHz: 4000.0000 CPU min MHz: 800.0000 BogoMIPS: 6799.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nons top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu id_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexp riority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida ara t pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi m d_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 ``` ### 什麼是 neco ? neco 是一個以 coroutines 提供並行處理的跨平台 C 函式庫,可以在多種作業系統上使用,目標是提供快速的單執行緒並行處理,且提供 API 給網路與 I/O 。那希望可以引入 neco 來當低延遲非同步處理引擎,讓每個 worker thread 能以 coroutine 管理多個 client 的 I/O 流程,減少 context switch 成本與 thread 數量,並提升 tail latency 表現。 具體表現能用 [bluebox](https://github.com/tidwall/bluebox) 展示 neco 的功能,支持 SET、GET、DEL 和 PING 操作。較單執行緒的 Redis 有更高的資料吞吐量和更低的延遲。評估 M:N threading model 對於 Redis 並行化的助益。 在我的設備中對 bluebox 與原始做 valkey各 30 次測試並做一個效能的比較 ``` ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 1031.52 --- --- 1.93272 1.89500 3.99900 5.59900 79.48 Gets 102638.44 0.00 102638.44 1.92741 1.90300 3.90300 4.89500 3997.68 Waits 0.00 --- --- --- --- --- --- --- Totals 103669.95 0.00 102638.44 1.92747 1.90300 3.90300 4.89500 4077.15 ``` ``` ALL STATS ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 1663.52 --- --- 1.19864 1.02300 3.00700 6.43100 128.18 Gets 165524.81 3870.41 161654.40 1.19570 1.02300 2.95900 5.69500 6575.55 Waits 0.00 --- --- --- --- --- --- --- Totals 167188.34 3870.41 161654.40 1.19573 1.02300 2.97500 5.72700 6703.73 ``` 透過這個專案可以觀察出 neco 配合單執行緒帶來的平行增益是很可觀的,可以達到 60% 的一個吞吐量增幅,因此也期望能使用這個套件去提昇 mt-redis 的吞吐量。 先閱讀以下的 neco 文件,了解 neco 的使用方法 > [neco doc](https://github.com/tidwall/neco/blob/main/docs/API.md#generators) 先從多執行緒的範例開始切入: ```c++ #include <stdio.h> #include <unistd.h> #include <pthread.h> #include "neco.h" #include "neco.c" void coro1(int argc, void *argv[]) { // This coroutine is running in a different scheduler than coro2. int rd = *(int*)argv[0]; int wr = *(int*)argv[1]; int val; neco_read(rd, &val, sizeof(int)); printf("coro1: %d\n", val); neco_write(wr, &(int){ 2 }, sizeof(int)); } void coro2(int argc, void *argv[]) { // This coroutine is running in a different scheduler than coro1. int rd = *(int*)argv[0]; int wr = *(int*)argv[1]; int val; neco_write(wr, &(int){ 1 }, sizeof(int)); neco_read(rd, &val, sizeof(int)); printf("coro2: %d\n", val); } void *runtime1(void *arg) { int *pipefds = arg; neco_start(coro1, 2, &pipefds[0], &pipefds[3]); return 0; } void *runtime2(void *arg) { int *pipefds = arg; neco_start(coro2, 2, &pipefds[2], &pipefds[1]); return 0; } int main() { int pipefds[4]; pipe(&pipefds[0]); pipe(&pipefds[2]); pthread_t th1, th2; pthread_create(&th1, 0, runtime1, pipefds); pthread_create(&th2, 0, runtime2, pipefds); pthread_join(th1, 0); pthread_join(th2, 0); return 0; } ``` 這邊範例主要用到三個主要的 API :分別是 `neco_start` , `neco_read` 和 `neco_write` #### neco_start 是 neco coroutine 函式庫的入口點 API,用來在當前執行緒中建立並執行一個 coroutine 調度器(scheduler),並從指定的 coroutine 函式開始執行。此函式將負責初始化內部 runtime,如果尚未存在 scheduler,則會自動建立一個。 > 這邊的排程器是使用 [sco](https://github.com/tidwall/sco) 排程器,是 neco 作者做出的 fair and deterministic scheduler ```cpp int neco_start(void(*coroutine)(int argc, void *argv[]), int argc,...); ``` 第一個參數須傳入 coroutine 函式指標,而 coroutine 函式又須滿足以下格式 ```cpp void coroutine(int argc, void *argv[]) { char *msg = argv[0]; printf("%s\n", msg); } ``` 第一個參數是傳入 coroutine 的參數數量,對應 argv 的長度,第二個參數則是參數陣列。 #### neco_read on-blocking I/O 包裝函式,用於在 coroutine 環境中從指定的 file descriptor (fd) 讀取資料。這個 API 實現是用 POSIX 的 `read()` 系統呼叫,但額外加上加上 coroutine 調度支援,能讓當前 coroutine 在資料尚未可讀時主動讓出(yield)執行權。 ```cpp ssize_t neco_read(int fd, void *data, size_t nbytes); ``` 第一個參數是傳入 fd,第二個參數是要存取的指標,第三個參數是要存取幾個 bytes #### neco_write 對 POSIX `write()` 系統呼叫的包裝函式,設計目的是在 coroutine 執行環境中提供非阻塞的寫入操作。當 fd 暫時無法寫入時,該 coroutine 會自動讓出(yield),等待可寫再繼續執行,參數跟邏輯都跟 `neco_read` 相同。 ```cpp ssize_t neco_read(int fd, void *data, size_t nbytes); ``` 因此我們在看回去多執行緒的範例,程式的行為是創造兩個執行緒,並且分別對兩個 pipe 進行資訊傳遞,那根據上面 API 的行為範例程式碼的結果應該是: >coro1: 1 coro2: 2 因為當 th1 先執行時,會在 coro1 呼叫 `neco_read` 去查看該 fd 有無可讀取的資料,但沒有因此會主動讓出給 th2 , th2 在 coro2 內的 `neco_read` 也一樣,一定要等到 coro1 寫入 pipe 後才可讀取,因此結果會如上 以下是針對 API 的參數去做測試,再次確認 API 使用方法 ```diff void *runtime1(void *arg) { + int temp = 123; - neco_start(coro1, 2, &pipefds[0], &pipefds[3]); + neco_start(coro1, 3, &pipefds[0], &pipefds[3], &temp); } ``` >coro1: temp: 123 coro1: 1 coro2: 2 ### 該如何引入 mt-redis? ## 分析限制 Redis 並行的因素 為甚麼 redis/ valkey 是使用 single thread? 因此我閱讀 [The Engineering Wisdom Behind Redis’s Single-Threaded Design](https://community.aws/content/2tfQijlXleV5iM6hDkSGkMJlpcd/the-engineering-wisdom-behind-redis-s-single-threaded-design),理解 redis 的設計與考量點。 [The Engineering Wisdom Behind Redis’s Single-Threaded Design 文章整理](/TX3sCh38Qc21i7nyYpXgyQ) ## TODO: 使 Valkey 具備並行處理能力 > 將 [yeh-sudo/mt-redis](https://github.com/yeh-sudo/mt-redis) 的成果用於 [Valkey](https://valkey.io/),確保 userspace RCU 發揮其作用 [閱讀 valkey 的資料結構](/wCk1UR9ESrO337Y5a2PbIg) ### mt-redis 中,將 urcu 加到 redis 的關鍵改動 可以搭配 yeh-sudo 寫的 [mt-redis 運作原理](https://hackmd.io/-6FWkmVkRRqrt7QZxNRofA?view#%E5%AD%97%E5%85%B8-q_dict) **以下討論 mt-redis 中,將 rcu 加到 redis 的關鍵改動**。 #### 通用雜湊表使用 rculfhash (lock-free hash table) redis/valkey 的雜湊表,考慮是單執行緒的設計。在雜湊表擴容時,這避免了長時間的 rehashing,有使用到 incremental rehashing,在增刪查改操作中會逐步的進行,將舊的雜湊表的資料移至新的資料表中,保有高可用性。 然而在 mt-redis 的通用雜湊表使用 rculfhash ,而 rculfhash 本身就支持動態調整大小,因此 mt-redis 中就沒有手動作 incremental rehashing。 在 rculfhash 提供了 `cds_lfht_node` 作為將 hash table 連結起來的媒介,可以使用 `caa_container_of` 這個巨集取得整個雜湊表位置。 因此在 redis 中幾個重要的結構中都能看到考量到 rculfhash 的設計。 - `struct redisDB` 中存於通用字典 (`q_dict`)。 - `struct q_dict` 是 Redis 用來管理 key-value 的核心結構,其中成員就包含`struct cds_lfht *table` 。 - `struct dictEntry` 是具體存儲每個 key-value pair 的單位,其中成有 `cds_lfht_node` node,以串起雜湊表,另外有還有用於推延回收的 `rcu_head rcu_head` 與 `void* ptr` 這將指向存於 value 的 robj (`struct redisObject`) 結構 也因此以下展示 mt-redis 插入 key-value 流程,也顯示了 redisDb、q_dict、q_dictEntry 以及 robj 之間的關係 `q_dictAdd()` 的步驟: 1. rcu_read_lock() 2. 建立 dictEntry, 將要插入的 key 轉成 sds 格式 (Simple Dynamic String),在 `dictEntry.key` 存 key,`dictEntry.val` 存含有 value 的 robj 指針 3. 使用 `cds_lfht_add_replace()`,將 dictEntry 加入到 q_dict.table 中 5. rcu_read_unlock() 圖示: ``` +-----------------+ | redisDb | +------------------+ | q_dict *dict --> +------------------------+ | | struct q_dict | (q_dict 是雜湊表,存放在 redisDb 的 dict 成員中) | | +-------------------+ | | | | int size | | | | | cds_lfht *table | | | +-------------------+ | +------------------+ | | | (使用 cds_lfht_add_replace() 加到 table 中) v +------------------------+ | struct q_dictEntry | +------------------------+ | +------------------+ | | | cds_lfht_node | | -----> 雜湊表節點 | | node | | | +------------------+ | | +------------------+ | | | struct rcu_head | | -----> call_rcu 會使用到,延後會回收 | | rcu_head | | | +------------------+ | | +------------------+ | | | void *key | | -----> key 會轉成 sds 類型所保存 | +------------------+ | | +------------------+ | | | void *val | | -----> value 將以 robj 形式所保存 | +------------------+ | +------------------------+ | v +---------------------+ | robj | (robj 是 Redis 中的資料對象) +---------------------+ | type | encoding | | ptr | |-----> sds 或其它資料類型 +---------------------+ ``` #### 使用 concurrent queue 處理多個請求 mt-redis 在 `struct redisServer` 中加入 userspace RCU 庫中的 wait-free concurrent queue 的頭尾結構以處理多個要求。 ```c struct redisServer { //multi-thread redis support pthread_mutex_t command_request_lock; int threads_num; /* num of worker threads.*/ list *command_requests; struct cds_wfcq_head command_requests_head; struct cds_wfcq_tail command_requests_tail; int socketpairs[2]; q_eventloop qel; }; ``` 在 initServer 對通用雜湊表初始化,還有初始化 q_worker, q_master ```c // server.c extern struct redisServer server; void initServer(void) { ... for (j = 0; j < server.dbnum; j++) { server.db[j].dict = zmalloc(sizeof(q_dict)); server.db[j].dict->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL); server.db[j].expires = zmalloc(sizeof(q_dict)); server.db[j].expires->table = cds_lfht_new(1, 1, 0, CDS_LFHT_AUTO_RESIZE|CDS_LFHT_ACCOUNTING, NULL); } q_workers_init(server.threads_num); q_master_init(); aeCreateFileEvent(server.el, server.socketpairs[0], AE_READABLE, server_event_process, NULL) ``` #### mt-redis 中有三種執行緒還有定義管理客戶連結與請求的資料結構 `struct client` 負責管理客戶端連接和請求處理 mt-redis 中的執行緒角色分成為 master, wroker 還有 server 執行緒。不同執行緒的職責,Master 負責管理 client,而 worker 負責並行讀操作,另外 Server 負責序列寫入操作 #### strcut client 這結構定義了每個客戶端連接的完整狀態 ```c /* With multiplexing we need to take per-client state. * Clients are taken in a linked list. */ typedef struct client { q_eventloop *qel; /* Eventloop of the worker's thread which handles this client.*/ int curidx; /* The worker idx that this client current belogn to.*/ uint64_t id; /* Client incremental unique ID. */ int fd; /* Client socket. */ redisDb *db; /* Pointer to currently SELECTed DB. */ sds querybuf; /* Buffer we use to accumulate client queries. */ int argc; /* Num of arguments of current command. */ robj **argv; /* Arguments of current command. */ struct redisCommand *cmd, *lastcmd; /* Last command executed. */ int flags; /* Client flags: CLIENT_* macros. */ ... ``` 結構成員介紹 - `qel` 指向處理該客戶端的事件循環 - `curidx` 記錄當前所屬的 worker 執行緒 ID - `fd` 為客戶端 socket,querybuf 用於存接收到的命令 - `argc`、`argv` 存儲解析後的命令參數,cmd 指向當前執行的命令 --- q_command_request 將藉由 q_node 將會掛到 redisServer 中的 wfcq ```c //command forwarded from worker thread to server thread. typedef struct q_command_request { int type; struct client* c; struct cds_wfcq_node q_node; } q_command_request; ``` #### Master 執行緒 主要負責接受新的客戶端連接與設置監聽 socket 並處理新連接。 ```c static int setup_master(void) { int j; for (j = 0; j < server.ipfd_count; j++) { aeCreateFileEvent(master.qel.el, server.ipfd[j], AE_READABLE, acceptTcpHandler, NULL) } if (server.sofd > 0) aeCreateFileEvent(master.qel.el, server.sofd, AE_READABLE, acceptUnixHandler,NULL) return C_OK; } ``` `acceptTcpHandler()` 在 `while(max--) `中有使用 q_work 的 `dispatch_conn_new()` 當 master 執行緒接受新連接時,會以輪詢方式分發給不同 worker。 而在 worker 執行緒創建 struct client,這部分以下說明。 ```c // in q_master.c static void acceptCommonHandler(int fd, int flags, char *ip) { ... dispatch_conn_new(fd); ... } ``` `dispatch_conn_new()` 將 1. 創建一個 connswapunit 結構來封裝 socket descriptor 2. 選擇一個 worker 執行緒(輪詢方式) 3. 將 connswapunit 推入該 worker 的隊列 4. 通過 socket pair 通知 worker 執行緒有新連接 ```c // q_worker.c void dispatch_conn_new(int sd) { struct connswapunit *su = csui_new(); char buf[1]; q_worker *worker; int tid = (last_worker_thread + 1) % server.threads_num; worker = darray_get(&workers, (uint32_t) tid); last_worker_thread = tid; su->num = sd; su->data = NULL; csul_push(worker, su); buf[0] = 'c'; // to be handled by worker's worker_thread_event_process loop write(worker->socketpairs[0], buf, 1); } ``` #### worker 負責並行的讀操作 connswapunit 結構中的 q_node 用於 worker 執行緒的 waiting-free concurrent queue q_worker 中兩個 queue: - q_head/q_tail:用於接收新連接 - r_head/r_tail:用於接收從 server 執行緒返回的 client ```c typedef struct q_worker { int id; q_eventloop qel; int socketpairs[2]; /*0: used by master/server thread, 1: belong to worker thread.*/ /* queue for replication thread to switch the client back to this worker's thread.*/ struct cds_wfcq_head r_head; struct cds_wfcq_tail r_tail; struct cds_wfcq_tail q_tail; struct cds_wfcq_head q_head; q_worker_stats stats; } q_worker; struct connswapunit { int num; void *data; struct cds_wfcq_node q_node; }; ``` ```c static int setup_worker(q_worker *worker) { int status = aeCreateFileEvent(worker->qel.el, worker->socketpairs[1], AE_READABLE, worker_thread_event_process, worker); ``` `worker_thread_event_process` 處理命令處理。 當 worker 執行緒收到 'c' 字符的通知時 1. 從隊列中取出 connswapunit 2. 提取 socket descriptor 3. 建立 client instance,並且將一個 client 綁一個 thread id ```c // worker threads to master/server thread communication handler function static void worker_thread_event_process(aeEventLoop *el, int fd, void *privdata, int mask) { if (read(fd, buf, 1) != 1) { buf[0] = 'c'; } switch (buf[0]) { case 'c': csu = csul_pop(worker); sd = csu->num; zfree(csu); ... c = createClient(&worker->qel, sd)`; c->curidx = worker->id; } ``` Worker 執行緒通過 `worker_readQueryFromClient` 讀取客戶端數據,然後使用 `worker_processInputBuffer` 處理命令。 處理命令時,發現如果有需要在 server 執行緒處理的命令,client 將被轉移到 server 執行緒,這部分邏輯見 `int worker_processCommand(client *c)`。 處理步驟 - 使用 `q_createCommandRequest(c)` 創建一個命令請求結構,封裝 client 訊息 - 使用 `cds_wfcq_enqueue()` 將命令請求加入到 server 執行緒的全局命令隊列中 - 設置 `c->flags |= CLIENT_JUMP` 標誌,表示 client 正在執行緒間跳轉 - 通過 `write(server.socketpairs[1], buf, 1)` 向 server 執行緒發送通知訊號。 ```c // server.c /* worker's version of processCommand. */ int worker_processCommand(client *c) { struct q_command_request *r; ... c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr); // CMD is not readonly unlinkClientFromEventloop(c); r = q_createCommandRequest(c); cds_wfcq_enqueue(&server.command_requests_head, &server.command_requests_tail, &r->q_node); c->flags |= CLIENT_JUMP; buf[0] = 'r'; /* r stands for request from worker thread */ ``` #### server 執行緒處理寫入操作 能在`server_event_process()` 看到 worker 執行緒到 server 執行緒的命令轉發機制 `void server_event_process` 1. 藉由 `__cds_wfcq_dequeue_blocking` 來從隊列中取出請求。 2. 獲取客戶端訊息並設置事件循環 3. 釋放命令 4. 使用 `server_processCommand()` 處理命令 5. 判斷客戶端類型,是 slave 客戶端或是管理命令,就保持在 server 執行緒上 6. dispatch 給 worker 處理 ```c void server_event_process(struct aeEventLoop *eventLoop, int fd, void *clientData, int mask) { char buf[1]; client *c; int from; q_command_request *r; struct cds_wfcq_node *qnode; // Number of request in command_requests NOT 1:1 with num of bytes inside socketpairs qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail); //the sockerpair buf can be overflowed when num of connections increase, so //Num of requests in command_requests list is NOT 1:1 with num of bytes in socketpair buf, the code //has to deal with this mismatch. while(read(fd, buf, 1) == 1 || qnode != NULL ){ if (qnode == NULL) break; r = caa_container_of(qnode, struct q_command_request, q_node); c = r->c; from = c->curidx; c->qel = &server.qel; c->curidx = -1; q_freeCommandRequest(r); server_processCommand(c); if (c->flags & CLIENT_SLAVE || c->cmd->flags & CMD_ADMIN) { keep_slave_to_server_thread(c, from); } else { dispatch_to_worker(c, from); } qnode = __cds_wfcq_dequeue_blocking(&server.command_requests_head, &server.command_requests_tail); } } ``` 因此從以上能看得處來 mt-redis 保持讀取操作在 worker 執行緒處理,而寫入操作統一在 server 執行緒處理。

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully