Epoll vs. io_uring 效能測試與比較

# Epoll vs. io_uring 效能測試與比較 ###### tags: `linux2020` --- ## 測試環境一台實體機 (Server)，與一台遠端主機 (Client) 透過外部網路互連，並非直接網路線對接或只經過一台交換器 #### Server side (physical machine) - 作業系統 Ubuntu 20.04.1 LTS Linux kernel v5.8.0 - 硬體 CPU: Intel® Pentium CPU 4500 @ 3.50GHz × 2 Memory: 23.4 GiB - 網路 IP: 140.116.aaa.aaa #### Client side (Remote machine) - 網路 IP: 140.116.bbb.bbb ##### 隔離 CPU 的核 1. `$ sudo vim /etc/default/grub` 開啟 `/etc/default/grub` 2. Edit `GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0"` 找到 `GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"`, 在該行空白後添加 `isolcpus=1`, 其中等號右邊為 CPU 核的 index，從 0 開始 3. `$ sudo update-grub` 保存退出後執行更新命令 `update-grub` ##### 透過 taskset 把行程安排給特定 CPU 核 1. Check process affinity `$taskset -p PID`: 得到十六進位的 bitmask，換算成二進位後每個 set bit 代表與該核具親合 `$taskset -cp PID`: 得到十進位的版本 2. Assign process to specific CPU `$taskset -p COREMASK PID` 或 `$taskset -cp CORELIST PID` --- ## 測試項目 (io_uring vs epoll) ### 1. 系統效能 Client side: `$ ab -n 100000 -c $CONNECTION [-k] http://140.116.aaa.aaa:8081/`, $CONNECTION 為同時連線數量 #### Request per second (io_uring / epoll) 針對每秒處理 request 數量的部份我們分為有 keep-alive 與沒有 keep-alive 參數來進行 [ab command output message info.](https://hackmd.io/@shanvia/SyOvJ1wCw) * **With keep-alive parameter** 使用 keep-alive 參數時我們針對總數皆為十萬次 request，但改變同時連線的數量來比較： - Number of fail 的部份表示當 server 偵測到錯誤而關閉連線的數量，修改前的系統發送 request 失敗的比例維持在 1% 上下，此錯誤會強制關閉與 client 的連線，但修改後的系統錯誤率皆為 0% - Request per second 數量的比較上，修改後的數量略大於修改前 | Number of connection | Number of fail | Requests per second | Time per request (ms 1 conncurrent) | | :------------------ | :------------ | :-------------------------------------- |:----------------------------------- | | 100 | 0 / 1300 | 2266 / 2025 | 44 / 49 | | 300 | 0 / 1200 | 6799 / 6116 | 44 / 49 | | 500 | 0 / 1000 | 11336 / 10364 | 44 / 48 | | 800 | 0 / 800 | 17517 / 16680 | 46 / 48 | | 1000 | 0 / 1000 | 22522 / 20382 | 44 / 49 | ![](https://i.imgur.com/7iWUVRY.png) * **==Without== keep-alive parameter** - 在不需要與 client 保持連線的情況下，兩種 server 每秒處理的 request 皆大幅增加，且因為不用保持連線再度 read 而導致兩者的 number of fail 都降為 0% - 與上圖比較可發現 io_uring 方法的速度會隨著連線數量增加而遞減，原因是因為 io_uring 的作法會優先處理 accept 的 request，然後才處理 read, write；但 epoll 則是平均處理，因此不太會變動。 | Number of connection | Number of fail | Requests per second | Time per request (ms 1 conncurrent) | | :------------------ | :------------ | :-------------------------------------- |:----------------------------------- | | 100 | 0 / 0 | 58642 / 44147 | 2 / 2 | | 300 | 0 / 0 | 57408 / 40492 | 5 / 7 | | 500 | 0 / 0 | 55248 / 42380 | 9 / 12 | | 800 | 0 / 0 | 54008 / 42382 | 15 / 19 | | 1000 | 0 / 0 | 52506 / 42158 | 19 / 24 | ![](https://i.imgur.com/Rp4dTJZ.png) #### do_request 函式處理時間 [Plot time consumpsion of do_request function](https://hackmd.io/@shanvia/BkUZSJvRD) 我們嘗試測量系統處理每一次 request 所需的時間，發現修改後的程式碼時間分布較集中且少 ![](https://i.imgur.com/E9EtJKY.png) :::warning 嘗試計算十萬次 request 的時間並取其 99% 信賴區間後的圖形 ![](https://i.imgur.com/ZxpZtxj.png) ::: ### 2. CPU 使用情形透過 `perf_events` 進行測試 Server side: `$ sudo perf stat -r 5 ./sehttpd > /dev/null` Client side: `$ ab -n 100000 -c 1000 [-k] http://140.116.aaa.aaa:8081/` 在這次的實作除了將 epoll 修改成 io_uring，也進行了一些修正： - 譬如透過 calloc() 函式實作 memory pool，取代原本因到處進行動態記憶體規劃而導致記憶體區段的碎片化，減少系統在運行時發生 `page-faults` 的次數。 - 另外因為我們修改了程式碼中的 switch case 的寫法，改以 `computed goto` 取代，減少 `branch-misses` 的發生 - 整體來看經過這上述修改及以 io_uring 實作，減少程式執行所須的 `cycles` 數及 `instructions` 數。目前的僅有一台實體機作為 server 的數據，預期會再找其他實體機多測試。 * With keep-alive parameter + io_uring ``` Performance counter stats for './sehttpd' (5 runs): 1,533.19 msec task-clock # 0.113 CPUs utilized ( +- 0.57% ) 1,593 context-switches # 0.001 M/sec ( +- 0.87% ) 0 cpu-migrations # 0.000 K/sec 1,391 page-faults # 0.907 K/sec ( +- 0.03% ) 5,316,634,475 cycles # 3.468 GHz ( +- 0.57% ) 5,397,118,059 instructions # 1.02 insn per cycle ( +- 0.36% ) 1,031,652,823 branches # 672.880 M/sec ( +- 0.35% ) 11,219,837 branch-misses # 1.09% of all branches ( +- 0.65% ) 13.63 +- 4.94 seconds time elapsed ( +- 36.26% ) ``` + epoll ``` Performance counter stats for './sehttpd' (5 runs): 2,318.17 msec task-clock # 0.221 CPUs utilized ( +- 0.54% ) 876 context-switches # 0.378 K/sec ( +- 4.54% ) 0 cpu-migrations # 0.000 K/sec 101,220 page-faults # 0.044 M/sec ( +- 0.01% ) 8,046,643,190 cycles # 3.471 GHz ( +- 0.55% ) 6,151,488,232 instructions # 0.76 insn per cycle ( +- 0.25% ) 1,174,246,810 branches # 506.541 M/sec ( +- 0.24% ) 14,519,502 branch-misses # 1.24% of all branches ( +- 0.41% ) 10.503 +- 0.928 seconds time elapsed ( +- 8.83% ) ``` * ==Without== keep-alive + io_uring ``` Performance counter stats for './sehttpd' (5 runs): 2,931.98 msec task-clock # 0.371 CPUs utilized ( +- 0.36% ) 627 context-switches # 0.214 K/sec ( +- 0.86% ) 0 cpu-migrations # 0.000 K/sec 1,134 page-faults # 0.387 K/sec ( +- 4.71% ) 10,222,300,431 cycles # 3.486 GHz ( +- 0.36% ) 9,501,551,090 instructions # 0.93 insn per cycle ( +- 0.03% ) 1,792,242,173 branches # 611.274 M/sec ( +- 0.03% ) 21,889,467 branch-misses # 1.22% of all branches ( +- 0.71% ) 7.90 +- 1.32 seconds time elapsed ( +- 16.74% ) ``` + epoll ``` Performance counter stats for './sehttpd' (5 runs): 3,658.49 msec task-clock # 0.406 CPUs utilized ( +- 0.56% ) 774 context-switches # 0.212 K/sec ( +- 1.97% ) 0 cpu-migrations # 0.000 K/sec 102,289 page-faults # 0.028 M/sec ( +- 0.27% ) 12,713,996,125 cycles # 3.475 GHz ( +- 0.40% ) 10,128,855,709 instructions # 0.80 insn per cycle ( +- 0.02% ) 1,923,276,791 branches # 525.702 M/sec ( +- 0.01% ) 25,670,694 branch-misses # 1.33% of all branches ( +- 0.39% ) 9.012 +- 0.493 seconds time elapsed ( +- 5.47% ) --- ## Reference 1. [在 Linux 中以特定的 CPU 核心執行程式](https://blog.gtwang.org/linux/run-program-process-specific-cpu-cores-linux/) 2. [使用 perf_events 分析程式效能](https://zh-blog.logan.tw/2019/07/10/analyze-program-performance-with-perf-events/) 3. [io_uring echo server benchmarks](https://github.com/frevib/io_uring-echo-server/blob/415cc5046c2e1469c8d26ad774b568efb5495145/benchmarks/benchmarks.md)