io_uring

點題

前情提要: 事件驅動伺服器:原理和實例

What is io_uring?

在 Linux 底下操作 IO 有以下方式:

  • read 系列
  • pread
  • preadv

但他們都是 synchronous 的,所以 POSIX 有實做 aio_read,但其乏善可陳且效能欠佳。

事實上 Linux 也有自己的 native async IO interface,但是包含以下缺點:

  1. async IO 只支援 O_DIRECT (or un-buffered) accesses -> file descriptor 的設定
  2. IO submission 伴隨 104 bytes 的 data copy (for IO that's supposedly zero copy),此外一次 IO 必須呼叫兩個 system call (submit + wait-for-completion)
  3. 有很多可能會導致 async 最後變成 blocking (如: submission 會 block 等待 meta data; request slots 如果都被佔滿, submission 亦會 block 住)

Jens Axboe 一開始先嘗試改寫原有的 native aio,但是失敗收場,因此他決定提出一個新的 interface,包含以下目標 (越後面越重要):

  1. Easy to understand and intuitive to use
  2. Extendable,除了 block oriented IO,networking 和 non-block storage 也要能用
  3. Efficiency
  4. Scalability

在設計 io_uring 時,為了避免過多 data copy,Jens Axboe 選擇透過 shared memory 來完成 application 和 kernel 間的溝通。其中不可避免的是同步問題,使用 single producer and single consumer ring buffer 來替代 shared lock 解決 shared data 的同步問題。而這溝通的管道又可分為 submission queue (SQ) 和 completion queue (CQ)。

以 CQ 來說,kernel 就是 producer,user application 就是 consumer。SQ 則是相反。

Linking request

CQEs can arrive in any order as they become available。(舉例: 先讀在 HDD 上的 A.txt,再讀 SSD 上的 B.txt,若限制完成順序的話,將會影響到效能)。事實上,也可以強制 ordering (see example)

liburing 預設會非循序的執行 submit queue 上的 operation,但是有些特別的情況,我們需要這些 operation 被循序的執行,如:write + close。所以我們可以透過添加 IOSQE_IO_LINK 來達到效果。詳細用法可參考 linking request

Submission Queue Polling

liburing 可以透過設定 flag: IORING_SETUP_SQPOLL 切換成 poll 模式,這個模式可以避免使用者一直呼叫 io_uring_enter (system call)。此模式下,kernel thread 會一直去檢查 submission queue 上是否有工作要做。詳細用法可參考 Submission Queue Polling

值得注意的是 kernel thread 的數量需要被控制,否則大量的 CPU cycle 會被 k-thread 佔據。為了避免這個機制,liburing 的 kthread 在一定的時間內沒有收到工作要做,kthread 就會 sleep,所以下一次要做 submission queue 上的工作就需要走原本的方式: io_uring_enter()

When using liburing, you never directly call the io_uring_enter() system call. That is usually taken care of by liburing’s io_uring_submit() function. It automatically determines if you are using polling mode or not and deals with when your program needs to call io_uring_enter() without you having to bother about it.

Memory ordering

如果要直接使用 liburing 就不用管這個議題,但是如果是要操作 raw interface,那這個就很重要。提供兩種操作:

  1. read_barrier(): Ensure previous writes are visible before doing subsequent memory reads
  2. write_barrier(): Order this write after previous writes

The kernel will include a read_barrier() before reading the SQ ring tail, to ensure that the tail write from the application is visible. From the CQ ring side, since the consumer/producer roles are reversed, the application merely needs to issue a read_barrier() before reading the CQ ring tail to ensure it sees any writes made by the kernel.

liburing library

  • Remove the need for boiler plate code for setup of an io_uring instance
  • Provide a simplified API for basic use cases.

Advanced use cases and features

FIXED FILES AND BUFFERS

POLLED IO

I/O relying on hardware interrupts to signal a completion event. When IO is polled, the application will repeatedly ask the hardware driver for status on a submitted IO request.

[註] Real polling example
[註] Submission queue polling only works in combination with fixed files (not fixed buffer)

KERNEL SIDE POLLING

會有 kernel thread 主動偵測 SQ 上是否有東西,這樣可以避免呼叫 syscall: io_uring_enter

原始程式碼

io_uring_setup

基本的設定。我們關注的是 setup 時需要設定哪些關於 IORING_SETUP_SQPOLL 的操作,預期找到 kthread 的建立,kthread 的工作內容等等。從 io_sq_offload_create 可知 offload 和 kthread 有關。

往裡面看可以找到 create_io_thread,透過 copy_process 達到 fork,搭配 wake_up_new_task 啟動 process。該 process 要做的事為 io_sq_thread

io_uring_enter

io_uring_enter 在 prepare 完 write/read 之類的 operation 後會被呼叫,這裡我們只關注在 poll 模式下的行為:

if (ctx->flags & IORING_SETUP_SQPOLL) {
    io_cqring_overflow_flush(ctx, false);

    ret = -EOWNERDEAD;
    if (unlikely(ctx->sq_data->thread == NULL))
        goto out;
    if (flags & IORING_ENTER_SQ_WAKEUP)
        wake_up(&ctx->sq_data->wait);
    if (flags & IORING_ENTER_SQ_WAIT) {
        ret = io_sqpoll_wait_sq(ctx);
        if (ret)
            goto out;
    }
    submitted = to_submit;
} else if ...
  1. kthread 閒置太久,為了避免霸佔 CPU,所以會主動 sleep,所以若看到 flag: IORING_ENTER_SQ_WAKEUP 設起,就必須要喚醒 kthread。
  2. PATCH: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits

Install liburing

  1. 下載 source code
  2. ./configure
  3. sudo make install
  4. compile example: gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring

liburing flow

io_uring_queue_init -> alloc iov -> io_uring_get_sqe -> io_uring_prep_readv -> io_uring_sqe_set_data -> io_uring_submit

io_uring_wait_cqe -> io_uring_cqe_get_data -> io_uring_cqe_seen -> io_uring_queue_exit

liburing/io_uring API

io_uring

  1. io_uring_setup(u32 entries, struct io_uring_params *p)
  • Description: Set up a submission queue (SQ) and completion queue (CQ) with at least entries entries, and returns a file descriptor which can be used to perform subsequent operations on the io_uring instance.
  • Relationship: wrap by liburing function: io_uring_queue_init
  • Flags: member of struct io_uring_params
    • IORING_SETUP_IOPOLL:
      • busy-waiting for an I/O completion
      • provides lower latency, but may consume more CPU resources than interrupt driven I/O
      • usable only on a file descriptor opened using the O_DIRECT flag
      • It is illegal to mix and match polled and non-polled I/O on an io_uring instance
    • IORING_SETUP_SQPOLL:
      • after setting, a kernel thread is created to perform submission queue polling
      • flag will be set with IORING_SQ_NEED_WAKEUP if kernel thread is idle more than sq_thread_idle milliseconds
      • before linux 5.11, application must register a set of files to be used for IO through io_uring_register using the IORING_REGISTER_FILES opcode
    • IORING_SETUP_SQ_AFF:
      • poll thread will be bound to the cpu set in the sq_thread_cpu field of the struct io_uring_params
  • If no flags are specified, the io_uring instance is setup for interrupt driven I/O
  1. io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args)
  • opcode:
    • IORING_REGISTER_BUFFERS:
      • buffers are mapped into the kernel and eligible for I/O
      • make use of them, the application must specify the IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED opcodes in the submission queue entry and set the buf_index field to the desired buffer index
    • IORING_REGISTER_FILES:
      • Description: register files for I/O. arg contains a pointer to an array of nr_args file descriptors
      • make use of them, the IOSQE_FIXED_FILE flag must be set in the flags member of the struct io_uring_sqe, and the fd member is set to the index of the file in the file descriptor array
    • IORING_REGISTER_EVENTFD:
      • It's possible to use eventfd() to get notified of completion events on an io_uring instance.
  1. io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig)
  • Description: single call can both submit new I/O and wait for completions of I/O initiated by this call or previous calls to io_uring_enter()
  • Flags:
    • IORING_ENTER_GETEVENTS:
      • wait for the specificied number of events in min_complete before returning
      • can be set along with to_submit to both submit and complete events in a single system call
    • IORING_ENTER_SQ_WAKEUP
    • IORING_ENTER_SQ_WAIT
  • opcode:
    • IORING_OP_READV
    • IORING_OP_WRITEV
  • some details:
    • If the io_uring instance was configured for polling, by specifying IORING_SETUP_IOPOLL in the call to io_uring_setup(), then min_complete has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If min_complete is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has exceeded its scheduler time slice.

liburing

io_uring v.s. epoll

io_uring is slower than epoll?

Reference