--- tags: aio --- # io_uring ## 點題 > 前情提要: [事件驅動伺服器:原理和實例](https://hackmd.io/@sysprog/event-driven-server) * [The Evolution of File Descriptor Monitoring in Linux: From select to io_uring](https://vmsplice.net/~stefan/stefanha-fosdem-2021.pdf) * [IO-uring speed the RocksDB & TiKV](https://openinx.github.io/ppt/io-uring.pdf) ## What is io_uring? 在 Linux 底下操作 IO 有以下方式: * `read` 系列 * `pread` * `preadv` 但他們都是 synchronous 的,所以 POSIX 有實做 `aio_read`,但其乏善可陳且效能欠佳。 事實上 Linux 也有自己的 native async IO interface,但是包含以下缺點: 1. async IO 只支援 `O_DIRECT` (or un-buffered) accesses -> file descriptor 的設定 2. IO submission 伴隨 104 bytes 的 data copy (for IO that's supposedly zero copy),此外一次 IO 必須呼叫兩個 system call (submit + wait-for-completion) 3. 有很多可能會導致 async 最後變成 blocking (如: submission 會 block 等待 meta data; request slots 如果都被佔滿, submission 亦會 block 住) Jens Axboe 一開始先嘗試改寫原有的 native aio,但是失敗收場,因此他決定提出一個新的 interface,包含以下目標 (越後面越重要): 1. Easy to understand and intuitive to use 2. Extendable,除了 block oriented IO,networking 和 non-block storage 也要能用 3. Efficiency 4. Scalability 在設計 io_uring 時,為了避免過多 data copy,Jens Axboe 選擇透過 shared memory 來完成 application 和 kernel 間的溝通。其中不可避免的是同步問題,使用 single producer and single consumer ring buffer 來替代 shared lock 解決 shared data 的同步問題。而這溝通的管道又可分為 submission queue (SQ) 和 completion queue (CQ)。 以 CQ 來說,kernel 就是 producer,user application 就是 consumer。SQ 則是相反。 ## Linking request CQEs can arrive in any order as they become available。(舉例: 先讀在 HDD 上的 A.txt,再讀 SSD 上的 B.txt,若限制完成順序的話,將會影響到效能)。事實上,也可以強制 ordering (see [example](https://kernel.dk/io_uring.pdf)) liburing 預設會非循序的執行 submit queue 上的 operation,但是有些特別的情況,我們需要這些 operation 被循序的執行,如:`write` + `close`。所以我們可以透過添加 `IOSQE_IO_LINK` 來達到效果。詳細用法可參考 [linking request](https://unixism.net/loti/tutorial/link_liburing.html) ## Submission Queue Polling liburing 可以透過設定 flag: `IORING_SETUP_SQPOLL` 切換成 poll 模式,這個模式可以避免使用者一直呼叫 `io_uring_enter` (system call)。此模式下,kernel thread 會一直去檢查 submission queue 上是否有工作要做。詳細用法可參考 [Submission Queue Polling](https://unixism.net/loti/tutorial/sq_poll.html#sq-poll) 值得注意的是 kernel thread 的數量需要被控制,否則大量的 CPU cycle 會被 k-thread 佔據。為了避免這個機制,liburing 的 kthread 在一定的時間內沒有收到工作要做,kthread 就會 sleep,所以下一次要做 submission queue 上的工作就需要走原本的方式: `io_uring_enter()` :::info When using liburing, you never directly call the io_uring_enter() system call. That is usually taken care of by liburing’s io_uring_submit() function. It automatically determines if you are using polling mode or not and deals with when your program needs to call io_uring_enter() without you having to bother about it. ::: ## Memory ordering 如果要直接使用 liburing 就不用管這個議題,但是如果是要操作 raw interface,那這個就很重要。提供兩種操作: 1. `read_barrier()`: Ensure previous writes are visible before doing subsequent memory reads 2. `write_barrier()`: Order this write after previous writes > The kernel will include a read_barrier() before reading the SQ ring tail, to ensure that the tail write from the application is visible. From the CQ ring side, since the consumer/producer roles are reversed, the application merely needs to issue a read_barrier() before reading the CQ ring tail to ensure it sees any writes made by the kernel. ## liburing library * Remove the need for boiler plate code for setup of an io_uring instance * Provide a simplified API for basic use cases. ### Advanced use cases and features #### FIXED FILES AND BUFFERS #### POLLED IO I/O relying on hardware interrupts to signal a completion event. When IO is polled, the application will repeatedly ask the hardware driver for status on a submitted IO request. [註] [Real polling example](https://unixism.net/loti/tutorial/sq_poll.html) [註] Submission queue polling only works in combination with fixed files (not fixed buffer) #### KERNEL SIDE POLLING 會有 kernel thread 主動偵測 SQ 上是否有東西,這樣可以避免呼叫 syscall: `io_uring_enter` ## 原始程式碼 ### `io_uring_setup` 基本的設定。我們關注的是 setup 時需要設定哪些關於 `IORING_SETUP_SQPOLL` 的操作,預期找到 kthread 的建立,kthread 的工作內容等等。從 `io_sq_offload_create` 可知 offload 和 kthread 有關。 往裡面看可以找到 [create_io_thread](https://github.com/torvalds/linux/blob/65090f30ab791810a3dc840317e57df05018559c/kernel/fork.c#L2444),透過 `copy_process` 達到 fork,搭配 `wake_up_new_task` 啟動 process。該 process 要做的事為 [io_sq_thread](https://github.com/torvalds/linux/blob/master/fs/io_uring.c#L6882), ### `io_uring_enter` [io_uring_enter](https://github.com/torvalds/linux/blob/master/fs/io_uring.c#L9332) 在 prepare 完 write/read 之類的 operation 後會被呼叫,這裡我們只關注在 poll 模式下的行為: ```cpp if (ctx->flags & IORING_SETUP_SQPOLL) { io_cqring_overflow_flush(ctx, false); ret = -EOWNERDEAD; if (unlikely(ctx->sq_data->thread == NULL)) goto out; if (flags & IORING_ENTER_SQ_WAKEUP) wake_up(&ctx->sq_data->wait); if (flags & IORING_ENTER_SQ_WAIT) { ret = io_sqpoll_wait_sq(ctx); if (ret) goto out; } submitted = to_submit; } else if ... ``` 1. 若 `kthread` 閒置太久,為了避免霸佔 CPU,所以會主動 sleep,所以若看到 flag: `IORING_ENTER_SQ_WAKEUP` 設起,就必須要喚醒 kthread。 2. [PATCH: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits](https://www.spinics.net/lists/io-uring/msg04097.html) ## Install liburing 1. 下載 source [code](https://github.com/axboe/liburing/releases) 2. ./configure 3. sudo make install 4. compile example: gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring ## liburing flow io_uring_queue_init -> alloc iov -> io_uring_get_sqe -> io_uring_prep_readv -> io_uring_sqe_set_data -> io_uring_submit io_uring_wait_cqe -> io_uring_cqe_get_data -> io_uring_cqe_seen -> io_uring_queue_exit ## liburing/io_uring API ### io_uring 1. io_uring_setup(u32 entries, struct io_uring_params *p) * Description: Set up a submission queue (SQ) and completion queue (CQ) with at least entries entries, and returns a file descriptor which can be used to perform subsequent operations on the io_uring instance. * Relationship: wrap by liburing function: `io_uring_queue_init` * Flags: member of `struct io_uring_params` * IORING_SETUP_IOPOLL: * busy-waiting for an I/O completion * provides lower latency, but may consume more CPU resources than interrupt driven I/O * usable only on a file descriptor opened using the O_DIRECT flag * It is illegal to mix and match polled and non-polled I/O on an io_uring instance * IORING_SETUP_SQPOLL: * after setting, a kernel thread is created to perform submission queue polling * flag will be set with `IORING_SQ_NEED_WAKEUP` if kernel thread is idle more than `sq_thread_idle` milliseconds * before linux 5.11, application must register a set of files to be used for IO through `io_uring_register` using the `IORING_REGISTER_FILES` opcode * IORING_SETUP_SQ_AFF: * poll thread will be bound to the cpu set in the `sq_thread_cpu` field of the `struct io_uring_params` * If no flags are specified, the io_uring instance is setup for interrupt driven I/O 2. io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args) * opcode: * IORING_REGISTER_BUFFERS: * buffers are mapped into the kernel and eligible for I/O * make use of them, the application must specify the `IORING_OP_READ_FIXED` or `IORING_OP_WRITE_FIXED` opcodes in the submission queue entry and set the `buf_index` field to the desired buffer index * IORING_REGISTER_FILES: * Description: register files for I/O. `arg` contains a pointer to an array of `nr_args` file descriptors * make use of them, the `IOSQE_FIXED_FILE` flag must be set in the flags member of the `struct io_uring_sqe`, and the `fd` member is set to the index of the file in the file descriptor array * IORING_REGISTER_EVENTFD: * It's possible to use `eventfd()` to get notified of completion events on an io_uring instance. 3. io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig) * Description: single call can both submit new I/O and wait for completions of I/O initiated by this call or previous calls to `io_uring_enter()` * Flags: * IORING_ENTER_GETEVENTS: * wait for the specificied number of events in `min_complete` before returning * can be set along with to_submit to both submit and complete events in a single system call * IORING_ENTER_SQ_WAKEUP * IORING_ENTER_SQ_WAIT * opcode: * IORING_OP_READV * IORING_OP_WRITEV * some details: * If the io_uring instance was configured for polling, by specifying `IORING_SETUP_IOPOLL` in the call to `io_uring_setup()`, then `min_complete` has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If `min_complete` is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has exceeded its scheduler time slice. ### liburing ## io_uring v.s. epoll ### io_uring is slower than epoll? - [liburing #189](https://github.com/axboe/liburing/issues/189) - [liburing #215](https://github.com/axboe/liburing/issues/215) - [netty #10622](https://github.com/netty/netty/issues/10622) ## Reference * [Lord of the io_uring](https://unixism.net/loti/what_is_io_uring.html) * [Efficient IO with io_uring - official pdf](https://kernel.dk/io_uring.pdf) * [io_uring](https://unixism.net/2020/04/io-uring-by-example-part-1-introduction/) * [liburing example](https://unixism.net/loti/tutorial/index.html) * [liburing web server](https://github.com/shuveb/loti-examples/blob/master/webserver_liburing.c)