前情提要: 事件驅動伺服器:原理和實例
在 Linux 底下操作 IO 有以下方式:
read
系列pread
preadv
但他們都是 synchronous 的,所以 POSIX 有實做 aio_read
,但其乏善可陳且效能欠佳。
事實上 Linux 也有自己的 native async IO interface,但是包含以下缺點:
O_DIRECT
(or un-buffered) accesses -> file descriptor 的設定Jens Axboe 一開始先嘗試改寫原有的 native aio,但是失敗收場,因此他決定提出一個新的 interface,包含以下目標 (越後面越重要):
在設計 io_uring 時,為了避免過多 data copy,Jens Axboe 選擇透過 shared memory 來完成 application 和 kernel 間的溝通。其中不可避免的是同步問題,使用 single producer and single consumer ring buffer 來替代 shared lock 解決 shared data 的同步問題。而這溝通的管道又可分為 submission queue (SQ) 和 completion queue (CQ)。
以 CQ 來說,kernel 就是 producer,user application 就是 consumer。SQ 則是相反。
CQEs can arrive in any order as they become available。(舉例: 先讀在 HDD 上的 A.txt,再讀 SSD 上的 B.txt,若限制完成順序的話,將會影響到效能)。事實上,也可以強制 ordering (see example)
liburing 預設會非循序的執行 submit queue 上的 operation,但是有些特別的情況,我們需要這些 operation 被循序的執行,如:write
+ close
。所以我們可以透過添加 IOSQE_IO_LINK
來達到效果。詳細用法可參考 linking request
liburing 可以透過設定 flag: IORING_SETUP_SQPOLL
切換成 poll 模式,這個模式可以避免使用者一直呼叫 io_uring_enter
(system call)。此模式下,kernel thread 會一直去檢查 submission queue 上是否有工作要做。詳細用法可參考 Submission Queue Polling
值得注意的是 kernel thread 的數量需要被控制,否則大量的 CPU cycle 會被 k-thread 佔據。為了避免這個機制,liburing 的 kthread 在一定的時間內沒有收到工作要做,kthread 就會 sleep,所以下一次要做 submission queue 上的工作就需要走原本的方式: io_uring_enter()
When using liburing, you never directly call the io_uring_enter() system call. That is usually taken care of by liburing’s io_uring_submit() function. It automatically determines if you are using polling mode or not and deals with when your program needs to call io_uring_enter() without you having to bother about it.
如果要直接使用 liburing 就不用管這個議題,但是如果是要操作 raw interface,那這個就很重要。提供兩種操作:
read_barrier()
: Ensure previous writes are visible before doing subsequent memory readswrite_barrier()
: Order this write after previous writesThe kernel will include a read_barrier() before reading the SQ ring tail, to ensure that the tail write from the application is visible. From the CQ ring side, since the consumer/producer roles are reversed, the application merely needs to issue a read_barrier() before reading the CQ ring tail to ensure it sees any writes made by the kernel.
I/O relying on hardware interrupts to signal a completion event. When IO is polled, the application will repeatedly ask the hardware driver for status on a submitted IO request.
[註] Real polling example
[註] Submission queue polling only works in combination with fixed files (not fixed buffer)
會有 kernel thread 主動偵測 SQ 上是否有東西,這樣可以避免呼叫 syscall: io_uring_enter
io_uring_setup
基本的設定。我們關注的是 setup 時需要設定哪些關於 IORING_SETUP_SQPOLL
的操作,預期找到 kthread 的建立,kthread 的工作內容等等。從 io_sq_offload_create
可知 offload 和 kthread 有關。
往裡面看可以找到 create_io_thread,透過 copy_process
達到 fork,搭配 wake_up_new_task
啟動 process。該 process 要做的事為 io_sq_thread,
io_uring_enter
io_uring_enter 在 prepare 完 write/read 之類的 operation 後會被呼叫,這裡我們只關注在 poll 模式下的行為:
if (ctx->flags & IORING_SETUP_SQPOLL) {
io_cqring_overflow_flush(ctx, false);
ret = -EOWNERDEAD;
if (unlikely(ctx->sq_data->thread == NULL))
goto out;
if (flags & IORING_ENTER_SQ_WAKEUP)
wake_up(&ctx->sq_data->wait);
if (flags & IORING_ENTER_SQ_WAIT) {
ret = io_sqpoll_wait_sq(ctx);
if (ret)
goto out;
}
submitted = to_submit;
} else if ...
kthread
閒置太久,為了避免霸佔 CPU,所以會主動 sleep,所以若看到 flag: IORING_ENTER_SQ_WAKEUP
設起,就必須要喚醒 kthread。io_uring_queue_init -> alloc iov -> io_uring_get_sqe -> io_uring_prep_readv -> io_uring_sqe_set_data -> io_uring_submit
io_uring_wait_cqe -> io_uring_cqe_get_data -> io_uring_cqe_seen -> io_uring_queue_exit
io_uring_queue_init
struct io_uring_params
IORING_SQ_NEED_WAKEUP
if kernel thread is idle more than sq_thread_idle
millisecondsio_uring_register
using the IORING_REGISTER_FILES
opcodesq_thread_cpu
field of the struct io_uring_params
IORING_OP_READ_FIXED
or IORING_OP_WRITE_FIXED
opcodes in the submission queue entry and set the buf_index
field to the desired buffer indexarg
contains a pointer to an array of nr_args
file descriptorsIOSQE_FIXED_FILE
flag must be set in the flags member of the struct io_uring_sqe
, and the fd
member is set to the index of the file in the file descriptor arrayeventfd()
to get notified of completion events on an io_uring instance.io_uring_enter()
min_complete
before returningIORING_SETUP_IOPOLL
in the call to io_uring_setup()
, then min_complete
has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If min_complete
is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has exceeded its scheduler time slice.