refer: https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/
For UDP's receive system call, it can be divided into two stages, wait for data and copy data to user space. Kernel offered the following blocking system calls to deal with file descriptors, be they storage files or sockets:
1. Blocking I/O model
For UDP receive function, when user mode's task call a blocking system call, task will wait until kernel return ok.
2. Nonblocking I/O
We can set socket/fd to nonblocking. When it is set non-blocking, io-system call will return immediately.
3. I/O multiplexing
I/O multiplexing allow a single task operator many file descriptors by specfic system call (we will introdice poll()/epoll()/select() latter).
4. Signal driven I/O
Do recv() only when signal handler be triggered.
We need to consider buttom-half mechanism.
5. Asynchronous I/O
Kernel notify user task when the data was ready and was copy to the special application buffer.
A synchronous I/O operation causes the requesting process to be blocked until that I/O operation completes
An asynchronous I/O operation does not cause the requesting process to be blocked
refer: https://hechao.li/2022/01/04/select-vs-poll-vs-epoll/
epoll()
is not a single API but a group of 3 APIs(epoll_create()
,epoll_add()
and epoll_wait()
).epoll_create()
and epoll_add()
are called to set up the epoll instance while epoll_wait()
can be called in a loop to constantly wait on the fds added by epoll_add()
.
add_monitor()
triggers an external thread to constantly monitor all_fds and add ready fds in it to ready_fds.
All 3 system calls are used for I/O multiplexing rather than non-blocking I/O. However, epoll will return the lsit of ready fd, so that the user task won't be blocked. In addition, epoll have a insignificance that only support network sockets and pipes.
For Storage-IO, classically the blocking problem has been solved with thread pools: the main thread of execution dispatches the actual I/O to helper threads that will block and carry the operation on the main thread’s behalf.
Kernel gained an Asynchronous I/O in Linux 2.6, but…
Mr. Torvald's mail
Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.So I think this is ridiculously ugly.
Linux AIO is indeed rigged with problems and limitations:
So that is why io_uring came along.
IO in PostgreSQL: Past, Present, Future (Andres Freund)
io_uring is the architecture of performance-oriented I/O systems. It’s a basic theory of operation is close to linux-aio (an interface to push work into the kernel, and another interface to retrieve completed work). But there is three different:
Instances of those structures live in a shared memory single-producer-single-consumer ring buffer between the kernel and the application.
In user space, a task wants to check whether work is ready or not, just looks at the cqe ring buffer and consumes entries if they are ready. There is no need to go to the kernel to consume those entries(receive() system call).
Avi Kivity at Core C++ 2019: There are good reasons why network programming is done asynchronously.
io_uring is slightly similar to aio, but io_uring brings the power of asynchronous operations to anyone, instead of confining it to specialized database applications.
Define the descriptor structure for io_uring interface.
dispatch_reads
will submit the reading request to io_uring by liburing. This is the only system call we need to do in io_uring.
Then we can check which read's descriptor are ready and process them. Because it is using shared-memory interface, no system calls are needed to consume those events. The user just has to be careful to tell the io_uring interface that the events were consumed.
io_uring offers a plethora of advanced features for specialized use cases.
Preformance is an big issue is this article. How to compare its preformance?
e.g. ScyllaDB: the user of linux's aio, doesn't be benefited because aio and io_uring have same designed architecture.
This article evaluate 4 different interfaces to compare preformance:
In the first test, all io interface to hit the storage, and not use the operating system page cache at all.
We then ran the tests with the Direct I/O flags, which should be the bread and butter for linux-aio. The test is conducted on NVMe storage(flash, SSD…) that should be able to read at 3.5M IOPS. We used 8 CPUs to run 72 fio jobs, each issuing random reads across four files with an iodepth of 8. This makes sure that the CPUs run at saturation for all backends and will be the limiting factor in the benchmark. This allows us to see the behavior of each interface at saturation. Note that with enough CPUs all interfaces would be able to at some point achieve the full disk bandwidth. Such a test wouldn’t tell us much.
This test show the strength of io_uring but with buffer-io.
In a second test, we preloaded all the memory with the data in the files and proceeded to issue the same random reads. Everything is equal to the previous test, except we now use buffered I/O and expect the synchronous interface to never block — all results are coming from the operating system page cache, and none from storage.
Linux-aio, which is not designed for buffered I/O, at all, actually becomes a synchronous interface when used with buffered I/O files.
Reading 512-byte buffers from an Intel Optane device from a single CPU. Parallelism of 1000 in-flight requests. There is very little difference between linux-aio and io_uring for the basic interface. But when advanced features are used, a 5% difference is seen.
reference:
I/O Model
https://kernel.dk/io_uring.pdf
https://www.graplsecurity.com/post/iou-ring-exploiting-the-linux-kernel
Extended Berkeley Packet Filter (eBPF)
The original BPF allows the user to specify rules that will be applied to network packets as they flow through the network. This has been part of Linux for years.
But when BPF got extended,it allowed users to add code that is executed by the kernel in a safe manner in various points of its execution, not only in the network code.
We can use eBPF to trace the user space code and what happen in kernel when running the code. In addition, eBPF show the talent on performance analysis and monitoring when we use io_uring on it.
https://lore.kernel.org/io-uring/s7bbl9pp39g.fsf@dokucode.de/T/
we are operating-system researchers from Germany and noticed the LWN
article about the combination of io_uring and eBPF programs. We think
that high-performance system-call interfaces in combination with
programability are the perfect substrate for system-call clustering.
Because eBPF probes run in kernel space, they can do complex analysis, collect more information, and then only return to the application with summaries and final conclusions.
Here are some examples of what those tools can do:
- Trace how much time an application spends sleeping, and what led to those sleeps. (wakeuptime)
- Find all programs in the system that reached a particular place in the code (trace)
- Analyze network TCP throughput aggregated by subnet (tcpsubnet)
- Measure how much time the kernel spent processing softirqs (softirqs)
- Capture information about all short-lived files, where they come from, and for how long they were opened (filelife)
The article mentioned that we can get more detailed information through io_uring, a share-memory-space between user space and kernel space.
io_uring supports linking operations, but there is no way to generically pass the result of one system call to the next. With a simple bpf program, the application can tell the kernel how the result of open is to be passed to read — including the error handling, which then allocates its own buffers and keeps reading until the entire file is consumed and finally closed: we can checksum, compress, or search an entire file with a single system call.