How io_uring and eBPF Will Revolutionize Programming in Linux

refer: https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

io_uring

Classic IO system call

For UDP's receive system call, it can be divided into two stages, wait for data and copy data to user space. Kernel offered the following blocking system calls to deal with file descriptors, be they storage files or sockets:



size_t read(int fd, void *buf, size_t count);

size_t write(int fd, const void *buf, size_t count);

Blocking system call?

1. Blocking I/O model
For UDP receive function, when user mode's task call a blocking system call, task will wait until kernel return ok.

2. Nonblocking I/O
We can set socket/fd to nonblocking. When it is set non-blocking, io-system call will return immediately.

3. I/O multiplexing
I/O multiplexing allow a single task operator many file descriptors by specfic system call (we will introdice poll()/epoll()/select() latter).

Pros and Cons

Pros: deal with many fd in one task
Cons: two system call(select() + recv())

4. Signal driven I/O
Do recv() only when signal handler be triggered.
We need to consider buttom-half mechanism.

5. Asynchronous I/O
Kernel notify user task when the data was ready and was copy to the special application buffer.

Asynchronous I/O vs Nonblocking I/O

A synchronous I/O operation causes the requesting process to be blocked until that I/O operation completes
An asynchronous I/O operation does not cause the requesting process to be blocked

select vs poll vs epoll

refer: https://hechao.li/2022/01/04/select-vs-poll-vs-epoll/

select()

The caller passes a set of fds.
The caller has to reset the fd set per select call.
The complexity of the inner loop is O(max_fd + 1).
e.g. if the fds are {1, 10, 1023}, then the loop size is 1024 instead of 3





















// Returns true if fd is ready for I/O.
bool is_ready(int fd);

struct fd_info {
  int fd;
  bool ready;
};

int select(set<fd_info> fds, int max_fd) {
  int ready_cnt = 0;
  while (ready_cnt == 0) {
    for (int i = 0; i < max_fd; i++) {
      if (is_ready(i)) {
        auto it = fds.find(i);
        it->ready = true;
        ready_cnt++;
      }
    }
  }
  return ready_cnt;
}

poll()

The caller no longer needs to reset the fds per call because poll will reset the ready flag of any unready fds.
The complexity of the inner loop is O(n) where n is the number of fds to monitor. If the fds are {1, 10, 1023}, then the complexity is O(3).
In Linux code, both select and poll implementation are in fs/select.c file because they both use the same underlying kernel poll functions.






















// Returns true if fd is ready for I/O.
bool is_ready(int fd);

struct fd_info {
  int fd;
  bool ready;
};

int poll(struct fd_info* fds, int nfds) {
  int ready_cnt = 0;
  while(ready_cnt == 0) {
    for (int i = 0; i < nfds; i++) {
      if (is_ready(fds[i])) {
        fds[i].ready = true;
        ready_cnt++;
      } else {
        fds[i].ready = false;
      }
    }
  }
  return ready_cnt;
}

epoll()

epoll() is not a single API but a group of 3 APIs(epoll_create(),epoll_add() and epoll_wait()).
epoll_create() and epoll_add() are called to set up the epoll instance while epoll_wait() can be called in a loop to constantly wait on the fds added by epoll_add().
The complexity of the inner loop is O(ready fds). The worst case complexity is still O(n) like poll. However, in the case that the ready fds are mostly much less than fds to monitor, epoll has better performance than poll. In other words, even when two algorithms both have complexity O(n), in reality, n=3 and n=10000 may matter a lot.

add_monitor() triggers an external thread to constantly monitor all_fds and add ready fds in it to ready_fds.











































// Start monitoring fds in `all_fds` and constantly adds ready ones to
// `ready_fds`.
void add_monitor(const vector<int>& all_fds, vector<int>& ready_fds);

struct fd_info {
  int fd;
  bool ready;
};

struct epoll_info {
  vector<int> all_fds;
  vector<int> ready_fds;
};

map<int, epoll_info> epoll_info_by_epoll_id;

// Create an epoll instance and return its id.
int epoll_create() {
  return epoll_info_by_epoll_fd.size();
}

// Add a fd to monitor to the epoll instance.
void epoll_add(int epoll_id, int fd) {
  epoll_info_by_epoll_id[epoll_id].push_back(fd);
}

// Wait until at least one fd is ready. Return number of ready fds.
// After the function returns, the first `ready_cnt` of `ready_fds` contain
// ready fds. The rest can be ignored.
int epoll_wait(int epoll_id, struct fd_info* ready_fds) {
  int ready_cnt = 0;

  struct epoll_info info = epoll_info_by_epoll_id[epoll_id];
  add_monitor(info.allfds, info.ready_fds);
  while (ready_cnt == 0) {
    ready_cnt = ready_fds.size();
    for (int i = 0; i < ready_cnt; i++) {
      ready_fds[i].fd = ready_fds[i];
      ready_fds[i].ready = true;
    }
  }
  return ready_cnt;
}

All 3 system calls are used for I/O multiplexing rather than non-blocking I/O. However, epoll will return the lsit of ready fd, so that the user task won't be blocked. In addition, epoll have a insignificance that only support network sockets and pipes.

For Storage-IO, classically the blocking problem has been solved with thread pools: the main thread of execution dispatches the actual I/O to helper threads that will block and carry the operation on the main thread’s behalf.

Asynchronous I/O

Kernel gained an Asynchronous I/O in Linux 2.6, but…
Mr. Torvald's mail

Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

So I think this is ridiculously ugly.

Linux AIO is indeed rigged with problems and limitations:

Linux-aio only works for O_DIRECT files, rendering it virtually useless for normal, non-database applications.
The interface is not designed to be extensible. Although it is possible — we did extend it — every new addition is complex.
Although the interface is technically non-blocking, there are many reasons that can lead it to blocking, often in ways that are impossible to predict.

So that is why io_uring came along.

IO in PostgreSQL: Past, Present, Future (Andres Freund)

What is io_uring

io_uring is the architecture of performance-oriented I/O systems. It’s a basic theory of operation is close to linux-aio (an interface to push work into the kernel, and another interface to retrieve completed work). But there is three different:

truly asynchronous
support multi-I/O interface
flexible and extensible

Two main structure in io_uring

Instances of those structures live in a shared memory single-producer-single-consumer ring buffer between the kernel and the application.

submission queue entry (sqe)
completion queue entry (cqe)

In user space, a task wants to check whether work is ready or not, just looks at the cqe ring buffer and consumes entries if they are ready. There is no need to go to the kernel to consume those entries(receive() system call).

Avi Kivity at Core C++ 2019: There are good reasons why network programming is done asynchronously.

aio vs io_uring

io_uring is slightly similar to aio, but io_uring brings the power of asynchronous operations to anyone, instead of confining it to specialized database applications.

How to use io_uring

Define the descriptor structure for io_uring interface.








/* Describes what we need from a read */
struct read_descriptor {
    int fd;
    char *buf;
    unsigned long long pos;
    unsigned long long size;
    int result;
};

dispatch_reads will submit the reading request to io_uring by liburing. This is the only system call we need to do in io_uring.























/*
 * given an array of struct read_descriptors, dispatch them in the
 * io_uring
 */
int
dispatch_reads(struct io_uring *ring, struct read_descriptor *descv, int nr_desc)
{
    int i;
    for (i = 0; i < nr_desc; i++) {
        struct io_uring_sqe *sqe;
        struct read_descriptor *desc = &descv[i];
        sqe = io_uring_get_sqe(ring);
        /* Each operation will have a special prep function */
        io_uring_prep_read(sqe, desc->fd, desc->buf, desc->size, desc->pos);
        /*
         * Whatever value we put here will be reflected when it is
         * ready. This is how we know which read we are talking about
         */
        io_uring_sqe_set_data(sqe, desc);
    }
    /* For all of the reads above, there will be only one system call! */
    return io_uring_submit(ring);
}

Then we can check which read's descriptor are ready and process them. Because it is using shared-memory interface, no system calls are needed to consume those events. The user just has to be careful to tell the io_uring interface that the events were consumed.



















/*
 * Consume reads that are available and returns how many were consumed.
 * System calls issued: ZERO!
 */
unsigned
consume_reads(struct io_uring *ring)
{
    unsigned completed;
    unsigned head;
    struct io_uring_cqe *cqe;

    io_uring_for_each_cqe(ring, head, cqe) {
        completed++;
        /* Remember what we passed in io_uring_sqe_set_data?. It's here */
        struct read_descriptor *desc = (struct read_descriptor*)cqe->user_data;
        desc->result = cqe->res;
    }
    io_uring_cq_advance(ring, completed);
}

io_uring offers a plethora of advanced features for specialized use cases.

pre-registered File descriptor and Buffer registration
Poll ring: for very fast devices, the cost of processing interrupts is substantial. io_uring allows the user to turn off those interrupts and consume all available events through polling.
Linked operations: allows the user to send two operations that are dependent on each other.

Preformance

Preformance is an big issue is this article. How to compare its preformance?

e.g. ScyllaDB: the user of linux's aio, doesn't be benefited because aio and io_uring have same designed architecture.

This article evaluate 4 different interfaces to compare preformance:

synchronous reads
posix-aio (which is implemented as a thread pool)
linux-aio
io_uring

first test: Storage-IO

In the first test, all io interface to hit the storage, and not use the operating system page cache at all.

We then ran the tests with the Direct I/O flags, which should be the bread and butter for linux-aio. The test is conducted on NVMe storage(flash, SSD…) that should be able to read at 3.5M IOPS. We used 8 CPUs to run 72 fio jobs, each issuing random reads across four files with an iodepth of 8. This makes sure that the CPUs run at saturation for all backends and will be the limiting factor in the benchmark. This allows us to see the behavior of each interface at saturation. Note that with enough CPUs all interfaces would be able to at some point achieve the full disk bandwidth. Such a test wouldn’t tell us much.

second test: with buffer-IO

This test show the strength of io_uring but with buffer-io.

In a second test, we preloaded all the memory with the data in the files and proceeded to issue the same random reads. Everything is equal to the previous test, except we now use buffered I/O and expect the synchronous interface to never block — all results are coming from the operating system page cache, and none from storage.

Linux-aio, which is not designed for buffered I/O, at all, actually becomes a synchronous interface when used with buffered I/O files.

ScyllaDB and io_uring

Reading 512-byte buffers from an Intel Optane device from a single CPU. Parallelism of 1000 in-flight requests. There is very little difference between linux-aio and io_uring for the basic interface. But when advanced features are used, a 5% difference is seen.

Source Code in kernel

reference:
I/O Model
https://kernel.dk/io_uring.pdf
https://www.graplsecurity.com/post/iou-ring-exploiting-the-linux-kernel

eBPF

Extended Berkeley Packet Filter (eBPF)

The original BPF allows the user to specify rules that will be applied to network packets as they flow through the network. This has been part of Linux for years.

But when BPF got extended,it allowed users to add code that is executed by the kernel in a safe manner in various points of its execution, not only in the network code.

We can use eBPF to trace the user space code and what happen in kernel when running the code. In addition, eBPF show the talent on performance analysis and monitoring when we use io_uring on it.

https://lore.kernel.org/io-uring/s7bbl9pp39g.fsf@dokucode.de/T/

we are operating-system researchers from Germany and noticed the LWN
article about the combination of io_uring and eBPF programs. We think
that high-performance system-call interfaces in combination with
programability are the perfect substrate for system-call clustering.

Because eBPF probes run in kernel space, they can do complex analysis, collect more information, and then only return to the application with summaries and final conclusions.

Here are some examples of what those tools can do:

Trace how much time an application spends sleeping, and what led to those sleeps. (wakeuptime)

Find all programs in the system that reached a particular place in the code (trace)

Analyze network TCP throughput aggregated by subnet (tcpsubnet)

Measure how much time the kernel spent processing softirqs (softirqs)

Capture information about all short-lived files, where they come from, and for how long they were opened (filelife)

The article mentioned that we can get more detailed information through io_uring, a share-memory-space between user space and kernel space.

io_uring supports linking operations, but there is no way to generically pass the result of one system call to the next. With a simple bpf program, the application can tell the kernel how the result of open is to be passed to read — including the error handling, which then allocates its own buffers and keeps reading until the entire file is consumed and finally closed: we can checksum, compress, or search an entire file with a single system call.

How io_uring and eBPF Will Revolutionize Programming in Linux

io_uring

Classic IO system call

Blocking system call?

Pros and Cons

Asynchronous I/O vs Nonblocking I/O

select vs poll vs epoll

select()

poll()

epoll()

Asynchronous I/O

What is io_uring

Two main structure in io_uring

aio vs io_uring

How to use io_uring

Preformance

first test: Storage-IO

second test: with buffer-IO

ScyllaDB and io_uring

Source Code in kernel

eBPF

Read more

eBPF

Optimed Strlen (cont.)

Optimed Strlen

The APEROS Code Review Elements