Esteban Blanc
Mail: estblcsk@gmail.com
Telephone: +33695597326
Matrix: @skallwar:matrix.org
GitHub: @Skallwar
Over the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring.
– BPF meets io_uring
gVisor does not currently support the io_uring interface which came out May 5, 2019 with Linux 5.1. Because of the performances (section 9) improvements over the old AIO interface which came out in Linux 2.5, io_uring has grown in popularity extremely fast and some big projects like QEMU are already using it.
It is just a matter of time before some applications request gVisor to support it, so we might as well start to add the support right now.
The goal of this proposal is to implement a part of the io_uring interface, notably the io_uring_setup
and io_uring_enter
syscalls and a subset of opcodes. This is the first step for full io_uring
support in gVisro
Most of the material presented will be explained without all the different subtilities. I will try to explain the simpler or common case because this proposal does not aim to implement the entirety of the io_uring
interface.
Linux already had an asynchronous interface named AIO but it had too many limitations:
O_DIRECT
)Over time, developers have tried to lift these limitations but the end results were a more complex API. It was then decided to create a new interface from scratch: io_uring
.
io_uring
design goals were to be:
io_uring
API is centered around two ring buffers shared by the application and the kernel. One ring is named the submission queue (SQ)
and the other is named the completion queue (CQ)
. As we are in a single producer single consumer case, there is no need for locking on the two rings.
To create theses two ring buffers, an application needs to call the io_uring_setup
syscall which will fill the struct io_uring_params
given by the p
argument and returns a file descriptor for this ring.
The kernel is responsible of the reservation of the memory space necessary for the two queues, but the application then needs to bring them in its own address space using mmap
, see section 5.0.
There are 3 zones to map at different offset in the file descriptor:
Now that the two ring buffers and the submission array are mapped in the application address space, the application is all set to start submitting syscalls and receiving their results.
For the submission queue, the application is the producer, and the kernel is the consumer.
The submission queue is filled with struct io_uring_sqe
, representing the syscall the application wants to be executed. The main fields of the struct are:
opcode
: describe the operation code of this requestflags
: modifier flags (see IOSQE_* flags)fd
: file descriptor of this requestuser_data
, used to identify this submissionThe other fields are used depending on the requested operation.
When the kernel must process a new syscall, it needs to read an index for the sqes ring buffer (where the struct io_uring_sqe has been filed by the userland application) from the submission queue array.
For the completion queue, the kernel is the producer, and the application is the consumer.
When a syscall complete, the kernel needs to add its result to the completion queue. The result is under the form of a struct io_uring_cqe
which contains:
user_data
, used to determine the submission corresponding to this completion. This is the same as the one in the struct io_uring_sqe
.res
result code of the syscallflags
carry some metadata associed with the operationThe application pops the completion struct from the completion queue.
Now that we know how to setup and read/write events to the two queues, we need to notify the kernel.
This is done through the io_uring_enter
syscall. When the application needs to notify the kernel when some new syscall need to be processed, to_submit
tells the kernel that there are up to that amount of sqes ready to be consumed by the kernel. When the application is waiting for request completion, IORING_ENTER_GETEVENTS
is set in the flags
, the kernel is told to return when at least min_complete
requests have been completed.
The goal of this proposal is to implement the io_uring
interface in gVisor. There are 34 different opcodes as of time of writing (Linux 5.10.26), and because of time constraints, I will not be able to implement all of them, but only a subset. As described by the Implement io_uring
section of gVisor Project Ideas for Google Summer of Code 2021 webpage, this will imply the creation of two new syscalls in gVisor sentry
:
io_uring_setup
io_uring_enter
As the API of io_uring
has some similarities with the epoll
one, I took a lot of information from it and built my implementation around them.
Adding a syscall to the sentry
is quite easy. It is done by adding a new entry to s.Table
using the syscall number and assign it to syscalls.PartiallySupported()
in pkg/sentry/syscalls/linux/vfs2/vfs2.go
. This way gVisor will be able to call a function to handle the corresponding syscall.
io_uring_setup
This will be the first stone of the project. As this is the first syscall of the io_uring
interface that will be called by userland, it must be done first.
The goal of this syscall, as described in the "How it works" section, is to setup a shared memory space between userland and gVisor and returning a file descriptor to access it.
First of all, we need to store the io_uring_params
of this new io_uring
instance. This will be stored in a new type named IOUringInstance
in pkg/sentry/vfs/io_uring.go
.
In order to create a new file descriptor out of thin air, I will create a new function analog to NewEpollInstanceFD
, which returns a FileDescription
to be used by making a call to NewFDFromVFS2
in order to create a new file descriptor as it is done in EpollCreate
.
I will then fill in the given struct io_uring_params
and return the new file descriptor.
At this the stage, the new file descriptor can be return to userland.
But before we do so, we need to setup a new goroutine (using the go
keyword) that will poll for new operations to execute from a channel. This is the hearth of io_uring
but we will talk more about how it works later on, in the io_uring_enter
section. For now, we just need to create a new channel of submission containing io_uring_sqe
structs and give it to the goroutine to spawn.
NewIOUringInstanceFD
NewIOUringInstanceFD
will be located in pkg/sentry/vfs/io_uring.go
. This will create a new anonymous VFS directory entry for the ring.
Then it will create a new IOUringInstance
which will contain a FileDescription
that will be initiated by the Init
method of FileDescription
.
This function will return a reference to the FileDescription
contained in the IOUringInstance
io_uring_enter
Once io_uring_setup
has been correctly implemented, or should I say when the userland process will not get an MAP_FAILED
while using mmap (see the "How it works section") with the returned file descriptor or a segfault while writing in the mapped memory, I will implement io_uring_enter
syscall.
The io_uring
API is used with a file descriptor but we will be able to easily find the underlaying IOUringInstance
from the file descriptor with a call to GetFileVFS2
which returns a FileDescription
. From there we can call Impl
to get the underlaying FileDescriptionImpl
which can be casted to a IOUringInstance
.
At this stage, the submission queue will not be empty anymore and I will need to be able to read the io_uring_sqe
in it. To correctly read the io_uring_sqe
structs, padding and type size must be respected strictly. But the biggest problem here is that io_uring_sqe
struct contains unions. Go do not have a union type so the best solution is to use an array of u8 even if it is not so pleasing.
When it is done, the first thing to do is to copy to_submit
number of submission (or less if there is enough in the submission queue). Then we can send a pointer (and not the object directly to avoid another costly copy of 64 bytes) into the channel of submission polled by the goroutine that was created in io_uring_setup
. This need to be done because when leaving io_uring_enter
the head of the queue will be moved forward, making the slots available for the userland again, even though some operations might not have been processed yet.
Of course, we still need to execute the requested operations, but this should be way "easier".
I plan to implement:
IORING_OP_NOP
for benchmarkingIORING_OP_READ
and IORING_OP_WRITE
because the underlaying read
and write
syscalls are simple, used everywhere and fully supported by gVisor.I will try to add more operations as time permits.
When the goroutine will have a new operation to process, it will switch on the opcode value of the io_uring_sqe
struct and call the corresponding handler function.
Each opcode handler when finished will write a completion event (io_uring_cqe
struct) in the completion queue to notify the userland.
To validate io_uring
implementation in gVisor, I will create benchmarks under test/perf/linux/
that will be run both on gVisor and on Linux using gVisor's benchmark framework.
Here are some benchmarks that I could take inspiration from:
I hope I will be faster than what is planed. This will undoubtedly change over time but I will be in touch with my mentor to revise this planing if we need to.
Each step of the timeline also includes the creation of its corresponding tests to ensure the stability of development and maintainability.
io_uring_setup
I will have one full week of exams in July.
My name is Esteban Blanc. I am a 4th year CS student at EPITA, near Paris.
I have been interested in system and system programming from the age of around 14 years old. At first, I was doing some Arduino project and drivers then I started to do some bootable real mode tools in X86 assembly. At EPITA I have done some personal projects on my free time such as a debugger, a Gameboy emulator, a static linker, and a memory checker, most of them in C.
In my 3rd year at EPITA, I have joined the LSE (EPITA System and Security Laboratory). The recruitment subject was a dynamic linker in C. Once recruited, my labmate and I had to wirte an i386 protected mode kernel in C (named K), able to communicate with SCSI devices and read an ISO file structure. The end goal was to load small games in the ELF executable format to run them in userland. This was an interesting project of about a month or so.
During the first lock down in France, just after the end of K project, the laboratory activity stopped. During this time, a friend and I wrote Suckit, a web scrapper in Rust. The goal was to be a lot faster than HTTrack, and from the benchmarks, it is by about 3460%. It is far from feature complete (html scrapping only) but has around 400 stars on GitHub which I am quite proud of.
I have been on an internship last semester and I am now majoring in Embedded System. My work at the LSE has restarted and I am currently designing a RISC-V CPU in SystemVerilog.
Concerning my Go level, I have done many assignments in Go for different courses. I really like the ease of use of the goroutines and its overall simplicity.