Implement io_uring in gVisor === Esteban Blanc Mail: <estblcsk@gmail.com> Telephone: +33695597326 Matrix: @skallwar:matrix.org GitHub: [@Skallwar](https://github.com/Skallwar) # Abstract > Over the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring. > -- <cite>[BPF meets io_uring](https://lwn.net/Articles/847951/)</cite> gVisor does not currently support the io_uring interface which came out May 5, 2019 with Linux 5.1. Because of the [performances (section 9)](https://kernel.dk/io_uring.pdf) improvements over the old AIO interface which came out in Linux 2.5, io_uring has grown in popularity extremely fast and some big projects like [QEMU](https://www.qemu.org/) are already [using it](https://archive.fosdem.org/2020/schedule/event/vai_io_uring_in_qemu/attachments/slides/4145/export/events/attachments/vai_io_uring_in_qemu/slides/4145/io_uring_fosdem.pdf). It is just a matter of time before some applications request gVisor to support it, so we might as well start to add the support right now. The goal of this proposal is to implement a part of the io_uring interface, notably the `io_uring_setup` and `io_uring_enter` syscalls and a subset of opcodes. This is the first step for full `io_uring` support in gVisro # Before reading Most of the material presented will be explained without **all** the different subtilities. I will try to explain the simpler or common case because this proposal does not aim to implement the entirety of the `io_uring` interface. # io_uring ## A bit of history Linux already had an asynchronous interface named [AIO](https://man7.org/linux/man-pages/man7/aio.7.html) but it had too many limitations: - It worked asynchronously only for un-buffered file descriptor (opened with `O_DIRECT`) - It is supposed to be asynchronous but will behave in a synchronous manner in some cases - The API needs a lot of data copy which can have an impact on performance. Over time, developers have tried to lift these limitations but the end results were a more complex API. It was then decided to create a new interface from scratch: `io_uring`. ## Design goals `io_uring` design goals were to be: - Easy to use - Extendable - Feature rich - Efficient - Scalable ## How it works `io_uring` API is centered around two ring buffers shared by the application and the kernel. One ring is named the `submission queue (SQ)` and the other is named the `completion queue (CQ)`. As we are in a single producer single consumer case, there is no need for locking on the two rings. To create theses two ring buffers, an application needs to call the `io_uring_setup` syscall which will fill the `struct io_uring_params` given by the `p` argument and returns a file descriptor for this ring. The kernel is responsible of the reservation of the memory space necessary for the two queues, but the application then needs to bring them in its own address space using `mmap`, [see section 5.0](https://kernel.dk/io_uring.pdf). There are 3 zones to map at different offset in the file descriptor: - SQ ring (IORING_OFF_SQ_RING) - SQ array (IORING_OFF_SQES) - CQ ring (IORING_OFF_CQ_RING) Now that the two ring buffers and the submission array are mapped in the application address space, the application is all set to start submitting syscalls and receiving their results. ### Submission queue For the submission queue, the application is the producer, and the kernel is the consumer. The submission queue is filled with [`struct io_uring_sqe`](https://elixir.bootlin.com/linux/v5.10.26/source/include/uapi/linux/io_uring.h#L17), representing the syscall the application wants to be executed. The main fields of the struct are: - `opcode`: describe the operation code of this request - `flags`: modifier flags (see [ IOSQE_* flags](https://manpages.debian.org/unstable/liburing-dev/io_uring_enter.2.en.html)) - `fd`: file descriptor of this request - `user_data`, used to identify this submission The other fields are used depending on the requested operation. When the kernel must process a new syscall, it needs to read an index for the sqes ring buffer (where the struct io_uring_sqe has been filed by the userland application) from the submission queue array. ### Completion queue For the completion queue, the kernel is the producer, and the application is the consumer. When a syscall complete, the kernel needs to add its result to the completion queue. The result is under the form of a `struct io_uring_cqe` which contains: - `user_data`, used to determine the submission corresponding to this completion. This is the same as the one in the `struct io_uring_sqe`. - `res` result code of the syscall - `flags` carry some metadata associed with the operation The application pops the completion struct from the completion queue. ### Notify the kernel Now that we know how to setup and read/write events to the two queues, we need to notify the kernel. This is done through the [`io_uring_enter`](https://manpages.debian.org/unstable/liburing-dev/io_uring_enter.2.en.html) syscall. When the application needs to notify the kernel when some new syscall need to be processed, `to_submit` tells the kernel that there are up to that amount of sqes ready to be consumed by the kernel. When the application is waiting for request completion, `IORING_ENTER_GETEVENTS` is set in the `flags`, the kernel is told to return when at least `min_complete` requests have been completed. # io_uring in gVisor The goal of this proposal is to implement the `io_uring` interface in gVisor. There are 34 different opcodes as of time of writing (Linux 5.10.26), and because of time constraints, I will not be able to implement all of them, but only a subset. As described by the `Implement io_uring` section of gVisor [Project Ideas for Google Summer of Code 2021](https://gvisor.dev/community/gsoc_2021/) webpage, this will imply the creation of two new syscalls in gVisor `sentry`: - `io_uring_setup` - `io_uring_enter` As the API of `io_uring` has some similarities with the `epoll` one, I took a lot of information from it and built my implementation around them. Adding a syscall to the `sentry` is quite easy. It is done by adding a new entry to `s.Table` using the syscall number and assign it to `syscalls.PartiallySupported()` in `pkg/sentry/syscalls/linux/vfs2/vfs2.go`. This way gVisor will be able to call a function to handle the corresponding syscall. ## `io_uring_setup` This will be the first stone of the project. As this is the first syscall of the `io_uring` interface that will be called by userland, it must be done first. The goal of this syscall, as described in the "How it works" section, is to setup a shared memory space between userland and gVisor and returning a file descriptor to access it. First of all, we need to store the `io_uring_params` of this new `io_uring` instance. This will be stored in a new type named `IOUringInstance` in `pkg/sentry/vfs/io_uring.go`. In order to create a new file descriptor out of thin air, I will create a new function analog to [`NewEpollInstanceFD`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/epoll.go;l=101?q=NewEpollInstanceFD&sq=&ss=gvisor%2Fgvisor), which returns a [`FileDescription`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=44) to be used by making a call to [`NewFDFromVFS2`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/task.go;l=788;bpv=1;bpt=1?q=NewFDFromVFS2&sq=&ss=gvisor%2Fgvisor) in order to create a new file descriptor as it is done in [`EpollCreate`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/epoll.go;l=55;bpv=1;bpt=1?q=EpollCr&sq=&ss=gvisor%2Fgvisor). I will then fill in the given `struct io_uring_params` and return the new file descriptor. At this the stage, the new file descriptor can be return to userland. But before we do so, we need to setup a new goroutine (using the `go` keyword) that will poll for new operations to execute from a channel. This is the hearth of `io_uring` but we will talk more about how it works later on, in the `io_uring_enter` section. For now, we just need to create a new channel of submission containing `io_uring_sqe` structs and give it to the goroutine to spawn. ### `NewIOUringInstanceFD` `NewIOUringInstanceFD` will be located in `pkg/sentry/vfs/io_uring.go`. This will create a new anonymous VFS directory entry for the ring. Then it will create a new `IOUringInstance` which will contain a [`FileDescription`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=44) that will be initiated by the [`Init`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=135) method of [`FileDescription`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=44). This function will return a reference to the [`FileDescription`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=44) contained in the `IOUringInstance` ## `io_uring_enter` Once `io_uring_setup` has been correctly implemented, or should I say when the userland process will not get an `MAP_FAILED` while using mmap (see the "How it works section") with the returned file descriptor or a segfault while writing in the mapped memory, I will implement `io_uring_enter` syscall. The `io_uring` API is used with a file descriptor but we will be able to easily find the underlaying `IOUringInstance` from the file descriptor with a call to [`GetFileVFS2`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/task.go;l=747;bpv=0;bpt=1?q=GetFileVFS2&sq=&ss=gvisor%2Fgvisor) which returns a [`FileDescription`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=44). From there we can call [`Impl`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=306) to get the underlaying [`FileDescriptionImpl`](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/vfs/file_description.go;drc=203890b13447a41c484bfeb737958d1ae01383c9;l=321) which can be casted to a `IOUringInstance`. At this stage, the submission queue will not be empty anymore and I will need to be able to read the `io_uring_sqe` in it. To correctly read the `io_uring_sqe` structs, padding and type size must be respected strictly. But the biggest problem here is that `io_uring_sqe` struct contains unions. Go do not have a union type so the best solution is to use an array of u8 even if it is not so pleasing. When it is done, the first thing to do is to copy `to_submit` number of submission (or less if there is enough in the submission queue). Then we can send a pointer (and not the object directly to avoid another costly copy of 64 bytes) into the channel of submission polled by the goroutine that was created in `io_uring_setup`. This need to be done because when leaving `io_uring_enter` the head of the queue will be moved forward, making the slots available for the userland again, even though some operations might not have been processed yet. Of course, we still need to execute the requested operations, but this should be way "easier". ## Operations I plan to implement: - `IORING_OP_NOP` for benchmarking - `IORING_OP_READ` and `IORING_OP_WRITE` because the underlaying `read` and `write` syscalls are simple, used everywhere and fully supported by gVisor. I will try to add more operations as time permits. When the goroutine will have a new operation to process, it will switch on the opcode value of the `io_uring_sqe` struct and call the corresponding handler function. Each opcode handler when finished will write a completion event (`io_uring_cqe` struct) in the completion queue to notify the userland. ## Benchmark To validate `io_uring` implementation in gVisor, I will create benchmarks under `test/perf/linux/` that will be run both on gVisor and on Linux using gVisor's benchmark framework. Here are some benchmarks that I could take inspiration from: - https://github.com/frevib/io_uring-echo-server/blob/master/benchmarks/benchmarks.md - https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-IO-uring-Tests # Timeline I hope I will be faster than what is planed. This will undoubtedly change over time but I will be in touch with my mentor to revise this planing if we need to. Each step of the timeline also includes the creation of its corresponding tests to ensure the stability of development and maintainability. ## Required: - May 17, 2021 - June 7, 2021: Community bonding, get more knowledge on gVisor's internals - June 7, 2021 - July 5, 2021: Working `io_uring_setup` - July 5, 2021 - July 18, 2021: Reading submissions - July 18, 2021 - July 26, 2021: Writing completion - July 26, 2021 - August 16, 2021: Executing submissions ## Optional: - Kernel side pooling - More operations ## Exams: I will have one full week of exams in July. # About me My name is Esteban Blanc. I am a 4th year CS student at [EPITA](https://www.epita.fr/en/), near Paris. I have been interested in system and system programming from the age of around 14 years old. At first, I was doing some Arduino project and [drivers](https://github.com/Skallwar/GSL1680) then I started to do some bootable real mode tools in X86 assembly. At EPITA I have done some personal projects on my free time such as a [debugger](https://github.com/Skallwar/mydbg), a [Gameboy emulator](https://github.com/Skallwar/gb-rs), a static linker, and a [memory checker](https://github.com/Skallwar/mymmck), most of them in C. In my 3rd year at EPITA, I have joined the [LSE](https://www.lse.epita.fr/) (EPITA System and Security Laboratory). The recruitment subject was a dynamic linker in C. Once recruited, my labmate and I had to wirte an i386 protected mode kernel in C (named [K](https://k.lse.epita.fr/)), able to communicate with SCSI devices and read an ISO file structure. The end goal was to load small games in the ELF executable format to run them in userland. This was an interesting project of about a month or so. During the first lock down in France, just after the end of K project, the laboratory activity stopped. During this time, a friend and I wrote [Suckit](https://github.com/Skallwar/suckit), a web scrapper in Rust. The goal was to be a lot faster than [HTTrack](https://www.httrack.com/), and from [the benchmarks](https://www.reddit.com/r/rust/comments/gdwuat/suckit_a_fast_multithreaded_website_downloader/), it is by about 3460%. It is far from feature complete (html scrapping only) but has around 400 stars on GitHub which I am quite proud of. I have been on an internship last semester and I am now majoring in Embedded System. My work at the LSE has restarted and I am currently designing a [RISC-V CPU in SystemVerilog](https://github.com/Skallwar/rv32i-pico). Concerning my Go level, I have done many assignments in Go for different courses. I really like the ease of use of the goroutines and its overall simplicity.