Tokio AsyncRead / AsyncWrite

This is my attempt to write-up and summarize the Tokio AsyncRead/AsyncWrite discussion. It includes a certain amount of "editorial commentary" on my part. To be clear, I am not a "stakeholder" with decision making power there, but obviously this is relevant to any future attempt to standardize AsyncRead and AsyncWrite, so I want to make sure that the discussion is well documented. nikomatsakis

XXX I may start adapting this to a more general "summary page" on the topic nikomatsakis

Core Motivation: Perf and Uninitialized Memory

The core motivation is that the AsyncRead interface would like to be able to accept buffers that are not yet initialized, but the &mut [u8]. This is a recognized downside of the existing synchronous Read traits as well, and sfackler has a great writeup on the topic.

Measuring the impact

The performance impact of zeroing uninitialized memory has been measured in various ways.

In end-to-end hyper benchmarks (basically hyper serving as both http client/server and piping data as fast as it can go), seanmonster reports the following results:

Data size Uninitialized Memory Zeroed Memory
100kb 1600 mb/s 1075 mb/s
10mb 2350 mb/s 1850 mb/s

Ralith tested a similar setup, but using the QUIC protocol as implemented by Quinn. Here Ralith reported an impact of 2.5% for large streams, but only 0.2%-0.6% for smaller inputs.

sfackler also had a comment exploring some of the performance costs from the synchronous trait. For example, they mentioned that some routines had to be rewritten in complex ways to work around initialization costs (see PR #23820). They also mentioned that PR #26950, which added some specializations to the stdlib to avoid initialization costs, found a 7% impact on microbenchmarks around file reads. Unfortunately, neither of these represent "full system" impact.

It would be interesting to try and get numbers for a setup that does more than pipe data through as fast as it can go for example, serving more realistic requests that involve more processing. But that data can be quite hard to gather in practice. If somebody has a setup based on tokio that is doing more complex processing, it might be possible to build with a fork of tokio in which zeroing is artifically added or removed and test that way?

Alternatives: buffer pooling

One alternative to permitting uninitialized buffers is to use buffer pooling, in which case you amortize the initialization costs by reusing buffers. In many scenarios this a perfectly fitting solution, but there are some concerns:

  • you are unable to allocate buffers on the stack
  • you are now required to decide when to release those buffers to the operating system, which was previously the job of the allocator

There have been arguments that you also increase the risk of heartbleed-like attacks, where a secret is accidentally left in a buffer and then re-used from another connection. However that seems like a weaker case, since the memory allocator is also likely to hand you memory that was freed but not re-initialized. Therefore, if you wish to guard secrets, the right approach is to zero the memory when you are done with it, most likely, or provide some other mechanism.

Bridging and the relationship to std

This part is more editorial on my part. I am happy to see Tokio exploring the "design space" around the AsyncRead traits. I don't have a fully formed opinion yet on what I think they should look like I see benefits here, but they come at a significant cost in complexity.

I guess my main thought is that I think it would be important to pick something that can be "bridged" to a std trait with relative ease (and, similarly, I would expect that "bridge-ability" to the traits in use in different runtimes would be a consideration for a std trait). Of course, not knowing exactly what the std trait will look like makes that a bit harder! But I guess we can assume it will be similar to some of the options raised on this thread, so it'd be good to consider how easily they can be bridged back and forth, and at what cost.

Takeaway: It would be good to consider how readily the proposals can be bridged back and forth. I would pay particular attention to how hard it is to bridge to &mut [u8] or &mut [MaybeUninit<u8>] based formulations.

How to expose vectorized writes

One concern is how to expose vectorized writes. It's clear that the current trait, which has two methods, can be error-prone, because people can easily forget to implement the vectorized variant and hence get poor performance. Some of the alternatives narrow down to one method, which is good.

But Carl argues here that in fact high performance callers really want to have two code-paths, depending on whether the source will be able to productively use vectorized reads, and hence that the trait should expose a bool method or something that lets the caller choose the path they want.

Alternatives

Original

fn poll_read(buffer: &mut [u8])

Some way to "opt-in" to not zeroing memory

Similar to what std offers.

Pros:

  • Compatible with Read, permits simpler signature

Cons:

  • Somewhat complex
  • Need a "unsafe to implement but not to call" mechanism for methods, which we lack

dyn Trait

Original proposal:

fn poll_read(buffer: &mut dyn Buf)

Pros:

  • One method supports both vectorized writes and ordinary writes
  • Buffer encapsulates uninitialized memory
  • Bridging is (presumably) easier because you can have Buf trait implemented for many types

Cons:

  • Requires a virtual call for leaf writes
  • Because dyn traits (like dyn Buf) cannot be created for unsized types like [u8], the actual type when invoked with a &mut [u8] buffer will be doubly indirect (&mut &mut [u8]), and similarly callers may need some extra &mut. Not clear that this matters.

impl Trait

As described here

fn poll_read(buffer: impl Buf) where Self: Sized;
fn poll_read_dyn(buffer: &mut dyn Buf);

Pros:

  • Like dyn Trait, but more flexible and eliminates fears of perf impact
  • But more complex trait, and implementing will be annoying because defaults cannot be provided, so there is some boilerplate

Concrete struct

Proposed here

Another option

fn poll_read(buffer: &mut BufStruct)
fn poll_read_vectorized(buffer: &mut BufStruct)

XXX write-up some of the pros/cons

Take slice of MaybeUninit

Another option

fn poll_read(buffer: &mut [MaybeUninit<u8>])
Select a repo