This is my attempt to write-up and summarize the Tokio AsyncRead/AsyncWrite discussion. It includes a certain amount of "editorial commentary" on my part. To be clear, I am not a "stakeholder" with decision making power there, but obviously this is relevant to any future attempt to standardize AsyncRead
and AsyncWrite
, so I want to make sure that the discussion is well documented. –nikomatsakis
XXX I may start adapting this to a more general "summary page" on the topic –nikomatsakis
The core motivation is that the AsyncRead
interface would like to be able to accept buffers that are not yet initialized, but the &mut [u8]
. This is a recognized downside of the existing synchronous Read
traits as well, and sfackler has a great writeup on the topic.
The performance impact of zeroing uninitialized memory has been measured in various ways.
In end-to-end hyper benchmarks (basically hyper serving as both http client/server and piping data as fast as it can go), seanmonster reports the following results:
Data size | Uninitialized Memory | Zeroed Memory |
---|---|---|
100kb | 1600 mb/s | 1075 mb/s |
10mb | 2350 mb/s | 1850 mb/s |
Ralith tested a similar setup, but using the QUIC protocol as implemented by Quinn
. Here Ralith reported an impact of 2.5% for large streams, but only 0.2%-0.6% for smaller inputs.
sfackler also had a comment exploring some of the performance costs from the synchronous trait. For example, they mentioned that some routines had to be rewritten in complex ways to work around initialization costs (see PR #23820). They also mentioned that PR #26950, which added some specializations to the stdlib to avoid initialization costs, found a 7% impact on microbenchmarks around file reads. Unfortunately, neither of these represent "full system" impact.
It would be interesting to try and get numbers for a setup that does more than pipe data through as fast as it can go – for example, serving more realistic requests that involve more processing. But that data can be quite hard to gather in practice. If somebody has a setup based on tokio that is doing more complex processing, it might be possible to build with a fork of tokio in which zeroing is artifically added or removed and test that way?
One alternative to permitting uninitialized buffers is to use buffer pooling, in which case you amortize the initialization costs by reusing buffers. In many scenarios this a perfectly fitting solution, but there are some concerns:
There have been arguments that you also increase the risk of heartbleed-like attacks, where a secret is accidentally left in a buffer and then re-used from another connection. However that seems like a weaker case, since the memory allocator is also likely to hand you memory that was freed but not re-initialized. Therefore, if you wish to guard secrets, the right approach is to zero the memory when you are done with it, most likely, or provide some other mechanism.
This part is more editorial on my part. I am happy to see Tokio exploring the "design space" around the AsyncRead
traits. I don't have a fully formed opinion yet on what I think they should look like – I see benefits here, but they come at a significant cost in complexity.
I guess my main thought is that I think it would be important to pick something that can be "bridged" to a std
trait with relative ease (and, similarly, I would expect that "bridge-ability" to the traits in use in different runtimes would be a consideration for a std
trait). Of course, not knowing exactly what the std trait will look like makes that a bit harder! But I guess we can assume it will be similar to some of the options raised on this thread, so it'd be good to consider how easily they can be bridged back and forth, and at what cost.
Takeaway: It would be good to consider how readily the proposals can be bridged back and forth. I would pay particular attention to how hard it is to bridge to &mut [u8]
or &mut [MaybeUninit<u8>]
based formulations.
One concern is how to expose vectorized writes. It's clear that the current trait, which has two methods, can be error-prone, because people can easily forget to implement the vectorized variant and hence get poor performance. Some of the alternatives narrow down to one method, which is good.
But Carl argues here that in fact high performance callers really want to have two code-paths, depending on whether the source will be able to productively use vectorized reads, and hence that the trait should expose a bool
method or something that lets the caller choose the path they want.
fn poll_read(buffer: &mut [u8])
Similar to what std offers.
Pros:
Cons:
Original proposal:
fn poll_read(buffer: &mut dyn Buf)
Pros:
Buf
trait implemented for many typesCons:
dyn Buf
) cannot be created for unsized types like [u8]
, the actual type when invoked with a &mut [u8]
buffer will be doubly indirect (&mut &mut [u8]
), and similarly callers may need some extra &mut
. Not clear that this matters.fn poll_read(buffer: impl Buf) where Self: Sized;
fn poll_read_dyn(buffer: &mut dyn Buf);
Pros:
Another option
fn poll_read(buffer: &mut BufStruct)
fn poll_read_vectorized(buffer: &mut BufStruct)
XXX write-up some of the pros/cons
MaybeUninit
Another option
fn poll_read(buffer: &mut [MaybeUninit<u8>])