Async Rust never left the MVP state

# Async Rust never left the MVP state TODO: - How to improve/who to pay? - Raise concern about lack of people knowledgeable in this area (so pay us to gain expertise and help?) - Rust project goal? I love me some async Rust! It's amazing how we can write executor agnostic code that can run concurrently on huge servers and tiny microcontrollers. But especially on those tiny microcontrollers we notice async Rust is far from the [zero cost abstractions](https://blog.rust-lang.org/2019/11/07/Async-await-stable/#zero-cost-futures) we were promised. That's because every byte counts and async introduces a lot bloat. But this bloat exists on desktops and servers too, but it's less noticable due to all the sheer amount of memory and compute available. So what's going wrong? What's missing? And how could we fix it? What I won't be talking about is the often discussed problem of futures becoming bigger than necessary and them doing a lot of copying. People are aware of that already. There's already an open PR that tackles part of that already: https://github.com/rust-lang/rust/pull/135527 ## Anatomy of a generated future We're gonna be looking at this code: ```rust= fn foo() -> impl Future<Output = i32> { async { 5 } } fn bar() -> impl Future<Output = i32> { async { foo().await + foo().await } } ``` We're using the desugared syntax for futures because it's easier to see what's happening. So what does the `bar` future look like? There are two await points, so the statemachine must have at least two states, right? Well, yes. But there's more. Luckily we can ask the compiler to dump MIR for us at various passes. An interesting pass is the `coroutine_resume` pass. This is the last async-specific MIR pass. Why is this important? Well, async is a language feature that still exists in MIR, but not in LLVM IR. So the transformation of async to statemachine happens as a MIR pass. The `bar` function generates 360 lines of MIR. Pretty crazy, right? This gets optimized a bit more later on though. The compiler also outputs the `CoroutineLayout`. It's basically an enum with these states (comments my own): ``` variant_fields: { Unresumed(0): [], // Starting state Returned (1): [], Panicked (2): [], Suspend0 (3): [_s1], // At await point 1, _s1 = the foo future Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = result of _s1, s2 = the second foo future }, ``` So what are `Returned` and `Panicked`? Well, `Future::poll` is a safe function. Calling it must not induce any UB, even when the future is done. So after `Suspend1` the future returns `Ready` and the future is changed to the `Returned` state. Once polled again in that state, the poll function will panic. I'm not quite sure exactly what the `Panicked` state does. Probably something something unwinding. Cool, this seems reasonable. ## Why panic? But is it reasonable? Futures in the `Returned` state will panic. But it doesn't have to. The only thing we can't do is cause UB to happen. Panics are relatively expensive. They introduce a path with a side-effect that's not easily optimized out. What if indead, we just return `Pending` again? Nothing unsafe going on, so we fullfill the contract of the `Future` type. I've hacked this in the compiler to try it out and saw a 2%-5% reduction in binary size for async embedded firmware. So maybe this should be a switch just like `overflow-checks = false` is for integer overflow. In debug builds it would still panic so that wrong behavior is immediately visible, but in release builds we get more optimized futures. ## Always a statemachine We've looked at `bar`, but not yet at `foo`. ```rust fn foo() -> impl Future<Output = i32> { async { 5 } } ``` Let's implement it manually, to see what the optimal solution would be. ```rust struct FooFut; impl Future for FooFut { type Output = i32; fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> { Poll::Ready(5) } } ``` Easy right? We don't need any state. We just return the number. Let's see what the generated MIR is: ```rust= // MIR for `foo::{closure#0}` 0 coroutine_resume /* coroutine_layout = CoroutineLayout { field_tys: {}, variant_fields: { Unresumed(0): [], Returned (1): [], Panicked (2): [], }, storage_conflicts: BitMatrix(0x0) {}, } */ fn foo::{closure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<'_>) -> Poll<i32> { debug _task_context => _2; let mut _0: core::task::Poll<i32>; let mut _3: i32; let mut _4: u32; let mut _5: &mut {async block@src\main.rs:5:5: 5:10}; bb0: { _5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10}); _4 = discriminant((*_5)); switchInt(move _4) -> [0: bb1, 1: bb4, otherwise: bb5]; } bb1: { _3 = const 5_i32; goto -> bb3; } bb2: { _0 = Poll::<i32>::Ready(move _3); discriminant((*_5)) = 1; return; } bb3: { goto -> bb2; } bb4: { assert(const false, "`async fn` resumed after completion") -> [success: bb4, unwind unreachable]; } bb5: { unreachable; } } ``` Yikes! That's a lot of code! Notice at line 4 that we still have the 3 default states and at line 22 that we're still switching on it. There's a big optimization opportunity here that we're not using. I've also hacked this in the compiler and it saved 0.2% of binary size. Not a lot, but it's quite a simple optimization and so it's likely worth it to make. ## LLVM to the rescue? Ok, so the MIR output isn't great. But LLVM will pick up the pieces right? Well, sometimes, yeah. But only when the futures are simple enough and you're running `opt-level=3`. If the future grows too complex (which happens fast because future nest very deeply) or you're optimizing for size (which we often do with embedded or wasm), LLVM doesn't optimize this all away. Here's an example in godbolt: https://godbolt.org/z/58ahb3nne It's actually funny how it's half-optimized with `opt-level=z` (arm code): ```asm= example::block_on::hce4286e7e39d7ac7: // Block on `bar` (foo().await + foo().await) push {r4, r6, r7, lr} add r7, sp, #8 mov r4, r0 ldrb r0, [r0, #4] add r0, pc ldrb r0, [r0, #4] lsls r0, r0, #1 .LCPI4_1: add pc, r0 // Jump to .LBB4_5 if the future is in `Returned` state? Just a guess, this is weird. .LBB4_2: movs r0, #0 strb r0, [r4, #8] .LBB4_3: mov r0, r4 adds r0, #8 bl example::foo::{{closure}}::hba7d2b8a4e4c6fc1 // Call foo.poll movs r0, #0 // Overwrite the return value strb r0, [r4, #8] movs r0, #5 // Magically come up with the right answer anyways str r0, [r4] // Store it .LBB4_4: mov r0, r4 adds r0, #8 bl example::foo::{{closure}}::hba7d2b8a4e4c6fc1 // Call foo.poll again ldr r0, [r4] // Overwrite the return value by loading the previous answer adds r0, r0, #5 // Do the addition with the 'new' 5 pop {r4, r6, r7, pc} // Return from function .LBB4_5: ldr r0, .LCPI4_0 bl core::panicking::panic_const::panic_const_async_fn_resumed::h4fee113d5d7927ef .LBB4_6: .inst.n 0xdefe .LCPI4_0: .long .Lanon.ee789de5b3c3b99ca995cf66cfb258b2.1 ``` LLVM is not our savior here sadly. With `opt-level=3` it gets further, but eventually can't keep up either. And that's because we're relying on LLVM *not* doing what we're asking of it. ## Futures aren't inlined Inlining is great since it enables further optimization passes. Sadly, generated Rust futures are never inlined. After each future gets its implementation, then LLVM and the linker get an opportunity to do inlining. But as we've seen above, that's too late. The prime opportunity for inlining is this: ```rust async fn foo(blah: SomeType) -> OtherType { // ... } async fn bar(blah: SomeType) -> OtherType { foo(blah).await } ``` This is a pattern that happens a lot when doing abstractions using traits. Now `bar` creates its own statemachine that calls the `foo` statemachine. But `bar` could also *become* `foo`. Things get more difficult when we add a preamble and a postamble to the example. ```rust async fn foo(blah: bool) -> i32 { // ... } async fn bar(input: u32) -> i32 { let blah = input > 10; let result = foo(blah).await; result * 2 } ``` This pattern is common when translating an async function from one signature to another, which happens for trait impls. We notice that `bar` doesn't need any async state of its own here either. No data is kept over the single await point. `bar` can't simply *become* `foo`, but we can mostly rely on just the state of `foo`. The manual implementation would be something like: ```rust enum BarFut { Unresumed { input: u32 }, Inlined { foo: FooFut } } impl Future for BarFut { type Output = i32; fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // Ignoring pin projection here loop { match self { Unresumed { input } => *self = BarFut::Inlined { foo: foo(input) }, Inlined { foo } => break foo.poll(cx).map(|result| result * 2), } } } } ``` That's a lot better than what's currently generated. If only we were allowed to execute the code up to the first await point, then we could get rid of the `Unresumed` state. There's more stuff you could do with inlining if you were able to query properties of the futures you're polling. I don't think this is possible, at least not with the current architecture in rustc. For example, if you could query if a future always returns ready at the first poll, you wouldn't have to create a state for the await point in the future of the caller. If that were possible and you can apply these optimizations recursively, your could collapse a lot of futures into much simpler state machines. I haven't tested out inlining (yet?), but I think this would significantly help binary size and performance. ## Some testing results - Replace `Returned`' panic with `Poll::Pending`: 0.2% binary size savings on embedded. - When no await, no statemachine: 2%-5% binary size savings on embedded. - Both: ~3% perf increase on x86 in synthetic benchmark with `smol` executor. Future inlining should have a greater effect still. Ultimately it's hard to know the improvements until after it can be benched in real systems. ## Conclusion I would like this changed in the compiler: - `Returned` state no longer panics in release mode - Async blocks without awaits should not get state machines, but just return ready every time - Future inlining for futures with a single await I hope this shed some light on some of the async Rust issues! Links to my hacks: - No panics in poll after ready: https://github.com/rust-lang/rust/compare/main...diondokter:rust:resume-pending - No await, no statemaching: https://github.com/rust-lang/rust/compare/main...diondokter:rust:no-statemachine-when-no-await