Moves in and out of futures

# Moves in and out of futures ### Zulip Threads Creeping Memmove - https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Futures.20.22creeping.20memmove.22.20problem wg async / future sizes - https://rust-lang.zulipchat.com/#narrow/stream/187312-wg-async/topic/future.20sizes ### Motivational Example This is a motivational example, which has a future that contains a big struct, passes it through a chain of functions, and calls some functions on it. This is similar to patterns that occur quite frequently in async code, and should be efficient. ```rust= use std::future::Future; pub struct Big([u8; 128]); impl Big { fn some_fn(&self) -> bool { self.0[0] == 0 } } pub async fn f1(b: Big) -> u32 { b.0[0] as u32 } pub async fn f2(b: Big) -> u32 { b.some_fn(); f1(b).await } pub async fn f3(b: Big) -> u32 { b.some_fn(); f2(b).await } pub async fn f4(b: Big) -> u32 { b.some_fn(); f3(b).await } fn main() { let f = Box::pin(f4(Big([0; 128]))) as std::pin::Pin<Box<dyn Future<Output=u32>>>; println!("size={}", std::mem::size_of_val(&*f)); // black box so we can see the generated code std::hint::black_box(f); } ``` ### Situations #### The really-ideal situation would be to have MIR inlining for generators and turn this to a single generator. ("Situation A") However, MIR inlining for generators is semantically not simple, and I'm not sure we want to be relying on it. #### The almost-ideal solution would be to have all the generator structs laid out the same way ("Situation B") This is not the main goal of the document, but is an eventual goal I want to get to. ``` +-------+--------------------+ | state | bigbigbigbigbigbig | +-------+--------------------+ ``` Where the nestedmost future gets states 0-2, the second nested future gets 3 and 4, the one afterwards gets 5 and 6, etc, via layout magic, then LLVM inlining should be able to sort most of it out. This is basically a "niche optimization" - when we are awaiting the innermost generator, we know the state variable is 0-2. So we can have the match for that go to the right state. If the outer future needs some variables afterwards, they can store them after the end of the inner future. In the internal project I'm working on, 80% of the hot path functions have exactly 1 await point (which might be inside of a loop, and have code before and after, but still 1 await point), which I believe should make this optimization very useful. #### The next most-ideal solution would be to have 1 big, but separate states ("Situation C") ``` +--------+--------+--------+--------------+ + state0 | state1 | state2 | bigbigbigbig | +--------+--------+--------+--------------+ ``` In that situation, we would have state manipulation, but no movement of the big value. #### The next not-that-ideal-but-still-reasonable solution ("Situation D") Would be to have `big` in a separate location in each future, but still not duplicated ``` +--------+--------+--------+--------------+ + 0 | bigbigbigbig | + 1 | 0 | bigbigbigbig | + 1 | 1 | state2 | bigbigbigbigbig +--------+--------+--------+--------------+ ``` In that situation, we'll have to do many `memmove` calls, but the code will at least be reasonable. #### The current situation ("Situation E") In the current situation, the `struct Big` is duplicated many many times, and the future ends up taking 900 bytes (see - the code prints `size=900`) ### Purpose of this document The purpose of this document is to get from situation E to situation D. I should make some other document to get from situation D to situation B in most cases, but I think it is fairly straightforward layout optimizations. ### Why are not we in Situation D The pre-optimization MIR for, say, `f2`, currently looks like ``` // remainder: _0 is return value, _1 is parameters b = move _generator.0 _2 = &b some_fn(_2) _3 = move b _4 = f3(move _3) _5 = IntoFuture::into_future(_4) future_root = move _5 .await_loop: _8 = &mut future_root _9 = Pin::new_unchecked(_8) _10 = poll(_9) let Poll::Ready(ret) = _10 else { yield; goto .await_loop; } return; ``` Here, the move of `b` into `_3` occurs because function calls take an rvalue. In some sense, we would like to optimize that move out sometime, but we can do that after figuring out generator layouts. In this code, according to the current implementation, there are 3 live big variables - `_1.0`, `future_root`, and `b`. `_1.0` is live when the generator is created, `future_root` is live at the yield, and `b` is live since it is borrowed. The goal of this document is to get into a state where `b` is not regarded as live at a yield point, and `_1.0` and `future_root` don't have an interference edge between them. ### Basics of Generator Layout The way I see generator layout is that it should work in the following steps: 1. Figure out which variables are live at yield points. 2. Figure out which interference edges exist between variables that force them to occupy distinct offsets. 3. Solve for offsets that satisfy the interference edges. This means that each MIR variable will have exactly 1 offset in the generator. The "anti-creep" (going from Situation D to Situation B) I eventually want to have will take part as heuristics in step (3). This document is about step (2). ### Why do we have interference edges? If you go by the standard definition of liveness, you should have an interference edge between variable A and variable B if there is a non-UB situation where a read from variable B reads from a write to variable A, or vice versa (in register allocation, we don't regard it as interference if A and B have the same value. Here, I don't think this is important). In Rust, we also care about address inequality - we don't want "distinct" variables to have the same pointer address. ### Request 1 - Rust-level moves kill all existing borrows. I think it is a good idea if every Rust-level MIR that does ``` some_borrow = &_y _x = move _y ``` Does the same as ``` _x = move _y _y = poison ``` The overwrite should kill all existing borrows. This follows naturally from stacked borrows, or probably from any other aliasing model. A write of poison, is, of course, "strictly more defined than a no-op", so whenever the optimizer finds it inconvenient, it can remove it. This document does not concern whenever the optimizer finds it convenient to remove the poison writes. Clearly, the generated should not overwrite the variable with garbage, but it might be better to remove them beforehand to allow e.g. merging variables. #### Note 1A - ptr::read and friends `ptr::read` is called `ptr::read`, not `ptr::read_and_poison`. Same for `ptr::copy`, `ptr::copy_nonoverlapping`, and any similar functions that are not Rust-level moves. It "moves the value" in the safety invariant sense, but it does not overwrite it with poison. Unsafe code writers expect to be able to read a value multiple times, even call the destructor multiple times as long as that is not UB for some other reason (a struct with a destructor that just calls `println` is perfectly well-defined to `ptr::read` multiple times, a destructor that calls `free` of course isn't). I don't know a good argument for poisoning on reads. Does anyone have one? #### Note 1B - function calls There is some desire (e.g. https://github.com/rust-lang/rust/issues/71117) to play with the semantics of function calls on their arguments. In Rust source, function call arguments and return values are rvalues. They are never borrowed, and are not accessed by anything after the call. Therefore, this document is entirely independent on the solution to that. Of course, optimizations can make function call arguments to be values that have been borrowed. These optimizations should make sure they preserve MIR semantics. ### Request 2 - address equality Request 2 says that it is OK for Rust local addresses to overlap, as long as at least one of the locals is deinitialized in the Rust sense (i.e., touching it will give a "use of uninitialized/moved value" error). Again, this is a statement about surface Rust, not about intermediate optimization representations. This applies for both generators and regular functions ```rust struct NotCopy(u32); fn foo() -> bool { let x = NotCopy(0); let addr1 = &x as *const _ as usize; drop(x); let y = NotCopy(0); let addr2 = &y as *const _ as usize; // Might return either true or false addr1 == addr2 } ```