Design meeting 2024-03-21: Async Cancellation and Panic

--- title: "Design meeting 2024-03-21: Async Cancellation" tags: ["WG-async", "design-meeting", "minutes"] date: 2024-03-21 discussion: https://rust-lang.zulipchat.com/#narrow/stream/187312-wg-async/topic/Design.20meeting.202024-03-21 url: https://hackmd.io/vsF4W080SJ6H3PZjp9xlGA --- # Async Cancellation and Panic The document is here: https://theincredibleholk.org/blog/2024/03/06/async-cancellation-and-panic/ ## Unwind safety [TC: Eric and I had a discussion about his post that I'll summarize as a footnote here.] The post covers well many reasons why this is hard, but there is another one not covered: [unwind safety][]. [unwind safety]: https://doc.rust-lang.org/core/panic/trait.UnwindSafe.html The system described in the post involves catching an unwinding, returning `Pending` from `poll` until cancellation of that future is complete, then resuming the unwind. But this only works if the future whose unwinding is being caught is `UnwindSafe`. We currently have no way to automatically derive this property, and futures created by `async` blocks are not `UnwindSafe`. People write (or should write) `async` futures with the expectation that every `await` point is both a cancellation point and a point at which state might be observed by other code. But they don't necessarily write futures with the expectation that every call to every function that might panic is *also* a cancellation point and a point at which other code might observe state. The problem is that catching the panic and returning `Pending` from `poll` adds these additional points of cancellation and of potential observance of state. This isn't just a problem for the design in this post. This broadly constraints the set of potential solutions. It's possible that we'll need to adopt a design invariant, due to unwind safety, that drop (or cancel or other destruction) must proceed synchronously with unwinding except when special hand-written accommodations are made (e.g. what an executor is going to do to contain unwinding on a per-task basis). ### Potential solutions If that's true, then here are some solutions that might still work: 1. `poll_drop_ready` or similar: If there is a synchronous destructor that's always called and is responsible for finishing any work that the asynchronous destructor did not do, then we can just call the synchronous destructor when unwinding. 2. `block_on` the async destructor: If we had async destructors, we could `block_on` those destructors when unwinding. - Under a `poll_drop_ready`-style scheme, the implementor could choose to do this. 3. Panic during unwinding if async cleanup did not complete before unwinding. - This would risk turning many single panics into double panics. - Again, under a `poll_drop_ready`-style scheme, the implementor could choose to do this. The first two options both have some deadlock risk. The first seems better on that front since presence of the synchronous destructor means the implementor must deal with that explicitly. ### Implications for "may/must not drop" It seems true also that any "may/must not drop" scheme is going to have to confront this same problem. Any place that the code could unwind is a place where values might be dropped without first being explicitly destructed. ### Do away with general unwinding? Alternatively, perhaps we should resume our discussions on how we might do away with general unwinding while still somehow allowing multithreaded servers to contain panics to one request. --- # Discussion ## Attendance - People: TC, eholk, tmandry, Daria, petrochenkov, Yosh ## Meeting roles - Minutes, driver: TC ## Awaiting while unwinding outside `async_catch_unwind` eholk: I got feedback on my blog post that I didn't cover this issue, but it turns out it's a little subtle. Basically, we need to be careful not to await while unwinding unless we are in an `async_catch_unwind` context because otherwise it's easy to swallow panics. ```rust async fn foo() { async { panic!() }.on_cancel(async || { delay().await // delay is a function that returns pending once, then completes the next time }).await } let mut f = pin!(foo()); f.poll(cx); drop(f); // the future was in the process of panicking, but we didn't observe it and dropped it before the cancellation handler could complete. ``` tmandry: What's `async_catch_unwind`? :) yosh: Unsure whether the bound should be `AsyncUnwindSafe` or `UnwindSafe`, but it would likely look something like this: ```rust async fn async_catch_unwind<F: async FnOnce() -> R + AsyncUnwindSafe /* or maybe `UnwindSafe`, or both */, R>(f: F) -> Result<R> ``` eric: The case is that the future has panicked, but the future is still unwinding - so we don't know it yet. If we drop that future before unwinding completes, we may never realize that the panicked happened. That seems bad. yosh: note that keeping track of unwindstate is only necessary if we can't statically encode the async drop guarantees. (*people nodded, made sense*) Tmandry: yeah it would be good if we could guarantee that. TC: What are our thoughts on `AsyncUnwindSafe` versus `UnwindSafe`, as Yosh raised? eholk: My feeling is that `AsyncUnwindSafe` would have to be equivalent to `UnwindSafe`. The key property is what other code can observe. Daria: ~~`Future::poll` requires `Pin<&mut T>` thus `&mut T` which is always `!UnwindSafe`, no?~~ Well if such reference doesn't cross `catch_unwind` it should be fine. Petrochenkov: Are there more chances to observe unwinding in async? You can already do this across threads in sync Rust. eholk: I think in practice we'll have more opportunities to observe this, although it may be the same number of mechanisms. ## Sharing resources with cancellation handler tmandry: Let's say I open a database handle in my async fn that I want to close asynchronously on panic/cancel. What's the pattern look like – since this uses combinators, don't we have to `Arc<Mutex<T>>` the handle? eholk: Hopefully this just works if we have `do ... final` or something similar? tmandry: Sure, but if we do that we can probably have compiler-generated futures that can be poll_cancel'd after panicking (see next question) ## Using compiler magic to prevent inconsistent state tmandry: The compiler represents unwinding as part of the MIR control-flow graph. I think it's entirely possible for it to generate poll_cancel from that without unsoundness, and if we use mechanisms like `do..final` it would work fine for adding async code that executes during unwinding... ...but that inherits the problems with `do..final` like what to do with `?`, what to do if there's a panic during a `final` block (which may be entered without a panic), ... eholk: Agreed. I imagined a mechanism where the compiler would help generate this code. tmandry: We may need to prototype this. ## Cancellation in no_std (comment) tmandry: I'm less concerned about handling panics in no_std environments, which are likely to use panic=abort anyway. The important thing is that they have a mechanism for cancellation in the non-panic case. eholk: Having thought about this after putting up my post, I think the allocation stuff was mostly an artifact of prototyping with `catch_unwind`. With compiler support, I think we could generate something like "async unwind tables" that let us recover the information we need without allocation. tmandry: Amusingly, the unwind info used by DWARF is called "asynchronous unwind tables" because it can be used to unwind a process's stack from outside the process, e.g. by a debugger. petrochenkov: It'd be interesting to look at what Clang does in freestanding environments when catching an exception. ## `do ... final` has bad ergonomics in some cases eholk: Kind of tangentially related to this post, but Yosh pointed out to me that `do ... final` is kind of terrible for some common patterns. For example, in sync Rust: ```rust fn open_file(path: &Path) -> Result<File> { let file = File::open(path)?; check_header(&mut file)?; Ok(file) } ``` With `do ... final` and `async` we'd have to do something like: ```rust async fn open_file(path: &Path) -> Result<File> { let mut file = Some(File::open(path)?); do { let &Some(inner_file) = &file else unreachable!(); check_header(&mut inner_file).await?; Ok(file.take().unwrap()) } final { if let Some(file) = file { file.close().await?; } } } ``` (In prehistoric Rust, we called this pattern the option dance.) TC: It's probably better to think of this as a primitive. Part of the reason I've probably made my peace with this is that many other languages have equivalents to `do .. final` (e.g. CL's `unwind-protect`), and they all have these same ergonimic problems. Every time I've thought I had a clever idea to improve on this, I've realized a problem with it that explained why all the smart people before us did it this way. tmandry: Swift has a nice mechanism for cleaning up the `let Some`, though it wouldn't work in Rust since `?` means something else: ```rust! file?.close().await ``` yosh: But do we really want this? What's the use in sync Rust? tmandry: Destructors with arguments. tmandry: Example off the top of my head - a node from a graph, if you drop that node it removes itself from the graph, but it can't hold a mut reference to the graph or you would never be able to access it otherwise. ```rust! /* example to be provided after the call by tmandry */ ``` tmandry: Also, many uses for "contexts and capabilities" overlap with use cases for destructors with arguments. ## Structural cancellation Daria: Future::poll_cancel wouldn't work for cancellation of spawned tasks when they are inside of some other structure like tuple, since tuple does not implement Future. The only way out of this I think is async drop. Yosh: Agreed. eholk: This gets more powerful if we have unforgettable or undroppable types. I've been trying to explore designs where we don't have to do this right away. But maybe that's backward. If we want or need these more powerful mechanisms anyway, we should probably start with the basic typesystem features. Daria: There's a lot of symmetry here with effect generics and between sync drop and async drop, and it may be worth thinking about unforgettable or undroppable types. ## Panics do not have to be treated as cancellation points > People write (or should write) `async` futures with the expectation that every `await` point is both a cancellation point and a point at which state might be observed by other code. But they don't necessarily write futures with the expectation that every call to every function that might panic is _also_ a cancellation point and a point at which other code might observe state. Yosh: Why are panics equated to cancellation points? According to Eric's [A Mechanism for Async Cancellation](https://blog.theincredibleholk.org/blog/2023/11/14/a-mechanism-for-async-cancellation/#cancel-during-cancellation), it is possible to make cancellation handlers idempotent, meaning that even if a cancellation is triggered _during_ cancellation - it doesn't prevent the ongoing cancellation. Couldn't unwind handlers uphold those same properties using the same or even a similar mechanism? Eric: If a future panics and we cancel it, we do want it to keep panicking. .... Does that means we might need more states in the future? A panic graph in addition to the cancellation graph? Eric: We'd have more states we need to track. Daria: Panics inside of sync function calls are cancellation points too I would say. TC: They're only cancellation points if wrapped in `catch_unwind`, and for that, you need to assert `UnwindSafe`. Daria: That would mean we panic after every cancel returned a `Poll::Ready`? --- (Discussion related to the problem of panics being "lost".) TC: We might actually want `poll` to return a signal that it is either pending resolving to a value or pending unwinding (e.g. `Pending` vs `Unwinding` or `Canceling`). Callers should probably know that *now*, rather than know it later. Callers could use that, e.g., to keep polling futures that they would otherwise drop (because they're no longer interested in the result) if they know that the future is in fact unwinding. Eric: That has some appeal to it - in some way that feels like a violation of encapsulation. Though executors might, as a policy, want to use this to prioritize tasks which are unwinding. Yosh: We might be able to extend `Poll` if we want to: ```rust // Yosh: I think this could work? enum Poll<T, P = ()> { Ready(T), Pending(P) } ``` Eric: If we wanted to do this, we'd need to figure out the backwards compatibility story. ## Generator resumption invariants > The thing is, once `resume` panics, coroutines cannot be resumed again and they will panic if you try. Yosh: Do we know why this is? There is nothing in the signature of the traits that would requires this to happen? It sounds like we might be treating this as an invariant that informs the rest of the design; is it actually? Eric: Yes, we should keep that invariant. It's the right behavior and it made this post tractable. ## Does `#[no_std]` even support unwinding? (duplicate) *note: likely duplicate of tmandry's earlier comment* Yosh: A significant amount of time in the conclusion is spent speculating about how to add support for unwinding in `no_std` targets; but I don't believe those support unwinding in the first place? Yosh: Is the proposal under these rules that we don't run destructors automatically on unwind, even on targets which can support it? Daria: It does though, there's rust-psp. How do they do `Box`es?

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.