The document is here:
https://theincredibleholk.org/blog/2024/03/06/async-cancellation-and-panic/
[TC: Eric and I had a discussion about his post that I'll summarize as a footnote here.]
The post covers well many reasons why this is hard, but there is another one not covered: unwind safety.
The system described in the post involves catching an unwinding, returning Pending
from poll
until cancellation of that future is complete, then resuming the unwind.
But this only works if the future whose unwinding is being caught is UnwindSafe
. We currently have no way to automatically derive this property, and futures created by async
blocks are not UnwindSafe
.
People write (or should write) async
futures with the expectation that every await
point is both a cancellation point and a point at which state might be observed by other code. But they don't necessarily write futures with the expectation that every call to every function that might panic is also a cancellation point and a point at which other code might observe state.
The problem is that catching the panic and returning Pending
from poll
adds these additional points of cancellation and of potential observance of state.
This isn't just a problem for the design in this post. This broadly constraints the set of potential solutions. It's possible that we'll need to adopt a design invariant, due to unwind safety, that drop (or cancel or other destruction) must proceed synchronously with unwinding except when special hand-written accommodations are made (e.g. what an executor is going to do to contain unwinding on a per-task basis).
If that's true, then here are some solutions that might still work:
poll_drop_ready
or similar: If there is a synchronous destructor that's always called and is responsible for finishing any work that the asynchronous destructor did not do, then we can just call the synchronous destructor when unwinding.block_on
the async destructor: If we had async destructors, we could block_on
those destructors when unwinding.
poll_drop_ready
-style scheme, the implementor could choose to do this.poll_drop_ready
-style scheme, the implementor could choose to do this.The first two options both have some deadlock risk. The first seems better on that front since presence of the synchronous destructor means the implementor must deal with that explicitly.
It seems true also that any "may/must not drop" scheme is going to have to confront this same problem. Any place that the code could unwind is a place where values might be dropped without first being explicitly destructed.
Alternatively, perhaps we should resume our discussions on how we might do away with general unwinding while still somehow allowing multithreaded servers to contain panics to one request.
async_catch_unwind
eholk: I got feedback on my blog post that I didn't cover this issue, but it turns out it's a little subtle. Basically, we need to be careful not to await while unwinding unless we are in an async_catch_unwind
context because otherwise it's easy to swallow panics.
async fn foo() {
async {
panic!()
}.on_cancel(async || {
delay().await // delay is a function that returns pending once, then completes the next time
}).await
}
let mut f = pin!(foo());
f.poll(cx);
drop(f);
// the future was in the process of panicking, but we didn't observe it and dropped it before the cancellation handler could complete.
tmandry: What's async_catch_unwind
? :)
yosh: Unsure whether the bound should be AsyncUnwindSafe
or UnwindSafe
, but it would likely look something like this:
async fn async_catch_unwind<F: async FnOnce() -> R + AsyncUnwindSafe /* or maybe `UnwindSafe`, or both */, R>(f: F) -> Result<R>
eric: The case is that the future has panicked, but the future is still unwinding - so we don't know it yet. If we drop that future before unwinding completes, we may never realize that the panicked happened. That seems bad.
yosh: note that keeping track of unwindstate is only necessary if we can't statically encode the async drop guarantees. (people nodded, made sense)
Tmandry: yeah it would be good if we could guarantee that.
TC: What are our thoughts on AsyncUnwindSafe
versus UnwindSafe
, as Yosh raised?
eholk: My feeling is that AsyncUnwindSafe
would have to be equivalent to UnwindSafe
. The key property is what other code can observe.
Daria: Well if such reference doesn't cross Future::poll
requires Pin<&mut T>
thus &mut T
which is always !UnwindSafe
, no?catch_unwind
it should be fine.
Petrochenkov: Are there more chances to observe unwinding in async? You can already do this across threads in sync Rust.
eholk: I think in practice we'll have more opportunities to observe this, although it may be the same number of mechanisms.
tmandry: Let's say I open a database handle in my async fn that I want to close asynchronously on panic/cancel. What's the pattern look like – since this uses combinators, don't we have to Arc<Mutex<T>>
the handle?
eholk: Hopefully this just works if we have do ... final
or something similar?
tmandry: Sure, but if we do that we can probably have compiler-generated futures that can be poll_cancel'd after panicking (see next question)
tmandry: The compiler represents unwinding as part of the MIR control-flow graph. I think it's entirely possible for it to generate poll_cancel from that without unsoundness, and if we use mechanisms like do..final
it would work fine for adding async code that executes during unwinding…
…but that inherits the problems with do..final
like what to do with ?
, what to do if there's a panic during a final
block (which may be entered without a panic), …
eholk: Agreed. I imagined a mechanism where the compiler would help generate this code.
tmandry: We may need to prototype this.
tmandry: I'm less concerned about handling panics in no_std environments, which are likely to use panic=abort anyway. The important thing is that they have a mechanism for cancellation in the non-panic case.
eholk: Having thought about this after putting up my post, I think the allocation stuff was mostly an artifact of prototyping with catch_unwind
. With compiler support, I think we could generate something like "async unwind tables" that let us recover the information we need without allocation.
tmandry: Amusingly, the unwind info used by DWARF is called "asynchronous unwind tables" because it can be used to unwind a process's stack from outside the process, e.g. by a debugger.
petrochenkov: It'd be interesting to look at what Clang does in freestanding environments when catching an exception.
do ... final
has bad ergonomics in some caseseholk: Kind of tangentially related to this post, but Yosh pointed out to me that do ... final
is kind of terrible for some common patterns. For example, in sync Rust:
fn open_file(path: &Path) -> Result<File> {
let file = File::open(path)?;
check_header(&mut file)?;
Ok(file)
}
With do ... final
and async
we'd have to do something like:
async fn open_file(path: &Path) -> Result<File> {
let mut file = Some(File::open(path)?);
do {
let &Some(inner_file) = &file else unreachable!();
check_header(&mut inner_file).await?;
Ok(file.take().unwrap())
} final {
if let Some(file) = file {
file.close().await?;
}
}
}
(In prehistoric Rust, we called this pattern the option dance.)
TC: It's probably better to think of this as a primitive. Part of the reason I've probably made my peace with this is that many other languages have equivalents to do .. final
(e.g. CL's unwind-protect
), and they all have these same ergonimic problems. Every time I've thought I had a clever idea to improve on this, I've realized a problem with it that explained why all the smart people before us did it this way.
tmandry: Swift has a nice mechanism for cleaning up the let Some
, though it wouldn't work in Rust since ?
means something else:
file?.close().await
yosh: But do we really want this? What's the use in sync Rust?
tmandry: Destructors with arguments.
tmandry: Example off the top of my head - a node from a graph, if you drop that node it removes itself from the graph, but it can't hold a mut reference to the graph or you would never be able to access it otherwise.
/* example to be provided after the call by tmandry */
tmandry: Also, many uses for "contexts and capabilities" overlap with use cases for destructors with arguments.
Daria: Future::poll_cancel wouldn't work for cancellation of spawned tasks when they are inside of some other structure like tuple, since tuple does not implement Future. The only way out of this I think is async drop.
Yosh: Agreed.
eholk: This gets more powerful if we have unforgettable or undroppable types. I've been trying to explore designs where we don't have to do this right away. But maybe that's backward. If we want or need these more powerful mechanisms anyway, we should probably start with the basic typesystem features.
Daria: There's a lot of symmetry here with effect generics and between sync drop and async drop, and it may be worth thinking about unforgettable or undroppable types.
People write (or should write)
async
futures with the expectation that everyawait
point is both a cancellation point and a point at which state might be observed by other code. But they don't necessarily write futures with the expectation that every call to every function that might panic is also a cancellation point and a point at which other code might observe state.
Yosh: Why are panics equated to cancellation points? According to Eric's A Mechanism for Async Cancellation, it is possible to make cancellation handlers idempotent, meaning that even if a cancellation is triggered during cancellation - it doesn't prevent the ongoing cancellation. Couldn't unwind handlers uphold those same properties using the same or even a similar mechanism?
Eric: If a future panics and we cancel it, we do want it to keep panicking. … Does that means we might need more states in the future? A panic graph in addition to the cancellation graph?
Eric: We'd have more states we need to track.
Daria: Panics inside of sync function calls are cancellation points too I would say.
TC: They're only cancellation points if wrapped in catch_unwind
, and for that, you need to assert UnwindSafe
.
Daria: That would mean we panic after every cancel returned a Poll::Ready
?
(Discussion related to the problem of panics being "lost".)
TC: We might actually want poll
to return a signal that it is either pending resolving to a value or pending unwinding (e.g. Pending
vs Unwinding
or Canceling
). Callers should probably know that now, rather than know it later. Callers could use that, e.g., to keep polling futures that they would otherwise drop (because they're no longer interested in the result) if they know that the future is in fact unwinding.
Eric: That has some appeal to it - in some way that feels like a violation of encapsulation. Though executors might, as a policy, want to use this to prioritize tasks which are unwinding.
Yosh: We might be able to extend Poll
if we want to:
// Yosh: I think this could work?
enum Poll<T, P = ()> {
Ready(T),
Pending(P)
}
Eric: If we wanted to do this, we'd need to figure out the backwards compatibility story.
The thing is, once
resume
panics, coroutines cannot be resumed again and they will panic if you try.
Yosh: Do we know why this is? There is nothing in the signature of the traits that would requires this to happen? It sounds like we might be treating this as an invariant that informs the rest of the design; is it actually?
Eric: Yes, we should keep that invariant. It's the right behavior and it made this post tractable.
#[no_std]
even support unwinding? (duplicate)note: likely duplicate of tmandry's earlier comment
Yosh: A significant amount of time in the conclusion is spent speculating about how to add support for unwinding in no_std
targets; but I don't believe those support unwinding in the first place?
Yosh: Is the proposal under these rules that we don't run destructors automatically on unwind, even on targets which can support it?
Daria: It does though, there's rust-psp. How do they do Box
es?