Order of error scope resolution

Proposed by editors (Myles/Brandon/Kai), and written by Brandon, with edits from Kai.

An ongoing concern about use of error scopes is that they may require that operations within the scope finish in-order. At this time, it's unclear if that may prevent certain classes of optimization, such as moving pipeline creation to a separate thread. (See #2119 for prior conversation on that point.)

After some discussion the editors feel that in-order resolution is not a requirement for error scopes. This document outline the reasons why, and how we feel error scopes should operate.

Proposal

The proposal is to remove all constraints on the ordering of the resolution of promises from popErrorScope(), relative to one another and to other promises in the API. This allows implementations use a strict ordering, resolve strictly as early as possible, or anything else.

The general behavior that we would expect from an error scope is:

Every operation performed within the error scope is allocated a "slot" in the scope.
Once the operation completes it signals its error or success state to the slot, regardless of when that completion takes place relative to the rest of the operations in the scope.
At an unspecified point after…
- … ANY slot is signaled with an error: the error scope resolves with any error in the scope–-except that, if a later error cascaded from an earlier one, the later one should not be the one reported (but note this is only observable via the non-machine-readable GPUError.message). Resolving earlier is better, because it allows the application to begin fallback sooner.
- … ALL slots have been signaled with success: the error scope resolves with null.
This language allows implementations to defer work such as shader compilation to complete after other validation on the same device timeline. Under the as-if rule, any mutable object state (like destroyed-ness) must be read at this point in the device timeline, but the actual validation computation doesn't have to complete at any particular time.
Different error scopes (sibling or parent-child) are independent and do not need to resolve in-order.

Considerations

In discussing this proposal the editors talked through multiple scenarios to determine if the proposed API suited their needs.

Programmatic responses to errors

The order that errors are processed in, within a scope, should not matter for applications that are responding errors programatically. An application that uses error scopes to catch invalid calls and take another code path (such as substituiting a fallback resource, using a simpler shader, or freeing memory before retrying an allocation) can straightforwardly wrap many operations in one scope in order to determine that some operation in the scope failed with that error type.

If determining the error state of a specific operation within the scope is necessary, the developer can always create a new error scope that encompasses only that operation.

Validation errors

For validation errors specifically, it can seem helpful for developers using error scopes as a debugging tool if the errors are returned in-order. This is primarily due to worries about cascading failures from an earlier issue surfacing before the root error.

Error scopes were not primarily designed as a debugging aid. Ideally the browser will naturally surface enough debugging info via console messages or other debugging mechanisms that developers will not feel compelled to fall back to error scopes for that purpose. We know that they will be used for debugging regardless, however, so it's still a use case that should be examined.

To that end, it's worth considering that that implementations are likely to naturally process validation in-order. Even backend operations that will be dispatched to a different thread may have their WebGPU-level validation happen on a main thread.

In cases where the validation is asynchronous, there is still an expectation that validation messages will come in a sensible order, just not a strict order. While errors from two independent commands (such as creation of two unrelated pipelines) may come in out-of-order, that's OK because they can still be examined and addressed independently, and repeated runs will eventually reveal all the validation issues. Dependent operations (such as creating a texture and then creating a bind group with it) should never surface errors out of order, because it's nonsensical to begin validating an operation before all of it's inputs have been completed.

Use of error scopes as (bad) synchronization primitives

Some concerns have been raised that developers may attempt to use error scopes as implicit synchronization primitives, and treat the resolution of one scope to mean that any operations contained in scopes created before it are also completed.

The editors are comfortable with labeling this as a misuse of the API, in the same way that it was decided that mapAsync would be allowed to have undefined ordering even though it could also be erroneously treated as a synchronization primitive. In particular, it's now already hazardous to assume that anything else (except onSubmittedWorkDone) implies mapAsync is ready. Aside from this, misuse of error scopes can only result in poorer performance.

These APIs should have notes in the spec that they should not be relied on to infer the completion of other operations. Developers that want a reliable signal that work up to a given point has been completed should use queue.onSubmittedWorkDone(), and we should steer them towards it.

Another consideration from the editors is that in the case of error scopes it will generally be easier to introduce stricter ordering in the future than it will be to loosen ordering if we start out with strict ordering and then decide that it's preventing certain optimizations.

If a significant number of WebGPU applications are allowed to be developed with an implicit assumption of error scope ordering it would lock us into preserving that pattern for fear of breaking content.

Order of error scope resolution

Proposal

Considerations

Programmatic responses to errors

Validation errors

Use of error scopes as (bad) synchronization primitives

Future API refinement

Read more

WebGPU upload paths (MDN article draft)

DRAFT SharedValueTable Proposal

Multiple components cannot effectively share a GPUDevice because it is stateful

(Drafts) Multiple components cannot effectively share a GPUDevice because it is stateful