Error handling rework brainstorming, and method taxonomy

THIS IS A FILE FULL OF GARBAGE WRITTEN BY @kainino0x

Taxonomy of WebGPU method calls

This document describes different types of calls in the WebGPU API, in particular how validation and creation operations (NOT queue operations) are ordered and how errors get reported.

Largely an update of https://github.com/gpuweb/gpuweb/blob/main/design/ErrorHandling.md#categories-of-webgpu-calls which is terribly outdated.

Most operations in WebGPU depend on some previous operations completing. These can be modeled as carefully partially-ordered sequence of operations, but implementations most likely will fully-order many of them. Notable exceptions are the createPipeline methods, which could benefit significantly from being automatically multithreaded internally.

Even queue operations can be viewed this way, even though queues are supposed to be thought of as fully ordered. Explicit multiqueue could be thought of as a guiding force for how to parallelize work.

Initialization

Creation of the adapter and device.

GPU.requestAdapter
GPUAdapter.requestDevice

Independent operations

Resource creation

These methods create new objects, so they never create blocking relationships on the queue-timeline. (Device creation is async, so the device is known to be created before these are issued.) They can be modeled as being created/initialized on independent "queues" of work, with implicit queue transfers when they actually get used on a GPUQueue.

GPUDevice.createBuffer
GPUDevice.createTexture
GPUDevice.createSampler
GPUDevice.createQuerySet
GPUDevice.createBindGroupLayout
GPUDevice.createPipelineLayout
GPUDevice.createShaderModule
GPUDevice.importExternalTexture
- Import/initialization of the GPUExternalTexture may require some GPU work, but it doesn't necessarily take place on the device's queue.
GPUDevice.createCommandEncoder
GPUDevice.createRenderBundleEncoder

Destruction

Destroy methods have no validation but impact validation of future queue operations.

GPUDevice.destroy
GPUTexture.destroy
GPUQuerySet.destroy
GPUBuffer.destroy

Dependent operations

Resource creation

These methods depend on existing objects - in particular, past objects' validation and fallible allocation must have completed in order to validate these methods.

Non-Promises

GPUDevice.createComputePipeline (dep on shader validation)
GPUDevice.createRenderPipeline (dep on shader validation)
GPUDevice.createBindGroup (dep on resource creation)
GPUTexture.createView (dep on texture creation)
GPUCommandEncoder.finish (dep on encoder and all resources used)
GPURenderBundleEncoder.finish (dep on encoder and all resources used)
GPUPipelineBase.getBindGroupLayout (dep on pipeline creation)

Promises

GPUDevice.createComputePipelineAsync (dep on shader validation)
GPUDevice.createRenderPipelineAsync (dep on shader validation)

Command encoding (encoder state changes)

These methods may add dependencies to the finish() operation. They may (only) generate errors due to incorrect encoder state.

All encoding commands and endPass (dep on state of encoder)

Non-content-timeline state changes

GPUBuffer.mapAsync service side state change (dep on buffer creation and previous maps/unmaps)
GPUBuffer.unmap service side state change (dep on buffer creation and previous maps/unmaps)

Shared among multiple content timelines

GPUBuffer.mapAsync client side state change (NOT dep on buffer creation)
GPUBuffer.unmap client side state change (NOT dep on buffer creation)

Queue operations

GPUQueue.submit (dep on command buffer creation)
GPUQueue.writeBuffer (dep on buffer creation)
GPUQueue.writeTexture (dep on texture creation)
GPUQueue.copyExternalImageToTexture (dep on texture creation)

Reportbacks from device-timeline to content-timeline (Promises)

GPUShaderModule.compilationInfo resolve (dep on shader validation)
GPUQueue.onSubmittedWorkDone resolve (dep on validation of all previous ops)
GPUBuffer.mapAsync resolve (dep on buffer creation)

Error Scopes

GPUDevice.pushErrorScope (no deps)
GPUDevice.popErrorScope (dep on all validation or allocation inside the scope)

TBD (under development)

GPUCanvasContext.configure
GPUCanvasContext.unconfigure
GPUCanvasContext.getCurrentTexture

Content-timeline

GPUCanvasContext.getPreferredFormat
GPUBuffer.getMappedRange
new GPUOutOfMemoryError
new GPUValidationError
new GPUUncapturedErrorEvent

Design sketchpad

Error scopes have two major issues:

If you model out the partial ordering of operations (above), error scopes create a really complex graph of dependencies: every error scope depends on all of the stuff inside it. There's nothing inherently wrong with this; implementations could actually implement this even with internal threading, but implementing it well isn't trivial. This is what created the the issue with error handling on shader and pipeline creation (https://github.com/gpuweb/gpuweb/issues/2119). Still, if it were just this, I think implementations could make it work.
They're global state, 'nuff said. But seriously, they persist across JS tasks, and as a result an await can cause error scopes to capture errors from random other async tasks. Imagine you're doing some kind of async init while continuing rendering of the page - rendering gets caught in scopes that are supposed to be for init. (In theory you should never need to persist scopes across await for this, but the API makes it very easy to do this wrong.)

Generally, OOM errors can only be handled on an individual allocation basis. If you don't know which call it came from, you don't know which one to retry. There are situations where you could want to group them, for example if you have two resources with matching sizes and you can downsize if either fails. But in this case it's acceptable to force the application to compose the two errors itself.

Validation errors are never useful to handle inline. They only need to be detected for uses like telemetry and CTS. (Library integration tests could also use them like CTS.) Debugging is supposed to go through other mechanisms, like devtools breakpoints, although it's worth considering the printf-debugging style (https://github.com/gpuweb/gpuweb/issues/2478).

Just have error scopes not span JS tasks?

Error scopes could be a "container" function like device.withErrorScope(() => { do your stuff here }). This could be really unwieldy for applications though (esp in native).
Error scopes could be "begin only", somehow magically attached to the task and end as soon as the task ends. (Or at least enqueue a microtask to end themselves.)
Error scopes could inherit through async boundaries somehow??? (I was envisioning User Activation but apparently this is actually not how that works.)

1 and 2 don't really work for testing; I don't think we can reflect on the graph of async tasks in a way that lets us wrap each task in an async invocation tree in its own error scope (if only they were Rust-style async). 3 is a massive question mark.

Have every operation report errors separately?

This has been considered in the past. To make it compatible with telemetry we could have some kind of flag (the presence of an error callback) that determines whether something is an "uncaptured" error. But it's heavy-handed. There's runtime overhead. You have to remember to pass the callback in every function. Well, not every function; because errors are cascading, you only need to do it in the "leaf" operations. But still, it's no good for testing because if you miss one callback then you get false positives.

Replace "uncapturederror" with "getAnyUncapturedErrorSinceLastCallToThisFunction()"

This would be great for testing, but it doesn't save us from error ordering. It loosens the restrictions because there is no nesting, but error scopes still have the complex behavior (which I assert above is implementable).

Testing could use onSubmittedWorkDone() plus some timeouts to approximate this (and forward any further uncaptured errors to some higher-level harness failure). This could work as a backup to cooperative testing ("Have every operation report errors separately"), but the performance impact on testing would be abysmal unless we can hold a LOT of GPUDevices in parallel.

Also, I had envisioned uncapturederror as spanning all threads (errors from any thread go to the originating device object), but being able to specify the cutoff point creates a race between threads. Either we allow the race or we make it per thread. Per thread is less good for telemetry (but probably fine) but maybe we somehow keep both uncaptured error and this new thing.

Ideas Take 1

Best idea so far??

Get rid of error scopes.
Add callbacks for recoverable OOM in createBuffer/Texture/QuerySet, and hopefully nowhere else.
Keep uncapturederror as is. Have it exist only on the originating device object (one thread).
Add getAnyUncapturedErrorSinceLastCallToThisFunction(), on every thread. It doesn't prevent anything from going to the event. Testing would simply not use the event, and use this instead. Make implementations deal with the ordering problem internally.

What if we try to change as little as possible?

Keep uncapturederror as is.
Keep error scopes API as is (with both validation and out-of-memory).
- No guarantee about ordering of sibling scopes.
- Probably no guarantee about ordering of parent/child scopes. (A parent could return before a child since, if it has the same filter, it's known nothing can bubble up from the child.)
- Probably still guarantee that you get the "first" error in the scope. (This guarantee should be enough to implement, and makes it a lot more useful for debugging.)
Describe everything in WebGPU in an async task model where nothing is ordered until an operation creates a dependency. In this model ALL shader/pipeline creation occurs on independent "threads" that just get "await"ed when you need them.
MAYBE add a device.isReady(obj) -> Promise<void> taking a pipeline (maybe other WebGPU objects)
- Would not work for operations that don't return objects. Pretty sure it doesn't matter. (We have onSubmittedWorkDone for queues, which ARE ordered.)
- Lets us remove create*PipelineAsync().

A slight variation on that…

Keep ONLY the validation error scope. Move out-of-memory to createBuffer/Texture/QuerySet.
- Simplifies parent/child scope ordering.
- If error scopes are ONLY used for testing then ordering really no longer matters. Parent/child/sibling ordering prooobably doesn't matter for debugging?

Taking it further…

Make (validation) error scopes part of an optional feature, and clearly document that it could be removed in the future.
- Allows us to replace error scopes with getAnyUncapturedErrorSinceLastCallToThisFunction or whatever in the future.
- Makes it clear you're really not supposed to be using them in normal applications.
- Which gives us better leeway to make them unordered.

Scratch/observations

~~Error scopes could be "removed" in a future device feature.~~ I don't know what I was thinking this would solve
How are state changes modeled in the task graph model?
- Resource destroy and device destroy in particular - kinda same as resource writes and queue transfers, so maybe each object has its own timeline instead of just tasks?
- State changes mean some tasks are dependent on the order the tasks are created (e.g. writeBuffer and buffer.destroy must have an order).
https://github.com/gpuweb/gpuweb/pull/2583#issuecomment-1034038886

Ideas Take 2

The problems are maybe not as intertwined as I thought.

#2119 (createShaderModule) can be solved in several ways without getting rid of the totally-ordered nature of the Device Timeline.
1. Change the API to never surface shader errors in error scopes.
2. Do stuff in the implementation to "hide" the fact that createShaderModule is executed concurrently (as described in #2119). Basically delay error reporting for all future scopes. Model of how this works in a total ordering.
  - (Decided I don't think there's actually any value in removing ordering guarantees from error scopes.)
#2085 (create pipeline async) can be solved with:
1. Work stealing by cache hit (createPipeline after createPipelineAsync). TBD how specced this would be.
2. Work stealing by returning a pipeline object immediately that blocks when used. Needs another way to detect completion.
  - Early: separate function or overload (createPipelineAsync), descriptor option, or callback.
  - ~~Late: isReady() or pipeline.ready, usable on any pipeline.~~
Await-spanning for out-of-memory is not a huge issue because you're supposed to use it around only one call. We could even enforce that!
- Moving OOM detection to createBuffer/Texture/QuerySet is still clearer, and easy to implement. Can even be implemented on top of today's error scopes.
- OOM scopes aren't necessary for CTS. An OOM resource will cause validation errors in other commands if used. And if the test cares about the OOM itself, it can just check for it.
- Uncaptured OOMs should still go to the uncapturederror event, bypassing the scopes.
Validation error scopes are not meant to be used in production, so the await-spanning problem isn't as much of a concern.
- MAYBE consider putting them in an optional extension for documentation's sake. Could we really remove them later?
- getAnyUncapturedErrorSinceLastCallToThisFunction could be implemented on top of error scopes by just implementing it as pop();push();, except that if uncapturederror also exists then it prevents errors from ever going there. Hardly seems worth it given it still has await-spanning problems. And it's just as possible to use pop();push(); as a debugging tool.

Error handling rework brainstorming, and method taxonomy

Taxonomy of WebGPU method calls

Initialization

Independent operations

Resource creation

Destruction

Dependent operations

Resource creation

Non-Promises

Promises

Command encoding (encoder state changes)

Non-content-timeline state changes

Shared among multiple content timelines

Queue operations

Reportbacks from device-timeline to content-timeline (Promises)

Error Scopes

TBD (under development)

Content-timeline

Design sketchpad

Just have error scopes not span JS tasks?

Have every operation report errors separately?

Replace "uncapturederror" with "getAnyUncapturedErrorSinceLastCallToThisFunction()"

Ideas Take 1

Best idea so far??

What if we try to change as little as possible?

A slight variation on that…

Taking it further…

Scratch/observations

Ideas Take 2

Read more

WebGPU upload paths (MDN article draft)

DRAFT SharedValueTable Proposal

Multiple components cannot effectively share a GPUDevice because it is stateful

Order of error scope resolution