THIS IS A FILE FULL OF GARBAGE WRITTEN BY @kainino0x
This document describes different types of calls in the WebGPU API, in particular how validation and creation operations (NOT queue operations) are ordered and how errors get reported.
Largely an update of https://github.com/gpuweb/gpuweb/blob/main/design/ErrorHandling.md#categories-of-webgpu-calls which is terribly outdated.
Most operations in WebGPU depend on some previous operations completing. These can be modeled as carefully partially-ordered sequence of operations, but implementations most likely will fully-order many of them. Notable exceptions are the createPipeline methods, which could benefit significantly from being automatically multithreaded internally.
Even queue operations can be viewed this way, even though queues are supposed to be thought of as fully ordered. Explicit multiqueue could be thought of as a guiding force for how to parallelize work.
Creation of the adapter and device.
GPU.requestAdapter
GPUAdapter.requestDevice
These methods create new objects, so they never create blocking relationships on the queue-timeline. (Device creation is async, so the device is known to be created before these are issued.) They can be modeled as being created/initialized on independent "queues" of work, with implicit queue transfers when they actually get used on a GPUQueue.
GPUDevice.createBuffer
GPUDevice.createTexture
GPUDevice.createSampler
GPUDevice.createQuerySet
GPUDevice.createBindGroupLayout
GPUDevice.createPipelineLayout
GPUDevice.createShaderModule
GPUDevice.importExternalTexture
GPUDevice.createCommandEncoder
GPUDevice.createRenderBundleEncoder
Destroy methods have no validation but impact validation of future queue operations.
GPUDevice.destroy
GPUTexture.destroy
GPUQuerySet.destroy
GPUBuffer.destroy
These methods depend on existing objects - in particular, past objects' validation and fallible allocation must have completed in order to validate these methods.
GPUDevice.createComputePipeline
(dep on shader validation)GPUDevice.createRenderPipeline
(dep on shader validation)GPUDevice.createBindGroup
(dep on resource creation)GPUTexture.createView
(dep on texture creation)GPUCommandEncoder.finish
(dep on encoder and all resources used)GPURenderBundleEncoder.finish
(dep on encoder and all resources used)GPUPipelineBase.getBindGroupLayout
(dep on pipeline creation)GPUDevice.createComputePipelineAsync
(dep on shader validation)GPUDevice.createRenderPipelineAsync
(dep on shader validation)These methods may add dependencies to the finish()
operation.
They may (only) generate errors due to incorrect encoder state.
endPass
(dep on state of encoder)GPUBuffer.mapAsync
service side state change (dep on buffer creation and previous maps/unmaps)GPUBuffer.unmap
service side state change (dep on buffer creation and previous maps/unmaps)GPUBuffer.mapAsync
client side state change (NOT dep on buffer creation)GPUBuffer.unmap
client side state change (NOT dep on buffer creation)GPUQueue.submit
(dep on command buffer creation)GPUQueue.writeBuffer
(dep on buffer creation)GPUQueue.writeTexture
(dep on texture creation)GPUQueue.copyExternalImageToTexture
(dep on texture creation)GPUShaderModule.compilationInfo
resolve (dep on shader validation)GPUQueue.onSubmittedWorkDone
resolve (dep on validation of all previous ops)GPUBuffer.mapAsync
resolve (dep on buffer creation)GPUDevice.pushErrorScope
(no deps)GPUDevice.popErrorScope
(dep on all validation or allocation inside the scope)GPUCanvasContext.configure
GPUCanvasContext.unconfigure
GPUCanvasContext.getCurrentTexture
GPUCanvasContext.getPreferredFormat
GPUBuffer.getMappedRange
new GPUOutOfMemoryError
new GPUValidationError
new GPUUncapturedErrorEvent
Error scopes have two major issues:
If you model out the partial ordering of operations (above), error scopes create a really complex graph of dependencies: every error scope depends on all of the stuff inside it. There's nothing inherently wrong with this; implementations could actually implement this even with internal threading, but implementing it well isn't trivial. This is what created the the issue with error handling on shader and pipeline creation (https://github.com/gpuweb/gpuweb/issues/2119). Still, if it were just this, I think implementations could make it work.
They're global state, 'nuff said. But seriously, they persist across JS tasks, and as a result an await
can cause error scopes to capture errors from random other async tasks. Imagine you're doing some kind of async init while continuing rendering of the page - rendering gets caught in scopes that are supposed to be for init. (In theory you should never need to persist scopes across await
for this, but the API makes it very easy to do this wrong.)
Generally, OOM errors can only be handled on an individual allocation basis. If you don't know which call it came from, you don't know which one to retry. There are situations where you could want to group them, for example if you have two resources with matching sizes and you can downsize if either fails. But in this case it's acceptable to force the application to compose the two errors itself.
Validation errors are never useful to handle inline. They only need to be detected for uses like telemetry and CTS. (Library integration tests could also use them like CTS.) Debugging is supposed to go through other mechanisms, like devtools breakpoints, although it's worth considering the printf-debugging style (https://github.com/gpuweb/gpuweb/issues/2478).
device.withErrorScope(() => { do your stuff here })
. This could be really unwieldy for applications though (esp in native).1 and 2 don't really work for testing; I don't think we can reflect on the graph of async tasks in a way that lets us wrap each task in an async invocation tree in its own error scope (if only they were Rust-style async). 3 is a massive question mark.
This has been considered in the past. To make it compatible with telemetry we could have some kind of flag (the presence of an error callback) that determines whether something is an "uncaptured" error. But it's heavy-handed. There's runtime overhead. You have to remember to pass the callback in every function. Well, not every function; because errors are cascading, you only need to do it in the "leaf" operations. But still, it's no good for testing because if you miss one callback then you get false positives.
This would be great for testing, but it doesn't save us from error ordering. It loosens the restrictions because there is no nesting, but error scopes still have the complex behavior (which I assert above is implementable).
Testing could use onSubmittedWorkDone()
plus some timeouts to approximate this (and forward any further uncaptured errors to some higher-level harness failure). This could work as a backup to cooperative testing ("Have every operation report errors separately"), but the performance impact on testing would be abysmal unless we can hold a LOT of GPUDevices in parallel.
Also, I had envisioned uncapturederror as spanning all threads (errors from any thread go to the originating device object), but being able to specify the cutoff point creates a race between threads. Either we allow the race or we make it per thread. Per thread is less good for telemetry (but probably fine) but maybe we somehow keep both uncaptured error and this new thing.
device.isReady(obj) -> Promise<void>
taking a pipeline (maybe other WebGPU objects)
create*PipelineAsync()
.The problems are maybe not as intertwined as I thought.
#2119 (createShaderModule) can be solved in several ways without getting rid of the totally-ordered nature of the Device Timeline.
#2085 (create pipeline async) can be solved with:
Await-spanning for out-of-memory is not a huge issue because you're supposed to use it around only one call. We could even enforce that!
Validation error scopes are not meant to be used in production, so the await-spanning problem isn't as much of a concern.
pop();push();
, except that if uncapturederror
also exists then it prevents errors from ever going there. Hardly seems worth it given it still has await-spanning problems. And it's just as possible to use pop();push();
as a debugging tool.