Initially discussed by cwfitzgerald and jasperrlz.
See Kangz's Bindless Investigation and Proposal for the base proposal. This document proposes making a few specific decisions to help wgpu be able to implement bindless efficiently without restricting the API too much.
All of us want the API to be as flexible as possible, but these compromises are what we require to maintain performance – a very important goal for bindless implementations.
Note: The exact API shape and naming isn't important here, just the concepts and semantics they provide. We have written these in a Rust style, but it should be translatable to the JS API.
Bind groups are updated on the CPU timeline and updates "immediately". This behaves equivalently as if that bind group handle was replaced with a new bind group containing the new bindings.
What this means that all previous uses of the bind group continue to use the old contents and any new usages use the updated contents.
Being on the CPU timeline and not changing any previous usages are very important. This ensures that at set_bind_group
time, we know exactly the set of resources that a bind group contains. This is important for wgpu's immediate validation and recording.
This is a bit weird in a multithreaded world, but is perfectly workable.
Critical Decision: If this operation was on the queue or command buffer, this would not work, as we would not know the exact resources contained in a bind group at validation time.
If updating happened on the queue timeline:
If updating happened in a command buffer:
With either of these, we would need to move the validation of every single set_bind_group
to submit time, which requires a lot of expensive tracking wgpu would much rather not do, and would be surprising to the user.
Implementation Note: The implementation of this can vary from API to API, however, we expect that these "indexable" bind groups will be built up out of two major parts:
The steps required for update_bindings
should be similar to the following:
Note that this means that every update_bindings
call will require us to make a shadow copy of all descriptors in the bind group, and associated tracking data. While we don't expect update_bindings
to require a lot of memory compared to buffers and textures, it is still not a cheap operation.
For APIs with GPU-writable descriptors, or implementations using an indirection buffer approach, this can be a GPU-side copy_buffer_to_buffer
that prefixes the next submission.
With resources that can be used as both read and write, we need to keep track of the resource states when those resources are used in bindless arrays. This can be a considerable problem in practice, as bindless renderers have a large number of resource, and all of the resources need to be bound to every single pass, instead of only the resources the pass needs.
By allowing a resource to be shifted to a read-only state, we let the tracking systems only worry about the resource being alive, not their state. This allows bindless arrays to be bound with very low costs.
The set_valid_usages
function does the following on the CPU timeline:
usages
subset. usages
must be a strict subset of the descriptor usages.While this seems like merely an optimization, in practice it will change the order of magnitude of resources wgpu's barrier generation system needs to deal with and make validating bindless actually tractable, as now it only needs to deal with the ones actively in the read-write state. As we expect far more resources to be in read-only usages than read-write usages, under realistic workloads, we expect the number of resources to track going from 10,000+ to 100 or less.
As it is right now, once of the biggest complaints with wgpu's current implementation of bindless and binding_array
is that binding the bindless bind group is far too expensive; experience shows us that all the bookkeeping required to emit correct barriers is not cheap.
Critical Decision: This is on the CPU timeline for the same reason as update_bindings
is.
This will get translated as a transition into or out of a read-only layout that prefixes the next submission. Invalidating the existing bind groups is to prevent submission of command buffers recorded from before the usage layout was changed. Without this, we would need to re-validate all command buffers on submission.
Note: Since bind groups are invalidated when resources have their usages changed or are destroyed, this means that a resource with only read-only usages is by definition always safe to use without any barriers or tracking, if it came from a valid bind group. Resources with read-write usages will still require tracking, even if they are bound as read-only, as conflicts can come from elsewhere.
This is less important to have, but important for multithreaded scenarios. We would like to have an ability to "mask remove" a binding from a bind group for the duration of a specific render/compute pass, so that the resource can be used in a conflicting way; for example, imagine a shadow map texture being rendered. During the pass, it will get "masked out" so that it cannot be read from while inside the pass. After the pass is done, it can be sampled again.
This could be implicit or explicit, as long as the functionality is provided. wgpu is also happy to experiment on this point.
However, our proposal is to have an explicit "scoped" version of set_valid_usages
that will automatically revert after the render pass or compute pass are done. This allows multiple command encoders on multiple threads to use the same bind group while masking out different resources for their passes.
Personal Opinion: I (cwfitzgerald) am personally in favor of an explicit API where a list of resources to remove is passed into the pass descriptor. This solves a few problems that can be imagined with an "implicit" version:
Bikeshed API example (names are not important as long as the semantics are clear):
Implementation Note: If a binding array is represented as a bind group and a GPU "validity bitmask buffer" as described above, then an implementation of "exclusion" could make a pass-local copy of only the validity buffer, keeping the descriptor set the same.
If a resource accessed in the shader is not allowed to be accessed, either a "null" binding is read, or a scratch buffer/texture. For read-write resources, this will cause data races on the scratch resources. However, these scratch resources are just like any texture or buffer that the user can create, where the same races are already available, so this is not any more dangerous than functionality that already exists.
Personal Opinion:
We (cwfitzgerald and jasperrlz) would prefer some level of heterogeneity in bindings.
From working experience, it is nicer to have a bind_group that is heterogeneous, at least over texture resource types like 1D/2D/3D/Cubemap/etc. However, categories like read-only/read-write, and even buffer/texture, I am fine to split up and demand different binding_arrays for them.
As such, I (jasperrlz) would probably bikeshed an API like this:
I believe this provides a nice compromise of user friendliness and implementation complexity. If it is not significantly more complicated or expensive to do full heterogeneous (across buffers and textures) or combining read-only and read-write usages in one array, then we can add that, but I would treat them more like "stretch goals".
How these features work with min_binding_size, sampling types, and other features specified in the BindGroupLayout is a bit unclear.
Implementation Note: This would require a few more states in the bitmask buffer to determine if the type of resource is valid. Thankfully these types are mutally exclusive, so it can be tracked as an enum rather than a bitfield.
To help debug user applications, it would be nice to have some basic "GPU-enabled" logging that warns the user when they are accessing a scratch resource, along with the reason why (wrong binding type, wrong resource access type). Whether you need to "enable" these more verbose prints through a feature/pipeline flag, or whether is is always enabled is still TBD.
There are a bunch of constraints for the design of bindless in addition to the obvious implementability on target APIs and security/portability constraints of WebGPU:
Applications using bindless with native APIs usually create a single descriptor array that they update each frame (or more often). For example when objects are streamed in and out of a scene, their textures need to be added to the descriptor array / can be remove from it. They do this without racing with the GPU by tracking which indices in the array are in use and writing only to unused indices.
Out of scope is how to mutate non-bindless bindgroups. (Corentin's opinion is that we shouldn't allow that and expose something else, at least in "base, non-bindless WebGPU")
This section suggests having bindgroup mutations happen on the content timeline, but not affecting previous calls to set_bind_group
:
The reasoning is that if the mutation happens on the queue or device timeline, it will not be possible to validate or generate barriers as the GPUCommandEncoder
is being encoded. Doing as much work as possible during encoding is an important constraint from wgpu-rs.
Supporting this case would require making a copy of the bind group with some elements updated because two different versions of the array are seen in the same synchronization scope.
Choice to make: Decide which timeline the mutation happens on. If it happens on the content timeline, the semantic is akin to copy-on-write. If it happens on the device timeline the bind group is mutated (or copied when needed to avoid races with the GPU).
Corentin's opinion: Mutation on the content timeline makes it very easy to make WebGPU do a copy of the bind group, which we really want to avoid. Instead we should put mutation on the device timeline and find another way to minimize the overhead of validation and memory barriers. Sections below should help with that. (queue timeline updates are not supported natively on base bindless D3D12/Vulkan and would also require copies).
In WebGPU we cannot rely on the application doing that correctly as a data race could be a security and portability issue. There is a spectrum of options where the responsability of the allocation of indices is 100% the applications' to 100% WebGPU's.
On one end, if the application can write at anytime write at any index, WebGPU would have to make a copy of the descriptor array with that one index modified.
On the other end, the application can never choose an index and can only ask WebGPU to put a descriptor in the array and return the index where it put it (or an error if there was none available).
In all cases WebGPU needs to track for each index if or when it will be unused, either to do a copy when overwriting an index still in use, or to do the allocation itself.
Implementation notes: to efficiently track when a copy is necessary, we need to know for each individual entry if updating it will cause a race with the GPU. Tracking for each entry when it was last in use would be costly because all entries would need to be updated on each GPUQueue.submit.
Instead we can track for each entry when it was last removed: it can be overwritten when the GPU execution is past that point. This requires updates only to entries that are modified (and on the creation of the bind group).
Choice to make: How do we let applications update entries in the bind group? It can be one, or a combination of the approaches above, or another idea/variant. (for example when running out of free indices putSomewhere
could be allowed to automatically grow the bind group, or stall, or …)
Corentin's opinion: If we can get away with it, having WebGPU manage the allocation would be the easiest way to guide developers. However some developers could require more control, for example to use a subarray in a special way that requires descriptors to be contiguous. If that's the case we could have a mix of both approaches with a helper "index allocator" in WebGPU but letting application be explicit if they need to.
Choice to make: The examples above have an API to update a single binding. An alternative is an API to update multiple ones to encourage bulk updates. However WebGPU can buffer updates and apply them e.g. on the next GPUQueue.submit
. Is that palatable given how expensive that call is already? Is one of the options more of a pit of success somehow?
Corentin's opinion: For Dawn we'd likely buffer updates anyway. Our GPUQueue.submit
is a lot heavier than wgpu's so it would be easy to make the buffered bind group update happen in parallel to command encoding. It's likely some work in wgpu can also be parallelized with the bind group update but I don't know which one.
As described in the "Read Only Resources" section above, we need to minimize the amount of tracking for memory barriers and validation for usage scope rules. The key idea in that section is that instead of tracking the validity of resources and their usages, we can track the validity of the bindless bind groups (and update it over time, instead of computing it for each pass).
Note: Most of the section below also contains implementation details of the state tracking to show that it can be made efficiently.
We want to validate that bindless bindgroups are valid for a pass. We can define this as meaning that each of the entries containing a resource have a resource with the usage pinned to the expected usage (and that the resource is not destroyed).
A resource can have its usage pinned, with explicit pinning and unpinning operations:
Resources can be pinned to any single usage, or a combination of read-only usage. When a bindless bindgroup is used that contains readonly resources, we know that the proper memory barriers have been done, and that the resources are alive. When it contains writable resources (so either storage buffers or textures), we can record on the bindgroup itself that memory synchronization needs to happen.
Validation of the bindless bindgroup: To efficiently update the validity of the bindless bindgroup, each resource keeps a list of the bindless bindgroups it is part of (and at which index). This list will in practice almost always be small. Bindless bindgroups have a count of invalid entries that is updated when pinning/unpinning happens. Of course resources have their usage unpinned when .destroy()
is called.
Memory barriers: When pinning a resource the memory barriers necessary to use the resource with that usage are queued for the next submission. When a bindless bindgroup with storage buffers or texture is used, a "UAV barrier" or equivalent is queued to happen before the next usage scope (this could require command buffer splitting in wgpu to be able to insert them).
Interaction of pinning with existing validation: WebGPU already validates that all resources used in a submit are alive. For usage pinning, that validation is updated to check that the resources are alive, and that their available usage (pinned usage if there is one, or their full usage otherwise) is what's used in the pass. For Dawn this means changing a hashset<Resource>
into hashmap<Resource, Usage>
and a state == Alive
check into availableUsage & submitUsages == submitUsages
, which is minimal overhead.
Choice to make: Could we make pinning / unpinning instead toggle the visibility of the index in the shader automatically? This is a footgun because if an application forgets to pin it would get a black texture with no explanation. Maybe this could be palatable if there is some decent debugging utility that says that an access failed and why.
TODO(Corentin): Check that this won't cause pessimizing of the VkImageLayout in Vulkan or D3D12 Enhanced Barriers for sampled textures that are also storage read only.
Note: Many names could be used for the concept of "pinning" above: locking usages, making a usage persistent, making a resource resident with a usage, freezing the resource with a usage, etc. But most of them have a notion that the cannot be undone, or are names of different concepts in GPU programming. Pinning / unpinning represents a more temporary constraint. Similarly named concepts of Rust's Pin<>
or pinning virtual memory to a specific physical memory location are remote enough to not cause confusion.
This is the "usage pinning" equivalent to the proposal in the section above.
TODO(Corentin): Make a proposal, discuss what to do outside of passes (for copies).
As noted above it is extremely important for applications that at binding_array
of heterogeneous sampled texture types be available. This seems possible to support in all basic tiers of bindless in the backend APIs. However base bindless Vulkan requires knowing what VkDescriptorType will be used. This prevents mixing sampled and storage textures or textures and buffers.
We should still aim to support full heterogeneous in the future as that's possible in D3D12 / Metal base bindless support, and a the Vulkan ecosystem is moving towards it (with a small zoo of extensions). A full heterogeneous binding_array
API could look like this:
Choice to make: Do we require some tag for only semi-heterogeneous binding arrays like binding_array<sampled_texture>
or binding_array<storage_buffer>
? This would make validation of the T
in the new builtins very clear: either it is some kind of "sampled texture" or it isn't. Likewise matching with the BGLLayout would be clear.
Corentin's opinion: (weak) Given we want to go expose full heterogeneous bindless eventually, we can directly go to template-less binding_array
. Validation in the shader would check for uses of incompatible types of T
. Likewise the matching with the layout would match with whatever the type was deduced in the shader. It seems we can take a shortcut with good enough error messages.
It's not clear how useful it will be to put uniform buffers in the binding array, but storage buffers are important to support as well. At the moment storage buffers are explicitly typed in WGSL but if an application uses bindless storage buffer, it is reasonable to expect that the type will heavily depend on the code path / index used to query the buffer. For that reason we need a way to support creating differently typed views of a storage buffer (at an offset).
This is useful independently of bindless so we should make a separate proposal/improvement, then integrate it in bindless.
Implementation notes: On D3D12 structured buffers cannot be used to implement WGSL structs+variable length array, so all implementation must have a way to convert typeful WGSL to loads and stores in RWByteAddressBuffer. The same code transformation should allow implementing the casting of storage buffers above. There are some questions about handling OOB though.
Looking for validation of the direction from wgpu folks.
GPUBindGroup.update(i, GPUBindGroupEntry)
, only for bindless bindgroups.binding_array
with no template arguments. Validation that all things taken from a binding array are the same "type" will be done while compiling the shader.