Bindless for WebGPU: wgpu Edition

# Bindless for WebGPU: wgpu Edition Initially discussed by cwfitzgerald and jasperrlz. See [Kangz's Bindless Investigation and Proposal](https://hackmd.io/PCwnjLyVSqmLfTRSqH0viA) for the base proposal. This document proposes making a few specific decisions to help wgpu be able to implement bindless efficiently without restricting the API too much. All of us want the API to be as flexible as possible, but these compromises are what we require to maintain performance -- a very important goal for bindless implementations. :::info Note: The exact API shape and naming isn't important here, just the concepts and semantics they provide. We have written these in a Rust style, but it should be translatable to the JS API. ::: ## Updating Bind Groups Bind groups are updated on the CPU timeline and updates "immediately". This behaves equivalently as if that bind group handle was replaced with a new bind group containing the new bindings. ```rust bind_group.update_bindings([ BindingUpdate { binding: 0, array_index: Some(5), entry: bind_group_entry, }, // ... ]) ``` What this means that all previous uses of the bind group continue to use the old contents and any new usages use the updated contents. ```rust let bind_group = device.create_bind_group(/* Resource A, B */) rpass.set_bind_group(0, &bind_group); // Still sees A, B bind_group.update(/* Change B to C */); rpass.set_bind_group(0, &bind_group); // Sees A, C ``` Being on the CPU timeline and not changing any previous usages are very important. This ensures that at `set_bind_group` time, we know _exactly_ the set of resources that a bind group contains. This is important for wgpu's immediate validation and recording. This is a bit weird in a multithreaded world, but is perfectly workable. :::danger Critical Decision: If this operation was on the queue or command buffer, this would not work, as we would not know the exact resources contained in a bind group at validation time. - If updating happened on the queue timeline: ```rust rpass.set_bind_group(0, &bg); // *- Finish Renderpass -* queue.update_bindings(&bg, ...); queue.submit(cmd_buf); // All validation done by set_bind_group is out-of date and wrong as // it validated using the wrong set of resources. ``` - If updating happened in a command buffer: ```rust let rpass = cmd_buf1.begin_render_pass(..); rpass.set_bind_group(0, &bg); // *- Finish Renderpass -* cmd_buf2.update_bindings(&bg, ..); queue.submit(cmd_buf2, cmd_buf1); // All validation done by `set_bind_group` is out-of date and wrong as // it validated using the wrong set of resources. ``` With either of these, we would need to move the validation of every single `set_bind_group` to submit time, which requires a lot of expensive tracking wgpu would much rather not do, and would be surprising to the user. ::: :::info Implementation Note: The implementation of this can vary from API to API, however, we expect that these "indexable" bind groups will be built up out of two major parts: 1. A native descriptor set or emulation equivalent (e.g. on D3D12, we could use a buffer of indexes into ResourceDescriptorHeap). 2. A GPU-visible "metadata buffer" containing any data necessary to check that the resource is valid to access in this current pass (e.g. bits for read-only vs. read-write usage) The steps required for `update_bindings` should be similar to the following: 1. Allocate new descriptors for the bind group, along with any metadata buffers needed. 2. Copy all descriptors to the new bind group, patching in new descriptors. 3. Point the user bind group handle to now refer to the new underlying bind group and metadata buffers. Note that this means that every `update_bindings` call will require us to make a shadow copy of *all* descriptors in the bind group, and associated tracking data. While we don't expect `update_bindings` to require a lot of memory compared to buffers and textures, it is still not a cheap operation. For APIs with GPU-writable descriptors, or implementations using an indirection buffer approach, this can be a GPU-side `copy_buffer_to_buffer` that prefixes the next submission. ::: ## Read-Only Resources With resources that can be used as both read and write, we need to keep track of the resource states when those resources are used in bindless arrays. This can be a considerable problem in practice, as bindless renderers have a large number of resource, and _all_ of the resources need to be bound to every single pass, instead of only the resources the pass needs. By allowing a resource to be shifted to a read-only state, we let the tracking systems only worry about the resource being alive, not their state. This allows bindless arrays to be bound with very low costs. ```rust buffer.set_valid_usages(usages); texture.set_valid_usages(usages); ``` The `set_valid_usages` function does the following on the CPU timeline: - Restrict the allowed usages for the resource to the `usages` subset. `usages` must be a strict subset of the descriptor usages. - Invalidate all bind groups using the texture, similarly to how destroying a texture invalidates all bind groups using it. This should propagate, meaning that any previously recorded command buffers using the bind group are also now invalid. While this seems like merely an optimization, in practice it will change the order of magnitude of resources wgpu's barrier generation system needs to deal with and make validating bindless actually tractable, as now it only needs to deal with the ones actively in the read-write state. As we expect far more resources to be in read-only usages than read-write usages, under realistic workloads, we expect the number of resources to track going from 10,000+ to 100 or less. As it is right now, once of the biggest complaints with wgpu's current implementation of bindless and `binding_array` is that binding the bindless bind group is far too expensive; experience shows us that all the bookkeeping required to emit correct barriers is not cheap. :::danger Critical Decision: This is on the CPU timeline for the same reason as `update_bindings` is. This will get translated as a transition into or out of a read-only layout that prefixes the next submission. Invalidating the existing bind groups is to prevent submission of command buffers recorded from *before* the usage layout was changed. Without this, we would need to re-validate all command buffers on submission. ::: :::info Note: Since bind groups are invalidated when resources have their usages changed or are destroyed, this means that a resource with only read-only usages is *by definition* always safe to use without any barriers or tracking, if it came from a valid bind group. Resources with read-write usages will still require tracking, even if they are bound as read-only, as conflicts can come from elsewhere. ::: ## Temporary Removal This is less important to have, but important for multithreaded scenarios. We would like to have an ability to "mask remove" a binding from a bind group for the duration of a specific render/compute pass, so that the resource can be used in a conflicting way; for example, imagine a shadow map texture being rendered. During the pass, it will get "masked out" so that it cannot be read from while inside the pass. After the pass is done, it can be sampled again. This could be implicit or explicit, as long as the functionality is provided. wgpu is also happy to experiment on this point. However, our proposal is to have an explicit "scoped" version of `set_valid_usages` that will automatically revert after the render pass or compute pass are done. This allows multiple command encoders on multiple threads to use the same bind group while masking out different resources for their passes. :::success Personal Opinion: I (cwfitzgerald) am personally in favor of an explicit API where a list of resources to remove is passed into the pass descriptor. This solves a few problems that can be imagined with an "implicit" version: - If a user binds the same texture in a read-only binding array and a read-write binding array, which do we pick, and why? - The user will get a nice error if they do nothing as the conflict may be an accident. Bikeshed API example (names are not important as long as the semantics are clear): ```rust command_encoder.begin_render_pass(RenderPassDescriptor { set_valid_usages: [ SetValidUsage { resource: &texture1, // Will be removed from all read-only and // read-write binding arrays using that resource usages: wgpu::TextureUsages::RENDER_ATTACHMENT }, SetValidUsage { resource: &texture2, // Will be removed from only the read-write binding arrays; // not a conflict if the render attachment use is read-only. usages: wgpu::TextureUsages::RENDER_ATTACHMENT | wgpu::TextureUsages::SAMPLED }, ] .. }) ``` ::: :::info Implementation Note: If a binding array is represented as a bind group and a GPU "validity bitmask buffer" as described above, then an implementation of "exclusion" could make a pass-local copy of _only_ the validity buffer, keeping the descriptor set the same. ::: ## Shader API If a resource accessed in the shader is not allowed to be accessed, either a "null" binding is read, or a scratch buffer/texture. For read-write resources, this _will_ cause data races on the scratch resources. However, these scratch resources are just like any texture or buffer that the user can create, where the same races are already available, so this is not any more dangerous than functionality that already exists. :::success Personal Opinion: We (cwfitzgerald and jasperrlz) would prefer some level of heterogeneity in bindings. From working experience, it is nicer to have a bind_group that is heterogeneous, at least over texture resource types like 1D/2D/3D/Cubemap/etc. However, categories like read-only/read-write, and even buffer/texture, I am fine to split up and demand different binding_arrays for them. As such, I (jasperrlz) would probably bikeshed an API like this: ``` @group(0) @binding(0) var my_ro_buffers: binding_array<buffer, read_only>; @group(0) @binding(1) var my_rw_buffers: binding_array<buffer, read_write>; @group(0) @binding(2) var my_ro_textures: binding_array<texture, read_only>; ``` I believe this provides a nice compromise of user friendliness and implementation complexity. If it is not significantly more complicated or expensive to do full heterogeneous (across buffers and textures) or combining read-only and read-write usages in one array, then we can add that, but I would treat them more like "stretch goals". How these features work with min_binding_size, sampling types, and other features specified in the BindGroupLayout is a bit unclear. :::info Implementation Note: This would require a few more states in the bitmask buffer to determine if the type of resource is valid. Thankfully these types are mutally exclusive, so it can be tracked as an enum rather than a bitfield. ::: :::info To help debug user applications, it would be nice to have some basic "GPU-enabled" logging that warns the user when they are accessing a scratch resource, along with the reason why (wrong binding type, wrong resource access type). Whether you need to "enable" these more verbose prints through a feature/pipeline flag, or whether is is always enabled is still TBD. ::: ## Round 2 ### Constraints for bindless There are a bunch of constraints for the design of bindless in addition to the obvious implementability on target APIs and security/portability constraints of WebGPU: - wgpu-rs records D3D12/Vulkan command buffers as the GPUCommandEncoder is encoded. (Dawn defers to GPUQueue.submit and is less constrained) - Memory barrier generation / validation of usage scopes is a CPU bottleneck and should be minimized for bindless. - As much as possible developers should be guided towards usage of WebGPU that minimizes that CPU overhead. - It should be easy to use bindless (and fall into the pit of success above) ### Mutation of the bindless bindgroup Applications using bindless with native APIs usually create a single descriptor array that they update each frame (or more often). For example when objects are streamed in and out of a scene, their textures need to be added to the descriptor array / can be remove from it. They do this without racing with the GPU by tracking which indices in the array are in use and writing only to unused indices. Out of scope is how to mutate non-bindless bindgroups. (Corentin's opinion is that we shouldn't allow that and expose something else, at least in "base, non-bindless WebGPU") #### Timeline of the mutation This [section](https://hackmd.io/jnPeiwj6QQ6MUJUie1HpVA?both#Updating-Bind-Groups) suggests having bindgroup mutations happen on the content timeline, but not affecting previous calls to `set_bind_group`: ```rust let bind_group = device.create_bind_group(/* Resource A, B */) rpass.set_bind_group(0, &bind_group); // Still sees A, B bind_group.update(/* Change B to C */); rpass.set_bind_group(0, &bind_group); // Sees A, C ``` The reasoning is that if the mutation happens on the queue or device timeline, it will not be possible to validate or generate barriers as the `GPUCommandEncoder` is being encoded. Doing as much work as possible during encoding is an important constraint from wgpu-rs. Supporting this case would require making a copy of the bind group with some elements updated because two different versions of the array are seen in the same synchronization scope. :::warning **Choice to make:** Decide which timeline the mutation happens on. If it happens on the content timeline, the semantic is akin to copy-on-write. If it happens on the device timeline the bind group is mutated (or copied when needed to avoid races with the GPU). ::: :::success **Corentin's opinion**: Mutation on the content timeline makes it very easy to make WebGPU do a copy of the bind group, which we really want to avoid. Instead we should put mutation on the device timeline and find another way to minimize the overhead of validation and memory barriers. Sections below should help with that. (queue timeline updates are not supported natively on base bindless D3D12/Vulkan and would also require copies). ::: #### Mutation API spectrum In WebGPU we cannot rely on the application doing that correctly as a data race could be a security and portability issue. There is a spectrum of options where the responsability of the allocation of indices is 100% the applications' to 100% WebGPU's. On one end, if the application can write at anytime write at any index, WebGPU would have to make a copy of the descriptor array with that one index modified. ```javascript theIndex = chooseAnIndexSomehow(); bindGroup.update(theIndex, resource); // We can also remove a resource explicitly. texture.destroy() would free // indices implicitly, but an application might still want to keep the // texture around. bindGroup.remove(theIndex); ``` On the other end, the application can never choose an index and can only ask WebGPU to put a descriptor in the array and return the index where it put it (or an error if there was none available). ```javascript theIndex = bindGroup.putSomewhere(resource); if (theIndex === null) /* show a fatal error to the user? */ // This would still require a way to free an index bindGroup.remove(theIndex); ``` In all cases WebGPU needs to track for each index if or when it will be unused, either to do a copy when overwriting an index still in use, or to do the allocation itself. :::info **Implementation notes:** to efficiently track when a copy is necessary, we need to know for each individual entry if updating it will cause a race with the GPU. Tracking for each entry when it was last in use would be costly because all entries would need to be updated on each GPUQueue.submit. Instead we can track for each entry when it was last removed: it can be overwritten when the GPU execution is past that point. This requires updates only to entries that are modified (and on the creation of the bind group). ::: :::warning **Choice to make:** How do we let applications update entries in the bind group? It can be one, or a combination of the approaches above, or another idea/variant. (for example when running out of free indices `putSomewhere` could be allowed to automatically grow the bind group, or stall, or ...) ::: :::success **Corentin's opinion**: If we can get away with it, having WebGPU manage the allocation would be the easiest way to guide developers. However some developers could require more control, for example to use a subarray in a special way that requires descriptors to be contiguous. If that's the case we could have a mix of both approaches with a helper "index allocator" in WebGPU but letting application be explicit if they need to. ::: :::warning **Choice to make:** The examples above have an API to update a single binding. An alternative is an API to update multiple ones to encourage bulk updates. However WebGPU can buffer updates and apply them e.g. on the next `GPUQueue.submit`. Is that palatable given how expensive that call is already? Is one of the options more of a pit of success somehow? ::: :::success **Corentin's opinion**: For Dawn we'd likely buffer updates anyway. Our `GPUQueue.submit` is a lot heavier than wgpu's so it would be easy to make the buffered bind group update happen in parallel to command encoding. It's likely some work in wgpu can also be parallelized with the bind group update but I don't know which one. ::: ### Optimizing validation and memory barrier overhead. As described in [the "Read Only Resources"](https://hackmd.io/jnPeiwj6QQ6MUJUie1HpVA?view#Read-Only-Resources) section above, we need to minimize the amount of tracking for memory barriers and validation for usage scope rules. The key idea in that section is that instead of tracking the validity of resources and their usages, we can track the validity of the bindless bind groups (and update it over time, instead of computing it for each pass). #### Usage pinning :::info **Note:** Most of the section below also contains implementation details of the state tracking to show that it can be made efficiently. ::: We want to validate that bindless bindgroups are valid for a pass. We can define this as meaning that each of the entries containing a resource have a resource with the usage pinned to the expected usage (and that the resource is not destroyed). A resource can have its usage pinned, with explicit pinning and unpinning operations: ```webidl partial interface GPUBuffer { void pinUsage(GPUBufferUsage usage); void unpinUsage(); } partial interface GPUTexture { void pinUsage(GPUTextureUsage usage); void unpinUsage(); } ``` Resources can be pinned to any single usage, or a combination of read-only usage. When a bindless bindgroup is used that contains readonly resources, we know that the proper memory barriers have been done, and that the resources are alive. When it contains writable resources (so either storage buffers or textures), we can record on the bindgroup itself that memory synchronization needs to happen. **Validation of the bindless bindgroup:** To efficiently update the validity of the bindless bindgroup, each resource keeps a list of the bindless bindgroups it is part of (and at which index). This list will in practice almost always be small. Bindless bindgroups have a count of invalid entries that is updated when pinning/unpinning happens. Of course resources have their usage unpinned when `.destroy()` is called. **Memory barriers:** When pinning a resource the memory barriers necessary to use the resource with that usage are queued for the next submission. When a bindless bindgroup with storage buffers or texture is used, a "UAV barrier" or equivalent is queued to happen before the next usage scope (this could require command buffer splitting in wgpu to be able to insert them). **Interaction of pinning with existing validation:** WebGPU already validates that all resources used in a submit are alive. For usage pinning, that validation is updated to check that the resources are alive, and that their available usage (pinned usage if there is one, or their full usage otherwise) is what's used in the pass. For Dawn this means changing a `hashset<Resource>` into `hashmap<Resource, Usage>` and a `state == Alive` check into `availableUsage & submitUsages == submitUsages`, which is minimal overhead. :::warning **Choice to make:** Could we make pinning / unpinning instead toggle the visibility of the index in the shader automatically? This is a footgun because if an application forgets to pin it would get a black texture with no explanation. Maybe this could be palatable if there is some decent debugging utility that says that an access failed and why. ::: :::danger **TODO(Corentin):** Check that this won't cause pessimizing of the VkImageLayout in Vulkan or D3D12 Enhanced Barriers for sampled textures that are also storage read only. ::: :::warning Pinning is only done for whole textures and buffers, not texture views like the existing usage scope (though buffers are always considered whole in the usage scopes). ::: :::info **Note:** Many names could be used for the concept of "pinning" above: locking usages, making a usage persistent, making a resource resident with a usage, freezing the resource with a usage, etc. But most of them have a notion that the cannot be undone, or are names of different concepts in GPU programming. Pinning / unpinning represents a more temporary constraint. Similarly named concepts of Rust's `Pin<>` or pinning virtual memory to a specific physical memory location are remote enough to not cause confusion. ::: #### Temporary unpinning This is the "usage pinning" equivalent to the proposal in [the section above](https://hackmd.io/jnPeiwj6QQ6MUJUie1HpVA?both#Temporary-Removal). :::danger **TODO(Corentin):** Make a proposal, discuss what to do outside of passes (for copies). `push/popPin/Unpin`. Also `localPin/Unpin` for passes? ::: ### Shader API round 2 As noted above it is extremely important for applications that at `binding_array` of heterogeneous sampled texture types be available. This seems possible to support in all basic tiers of bindless in the backend APIs. However base bindless Vulkan requires knowing what [VkDescriptorType](https://registry.khronos.org/vulkan/specs/latest/man/html/VkDescriptorType.html) will be used. This prevents mixing sampled and storage textures or textures and buffers. We should still aim to support full heterogeneous in the future as that's possible in D3D12 / Metal base bindless support, and a the Vulkan ecosystem is moving towards it (with a small zoo of extensions). A full heterogeneous `binding_array` API could look like this: ```rust fn getBinding<T>(a: binding_array, index: i32/u32) -> T fn hasBinding<T>(a: binding_array, index: i32/u32) -> bool fn arrayLength(a: binding_array); ``` :::warning **Choice to make:** Do we require some tag for only semi-heterogeneous binding arrays like `binding_array<sampled_texture>` or `binding_array<storage_buffer>`? This would make validation of the `T` in the new builtins very clear: either it is some kind of "sampled texture" or it isn't. Likewise matching with the BGLLayout would be clear. ::: :::success **Corentin's opinion**: (weak) Given we want to go expose full heterogeneous bindless eventually, we can directly go to template-less `binding_array`. Validation in the shader would check for uses of incompatible types of `T`. Likewise the matching with the layout would match with whatever the type was deduced in the shader. It seems we can take a shortcut with good enough error messages. ::: It's not clear how useful it will be to put uniform buffers in the binding array, but storage buffers are important to support as well. At the moment storage buffers are explicitly typed in WGSL but if an application uses bindless storage buffer, it is reasonable to expect that the type will heavily depend on the code path / index used to query the buffer. For that reason we need a way to support creating differently typed views of a storage buffer (at an offset). This is useful independently of bindless so we should make a separate proposal/improvement, then integrate it in bindless. :::info **Implementation notes:** On D3D12 structured buffers cannot be used to implement WGSL structs+variable length array, so all implementation must have a way to convert typeful WGSL to loads and stores in RWByteAddressBuffer. The same code transformation should allow implementing the casting of storage buffers above. There are some questions about handling OOB though. ::: ### Corentin's plan for prototyping in Dawn Looking for validation of the direction from wgpu folks. - BGLEntry is the same as in the [first proposal](https://hackmd.io/PCwnjLyVSqmLfTRSqH0viA) but the entry for sampled textures doesn't need to specify a dimension, sample type or sample count. - Have `GPUBindGroup.update(i, GPUBindGroupEntry)`, only for bindless bindgroups. - Pinning / unpinning - Temporary unlocking would be skipped in that first prototype. - On the shader side, have only the heterogeneous `binding_array` with no template arguments. Validation that all things taken from a binding array are the same "type" will be done while compiling the shader.