Here is a new investigation for bindless with a proposal for WebGPU at the end. @litherum's original investigation is great, read it as well! There was also @cwfitzgerald's presentation at the previous F2F. Another good resource to was the series of posts from Traverse Research on bindless.
I tried really hard to find if we could support heterogeneous bindless in WebGPU directly instead of having the intermediate step of having per-binding-type indexing, but didn't succeed.
Note that for this extension to be useful for storage textures we should have support for format-less storage textures.
In the current WebGPU binding model, shaders have access to a small set of resources that are in the GPUBindGroups
currently bound at the time draw*
or dispatch*
is called. The CPU-side code of the application has to know the what are all resources a shader might need and bind it at command-recording time. The amount of resources bound while executing a shader also has to stay under limits that can be fairly low.
This model was chosen for the first version of WebGPU as WebGPU without extension needs to run on a variety of hardware, including hardware that has a fixed number of "registers" configured to access resources. In this "bindful" model, setBindGroup
sets the internal state of the registers, and shader assembly uses a register number to name a resource. Over the last decade, GPU architectures have shifted to a "bindless" model where resources are instead named in shaders assembly by a GPU memory pointer, or an index into a table somewhere in memory. This let shaders index a vastly increased amount of resources and allowed many new kind of GPU algorithms.
In more recent rendering engines many algorithms execute on the GPU that need global information about the scene or the objects contained in it. They need access to each texture potentially used by an object in case they need to process that object. This blows way past the resource limits in the bindful model. Examples of algorithms that do this are:
In some cases texture atlasing can be used as a (painful) workaround, but it can become extremely messy, inefficient (with lots of the atlas unused), or inflexible. Exposing the bindless capabilities of the hardware would enhance the programming model of WebGPU and unlock many exciting new applications and capabilities.
Residency: Residency management where the driver makes sure all used resources are swapped-in to GPU memory can no longer be done totally automatically; it needs information from the application. When the GPU runs out of memory to keep all allocations of all resources in GPU memory at the same time, the driver can swap them in and out of system (CPU) memory to keep the necessary working set available in GPU memory when shaders execute. In the bindful model, the driver has close to perfect information of which resources will be used and does that automatically. In bindless, applications can use any resource at any time so applications must manage residency by telling the driver which resources must be kept resident when.
Indexing of descriptors - uninitialized entries: In all APIs bindless is be done by indexing descriptors (Metal can a bit more flexible) in arrays of descriptors that have been allocated by the application. Applications allocate conservatively-sized arrays and elements might not be initialized, or contain stale data (esp past the currently used index range). Sparse arrays are supported in target APIs, as long as the uninitialized entries are not used.
Indexing of descriptors - index uniformity: Some hardware can only index inside the descriptor arrays uniformly, though a scalarization loops allows emulating non-uniform indexing.
Descriptor homogeneity: Hardware descriptors may have different sizes depending on the resource type. For example a buffer could be a 64bit virtual address, while a sampled texture could be a 64 byte data payload. This makes allocation of the descriptor arrays and their indexing in the shader depend on the type of descriptor, so descriptor types may not always be mixed in the same descriptor array.
Samplers: Samplers stayed "bindful" a lot longer than other types of resources because they are pure fixed-function object that don't have contents. In all APIs bindless for samplers has additional constraints and limits.
Overall D3D12 was designed with bindless at the forefront with the descriptor tables, although support depends on the hardware tier. CBV UAV SRV descriptor heaps are heterogeneous on the API side but HLSL only gained support with ResourceHeaps
in Shader Model 6.6.
D3D12's binding model is centered around the root signature and descriptor heaps
Resource in D3D12 are bound in the root signature (equivalent of the GPUPipelineLayout
) that can contain root constants (immediate data), root descriptors (a single binding) or a root descriptor table (an array of bindings). The root descriptor table is defined with set*RootDescriptorTable
as the D3D12_GPU_DESCRIPTOR_HANDLE
(GPU memory pointer) to the first element of the array. This handle must be inside one of the two currently bound descriptor heaps (the one for sampler, or the one for everything else).
Descriptors heaps are either CPU heaps used for staging, or GPU heaps that are actually used by the hardware and usable in root signature. D3D12 supports copies between heaps to move from staging to shader-visible heaps. Shader-visible heaps and root descriptor tables have limits that depend on the resource binding tier. Tier 2 is bindless for textures, Tier 3 bindless for everything, but both tiers have a limit of 2048 max descriptors in sampler heaps.
D3D12 delegates residency management to the application that tags individual memory heaps (allocations from which resources are sub-allocated) resident and evicts them.
CBV_UAV_SRV descriptor heaps in D3D12 are heterogenous with a device-wide increment between descriptors when copying between heaps or indexing them.
It is not allowed to change a descriptor in a heap while it might be in use by commands as that could be a data-race.
On the HLSL side, using bindless is done by declaring unsized arrays of textures, and specifying it is unbounded in the root signature (if one is provided in the HLSL):
// Declaration
Texture2D<float4> textures[] : register(t0)
// Root signature
DescriptorTable( CBV(b1), UAV(u0, numDescriptors = 4), SRV(t0, numDescriptors=unbounded) )
Non-uniform indexing must be done with an extra NonUniformResourceIndex
keyword:
tex1[NonUniformResourceIndex(myMaterialID)].Sample(samp[NonUniformResourceIndex(samplerID)], texCoords);
The restriction here is that a single type is given for the array so it is not possible to chose dynamically inside the shader what the type of the binding will be. It has to be statically known to be UAV, SRV or CBV, and even then it must be a single type for each of there (a float, uint, or int texture). Overlapping of the root descriptor tables is allowed which could be used to have different types.
Shader Model 6.6 lifts this restriction with the dynamic resources feature. The ResourceDescriptorHeap
HLSL object can be used to cast any index to any resource type:
<resource variable> = ResourceDescriptorHeap[uint index];
<sampler variable> = SamplerDescriptorHeap[uint index];
Metal argument buffer tier 2 supports dynamically indexing resources in arbitrarily-sized argument buffers. After specific-OS releases it seems that the argument buffer layout is transparent and could be used for heterogeneous descriptor but there is no indication how.
Argument buffers are MTLBuffers
that contain resources usable by the shader. Their layout is opaque (maybe until macOS 13.0 for Tier2 devices?) and they have to be filled by using an MTLArgumentEncoder
which is similar to a GPUBindGroupLayout
either reflected from an MTLLibrary
or created directly with MTLArgumentDescriptor
. It is called out explicily that argument buffers cannot contain unions, so that can't be used for heterogeneous descriptors. The argument buffers are bound with the set*Buffer
method like any other buffer.
When using argument buffers, the application must handle residency explicitly by calling useResource
, useHeap
(when suballocating resources), or useResidencySet
.
Samplers used with argument buffers must have supportsArgumentBuffer set to true, and MTLDevices have a query for the maximum number of unique such samplers that are supported.
Here is an example from the Metal documentation. The argument to the entrypoint is a reference to a structure containing resources itself. The layout of this struct must correspond to the MTLArgumentEncoder
:
struct ArgumentBufferExample{
texture2d<float, access::write> a;
depth2d<float> b;
sampler c;
texture2d<float> d;
device float4* e;
texture2d<float> f;
int g;
};
kernel void example(constant ArgumentBufferExample & argumentBuffer [[buffer(0)]])
{
Metal Shading Language Specification 3.2 section 2.14.1 "The Need for a Uniform Type" shows that Metal will scalarize non-uniform indexing in arrays of resources, but at a cost.
It's not immediately clear how heterogeneous bindless would be expressed in MSL.
Vulkan promoted VK_EXT_descriptor_indexing
(documentation) to core in Vulkan 1.2, it is how bindless is exposed in that API. Further extensions enable additional niceness, like VK_EXT_mutable_descriptor_type
Applications start by querying VkPhysicalDeviceVulkan12Features
.dynamicIndexing
then can get more information from VkPhysicalDeviceDescriptorIndexingFeatures
like whether updating descriptors sets in use is possible, if sparse descriptor sets are possible, and if SPIR-V can use the RuntimeDescriptorArray
capability.
When creating a VkDescriptorSet
new flags can be passed to the last binding to specify the set my be sparse, may be updated after being bound / while in use, and maybe have a variable length array as the last element. When allocating a descriptor set for this layout, VkDescriptorSetVariableDescriptorCountAllocateInfo
is passed to specify how big this variable size array at the end will be.
Descriptor sets can only be modified on the device timeline and descriptors cannot be modified while they might be in use as that would be a race. VK_EXT_descriptor_buffer
is another related extension which could allow pipelining the updates to descriptors with other queue operations.
Vulkan doesn't have a functionality to manage residency of resources on the GPU that I could see. All resources are always resident.
The bindings instead of being pointers to OpTypeImage
or OpTypeArray<OpTypeImage, N>
can be OpTypeRuntimeArray<OpTypeImage>
which allows for unbounded indexing.
When using mutable descriptor types for heterogeneous descriptors, multiple bindings can be aliased on the same set/binding location, one for each type of descriptor accessed in the shader.
There are many commonalities between our target APIs. Descriptor/resources are gathered in arrays that are unsized from the point of view of the shader, but allocated with a concrete size on the CPU. Descriptors must not be modified while in flight. Residency must be managed. Heterogeneous descriptors require more capabilities than the base bindless capabilities.
However all of these APIs are inherently unsafe: it is an application error to use a resource past the end of the array, in an undefined element of the array, to change the descriptor while it might be in use by the GPU, or to use a descriptor while the underlying resource is not in the correct state. WebGPU needs to handle or validate all of these contraints.
In all of the APIs the unsized array of bindings are part of the object equivalent to GPUBindGroup
(D3D12 descriptor table, a Metal argument buffer, a Vulkan descriptor set / buffer). Either we extend GPUBindGroup
(and layout) creation with an unsized array of bindings, or we replace it with an object that's just an unsized array of bindings.
Extending GPUBindGroup
seems ideal because bind group slots are very limited (4 in the base limits). Also many applications would have a scene-wide binding array of textures which could be grouped with other scene-wide data. The implementation could also use an additional storage buffer for validation that would be in the same underlying API object.
// arraySize: Number would already be there for fixed-arrays of bindings
partial dictionary GPUBindGroupLayoutEntry {
arraySize: Number | "dynamic",
};
partial dictionary GPUBindGroupDescriptor {
dynamicSize: Number,
};
Additional validation rules are added like:
dynamicSize
can only be used when the layout has an arraySize: "dynamic"
binding.GPUBindGroupDescriptor
between the binding
for the dynamic size entry, and binding + dynamicSize
(excluded). Note that holes are allowed.This might depend on how we decide to update entries in regular (sized) bingroups.
We need to allow updating of bindings after creation of GPUBindGroups
for bindless because applications need to add more resources, for example when streaming in textures for a new area. In all the APIs in the base features for bindless, the bindings are updated by the CPU directly. There is no way to pipeline the binding updates with other GPU operations, so the CPU-side must make sure to not write bindings that may currently be in use by the GPU.
Multiple alternatives to update the bindings are:
GPUBindGroup
to get the index of an empty binding, if any is empty.The most ergonomic alternative is the first one where applications are totally in control of what goes in which binding instead of reacting to slots in the GPUBindGroup
becoming available. Implementation could optimize things under the hood: buffering updates until queue submits, updating directly when slots are unused, and using more advanced API capabilities like VK_EXT_descriptor_buffer
or some tier of argument buffers to pipeline updates with other queue operations (we need to check if an equivalent is possible in D3D12).
// Example API to update bind groups, it is also possible to update to nothing to clear a slot.
partial interface GPUBindGroup {
update(sequence<GPUBindGroupEntry>);
copy(GPUBindGroup other, Number startBinding, Number count);
}
WebGPU usage scope validation must be upheld with unsized bind groups as well, both to avoid data races for correctness but also to prevent crashes or other exploits if resources are used in an incorrect state. Likewise resources must be validated to be alive even if used in an unsized bind group.
Bind group validation and memory barriers / layout transitions are already among the most expensive part of a WebGPU implementation. Unsized bind groups because they are so large, risk multiplying the cost even more, to the point it could be untractable.
If we decide to not add any specific APIs to handle barriers / residency, on implementation strategy could be (this is Dawn-centric, wgpu already has something for their bindless extensions):
Because in general a single or a few unsized bindgroup should be used, this should still be somewhat efficient.
The alternative is to add some explicit memory barriers by requiring applications to "lock" the usages of a resource for them to be able to be used in an unsized bindgroup, preventing any other use. Then resources would need to be "unlocked" to be allowed for other usages (copying, rendering etc). The design space here is quite large, but all options would add additional global state to the API which would be best to avoid.
This is assuming that we have some form of binding_array<T, OverridableContstantN>
type for use for fixed size arrays of bindings. A new templated version binding_array<T>
that is an unsized array of this type of binding. New builtin functions are added as well:
fn isBindingAvailable(a: binding_array<T>, i: i32/u32) -> bool
fn arrayLength(a : binding_array<T>) -> u32
// C++ ism
T binding_array<T>::operator [] (@uniform i32/u32 index)
Getting a binding that's past the end of the array or not available returns a "default" binding instead that's a 1x1 texture filled with zeroes, a zeroed uniform buffers, etc. (implementation detail, it could be stored in the 0th resource in the binding array, and indexing always adds 1 to the index).
Vulkan doesn't always allow indexing binding array non-uniformly (there are per binding type feature buts), so WGSL could adopt that restriction in all cases (like described above). Alternatively, and preferably, implementation can use uniformity information to emit scalarization loops when needed. On Vulkan this would require subgroup support in addition to bindless support.
An example of using in WGSL would be:
@group(0) @binding(0) var<storage, read_only> materials : array<Materials>;
@group(0) @binding(1) var textures : binding_array<texture_2d<f32>>;
var<immediate> materialId : u32;
fn fs(...) -> vec4f {
let material = materials[materialId];
let albedoTexture = textures[material.albedoId];
let specularTexture = textures[material.specularId];
// Do something with the textures.
}
An unsized bind group layout would have a {binding: startIndex, heterogeneous: {stuff?}}
entry or alternatively would be a different constructor that makes a fully heterogeneous bind group layout. The GPUBindGroup would be similar to a fixed-type bindless GPUBindGroup.
On the WGSL side the type would be binding_array
with no template arguments then:
fn getBinding<T>(a: binding_array, index: i32/u32) -> T
fn hasBinding<T>(a: binding_array, index: i32/u32) -> bool
fn arrayLength(a: binding_array);