Bindless investigation and proposal

Here is a new investigation for bindless with a proposal for WebGPU at the end. @litherum's original investigation is great, read it as well! There was also @cwfitzgerald's presentation at the previous F2F. Another good resource to was the series of posts from Traverse Research on bindless.

I tried really hard to find if we could support heterogeneous bindless in WebGPU directly instead of having the intermediate step of having per-binding-type indexing, but didn't succeed.

Note that for this extension to be useful for storage textures we should have support for format-less storage textures.

Motivation

In the current WebGPU binding model, shaders have access to a small set of resources that are in the GPUBindGroups currently bound at the time draw* or dispatch* is called. The CPU-side code of the application has to know the what are all resources a shader might need and bind it at command-recording time. The amount of resources bound while executing a shader also has to stay under limits that can be fairly low.

This model was chosen for the first version of WebGPU as WebGPU without extension needs to run on a variety of hardware, including hardware that has a fixed number of "registers" configured to access resources. In this "bindful" model, setBindGroup sets the internal state of the registers, and shader assembly uses a register number to name a resource. Over the last decade, GPU architectures have shifted to a "bindless" model where resources are instead named in shaders assembly by a GPU memory pointer, or an index into a table somewhere in memory. This let shaders index a vastly increased amount of resources and allowed many new kind of GPU algorithms.

In more recent rendering engines many algorithms execute on the GPU that need global information about the scene or the objects contained in it. They need access to each texture potentially used by an object in case they need to process that object. This blows way past the resource limits in the bindful model. Examples of algorithms that do this are:

Visibility buffers, which do a first rasterization pass with just object and triangle IDs in a render target, then have a compute shader / large quad that handles the texturing of all objects at once.
Ray-tracing where rays can be launched in any direction and need to get the texture of the intersected object.
Rendering 2D contents with many embedded images.
And a lot, lot more.

In some cases texture atlasing can be used as a (painful) workaround, but it can become extremely messy, inefficient (with lots of the atlas unused), or inflexible. Exposing the bindless capabilities of the hardware would enhance the programming model of WebGPU and unlock many exciting new applications and capabilities.

Bindless concepts

Residency: Residency management where the driver makes sure all used resources are swapped-in to GPU memory can no longer be done totally automatically; it needs information from the application. When the GPU runs out of memory to keep all allocations of all resources in GPU memory at the same time, the driver can swap them in and out of system (CPU) memory to keep the necessary working set available in GPU memory when shaders execute. In the bindful model, the driver has close to perfect information of which resources will be used and does that automatically. In bindless, applications can use any resource at any time so applications must manage residency by telling the driver which resources must be kept resident when.

Indexing of descriptors - uninitialized entries: In all APIs bindless is be done by indexing descriptors (Metal can a bit more flexible) in arrays of descriptors that have been allocated by the application. Applications allocate conservatively-sized arrays and elements might not be initialized, or contain stale data (esp past the currently used index range). Sparse arrays are supported in target APIs, as long as the uninitialized entries are not used.

Indexing of descriptors - index uniformity: Some hardware can only index inside the descriptor arrays uniformly, though a scalarization loops allows emulating non-uniform indexing.

Descriptor homogeneity: Hardware descriptors may have different sizes depending on the resource type. For example a buffer could be a 64bit virtual address, while a sampled texture could be a 64 byte data payload. This makes allocation of the descriptor arrays and their indexing in the shader depend on the type of descriptor, so descriptor types may not always be mixed in the same descriptor array.

Samplers: Samplers stayed "bindful" a lot longer than other types of resources because they are pure fixed-function object that don't have contents. In all APIs bindless for samplers has additional constraints and limits.

D3D12 / HLSL

Overall D3D12 was designed with bindless at the forefront with the descriptor tables, although support depends on the hardware tier. CBV UAV SRV descriptor heaps are heterogeneous on the API side but HLSL only gained support with ResourceHeaps in Shader Model 6.6.

D3D12

D3D12's binding model is centered around the root signature and descriptor heaps

Resource in D3D12 are bound in the root signature (equivalent of the GPUPipelineLayout) that can contain root constants (immediate data), root descriptors (a single binding) or a root descriptor table (an array of bindings). The root descriptor table is defined with set*RootDescriptorTable as the D3D12_GPU_DESCRIPTOR_HANDLE (GPU memory pointer) to the first element of the array. This handle must be inside one of the two currently bound descriptor heaps (the one for sampler, or the one for everything else).

Descriptors heaps are either CPU heaps used for staging, or GPU heaps that are actually used by the hardware and usable in root signature. D3D12 supports copies between heaps to move from staging to shader-visible heaps. Shader-visible heaps and root descriptor tables have limits that depend on the resource binding tier. Tier 2 is bindless for textures, Tier 3 bindless for everything, but both tiers have a limit of 2048 max descriptors in sampler heaps.

D3D12 delegates residency management to the application that tags individual memory heaps (allocations from which resources are sub-allocated) resident and evicts them.

CBV_UAV_SRV descriptor heaps in D3D12 are heterogenous with a device-wide increment between descriptors when copying between heaps or indexing them.

It is not allowed to change a descriptor in a heap while it might be in use by commands as that could be a data-race.

HLSL

On the HLSL side, using bindless is done by declaring unsized arrays of textures, and specifying it is unbounded in the root signature (if one is provided in the HLSL):

// Declaration
Texture2D<float4> textures[] : register(t0)

// Root signature
DescriptorTable( CBV(b1), UAV(u0, numDescriptors = 4), SRV(t0, numDescriptors=unbounded) )

Non-uniform indexing must be done with an extra NonUniformResourceIndex keyword:

tex1[NonUniformResourceIndex(myMaterialID)].Sample(samp[NonUniformResourceIndex(samplerID)], texCoords);

The restriction here is that a single type is given for the array so it is not possible to chose dynamically inside the shader what the type of the binding will be. It has to be statically known to be UAV, SRV or CBV, and even then it must be a single type for each of there (a float, uint, or int texture). Overlapping of the root descriptor tables is allowed which could be used to have different types.

Shader Model 6.6 lifts this restriction with the dynamic resources feature. The ResourceDescriptorHeap HLSL object can be used to cast any index to any resource type:

<resource variable> = ResourceDescriptorHeap[uint index];
<sampler variable> = SamplerDescriptorHeap[uint index];

Metal / MSL

Metal argument buffer tier 2 supports dynamically indexing resources in arbitrarily-sized argument buffers. After specific-OS releases it seems that the argument buffer layout is transparent and could be used for heterogeneous descriptor but there is no indication how.

Metal

Argument buffers are MTLBuffers that contain resources usable by the shader. Their layout is opaque (maybe until macOS 13.0 for Tier2 devices?) and they have to be filled by using an MTLArgumentEncoder which is similar to a GPUBindGroupLayout either reflected from an MTLLibrary or created directly with MTLArgumentDescriptor. It is called out explicily that argument buffers cannot contain unions, so that can't be used for heterogeneous descriptors. The argument buffers are bound with the set*Buffer method like any other buffer.

When using argument buffers, the application must handle residency explicitly by calling useResource, useHeap (when suballocating resources), or useResidencySet.

Samplers used with argument buffers must have supportsArgumentBuffer set to true, and MTLDevices have a query for the maximum number of unique such samplers that are supported.

MSL

Here is an example from the Metal documentation. The argument to the entrypoint is a reference to a structure containing resources itself. The layout of this struct must correspond to the MTLArgumentEncoder:

struct ArgumentBufferExample{
    texture2d<float, access::write> a;
    depth2d<float> b;
    sampler c;
    texture2d<float> d;
    device float4* e;
    texture2d<float> f;
    int g;
};


kernel void example(constant ArgumentBufferExample & argumentBuffer [[buffer(0)]])
{

Metal Shading Language Specification 3.2 section 2.14.1 "The Need for a Uniform Type" shows that Metal will scalarize non-uniform indexing in arrays of resources, but at a cost.

It's not immediately clear how heterogeneous bindless would be expressed in MSL.

Vulkan / SPIR-V

Vulkan promoted VK_EXT_descriptor_indexing (documentation) to core in Vulkan 1.2, it is how bindless is exposed in that API. Further extensions enable additional niceness, like VK_EXT_mutable_descriptor_type

Vulkan

Applications start by querying VkPhysicalDeviceVulkan12Features.dynamicIndexing then can get more information from VkPhysicalDeviceDescriptorIndexingFeatures like whether updating descriptors sets in use is possible, if sparse descriptor sets are possible, and if SPIR-V can use the RuntimeDescriptorArray capability.

When creating a VkDescriptorSet new flags can be passed to the last binding to specify the set my be sparse, may be updated after being bound / while in use, and maybe have a variable length array as the last element. When allocating a descriptor set for this layout, VkDescriptorSetVariableDescriptorCountAllocateInfo is passed to specify how big this variable size array at the end will be.

Descriptor sets can only be modified on the device timeline and descriptors cannot be modified while they might be in use as that would be a race. VK_EXT_descriptor_buffer is another related extension which could allow pipelining the updates to descriptors with other queue operations.

Vulkan doesn't have a functionality to manage residency of resources on the GPU that I could see. All resources are always resident.

SPIR-V

The bindings instead of being pointers to OpTypeImage or OpTypeArray<OpTypeImage, N> can be OpTypeRuntimeArray<OpTypeImage> which allows for unbounded indexing.

When using mutable descriptor types for heterogeneous descriptors, multiple bindings can be aliased on the same set/binding location, one for each type of descriptor accessed in the shader.

Proposal for WebGPU / WGSL

There are many commonalities between our target APIs. Descriptor/resources are gathered in arrays that are unsized from the point of view of the shader, but allocated with a concrete size on the CPU. Descriptors must not be modified while in flight. Residency must be managed. Heterogeneous descriptors require more capabilities than the base bindless capabilities.

However all of these APIs are inherently unsafe: it is an application error to use a resource past the end of the array, in an undefined element of the array, to change the descriptor while it might be in use by the GPU, or to use a descriptor while the underlying resource is not in the correct state. WebGPU needs to handle or validate all of these contraints.

WebGPU API

Object storing the array of bindings.

In all of the APIs the unsized array of bindings are part of the object equivalent to GPUBindGroup (D3D12 descriptor table, a Metal argument buffer, a Vulkan descriptor set / buffer). Either we extend GPUBindGroup (and layout) creation with an unsized array of bindings, or we replace it with an object that's just an unsized array of bindings.

Extending GPUBindGroup seems ideal because bind group slots are very limited (4 in the base limits). Also many applications would have a scene-wide binding array of textures which could be grouped with other scene-wide data. The implementation could also use an additional storage buffer for validation that would be in the same underlying API object.

// arraySize: Number would already be there for fixed-arrays of bindings
partial dictionary GPUBindGroupLayoutEntry {
    arraySize: Number | "dynamic",
};

partial dictionary GPUBindGroupDescriptor {
    dynamicSize: Number,
};

Additional validation rules are added like:

No binding is allowed after the unsized array binding, and at most one is allowed
Only certain type of bindings are allowed as unsized array.
dynamicSize can only be used when the layout has an arraySize: "dynamic" binding.
Bindings are now allowed in the GPUBindGroupDescriptor between the binding for the dynamic size entry, and binding + dynamicSize (excluded). Note that holes are allowed.
The feature is checked when new capabilities are used.

Updating entries in the bind group.

This might depend on how we decide to update entries in regular (sized) bingroups.

We need to allow updating of bindings after creation of GPUBindGroups for bindless because applications need to add more resources, for example when streaming in textures for a new area. In all the APIs in the base features for bindless, the bindings are updated by the CPU directly. There is no way to pipeline the binding updates with other GPU operations, so the CPU-side must make sure to not write bindings that may currently be in use by the GPU.

Multiple alternatives to update the bindings are:

Let applications update bindings directly, without waiting for the GPU, which would often require making a new copy of the whole contents of the GPUBindGroup.
Have a method on GPUBindGroup to get the index of an empty binding, if any is empty.
Have a method to set a binding, that returns the index in which the binding was added in the GPUBindGroup.
Let applications receive signals when bindings become unused on the GPU, and validates that only these ones are used when setting a binding.

The most ergonomic alternative is the first one where applications are totally in control of what goes in which binding instead of reacting to slots in the GPUBindGroup becoming available. Implementation could optimize things under the hood: buffering updates until queue submits, updating directly when slots are unused, and using more advanced API capabilities like VK_EXT_descriptor_buffer or some tier of argument buffers to pipeline updates with other queue operations (we need to check if an equivalent is possible in D3D12).

// Example API to update bind groups, it is also possible to update to nothing to clear a slot.
partial interface GPUBindGroup {
    update(sequence<GPUBindGroupEntry>);
    copy(GPUBindGroup other, Number startBinding, Number count);
}

Memory barriers and residency

WebGPU usage scope validation must be upheld with unsized bind groups as well, both to avoid data races for correctness but also to prevent crashes or other exploits if resources are used in an incorrect state. Likewise resources must be validated to be alive even if used in an unsized bind group.

Bind group validation and memory barriers / layout transitions are already among the most expensive part of a WebGPU implementation. Unsized bind groups because they are so large, risk multiplying the cost even more, to the point it could be untractable.

If we decide to not add any specific APIs to handle barriers / residency, on implementation strategy could be (this is Dawn-centric, wgpu already has something for their bindless extensions):

Each resources keeps a list of unsized bindgroups it's in.
When destroyed it marks itself unavailable in all of them.
When memory barriered, it tells all of its unsized bindgroups it is "dirty" for them.
When processing the usage scopes, record unsized bindgroups that are used, but skip iterating over their bindings.
Have a fake "may be in unsized bindgroup" bit for all other tracked resource.
When validating the usage scope:
- Add usages for all the "may be in unsized bindgroup" for their actual uses in the unsized bindgroups.
  - This requires adding a new GPUTextureUsage value.
- For all the unsized bind groups check interferences (somehow?)
- Add barriers as needed for "dirty" resources in the unsized bindgroups.

Because in general a single or a few unsized bindgroup should be used, this should still be somewhat efficient.

The alternative is to add some explicit memory barriers by requiring applications to "lock" the usages of a resource for them to be able to be used in an unsized bindgroup, preventing any other use. Then resources would need to be "unlocked" to be allowed for other usages (copying, rendering etc). The design space here is quite large, but all options would add additional global state to the API which would be best to avoid.

WGSL

This is assuming that we have some form of binding_array<T, OverridableContstantN> type for use for fixed size arrays of bindings. A new templated version binding_array<T> that is an unsized array of this type of binding. New builtin functions are added as well:

fn isBindingAvailable(a: binding_array<T>, i: i32/u32) -> bool
fn arrayLength(a : binding_array<T>) -> u32

// C++ ism
T binding_array<T>::operator [] (@uniform i32/u32 index)

Getting a binding that's past the end of the array or not available returns a "default" binding instead that's a 1x1 texture filled with zeroes, a zeroed uniform buffers, etc. (implementation detail, it could be stored in the 0th resource in the binding array, and indexing always adds 1 to the index).

Vulkan doesn't always allow indexing binding array non-uniformly (there are per binding type feature buts), so WGSL could adopt that restriction in all cases (like described above). Alternatively, and preferably, implementation can use uniformity information to emit scalarization loops when needed. On Vulkan this would require subgroup support in addition to bindless support.

An example of using in WGSL would be:

@group(0) @binding(0) var<storage, read_only> materials : array<Materials>;
@group(0) @binding(1) var textures : binding_array<texture_2d<f32>>;

var<immediate> materialId : u32;
fn fs(...) -> vec4f {
    let material = materials[materialId];
    let albedoTexture = textures[material.albedoId];
    let specularTexture = textures[material.specularId];
    
    // Do something with the textures.
}

How could heterogeneous work?

An unsized bind group layout would have a {binding: startIndex, heterogeneous: {stuff?}} entry or alternatively would be a different constructor that makes a fully heterogeneous bind group layout. The GPUBindGroup would be similar to a fixed-type bindless GPUBindGroup.

On the WGSL side the type would be binding_array with no template arguments then:

fn getBinding<T>(a: binding_array, index: i32/u32) -> T
fn hasBinding<T>(a: binding_array, index: i32/u32) -> bool
fn arrayLength(a: binding_array);

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.