## Exploring CPU/GPU-driven rendering of dotted lines in WGPU. Part 3, the last one.
In [the previous part](https://hackmd.io/@agent10/rkKg3w8_Wl), we explored a couple of optimizations that gave us quite a good performance boost.
Those optimizations were CPU-based algorithms: a simple filter to cull off-screen dots and an advanced approach using an R-tree structure.
In this part, as promised, we will utilize the GPU more and show how GPU's parallel architecture can speed up our calculations. Specifically, we are going to optimize the simple filtering, dots visibility calculation and cover topics like compute shaders and the indirect draw call feature of WGPU.
Let's get started!
### Implementation
[Previously](https://hackmd.io/X9hBdtQnS3mdXIyK1bjyJQ?view#Screen-filtering), we filtered off-screen dots by calculating the dot's position on the CPU using a projection matrix and determining if it intersected with the screen bounds.
<center>

</center>
This approach provided quite good performance, maintaining 90FPS on my Pixel 5, with up to 250K dots.
The algorithm is simple, and more importantly, the position calculations are independent. This allows for parallelization, which GPUs handle well.
Let's see what the general flow of the GPU approach looks like:
- Prepare compute shader
- Feed the data and execute computation using compute shader
- Output the result to the vertex/fragment pipeline for rendering
#### Initial shader
Let's start shaping the compute shader.
```wgsl
struct GlobalUniform {
view_proj: mat4x4<f32>,
scale: f32,
};
@group(0) @binding(0)
var<uniform> globals: GlobalUniform;
struct DotInput {
position: vec3<f32>,
color_alpha: f32,
}
@group(1) @binding(0)
var<storage, read_write> dots: array<DotInput>;
@compute @workgroup_size(64)
fn compute_main(@builtin(global_invocation_id) id: vec3<u32>) {
// TODO
}
```
Here, we prepare a global uniform containing the view-projection matrix and global scale for culling and visibility calculation. It also includes a storage array with dots' `world` positions and alpha values.
The remaining steps for creating the bind group and pipeline layouts are no different from a standard rendering pipeline.
#### Computation
Generally, the computation is a combination of the [filtering](https://hackmd.io/@agent10/rkKg3w8_Wl#Screen-filtering) and [GPU alpha value calculations](https://hackmd.io/@agent10/BylPo4TDbl#GPU-driven-re-calculation) from the previous articles. Here is the compute shader:
```wgsl
@compute @workgroup_size(64)
fn compute_main(@builtin(global_invocation_id) id: vec3<u32>) {
// calculate power-of-two scale
let p2 = u32(ceil(log2(camera.scale)));
var p2_scale = u32(1);
if(p2 >= 1) {
p2_scale = u32(2 << (p2 - 1));
}
// global_invocation_id.x represents the index of the dot
let i = id.x;
// drop indices that are not multiples of p2_scale or index is outside of dots array bounds
if(i % u32(p2_scale) != 0 || i >= arrayLength(&dots)) {
return;
}
// calculate NDC position
let position = camera.view_proj * vec4f(dots[i].position, 1.0);
let ndc = position.xy / position.w;
// drop off-screen dots
if ndc.x < -1.0 || ndc.x > 1.0 || ndc.y < -1.0 || ndc.y > 1.0 {
return;
}
// calculate alpha value of the dot
var color_alpha = 1.0;
if(i % (u32(p2_scale) * 2) != 0) {
if(u32(p2_scale) == 1) {
color_alpha = 2.0 * (1.0 - camera.scale);
} else {
color_alpha = 2.0 * (p2_scale - camera.scale) / p2_scale;
}
}
if color_alpha > 0.0 {
dots[i].color_alpha = color_alpha;
// TODO handle result
}
}
```
It's executed by calling:
```rust
let mut compute_pass = encoder.begin_compute_pass(...);
compute_pass.set_pipeline(...);
compute_pass.set_bind_group(...);
compute_pass.dispatch_workgroups((dots.size() + 64) / 64, 1, 1);
```
Let's go through the details.
1) Since we use a typical `@workgroup_size(64)` we dispatch
`(dots.size() + 64) / 64` to cover all dot indices plus a small buffer.
2) We perform an *early exit* if the index is not a multiple of the power-of-two scale or if it is outside the bounds of the dots array. This ensures those dots are not included in the rendering or further calculation.
3) We transform the position of the dot into NDC space (range `[-1.0, 1.0]`), any NDC value outside of this range is off-screen..**this is our filtering we aimed to achieve!**
4) Finally, we calculate the alpha value and update the original storage array.
So far, we have only calculated the alpha values and skipped certain dots. But how does this help optimize the rendering and reduce the number of instances to be drawn?
One approach might be to transfer the results back to the CPU and use them to adjust buffers and the parameters of the draw call, but this would significantly degrade performance since it breaks the asynchronous pipeline, forcing the CPU to stall while waiting for the GPU to finish its work. Fortunately, WGPU has a great feature for this.
#### Indirect drawing
What if everything needed to perform rendering (including the parameters for the `draw()` and `drawIndexed()` methods) was on the GPU side? This is essentially what **indirect drawing** is. It allows us to bypass a readback and adjust draw call parameters directly on the GPU. Let me illustrate it with the following diagram.

This is how the entire flow is implemented.
1) Prepare all buffers that will be shared and used by the`compute pass`, the `render pass`, and WGPU:
- we already saw earlier how `compute pass` works with `global uniform buffer` and `dots buffer`
- `IndirectArgs` buffer. There are two types of indirect arguments: [DrawIndirectArgs](https://github.com/gfx-rs/wgpu/blob/81eca17db8a92288b866f3baf93498d7b2225de7/wgpu-types/src/render.rs#L933) for an active vertex buffer only, and [DrawIndexedIndirectArgs](https://github.com/gfx-rs/wgpu/blob/81eca17db8a92288b866f3baf93498d7b2225de7/wgpu-types/src/render.rs#L957) for indexed primitives
Since we use indexed primitives, we'll go with the latter:
```rust
pub struct DrawIndexedIndirectArgs {
pub index_count: u32,
pub instance_count: u32,
pub first_index: u32,
pub base_vertex: i32,
pub first_instance: u32,
}
```
Instead of calling the "direct" method,
```rust
render_pass.draw_indexed(&mut self, indices: Range<u32>, base_vertex: i32, instances: Range<u32>)
```
we will be calling the indirect one, where the `indirect_buffer` is `DrawIndexedIndirectArgs` structure.
```rust
pub fn draw_indexed_indirect(
&mut self,
indirect_buffer: &Buffer,
indirect_offset: BufferAddress,
)
```
- `culled` buffer contains of actual indices to be drawn and it will be covered in a moment.
Now, let's look at what the compute shader includes:
```wgsl
struct GlobalUniform {
view_proj: mat4x4<f32>,
scale: f32,
};
@group(0) @binding(0)
var<uniform> globals: GlobalUniform;
// represents DrawIndexedIndirectArgs
struct IndirectArgs {
indexCount: u32,
instanceCount: atomic<u32>,
reserved0: u32,
reserved1: u32,
reserved2: u32,
}
struct DotInput {
position: vec3<f32>,
color_alpha: f32,
}
@group(1) @binding(0)
var<storage, read_write> dots: array<DotInput>;
@group(1) @binding(1)
var<storage, read_write> culled: array<u32>;
@group(1) @binding(2)
var<storage, read_write> args: IndirectArgs;
@compute @workgroup_size(64)
fn compute_main(@builtin(global_invocation_id) id: vec3<u32>) {
...
}
```
We've added the `IndirectArgs` structure to map to `DrawIndexedIndirectArgs` along with two new storage buffers: `args: IndirectArgs` to store draw call arguments and `culled: array<u32>` for dot indices that need to be rendered.
2) Now, let's update `compute_main` function to fill `args` and `culled` storages:
```wgsl
@compute @workgroup_size(64)
fn compute_main(@builtin(global_invocation_id) id: vec3<u32>) {
...calculation as before...
if color_alpha > 0.0 {
dots[i].color_alpha = color_alpha;
args.indexCount = 6;
let culledIndex = atomicAdd(&args.instanceCount, 1u);
culled[culledIndex] = i;
}
}
```
- We set `indexCount` to 6, since we have 4 vertices and 6 indices in our [dot quad mesh](https://hackmd.io/@agent10/rkKg3w8_Wl#Mesh-optimization)
- For each dot to be rendered, we atomically increment the `instanceCount` that will be used for the indirect draw call.
- We fill `culled` array with the increment from the previous step and the current index `i` of the dot.
So, in simple words, each frame we populate the `culled` array from the start with the actual indices that need to be drawn.

Next, we'll see how this is used within the render pass and the shader.
3) With the calculations handled by the compute shader, the vertex shader simply needs to fetch the dot's position and alpha value from the same storage buffers used in the previous compute pass:
```wgsl
...
struct DotInput {
position: vec3<f32>,
color_alpha: f32,
}
@group(2) @binding(0)
var<storage, read> dots: array<DotInput>;
@group(2) @binding(1)
var<storage, read> culled: array<u32>;
@vertex
fn vs_main(
@builtin(instance_index) instance_index : u32,
...
) -> VertexOutput {
...
let dot_index = culled[instance_index];
let dot_input = dots[dot_index];
let dot_position = dot_input.position;
let dot_alpha = dot_input.color_alpha;
...
}
```
Here, the `instance_index` range will be `[0, N]`, where `N` is the value set in the compute shader (`args.instanceCount`). We fetch the actual index of the dot via `culled[instance_index]` and then use it to get the dot's data from `dots[dot_index]`.
As a result, on my Pixel 5 I haven't noticed any issues with up to `1800000` dots which is quite impressive.

You might wonder why this is better than calculating everything in the vertex shader. Since the vertex shader runs for every vertex (or index), switching to a compute shader can offer a `4x` performance boost (assuming 4 vertices or 6 indices per dot). However, the downside of using compute shaders with indirect drawing is the increased architectural complexity.
### Wrap up
It’s time to finish the journey into dotted-line rendering. Over these three articles, we’ve moved from a simple, straightforward approach to advanced techniques using R-Trees, compute shaders, and indirect draw calls. We also optimized the mesh structure and utilized the `instancing` feature.
The goal was to demonstrate various approaches rather than declare a single 'best' one. Each technique offers different trade-offs, with its own pros and cons. The right choice depends entirely on your specific project requirements.
Happy coding!