# Poseidon Prove: SIMD → WGPU Optimization for WASM
## Background
* Original: `prove()` executed entirely on the CPU using SIMD.
* **Goal**: Off‑load the heavy `extend_trace` & `evaluate_constraint_quotients_on_domain` steps to the GPU with `wgpu` in wasm
## Original Flow
1. just simd `prove()`
## Optimised Flow (WASM + wgpu)
0. **Spawn worker thread**
- Spawn Web Workers to handle GPU computations in parallel.
- The worker initializes wgpu and remains idle until work arrives.
- Primary role: execute GPU kernels and return results.
1. **execute simd `prove()` until `extend_trace()`**
- inside `WebBackend` implementation of `evaluate_constraint_quotients_on_domain()`, `eval()` of `WebDomainEvaluator` is called
```rust
// crates/constraint_framework/src/component.rs 490
let eval = WebDomainEvaluator::new(
...
);
self.eval.evaluate(eval);
```
- It goes to `eval_poseidon_constraints_web()` and finally passes the `original trace` to worker thread
```rust
// crates/examples/src/poseidon/mod.rs 223
let web_input = build_gpu_input(web, lookup_elements);
// Send data to worker thread
gpu_channels::with(|c| c.tx.send(web_input).unwrap());
// Wait for worker thread using Atomics.wait()
let output = gpu_channels::with(|c| c.rx.recv().unwrap());
```
2. **Worker thread**
- The entire process involves copying CPU data to GPU, executing shaders, waiting for completion, and then fetching the GPU results back to CPU.
```rust
// crates/examples/src/poseidon/web/runner.rs
// Wait for main thread to send original trace & etcs
let input_data = request_rx.recv_async().await.unwrap();
// Perform wgpu computation here
let output_data = compute_composition_polynomial_wgpu(input_data, &gpu).await;
// Send back the result to main thread
response_tx.send(output_data).unwrap();
```
3. **Main thread resumes**
- pulls the result, and continues the remainder of `prove()`.
## Call‑Flow Diagram
```mermaid
flowchart TD
subgraph "Main Thread"
Z[simd prove] --> A
A[prove] --> B[send original trace]
B --> C[sync js Atomics.wait]
C --> D[Resume]
D --> E[simd prove continues]
end
subgraph "Worker thread"
H{{original trace}} --> I[wgpu compute:
extend_trace &
evaluate_constraint_quotients_on_domain
]
I --> J{{evaluated constraint
quotient}}
end
B -->|channel| H
J -->|channel| D
```
## Notes
- Requires `target-feature=+atomics,+bulk-memory,+mutable-globals`
- While WASM is fundamentally single-threaded, modern browsers can create a multi-threaded-like environment using Web Workers.
- Furthermore, by utilizing the browser's SharedArrayBuffer as shared memory, we can reduce memory copying overhead between multiple Web Workers.
- Architecture considerations (why async?)
- Synchronizing data between CPU and GPU requires asynchronous operations.
- Making the `prove()` function and its related functions async would simplify the code somewhat, but to minimize impact on existing code, we opted for a slightly hacky approach.
- Don't block the browser's main thread
- The "main thread" mentioned above should NOT be the browser's main UI thread, as this would cause UI freezing and is prohibited.
- Instead of using the main Web Worker directly, we spawn 2 separate Web Workers: one acts as the "main thread" and the other as the "worker thread".