# Poseidon Prove: SIMD → WGPU Optimization for WASM ## Background * Original: `prove()` executed entirely on the CPU using SIMD. * **Goal**: Off‑load the heavy `extend_trace` & `evaluate_constraint_quotients_on_domain` steps to the GPU with `wgpu` in wasm ## Original Flow 1. just simd `prove()` ## Optimised Flow (WASM + wgpu) 0. **Spawn worker thread** - Spawn Web Workers to handle GPU computations in parallel. - The worker initializes wgpu and remains idle until work arrives. - Primary role: execute GPU kernels and return results. 1. **execute simd `prove()` until `extend_trace()`** - inside `WebBackend` implementation of `evaluate_constraint_quotients_on_domain()`, `eval()` of `WebDomainEvaluator` is called ```rust // crates/constraint_framework/src/component.rs 490 let eval = WebDomainEvaluator::new( ... ); self.eval.evaluate(eval); ``` - It goes to `eval_poseidon_constraints_web()` and finally passes the `original trace` to worker thread ```rust // crates/examples/src/poseidon/mod.rs 223 let web_input = build_gpu_input(web, lookup_elements); // Send data to worker thread gpu_channels::with(|c| c.tx.send(web_input).unwrap()); // Wait for worker thread using Atomics.wait() let output = gpu_channels::with(|c| c.rx.recv().unwrap()); ``` 2. **Worker thread** - The entire process involves copying CPU data to GPU, executing shaders, waiting for completion, and then fetching the GPU results back to CPU. ```rust // crates/examples/src/poseidon/web/runner.rs // Wait for main thread to send original trace & etcs let input_data = request_rx.recv_async().await.unwrap(); // Perform wgpu computation here let output_data = compute_composition_polynomial_wgpu(input_data, &gpu).await; // Send back the result to main thread response_tx.send(output_data).unwrap(); ``` 3. **Main thread resumes** - pulls the result, and continues the remainder of `prove()`. ## Call‑Flow Diagram ```mermaid flowchart TD subgraph "Main Thread" Z[simd prove] --> A A[prove] --> B[send original trace] B --> C[sync js Atomics.wait] C --> D[Resume] D --> E[simd prove continues] end subgraph "Worker thread" H{{original trace}} --> I[wgpu compute: extend_trace & evaluate_constraint_quotients_on_domain ] I --> J{{evaluated constraint quotient}} end B -->|channel| H J -->|channel| D ``` ## Notes - Requires `target-feature=+atomics,+bulk-memory,+mutable-globals` - While WASM is fundamentally single-threaded, modern browsers can create a multi-threaded-like environment using Web Workers. - Furthermore, by utilizing the browser's SharedArrayBuffer as shared memory, we can reduce memory copying overhead between multiple Web Workers. - Architecture considerations (why async?) - Synchronizing data between CPU and GPU requires asynchronous operations. - Making the `prove()` function and its related functions async would simplify the code somewhat, but to minimize impact on existing code, we opted for a slightly hacky approach. - Don't block the browser's main thread - The "main thread" mentioned above should NOT be the browser's main UI thread, as this would cause UI freezing and is prohibited. - Instead of using the main Web Worker directly, we spawn 2 separate Web Workers: one acts as the "main thread" and the other as the "worker thread".