owned this note
owned this note
Published
Linked with GitHub
# Stwo WebGPU
## Summary
- Using the Stwo Poseidon AIR as a benchmark, we profiled the proving execution time and implemented the most expensive operations in WebGPU: trace generation, trace interpolation, trace extension, and evaluating the composition polynomial.
- Even after accounting for various overheads such as CPU-to-GPU memory copying, these operations using WebGPU are roughly 5 times faster than Simd implementation on wasm build.
- Since these operations take up approximately 65% of the total proving execution time, even with the current partial implementation, the overall execution is 2 times faster.
- As is the case with other Stwo GPU implementations, offloading the entire proof process to the GPU appears to be the most efficient approach.
- However, due to the memory limitations of WebGPU (constrained both by the browser and the mobile device), other measures such as dynamic proving (switching between CPU and WebGPU based on device limits) and recursive proving may also need to be implemented.
- Among the major tasks, the remaining component that hasn't been ported to WebGPU are the commitment and the FRI code.
- After that, the next step would be to migrate as much of the proving logic as possible to WebGPU.
# Stwo Profiling
```mermaid
flowchart LR
L1["Major Impact<br/>>15%"] --- L2["Significant Impact<br/>5-15%"] --- L3["Moderate Impact<br/>1-5%"] --- L4["Minor Impact<br/><1%"]
classDef red fill:#ff0000,color:white
classDef orange fill:#ffa500,color:black
classDef gray fill:#808080, color:white;
classDef white fill:#ffffff, color:black;
class L1 red
class L2 orange
class L3 gray
class L4 white
```
## Poseidon n=15 (parallel)
```mermaid
flowchart TD
A["Total Poseidon Proof<br/>13.78s | 100%"] --> B["Precompute twiddles<br/>9.01ms | 0.07%"]
A --> C["Constant<br/>1.04ms | 0.01%"]
A --> D["Trace<br/>5.19s | 37.66%"]
A --> E["Interaction<br/>331ms | 2.40%"]
A --> F["Prove<br/>8.24s | 59.80%"]
F --> G["Composition<br/>6.80s | 49.35%"]
F --> H["Evaluate columns out of domain<br/>997ms | 7.24%"]
F --> I["Compute FRI quotients<br/>261ms | 1.89%"]
F --> J["Commit<br/>105ms | 0.76%"]
F --> K["Grind<br/>13.3ms | 0.10%"]
G --> L["Generation<br/>6.73s | 48.84%"]
G --> M["Commitment<br/>68.7ms | 0.50%"]
L --> N["Extension<br/>5.66s | 41.07%"]
L --> O["Constraint point-wise eval<br/>1.04s | 7.55%"]
L --> P["Constraints interpolation<br/>22.8ms | 0.17%"]
D --> Q["Generation<br/>865ms | 6.28%"]
D --> R["Interpolation for commitment<br/>1.51s | 10.96%"]
D --> S["Commitment<br/>2.82s | 20.46%"]
S --> T["Commitment<br/>2.82s | 20.46%"]
T --> U["Extension<br/>2.58s | 18.72%"]
T --> V["Merkle<br/>243ms | 1.76%"]
%% Color definitions based on impact:
%% Major Impact (>15%): red
%% Significant Impact (5-15%): orange
%% Moderate Impact (1-5%): gray
%% Minor Impact (<1%): white
classDef red fill:#ff0000, color:white;
classDef orange fill:#ffa500, color:black;
classDef gray fill:#808080, color:white;
classDef white fill:#ffffff, color:black;
class A,D,F,G,L,N,S,T,U red;
class H,O,Q,R orange;
class E,I,V gray;
class B,C,J,K,M,P white;
```
<details>
<summary>Details</summary>
| Depth | Operation | Time (busy) | % of Total | % of Parent |
|-------|-----------|-------------|------------|-------------|
| 0 | Total Poseidon Proof | 13.78s | 100% | 100% |
| 1 | ├── Precompute twiddles | 9.01ms | 0.07% | 0.07% |
| 1 | ├── Constant | 1.04ms | 0.01% | 0.01% |
| 1 | ├── Trace | 5.19s | 37.66% | 37.66% |
| 2 | │ ├── Generation | 865ms | 6.28% | 16.67% |
| 2 | │ ├── Interpolation for commitment | 1.51s | 10.96% | 29.09% |
| 2 | │ └── Commitment | 2.82s | 20.46% | 54.33% |
| 3 | │ └── Commitment | 2.82s | 20.46% | 100% |
| 4 | │ ├── Extension | 2.58s | 18.72% | 91.49% |
| 4 | │ └── Merkle | 243ms | 1.76% | 8.62% |
| 1 | ├── Interaction | 331ms | 2.40% | 2.40% |
| 1 | └── Prove | 8.24s | 59.80% | 59.80% |
| 2 | ├── Composition | 6.80s | 49.35% | 82.52% |
| 3 | │ ├── Generation | 6.73s | 48.84% | 98.97% |
| 4 | │ │ ├── Extension | 5.66s | 41.07% | 84.10% |
| 4 | │ │ ├── Constraint point-wise eval | 1.04s | 7.55% | 15.45% |
| 4 | │ │ └── Constraints interpolation | 22.8ms | 0.17% | 0.34% |
| 3 | │ └── Commitment | 68.7ms | 0.50% | 1.01% |
| 2 | ├── Evaluate columns out of domain | 997ms | 7.24% | 12.10% |
| 2 | ├── Compute FRI quotients | 261ms | 1.89% | 3.17% |
| 2 | ├── Commit | 105ms | 0.76% | 1.27% |
| 2 | └── Grind | 13.3ms | 0.10% | 0.16% |
</details>
# Benchmark Result (Extending trace + Evaluating constraint polynomials)
**Repo:** https://github.com/zksecurity/stwo/commits/gpu-extend-trace/
- Run on Macbook M3 Pro
## WASM (release build) execution time (ms)
| Poseidon instances | WebGPU | CPU Total | CPU Extend | CPU Evaluate |
| --- | --- | --- | --- | --- |
| 2^12 | 32 | 117 | 62 | 55 |
| 2^13 | 52 | 242 | 129 | 113 |
| 2^14 | 83 | 537 | 312 | 225 |
| 2^15 | 155 | 814 | 365 | 449 |
| 2^16 | 229 | 1647 | 751 | 896 |
| 2^17 | 528 | 3265 | 1470 | 1795 |
| 2^18 | 1100 | 6497 | 2952 | 3545 |
### Native build (parallel feature on + optimized) execution time (ms)
| Poseidon instances | WebGPU | CPU Ex+Comp | CPU Extend | CPU Compute |
| --- | --- | --- | --- | --- |
| 2^12 | 40 | 7 | 4 | 2 |
| 2^13 | 53 | 14 | 9 | 4 |
| 2^14 | 77 | 31 | 22 | 9 |
| 2^15 | 116 | 59 | 45 | 14 |
| 2^16 | 207 | 124 | 92 | 32 |
| 2^17 | 404 | 277 | 206 | 70 |
| 2^18 | 785 | 530 | 412 | 118 |
## Benchmarking Steps
<details>
<summary>Benchmarking Steps</summary>
### Running WASM
1. Change `LOG_N_INSTANCES` in `test_poseidon_prove_wasm_gpu_cpu`
2. Change `N_ROWS` in `compute_composition_polynomial.rs`
1. `LOG_N_INSTANCES=12 => N_ROWS=32`
2. `LOG_N_INSTANCES=13 => N_ROWS=64`
3. …
3. Change `N_ROWS` in `compute_composition_polynomial.wgsl`
1. Same value as `compute_composition_polynomial.rs`
4. Run `wasm-pack test --chrome --release`
### Running native CPU & GPU (parallel features on & release)
1. Change `LOG_N_INSTANCES` in `test_simd_poseidon_prove`
2. Change `N_ROWS` in `compute_composition_polynomial.rs`
1. `LOG_N_INSTANCES=12 => N_ROWS=32`
2. `LOG_N_INSTANCES=13 => N_ROWS=64`
3. …
3. Change `N_ROWS` in `compute_composition_polynomial.wgsl`
1. Same value as `compute_composition_polynomial.rs`
4. Run `cargo test --release --features parallel test_simd_poseidon_prove -- --nocapture`
</details>