Wasm Benchmarks

# Zlib-rs: Wasm optimization report This report presents the benchmarks of the optimized zlib-rs. ## Summary A quick summary: - Inflate: ~ 25% faster than zlib-ng, ~ 50% faster than miniz_oxide - Deflate: on average ~ 10% faster than zlib-ng (depending on compression level). Miniz_oxide not tested. ## Setup Our baseline commit: https://github.com/memorysafety/zlib-rs/commit/d693febccb642fc3eb313632ffdf4a51a0e3cfb3 measurements were made with commit: https://github.com/memorysafety/zlib-rs/commit/94c1727984dfda0bc6c95f0448633c0fa28c9dfb improvements: - [wasm: allow unaligned reads in `longest_match`](https://github.com/memorysafety/zlib-rs/pull/202) - [wasm: SIMD `adler32`](https://github.com/memorysafety/zlib-rs/pull/198) - [wasm: SIMD `slide_hash`](https://github.com/memorysafety/zlib-rs/pull/199) - [wasm: use wider loads/stores in `copy_match`](https://github.com/memorysafety/zlib-rs/pull/197) - [wasm: SIMD `compare256`](https://github.com/memorysafety/zlib-rs/pull/179) other relevant PRs - [Run test-libz-rs-sys tests on wasm in CI](https://github.com/trifectatechfoundation/zlib-rs/pull/220) ## Inflate ### Benchmark methodology Measurements were taken using https://github.com/folkertdev/wasm-zlib-benchmark-/blob/main/bench-inflate.sh. The benchmarked executables all embed the file to decompress to avoid IO overhead and run the decompression 10 times for each time they get invoked to reduce the effect of wasmtime startup on the benchmark results. Each executable is then invoked multiple times such that the total time it is benchmarked is at least 5s. ### Benchmark results **baseline** ``` Benchmark 1 (10 runs): wasmtime run --allow-precompiled baseline-d693fe.cwasm inflate zlib-rs measurement mean ± σ min … max outliers wall_time 539ms ± 10.6ms 532ms … 568ms 1 (10%) peak_rss 43.2MB ± 2.19MB 40.2MB … 46.5MB 0 ( 0%) cpu_cycles 2.21G ± 45.2M 2.19G … 2.34G 1 (10%) instructions 6.97G ± 2.52K 6.97G … 6.97G 0 ( 0%) cache_references 29.7M ± 2.14M 28.1M … 35.5M 1 (10%) cache_misses 660K ± 57.5K 573K … 775K 0 ( 0%) branch_misses 15.1M ± 20.9K 15.0M … 15.1M 2 (20%) ``` **performance versus zlib-ng** ``` Benchmark 1 (13 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate zlib-ng measurement mean ± σ min … max outliers delta wall_time 389ms ± 2.02ms 386ms … 392ms 0 ( 0%) 0% peak_rss 43.3MB ± 1.62MB 40.4MB … 46.7MB 0 ( 0%) 0% cpu_cycles 1.63G ± 4.68M 1.62G … 1.64G 0 ( 0%) 0% instructions 4.68G ± 2.57K 4.68G … 4.68G 0 ( 0%) 0% cache_references 25.1M ± 51.4K 25.1M … 25.2M 0 ( 0%) 0% cache_misses 536K ± 6.11K 526K … 545K 0 ( 0%) 0% branch_misses 13.0M ± 18.9K 13.0M … 13.1M 1 ( 8%) 0% Benchmark 2 (18 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 291ms ± 1.17ms 288ms … 292ms 0 ( 0%) ⚡- 25.3% ± 0.3% peak_rss 43.5MB ± 1.28MB 40.3MB … 44.7MB 0 ( 0%) + 0.3% ± 2.5% cpu_cycles 1.20G ± 2.62M 1.20G … 1.21G 0 ( 0%) ⚡- 26.2% ± 0.2% instructions 3.78G ± 1.84K 3.78G … 3.78G 0 ( 0%) ⚡- 19.2% ± 0.0% cache_references 26.1M ± 84.7K 25.9M … 26.3M 0 ( 0%) 💩+ 3.8% ± 0.2% cache_misses 536K ± 9.32K 523K … 557K 0 ( 0%) - 0.1% ± 1.1% branch_misses 10.2M ± 2.82K 10.2M … 10.2M 1 ( 6%) ⚡- 21.6% ± 0.1% ``` **performance versus miniz_oxide** ``` Benchmark 1 (8 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate miniz_oxide measurement mean ± σ min … max outliers delta wall_time 655ms ± 3.36ms 650ms … 660ms 0 ( 0%) 0% peak_rss 43.0MB ± 1.72MB 40.7MB … 46.4MB 0 ( 0%) 0% cpu_cycles 2.77G ± 8.54M 2.76G … 2.79G 0 ( 0%) 0% instructions 6.62G ± 3.19K 6.62G … 6.62G 0 ( 0%) 0% cache_references 25.6M ± 113K 25.5M … 25.8M 0 ( 0%) 0% cache_misses 453K ± 14.4K 430K … 470K 0 ( 0%) 0% branch_misses 34.4M ± 16.5K 34.4M … 34.4M 0 ( 0%) 0% Benchmark 2 (17 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 295ms ± 7.00ms 290ms … 319ms 1 ( 6%) ⚡- 54.9% ± 0.8% peak_rss 43.1MB ± 1.65MB 40.2MB … 46.7MB 0 ( 0%) + 0.2% ± 3.5% cpu_cycles 1.22G ± 27.0M 1.20G … 1.31G 1 ( 6%) ⚡- 56.1% ± 0.7% instructions 3.78G ± 3.04K 3.78G … 3.78G 0 ( 0%) ⚡- 42.9% ± 0.0% cache_references 26.8M ± 1.76M 26.0M … 33.2M 3 (18%) + 4.7% ± 5.1% cache_misses 540K ± 10.3K 522K … 565K 0 ( 0%) 💩+ 19.2% ± 2.3% branch_misses 10.2M ± 7.59K 10.2M … 10.2M 1 ( 6%) ⚡- 70.3% ± 0.0% ``` **performance versus baseline** ``` Benchmark 1 (10 runs): wasmtime run --allow-precompiled baseline-d693fe.cwasm inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 498ms ± 2.64ms 494ms … 502ms 0 ( 0%) 0% peak_rss 43.7MB ± 1.15MB 42.2MB … 44.8MB 0 ( 0%) 0% cpu_cycles 2.09G ± 8.63M 2.08G … 2.11G 0 ( 0%) 0% instructions 6.97G ± 2.40K 6.97G … 6.97G 0 ( 0%) 0% cache_references 26.7M ± 586K 26.2M … 27.7M 0 ( 0%) 0% cache_misses 531K ± 12.4K 515K … 553K 0 ( 0%) 0% branch_misses 15.0M ± 5.68K 15.0M … 15.0M 0 ( 0%) 0% Benchmark 2 (18 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 293ms ± 3.50ms 289ms … 304ms 1 ( 6%) ⚡- 41.1% ± 0.5% peak_rss 43.2MB ± 1.17MB 40.5MB … 44.7MB 0 ( 0%) - 1.2% ± 2.2% cpu_cycles 1.21G ± 12.9M 1.20G … 1.25G 1 ( 6%) ⚡- 42.0% ± 0.4% instructions 3.78G ± 2.11K 3.78G … 3.78G 0 ( 0%) ⚡- 45.7% ± 0.0% cache_references 26.6M ± 733K 26.0M … 29.0M 1 ( 6%) - 0.3% ± 2.1% cache_misses 533K ± 11.7K 509K … 563K 0 ( 0%) + 0.4% ± 1.8% branch_misses 10.2M ± 3.52K 10.2M … 10.2M 0 ( 0%) ⚡- 32.0% ± 0.0% ``` **performance versus native** ``` Benchmark 1 (20 runs): ./native-zlib-benchmark-sse42 inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 257ms ± 9.96ms 250ms … 292ms 2 (10%) 0% peak_rss 25.6MB ± 56.1KB 25.5MB … 25.7MB 1 ( 5%) 0% cpu_cycles 1.07G ± 38.8M 1.04G … 1.21G 2 (10%) 0% instructions 3.15G ± 247 3.15G … 3.15G 3 (15%) 0% cache_references 26.6M ± 2.55M 25.2M … 35.9M 4 (20%) 0% cache_misses 383K ± 9.01K 363K … 400K 1 ( 5%) 0% branch_misses 9.26M ± 14.6K 9.24M … 9.30M 0 ( 0%) 0% Benchmark 2 (23 runs): ./native-zlib-benchmark-avx2 inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 218ms ± 1.07ms 216ms … 220ms 0 ( 0%) ⚡- 15.0% ± 1.6% peak_rss 25.5MB ± 57.9KB 25.4MB … 25.6MB 0 ( 0%) - 0.4% ± 0.1% cpu_cycles 914M ± 3.56M 907M … 920M 0 ( 0%) ⚡- 14.3% ± 1.5% instructions 2.66G ± 366 2.66G … 2.66G 0 ( 0%) ⚡- 15.4% ± 0.0% cache_references 25.3M ± 204K 25.0M … 25.9M 2 ( 9%) - 4.6% ± 4.1% cache_misses 361K ± 7.64K 341K … 371K 0 ( 0%) ⚡- 5.9% ± 1.3% branch_misses 9.46M ± 61.7K 9.34M … 9.55M 0 ( 0%) 💩+ 2.2% ± 0.3% Benchmark 3 (18 runs): wasmtime run --allow-precompiled wasm-zlib-benchmark.cwasm inflate zlib-rs measurement mean ± σ min … max outliers delta wall_time 294ms ± 3.62ms 289ms … 300ms 0 ( 0%) 💩+ 14.3% ± 2.0% peak_rss 43.9MB ± 1.60MB 40.5MB … 46.9MB 0 ( 0%) 💩+ 71.5% ± 2.9% cpu_cycles 1.21G ± 9.17M 1.20G … 1.23G 1 ( 6%) 💩+ 13.6% ± 1.8% instructions 3.78G ± 3.45K 3.78G … 3.78G 0 ( 0%) 💩+ 19.9% ± 0.0% cache_references 26.4M ± 533K 26.0M … 28.1M 1 ( 6%) - 0.6% ± 4.7% cache_misses 537K ± 14.4K 519K … 575K 2 (11%) 💩+ 40.1% ± 2.1% branch_misses 10.2M ± 5.27K 10.2M … 10.2M 0 ( 0%) 💩+ 10.4% ± 0.1% ``` ### File size for decompression only All results are in release mode with fat LTO and at least strip=debuginfo. The ones ending with `-opt-z` use opt-level=z and the ones ending with `-opt-z-no-symbols` use opt-level=z and strip=symbols. ``` -rwxr-xr-x 1 bjorn bjorn 60K Oct 8 11:17 miniz_oxide-inflate-lto-opt-z-no-symbols.wasm -rwxr-xr-x 1 bjorn bjorn 69K Oct 8 11:12 miniz_oxide-inflate-lto-opt-z.wasm -rwxr-xr-x 1 bjorn bjorn 70K Oct 8 11:12 miniz_oxide-inflate-lto.wasm -rwxr-xr-x 1 bjorn bjorn 74K Oct 8 11:16 zlib-ng-inflate-lto-opt-z-no-symbols.wasm -rwxr-xr-x 1 bjorn bjorn 83K Oct 8 11:14 zlib-ng-inflate-lto-opt-z.wasm -rwxr-xr-x 1 bjorn bjorn 96K Oct 8 11:14 zlib-ng-inflate-lto.wasm -rwxr-xr-x 1 bjorn bjorn 81K Oct 8 11:17 zlib-rs-inflate-lto-opt-z-no-symbols.wasm -rwxr-xr-x 1 bjorn bjorn 93K Oct 8 11:13 zlib-rs-inflate-lto-opt-z.wasm -rwxr-xr-x 1 bjorn bjorn 96K Oct 8 11:13 zlib-rs-inflate-lto.wasm ``` And the same files gzipped with `gzip -1` to simulate the actual amount of data transferred when downloading from a website uses gzip in the fast compression mode: ``` -rw-r--r-- 1 bjorn bjorn 29K Oct 8 11:23 miniz_oxide-inflate-lto-opt-z-no-symbols.wasm.gz -rw-r--r-- 1 bjorn bjorn 33K Oct 8 11:23 miniz_oxide-inflate-lto-opt-z.wasm.gz -rw-r--r-- 1 bjorn bjorn 33K Oct 8 11:23 miniz_oxide-inflate-lto.wasm.gz -rw-r--r-- 1 bjorn bjorn 39K Oct 8 11:23 zlib-ng-inflate-lto-opt-z-no-symbols.wasm.gz -rw-r--r-- 1 bjorn bjorn 43K Oct 8 11:23 zlib-ng-inflate-lto-opt-z.wasm.gz -rw-r--r-- 1 bjorn bjorn 46K Oct 8 11:23 zlib-ng-inflate-lto.wasm.gz -rw-r--r-- 1 bjorn bjorn 44K Oct 8 11:23 zlib-rs-inflate-lto-opt-z-no-symbols.wasm.gz -rw-r--r-- 1 bjorn bjorn 49K Oct 8 11:23 zlib-rs-inflate-lto-opt-z.wasm.gz -rw-r--r-- 1 bjorn bjorn 50K Oct 8 11:23 zlib-rs-inflate-lto.wasm.gz ``` ## Chunked inflate Chunked inflate (where input arrives in chunks, simulating the input arriving in a streaming fashion) execercises different parts of the code base. Also at smaller chunk sizes, SIMD is less effective. In the previous section we've seen that zlib-rs is already much faster than miniz_oxide, so here we instead compare just to zlib-ng. Zlib-ng is optimized for performance, but does not use wasm SIMD instructions. Zlib-rs now does use these instructions, and they appear very effective. ![chart (2)](https://hackmd.io/_uploads/H1Z36Wrkke.png) | chunk size (log2) | runtime zlib-ng (ms) | runtime zlib-rs (ms) | speedup (%) |--|--|--|--| | 4 | 116 | 105 | ` 9.48` | | 5 | 98.5 | 83.2 | `15.53` | | 6 | 82.6 | 71.7 | `13.19` | | 7 | 74.3 | 64.4 | `13.32` | | 8 | 69.4 | 63.1 | ` 9.07` | | 9 | 67.2 | 54.3 | `19.19` | | 10 | 65.5 | 49.5 | `24.42` | | 11 | 64.7 | 47.2 | `27.04` | | 12 | 62.1 | 45.7 | `26.40` | | 13 | 58.2 | 44 | `24.39` | | 14 | 53.1 | 40.8 | `23.16` | | 15 | 49.9 | 38.9 | `22.04` | | 16 | 47.2 | 36.8 | `22.03` | | 17 | 48.6 | 37.4 | `23.04` | | 18 | 45.7 | 36.4 | `20.35` | | 19 | 45.9 | 35.9 | `21.78` | | 20 | 44.9 | 36 | `19.82` | | 21 | 44.6 | 35.7 | `19.95` | | 22 | 44.6 | 36.5 | `18.16` | | 23 | 46.0 | 36.4 | `20.86` | | 24 | 44.5 | 36.8 | `17.30` | ## Deflate Overall we get a nice boost from SIMD, but for the medium algorithm (used for levels 3, 4, 5 and 6) it appears that zlib-rs isn't as efficient as zlib-ng today. | compression level | runtime zlib-ng (ms) | runtime zlib-rs (ms) | speedup (%) |--|--|--|--| | 0 | 31.6 | 27 | `14.55` | | 1 | 146.6 | 124.2 | `15.27` | | 2 | 216 | 208 | ` 3.70` | | 3 | 250 | 260 | `-4.0` | | 4 | 280 | 280 | ` 0.0` | | 5 | 304 | 308 | `-1.31` | | 6 | 354 | 346 | ` 2.25` | | 7 | 470 | 414 | `11.91` | | 8 | 584 | 510 | `12.67` | | 9 | 678 | 574 | `15.33` | ## Notes ### crc32 It turns out that the wasm SIMD instructions don't have the instructions needed for improving crc32. Both on x86_64 and aarch64, fairly specific instructions are used. While it is possible to use wider loads, we believe the potential performance gain is not worth the time investment.