This document contains some quick notes comparing calculating

\sqrt{n}

using the latest gnark ( “normal” Tonelli-Shanks) versus using precomputed tables for solving

b^{2} \equiv n^{Q} \mod p

The optimized version described in this document has the twist of using gnark for doing multiplications, so both approaches can be fairly compared. (gnark has some advanced multiplication optimizations with and without assembly). So, it isn’t that code but a slightly modified version with some plumbing.

I’ll call Before to the latest-gnark approach and After to the optimized approach.

This optimization is relevant for:

Clients running the chain since if they save points in compressed form, they'll have to uncompress them every time you load tree nodes (with potentially many commitments) from disk.
Proof verification since the points are compressed.

TL;DR

Even though we could count how many multiplications are done in each case by looking at the algorithms on paper, that can be complicated or error-prone.

I added some extra code to count every time an fp.Mul(a, b) is called while calculating the square root of 100k random finite field elements (both approaches use the same gnark “backend” for FF arithmetic, so both use “the hook” for counting).

Total finite-field multiplications per

$\sqrt{n}$

Before: 605 multiplications (average).
After: 303 multiplications (exact).

Note that Before mentions average. This is the case since Tonelli-Shanks number of internal iterations depends on the field element. The After algorithm has a deterministic number of steps.

In summary, this should map to a 2x speedup in the number of multiplications.

Detailed comparison

This is digging a bit in detail inside Tonelli-Shanks in both approaches.

If you think of Tonelli-Shanks, you have to do two main steps:

Calculating the first “square root candidate” (i.e: calculating
$n^{\frac{Q + 1}{2}}$ )
1. Before: 264
2. After: 265
Calculate
$b^{2} \equiv n^{Q} \mod p$
1. Before: ~341 (~56% of total)
  - This count is an average since the number of iterations isn't deterministic.
2. After: 34 (~11% of total)
  - Deterministic.
  - 24 from computing powers
    $n^{Q * 2^{24}}$ ,
    $n^{Q * 2^{16}}$ and
    $n^{Q * 2^{8}}$
  - 6 resolving the dlogs for 8-bit window.
  - 4 to reconstruct
    $b^{2}$ from 8-bit window exponents.

Reg Step 1.: Both having (almost) the same multiplications makes sense since this new approach isn’t optimizing this work. Step 1. is mostly an addition chain and nothing else.

Reg Step 2: This is what the new approach is optimizing and where the savings come from. In this step, we have a 10x speedup.

Square root benchmark

I made a synthetic benchmark only focusing on the original sqrt and the new one.

With gnark allowing assembly+ADX for both (default gnark behavior):

goos: linux
goarch: amd64
pkg: github.com/crate-crypto/go-ipa/bandersnatch/fp
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkElementSqrt1000-16         1188      910158 ns/op      9102 sqrt_ns/op
BenchmarkElementSqrtOptimized-16    2215      469742 ns/op      4697 sqrt_ns/op
PASS
ok      github.com/crate-crypto/go-ipa/bandersnatch/fp  2.285s

Cheating on gnark to disallow using assembly but the "fallback" non-assembly arithmetic:

goos: linux
goarch: amd64
pkg: github.com/crate-crypto/go-ipa/bandersnatch/fp
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkElementSqrt1000-16         1951      1129566 ns/op      11296 sqrt_ns/op
BenchmarkElementSqrtOptimized-16    1809      626508 ns/op       6265 sqrt_ns/op
PASS
ok      github.com/crate-crypto/go-ipa/bandersnatch/fp  2.409s

Impact in geth import-chain benchmark

I tested this in our replay benchmark and had the expected impact you can imagine considering the previous results. The amount of CPU seconds that were saved doing Sqrt was cut by ~half which saving around 50s!

We're currently finishing to integrate this with other optimizations to have a new version of the benchmark run.

Acknowledgements

Thanks to Antonio Sanso for raising awareness of this optimization, and Gottfried Herold who implemented it in the Bandersnatch repo. I simply unpacked how this worked and "glued things" together to make it work in go-ipa where we use gnark as the "backend".

TL;DR

Detailed comparison

Square root benchmark

Impact in geth import-chain benchmark

Acknowledgements

Read more

Berlin - Stateless Ethereum

Ethproof Killers Test Suite – Overview & Format Guide

EIP-4762 execution witness worst-case

Execution-Spec-Tests TODOs (Stateless Tests)