This document contains some quick notes comparing calculating $\sqrt{n}$ using the [latest gnark]( ( “normal” Tonelli-Shanks) versus [using precomputed tables]( for solving $b^2 \equiv n^Q \mod{p}$. The optimized version described in this document has the twist of using gnark for doing multiplications, so both approaches can be fairly compared. (gnark has some advanced multiplication optimizations with and without assembly). So, it isn’t that code but a slightly modified version with some plumbing. I’ll call *Before* to the latest-gnark approach and *After* to the optimized approach. This optimization is relevant for: - Clients running the chain since if they save points in compressed form, they'll have to uncompress them every time you load tree nodes (with potentially many commitments) from disk. - Proof verification since the points are compressed. # TL;DR Even though we could count how many multiplications are done in each case by looking at the algorithms o*n paper*, that can be complicated or error-prone. I added some extra code to count every time an `fp.Mul(a, b)` is called while calculating the square root of 100k random finite field elements (both approaches use the same gnark “backend” for FF arithmetic, so both use “the hook” for counting). **Total finite-field multiplications per $\sqrt{n}$** - Before: 605 multiplications (average). - After: 303 multiplications (exact). Note that *Before* mentions *average*. This is the case since Tonelli-Shanks number of internal iterations depends on the field element. The *After* algorithm has a deterministic number of steps. In summary, this should map to a 2x speedup in the number of multiplications. # Detailed comparison This is digging a bit in detail inside Tonelli-Shanks in both approaches. If you think of Tonelli-Shanks, you have to do two _main_ steps: 1. Calculating the first “square root candidate” (i.e: calculating $n^{\frac{Q+1}{2}}$) 1. Before: 264 2. After: 265 2. Calculate $b^2 \equiv n^Q \mod{p}$ 1. Before: ~341 (~56% of total) - This count is an average since the number of iterations isn't deterministic. 3. After: 34 (~11% of total) - Deterministic. - 24 from computing powers $n^{Q*2^{24}}$, $n^{Q*2^{16}}$ and $n^{Q*2^{8}}$ - 6 resolving the dlogs for 8-bit window. - 4 to reconstruct $b^2$ from 8-bit window exponents. Reg Step 1.: Both having (almost) the same multiplications makes sense since this new approach isn’t optimizing this work. Step 1. is mostly an addition chain and nothing else. Reg Step 2: This is what the new approach is optimizing and where the savings come from. In this step, we have a 10x speedup. # Square root benchmark I made a synthetic benchmark only focusing on the original sqrt and the new one. With gnark allowing assembly+ADX for both (default gnark behavior): ``` goos: linux goarch: amd64 pkg: cpu: AMD Ryzen 7 3800XT 8-Core Processor BenchmarkElementSqrt1000-16 1188 910158 ns/op 9102 sqrt_ns/op BenchmarkElementSqrtOptimized-16 2215 469742 ns/op 4697 sqrt_ns/op PASS ok 2.285s ``` Cheating on gnark to disallow using assembly but the "fallback" non-assembly arithmetic: ``` goos: linux goarch: amd64 pkg: cpu: AMD Ryzen 7 3800XT 8-Core Processor BenchmarkElementSqrt1000-16 1951 1129566 ns/op 11296 sqrt_ns/op BenchmarkElementSqrtOptimized-16 1809 626508 ns/op 6265 sqrt_ns/op PASS ok 2.409s ``` # Impact in geth import-chain benchmark I tested this in our _replay benchmark_ and had the expected impact you can imagine considering the previous results. The amount of CPU seconds that were saved doing `Sqrt` was cut by ~half which saving around 50s! We're currently finishing to integrate this with other optimizations to have a new version of the benchmark run. # Acknowledgements Thanks to Antonio Sanso for raising awareness of this optimization, and Gottfried Herold who implemented it in the [Bandersnatch repo]( I simply [unpacked]( how this worked and "glued things" together to make it work in `go-ipa` where we use gnark as the "backend".