This document contains some quick notes comparing calculating $\sqrt{n}$ using the [latest gnark](https://github.com/ConsenSys/gnark-crypto/blob/8f7ca09273c24ed9465043566906cbecf5dcee91/ecc/bls12-381/fr/element.go#L1163-L1229) ( “normal” Tonelli-Shanks) versus [using precomputed tables](https://github.com/GottfriedHerold/Bandersnatch/blob/main/bandersnatch/fieldElements/field_element_square_root.go) for solving $b^2 \equiv n^Q \mod{p}$.
The optimized version described in this document has the twist of using gnark for doing multiplications, so both approaches can be fairly compared. (gnark has some advanced multiplication optimizations with and without assembly). So, it isn’t that code but a slightly modified version with some plumbing.
I’ll call *Before* to the latest-gnark approach and *After* to the optimized approach.
This optimization is relevant for:
- Clients running the chain since if they save points in compressed form, they'll have to uncompress them every time you load tree nodes (with potentially many commitments) from disk.
- Proof verification since the points are compressed.
# TL;DR
Even though we could count how many multiplications are done in each case by looking at the algorithms o*n paper*, that can be complicated or error-prone.
I added some extra code to count every time an `fp.Mul(a, b)` is called while calculating the square root of 100k random finite field elements (both approaches use the same gnark “backend” for FF arithmetic, so both use “the hook” for counting).
**Total finite-field multiplications per $\sqrt{n}$**
- Before: 605 multiplications (average).
- After: 303 multiplications (exact).
Note that *Before* mentions *average*. This is the case since Tonelli-Shanks number of internal iterations depends on the field element. The *After* algorithm has a deterministic number of steps.
In summary, this should map to a 2x speedup in the number of multiplications.
# Detailed comparison
This is digging a bit in detail inside Tonelli-Shanks in both approaches.
If you think of Tonelli-Shanks, you have to do two _main_ steps:
1. Calculating the first “square root candidate” (i.e: calculating $n^{\frac{Q+1}{2}}$)
1. Before: 264
2. After: 265
2. Calculate $b^2 \equiv n^Q \mod{p}$
1. Before: ~341 (~56% of total)
- This count is an average since the number of iterations isn't deterministic.
3. After: 34 (~11% of total)
- Deterministic.
- 24 from computing powers $n^{Q*2^{24}}$, $n^{Q*2^{16}}$ and $n^{Q*2^{8}}$
- 6 resolving the dlogs for 8-bit window.
- 4 to reconstruct $b^2$ from 8-bit window exponents.
Reg Step 1.: Both having (almost) the same multiplications makes sense since this new approach isn’t optimizing this work. Step 1. is mostly an addition chain and nothing else.
Reg Step 2: This is what the new approach is optimizing and where the savings come from. In this step, we have a 10x speedup.
# Square root benchmark
I made a synthetic benchmark only focusing on the original sqrt and the new one.
With gnark allowing assembly+ADX for both (default gnark behavior):
```
goos: linux
goarch: amd64
pkg: github.com/crate-crypto/go-ipa/bandersnatch/fp
cpu: AMD Ryzen 7 3800XT 8-Core Processor
BenchmarkElementSqrt1000-16 1188 910158 ns/op 9102 sqrt_ns/op
BenchmarkElementSqrtOptimized-16 2215 469742 ns/op 4697 sqrt_ns/op
PASS
ok github.com/crate-crypto/go-ipa/bandersnatch/fp 2.285s
```
Cheating on gnark to disallow using assembly but the "fallback" non-assembly arithmetic:
```
goos: linux
goarch: amd64
pkg: github.com/crate-crypto/go-ipa/bandersnatch/fp
cpu: AMD Ryzen 7 3800XT 8-Core Processor
BenchmarkElementSqrt1000-16 1951 1129566 ns/op 11296 sqrt_ns/op
BenchmarkElementSqrtOptimized-16 1809 626508 ns/op 6265 sqrt_ns/op
PASS
ok github.com/crate-crypto/go-ipa/bandersnatch/fp 2.409s
```
# Impact in geth import-chain benchmark
I tested this in our _replay benchmark_ and had the expected impact you can imagine considering the previous results. The amount of CPU seconds that were saved doing `Sqrt` was cut by ~half which saving around 50s!
We're currently finishing to integrate this with other optimizations to have a new version of the benchmark run.
# Acknowledgements
Thanks to Antonio Sanso for raising awareness of this optimization, and Gottfried Herold who implemented it in the [Bandersnatch repo](https://github.com/GottfriedHerold/Bandersnatch). I simply [unpacked](https://ihagopian.com/posts/tonelli-shanks-with-precomputed-dlog-tables) how this worked and "glued things" together to make it work in `go-ipa` where we use gnark as the "backend".

block execution: 99.992976msINFO [07-15|13:10:28.917] Collected key values from base tree count=10000 duration=293.583654ms last account=008f65…483efbINFO [07-15|13:10:29.300] Prepared key values from base tree duration=382.488744msINFO [07-15|13:10:30.000] Inserted key values in overlay tree count=10000 duration=1.083027557smigration took: 1.376893246sfillLevels took 4.422121mscollecting nodes took 56.29µstoFrMultiple took 208.536µsupdating commitments took 3.090994mscollecting nodes took 5.472675mstoFrMultiple took 21.644539msupdating commitments took 52.035324mscollecting nodes took 2.3216mstoFrMultiple took 13.16166msupdating commitments took 54.440922mscollecting nodes took 127.455µstoFrMultiple took 639.024µsupdating commitments took 15.478009msFinalize: 211.469993msblock execution: 293.354953msINFO [07-15|13:10:30.971] Collected key values from base tree count=10000 duration=330.318923ms last account=008f65…483efbINFO [07-15|13:10:31.372] Prepared key values from base tree duration=400.638925msINFO [07-15|13:10:32.097] Inserted key values in overlay tree count=10000 duration=1.126096392smigration took: 1.456677516sfillLevels took 2.508844mscollecting nodes took 47.249µstoFrMultiple took 247.035µsupdating commitments took 4.936315mscollecting nodes took 7.428826mstoFrMultiple took 15.575423msupdating commitments took 50.524827mscollecting nodes took 2.413472mstoFrMultiple took 12.737298msupdating commitments took 41.038935mscollecting nodes took 139.996µstoFrMultiple took 648.065µsupdating commitments took 15.552383msFinalize: 193.08911msblock execution: 252.886918msINFO [07-15|13:10:32.963] Collected key values from base tree count=10000 duration=282.597556ms last account=00ab98…a45379INFO [07-15|13:10:33.531] Prepared key values from base tree duration=567.853577msINFO [07-15|13:10:33.900] Inserted key values in overlay tree count=10000 duration=937.057289msmigration took: 1.219926088sfillLevels took 7.629195mscollecting nodes took 136.204µstoFrMultiple took 414.446µsupdating commitments took 7.367287mscollecting nodes took 14.461581mstoFrMultiple took 18.437174msupdating commitments took 81.609759mscollecting nodes took 3.4579mstoFrMultiple took 19.02924msupdating commitments took 70.383834mscollecting nodes took 121.622µstoFrMultiple took 631.732µsupdating commitments took 15.461385msFinalize: 281.168431msblock execution: 239.645635msfillLevels took 269.784µscollecting nodes took 7.291µstoFrMultiple took 25.958µsupdating commitments took 641.648µscollecting nodes took 345.906µstoFrMultiple took 779.019µsupdating commitments took 5.172558mscollecting nodes took 198.328µstoFrMultiple took 601.982µsupdating commitments took 5.064353mscollecting nodes took 48.999µstoFrMultiple took 390.531µs

5/19/2024Why is this relevant for the Verkle conversion? We have to migrate all the data from MPT to VKT The keys in MPT are hashed We need the preimage of the hash to rekey it into the VKT Thus, we need all the preimages of the MPT tree Overlay Tree overview & timeline mental model image Current proposed strategy

5/16/2024EOF Verkle slides

4/22/2024This document attempts to do another round of optimizations to the Verkle Tree related Multiscalar Multiplications (MSM).

3/21/2024
Published on ** HackMD**

or

By clicking below, you agree to our terms of service.

Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet

Wallet
(
)

Connect another wallet
New to HackMD? Sign up