This document contains some quick notes comparing calculating using the latest gnark ( “normal” Tonelli-Shanks) versus using precomputed tables for solving .
The optimized version described in this document has the twist of using gnark for doing multiplications, so both approaches can be fairly compared. (gnark has some advanced multiplication optimizations with and without assembly). So, it isn’t that code but a slightly modified version with some plumbing.
I’ll call Before to the latest-gnark approach and After to the optimized approach.
This optimization is relevant for:
Even though we could count how many multiplications are done in each case by looking at the algorithms on paper, that can be complicated or error-prone.
I added some extra code to count every time an fp.Mul(a, b)
is called while calculating the square root of 100k random finite field elements (both approaches use the same gnark “backend” for FF arithmetic, so both use “the hook” for counting).
Total finite-field multiplications per
Note that Before mentions average. This is the case since Tonelli-Shanks number of internal iterations depends on the field element. The After algorithm has a deterministic number of steps.
In summary, this should map to a 2x speedup in the number of multiplications.
This is digging a bit in detail inside Tonelli-Shanks in both approaches.
If you think of Tonelli-Shanks, you have to do two main steps:
Reg Step 1.: Both having (almost) the same multiplications makes sense since this new approach isn’t optimizing this work. Step 1. is mostly an addition chain and nothing else.
Reg Step 2: This is what the new approach is optimizing and where the savings come from. In this step, we have a 10x speedup.
I made a synthetic benchmark only focusing on the original sqrt and the new one.
With gnark allowing assembly+ADX for both (default gnark behavior):
Cheating on gnark to disallow using assembly but the "fallback" non-assembly arithmetic:
I tested this in our replay benchmark and had the expected impact you can imagine considering the previous results. The amount of CPU seconds that were saved doing Sqrt
was cut by ~half which saving around 50s!
We're currently finishing to integrate this with other optimizations to have a new version of the benchmark run.
Thanks to Antonio Sanso for raising awareness of this optimization, and Gottfried Herold who implemented it in the Bandersnatch repo. I simply unpacked how this worked and "glued things" together to make it work in go-ipa
where we use gnark as the "backend".