This document revisits potential Verkle Tree-related Multiscalar Multiplications (MSM) optimizations.
The main gist of this round is to focus on how Gottfried Herold's ideas in Notes on MSMs with Precomputation could improve our current (Geth) performance (but the ideas apply to any other VKT crypto library).
Most non-proof-related cryptography in Verkle Trees involves an MSM with a fixed basis of length 256.
The fundamental work of the following EL tasks are a fixed-basis MSM calculation:
As we can see, optimizing MSM calculation is relevant for many important tasks.
Today’s algorithm exploits the fact that the basis is fixed, compared to other kinds of MSM where the basis isn’t fixed and where a Pippenger-like algorithm is the usual choice.
We calculate precomputed tables from the basis points to speed up the scalar multiplications. Originally, we calculated windows of 8 bits, but last year, I had the idea of using a window size of 16 bits (i.e. 16 windows) for the first five points and 8 bits (i.e. 32 windows) for the rest.
Calculating tree keys and extension node updates only requires an MSM using the first five elements of the basis. Considering “five” is a small number, we can push the windows on these points further without exploding the table sizes. This gives a ~2x speedup on these cases, which are quite common and greatly impact overall client performance.
Sometime later, I implemented a suggested trick for halving the table sizes. Since elliptic curve negation is cheap, we can reduce the precomputed values of each window by 1 bit. This doesn't change performance but reduces memory usage and table generation times (i.e: startup time).
Regarding the theoretical performance of this algorithm, counting the number of elliptic curve operations we must do is straightforward. For the first five elements, each scalar multiplication requires 16 point additions (i.e: 16-bit window). For the rest, it takes 32 point additions (i.e: 8-bit window). As in, a full 256-MSM in this algorithm takes: point additions. Calculating tree-keys for accounts and storage slots require fewer point additions since we only involve 2 or 4 basis points (the first scalar multiplication involves the version which is fixed now, so we can cache that result).
Regarding how big the precomputed table size is, it's also easy to calculate :
In summary, the current implementation has:
I won’t explain the algorithm in detail, but I’ll try to give a good intuition here if you want to avoid reading his document.
Let’s look at the following image presented there:
The “x-axis” is the basis of the MSM. The “y-axis” are the 253 bits of each scalar corresponding to each point we want to calculate.
The algorithm has two main knobs you have to choose from:
In summary, we form the windows by grouping b bits that we select every t rows. We “wrap around” when we reach the top, continuing with the next P_i. A window can involve two different points, which is not the case in our current algorithm.
After we define t and b, this already defines the windows we have to precompute in the same style as our current algorithm. The idea is that we do a similar fashion of a “double-and-add” where we sum all the windows for some bits, then double them, then sum the windows for the next bits, then double them, etc (i.e: filling the gaps between blue boxes). The number of iterations is t-1. The "wrap around" situation makes things a bit more tricky.
Let’s say that we choose some t and b, build some intuition of what we gain/lose by increasing each knob:
The cool fact is that increasing t by 1 means only 1 extra doubling independently of the length of the MSM, which is very interesting. For example, if we switch from t=2 to t=4 with only two extra doublings, we reduce the table’s memory to half the size (again, independently of the length of the MSM).
Are two (or four, or 10, etc.) extra doublings relevant? It depends on the total number of point additions. If we have to do 1000 point additions and doublings for t=2
, then doubling t to t=4
would decrease the table size in half and only add a total of 2 extra operations (i.e., a total of 1002), which is a big win in terms of efficiency.
In reality, things are a bit harder since there’s some “efficiency loss” with this “wrap around” situation where some bits in the tables are “lost” depending how t divides 253. I'll touch on this a bit more later.
I wrote a Go implementation and benchmark. The benchmark is reproducible, so you can run it on your machine.
The benchmark does a scan of three relevant dimensions:
The full benchmark output is quite long, so I put it in an appendix section at the end of the document. Next, I’ll explain how to interpret each line of the output.
Let’s take as an example the following line:
msmLength=4,t=32,b=10
configuration.(t, b)
but is also slightly influenced by the msmLength
(this is a consequence of the “wrap around” of windows).(t, b)
setup is 19.22MiB.
(t, b)
, so you’ll see that in other MSM lengths, for the same (t, b)
, the table size should be the same (the table is always prepared for a 256-MSM).A curious reader might wonder why I added t=23
and t=11
as configurations if the rest are powers of two. Those numbers are factors of 253, the bit length of the scalar field. The “wrap around” nature of skipping every t
rows makes the algorithm have “top efficiency” when each window perfectly fits each column. You can find cases where the total Add+Doublings remains the same even between t=16->t=23
configurations. Despite having to do 7 extra doublings, due to the higher efficiency of t=23
, we also pack more meaningful bits in window additions, resulting in the same total number of operations!
The obvious question is: What is the best (t, b)
we can use to beat the current algorithm? You might have realized that the answer depends.
Increasing t is nice since, e.g., if we double it, we reduce the table size by half. Reducing the table size isn’t only a “save memory” benefit, but leaves more space to push b further, which gives us speedup. The cost of increasing t by 1 is increasing the total number of doublings by 1, but how much impact that has depends on the length of the MSM! Let’s look at two examples to understand this better.
First example:
The first setup has t=1
which means no doublings. The second setup has t=2
which means 1 doubling. The table size difference between both is 1/2, which is massive, but only at the total cost of extra 1 doubling! Is this extra doubling a big or low overhead? We moved from 32 to 33 operations, which makes sense. Adding one extra operation in 32 is ~3% overhead. If we increase from t=1
to t=32
, we’d include 31 doublings, which is ~97% overhead. In summary, in this case, we need low t because, in relative terms, doublings can add overhead quite quickly.
Second example:
Here we add an extra doubling as before. You might wonder why we changed from 5398 to 5418 instead of 5399. The reason is the “wrap around,” which is losing some extra efficiency due to “lost bits”. In any case, the important point is that 1 extra doubling in 5398 operations is ~0.02%, and we reduced the table size by half (~1GiB!), and the performance is almost the same!
Note that if we change from t=1
to t=32
:
That was a 5x table size reduction (1=2^0 → 32=2^5), and we only added 31 extra doublings which in relative terms to ~5400 operations is tiny (~0.57%).
The longer the MSM length, the more aggressive we can increase t without significant overhead. This leads to smaller tables with roughly the same performance. Smaller tables give us more memory room to increase b, providing more performance.
For a short MSM, we’d need a smaller t since each doubling, despite being constant, is significant relative to the total number of operations. For longer MSMs, increasing it by a big amount can still be small relative to the rest of the operations, so we can push for a bigger b.
Regarding b, as you can see in the benchmark, we should increase it as much as possible since this directly improves performance at the cost of pushing for more memory usage. Note that each extra increase of b has less impact, and it doubles the table size, so there are diminishing returns of trying to push this knob too much.
The “unfortunate” conclusion is that the best setup might depend on at least some group of MSM length ranges.
As we’ve seen before, we need to analyze numbers more carefully since the new algorithm might be useful only in some cases.
A very important clarification worth doing now, is that when I said “a MSM of length X”, I mean having a 256-vector of scalars where only X scalars are non-zero. Since the number of non-zero scalars increases/decreases the total number of add+doubles we have to do, that’s what matters to know when adding more doublings starts to hurt performance too much.
These are the relevant MSMs that happen in VKT:
We’ll have to analyze each of these cases. The complexity of having to distinguish both cases comes from the current algorithm has different performance for MSMs of, for example, length 2 depending on whether these two scalars are in the first 5 or the rest of the elements—since recall for the first 5 we use 16-bit windows and for the rest 8-bit windows, so it gets complicated!
I wrote a benchmark that shows the current algorithm's performance in each scenario, so let’s analyze them separately. All the benchmarks shown below are reproducible also in this development branch.
Let’s look at the current algorithm performance in this case:
Notes about these numbers:
basis[0:5]
range, we use 16-bit windows.It is very hard to beat these numbers since the current algorithm excels in this case. The tables are fully optimized for performance in this basis subset, and it would be very hard (or impossible?) to find another strategy that can beat this.
This is fine since using 16-bit windows was a conscious decision to put the feet on the gas for very important cases in Verkle. If you’re curious you could look at the Appendix - New algorithm full benchmark output section again and check that the scanned parameters can’t match this performance. For this case, we could try t=1 and b=16, but we’d be pushing the new algorithm to try doing the same as we do today, without any point negation tricks or simply having a more inefficient implementation for this border case.
So, for this case, if we allow zero compromises on performance (which might be questionable, but let’s put that as a constraint), we still need to use the current algorithm. If losing 10% of performance is allowed for this case, then we can probably find a configuration in the new algorithm that is fast enough and probably uses less memory. For now, we stick to “0 performance compromises” to simplify the analysis.
In this case, things start to get more interesting.
Let’s look at the performance of the current algorithm:
This case concerns MSM of non-zero scalars spread over all 256 bases. This means the current algorithm will usually use 8-bit windows instead of 16-bit ones, which only exist for the first five elements. Looking at MSM lengths between 1 and 4, we can already see how those are slower than the previous case using 16-bit windows (although it surprised me a bit how it isn’t closer to a 2x slowdown, but it could be related to 8-bit tables causing more cache hits—or maybe my computer had some CPU spike adding noise). In the current algorithm, the expected number of point additions is 32 windows per point * MSM length
.
Here’s where we find two sub-cases in this kind of MSM:
Looking again at the output in Appendix - New algorithm full benchmark output we can already identify these cases with a similar or better performance.
Let’s start with the MSM length range between 1 and 8:
The raw performance numbers are better than the current algorithm, but within this range is still hard to have a good tradeoff of memory usage. The underlying reason is not other than what we explained in previous sections. The new algorithm has a fixed cost of doublings that should amortize well in the total number of operations. Since these MSM lengths are still quite short, these costs aren’t amortized enough to make the new algorithm shine. Also note that ideally, we’d like a single t and b configuration to cover this range since having many configurations means adding the table sizes! So maybe using the current algorithm with 8-bit windows for these spread short MSM is still better.
Now let’s look at MSM ranges of 9 and 256 (quite a wide range):
This is where the new strategy starts to have serious performance improvements using table sizes that are quite small:
For these ranges above, the cost of doublings starts to become “negligible” compared to the total number of operations. This allows us to use less memory and a bigger b, which helps performance.
The final strategy for our new MSM algorithm is the following:
So, in the most aggressive strategy, we would add 128+16MiB of memory usage and get a decent speedup, which we’ll see more clearly in the next section benchmarks. We could be more conservative and still use the current algorithm for the 9-32 range and save the extra 128MiB if we think it isn’t worth it.
So, this final strategy is a hybrid algorithm implementation in their best performance for each case.
Here’s a final benchmark that compares our current strategy with the new proposed one in AMD Ryzen 7 3800XT:
For MSMs of 16 non-zero scalars up to 256, we got a speedup between 20% and 44%, which is very good, using only 144MiB more memory. This has 0 compromises in any other case (which was one of the constraints of this exploration).
Let's look at the above benchmark comparisons in a Rock5B (very low-hardware setup, maybe already discarded already for an Ethereum node post-4844):
If we think using 144MiB of extra memory is too much, then by only using 16MiB of extra memory we can get the following:
Which is the same speedups as before, but not having gains in the 8 and 32 range.
I think 144MiB of extra RAM sounds reasonable for a decent speedup in a wide range of MSM lengths. But this might be up to each client's decision! In the worst case with extra 16MiB it will be getting quite a good bang for the buck anyway.
To finish, let’s explore some parallel questions regarding this new strategy.
The current and new strategies can be parallelized since aggregating windows is fully parallelizable. I kept all benchmarks single-threaded to make a fair comparison without any extra noise (i.e., comparing a non-parallelized algorithm with a parallelized one) would be unfair.
Furthermore, it’s easy to lose the view of the big picture when doing benchmarks deep in the stack. In the Verkle Trees EIPs, the use of these MSM comes from:
As mentioned in the How do we calculate MSMs today? (in Geth) **section, today we do some point negation tricks to save 50% of table sizes by a known “point negation” trick where we can save 1-bit length in the windows.
As Gottfried mentioned in his document, the new algorithm includes a similar (but a bit more convoluted) trick that we can use to save 50% of pre-computation and, thus, memory usage.
The trick is very clever, where instead of aggregating window values from the identity element you start from a precomputed point that already adds “half” of the potential contribution of each scalar-bit in the windows. This allows you to represent the scalar bits with 1 and -1, which contribute the remaining “half” or cancel the existing one to have the correct contribution if the original bit was 0 or 1. Due to the sign symmetry of 1 and -1, we could do a similar “negation trick” to save 1 bit from the windows.
As mentioned in the original document, this trick creates a challenging situation if we have many scalars that are zero. The original point that we’re starting from to do the aggregation already assumes a contribution from each window, which should be zero in the case of zero scalars. We still have to add these zero-scalars, which hurts performance. Also as mentioned in the original doc, if we can predict which scalars might be zero, we could precompute some “cancelation points” such that a single point addition deals with the situation, but I don’t think this is quite possible since we’re using the new strategy for cases that aren’t very predictable.
I haven’t chatted with Gottfried about this situation to pick his brain, but my intuition said that all this probably isn’t worth it, considering our tables for the new strategy are already quite small for a lot of the complexity (and potential performance hurt) trying to do this trick would imply.
The original strategy required the following times of table precomputation at startup time:
The new strategy requires:
All the presented numbers are synthetic benchmarks that show the raw performance of MSMs. The speedups are quite significant, so this will be a good strategy to have from now on.
After preparing a more polished branch with this strategy to be used in go-verkle
and geth, we could run other existing “higher level benchmarks” and a chain-replay run checking if we notice a mgasps
throughput improvement or potential bump in the number of key-values we can migrate (MPT→VKT) per block.
Of course, any decision that considers performance should always consider slowest client implementation, so there won’t be an immediate decision regarding speeding up the migration. However, this this work and strategy could be applied to other libraries so more clients could benefit from it.
Below is the full benchmark output for the new algorithm (scroll horizontally to see all the columns):