Verkle Tries exploration about Keccak vs. Pedersen Commitments

# Verkle Tries exploration about Keccak vs. Pedersen Commitments **Update: You can also see the [verkle-vs-patricia](https://github.com/jsign/verkle-vs-patricia) repository which contains newer benchmarks including more use-cases and comparisons with other libraries** This document explores the switch from Merkle Patricia Tries (MPT) to Verkle tries (VK). Both differ in their structure and underlying cryptographic primitives, so it’s worth exploring more theoretically and practically using benchmarks of real libraries. Note: This isn’t an introductory article about Keccak, Pedersen Commitments, or Verkle Tries. They're great articles out there that I’d highly recommend: - [Verkle Tree Structure](https://blog.ethereum.org/2021/12/02/verkle-tree-structure) - [Verkle Trie for eth1](https://dankradfeist.de/ethereum/2021/06/18/verkle-trie-for-eth1.html) - [Verkle Trees](https://vitalik.ca/general/2021/06/18/verkle.html) Before jumping into the main analysis, let’s dive into why this work is relevant today. # Context As part of the *Verge*, Ethereum plans to migrate storing its state from a Merkle Patricia Trie to a Verkle Trie. Despite MPTs and VKs both *tries*, they have many differences: - They have different structures. - They use other cryptographic primitives. - They have different levels of maturity in existing implementations. These tries are at the core of all Ethereum clients since they store the chain's state, which means this change will impact any operation that has to read or write the chain state. Moreover, reading and writing into these tries require hardware resources such as CPU, memory, and storage. Here’s a non-exhaustive list of responsibilities of an Ethereum client that this *trie* change will impact: - Serving API calls, such as getting the value of a storage slot. - The total size of the underlying DB (e.g., LevelDB in Geth) - Transaction processing, since EVM execution, interacts with the storage layer. - Witness generation and validation for future statelessness light clients. *Ideally*, we’d like to get all of Verkle Tries's benefits without any performance or resource consumption overhead. Dankrad Feist has been leading the research side of Verkle Tries and, more generally, statelessness protocol changes. Guillaume Ballet has been leading the construction of the [Go repository for Verkle Tries](https://github.com/gballet/go-verkle) and its integration in geth. He [has shared many findings](https://www.youtube.com/watch?v=4fL7hi8SZMs), which can be summarized as follows: - VKs have a CPU overhead compared to MPTs. - VKs memory consumption is much higher than MPTs. - Migrating MPTs with the chain's current state to a VK is a very heavy task that will probably require some strategy to avoid halting the chain for hours or days. Kevaundray Wedderburn has been leading the implementation of the [go-ipa](https://github.com/crate-crypto/go-ipa) repository, which contains an implementation of cryptographic primitives such as Bandersnash/Banderwagon elliptic curve operations, polynomial commitments (i.e. Pedersen Commitments), and inner product arguments for proof creation/verification. This repo is a dependency for the `go-verkle` repo mentioned above. # EFP project proposal draft and goal of this document As part of my work in the Ethereum Fellowship Program, I’ll like to help in the efforts of migrating Ethereum from using Merkle Patricia Tries to Verkle Tries; in any shape or form, which includes: - Understanding the VK data structure and cryptographic primitives in detail. - Diving into the details of the current Go `go-verkle` implementation and help with tests, new API or changes, or anything implementation related. - Generate benchmarks to understand where the bottlenecks are. - Optimize the implementation (CPU and/or memory) by leveraging what we discovered in benchmarks. Considering what was mentioned in the *Context* section, I believe that helping in understanding the performance overheads of an MPT against VKs, and optimizing the implementation or usage can have an important impact since: - The migration of a *loaded* node using an MPT to a VK can run faster and use less CPU/memory/storage. Both for full migrations (if possible) or incremental migrations. - The benchmarks can give concrete numbers to shed light if any optimization that could or should be done at the cryptographic level. For now, we’ll start exploring the difference between MPTs and VKs regarding their cryptographic primitives: - Keccak vs. Pedersen Commitments. - Benchmarks to understand the current overhead and bottlenecks. As my work continues in the program, I’ll focus on other mentioned points above where MPTs → VKs have a big impact. # Keccak vs. Pedersen Commitments When building the *tries,* we need to use a cryptographic hashing function for two main reasons: - We need to map *(key, values)* to the trie in a balanced way. Not doing so breaks good properties such as average depth and avoids attack vectors. - To *roll up* children's tries nodes into parents, we need to map byte slices (*values*) into a representation with some good properties (e.g., to avoid attackers manipulating the trie or forge proofs. Also, into *native domains* that are required for other cryptographic tools). MPTs use [Keccak](https://en.wikipedia.org/wiki/SHA-3) as the cryptographic hash function. Keccak is a well-known cryptographic hash function that has an efficient implementation in most languages (i.e., even included in their standard libraries). Keccak works by manipulating bits and doing bitwise operations, which are a good fit for CPU instructions. VKs have a more complex design that brings many benefits, but we have to do a bit more work to unpack how it works. From now forward, we’ll assume *some* general understanding of Verkle Tries. If that isn't the case, please read the [Verkle tree structure](https://blog.ethereum.org/2021/12/02/verkle-tree-structure) article from the EF blog and the three referenced links at the top of that link. When building a VK, we have to focus on two main tasks: - Transform a `[32]byte` *key* of the inserted *(key, value)* into a trie key. - To *roll up* children nodes to parents nodes, we need to compute a vector commitment using Pedersen Commitments. Let’s do a quick analysis of each bullet in order. While doing the analysis, the idea is to capture operations that it’s worth digging into in benchmarks later. ## Transform a `[32]byte` *key* into a proper trie key Transforming a *key* into a *stem+suffix* is explained in the [Verkle Trie EIP](https://notes.ethereum.org/@vbuterin/verkle_tree_eip#Header-values). The TL;DR is that we use the *address* and some constants to build a `[32]byte` key where the first 31 bytes (stem) are calculated by constructing a vector of field elements containing the *address* and a tree index, and the last byte is a constant indicating which data is being stored (e.g., nonce, balance, etc.). A similar strategy is used for storage slots when the *address* is an SC, similarly using the *address* and the *storage key*. The idea of `[31+1]byte` design is to improve data locality in the trie, which benefits witness sizes. Transforming the original *key* to the *trie key* is done at the client side before inserting the entire into the VK (i.e: geth node). The work mentioned above requires two cryptographic operations: 1. A *Pedersen Commitment* of that created vector results in an elliptic curve point. 2. An elliptic curve point is a point in a curve that we need to map into a `[32]byte`. To do that, we h*ash* that elliptic curve point into a field element. Regarding 1., refer to [this article](https://dankradfeist.de/ethereum/2021/07/27/inner-product-arguments.html) from Dankrad, which explains how *Pedersen Commitments* work at the math level. We’ll dive deeper into how this primitive works in the next sub-section when discussing how children's nodes roll up to parent nodes. Regarding 2., *[hashing* a point to a `[32]byte`](https://github.com/ethereum/go-ethereum/pull/24906/files#diff-b4012b1250cf4664a9d898d0b1e2f5b45a93ab90c9626220ee5f633fd7c61264R162) can be summarized as follows: - Transform the point [from projective form to affine form](https://www.nayuki.io/page/elliptic-curve-point-addition-in-projective-coordinates). - Take the X coordinate in affine form in little-endian representation. - Substitute the last byte with the `subIndex` (i.e: the `+1`, in the `[31+1]byte` concept explained above) Note that this work isn’t part of a Verkle Trie implementation but is done by clients (i.e: an Ethereum Client such as geth). When an Ethereum client wants to insert a *(key, value)* in the trie, it needs to prepare the *key* with the steps described above. This will result in a new *trie-key* that is the one to be used in the trie for storing *value*. From now on, when we refer to *(key, value)* inside a VK, we’ll be assuming that *key* is this *trie-key* prepared by the client. ## Rolling up from children nodes to parent nodes When a *(key, value)* is inserted in a VK, the *value* leaves in the leaves of the trie. Let’s refresh for a moment how this looks like: ![](https://i.imgur.com/dgt7kr0.png) *(Image from [Verkle Tree structure](https://blog.ethereum.org/2021/12/02/verkle-tree-structure))* Each node of the trie is a vector of elements. These elements are field-elements (i.e: 252 bits values). Each vector has a fixed size of 256 elements. As shown in the picture, *internal nodes* exploit the full capacity of the vector, but *extension nodes* only use the first four elements. ### Vector commitment step The picture shows that children nodes *roll up* to parent nodes by computing a vector commitment. We take this `[256]FieldElement` vector, apply a transformation, and get a *commitment*. A *commitment* is an elliptic curve point. This `[256]FieldElement -> Elliptic Curve Point` transformation is done using *Pedersen Commitments*: ![](https://i.imgur.com/1TpyKor.png) ([Taken from Dankrad blog post](https://dankradfeist.de/ethereum/2021/07/27/inner-product-arguments.html)) The `a_n` values refer to our entries in the vector, and the `g_n` values refer to our fixed elliptic curve points. We can see here the real [Go implementation of this](https://github.com/crate-crypto/go-ipa/blob/master/banderwagon/precomp_multiexp.go#L109-L132), which we can see how we iterate each `[]fr.Element` which is our vector, doing the `a_n*g_n` computation. The implementation uses some extra tricks: - It skips zero elements, which is very convenient since our vectors will probably be sparse. - It uses a precomputed Lagrange table that makes the `a_n*g_n` cheaper to compute. Note some important observations compared to MPTs: - The “rollup operation” acts on a vector, not a single value. - The cost of the operation is linear with the **number of field entries that aren’t zero**. You can already notice how comparing Keccak with Pedersen Commitments is a complex question since the former acts in individual values, and the latter acts on a vector where the cost depends on the field entries. We’ll get back to this in more detail in the *Benchmark* section. ### Transforming an elliptic curve point to a field element After we apply the *Pedersen Commitment* we have an elliptic curve point as a result, but we need a `[32]byte` result to store as a *commitment* (i.e: those `C_x` values in the image above). We can see this is [done here](https://github.com/gballet/go-verkle/blob/master/ipa.go#L45-L48), where a summary is: - We take the elliptic curve and “map it to the base field”. - [“map it to base field” means](https://github.com/crate-crypto/go-ipa/blob/c5abbdbdf644b0fc059eb12e8962a122ef44f262/banderwagon/element.go#L118-L129) taking the X and Y coordinates of the point we have in projective coordinates and doing `X/Y` (a division). Dividing implies calculating the inverse of `Y`, which **is a costly operation**. In summary, we can appreciate how calculating the Pedersen Commitment of a vector is a pretty complex operation. From a high-level perspective, it seems hard to match to a bitwise algorithm such as Keccak. In the *Benchmark* section, we’ll try translating this to numbers. ### Bonus: Efficient commitment updating when updating a vector entry Finally, it’s worth mentioning that after we have the current *Pedersen Commitment* of an existing vector, we can have the updated commitment of a new vector if one of the elements changes without computing everything again. This is possible since *Pedersen Commitments* are *additively homomorphic* (in this case, it is a linear combination of elliptic curve point additions). This means that if we have the commitment of our current vector values, we can update it by calculating the difference between the old factor with the new factor in the linear combination, and adding it to the current commitment value. This will result in the updated commitment of the vector with the updated entire. [We can see this implemented in the `go-verkle` repository.](https://github.com/gballet/go-verkle/blob/master/tree.go#L362-L368) # Benchmarks The theoretical description/analysis above is useful to understand how the different cryptographic primitives used on MPTs and VKTs would impact performance. In this section, we take a more pragmatic approach to measuring their differences using benchmarks. Creating benchmarks allows for validating expected performance differences and detects if their implementations can have improvements unrelated to fundamental complexities. (e.g: memory allocations, GC pressure, etc.). All the benchmarks are run on the same machine: - CPU: AMD Ryzen 7 3800XT - RAM: 16 GiB - OS: Ubuntu This section assumes that the section *Rolling up from children nodes to parent nodes* was read. A good first exploration is understanding the performance differences between calculating a Keccak hash versus a Pedersen Commitment. We’ll do two separate benchmarks: - Keccak vs. full Pedersen Commitment: “full” means that we don’t have any previous vector commitment and thus have to do the full calculation. - Keccak vs. single entry update of a Pedersen Commitment: here’ we’ll compare the cost of updating a single entry of a vector from which we have a previously computed commitment. All the benchmarks live in the [keccack-vs-pedersen Github repo](https://github.com/jsign/keccak-vs-pedersen). I’ll reference this repo when explaining each benchmark next. A final important note: all the benchmarks are using the `go-verkle` and `go-ipa` implementations, which will be the ones that are going to be used in Geth. These repos are still a work in progress, and there's a change that they still can be optimizable (discovering that is part of my goal!). ## Keccak vs. full Pedersen Commitments With this benchmark, we’ll try to unpack with real numbers an important fact we discovered in our analysis: computing a Pedersen Commitment should have a linear cost related to the number of non-zero entries in the vector. ### Where is this relevant? This benchmark will be running a full Pedersen Commitment on a 256 vector. This usually happens when we add a new leaf to the trie. Updating a leaf **does not** have to do this work, and it can use another trick since Pedersen Commitments is additively homomorphic (this will be a separate benchmark later). The case of adding leaves to a trie will happen in the following *real world* cases: - The EVM execution has to store a new value with a *stem* that didn’t exist before. - More heavily, we’re migrating data from an MPT to a VK. This *might* be something that will happen as part of the migration process. As we’ll see in the benchmark results, the cost depends on how many values of the new leaf are not zero since the cost is linear to that. To understand which is the average case in a real-world scenario, we’d have to know a realistic distribution. For example, looking at the [EIP](https://notes.ethereum.org/@vbuterin/verkle_tree_eip) that explains how *accounts* data get mapped into Verkle Tries, we can identify some interesting cases: - For EOA addresses, we’ll store five values in the vector. - For SC addresses, we’ll fill up probably more than 64 bytes since at least part of the contract's bytecode will be in the vector. This is closer to a best-case (amortized) Considering the above, we’d expect the cost of adding a leaf for an EOA address to be less costly than an SC address. ### Measurements I created a Go benchmark first to have a rough sense of Keccak performance, you can see the [code for this benchmark here](https://github.com/jsign/keccak-vs-pedersen/blob/main/bench_test.go#L17-L40): ```bash goos: linux goarch: amd64 pkg: github.com/jsign/keccak-vs-pedersen cpu: AMD Ryzen 7 3800XT 8-Core Processor BenchmarkKeccak/hash_a_single_32_byte_blob-16 903310 1281 ns/op 480 B/op 2 allocs/op ``` We can say that the **cost of ~1.3μs**. For *Pedersen Commitments* [I created a benchmark that](https://github.com/jsign/keccak-vs-pedersen/blob/main/bench_test.go#L42-L81): - Creates a leaf node with a stem an N non-zero values - Computes the leaf commitment. - Computes the leaf hash (i.e: commitment EC point → Field Element) Let’s see first the result of the benchmark, and I’ll explain what it means after: ```bash goos: linux goarch: amd64 pkg: github.com/jsign/keccak-vs-pedersen cpu: AMD Ryzen 7 3800XT 8-Core Processor BenchmarkPedersen/vec_num_entries=1-16 32173 38825 ns/op 38825 ns/value BenchmarkPedersen/vec_num_entries=4-16 19666 59960 ns/op 14990 ns/value BenchmarkPedersen/vec_num_entries=16-16 8058 143572 ns/op 8973 ns/value BenchmarkPedersen/vec_num_entries=64-16 2299 503363 ns/op 7865 ns/value BenchmarkPedersen/vec_num_entries=256-16 608 1902948 ns/op 7433 ns/value ``` To understand the results: - 1st column: the name of the test shows which case of N we’re benchmarking. (e.g: `BenchmarkPedersen/vec_num_entries=64-16` is benchmarking a 256-vector with 64 non-zero entries). - 3rd column: the number of nanoseconds it took to make the computations. - 4rd column: the number of nanoseconds per-entry (i.e: `ns/op / N`). Saying it differently, an amortized cost per non-zero entry (more about this later). Let’s make some observations regarding some things we originally expected: - When the number of non-zero entries increases (N), it takes more time to compute the commitment. This was expected since, in the `go-ipa` implementation, it skips any zero entries to avoid doing elliptic curve operations. - The cost is ~*O(N)* (N = # of non-zero values)*.* It’s not perfectly linear since calculating the field element from the elliptic curve point (i.e: the result of the Pedersen Commitment) is a fixed-cost operation. - If our 256-vector has more than 16 non-zero values, the amortized cost per non-zero value is roughly the same. We’ll see in the plot right below why this is relevant. Let’s aggregate all this information in a plot to compare it better: ![](https://i.imgur.com/EG74Sp4.png) This plot isn’t very useful to analyze since Pedersen Commitments are designed to work for a batch of values, not only one value as Keccak. Calculating the Pedersen Commitment of a 256 vector where all values are non-zero will be more costly than a Keccak of a 32-byte (single value). Let’s see a plot of the *amortized cost per value to make the comparison fairer*. That is, if a 256 vector where all elements are non-zero took T nanoseconds, then the *cost per value* is roughly T/256: ![](https://i.imgur.com/60CRj8S.png) We can calculate the overhead of Pedersen Commitments in each case compared to Keccak (i.e: `PedersenCommtimentTime/KeccakTime`): - Pedersen with one entry: ~30x compared to Keccak. - Pedersen with four entries: ~12x compared to Keccak. - Pedersen with 16 entries: ~7x compared to Keccak. - Pedersen with 32 entries: ~6x compared to Keccak. - Pedersen with 256 entries: ~5.8x compared to Keccak. ### Where is the main cost in this case? If we analyze the CPU profile of the benchmark in the case of a vector of 256 non-zero elements: ![](https://i.imgur.com/oAvXzxw.png) (Open image in new tab to see it bigger) Most of the work is done in the `MixedAdd(...)` function used in the Pedersen Commitment calculation using the Precomputed Lagrange tables. Ultimately, this means that most of the time`,` the CPU is mostly doing elliptic curve multiplications (`fp.mul`). The underlying `fp.mul` implementation looks optimized since [it’s written in assembly](https://github.com/crate-crypto/go-ipa/blob/master/bandersnatch/fp/element_mul_amd64.s#L45-L334). This code is coming from a Consensys project, so it might be worth checking if there’s any newer version that is more optimized. ## Keccak vs. single entry update of a Pedersen Commitment In this benchmark, we’ll measure how costly it is to update an existing commitment of a 256-vector. This cost is different compared to the full commitment calculation since we can exploit the additively homomorphic property of Pedersen Commitments. ### Where is this relevant? This benchmark is relevant for any case where we need to update a leaf that exists in a VK. This usually happens when the EVM executes a transaction, and it has to save values that live in existing leaves of the trie. ### Measurements [I created a benchmark](https://github.com/jsign/keccak-vs-pedersen/blob/main/bench_test.go#L83-L129) that: - Setup: - Prepares a leaf with 256-non-zero values (this work is out of the benchmark loop) - Prepare a new random `[32]byte` value to be inserted in the leaf. - Benchmark: - Insert the new value in the leaf (i.e: same stem and different last byte) Here’re the results: ```bash BenchmarkUpdatePedersen/vec_num_entries=1-16 86296 13121 ns/op BenchmarkUpdatePedersen/vec_num_entries=4-16 84475 13498 ns/op BenchmarkUpdatePedersen/vec_num_entries=16-16 82741 14370 ns/op BenchmarkUpdatePedersen/vec_num_entries=64-16 84075 14096 ns/op BenchmarkUpdatePedersen/vec_num_entries=256-16 84146 14041 ns/op ``` We can conclude that the cost is **~14μs**, and it doesn’t seem to depend much on the number of vector elements that aren’t zero (i.e: constant time). Note that, as we benchmarked earlier, Keccak takes around 1.3μs so **updating a Pedersen Commitment takes around ~11x.** Updating more than one value in the same leave at the same time probably can have a roughly linear cost on the number of updated values. This is because we have to do EC operations on each updated value and aggregate it to the previous commitment. ### Where is the main cost in this case? Most of the CPU work comes from calculating field element inverses when converting elliptic curve points to field elements: ![](https://i.imgur.com/xhFqybs.png) (Open image in new tab to see it bigger) If we allow caching the field element representation of a commitment, all these costs could be avoided. The tradeoff would be: - What we gain: Avoid 80% of CPU work when updating the value of a leaf. - What we lose: Storing an extra 32 bytes **per node** of the trie, which can be somewhat significant. The best way to know if the tradeoff makes sense is to see how much of this CPU work is relative to other work in the related use case. For example, if EVM execution is the main use case triggering leaves updates, we’d have to see if these 14μs are relevant to pay the 32 bytes extra cost; mostly because those 32 bytes are greatly amplified by the number of nodes in the trie. I’d expect it probably isn’t worth it, but it might be worth considering it. Apart from CPU, I see that this *diff updating* is doing some memory allocations that can be avoided. I’ll probably try some code changes and make some benchmarks memory-wise. # Next steps I’m considering doing some extra benchmarks that are more end-to-end, like: - [Leveraging the `./bench` benchmark in go-verkle](https://github.com/gballet/go-verkle/tree/master/benchs), since it can provide a more general idea of inserting new leaves in a pre-existing trie. - Consider running this [draft branch from go-ethereum](https://github.com/ethereum/go-ethereum/pull/24906) that migrates an existing node state to a verkle trie, and see what we might discover to optimize. Despite most of the benchmarks being CPU bound, we can probably make the implementation more memory efficient, which sometimes becomes the dominant factor in end-to-end scenarios. Let me know your thoughts!