Verkle Trees - An exploration of tree branches attacks

# Verkle Trees - An exploration of tree branches attacks [Note: this document is under review and might be changing in the next days] [Follow-up: Please look at [Gottfried document](https://notes.ethereum.org/hmqCk1tiTq6TdrxO_CKhuw) for a more formal approach to the topic] This document explores an attempt to attack a particular branch in a Verkle Tree. By "attack" I mean making that branch have a depth higher than expected. Attacking tree branches isn't a new idea, and it's a known attack vector for any self-balancing tree based on hashes (e.g: the current Merkle Patricia Tree (MPT)). The Ethereum protocol [plans to change](https://verkle.info/#332fc5e880bc452d96477abac0a2f164) from a Merkle Patricia Tree to a Verkle tree (VKT). Thus, this tree new design motivates re-exploring this attack vector since: - VKT uses another kind of cryptography compared to MPT (i.e., elliptic curve vs hashes). - VKT has another "shape," which leads to another expected depth compared to an MPT for the same number of key-values. (This is oversimplifying other facts, such as the MPT design having multiple trees compared to a single tree in VKT) Some important notes about this document: - It tries to be as concrete as possible for VKTs. It isn't a generic exploration of attacking tree branches. - I'm not an expert in cryptography or security. I know the VKT design, implementation, and cryptography to make a reasonable attempt at this exploration. This attack vector can have severe implications for Ethereum security and stability, so any "final call" about its security should ideally be reviewed/analyzed by experts in the field if they sense this requires a more formal analysis. This document presents a reasonable attempt and invites/nerd-snipe experts if they think a serious risk must be addressed. The rest of the document is organized in three main sections: - `Context: Why is this attack vector a problem?`: gives some intro to the reader about the discussed problem. - `Attacker strategy and PoC`: tries to go through what an attacker strategy might be and show some numbers on a developed proof of concept of that strategy. - `Attack impact benchmarks`: show some benchmarks on how branches of different depths affect some relevant metrics for the Ethereum clients and protocol. ## Context: Why is this attack vector a problem? You should skip this section if you know why making branches longer than average is a problem. The impact of a branch that has a higher depth than average tries to break a design expectation of having a balanced tree, which has some unwanted side effects: - The _execution witness_$^1$ for a block would be larger than average for blocks that contain this branch. This happens because we must introduce more internal node commitments in the block witness for verifications to rebuild the (partial) pre-state tree. - The computational cost of calculating the new state root would be higher if tree writes touch this branch since the path from the LeadNode and root node contains more internal nodes. Those are simply effects of branches with higher depth. The critical question is how costly it is for an attacker to generate an attack that would be meaningful (negatively) for the Ethereum protocol. For example, if it costs 100 billion dollars to affect Ethereum client performance by 1%, we can ignore the attack vector. ## Attacker strategy and PoC To make this document educational, let's review some points an attacker can consider for their attack. Note this might be a naive-ish attacker, and depending on the resources, it might attempt a more distributed and specialized attack. ### Exploit the fact that a VKTs doesn't use _extension nodes_ The Verkle Tree design doesn't have the concept of _extension nodes_. This means the tree doesn't try to avoid long branches by compressing shared bytes in tree keys in the same branch (as with MPTs). For example, let's start with an empty tree and insert the key `0x0000000000000000000000000000000000000000000000000000000000000001`. Now imagine how the tree looks like doing the following operations: - If we insert `0x0100000000000000000000000000000000000000000000000000000000000001`: we'd have two branches, each of depth 1. - If we insert `0x0001000000000000000000000000000000000000000000000000000000000001`: we'd have two branches, each of depth 2. - If we insert `0x0000000000000000000100000000000000000000000000000000000000000001`: we'd have two branches, each of depth 10. In summary, the length in bytes of the shared prefix +1 will be the depth of the branch for both keys. From an attacker's perspective, this is convenient since we only need to insert one appropriately crafted key to increase the depth of the branch, and not many if there were _extension nodes_ to avoid "prefix-compression". This is also convenient since we only need to pay for a single SSTORE to produce the effect in a branch. Note that SSTORE costs might be negligible compared to the computational cost of finding the correct key to insert. ### Optimize the computation to generate keys The attack cost is related to the computational cost of finding a tree key with a matching prefix length for the target branch to be attacked. Let's go through some basic ideas that come to mind. #### Can the attacker bias the tree-key algorithm to gain an advantage? Trees usually use _tree key hashing_ to generate a self-balanced tree automatically. The _hashing_ function can have many shapes and forms, but if the hash is cryptographically secure, it shouldn't be biasable by the user (i.e: potential attacker). The kind of _tree key hashing_ used in Verkle Trees is called _Pedersen Hashing_ or could be described by a _multiscalar multiplication on a fixed basis_. Proving this hashing function would produce a uniform and unbiased result is outside the scope of this document. Out of curiosity and double-checking, I ran a large sampling months ago and verified that the results were uniform. The unbiasibility can't be proven by any statistical analysis, since it requires a more formal mathematical argument. To conclude about this point for this document, we assume an attacker can't manipulate the hashing function inputs in any strategic way to gain an advantage compared to a random/linear scan. #### Customize our tree key calculation strategy As mentioned before, Verkle Trees doesn't use any _mainstream_ cryptographic hash function that has specialized instructions in CPUs. If that was the case, then there might not be better options than "just use [X CPU instruction](https://en.wikipedia.org/wiki/Intel_SHA_extensions)" which is probably the best you can do compared with other strategies. In the case of Verkle Trees, the _tree key_ calculation is a multiscalar multiplication of a fixed basis of length 5. Everything narrows down to how we can perform this operation as fast as possible with minimal work. The short [description](https://notes.ethereum.org/@vbuterin/verkle_tree_eip#Header-values) of the _tree key_ calculation is: ``` def pedersen_hash(inp: bytes) -> bytes32: assert len(inp) <= 255 * 16 # Interpret input as list of 128 bit (16 byte) integers ext_input = inp + b"\0" * (255 * 16 - len(inp)) ints = [2 + 256 * len(inp)] + \ [int.from_bytes(ext_input[16 * i:16 * (i + 1)]) for i in range(255)] return compute_commitment_root(ints).serialize() def get_tree_key(address: Address32, tree_index: int, sub_index: int): # Asssumes VERKLE_NODE_WIDTH = 256 return ( pedersen_hash(address + tree_index.to_bytes(32, 'little'))[:31] + bytes([sub_index]) ) ``` Or more compactly as: ``` tree_key = PointToHash(k*G0+addr_low*G1+addr_high*G2+treeIndexLow*G3+treeIndexLow*G4) ``` where: - `k` is a fixed scalar - `xxx_(low|high)` refers to the top and bottom 16-bytes of an address and `treeIndex`. - `PointToHash(...)` transforms the elliptic curve point result of the multiscalar multiplication into bytes (which we want a prefix length match a target attacked key). The above logic for _tree key_ hashing is used in both: 1. Generate tree key addresses for accounts. In these leaf nodes, we store the _account header_ values (i.e.: nonce, balance, code hash, code size, first 64 storage slot values, and first 128 code bytes). 2. Generate tree key addresses for contract storage slots. These leaf nodes that hold storage slot values greater than 64. Thus, an attacker can try finding an Ethereum address that generates the desired tree key prefix or fixes a contract address and brute-forcing the storage slot. The latter is better since we can fix a contract address that will do the `SSTORE`, and then make a contract call providing the "convenient" storage slot. This will make the tree write and fire the branch depth attack. So, everything comes down to doing "scan" on the `treeIndex`, and _work backwards_ to generate a proper storage slot (I'll mention later how to transfrom a `treeIndex` to (one possible) `storageSlot`) Despite the `treeIndex` maps to two scalar 16-byte values, we only need to do a single scalar multiplication per brute force iteration. The lower 16-bytes give us plenty of search space. The central question then is: how we can make this scalar multiplication and point addition as fast as possible? My take on this PoC attack leverages some code that we did for [go-ipa](https://github.com/crate-crypto/go-ipa), the cryptography library to be used by geth for VKTs: - Use a precomputed table only for the point in question to brute force. We don't need precomputed tables for the rest of the points for all the brute force attacks. Using less memory means fewer cache misses. - The window width used is 16 bits. Considering the scalar is 128 bits, we have up to 8 windows to operate, which is more than enough for any reasonable attempt. (Considering we only need precomputed tables for one point, we could use wider tables if RAM allows). - Distribute the precomputed tables brute forcing in all available cores such that there's no overlap to waste effort. - Always do linear scans in the table to leverage CPU memory prefetching. - The precomputed tables already use extended normalized points such that the CPU cost in finite field multiplications is as tiny as possible by doing "mixed additions". This strategy is at least much less naive than doing any other kind of "double and add", Pippenger style algorithm, or similar. Still, there might be better strategies that maybe allow us to do even less work per iteration. ***Gory details note***: An implementation (such as the PoC I did) should consider that not every `treeIndex` value maps to a storage slot. In the tree key calculation, for most storage slots, we shift by a `MAIN_STORAGE_OFFSET` constant (256**31), which should be considered if we want to be able to map the `treeIndex` finding to a storage slot. I'm just mentioning this if anyone gives a detailed look into the implementation, it will see that I fix some high-level bytes of `treeIndex` such that subtracting `MAIN_STORAGE_OFFSET` never underflows a `uint256`. If you're not reading details at that level, you can ignore this paragraph since it doesn't change anything about the important points. ### Attack proof of concept I'll avoid getting into implementation details, but the PoC code is open, and the source can be found in a single commit [here](https://github.com/crate-crypto/go-ipa/commit/63ce3626b65ceec20e3d1d4ef268d22b24fec851#diff-684c4008d4c062ddcba525dd3217886ed956bf6b97eecf4de754cad194744882). There I added some extra comments if you're interested in more details. Still, it's not necessary to continue reading the document. The attack code receives a contract address and a desired prefix and uses the precomputed tables to find a lower set of 48 bits for a `treeIndex` that matches that prefix. Using the returned `treeIndex` an attacker can generate a valid storage slot by doing `storage_slot = treeIndex * 256 - 256**31` (I'm just reversing the `storageSlot -> treeIndex` formula, [see here](https://notes.ethereum.org/@vbuterin/verkle_tree_eip#Storage)). The program output has lines of this style: ``` Attack for prefix length A took Bms (treeIndex=C) ``` This means that for some defined arbitrary contract address: > Finding a storage slot that would result in a tree key with some (arbitrarily) defined prefix of length A took B milliseconds, and the corresponding tree index to be used in the attack is C. Before showing the results, it's worth noting some things: - I've used my desktop computer for the run, which has a [AMD Ryzen 7 3800XT 8-Core Processor](https://www.amazon.com/AMD-Ryzen-3800XT-16-Threads-Processor/dp/B089WCXZJC) CPU. This is a desktop CPU and is not optimized for computing. - This CPU supports ADX instructions, which are leveraged by gnark to do faster finite field multiplication. These ADX instructions aren't a "high-end" feature but are available in most modern CPUs. However, we shouldn't assume most Ethereum nodes have CPUs supporting this instruction set. Running logs: ``` Attack for prefix length 1 took 549.243719ms (treeIndex=1766847064778384329583297500742918515827483896875618958121606227062489100) Attack for prefix length 2 took 673.211692ms (treeIndex=1766847064778384329583297500742918515827483896875618958121606265717203103) Attack for prefix length 3 took 18.494104223s (treeIndex=1766847064778384329583297500742918515827483896875618958121606252841209213) Attack for prefix length 4 took 4m38.843584303s (treeIndex=1766847064778384329583297500742918515827483896875618958121606252967438367) ``` Looking for each extra depth implies the search space gets bigger by 256 (i.e: one extra byte in the prefix). The _expected_ growth per extra depth should increase by a similar factor. ## Attack impact benchmarks In the previous section, we explored how hard it is to generate these branches. The other side of the coin is, assuming we have branches of depth X, how much this affects the Ethereum protocol operation. ### Updating the root node of a tree for different depths As mentioned earlier, the longer a branch is, the more commitments we need to update to propagate the change from the leaf node to the root node. I've created a [synthetic benchmark](https://github.com/ethereum/go-verkle/commit/cc399a7e1f683e1ccb6919138b70c26c1a55ca75) that creates a tree with all potential depths and measures how much time it takes to update all the required commitments to calculate the new root hash: ``` $ go test ./... -run=none -bench=BenchmarkRootUpdatePerDepth goos: linux goarch: amd64 pkg: github.com/ethereum/go-verkle cpu: AMD Ryzen 7 3800XT 8-Core Processor BenchmarkRootUpdatePerDepth/depth=2-16 63417 16927 ns/op BenchmarkRootUpdatePerDepth/depth=3-16 52216 22509 ns/op BenchmarkRootUpdatePerDepth/depth=4-16 41318 28787 ns/op BenchmarkRootUpdatePerDepth/depth=5-16 33708 34838 ns/op BenchmarkRootUpdatePerDepth/depth=6-16 28634 41061 ns/op BenchmarkRootUpdatePerDepth/depth=7-16 25080 47159 ns/op BenchmarkRootUpdatePerDepth/depth=8-16 21559 53613 ns/op BenchmarkRootUpdatePerDepth/depth=9-16 19946 59733 ns/op BenchmarkRootUpdatePerDepth/depth=10-16 17953 66157 ns/op BenchmarkRootUpdatePerDepth/depth=11-16 16119 74071 ns/op BenchmarkRootUpdatePerDepth/depth=12-16 14284 83598 ns/op BenchmarkRootUpdatePerDepth/depth=13-16 13365 89195 ns/op BenchmarkRootUpdatePerDepth/depth=14-16 12264 96942 ns/op BenchmarkRootUpdatePerDepth/depth=15-16 10000 105721 ns/op BenchmarkRootUpdatePerDepth/depth=16-16 9766 113275 ns/op BenchmarkRootUpdatePerDepth/depth=17-16 9356 122935 ns/op BenchmarkRootUpdatePerDepth/depth=18-16 8570 131973 ns/op BenchmarkRootUpdatePerDepth/depth=19-16 8641 139550 ns/op BenchmarkRootUpdatePerDepth/depth=20-16 7848 145565 ns/op BenchmarkRootUpdatePerDepth/depth=21-16 7461 154769 ns/op BenchmarkRootUpdatePerDepth/depth=22-16 7161 163696 ns/op BenchmarkRootUpdatePerDepth/depth=23-16 6901 169980 ns/op BenchmarkRootUpdatePerDepth/depth=24-16 6568 177983 ns/op BenchmarkRootUpdatePerDepth/depth=25-16 5989 184123 ns/op BenchmarkRootUpdatePerDepth/depth=26-16 6134 192775 ns/op BenchmarkRootUpdatePerDepth/depth=27-16 5916 203081 ns/op BenchmarkRootUpdatePerDepth/depth=28-16 5480 211380 ns/op BenchmarkRootUpdatePerDepth/depth=29-16 5475 217600 ns/op BenchmarkRootUpdatePerDepth/depth=30-16 5299 226398 ns/op BenchmarkRootUpdatePerDepth/depth=31-16 5023 231598 ns/op ``` As expected, the time it takes is linear, depending on the depth of the branch. In the case of this particular CPU, it's around 6μs. These numbers might be considered on the _high-end_ spectrum since: - [go-verkle](https://github.com/ethereum/go-verkle) and [go-ipa](https://github.com/crate-crypto/go-ipa) have gone through a reasonable amount of optimization iterations in their cryptography. - The underlying library [gnark-crypto](https://github.com/Consensys/gnark-crypto) performs very well for finite field operations compared to other libraries. - gnark-crypto supports ADX instructions (if available in the CPU) to optimize finite field multiplications. My CPU supports those instructions, so it leverages this optimization. might be only available in high-end desktop or more modern CPUs). To have an idea on the lower end of hardware specs, I've run the same benchmark on a [Rock5B](https://radxa.com/products/rock5/5b): ``` $ go test ./... -run=none -bench=BenchmarkRootUpdatePerDepth goos: linux goarch: arm64 pkg: github.com/ethereum/go-verkle BenchmarkRootUpdatePerDepth/depth=2-8 14427 98176 ns/op BenchmarkRootUpdatePerDepth/depth=3-8 7084 166354 ns/op BenchmarkRootUpdatePerDepth/depth=4-8 7959 138160 ns/op BenchmarkRootUpdatePerDepth/depth=5-8 7603 165196 ns/op BenchmarkRootUpdatePerDepth/depth=6-8 5079 212680 ns/op BenchmarkRootUpdatePerDepth/depth=7-8 4562 239260 ns/op BenchmarkRootUpdatePerDepth/depth=8-8 4998 277999 ns/op BenchmarkRootUpdatePerDepth/depth=9-8 3835 293869 ns/op BenchmarkRootUpdatePerDepth/depth=10-8 3534 326106 ns/op BenchmarkRootUpdatePerDepth/depth=11-8 3039 338630 ns/op BenchmarkRootUpdatePerDepth/depth=12-8 3247 360632 ns/op BenchmarkRootUpdatePerDepth/depth=13-8 2976 436670 ns/op BenchmarkRootUpdatePerDepth/depth=14-8 2532 438470 ns/op BenchmarkRootUpdatePerDepth/depth=15-8 2283 449593 ns/op BenchmarkRootUpdatePerDepth/depth=16-8 2526 477057 ns/op BenchmarkRootUpdatePerDepth/depth=17-8 2059 508807 ns/op BenchmarkRootUpdatePerDepth/depth=18-8 1940 538670 ns/op BenchmarkRootUpdatePerDepth/depth=19-8 1926 555768 ns/op BenchmarkRootUpdatePerDepth/depth=20-8 1760 585905 ns/op BenchmarkRootUpdatePerDepth/depth=21-8 1809 627361 ns/op BenchmarkRootUpdatePerDepth/depth=22-8 1579 640538 ns/op BenchmarkRootUpdatePerDepth/depth=23-8 1698 684310 ns/op BenchmarkRootUpdatePerDepth/depth=24-8 1410 725966 ns/op BenchmarkRootUpdatePerDepth/depth=25-8 1532 722825 ns/op BenchmarkRootUpdatePerDepth/depth=26-8 1140 907914 ns/op BenchmarkRootUpdatePerDepth/depth=27-8 1501 816204 ns/op BenchmarkRootUpdatePerDepth/depth=28-8 1444 852239 ns/op BenchmarkRootUpdatePerDepth/depth=29-8 1112 913428 ns/op BenchmarkRootUpdatePerDepth/depth=30-8 1257 952172 ns/op BenchmarkRootUpdatePerDepth/depth=31-8 1280 924851 ns/op ``` Note that a Rock5B cost is <200usd (half the price of the CPU used in the previous run). It has an ARM64 architecture, which might not have all the same tricks as for amd64 in gnark-crypto, and of course, it doesn't have ADX support. Still, it isn't clear if a Rock5B can support both 4844 and VKT new load, but we use it as a lower bound. ## Closing thoughts It's more than obvious to say that a serious attacker won't use a 400usd CPU to attack the Ethereum network. In this PoC, I used precomputed tables focused on a single elliptic curve point. This is far from a naive take, but there might be other more clever ideas depending on the way the search space is walked (e.g: maybe some math tricks with EC to save time) This kind of attack is [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel). Partitioning the search space is very easy, and with little effort, the attacker could make more specialized attacks if it's also willing to use more RAM to have wider windows. If I had a CPU with four times more cores, all times above would be cut by four, too. An attacker might have access to a botnet, supercomputers, or other specialized hardware to fine-tune the computation. Another fact is that the framing of the problem I did here was attacking a single branch, but an attacker could try producing a match on many (heavily used/touched) branches in the tree. This can improve the efficiency of finding a match by a significant factor. From another perspective, as explored in the _Attack impact benchmarks_ section, scenarios of depth up to 15 take less than 100μs (on a high-end CPU) and 450μs (on a very low-end CPU) to recompute the root commitment. The mainnet tree will probably have a depth of 5 or 6, so the "additional overhead" of the attack is much less than the presented numbers. Regarding witness impact size, the logic is similar. A longer branch means that the _execution witness_ will need to provide "extra" commitments (32 bytes) per extra depth compared to the average (e.g: from 5 to 14, means ~288 bytes). The next steps to continue this discussion that I imagine are: - Getting more expert opinions about the cost of producing the attack on a more large scale attacker with more resources. - More feedback from client teams to understand which branch depths start to produce an unacceptable impact on their performance (e.g: they can try doing the same benchmark I did earlier) --- $^1$ - The _execution witness_ is what the stateless client will receive as proof of all the pre-state needed to verify the block. This new construct is one of the main goals of Verkle Trees, since those will be much smaller than one coming from a MPT.