Try   HackMD

Verkle Trees - An exploration of tree branches attacks

[Note: this document is under review and might be changing in the next days]

[Follow-up: Please look at Gottfried document for a more formal approach to the topic]

This document explores an attempt to attack a particular branch in a Verkle Tree. By "attack" I mean making that branch have a depth higher than expected.

Attacking tree branches isn't a new idea, and it's a known attack vector for any self-balancing tree based on hashes (e.g: the current Merkle Patricia Tree (MPT)).

The Ethereum protocol plans to change from a Merkle Patricia Tree to a Verkle tree (VKT). Thus, this tree new design motivates re-exploring this attack vector since:

  • VKT uses another kind of cryptography compared to MPT (i.e., elliptic curve vs hashes).
  • VKT has another "shape," which leads to another expected depth compared to an MPT for the same number of key-values. (This is oversimplifying other facts, such as the MPT design having multiple trees compared to a single tree in VKT)

Some important notes about this document:

  • It tries to be as concrete as possible for VKTs. It isn't a generic exploration of attacking tree branches.
  • I'm not an expert in cryptography or security. I know the VKT design, implementation, and cryptography to make a reasonable attempt at this exploration.

This attack vector can have severe implications for Ethereum security and stability, so any "final call" about its security should ideally be reviewed/analyzed by experts in the field if they sense this requires a more formal analysis. This document presents a reasonable attempt and invites/nerd-snipe experts if they think a serious risk must be addressed.

The rest of the document is organized in three main sections:

  • Context: Why is this attack vector a problem?: gives some intro to the reader about the discussed problem.
  • Attacker strategy and PoC: tries to go through what an attacker strategy might be and show some numbers on a developed proof of concept of that strategy.
  • Attack impact benchmarks: show some benchmarks on how branches of different depths affect some relevant metrics for the Ethereum clients and protocol.

Context: Why is this attack vector a problem?

You should skip this section if you know why making branches longer than average is a problem.

The impact of a branch that has a higher depth than average tries to break a design expectation of having a balanced tree, which has some unwanted side effects:

  • The execution witness
    1
    for a block would be larger than average for blocks that contain this branch. This happens because we must introduce more internal node commitments in the block witness for verifications to rebuild the (partial) pre-state tree.
  • The computational cost of calculating the new state root would be higher if tree writes touch this branch since the path from the LeadNode and root node contains more internal nodes.

Those are simply effects of branches with higher depth. The critical question is how costly it is for an attacker to generate an attack that would be meaningful (negatively) for the Ethereum protocol. For example, if it costs 100 billion dollars to affect Ethereum client performance by 1%, we can ignore the attack vector.

Attacker strategy and PoC

To make this document educational, let's review some points an attacker can consider for their attack. Note this might be a naive-ish attacker, and depending on the resources, it might attempt a more distributed and specialized attack.

Exploit the fact that a VKTs doesn't use extension nodes

The Verkle Tree design doesn't have the concept of extension nodes. This means the tree doesn't try to avoid long branches by compressing shared bytes in tree keys in the same branch (as with MPTs).

For example, let's start with an empty tree and insert the key 0x0000000000000000000000000000000000000000000000000000000000000001. Now imagine how the tree looks like doing the following operations:

  • If we insert 0x0100000000000000000000000000000000000000000000000000000000000001: we'd have two branches, each of depth 1.
  • If we insert 0x0001000000000000000000000000000000000000000000000000000000000001: we'd have two branches, each of depth 2.
  • If we insert 0x0000000000000000000100000000000000000000000000000000000000000001: we'd have two branches, each of depth 10.

In summary, the length in bytes of the shared prefix +1 will be the depth of the branch for both keys. From an attacker's perspective, this is convenient since we only need to insert one appropriately crafted key to increase the depth of the branch, and not many if there were extension nodes to avoid "prefix-compression".

This is also convenient since we only need to pay for a single SSTORE to produce the effect in a branch. Note that SSTORE costs might be negligible compared to the computational cost of finding the correct key to insert.

Optimize the computation to generate keys

The attack cost is related to the computational cost of finding a tree key with a matching prefix length for the target branch to be attacked.

Let's go through some basic ideas that come to mind.

Can the attacker bias the tree-key algorithm to gain an advantage?

Trees usually use tree key hashing to generate a self-balanced tree automatically. The hashing function can have many shapes and forms, but if the hash is cryptographically secure, it shouldn't be biasable by the user (i.e: potential attacker).

The kind of tree key hashing used in Verkle Trees is called Pedersen Hashing or could be described by a multiscalar multiplication on a fixed basis.

Proving this hashing function would produce a uniform and unbiased result is outside the scope of this document. Out of curiosity and double-checking, I ran a large sampling months ago and verified that the results were uniform. The unbiasibility can't be proven by any statistical analysis, since it requires a more formal mathematical argument.

To conclude about this point for this document, we assume an attacker can't manipulate the hashing function inputs in any strategic way to gain an advantage compared to a random/linear scan.

Customize our tree key calculation strategy

As mentioned before, Verkle Trees doesn't use any mainstream cryptographic hash function that has specialized instructions in CPUs. If that was the case, then there might not be better options than "just use X CPU instruction" which is probably the best you can do compared with other strategies.

In the case of Verkle Trees, the tree key calculation is a multiscalar multiplication of a fixed basis of length 5. Everything narrows down to how we can perform this operation as fast as possible with minimal work.

The short description of the tree key calculation is:

def pedersen_hash(inp: bytes) -> bytes32:
    assert len(inp) <= 255 * 16
    # Interpret input as list of 128 bit (16 byte) integers
    ext_input = inp + b"\0" * (255 * 16 - len(inp))
    ints = [2 + 256 * len(inp)] + \
           [int.from_bytes(ext_input[16 * i:16 * (i + 1)]) for i in range(255)]
    return compute_commitment_root(ints).serialize()

def get_tree_key(address: Address32, tree_index: int, sub_index: int):
    # Asssumes VERKLE_NODE_WIDTH = 256
    return (
        pedersen_hash(address + tree_index.to_bytes(32, 'little'))[:31] +
        bytes([sub_index])
    )

Or more compactly as:

tree_key = PointToHash(k*G0+addr_low*G1+addr_high*G2+treeIndexLow*G3+treeIndexLow*G4)

where:

  • k is a fixed scalar
  • xxx_(low|high) refers to the top and bottom 16-bytes of an address and treeIndex.
  • PointToHash(...) transforms the elliptic curve point result of the multiscalar multiplication into bytes (which we want a prefix length match a target attacked key).

The above logic for tree key hashing is used in both:

  1. Generate tree key addresses for accounts. In these leaf nodes, we store the account header values (i.e.: nonce, balance, code hash, code size, first 64 storage slot values, and first 128 code bytes).
  2. Generate tree key addresses for contract storage slots. These leaf nodes that hold storage slot values greater than 64.

Thus, an attacker can try finding an Ethereum address that generates the desired tree key prefix or fixes a contract address and brute-forcing the storage slot. The latter is better since we can fix a contract address that will do the SSTORE, and then make a contract call providing the "convenient" storage slot. This will make the tree write and fire the branch depth attack.

So, everything comes down to doing "scan" on the treeIndex, and work backwards to generate a proper storage slot (I'll mention later how to transfrom a treeIndex to (one possible) storageSlot)

Despite the treeIndex maps to two scalar 16-byte values, we only need to do a single scalar multiplication per brute force iteration. The lower 16-bytes give us plenty of search space.

The central question then is: how we can make this scalar multiplication and point addition as fast as possible?

My take on this PoC attack leverages some code that we did for go-ipa, the cryptography library to be used by geth for VKTs:

  • Use a precomputed table only for the point in question to brute force. We don't need precomputed tables for the rest of the points for all the brute force attacks. Using less memory means fewer cache misses.
  • The window width used is 16 bits. Considering the scalar is 128 bits, we have up to 8 windows to operate, which is more than enough for any reasonable attempt. (Considering we only need precomputed tables for one point, we could use wider tables if RAM allows).
  • Distribute the precomputed tables brute forcing in all available cores such that there's no overlap to waste effort.
  • Always do linear scans in the table to leverage CPU memory prefetching.
  • The precomputed tables already use extended normalized points such that the CPU cost in finite field multiplications is as tiny as possible by doing "mixed additions".

This strategy is at least much less naive than doing any other kind of "double and add", Pippenger style algorithm, or similar. Still, there might be better strategies that maybe allow us to do even less work per iteration.

Gory details note: An implementation (such as the PoC I did) should consider that not every treeIndex value maps to a storage slot. In the tree key calculation, for most storage slots, we shift by a MAIN_STORAGE_OFFSET constant (256**31), which should be considered if we want to be able to map the treeIndex finding to a storage slot. I'm just mentioning this if anyone gives a detailed look into the implementation, it will see that I fix some high-level bytes of treeIndex such that subtracting MAIN_STORAGE_OFFSET never underflows a uint256. If you're not reading details at that level, you can ignore this paragraph since it doesn't change anything about the important points.

Attack proof of concept

I'll avoid getting into implementation details, but the PoC code is open, and the source can be found in a single commit here. There I added some extra comments if you're interested in more details. Still, it's not necessary to continue reading the document.

The attack code receives a contract address and a desired prefix and uses the precomputed tables to find a lower set of 48 bits for a treeIndex that matches that prefix. Using the returned treeIndex an attacker can generate a valid storage slot by doing storage_slot = treeIndex * 256 - 256**31 (I'm just reversing the storageSlot -> treeIndex formula, see here).

The program output has lines of this style:

Attack for prefix length A took Bms (treeIndex=C)

This means that for some defined arbitrary contract address:

Finding a storage slot that would result in a tree key with some (arbitrarily) defined prefix of length A took B milliseconds, and the corresponding tree index to be used in the attack is C.

Before showing the results, it's worth noting some things:

  • I've used my desktop computer for the run, which has a AMD Ryzen 7 3800XT 8-Core Processor CPU. This is a desktop CPU and is not optimized for computing.
  • This CPU supports ADX instructions, which are leveraged by gnark to do faster finite field multiplication. These ADX instructions aren't a "high-end" feature but are available in most modern CPUs. However, we shouldn't assume most Ethereum nodes have CPUs supporting this instruction set.

Running logs:

Attack for prefix length 1 took 549.243719ms (treeIndex=1766847064778384329583297500742918515827483896875618958121606227062489100)
Attack for prefix length 2 took 673.211692ms (treeIndex=1766847064778384329583297500742918515827483896875618958121606265717203103)
Attack for prefix length 3 took 18.494104223s (treeIndex=1766847064778384329583297500742918515827483896875618958121606252841209213)
Attack for prefix length 4 took 4m38.843584303s (treeIndex=1766847064778384329583297500742918515827483896875618958121606252967438367)

Looking for each extra depth implies the search space gets bigger by 256 (i.e: one extra byte in the prefix). The expected growth per extra depth should increase by a similar factor.

Attack impact benchmarks

In the previous section, we explored how hard it is to generate these branches. The other side of the coin is, assuming we have branches of depth X, how much this affects the Ethereum protocol operation.

Updating the root node of a tree for different depths

As mentioned earlier, the longer a branch is, the more commitments we need to update to propagate the change from the leaf node to the root node.

I've created a synthetic benchmark that creates a tree with all potential depths and measures how much time it takes to update all the required commitments to calculate the new root hash:

$ go test ./... -run=none -bench=BenchmarkRootUpdatePerDepth
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-verkle
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkRootUpdatePerDepth/depth=2-16             63417             16927 ns/op
BenchmarkRootUpdatePerDepth/depth=3-16             52216             22509 ns/op
BenchmarkRootUpdatePerDepth/depth=4-16             41318             28787 ns/op
BenchmarkRootUpdatePerDepth/depth=5-16             33708             34838 ns/op
BenchmarkRootUpdatePerDepth/depth=6-16             28634             41061 ns/op
BenchmarkRootUpdatePerDepth/depth=7-16             25080             47159 ns/op
BenchmarkRootUpdatePerDepth/depth=8-16             21559             53613 ns/op
BenchmarkRootUpdatePerDepth/depth=9-16             19946             59733 ns/op
BenchmarkRootUpdatePerDepth/depth=10-16            17953             66157 ns/op
BenchmarkRootUpdatePerDepth/depth=11-16            16119             74071 ns/op
BenchmarkRootUpdatePerDepth/depth=12-16            14284             83598 ns/op
BenchmarkRootUpdatePerDepth/depth=13-16            13365             89195 ns/op
BenchmarkRootUpdatePerDepth/depth=14-16            12264             96942 ns/op
BenchmarkRootUpdatePerDepth/depth=15-16            10000            105721 ns/op
BenchmarkRootUpdatePerDepth/depth=16-16             9766            113275 ns/op
BenchmarkRootUpdatePerDepth/depth=17-16             9356            122935 ns/op
BenchmarkRootUpdatePerDepth/depth=18-16             8570            131973 ns/op
BenchmarkRootUpdatePerDepth/depth=19-16             8641            139550 ns/op
BenchmarkRootUpdatePerDepth/depth=20-16             7848            145565 ns/op
BenchmarkRootUpdatePerDepth/depth=21-16             7461            154769 ns/op
BenchmarkRootUpdatePerDepth/depth=22-16             7161            163696 ns/op
BenchmarkRootUpdatePerDepth/depth=23-16             6901            169980 ns/op
BenchmarkRootUpdatePerDepth/depth=24-16             6568            177983 ns/op
BenchmarkRootUpdatePerDepth/depth=25-16             5989            184123 ns/op
BenchmarkRootUpdatePerDepth/depth=26-16             6134            192775 ns/op
BenchmarkRootUpdatePerDepth/depth=27-16             5916            203081 ns/op
BenchmarkRootUpdatePerDepth/depth=28-16             5480            211380 ns/op
BenchmarkRootUpdatePerDepth/depth=29-16             5475            217600 ns/op
BenchmarkRootUpdatePerDepth/depth=30-16             5299            226398 ns/op
BenchmarkRootUpdatePerDepth/depth=31-16             5023            231598 ns/op

As expected, the time it takes is linear, depending on the depth of the branch. In the case of this particular CPU, it's around 6μs.

These numbers might be considered on the high-end spectrum since:

  • go-verkle and go-ipa have gone through a reasonable amount of optimization iterations in their cryptography.
  • The underlying library gnark-crypto performs very well for finite field operations compared to other libraries.
  • gnark-crypto supports ADX instructions (if available in the CPU) to optimize finite field multiplications. My CPU supports those instructions, so it leverages this optimization. might be only available in high-end desktop or more modern CPUs).

To have an idea on the lower end of hardware specs, I've run the same benchmark on a Rock5B:

$ go test ./... -run=none -bench=BenchmarkRootUpdatePerDepth
goos: linux
goarch: arm64
pkg: github.com/ethereum/go-verkle
BenchmarkRootUpdatePerDepth/depth=2-8              14427             98176 ns/op
BenchmarkRootUpdatePerDepth/depth=3-8               7084            166354 ns/op
BenchmarkRootUpdatePerDepth/depth=4-8               7959            138160 ns/op
BenchmarkRootUpdatePerDepth/depth=5-8               7603            165196 ns/op
BenchmarkRootUpdatePerDepth/depth=6-8               5079            212680 ns/op
BenchmarkRootUpdatePerDepth/depth=7-8               4562            239260 ns/op
BenchmarkRootUpdatePerDepth/depth=8-8               4998            277999 ns/op
BenchmarkRootUpdatePerDepth/depth=9-8               3835            293869 ns/op
BenchmarkRootUpdatePerDepth/depth=10-8              3534            326106 ns/op
BenchmarkRootUpdatePerDepth/depth=11-8              3039            338630 ns/op
BenchmarkRootUpdatePerDepth/depth=12-8              3247            360632 ns/op
BenchmarkRootUpdatePerDepth/depth=13-8              2976            436670 ns/op
BenchmarkRootUpdatePerDepth/depth=14-8              2532            438470 ns/op
BenchmarkRootUpdatePerDepth/depth=15-8              2283            449593 ns/op
BenchmarkRootUpdatePerDepth/depth=16-8              2526            477057 ns/op
BenchmarkRootUpdatePerDepth/depth=17-8              2059            508807 ns/op
BenchmarkRootUpdatePerDepth/depth=18-8              1940            538670 ns/op
BenchmarkRootUpdatePerDepth/depth=19-8              1926            555768 ns/op
BenchmarkRootUpdatePerDepth/depth=20-8              1760            585905 ns/op
BenchmarkRootUpdatePerDepth/depth=21-8              1809            627361 ns/op
BenchmarkRootUpdatePerDepth/depth=22-8              1579            640538 ns/op
BenchmarkRootUpdatePerDepth/depth=23-8              1698            684310 ns/op
BenchmarkRootUpdatePerDepth/depth=24-8              1410            725966 ns/op
BenchmarkRootUpdatePerDepth/depth=25-8              1532            722825 ns/op
BenchmarkRootUpdatePerDepth/depth=26-8              1140            907914 ns/op
BenchmarkRootUpdatePerDepth/depth=27-8              1501            816204 ns/op
BenchmarkRootUpdatePerDepth/depth=28-8              1444            852239 ns/op
BenchmarkRootUpdatePerDepth/depth=29-8              1112            913428 ns/op
BenchmarkRootUpdatePerDepth/depth=30-8              1257            952172 ns/op
BenchmarkRootUpdatePerDepth/depth=31-8              1280            924851 ns/op

Note that a Rock5B cost is <200usd (half the price of the CPU used in the previous run). It has an ARM64 architecture, which might not have all the same tricks as for amd64 in gnark-crypto, and of course, it doesn't have ADX support. Still, it isn't clear if a Rock5B can support both 4844 and VKT new load, but we use it as a lower bound.

Closing thoughts

It's more than obvious to say that a serious attacker won't use a 400usd CPU to attack the Ethereum network.

In this PoC, I used precomputed tables focused on a single elliptic curve point. This is far from a naive take, but there might be other more clever ideas depending on the way the search space is walked (e.g: maybe some math tricks with EC to save time)

This kind of attack is embarrassingly parallel. Partitioning the search space is very easy, and with little effort, the attacker could make more specialized attacks if it's also willing to use more RAM to have wider windows. If I had a CPU with four times more cores, all times above would be cut by four, too. An attacker might have access to a botnet, supercomputers, or other specialized hardware to fine-tune the computation.

Another fact is that the framing of the problem I did here was attacking a single branch, but an attacker could try producing a match on many (heavily used/touched) branches in the tree. This can improve the efficiency of finding a match by a significant factor.

From another perspective, as explored in the Attack impact benchmarks section, scenarios of depth up to 15 take less than 100μs (on a high-end CPU) and 450μs (on a very low-end CPU) to recompute the root commitment. The mainnet tree will probably have a depth of 5 or 6, so the "additional overhead" of the attack is much less than the presented numbers. Regarding witness impact size, the logic is similar. A longer branch means that the execution witness will need to provide "extra" commitments (32 bytes) per extra depth compared to the average (e.g: from 5 to 14, means ~288 bytes).

The next steps to continue this discussion that I imagine are:

  • Getting more expert opinions about the cost of producing the attack on a more large scale attacker with more resources.
  • More feedback from client teams to understand which branch depths start to produce an unacceptable impact on their performance (e.g: they can try doing the same benchmark I did earlier)

1 - The execution witness is what the stateless client will receive as proof of all the pre-state needed to verify the block. This new construct is one of the main goals of Verkle Trees, since those will be much smaller than one coming from a MPT.