A Theory of Ethereum State Size Management

Written 2021 Feb 12

One of the longest and still unresolved challenges in the Ethereum protocol is that of how to manage the problem of growing state. Many actions in Ethereum (creating an account, filling a storage slot, sending ETH to a new account…) add objects to the state that all full nodes must store in order to be able to verify or produce new blocks. These actions require a one-time gas cost to be paid by the sender of the transaction, but impose permanent ongoing costs on the network, and the nodes that need to store this new data (and future nodes that need to download it when syncing).

This is a major imbalance in the system's design, and risks the Ethereum system becoming more and more difficult to use over time as the state fills with no-longer-needed "junk data". The purpose of this document is to describe in more detail what the problem is, and what are some of the paths toward solving it. If we successfully implement a solution, it could pave the way to safely making significant increases to the block gas limit.

This document is describing an area of research that is still a work-in-progress; new and better ideas and better tradeoffs could potentially be discovered at any time.

Introduction: What is the problem?

State refers to information that a node must hold in order to be able to process new incoming blocks and transactions. It is typically contrasted with history, information about past events which can be held for later rebroadcasting and archiving purposes, but is not strictly needed to continue to process the chain.

In Ethereum, state includes:

Account balances and nonces
Contract code
Contract storage
Consensus-related data (some recent block hashes, uncles, proof-of-stake consensus data such as validator pubkeys and activity records on the beacon chain, etc)

History is made up of older blocks and receipts. There is no opcode in EVM execution that allows you to access old block contents or previous transaction contents or receipt outputs, so a node can safely forget those things and still be able to process new blocks; hence those things are history and not state.

The last item in the above list of types of state, consensus-related data, has already been carefully designed to be limited in size, so we do not need to concern ourselves with it. The first three items, however, are more tricky. These forms of state grow over time, as new users enter the network and create new accounts and new contracts, participate in contracts and receive new tokens for the first time.

Unfortunately, these items often stay in the state long past their usefulness; once a user stops using some application, there is typically some "junk state" that is left lying around forever.

In theory, it is possible for a user to "leave no trace behind". A user could publish only contracts with a clause that calls SELFDESTRUCT to remove the contract after they are done, zero out their token balances, and use a smart contract wallet that passes transactions through an existing externally-owned account so they do not need to create their own (undeleteable) EOA.

In practice, however, incentives to do this are tiny, and the complexity of proper state hygiene is too high. In many contracts it does not make sense to empower anyone to be able to call SELFDESTRUCT (people want "unstoppable" applications!), and it would add too much user experience and code complexity. In fact, because of the extremely limited utility of the SELFDESTRUCT opcode and its costly side effects, I am potentially in favor of removing it outright. If we want to truly manage state size, we instead need a solution where no-longer-used "junk state" can be forgotten by the rest of the network by default.

Stateless clients

One category of solution to this problem is the idea of stateless clients (see here for the original post describing this idea, and here for a video presentation). The general principle is that nodes verifying blocks no longer need to store state. Instead, blocks come with proofs (or "witnesses") proving the values of the state that get accessed. As is already the case today, each block contains a "state root" (a kind of hash) that these values can be proved against. The existing Merkle Patricia tree could be used for this, as can more efficient designs such as binary tries or Verkle tries. The witness would also prove the correctness of the new state root after processing the block.

There are two forms of statelessness:

Weak statelessness: block producers still need the full state to generate witnesses for blocks that they create, but nodes verifying blocks can be stateless
Strong statelessness: no node needs the full state. Instead, transaction senders would provide witnesses that block producers can aggregate, and would be responsible for storing portions of the state tree needed to generate witnesses for accounts that they care about

Strong statelessness is a very "elegant" solution in that it completely moves responsibility to the users, though to maintain a good user experience in practice there would need to be some kind of protocol created to help maintain state for users that are not personally running nodes or that need to interact with some unexpected account. The challenges with making such protocols are significant. Additionally, all statelessness increases the data bandwidth requirements of the network, and strong statelessness requires transactions to declare which accounts and storage keys they are interacting with (this notion is called access lists).

A more moderate solution: state expiry

The more moderate solutions to this problem can be summarized as being different forms of state expiry. State must be continually accessed in order to remain "active"; state that is untouched for a long time becomes "inactive" (or "expired"). There are many choices for the exact mechanic for how state can be renewed (eg. pre-paying a "rent" fee, or simply touching the account), but the general principle is that unless a state object is renewed explicitly, it is inactivated in some way. Hence, any action that creates new state objects (or refreshes existing ones) only burdens other nodes for a limited period of time, and not as is currently the case forever.

Inactive state is, using the definition above, not part of the "state"; clients that want to process blocks, or create blocks, do not need to store inactive state. However, inactive state is not deleted! In all existing state expiry proposals, there is a way to "resurrect" inactive state and make it active again.

The general principle is that active state is treated as it is today, and inactive state is treated with the mechanisms described above for stateless clients. A transaction resurrecting an expired state object is required to provide a proof (or "witness") showing that the object actually is part of the inactive state. In order to be able to generate such proofs, users themselves would need to store and maintain at least the part of the inactive state that corresponds to inactive state objects that they care about.

When to expire

There are a few different approaches to determine when state gets expired. The most common ones are:

Direct rent: some per-block "rent" fee gets charged directly from the balance of each account or other state object; an object is expired when its balance reaches zero
Rent via time-to-live: each state object stores a "time-to-live" value, and this can be extended by paying a fee
Refresh by touching: each state object stores a "time-to-live" value, and this is extended automatically by reading or writing to the account
Everything expires at regular intervals (eg. once every 6 months): this is the ReGenesis proposal

I am increasingly a fan of refresh-by-touching, because (i) it avoids requiring applications to add complicated economics to pass on the costs of their own state rent fees to their users, and (ii) it ensures a clear hard bound on active state size (block_gas_limit / cost_to_touch_state_object * time_to_live_period) at all times. Having large amounts of state expire at regular discrete intervals (ie. ReGenesis) also has these benefits, and it has interesting tradeoffs. A key benefit is easier expiry (no need to walk through the tree and expire things one by one), a key downside is more variability in how much witness data needs to be provided depending on how far into an epoch you are.

Account-level vs storage-slot-level expiry

State expiry logic can be done at the level of the account, or at the level of individual storage slots. At present is my strong preference to do it at the level of individual storage slots. This is because there are many contract accounts that have an unbounded number of storage slots, and where arbitrary users can come in and increase the number of storage slots that the contract is responsible for (eg. airdrops are one example of this that happens already). For an expiry scheme with account-level granularity to actually limit state size, the rent fee would need to be proportional (or the time-to-live extension inversely proportional) to the number of storage slots in a contract. As a result, users would be able to pay a one-time cost to impose a permanent ongoing cost to the contract and its users.

To get around this, contracts would have to either add fairly complex internal logic to "pass on" the costs of storage slots to their users, or redesign themselves to use CREATE2 to create new accounts and use those as storage slots. If contracts do keep using storage slots, they would need to have their own logic to "evict" storage slots that their users are not paying for. Either path leads to a result equivalent to expiry at per-storage-slot granularity. Hence, it is my opinion that we should do state expiry with per-storage-slot granularity; the total complexity from not expiring individual storage slots far exceeds the complexity from expiring them.

However, per-storage-slot granularity is not without its weaknesses: it requires each storage slot to have metadata on when it will expire or if it has already expired, and it means that resurrection conflict issues (see the later section) affect not just accounts but also storage slots.

Removing from the tree vs retiring parts of the tree

One technical dichotomy on which state expiry proposals differ is the "one tree" vs "two trees" divide. Essentially, do we maintain a single state tree as before, but simply mark parts of the state as expired, or do we explicitly remove expired state from the main state tree and move it to a separate tree (or other data structure) that contains only expired objects?

One tree

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Active nodes in white, expired nodes in grey.

Note that even intermediate nodes in the tree would be marked as active or inactive (or, more realistically, each node would be marked with an expiry date and activeness could simply be checked against that); this marking would be done at the level of each (leaf and intermediate) node in the tree.

Two trees

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Tree containing active state in white, tree containing expired state in grey.

The one-tree approach has the benefit of at least appearing similar to how the tree works today, and having a simple expiring and resurrecting process: the latter procedure would simply require refreshing an expiry-date parameter on each node in the tree, and the former would happen automatically. But it has the flaw that it requires a tree structure that is capable of storing intermediate information in nodes in this way, and does not extend well to Verkle trees. Additionally, it requires the additional primitive of Merkle proofs not just going down to leaf nodes, but also stopping at intermediate points to prove that a portion of the state is expired.

The two-tree approach has the benefit that it works with state accumulators as they exist today in their pure form, without needing per-node metadata. It has the flaw of requiring somewhat deeper changes to the wider protocol to implement, and requiring an explicit procedure to expire part of the state (so expiry is not automatic). It also has the property that it does not carry a built-in solution to the resurrection conflict dilemma (see the next section), requiring a choice of one of two approaches.

Note also the in the two-tree case, the second tree need not be a literal tree. In fact, it is possible to have a design where a state object is resurrected by providing a Merkle proof pointing to the receipt when an object is expired together with some cryptographic proof proving that it has not been resurrected before or re-expired more recently.

Resurrection conflicts

This brings us to a key challenge with state expiry schemes: resurrection conflicts. The concept of a resurrection conflict is as follows. Suppose some account gets created at address A. That account is then expired. Then, a new account gets created at address A (eg. with the CREATE2 opcode to ensure the same address on both creations). Finally, a resurrection of the original account is attempted. What happens?

There are a few possible solutions:

Explicit "account merge" procedure: this could be "old account state except ETH is added together", "new account state wins except ETH is added together", or even some custom combining procedure specified in the old account contract code
Make resurrection conflicts impossible by eliminating same-address re-creation capability: this could be done by adjusting CREATE2 so that the data hashed to generate the address includes the current year, so an address generated in a future year cannot equal an address generated in a previous year
Add a "stub" to the state preventing a new account from being created at that position (note that the one-tree approach described in the previous section does this automatically)
Require all new account creations to come with witnesses proving non-prior-expiry: this is in some ways equivalent to stubs, except putting stubs into a separate section of the state that anyone generating transactions that create accounts is required to keep track of

(Note that if we use storage-slot-level expiry, as I think we probably have to, then each of these options would need to be extended to individual storage slots and not just accounts)

The main concern with (1) is that it would add great complexity to applications to require them to add merging logic. The main concern with (2) is that it removes the ability to easily have addresses that can be interacted with and even accumulate assets (eg. think ERC20 tokens) before the account is "registered" on-chain. Unregistered addresses are important: any user that receives ETH for the first time is using an unregistered address. (2) would effectively put a time limit on unregistered addresses, opening up a security risk of users losing their funds if they make a new address and receive funds but forget to send a transaction for a year.

Note that EOAs do not solve this problem. They may seem to, because the merge procedure with EOAs is trivial (just add the old ETH balance to the new, and use some scheme like EIP 169 for nonces). However, they suffer from two problems. First, there is a goal to replace EOAs with contracts with account abstraction, and account-abstracted contracts may well not have trivial merge procedures. Second, the state that would be subject to expiry and resurrection is not only the EOAs themselves, but also storage keys related to applications that the EOAs participate in (eg. ERC20 token balances), and so nontrivial merge logic may once again be required.

Hence, the least disruptive solution so far seems to me to be some form of stubs. However, there is an information-theoretic problem with stubs that leads to stubs having some tricky consequences. In order to serve the role of preventing new state objects from being created at N positions of expired state objects, a set covering those N addresses (and/or storage keys) must be part of the state. If that set covers them minimally (ie. it's just the addresses), that set has size O(N), and therefore the state size is O(N); the size of the active state is proportional to the expired state, so we don't actually solve the problem.

Tree rot

The only way to solve this is to cover more than just those N accounts; in fact, we would have to make entire parts of the tree inaccessible (note once again that this is what the single-tree solution described above does: if two accounts get expired, all the space in between them also implicitly gets expired).

And here lies the problem: this creates a form of "tree rot" where over time entire portions of the tree become inaccessible for new account creation, at least to anyone who is not keeping track of old expired state in that region.

This tree rot leads to secondary problems that must be dealt with. For example, if a contract needs to create child contracts, it must (perhaps with the help of a user-provided "hint") be able to create contracts in regions of the state that are either not rotted, or that the user has witnesses for. One solution to tree rot is this scheme which continually opens up new regions of the state for account creation. Another is for each user to choose some region (eg. 1/256 wide) of the state, keep track of even expired state in that region to be able to create witnesses, and only create accounts in that region.

Another problem of the tree rot approach is that it requires an explicit data structure for storing and checking ranges. A tree that has node-level data on whether or not the portion of the tree below the node is expired (as used in the single-tree solution) does this perfectly, but a key-value store can do this only with some difficulty.

Another look at strong statelessness

Many of the problems with tree structures in a state-expiry regime can be traced to the fact that we need to have consensus over which state is active and which state is inactive. In the two-tree model, this is most obvious. But even in the one-tree model, we need explicit markers in the tree, so that a node that recently fast-synced the state can determine whether a transaction that attempts to access that account without providing a witness should succeed or fail. But what if we don't need this distinction to be explicit?

We can come to this approach by starting from full statelessness, and then trying to solve the problem of how transaction senders or block producers can reliably get the state needed to generate witnesses. One natural approach is for nodes in the network to store only the portions of the state tree that have been accessed in the last eg. 1 year. This could simply be a voluntary default client setting, though if we want more reliability we could force at least miners (and later PoS validators) to store the data by adding a proof of custody scheme.

There is one caveat. If the consensus layer is not aware of which state is active and which state is inactive, then gas costs for accessing recent and older state would be the same. This imples one of two things:

Gas costs for accessing even recent state would need to be increased further
The maximum size of a block including its witness could be very large if that block is full of accesses of very old state (roughly 800 bytes * 12.5m gas / 2400 gas per access ~= 4.1 MB post-EIP-2929 assuming binary trees)

If we want to avoid these disadvantages, then we would need to track in-consensus whether or not state objects, including regions of not-yet-filled address space, are part of the active state, and this would bring us closer to the properties of a state expiry scheme. This serves to further illustrate that "statelessness vs state expiry (or rent)" is a spectrum and a complicated tradeoff space, not a binary.

Rollups will need, and can use, the same solutions

An important medium-term scalability solution for Ethereum is rollups. However, rollups do not remove the need to worry about state size; in fact, rollups have state size problems of exactly the same type as the Ethereum chain itself.

Fortunately, if we come up with a solution, then at least EVM rollups (which attempt to maximally faithfully replicate the ethereum environment) will be able to use the same solution exactly as is to solve their own internal state size problems. Hence, state size management is complementary to rollups, sharding and other scaling strategies.

Conclusions

State size is a growing problem, and a solution to state size could pave the way for significantly higher gas limits. We should move toward agreeing on and implementing some form of state expiry solution. However, there are important technical tradeoffs between these solutions, particularly if we want to preserve important properties of the system's current design.

Some properties that we may need to compromise on include:

The property that you can generate an address offline and receive funds at that address and be able to wait an arbitrarily long time before publishing that address to chain
Addresses being 20 bytes (the rolling state expansion scheme requires a larger address space, though the address length arguably needs to soon be changed for collision-resistance reasons anyway)
The property that the state can be viewed as a "pure" key/value store, and we can avoid storing metadata at each node in the tree
The extent to which existing applications may need to be rewritten to be able to work without users needing to store the full inactive state to generate witnesses
Gas costs or otherwise ease of creating new contracts or filling new storage slots

If we are ready to make sacrifices, there are solutions that could start to be implemented very soon. On the other hand, there is the possibility that over time we can tinker and come up with better combinations of these ideas to reduce the problems, and particularly make it technically easier to implement (eg. allowing the use of "pure" key/value stores). We should get a better understanding of what kinds of sacrifices we are more vs less willing to accept, and continue actively working on improved proposals.

optimalbrew

2021/02/13 05:52:19

contracts would

Another alternative to "pass the cost to end users" is to make transaction senders pay the `cost_to_refresh` every state object touched by their transaction (but only for state objects that "need refreshing"). This can be part of the protocol, not something contract devs would need to implement (Edited)

vbuterin

2021/02/13 09:02:14

Actually, that would not be economically safe. An attacker could pay a one-time-cost to add a storage slot and then go away, and the remaining users would be forced to keep paying for that storage slot forever. (Edited)

2021/02/14 03:12:06

Thanks for pointing out the attack. However, if the data inserted by an attacker is just spam, then transactions need not "touch/refresh" that storage slot at all. If we make devs mindful of the attack, perhaps they can manage the risk

kngo

2021/02/13 23:49:18

Refresh by touching**: each state object stores a "time-to-live" value, and this is extended automatically by reading or writing to the account

Would this open doors to malicious writes to state objects that extend "time-to-live" value simply to increase the global state size? (Edited)

Samuel Shadrach

2021/03/11 06:48:00

Refreshing will cost gas

Paul D.

2021/02/15 03:42:44

the

Typo. (Edited)

Robert Drost

2021/02/18 18:49:49

store and maintain

users would need to store and maintain, or otherwise reliably be able to access (i.e. from a state provider network), at least ... (Edited)

2021/02/23 14:30:21

complexity

Do you mean complexity or cost here? as the complexity (from a code and mechanism perspective) of not expiring is quite low. Or maybe I am missing what you mean by complexity in this context? Your second parallel use of complexity in "complexity from expiring them" better agrees with the meaning of complexity as the coding and mechanism are higher (with lower storage cost and higher bandwidth cost). (Edited)

Haissam

2021/08/10 17:15:53

To resolve the issue, the action requires the sender to pay a one-time gas cost. We need to create a unique hash function for the sender's transaction, and the node will store this number to demonstrate that the sender will not be charged gas in future network costs. (Edited)

Guest Brown2021/12/17 09:22:16

Suppose that account A wants to send 1 coin to account B. How do miners know that account B is an inactive account or account B does not exist. If account B is not exist, then miners have to creat this account. But if account B is an inactive account, then miners have to make this account acctive and then increase its ballance. (Edited)

Guest Lowe2023/05/13 20:21:47

In a refresh by touching state expiry scheme, the block producer would not need to activate the account. Normally, when one calls an empty account, that which is absent in state, one of the two things happens: if the call sends ether a new account whose balance is the sent value gets initialized or the account remains non-existent otherwise.

Guest Lowe2023/05/13 20:22:07

So, one solution is, the transaction would initialize a new account for that address and when one tries to activate the original account the resurrection conflict would be resolved by an explicit account merge procedure.

Guest Waters2023/02/23 17:03:35

https://www.coinbase.com/institutional/research-insights/research/weekly-market-commentary/weekly-market-commentary-september-02-2022 (Edited)

hydrogenbond007

2023/04/02 16:40:02

hey kinda confused, I thought some nodes chose to sync from the genesis block. if they can retrieve old block content how do they sync? (Edited)