A Theory of Ethereum State Size Management

# A Theory of Ethereum State Size Management _Written 2021 Feb 12_ One of the longest and still unresolved challenges in the Ethereum protocol is that of how to manage the problem of growing **state**. Many actions in Ethereum (creating an account, filling a storage slot, sending ETH to a new account...) add objects to the state that all full nodes must store in order to be able to verify or produce new blocks. These actions require a **one-time gas cost to be paid by the sender** of the transaction, but impose **permanent ongoing costs on the network**, and the nodes that need to store this new data (and future nodes that need to download it when syncing). This is a major imbalance in the system's design, and risks the Ethereum system becoming more and more difficult to use over time as the state fills with no-longer-needed "junk data". The purpose of this document is to describe in more detail what the problem is, and what are some of the paths toward solving it. If we successfully implement a solution, it could pave the way to safely making significant increases to the block gas limit. This document is describing an area of research that is still a work-in-progress; new and better ideas and better tradeoffs could potentially be discovered at any time. ### Introduction: What is the problem? **State** refers to information that a node must hold in order to be able to process new incoming blocks and transactions. It is typically contrasted with **history**, information about past events which can be held for later rebroadcasting and archiving purposes, but is not strictly needed to continue to process the chain. In Ethereum, state includes: * **Account balances** and **nonces** * **Contract code** * **Contract storage** * **Consensus-related data** (some recent block hashes, uncles, proof-of-stake consensus data such as validator pubkeys and activity records on the beacon chain, etc) History is made up of older blocks and receipts. There is no opcode in EVM execution that allows you to access old block contents or previous transaction contents or receipt outputs, so a node can safely forget those things and still be able to process new blocks; hence those things are history and not state. The last item in the above list of types of state, consensus-related data, has already been carefully designed to be limited in size, so we do not need to concern ourselves with it. The first three items, however, are more tricky. These forms of state grow over time, as new users enter the network and create new accounts and new contracts, participate in contracts and receive new tokens for the first time. Unfortunately, these items often stay in the state long past their usefulness; **once a user stops using some application, there is typically some "junk state" that is left lying around forever**. In theory, it is possible for a user to "leave no trace behind". A user could publish only contracts with a clause that calls `SELFDESTRUCT` to remove the contract after they are done, zero out their token balances, and use a smart contract wallet that passes transactions through an existing externally-owned account so they do not need to create their own (undeleteable) EOA. In practice, however, incentives to do this are tiny, and the complexity of proper state hygiene is too high. In many contracts it does not make sense to empower anyone to be able to call `SELFDESTRUCT` (people want "unstoppable" applications!), and it would add too much user experience and code complexity. In fact, because of the extremely limited utility of the `SELFDESTRUCT` opcode and its costly side effects, I am potentially in favor of [removing it outright](https://hackmd.io/@HWeNw8hNRimMm2m2GH56Cw/selfdestruct). If we want to truly manage state size, we instead need a solution where no-longer-used "junk state" can be forgotten by the rest of the network _by default_. ## Stateless clients One category of solution to this problem is the idea of **stateless clients** (see [here](https://ethresear.ch/t/the-stateless-client-concept/172) for the original post describing this idea, and [here](https://www.youtube.com/watch?v=kbcj1TKXaN4) for a video presentation). The general principle is that nodes verifying blocks no longer need to store state. Instead, blocks come with proofs (or "**witnesses**") proving the values of the state that get accessed. As is already the case today, each block contains a "**state root**" (a kind of hash) that these values can be proved against. The existing Merkle Patricia tree could be used for this, as can more efficient designs such as [binary tries](https://eips.ethereum.org/EIPS/eip-3102) or [Verkle tries](https://github.com/ethereum/research/blob/master/verkle_trie/verkle_trie.py). The witness would also prove the correctness of the new state root after processing the block. There are two forms of statelessness: * **Weak statelessness**: block producers still need the full state to generate witnesses for blocks that they create, but nodes verifying blocks can be stateless * **Strong statelessness**: no node needs the full state. Instead, transaction senders would provide witnesses that block producers can aggregate, and would be responsible for storing portions of the state tree needed to generate witnesses for accounts that they care about Strong statelessness is a very "elegant" solution in that it completely moves responsibility to the users, though to maintain a good user experience in practice there would need to be some kind of protocol created to help maintain state for users that are not personally running nodes or that need to interact with some unexpected account. The challenges with making such protocols are significant. Additionally, all statelessness increases the data bandwidth requirements of the network, and strong statelessness requires transactions to declare which accounts and storage keys they are interacting with (this notion is called **access lists**). ## A more moderate solution: state expiry The more moderate solutions to this problem can be summarized as being different forms of **state expiry**. State must be continually accessed in order to remain "active"; state that is untouched for a long time becomes "inactive" (or "expired"). There are many choices for the exact mechanic for how state can be renewed (eg. pre-paying a "rent" fee, or simply touching the account), but the general principle is that unless a state object is renewed explicitly, it is inactivated in some way. Hence, any action that creates new state objects (or refreshes existing ones) only burdens other nodes for a limited period of time, and not as is currently the case forever. Inactive state is, using the definition above, not part of the "state"; clients that want to process blocks, or create blocks, do not need to store inactive state. **However, inactive state is not deleted! In all existing state expiry proposals, there is a way to "resurrect" inactive state and make it active again**. The general principle is that active state is treated as it is today, and inactive state is treated with the mechanisms described above for stateless clients. A transaction resurrecting an expired state object is required to provide a proof (or "witness") showing that the object actually is part of the inactive state. In order to be able to generate such proofs, users themselves would need to store and maintain at least the part of the inactive state that corresponds to inactive state objects that they care about. ## When to expire There are a few different approaches to determine when state gets expired. The most common ones are: * **Direct rent**: some per-block "rent" fee gets charged directly from the balance of each account or other state object; an object is expired when its balance reaches zero * **Rent via time-to-live**: each state object stores a "time-to-live" value, and this can be extended by paying a fee * **Refresh by touching**: each state object stores a "time-to-live" value, and this is extended automatically by reading or writing to the account * **Everything expires at regular intervals** (eg. once every 6 months): this is the [ReGenesis proposal](https://medium.com/@mandrigin/regenesis-explained-97540f457807) I am increasingly a fan of refresh-by-touching, because (i) it avoids requiring applications to add complicated economics to pass on the costs of their own state rent fees to their users, and (ii) it ensures a clear hard bound on active state size (`block_gas_limit / cost_to_touch_state_object * time_to_live_period`) at all times. Having large amounts of state expire at regular discrete intervals (ie. ReGenesis) also has these benefits, and it has interesting tradeoffs. A key benefit is easier expiry (no need to walk through the tree and expire things one by one), a key downside is more variability in how much witness data needs to be provided depending on how far into an epoch you are. ## Account-level vs storage-slot-level expiry State expiry logic can be done at the level of the account, or at the level of individual storage slots. At present is my strong preference to do it at the level of individual storage slots. This is because there are many contract accounts that have an unbounded number of storage slots, and where arbitrary users can come in and increase the number of storage slots that the contract is responsible for (eg. airdrops are one example of this that happens already). For an expiry scheme with account-level granularity to actually limit state size, the rent fee would need to be proportional (or the time-to-live extension inversely proportional) to the number of storage slots in a contract. As a result, users would be able to pay a one-time cost to impose a _permanent ongoing cost_ to the contract and its users. [To get around this](https://ethresear.ch/t/common-classes-of-contracts-and-how-they-would-handle-ongoing-storage-maintenance-fees-rent/4441), contracts would have to either add fairly complex internal logic to "pass on" the costs of storage slots to their users, or redesign themselves to use CREATE2 to create new accounts and use those as storage slots. If contracts do keep using storage slots, they would need to have their own logic to "evict" storage slots that their users are not paying for. Either path leads to a result equivalent to expiry at per-storage-slot granularity. **Hence, it is my opinion that we should do state expiry with per-storage-slot granularity; the total complexity from not expiring individual storage slots far exceeds the complexity from expiring them.** However, per-storage-slot granularity is not without its weaknesses: it requires each storage slot to have metadata on when it will expire or if it has already expired, and it means that resurrection conflict issues (see the later section) affect not just accounts but also storage slots. ## Removing from the tree vs retiring parts of the tree One technical dichotomy on which state expiry proposals differ is the "one tree" vs "two trees" divide. Essentially, do we maintain a single state tree as before, but simply mark parts of the state as expired, or do we explicitly remove expired state from the main state tree and move it to a separate tree (or other data structure) that contains only expired objects? #### One tree <center> <img src="https://i.imgur.com/Bw5NvJo.png" /> _Active nodes in white, expired nodes in grey._ </center> Note that even intermediate nodes in the tree would be marked as active or inactive (or, more realistically, each node would be marked with an expiry date and activeness could simply be checked against that); this marking would be done at the level of each (leaf and intermediate) node in the tree. #### Two trees <center> <img src="https://i.imgur.com/TNE1Tpt.png" /> _Tree containing active state in white, tree containing expired state in grey._ </center> The one-tree approach has the benefit of at least appearing similar to how the tree works today, and having a simple expiring and resurrecting process: the latter procedure would simply require refreshing an expiry-date parameter on each node in the tree, and the former would happen automatically. But it has the flaw that it requires a tree structure that is capable of storing intermediate information in nodes in this way, and does not extend well to Verkle trees. Additionally, it requires the additional primitive of Merkle proofs not just going down to leaf nodes, but also stopping at intermediate points to prove that a portion of the state is expired. The two-tree approach has the benefit that it works with state accumulators as they exist today in their pure form, without needing per-node metadata. It has the flaw of requiring somewhat deeper changes to the wider protocol to implement, and requiring an explicit procedure to expire part of the state (so expiry is not automatic). It also has the property that it does not carry a built-in solution to the resurrection conflict dilemma (see the next section), requiring a choice of one of two approaches. Note also the in the two-tree case, the second tree need not be a literal tree. In fact, it is possible to have a design where a state object is resurrected by providing a Merkle proof pointing to the receipt when an object is expired together with some cryptographic proof proving that it has not been resurrected before or re-expired more recently. ## Resurrection conflicts This brings us to a key challenge with state expiry schemes: **resurrection conflicts**. The concept of a resurrection conflict is as follows. Suppose some account gets created at address A. That account is then expired. Then, a new account gets created at address A (eg. with the CREATE2 opcode to ensure the same address on both creations). Finally, a resurrection of the original account is attempted. What happens? There are a few possible solutions: 1. **Explicit "account merge" procedure**: this could be "old account state except ETH is added together", "new account state wins except ETH is added together", or even some custom combining procedure specified in the old account contract code 2. **Make resurrection conflicts impossible by eliminating same-address re-creation capability**: this could be done by adjusting CREATE2 so that the data hashed to generate the address includes the current year, so an address generated in a future year cannot equal an address generated in a previous year 3. **Add a "stub" to the state preventing a new account from being created at that position** (note that the one-tree approach described in the previous section does this automatically) 4. **Require all new account creations to come with witnesses proving non-prior-expiry**: this is in some ways equivalent to stubs, except putting stubs into a separate section of the state that anyone generating transactions that create accounts is required to keep track of (**Note that if we use storage-slot-level expiry, as I think we probably have to, then each of these options would need to be extended to individual storage slots and not just accounts**) The main concern with (1) is that it would add great complexity to applications to require them to add merging logic. The main concern with (2) is that it removes the ability to easily have addresses that can be interacted with and even accumulate assets (eg. think ERC20 tokens) before the account is "registered" on-chain. Unregistered addresses are important: any user that receives ETH for the first time is using an unregistered address. (2) would effectively put a time limit on unregistered addresses, opening up a security risk of users losing their funds if they make a new address and receive funds but forget to send a transaction for a year. Note that EOAs do not solve this problem. They may seem to, because the merge procedure with EOAs is trivial (just add the old ETH balance to the new, and use some scheme like [EIP 169](https://github.com/ethereum/EIPs/issues/169) for nonces). However, they suffer from two problems. First, there is a goal to replace EOAs with contracts with [account abstraction](https://github.com/ethereum/EIPs/pull/2938), and account-abstracted contracts may well not have trivial merge procedures. Second, the state that would be subject to expiry and resurrection is not only the EOAs themselves, but also storage keys related to applications that the EOAs participate in (eg. ERC20 token balances), and so nontrivial merge logic may once again be required. Hence, the least disruptive solution so far seems to me to be some form of stubs. However, there is an information-theoretic problem with stubs that leads to stubs having some tricky consequences. In order to serve the role of preventing new state objects from being created at N positions of expired state objects, a set covering those N addresses (and/or storage keys) must be part of the state. If that set covers them minimally (ie. it's just the addresses), that set has size O(N), and therefore the state size is O(N); _the size of the active state is proportional to the expired state_, so we don't actually solve the problem. ### Tree rot The only way to solve this is to cover more than just those N accounts; in fact, we would have to make entire parts of the tree inaccessible (note once again that this is what the single-tree solution described above does: if two accounts get expired, all the space in between them also implicitly gets expired). ![](https://i.imgur.com/0xjhLXs.png) And here lies the problem: this creates a form of "tree rot" where over time entire portions of the tree become inaccessible for new account creation, at least to anyone who is not keeping track of old expired state in that region. This tree rot leads to secondary problems that must be dealt with. For example, if a contract needs to create child contracts, it must (perhaps with the help of a user-provided "hint") be able to create contracts in regions of the state that are either not rotted, or that the user has witnesses for. One solution to tree rot is [this scheme](https://ethresear.ch/t/alternative-bounded-state-friendly-address-scheme/8602) which continually opens up new regions of the state for account creation. Another is for each user to choose some region (eg. 1/256 wide) of the state, keep track of even expired state in that region to be able to create witnesses, and only create accounts in that region. Another problem of the tree rot approach is that it requires an explicit data structure for storing and checking ranges. A tree that has node-level data on whether or not the portion of the tree below the node is expired (as used in the single-tree solution) does this perfectly, but a key-value store can do this only with some difficulty. ## Another look at strong statelessness Many of the problems with tree structures in a state-expiry regime can be traced to the fact that _we need to have consensus over which state is active and which state is inactive_. In the two-tree model, this is most obvious. But even in the one-tree model, we need explicit markers in the tree, so that a node that recently fast-synced the state can determine whether a transaction that attempts to access that account without providing a witness should succeed or fail. But what if we don't need this distinction to be explicit? We can come to this approach by starting from full statelessness, and then trying to solve the problem of how transaction senders or block producers can reliably get the state needed to generate witnesses. One natural approach is for nodes in the network to store only the portions of the state tree that have been accessed in the last eg. 1 year. This could simply be a voluntary default client setting, though if we want more reliability we could force at least miners (and later PoS validators) to store the data by adding a [proof of custody scheme](https://ethresear.ch/t/a-0-001-bit-proof-of-custody/7409). There is one caveat. If the consensus layer is not aware of which state is active and which state is inactive, then gas costs for accessing recent and older state would be the same. This imples one of two things: 1. Gas costs for accessing even recent state would need to be increased further 2. The maximum size of a block including its witness could be very large if that block is full of accesses of very old state (roughly `800 bytes * 12.5m gas / 2400 gas per access ~= 4.1 MB` post-EIP-2929 assuming binary trees) If we want to avoid these disadvantages, then we would need to track in-consensus whether or not state objects, including regions of not-yet-filled address space, are part of the active state, and this would bring us closer to the properties of a state expiry scheme. **This serves to further illustrate that "statelessness vs state expiry (or rent)" is a spectrum and a complicated tradeoff space, not a binary**. ## Rollups will need, and can use, the same solutions An important medium-term scalability solution for Ethereum is [rollups](https://vitalik.ca/general/2021/01/05/rollup.html). However, **rollups do not remove the need to worry about state size; in fact, rollups have state size problems of exactly the same type as the Ethereum chain itself**. Fortunately, if we come up with a solution, then at least EVM rollups (which attempt to maximally faithfully replicate the ethereum environment) will be able to use the same solution exactly as is to solve their own internal state size problems. Hence, state size management is complementary to rollups, sharding and other scaling strategies. ## Conclusions State size is a growing problem, and a solution to state size could pave the way for significantly higher gas limits. We should move toward agreeing on and implementing some form of state expiry solution. However, there are important technical tradeoffs between these solutions, particularly if we want to preserve important properties of the system's current design. Some properties that we may need to compromise on include: * The property that you can generate an address offline and receive funds at that address and be able to wait an arbitrarily long time before publishing that address to chain * Addresses being 20 bytes (the [rolling state expansion scheme](https://ethresear.ch/t/alternative-bounded-state-friendly-address-scheme/8602) requires a larger address space, though the address length arguably needs to soon be changed for collision-resistance reasons anyway) * The property that the state can be viewed as a "pure" key/value store, and we can avoid storing metadata at each node in the tree * The extent to which existing applications may need to be rewritten to be able to work without users needing to store the full inactive state to generate witnesses * Gas costs or otherwise ease of creating new contracts or filling new storage slots If we are ready to make sacrifices, there are solutions that could start to be implemented very soon. On the other hand, there is the possibility that over time we can tinker and come up with better combinations of these ideas to reduce the problems, and particularly make it technically easier to implement (eg. allowing the use of "pure" key/value stores). We should get a better understanding of what kinds of sacrifices we are more vs less willing to accept, and continue actively working on improved proposals.