# Ethereum’s data roadmap and client democratisation
1. Ethereum data
2. Data will be rollup-centric
3. Danksharding and Proto-danksharding
5. Historical data persistence
6. From data clients to light clients to full clients
7. Rollups and permissionless
8. The blockchain-data opportunity
The idea of this document is to present the future evolution of Ethereum in terms of blockchain data management, and how to use data as a starting point to build a more democratic client to interact with the network.
Also, it will include the different actors (explorers, rollups, etc) who offer or will offer Blockchain-Data-as-a-Service and the opportunities that lie with them to leverage decentralised storage networks.
## Ethereum data
Ethereum is building a scalable unified settlement layer. To focus on the problem, the protocol makes a distinction between two types of data:
- **State data:** data needed for consensus purposes. ‘Data availability’ is the guarantee that the block proposer published all transaction data for a block and that the transaction data is available to other network participants *while* it is being proposed for addition to the chain.
- **Historical data:** data stored in ancient blocks and receipts that store information about past events. While historical blockchain data may be necessary for archiving purposes, nodes can validate the chain and process transactions without it.
Today, data availability, the ability to permissionless reconstruct the state, is the primary scaling bottleneck. Losing historical data is not a risk to the protocol – only to individual applications.
Therefore, the purpose of the Ethereum consensus protocol is not to guarantee storage of all historical data forever. Rather, the purpose is to provide a highly secure real-time bulletin board, and leave room for other decentralised protocols to do longer-term storage.
## Data will be rollup-centric
Ethereum’s roadmap is a [rollup-centric roadmap](https://ethereum-magicians.org/t/a-rollup-centric-ethereum-roadmap/4698), and the final goal is to become as a single high-security execution shard that everyone processes, plus a scalable data availability layer.
Rollups offer high TPS without materially sacrificing decentralisation and security. A vast majority of end user activity will be on rollups. In a simplified form, they are Data Availability with some execution check, they separate the execution layer from the data layer.
It will look like this:
![Image credit to Diederik Loerakker](https://i.imgur.com/XQWN9UC.png)
`Image credit to Diederik Loerakker`
## Danksharding and Proto-danksharding
Danksharding will provide a scalable base layer required for this Ethereum’s rollup-centric roadmap: there is one builder who builds the block, a validator/proposer that confirms the data and proposes the block, and a committee voting on it. It is an improvement from 64 separate committees of the original design.
Innovations such as [proposer-builder separation](https://ethresear.ch/t/two-slot-proposer-builder-separation/10980) and weak statelessness unlock this separation of powers (building and validating) to achieve scalability without sacrificing security or decentralisation. Block builders can bid on the right of choosing the contents of the slot. The proposer needs only choose the valid header with the highest bid.
Additional requirements towards validators such as [Proof-of-custody](https://dankradfeist.de/ethereum/2021/09/30/proofs-of-custody.html) will be in place to verify availability of a particular part of the sharded data in each block, before pruning.
As the first step towards Danksharding, EIP-4844, Proto danksharding, will be implemented in the next hard fork after Ethereum’s Shanghai. It will introduce blobs without actually implementing any sharding, as well as all of the execution / consensus logic, and a self-adjusting independent gas price for blobs(!), among others.
Blobs are vectors of 4096 field-elements of 32 bytes each (around 125kb), included as a new type of transaction on the execution chain. These pieces of data are cheaper to execute than `calldata`, and will be downloaded by nodes (builders and validators), but not directly accessible to the L1 EVM, effectively distinguishing between execution layer and consensus layer. The EVM only will be able to access the commitments attached to them.
Sequencers or rollups block submitters currently include transaction data as `calldata` in typical L1 transactions. With blobs, data will be included in the transaction pool but the beacon chain will create the split between execution payload and sidecar data. By decoupling the data from the execution payload, time availability for blobs could be included as a property. After the beacon chain processes the blob, the execution payload remains on L1 while the sidecar and the rest of the data remains in L1 clients for a month, and then it is pruned, being L2 the ones to keep the data for longer persistence.
In Layer 1 there is no direct blob-content in EVM, only data hashes. A new `OPCODE` will provide data hashes to prove that the data has been included, referenced to the transaction that you are processing.
Proto-danksharding allows for a max of 16 blobs per block, and Danksharding will bump that up to 256. A new pricing market will be set in place, initially following the mechanism set on EIP-1559, with the objective to the new [exponential EIP-1559](https://dankradfeist.de/ethereum/2022/03/16/exponential-eip1559.html) mechanism. The blob fee is charged in gas, but it is a variable amount of gas, which adjusts and block builders would have to simultaneously avoid hitting *two* different limits.
## Historical data persistence
There are two main iniciatives that prune historical data in ethereum:
- Layer 1: EIP-4444
- Layer 1/2: Danksharding & EIP-4844
In Layer 1, EIP-4444 allows clients the option to locally prune historical data (headers, bodies, and receipts) older than one year. Historical data is only retrieved when requested explicitly over the JSON-RPC or when a peer attempts to sync the chain. Clients will have to “checkpoint” sync from a weak subjectivity checkpoint that they’ll treat as the genesis block.
For Layer 2, Danksharding will create many more blobs available by L2 EVM, and all these blobs in aggregate can be use these rollups, not limited to use one blob, it can span over as many blobs as they want. They are pruned after a month. A sort of interoperability will be highly desirable so that data can travel cross-rollup.
The suitability for previous proposals can be summarised in the following figure:
The distinction between the two types of data is derived from the necessity of the data to validate new blocks. State data is stored in full clients in order to build and validate blocks. Historical data is only retrieved when requested explicitly over the JSON-RPC or when a peer attempts to sync the chain.
The potential actors that could act as **data clients** are:
1. Block explorers ([etherchain.org](https://etherchain.org/), [etherscan.io](https://etherscan.io/), [amberdata.io](https://amberdata.io/)…), providing the data to users is their business model.
2. JSON-RPC endpoints like Infura or protocols like [TheGraph](https://thegraph.com/en/) can create incentivized marketplaces where clients pay servers for historical data with Merkle proofs of its correctness.
3. Rollups nominating and paying participants to store and provide the history: requiring to store the portion relevant to the app. Historical data is relevant to applications, hence not consensus nodes.
1. Separating the job of the sequencer from securing the data.
2. L1 provides the data, if it’s not available, L2 is not moving.
4. Clients in the [Portal Network](https://github.com/ethereum/portal-network-specs) could store random portions of chain history, and the Portal Network would automatically direct requests for data to the nodes that have it.
5. History could be uploaded and shared through torrent networks, auto-generating and distributing a 7 GB file containing the blob data from the blocks in each day.
6. Data analysts and volunteers could voluntarily choose to each store a random 5% of the chain history (using [erasure coding](https://blog.ethereum.org/2014/08/16/secret-sharing-erasure-coding-guide-aspiring-dropbox-decentralizer/))
The historical data storage problem is a [1 of N trust assumption](https://vitalik.ca/general/2020/08/20/trust.html). However, there are still censorship and availability risks if there is a lack of incentives to keep historical data available, specially with non-popular data.
## From data clients to light clients to full clients
[Weak statelessness](https://ethereum-magicians.org/t/weak-statelessness-and-or-state-expiry-coming-soon/5453) and [weak subjectivity](https://blog.ethereum.org/2014/11/25/proof-stake-learned-love-weak-subjectivity) are assumptions proposed to reduce full client’s data requirements, so that running a validator gets more decentralised.
Simply put, users would need a light client to talk to the protocol that doesn’t require
1. to sacrifice privacy,
2. lean on central points of failure, or
3. have a heavy client storing TBs of data.
According to Piper Merrian, Ethereum Foundation’s technical lead for the [Portal Network](https://www.ethportal.net), a light client is a client dedicated to serve these properties to, for instance, wallet users (Metamask installs >>> full clients installs), casual developers, or low resource devices such as IoT devices.
Today a light client is an Ethereum client that only syncs to the latest block header and requests other information from full clients. As they don't download blocks, light clients cannot validate transactions or help secure Ethereum. The objective is to ensure light clients can prove data availability without needing to download blocks.
In the network that exists today, nodes exists in ETH DevP2P protocol, stable and reliable for local storage and access of the state. In a node, data is divided in three categories: history, state, and gossip. Today, all nodes store all the state and all the history.
### How do we evolve the client to make it lightweight and decentralised?
From historical data as a starting point, potential evolution could look like this:
The **Present phase** creates a distinction between clients, preparing them for EIP-4444, leveraging weak subjectivity:
- **Full clients:** build blocks and validate them, storing and supporting both state data and historical data.
- **Data clients:** serving and sharing packaged historical data via incentivised storage networks or over torrent magnet links.
Data for these light clients would need not only headers, blocks and receipts, as the ETH protocol exposes, but also **canonical indexes** that would need to be part of the responses to facilitate retrieval of individual transactions by their hash. Those are transactions that happened but are not in a canonical block, and has to be indexed to a block hash.
Methods and scripts that fetch/verify data and automatically import them must be created so that full clients can benefit from data clients, whatever form they hold.
It is a one way call request (full clients -> data clients)
**Phase 1** brings into the equation Ethereum Light Clients, being able to integrate not only historical data retrieval, but also the gossip network to communicate the clients. State would still reserved to full nodes due to difficulty to reconstruct the state through online sources, but it'll present a great way to level up privacy and independance for RPCs requests and appending new data to historical data.
- **Light clients:** clients that aggregate the three components: comms (Gossip), historical data retrieval, but still relies on full clients to get the state. Able to update the historical Ethereum's blockchain data.
For syncing, light clients will “checkpoint sync” from a weak subjectivity checkpoint that they’ll treat as the genesis block, instead of full sync using [devp2p](https://github.com/ethereum/devp2p) as they do today. This distinction will result in less bandwidth usage on the network as light clients adopt more lightweight sync strategies based on the PoS [weak subjectivity](https://blog.ethereum.org/2014/11/25/proof-stake-learned-love-weak-subjectivity) assumption.
Data clients from decentralised storage networks, such as Swarm, could be integrated into previous light clients to form a unique client that can access, serve and validate data after the latest blob prunned.
It is a two way call request and update (data clients <-> light clients)
**Phase 2** represents a scenarios where [weak statelessness](https://ethereum-magicians.org/t/weak-statelessness-and-or-state-expiry-coming-soon/5453) becomes usable: state isn’t required to validate a block, but it is required to build the block. It provides a great tradeoff since builders are more centralised high-resource entities, and weak statelessness gives them a bit more work, but far less work to validators, enabling true light clients.
They will represent the final ethereum node where proofs of correct state access (including blob data) will be included in every block by builders, and validator nodes would have full access to all ethereum data.
Clients here will fully support and decentralise Ethereum's network.
- [ ] Find more research threads about the EIP-4444 topic
- [ ] The Portal Network seems a great starting point to integrate the three components. Also, hardware node operators such as Dappnode or Avado.
## Rollups and permission-less storage
Customer facing applications will populate Layer 2 protocols because of their performance and cost, while relying security and decentralization to Layer 1 consensus mechanism.
Their transaction fee is broken down into the `calldata` costs to post to L1, the computation used on L2, and the L2 storage. And in almost all transactions, the L1 `calldata` will be the primary driver of fees.
There are three lines of research where opportunities may emerge in the medium term:
1. Helping rollups persist data, permisionless.
2. Interoperability between rollups data.
3. Compressing data in the data blobs.
4. A marketplace for data fees, cross-rollup.
- [ ] Research the format and properties of the most common NFTs stored in AWS, create a pointer proposal to be included in different L2 solutions.
- [ ] Define a marketplace for data inclusion into blobs, for every rollup. Is there a way to guarantee that the data will be available in Swarm, despite price changes?
- [ ] Is it possible to compress/add multiple data into L2 with the same cost as a L1 transaction?
## The blockchain-data opportunity
A quick-n-dirty summary of the potential use cases, the buying pressure associated and the potential partners that would help to enable it.
| | Use case | Buying pressure | Partners |
| --- | --- | --- | --- |
| Full clients | Adaptation to EIP-4444 | Data market | EF, Portal network |
| RPC providers | Permissionless JSON-RPC access | Wallets, node operators, IoT devices | Infura, Getblock, Quicknode, Dappnode, Avado |
| Block explorers | Public datasets | Public data access | Etherscan, Arbiscan, Gnosisscan, etc. |
| Indexers | Private datasets | Private data access, development, etc | The Graph, Covalent, Tatum, SubQuery, etc. |
| Rollups | Applications, NFTs, files, etc. | Hosting websites, NFTs, files and documents | SoundXYZ, Gnosis chain, Arbitrum, Optimism, Loopring, dYdX, Polygon, Lens, etc. |
A potential starting point will be tackling the following fronts:
1. Delivering an RPC API for data older >1 year (82125 epochs).
2. Defining open specs for EIP-4444 data pruning.
3. Researching NFT data models in L2 and possible rollup data interoperability.