# Data Availability and Proto-Danksharding The Data Availability Problem === Consider a p2p network of nodes maintaining a replicated state machine (e.g. the Bitcoin network). A full node, by definition, maintains a complete history of the chain from genesis. By replaying included transactions against the current state, such a node can independently verify new transactions. This begins to impose heavy hardware requirements for nodes looking to join the network - As of April 2023, the entire Bitcoin chain is [~475Gb](https://ycharts.com/indicators/bitcoin_blockchain_size) (Growth rate determined by: Block size ~ 1Mb, One block mined every ~ 10 minutes) Light clients, as opposed to full nodes, only download *block headers*. Taken from the [core Bitcoin implementation](https://github.com/bitcoin/bitcoin/blob/master/src/primitives/block.h#L14) (note the quintessential copyright (c) 2009-2010 Satoshi Nakamoto), these describe meta-data about the block: ```c++=14 /** Nodes collect new transactions into a block, hash them into a hash tree, * and scan through nonce values to make the block's hash satisfy proof-of-work * requirements. When they solve the proof-of-work, they broadcast the block * to everyone and the block is added to the block chain. The first transaction * in the block is a special one that creates a new coin owned by the creator * of the block. */ class CBlockHeader { public: // header int32_t nVersion; uint256 hashPrevBlock; uint256 hashMerkleRoot; uint32_t nTime; uint32_t nBits; uint32_t nNonce; ... ``` The crucial info here is the MerkleRoot - a succinct cryptographic commitment to all of the transactions in the block. Note that Ethereum block headers encode state slightly differently - the key piece of data is the StateRoot, which is the MerkleRoot of the Merkle Patricia Trie mapping account addresses to account-specific state (including e.g. balances, contract storage, etc.) The reason Bitcoin headers don't include a commitment to the entire state, but just the transactions in the block, is essentially because Bitcoin uses a [utxo](https://en.wikipedia.org/wiki/Unspent_transaction_output) model instead of the account model of Ethereum.) The data availability problem is - how can a light client make sure that the transactions it received a commitment to *could* have been downloaded from the network - i.e. were publicly available at least once, without actually downloading all transactions? Note that this makes no guarantee on permanent availability, the light client just wants to ensure that at some point in time, the data were publicly available. L2 Networks and Blob Transactions === Motivation for providing a solution to the data availability problem in Ethereum comes from wanting to give L2 networks cheaper block space to post their state and transactions. Recall that in order to inherit security from the L1, a basic function of an L2 network involves maintaining the L2 state root in a smart contract on L1. A rollup coordinator batches transactions from the L2, computes the state transitions, and posts a proof to the L1 veriyfing the state transitions were computed correctly. For instance, here is the Ethereum smart contract [function](https://github.com/hermeznetwork/contracts/blob/master/contracts/hermez/Hermez.sol#L273) that a [Polygon-Hermez](https://github.com/hermeznetwork/) rollup-coordinator must call to move its L2 state foward: ```solidity=273 /** * @dev Forge a new batch providing the L2 Transactions, L1Corrdinator transactions and the proof. * If the proof is succesfully verified, update the current state, adding a new state and exit root. * In order to optimize the gas consumption the parameters `encodedL1CoordinatorTx`, `l1L2TxsData` and `feeIdxCoordinator` * are read directly from the calldata using assembly with the instruction `calldatacopy` * @param newLastIdx New total rollup accounts * @param newStRoot New state root * @param newExitRoot New exit root * @param encodedL1CoordinatorTx Encoded L1-coordinator transactions * @param l1L2TxsData Encoded l2 data * @param feeIdxCoordinator Encoded idx accounts of the coordinator where the fees will be payed * @param verifierIdx Verifier index * @param l1Batch Indicates if this batch will be L2 or L1-L2 * @param proofA zk-snark input * @param proofB zk-snark input * @param proofC zk-snark input * Events: `ForgeBatch` */ function forgeBatch( uint48 newLastIdx, uint256 newStRoot, uint256 newExitRoot, bytes calldata encodedL1CoordinatorTx, bytes calldata l1L2TxsData, bytes calldata feeIdxCoordinator, uint8 verifierIdx, bool l1Batch, uint256[2] calldata proofA, uint256[2][2] calldata proofB, uint256[2] calldata proofC ) external virtual { .. ``` Note that just posting a commitment to the old L2 state, the new L2 state, and a succinct proof of a knowledge of a computation trace transitioning between these two states is not enough. Indeed, if the L2 transactions themselves did not have to be posted, a malacious rollup coordinator could keep progressing the L2 state and withhold the transactions used to progress the state. Nodes in the network would then only be able to access cryptographic commitments to the L2 state but not the state itself - i.e. they may be unable to determine balances of accounts. To avoid this attack, the transactions themselves must be published, hence the argument ```solidity bytes calldata l1L2TxsData ``` , which implicitly stores the transactions to an area of memory in the EVM called calldata. Calldata memory is used by the EVM used for passing arguments to smart contract functions - it is not part of global state, hence is cheaper to write to than state storage. Although not directly part of the state, calldata *does* need to be stored in validator memory to reconstruct the history of the chain. L2 rollups store their transactions in calldata as opposed to storage to save on fees they must charge users of the L2 network. As part of the *rollup-centric* vision of Ethereum, a cheaper and more scalable solution for storing rollup transactions should be added to the L1: > "Plans for sharding are rapidly evolving, but given the rise and success of layer 2 technologies to scale transaction execution, sharding plans have shifted to finding the most optimal way to distribute the burden of storing compressed calldata from rollup contracts, allowing for exponential growth in network capacity."- > [Ethereum Roadmap](https://ethereum.org/en/roadmap/merge/) This motivated the proposol of [EIP-4844](https://eips.ethereum.org/EIPS/eip-4844) (aka Proto-Danksharding). To provide some context, since the Merge, running a full Ethereum node consists of running both an execution client and a consensus client. In addition to standard Ethereum transactions, EIP-4844 proposes the introduction of *blob transactions*. By definition, a *blob* is a vector of $2^{12} = 4096$ field elements (the scalar field of the elliptic curve [BLS12-381](https://electriccoin.co/blog/new-snark-curve/)). To Ethereum, Blobs are just opaque bytes with no interpretation (the idea being in practice they will encode L2 transactions). The point is that the blobs themselves will only be stored on the Beacon chain - i.e. they only need to be physically stored in validator memory, and the execution client, running an instance of the EVM, will only have access to the commitments of these blobs. We thus naturally arrive at the same Data Availability Problem - How can a light client verify that a piece of data $\mathcal{D}$ is available to download from the network without actually downloading it? More concretely, how can we ensure that $O(N)$ data is available while doing $O(1)$ work? Erasure Encoding - Theory === The basic primitive needed is an encoding $$ \text{Enc}: \mathbb{F}^{n}\rightarrow \mathbb{F}^{f(n)} $$ such that any $x\in \mathbb{F}^n$ can be reconstructed from any $k$ entries in $\text{Enc}(x)$. Such an encoding can be paired with a cryptographic commitment scheme to construct a proof of data availability as follows. Let $\mathcal{D} := [p_0, \cdots, p_{n - 1}]$ be a *blob* of data that a prover (validator) $\mathcal{P}$ wants to prove availability of to a (light) execution client $\mathcal{V}$. The prover sends (posts on chain) a cryptographic commitment $\text{Com}(\text{Enc}(\mathcal{D}))$ to $\mathcal{V}$. Now $\mathcal{V}$ just needs to ensure that the prover knows as least $k$ of the $p_i$, as this is enough to prove $\mathcal{D}$ may be reconstructed if needed. It challenges the prover by asking for a random entry $\ell$ times. If the prover knows less than $k$ entries of the encoding, then the probability that $\mathcal{P}$ will be able to successfully respond to all of $\mathcal{V}$'s queries is $$\text{Pr}_{\mathcal{V} \text{-challenges}}[\mathcal{P} \text{ convinces } \mathcal{V} | \mathcal{P} \text{ knows } < k \text{ entries in encoding}] < (\frac{k}{f(n)})^{\ell}.$$ (In an actual implementation, the *response* to each query is an opening of the cryptographic commitment at the given $\mathcal{V}$-challenge point.) The power of this technique comes from the fact that originally, $\mathcal{V}$ needed to ensure that $\mathcal{P}$ had every $p_i$ available, whereas after applying $\text{Enc}$, $\mathcal{V}$ just needs to ensure that some fraction of the encoded samples are available, which can be done with high probability through random sampling. We want to construct mappings $\text{Enc}$ with $\frac{k}{f(n)}$ small. Making $k$ small gives better guarantees for the verifier, and making $f(n)$ large, while also giving better guarantees for the verifier, gives more work to the prover to construct the encoding. Probably the simplest non-trivial example is the encoding $\text{Par}$ given by $$ \text{Par}: \mathbb{F}_2^{n}\rightarrow \mathbb{F}_2^{n + 1}$$ $$x\mapsto \text{Concat}(x, \bigoplus_{1\leq i \leq n}x_i), $$ where $\bigoplus$ denotes bitwise xor, which just appends an extra parity bit to $x$. As long as $\leq 1$ bit is erased from $\text{Par}(x)$, the encoding can be uniquely recovered using the invariant that the bitwise xor of any encoding must be 0. This encoding has parameters $k = n$ and $f(n) = n + 1$, so $\frac{k}{f(n)}$ is relatively large (approaches $1$ as $n\rightarrow \infty$). A much more powerful encoding, proposed for use in EIP-4844, is given by so-called *Reed-Solomon Codes*. Fix a multiplicative subgroup $H\subseteq \mathbb{F}^{\times}$ of order $2^k \gg n$ (it is not stricly necessary that $|H|$ be a power of $2$, although it will allow for certain optimizations (think FFTs)). Let $\omega\in H$ be a generator. The idea is take $x\in \mathbb{F}^n$ and interpolate a polynomial $f\in \mathbb{F}[X]$ through the points $(\omega_i, x_i)$. (Such an $f$ exists by [Lagrange Interpolation](https://https://en.wikipedia.org/wiki/Lagrange_polynomial), and can be computed in $O(n\log n)$ using an [FFT](https://en.wikipedia.org/wiki/Fast_Fourier_transform) and our assumption on $|H|$). The benefit of introducing $f$ (which contains the same information as $x$), is that we may evaluate $f$ at more points in $H$. For instance, assume $2^k > \lambda n$, so we can consider the evaluations $$\mathcal{E}_{f,H,\lambda n} := [(1, f(1)), (\omega, f(\omega)), \cdots, (\omega^{\lambda n - 1}, f(\omega^{\lambda n - 1}))]. $$ This is a vector of length $\lambda n$, but because $f$ is a degree $n - 1$ polynomial, it can be uniquely recovered from any $n$ values. This encoding has parameters $k = n$ and $f(n) = \lambda n$, which gives $\frac{k}{f(n)} = \frac{1}{\lambda}$ independent of $n$. If $\lambda = 2$ and the verifier makes $\ell = 20$ queries to the prover, the probability that the prover will convince the verifier of data availability, depspite not actually having enough information to reconstruct the data, is already $< 10^{-6}$. (The actual proposal in Danksharding involves using a 2d Reed-Solomon Encoding. The idea is to arrange the original data in a square, and Reed-Solomon extend both the rows and the columns. The benefit of doing this is the ability to shard the reconstruction process across nodes - i.e. a node can reconstruct a single row or column with $O(\sqrt{N})$ work) # Erasure Encoding - Practice Recall that blobs published on the Beacon chain consist of a body - the original data divided into $N = 4096$ chunks (each of which is a field element of the scalar field of the curve $\text{BLS12-381}$), together with a Reed-Solomon encoding. (An important point glossed over in the previous section is how the verifier can make sure that the data committed to is actually an interpolation of a polynomial. Perhaps the prover just extended the data arbitrarily. The solution is to use a special type of commitment to the Reed-Solomon extended data which "remembers" the polynomial structure of the data, called polynomial commitments. They are similar to Merkle commitments (which can be considered vector commitments), but they only can commit to polynomials, and succinct proofs can be given for point evaluations. The polynomial commitment scheme planned for use by Ethereum is KZG - regardless of the degree of the polynomial, the commitment is a single elliptic curve point, which assuming a 256 bit field, is just 512 bits. We won't detail the construction of KZG commitments here, but will note that the it relies on elliptic curve pairings.) The blob header contains a KZG commitment $c_{\text{data}}$ of the polynomial interpolating the chunks, and the blob body contains proofs of openings $\pi_i$ for each index $0\leq i < 4096$ of the KZG commitment. The idea is that validators sample random $i$ and verify that $\pi_i$ is a valid opening at $\omega^i$ of the commitment $c_{\text{data}}$. To reduce gas costs, new [pre-compiles](https://github.com/ethereum/go-ethereum/blob/master/core/vm/contracts.go) for opening KZG commitments are planned to be added to the execution software (i.e. a Geth node). The road to solve the data availability problem for transactions published by L2 networks on Ethereum L1 is thus - provide blob storage on the Beacon chain (EIP 4844), encode the data using (multi-dimensional) Reed Solomon Codes, and have validators (+light execution clients) randomly sample chunks before voting on a block, ensuring with high probability that the data is available without actually downloading it.