State of the art of Data Availability

State of the art of Data Availability

In this section, we explain the data availability problem, present well-known solutions to it, and give an overview of the various data availability solutions proposed by different blockchains. In some contexts, the solutions proposed for data availability are referred to as sharding solutions.

The Data Availability Problem

One way to increase the scalability of blockchain is by giving the possibility for the users of the blockchain to post the content of their operation off-chain. This enables us to go beyond the limit of the bandwidth for the L1, which cannot be too high.

But if the data are posted off-chain, how can we reach a consensus about those data being available without requiring all the L1 nodes to download those data? This is the data availability problem.

More particularly, the problem of data availability is to ensure that some data has been published at some given point in time. This point in time generally is the same as the time a block proposal is made containing commitments to those data.

The naming "data-availability" can be a bit misleading: ensuring data-availability does not ensure that data can be retrieved after the point of publication.

Data Availability vs Data Retrievability

Data-retrievability is the property ensuring that data can be retrieved for a fixed time period (of the order of a few days).

While reaching a consensus on the availability of data posted off-chain is generally considered the most important property, retrievability is important for some use cases. Chiefly, for smart-rollups, this data may be required during the refutation period of a commitment, making retrievability a very useful feature.

Some discussion around the difference can be found here.

Solutions for the Data Availability Problem

Solution 0: All full nodes download all the data

Let's first look at the most naive approach to data availability: Have all full nodes download all data. This is what existing L1 blockchains such as Bitcoin, Ethereum, and Tezos do for their normal transactions.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Requiring all nodes to download all data naturally ensures that the data is available. If a node proposes a block that is unavailable, honest nodes in the network won't accept it since they can't download and execute it, making it impossible to build on top of the block.

The problem with this approach is that it doesn't scale. This becomes especially problematic as we move to an L2-centric world which requires ensuring the availability of large amounts of data.

All full nodes download all the data

Trust assumption for data availability:
- Honest minority of the L1 network. When an honest node encounters an unavailable block, it will fork from that chain even if the majority of nodes are following the unavailable block
Pros:
- Can protect against a data withholding attack from a malicious majority.
Cons:
- Does not scale.

Non-solution: IPFS, BitTorrent, etc

A common question when people hear about data availability problem is: Why not just post all the data to P2P data storage protocols such as IPFS and BitTorrent? I.e., have the user posts data to IPFS/BitTorrent and only post some metadata, a commitment, of the data to the L1 network.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The problem with this approach is that such protocols do not have a built-in consensus mechanism.

Consider the scenario where both Alice and Mallory claim to have published a set of transactions to IPFS/BitTorrent at a specific time, and suppose Alice actually published the data and Mallory did not. The problem is that the network cannot agree on which set of transactions to accept and which to reject.

While it is possible for all nodes to download the full data to reach consensus, doing so would be no different from the "all full nodes download all the data" solution.

So the question is, how do we provide a solution that is scalable while also providing consensus on data availability?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(Source)

Solution 1: Data Availability Committees

Data availability committees (DAC) are trusted parties that ensure that a given data is available.

When a user submits a data blob, DAC members will download the full data and post signatures to L1 to attest to the availability of the data. The data is only considered available if some threshold of members signs this attestation.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Note that when setting a threshold for a DAC, there is a trade-off between trustlessness and liveness. A higher threshold makes it more difficult for malicious members to withhold data while making the network believe it is available. However, it also makes it easier for malicious members to stop the DAC's functionality by not attesting to any data.

For instance, if there are 10 members in the DAC and the threshold is set to 10 (meaning that all 10 members must attest to availability), then a single honest member can prevent unavailable data from being accepted, but a single malicious member can prevent any data from being accepted.

Data Availability Committees

Trust assumption for data availability:
- Honest majority of the DAC.
Pros:
- Scalable. We can have very large data blobs as the size won't directly affect the L1 network bandwidth.
Cons:
- Requires trusting the DAC and is not backed by L1 security.

Solution 2: Randomly Sampled Committees

Randomly sampled committees is a subset of validators that are randomly selected by L1 to attest availability of a blob of data.

When a user submits a data blob, L1 will assign a randomly selected subset of validators, a committee, to attest to the blob. The committee members will download the data blob and post to L1 attestations that the data was available. The network as a whole only accepts the data if they see signatures from the majority of the committee.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Any committee based system is vulnerable to addaptive corruption, where the attacker corrupting a small number of nodes results in corrupting the whole system. In order to address this attack, the committee members are periodically shuffled, making it harder to be tageted by an attacker. Some analysis of adaptive corruption can be found here.

Randomly Sampled Committees

Trust assumption for data availability:
- Honest majority of the randomly sampled committee.
Pros:
- Can be naturally combined with part-wise attestations or DAS (both explained later) to check the availability of unassigned data blobs.
Cons:
- Without combining with part-wise attestations or DAS, it requires trusting the sampled committee, which in general is not large, e.g., 5 nodes.
- Without a proof of custody scheme (explained in the following section) cannot prevent the "lazy validator" problem.

The Lazy Validator Problem

Any system that relies on the validator attestation that data is available is susceptible to the lazy validator problem. This is because naive attestations are essentially self-declarations, hence there is nothing preventing validators from attesting to data that they have not downloaded.

There are two potential solutions to this problem:

Proof of custody:
- Associate a custody bit to each of the data parts, require the validator to report the custody bit in the attestation, and make it slashable if the validator wrongly reports a wrong custody bit.
Interactive dispute:
- If a validator fails to provide data that it attested for, the nodes in the network have the ability to require the validator to reveal the data to the L1 chain. This scheme not only ensures that the validators download the data but it also ensures that the validators provide the data.

The "proof of custody" scheme is considered to be used in Ethereum and NEAR.

The concept of an "interactive dispute" can be challenging because of data unavailability not a uniquely attributable fault. There is ongoing research (such as this paper and talk) to find the optimal reward, slashing, and fee structure for such interactive games.

Solution 3: Part-wise Attestation

When a user submits data, they will erasure-coded data blob and split it into parts. Each validator in the network is assigned to download and attest the availability of a part (or multiple parts). Since only a subset of the parts is required to reconstruct the whole data, the network can tolerate data-withholding attacks from some threshold of malicious actors.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Suppose the erasure coding parameter is set to

2

, which means that any selection of half of the parts can be used to reconstruct the full data. In this case, as long as at least half of the validators behave honestly, it will be possible to reconstruct the entire data.

Note that the proof of custody scheme discussed in the previous section can also be applied in this context to prevent validators from attesting for an unavailable part.

Part-wise Attestation

Trust assumption for data availability:
- Honest majority in the L1 network. The full data blob can be reconstructed if a majority of the network is honest and stores its parts.
Pros:
- Each node only needs to download a small amount of data.
Cons:
- Cannot protect against data withholding attacks from a malicious majority. A malicious majority can trick the whole network into believing that unavailable data is available.

Solution 4: Data Availability Sampling

The user submits an erasure-coded data blob to the L1 network. Full nodes (a subset of them if we combine them with randomly sampled committees) will download the data and attest to their availability. Light nodes, i.e., nodes that did not download the data, can check the availability of the data by running data availability sampling (DAS) against the data. DAS will be explained in detail in the next section.

Note that DAS does not rely on trusting the majority of the network. Even if a majority of validators act maliciously and accept an unavailable data blob, the light clients will be able to detect the unavailability in a trustless manner.

The word "light clients" here refers to nodes that have not completely downloaded a specific set of data. For instance, when paired with randomly sampled committees, full nodes that were not chosen to download the data can be considered as "light clients" for that data.

It is worth emphasizing that passing DAS is often part of the requirement for a block to be accepted. In other words, it is considered part of the validity condition of the chain. This means that invalid data is treated similarly to invalid transactions. If a block contains a transaction that creates a large amount of currency out of nowhere, it will be considered invalid and will be rejected by honest nodes that execute the transactions. Similarly, a block containing an unavailable data will also be rejected by honest nodes that run DAS regardless of what the rest of the network thinks.

Data Availability Sampling

Trust assumption for data availability:
- Honest minority in the L1 network. When an honest node fails DAS for a block, it can reject that chain even if the majority of nodes are following the unavailable block.
Pros:
- Can protect against a data withholding attack from a malicious majority.
Cons:
- Inefficient when the data is split into multiple independent "shards". See this later section for more detail:
  - Problem with the Eth2 sharding approach
- Designing an efficient P2P network for DAS is very difficult problem. See this later section for more information:
  - Difficulty in P2P for Danksharding DAS
- Without a private P2P network, the system is susceptible to selective disclosure attacks. In this type of attack, the attacker responds to DAS requests for a specific subset of nodes and convinces them to accept a block without revealing enough parts to reconstruct the data.

A brief explanation of Data Availability Sampling

Data availability sampling^[1] is a technique to ensure, with a high probability, that some data is available while only downloading a small portion of the entire data. It works as follows:

The data is erasure coded and split into k*n parts where n is the size of the original data, and k is some parameter. Due to the nature of erasure coding, any subset of n parts out of the k*n parts is sufficient to reconstruct the entire data.

When a client in the network wants to ensure data availability of the data, it will randomly sample parts of the erasure-coded data and only consider the data available if all sampling requests succeed.

Now how does the above scheme actually ensure data availability?

Suppose that a malicious actor wants to withhold the data while convincing the network that the data is available. Since n parts out of the 2*n parts (for this explanation, we assume k=2) is enough to reconstruct the whole data, they must withhold at least n+1 parts to the network (= can make at most n-1 parts available).

Now suppose a client attempts to check the availability of the data. Since the probability of one sampling request to succeed is less than 1/2 (because (n-1)/2n < 1/2), the probability of x consecutive sampling requests to succeed is less than (1/2)^x, hence the chance of withholding data becomes exponentially smaller with the number of sampling requests. For example, with x=20, the possibility of passing the DAS check while withholding data will be less than 1/10^6 (one in one million).

This solution is discussed further here while we discuss future improvements to the DAL design for Tezos.

Approaches of various blockchains

Summary

The table below is a summary of the data availability solutions of blockchains covered in this document.

The columns "Randomly Sampled Committee," "Part-wise Attestations," and "Data Availability Sampling" indicate whether each solution utilizes the corresponding data availability technique discussed in this document.

The "Decentralized Block Production" column indicates whether or not the network relies on powerful block producers to erasure code and distribute all the data in the block. Note that, as mentioned in the Vitalik's endgame post, centralizing block production is not neccesarily bad as long as block validation is trustless.

Blockchain	Randomly Sampled Committee	Part-wise Attestations	Data Availabiliy Sampling	Decentralized Block Production
Polkadot	◯	◯	X	◯
NEAR	◯	◯	X	◯
Eth2	◯	X	◯	◯
Danksharding Phase 1	X	◯	X	X
Danksharding Phase 2	X	◯	◯	X
Celestia	X	X	◯	X
Tezos DAL V1	X	◯	X	◯

Polkadot

The structure of the Polkadot network is represented in the following figure.

An overview of Polkadot's blockchain with parachains

Polkadot has two types of chains. At the center of the figure above, there is a chain called the relay chain which acts as the hub for the whole system. Transactions are processed by several different blockchains, known as parachains, which operate independently. Each parachain is managed by its own group of full nodes, called collators. The collators are responsible for creating blocks for their specific parachain.

A group of relay chain validators is selected to become the para-validators for each parachain. The para-validators are responsible for downloading and verifying new parachain blocks generated by the collators. Once a majority of the para-validators agree on the validity of a block, they send the header to all relay chain validators.

Furthermore, in order to ensure data availability, para-validators construct erasure-coded chunks of the parachain block (or more accurately, the PoV of the block which include the transactions and the state that the transaction touches) and send a piece to each validator in the relay chain. The parachain block is accepted in the relay chain only if some threshold of validators report that they downloaded their chunk. The erasure coding is done so that only

1 / 3 + 1

of the chunks are needed to reconstruct the full block; hence it is ensured that the block is reconstructable as long as more than

1 / 3

of the validators are honest.

In summary, in terms of data availability, the validators only accept a relay chain block if it was able to:

Download all data for para-validators (Solution 2).
Download a part of the data for all the other relay chain validators (Solution 3).

NEAR Nightshade

Schema of nightshade

NEAR Nightshade's data availability solution is similar to Polkadot's solution.

Each block is split into chunks. Each validator is assigned to a specific chunk and is tasked to download, execute, and verify the chunk.

To ensure data availability, validators not only download the assigned chunk but also downloads small erasure-coded parts of all the other chunks.

In terms of data availability, the validators only accept a block if it is able to:

Download the whole data of their assigned chunk (Solution 2).
Download parts of all the other chunks (Solution 3).

Furthermore, in order to prevent validators from being lazy, NEAR introduces a concept of colored parts.

Colored Parts

Preventing lazyness via colored parts works as follows:

When the chunk is produced and split into erasure-coded parts, the chunk producer chooses a color (red or blue) for each part.
The color is combined with the parts when computing the Merkle root of the parts.
The color is combined with the part when it is being distributed.
When a validator attests for a chunk part, it must include the color of the part.
If the validator includes the wrong color, then it is possible to prove via the Merkle tree; hence it is possible to slash.

This can be thought of as a simplified version of the proof of custody scheme described here. In proof of custody, the colors for each part are personalized for each validator. This is to prevent depending on a centralized color provider.

Eth2

In this section, we will see Eth2's proposal for data availability described here. Note that this proposal has been dropped in favor of the Danksharding solution described later. As the term "Eth2" has been phased out at the time when Danksharding was proposed, we refer to pre-Danksharding Eth2 design as just Eth2.

In this proposal, block data is split into shards. For each shard, a randomly sampled set of validators, called committee, are assigned. The committee is responsible for proposing and verifying the given shard, and is shuffled at each epoch (a fixed number of blocks).

For each shard in each block, a single member of the shard's committee is selected as the shard's proposer. The proposer will compile a blob of arbitrary data for the shard and broadcast a shard header, which contains the metadata of the blob. The other members of the shard's committee will attempt to download the full blob and sign the shard header if it succeeds.

Additionally, in order to ensure data availability of unassigned shards, validators run data availability sampling against shards that they did not fully download.

In summary, in terms of data availability, the validators only accept a block if it was able to:

Download all data for their assigned shard (Solution 2).
Successfully conduct DAS against all the other shards (Solution 4).

All the nodes of the L1 accept and propagate the block if it was able to successfully conduct DAS against all the shards proposed in the block

Celestia

Unlike other solutions we have seen so far, Celestia does not "shard" its blocks. All data is directly stored in large blocks in the main chain. Since full nodes are required to download all the data, Celestia has higher hardware requirements compared to sharded blockchains, which leads to centralization (i.e. they require supernodes). In such a setting, it is important to have a mechanism to monitor the full nodes in a decentralized way.

Celestia uses data availability sampling (Solution 4) for this monitoring.

With DAS, light clients (i.e., clients that do not download the full block) can validate the block's availability without relying on the honesty of a full node. Even if all the full nodes in the network conspire and publish an unavailable block, the light clients of the network will simply stop following the unavailable block, which effectively halts the progression of the chain. At this point, light client users can fall back to social consensus and spin up a new honest node and/or slash the malicious full nodes via a hard fork.

Furthermore, in Celestia, light clients cache the fetched samples and serve them on request. This means that with enough light clients, full blocks can be fully reconstructed from the samples collectively stored in the light clients.

(Note: Polygon Avail, which is a data availability solution for the Polygon ecosystem, takes a similar approach to Celestia. One difference is that Polygon Avail relies on Kate commitments instead of fraud proofs to prevent invalid erasure codings.)

Danksharding

Danksharding (original proposal, article from Dephidigital, workshop at Devcon Bogotá) is the latest sharding proposal for Ethereum. Before explaining what it is, let's first look at the problem with the previous Eth2 sharding proposal described in the above section.

Problem with the Eth2 sharding approach

In the Eth2 approach, data is split into shards where each shard is built and proposed independently:

As a consequence, we must check availability (via DAS) for each shard independently. This is because, as illustrated below, it is enough for a malicious make a single shard unavailable to place harm to the system.

Hence with

N

shards where each shard requires

K

samples for DAS,

N * K

samples are needed in total.

Utilizing Powerful Block Builders

One way to improve this situation is to introduce a new role, a block builder, that would gather data for all the shards in a block and create a single erasure-coded data blob.

This way, you would only need to run DAS once to check availability.

Since such a block builder would require a lot of network bandwidth and computing power, it would naturally be a centralization vector. However, due to MEV, introducing centralization in the block-building process is already being proposed in the form of proposer builder separation (PBS). Such centralization of block production is accepted as long as the validation stays decentralized, as expressed as the "centralized block production, decentralized validation" slogan in Vitalik's endgame blog post.

Danksharding is a proposal to enable efficient sharding by utilizing the powerful block builder role introduced for PBS. The basic idea is to have the block builder be responsible for gathering the shard data and erasure coding/distributing them.

Here is a quote that nicely summarizes the core idea of Danksharding.

genius of danksharding is to realize that MEV is incentivizing this powerful hardware anyway, so we might as well leverage their specialized role and have it be unified with data sharding.

Note that such a design makes Ethereum's sharding closer to Celestia's design. Below is a relevant tweet exchange between Danksharding proposer and Celestia founder:

2D Erasure Coding Scheme

In the above explanation, we represented the erasure-coded block as a single 1D erasure coding. The problem with this approach is that it not only requires a powerful computer to create the block, but it also takes a powerful computer to reconstruct the data from the erasure coding.

In order to support reconstruction on lower-resource computers, Danksharding employs a 2D erasure coding scheme. This approach divides the data into rows and columns and then applies erasure coding to both the row and column directions.

This erasure-coding scheme allows for the data to be reconstructed row by row or column by column, which enables incremental reconstruction and makes it feasible for computers with limited resources to perform the reconstruction.

Furthermore, having both rows and columns, and not just rows, makes the system more robust against data withholding attacks. This is because a malicious actor must withhold at least 25% of the erasure-coded data to conduct a data unavailability attack.

If more than 75% of data is made available, we would be able to reconstruct the original data by combining the reconstruction of columns (e.g. (A) below) and rows (e.g. (B) below).

Difficulty in P2P for Danksharding DAS

Danksharding is planned to be split into two (or more) phases, where the initial version will NOT include DAS. This is because the P2P networking design for efficient DAS in Ethereum remains to be an open question.

Here is an open RFP for the DAS P2P, here is a talk explaining the difficulty of P2P for DAS, and here is a blog post that touches on the topic. Furthermore, below is a quote from Vitalik in this podcast.

Moderator: If people want to contribute from the research/engineering point of view, what are the big open questions in sharding?

Vitalik: One problem is definitely sorting out the networking of DAS. There are designs that we have that work in theory, like doing it based on subnets, making DHT faster, a couple of other techniques in the middle. But taking these ideas from the engineering perspective and optimizing is very hard. Basically, it is like a scalable specialized DHT where publishing/downloading can happen extremely quickly. I think in general, in the Ethereum ecosystem, the networking side is talked about the least. … . But it is still a very important problem, and it would be amazing if we had more networking expertise in Ethereum.

Note that in Celestia, the P2P networking is simplified by relying on multiple supernodes to download and distribute the complete data. However, in Ethereum, although they accept having high requirements for the block builder, they still aim is to keep the requirements for validators low. Therefore, they need to design a P2P network that does not necessitate high bandwidth for validators.

In conclusion, designing a P2P network for DAS that does not rely on validators being supernodes is an open problem.

Danksharding without DAS

The initial phase will rely on part-wise attestation for availability. Each validator will be assigned to download a few rows and columns in the erasure-coded matrix (e.g. two rows and two columns), and will only attest for a block if the assigned rows/columns are available.

This ensures that the data is reconstructable as long as the majority of the validators are honest. This is described as "Danksharding honest majority validation" in the below slide:

Danksharding with DAS

As a future upgrade, Danksharding will incorporate DAS in order to obtain "malicious majority safety":

References ↩︎

State of the art of Data Availability

The Data Availability Problem

Data Availability vs Data Retrievability

Solutions for the Data Availability Problem

Solution 0: All full nodes download all the data

All full nodes download all the data

Non-solution: IPFS, BitTorrent, etc

Solution 1: Data Availability Committees

Data Availability Committees

Solution 2: Randomly Sampled Committees

Randomly Sampled Committees

The Lazy Validator Problem

Solution 3: Part-wise Attestation

Part-wise Attestation

Solution 4: Data Availability Sampling

Data Availability Sampling

A brief explanation of Data Availability Sampling

Approaches of various blockchains

Summary

Polkadot

NEAR Nightshade

Colored Parts

Eth2

Celestia

Danksharding

Problem with the Eth2 sharding approach

Utilizing Powerful Block Builders

2D Erasure Coding Scheme

Difficulty in P2P for Danksharding DAS

Danksharding without DAS

Danksharding with DAS

Read more

DAL cheatsheet

How to run a DAL node

Don't fear the shrinking

Document design for the DAL node