Future improvements for the DAL design

We suggest improvements of the DAL design for a future version.

Sampling for the DAL

Unlike Etheurem or Celestia, Tezos DAL design allows multiple slot producers for one L1 level.

This design choice makes the sampling more limited:

  • Sampling ensures with a very high probability that the chain will stop if there is only one slot producer per block. However, this probability decreases with more slot producers (such as 256).
  • Increasing the number of slot producers also has an impact on data retrievability.

We discuss below the technical difficulties in implementing the sampling.

P2P for sampling and technical questions

With sampling, the P2P protocol must ensure the following properties simultaneously:

  • A cap on the number of connections for all the actors of the DAL
  • A cap on the bandwidth for all the actors of the DAL
  • The node must be able to be able to fetch the shards quickly in order to propagate the block

While the first two points are properties already provided by the current design, the third one seems quite challenging and still needs to be solved in the blockchain community (as mentioned here).

Moreover, implementing the sampling requires an answer to the following questions:

  • What happens if a node fails to sample the data?
  • If the sampling fails for a given block, for how long the node must refuse any branch containing this block?
  • How to implement a P2P protocol that supports sampling? At the time of writing, whether the gossip protocol approach supports sampling is still being determined.
  • How many samples should a node fetch? How to determine this number?

Does sampling really stop the chain with multiple slot producers?

Assume \(256\) slot producers. Furthermore, assume that there is one slot not available. The total number of shards is \(524288\), and the number of available shards is \(522368\). The probability of drawing an available shard is \(\frac{522368}{524288}=0.996337\). If the sampling uniformly selects shards, it means that for a node sampling \(20\) shards, the odds of detecting the problem are

\(1 - \frac{522368}{524288}^{20} \approx 7\)%

Is it enough to be able to stop the chain? How such a failure could be detected?

Two possible ways to make this percentage increase:

  • Draw more shards, but this requires a larger bandwidth, and also, be sure the P2P protocol can support it. Twenty shards requires about 80Ko/s of bandwidth. With \(5000\) nodes, this requires a bandwidth of \(40\)Mo/s besides the bandwidth of the DAL to be spread among the attestors (at least, since it does not count the control messages of the P2P protocol)
  • For example, requiring to draw at least some number of shards per slot

Rational sampling

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
This section remains work-in-progress, and the ideas presented remain to be tested and benchmarked.

However, we believe this approach gives a new perspective that could be helpful in the future. In particular, it could help with weakening the honest majority of attesters assumption.

Given a finite family of commitments \(c_i\), it may be possible to aggregate them into a linear combination

\[ c = \sum c_i \]
giving a new commitment \(c\).

With this scheme, \(c\) is called an aggregated commitment and it could be computed from previous confirmed commitments.

Associated with an aggregated commitment, the L1 could draw a random point \(x\) which is outside of the shards' domain.

Given a valuation \(y\) and a proof \(p\), the L1 is able to check whether \(y=P(x)\) where \(P\) is the polynomial associated with the commitment \(c\).

Recall that a commitment is actually a polynomial commitment.

Computing \(P\) requires to reconstruct the polynomial \(P_i\) associated to the commitments \(c_i\).

Attestations of attesters could be extended with a valuation of a point and a proof that this point belongs to the aggregated commitment computed by the L1.

Commiting points to an aggregated commitment

At level \(n\), an attestation contains the bitset attesting the availability of the data plus a valuation of a point and its proof of the aggregated commitment the attester must commit on.

To ensure that the attester has the time to fetch all the polynomials, the commitments involved in the aggregation could be published a few levels ago (the constant remains to be determined). But notice it is only a matter of latency here.

Any time an attester is chosen for to commit on a point associated with an aggregated commitment, the bandwidth required to compute the valuation is around \(\frac{\text{slot size}}{\text{time between blocks}}\) which is assumed to be constant (about \(33\)KB/s). However, it requires more computational resources to compute the corresponding valuation.

Regarding the P2P protocol, such a scheme seems easier to support but may require more memory and more bandwith. Such a scheme would be compatible with the notion of topic mentioned previously.

Even though this scheme demands more resources in terms of bandwidth or computation, the number of resources is still proportional to the stake.

Morever, as we detail below we can mitigate bandwidth issues if attesters run multiple DAL nodes.

Splitting attestor's bandwith

In the current design, only one DAL node can be used to fetch the shards assigned to an attestor. In the case of large attesters, this could have a cost in term of bandwidth.

A way to overcome this limitation is to allow an attestor to run several DAL nodes. There are two ways to split the bandwidth: Either by splitting the slots or by splitting the shards.

Splitting the slots sounds easier and fits better with the current design.

Splitting the slots

Each DAL node would fetch the shards for a subset of the slots. Doing so fits naturally with the current design since the slots are already part of the notion of a topic.

An attestor wishes to split its bandwidth into two could run \(3\) DAL nodes:

  • One for the first half of the slots
  • One for the second half of the slots
  • One to gather results from the first and the second DAL nodes

This does not change the interactions between the attestor and the DAL node. Communication between the DAL nodes can be done via RPCs.

Splitting the shards

Each DAL node would be responsible for handling one fraction of the stake of the attestor.

An implementation for that is introducing a new L1 operation allowing an attestor to split its stake into \(n\) parts.

Hence, for the DAL committee, the shards of this attestor would be split into \(n\) parts. Consequently, the definition of a topic should be changed.

Optimising resilience

The current design assumes a limited number of attestors since the number of outbound connections for slot producers depends on the number of attestors. To make the DAL/P2P network scaling, we should consider introducing some increase in the replication factor parameter (i.e. multiple attestors may route a shard) so that there is a benefit for attestors to be interconnected. The simulation via the relayed shards parameter shows a benefit but also highlights a bandwidth increase.

This can be mitigated by splitting the infrastructure into multiple DAL nodes, as described here. Another way to reduce this bandwidth overhead is lazy message propagation.

Select a repo