# Future improvements for the DAL design
[toc]
We suggest improvements of the DAL design for a future version.
## Sampling for the DAL
Unlike Etheurem or Celestia, Tezos DAL design allows multiple slot producers for one L1 level.
This design choice makes the sampling more limited:
- *Sampling* ensures with a very high probability that the chain will stop if there is only one slot producer per block. However, this probability decreases with more slot producers (such as *256*).
- Increasing the number of slot producers also has an impact on data retrievability.
We discuss below the technical difficulties in implementing the sampling.
### P2P for sampling and technical questions
With sampling, the P2P protocol must ensure the following properties simultaneously:
- A cap on the number of connections for all the actors of the DAL
- A cap on the bandwidth for all the actors of the DAL
- The node must be able to be able to fetch the shards quickly in order to propagate the block
While the first two points are properties already provided by the current design, the third one seems quite challenging and still needs to be solved in the blockchain community (as mentioned [here](https://hackmd.io/GfXfa1CqQtGY6zoessBzIw?view#Data-Availability-Sampling)).
Moreover, implementing the sampling requires an answer to the following questions:
- What happens if a node fails to sample the data?
- If the sampling fails for a given block, for how long the node must refuse any branch containing this block?
- How to implement a P2P protocol that supports sampling? At the time of writing, whether the gossip protocol approach supports sampling is still being determined.
- How many samples should a node fetch? How to determine this number?
### Does sampling really stop the chain with multiple slot producers?
Assume $256$ slot producers. Furthermore, assume that there is one slot not available. The total number of shards is $524288$, and the number of available shards is $522368$. The probability of drawing an available shard is $\frac{522368}{524288}=0.996337$. If the sampling uniformly selects shards, it means that for a node sampling $20$ shards, the odds of detecting the problem are
$1 - \frac{522368}{524288}^{20} \approx 7$%
Is it enough to be able to stop the chain? How such a failure could be detected?
Two possible ways to make this percentage increase:
- Draw more shards, but this requires a larger bandwidth, and also, be sure the P2P protocol can support it. Twenty shards requires about 80Ko/s of bandwidth. With $5000$ nodes, this requires a bandwidth of $40$Mo/s besides the bandwidth of the DAL to be spread among the attestors (at least, since it does not count the control messages of the P2P protocol)
- For example, requiring to draw at least some number of shards per slot
## Rational sampling
:::warning
:warning: This section remains work-in-progress, and the ideas presented remain to be tested and benchmarked.
However, we believe this approach gives a new perspective that could be helpful in the future. In particular, it could help with weakening the honest majority of attesters assumption.
:::
Given a finite family of commitments $c_i$, it may be possible to aggregate them into a linear combination
$$
c = \sum c_i
$$
giving a new commitment $c$.
With this scheme, $c$ is called an aggregated commitment and it could be computed from previous confirmed commitments.
Associated with an aggregated commitment, the L1 could draw a random point $x$ which is outside of the shards' domain.
Given a valuation $y$ and a proof $p$, the L1 is able to check whether $y=P(x)$ where $P$ is the polynomial associated with the commitment $c$.
:::info
Recall that a commitment is actually a polynomial commitment.
:::
Computing $P$ requires to reconstruct the polynomial $P_i$ associated to the commitments $c_i$.
Attestations of attesters could be extended with a valuation of a point and a proof that this point belongs to the aggregated commitment computed by the L1.
<!--
### How to compute aggregated commitment
Assume we want to ensure data retrievability for $2048$ blocks. Assuming a time between blocks of $30$ seconds, corresponding to about a day.
The L1 aggregates commitments according to their slot index $i$ by summing the confirmed commitment for the last $2048$ blocks. For each level, the L1 computes $256$ aggregated commitments using the aggregated commitments from the previous levels and add the new confirmed slots and removed the oldest ones. For the new confirmed slots, we use a random permutation. It is important that this source of randomness cannot be predicted easily. To do so, one can take the payload hash of the previous block.
At any level, for any attester selected by the DAL committee, a random number is assigned between $0$ and $255$ which designates the aggregated commitment that he has to commit on.
-->
### Commiting points to an aggregated commitment
At level $n$, an attestation contains the bitset attesting the availability of the data plus a valuation of a point and its proof of the aggregated commitment the attester must commit on.
To ensure that the attester has the time to fetch all the polynomials, the commitments involved in the aggregation could be published a few levels ago (the constant remains to be determined). But notice it is only a matter of latency here.
Any time an attester is chosen for to commit on a point associated with an aggregated commitment, the bandwidth required to compute the valuation is around $\frac{\text{slot size}}{\text{time between blocks}}$ which is assumed to be constant (about $33$KB/s). However, it requires more computational resources to compute the corresponding valuation.
Regarding the P2P protocol, such a scheme seems easier to support but may require more memory and more bandwith. Such a scheme would be compatible with the notion of topic mentioned previously.
Even though this scheme demands more resources in terms of bandwidth or computation, the number of resources is still proportional to the stake.
Morever, as we detail [below](https://hackmd.io/E6_izRwdRbaas-2rMZ_f0A#Splitting-attestor%E2%80%99s-bandwith) we can mitigate bandwidth issues if attesters run multiple DAL nodes.
<!--
### Analysis
With this simple scheme, each polynomial is attested $2048$ times. Assuming that if a point cannot be provided, the attestion is not valid, this provides incentives for attesters to fetch and also provides data on demand.
Assuming rationale attesters, a slot which was confirmed but actually was not available will eventually stop the chain but this may take quite some time depending on the latency.
-->
<!--
### Open questions
- Could we stop the L2 cementation process or commitment process for confirmed slots but on which some attesters were not able to commit (i.e. when the aggregated commitment contains a commitment for this slot)?
- Is this scheme tolerant to malicious attesters? If so, for which fraction of the attesters?
-->
## Splitting attestor's bandwith
In the current design, only one DAL node can be used to fetch the shards assigned to an attestor. In the case of large attesters, this could have a cost in term of bandwidth.
A way to overcome this limitation is to allow an attestor to run several DAL nodes. There are two ways to split the bandwidth: Either by splitting the slots or by splitting the shards.
Splitting the slots sounds easier and fits better with the current design.
### Splitting the slots
Each DAL node would fetch the shards for a subset of the slots. Doing so fits naturally with the current design since the slots are already part of the notion of a topic.
An attestor wishes to split its bandwidth into two could run $3$ DAL nodes:
- One for the first half of the slots
- One for the second half of the slots
- One to gather results from the first and the second DAL nodes
This does not change the interactions between the attestor and the DAL node. Communication between the DAL nodes can be done via RPCs.
### Splitting the shards
Each DAL node would be responsible for handling one fraction of the stake of the attestor.
An implementation for that is introducing a new L1 operation allowing an attestor to split its stake into $n$ parts.
Hence, for the DAL committee, the shards of this attestor would be split into $n$ parts. Consequently, the definition of a *topic* should be changed.
## Optimising resilience
The current design assumes a limited number of attestors since the number of outbound connections for slot producers depends on the number of attestors. To make the DAL/P2P network scaling, we should consider introducing some increase in the replication factor parameter (i.e. multiple attestors may route a shard) so that there is a benefit for attestors to be interconnected. The simulation via the relayed shards parameter shows a benefit but also highlights a bandwidth increase.
This can be mitigated by splitting the infrastructure into multiple DAL nodes, as described [here](https://hackmd.io/E6_izRwdRbaas-2rMZ_f0A#Splitting-DAL-attestations). Another way to reduce this bandwidth overhead is lazy message propagation.