This document presents a diverse study of Data Availability Sampling (DAS) for Ethereum, a technique that enables efficient verification of data availability in blockchain networks. We study two primary approaches for data dissemination: unstructured (gossip-based) and structured (DHT-based). We focus on the structured approach, tailoring our random sampling design accordingly. Multiple algorithms and their variations are proposed and evaluated with a Rust-written prototype. Additionally, we provide some design considerations encompassing networking, security, and privacy aspects. Our results showcase performance and trade-offs between different approaches, offering limited insights for selecting appropriate techniques. This study contributes to ongoing research and development efforts to secure the rollup-centric future of Ethereum L1.
The future of Ethereum is rollup-centric: instead of providing more space for transactions, it will reduce the cost to store blobs of data, which the core protocol does not attempt to interpret. Verifying a blob simply requires checking that the data is available - that it can be downloaded from the network. However, supporting a constant inflow of extrinsic data will require Ethereum's consensus to upgrade the way how such data availability (DA) is guaranteed.
Today ensuring availability requires nodes to download a full block and verify Merkle proofs associated with it. With an increased storage capacity per block, such a check will only be feasible for nodes running specialized hardware with high-throughput network connectivity, which goes against the core design principles of Ethereum. DA must remain sufficiently decentralized and be able to withstand targeted attacks.
For this purpose, a technique called random sampling was popularized in a paper by Mustafa Al-Bassam, Alberto Sonnino, and Vitalik Buterin in 2018. Since then it has gone through some major changes, introducing KZG instead of fraud proofs and replacing trusted committee with Danksharding (DS), but the core idea remains the same - extend useful data with multi-dimensional erasure codes such as Reed-Solomon, disseminate it to the network, and require validators to query a number of individual samples randomly until statistical confidence of full availability is reached.
In this work, we lay down the theoretical background for Data Availability Sampling (DAS) and propose multiple different options to deploy its core components onto the Ethereum network. The primary focus of the current research would be on sample dissemination and random sampling. However, we include some considerations regarding fault identification and data reconstruction here as well.
Reed-Solomon (RS) is a type of error-correcting code that works by encoding original data consisting of
In DAS, RS codes (with a rate
Kate commitments (KZG) is a type of polynomial commitment scheme that employs structured reference string (SRS) and trusted setup, thus producing a toxic waste. It requires pairing-friendly curves and it's security assumes pairing group hardness. In KZG, the prover who needs to convince verifier that
In DAS, KZG is used to prove that RS codes are well-formed (that all extension points lay on the same polynomial as the original data points, more details here) and can be used to reconstruct data. Being an alternative to fraud proofs, it has much lower communication overhead, though it does require more resources from a node generating proofs. Committing to all of the data in a single KZG commitment will also require nodes to download all of the data for reconstruction. We are okay with a super-node requirement for initial building, but we need to avoid such assumption for reconstruction.
2D-KZG scheme is a way to make reconstruction easier, by splitting each block into
In the rest of this document, when we mention rows and columns we will refer to the commitments encoded along the horizontal and vertical axes according to the 2D-KZG scheme we have just outlined.
Builders are resourced-optimized and highly-incentivized actors that are responsible for compiling "exec block body" (ordered lists of transactions) and submitting it along with a bid to the network. According to Vitalik's "Endgame", as long as block validation is decentralized, block building can be more centralized, ie. run by companies/institutions rather than ordinary users. Nonetheless, we need not to rely on builders to continuously (in the span of multiple epochs) act consistently and even honestly. More details in the spec here.
In DAS, builders are expected to erasure-code blobs of data and generate KZG proofs attesting to the validity of extended chunks.
Validators are actors whose responsibility is to verify the blocks produced by the current builder. It is crucial for validators to remain decentralized, so we should not assume much about their hardware.
In DAS, validators attest to the availability of 2 rows and 2 columns of the erasure coded data by querying the network for chunks that are associated with those locations. While they need to attest only in their own slot, validators are expected to keep querying rows and columns assigned to them until the change in next epoch.
Proposer is an actor selected from the validator set using the RANDAO mechanism. According to proposer-builder separation (PBS) proposal, these nodes are purposely separated from the builders to prevent censorship and MEV attacks. Proposer's main responsibility is to accept the exec block body with the highest bid. DAS makes no distinction between proposer and other validator nodes.
Regular nodes and light clients expected to be partake in random sampling and rebroadcast any samples they received to help disseminate samples to the entirety of the network.
Kademlia is a distributed hash table (DHT) protocol that organizes network into a binary tree. Each node maintains a partial view on the network tree stored in their routing table, which is further separated into
In order to find the nodes that are currently responsible for a specific key
Discv5 is a discovery protocol build on top of Kademlia with FIND_VALUE
queries, as for peer discovery only FIND_NODES
is needed. For security reasons, discv5 nodes must regularly perform liveness checks and routing table maintenance to favor long-lived nodes. The communication between discv5 nodes is done over the UDP transport.
Discv5 is already tightly integrated into Ethereum client design and implementation. Hence, whenever structured overlays are considered, we should prefer it rather than other Kademlia-based systems.
GossipSub is a p2p protocol that is based on a publish-subscribe system, in which members of the network subscribe to topics and messages will be pushed to them. A popular implementation of pub-sub systems uses a mechanism called flood-sub, which upon receiving a message, will forward it to any peers in the network that are subscribed to the topic. GossipSub takes this concept and makes it a bit more intelligent by asking peers if they have seen the message before forwarding it, “have you heard the gossip”, which reduces unnecessary communications.
Values above are taken from multiple sources [I, II, III] and are subjected to changes.
Validators need to download KZG commitments for their assigned rows and columns so that they can take part in the reconstruction. The exact numbers may vary for blocks produced at different slots. The target is 256, while the max is 512. Each commitment is 48 bytes, so it would be practical to send all of them in one message.
This data needs to be disseminated at the same time for each slot, similar to how full blocks are broadcasted today. Hence, it would be wise to consider same tool for the job. A flooding-based pub-sub gossip protocol such as GossipSub organizes communication into different “broadcast groups” (“topics”) that participants can join (“subscribe”) and send messages (“publish”) to.
There exist multiple configurations for how messages are sent over the gossip networks: push, pull, and push-pull. According to [1: p. 10-11], pull is generally faster and takes less bandwidth than other configurations, however, this advantage quickly vanishes when connection limit is presented. Hence, assuming the worst case hardware run by validators, we tend to give preference to push-based gossiping, which can also benefit from the higher bandwidth available to publishers, i.e. builders.
Another relevant observation is that builders may choose to include different amounts of data into their exec bodies of the same block. Since it is up to the proposer to decide on the winning body, to save time builders may need to broadcast rows & columns to the interested validators before the winning body is revealed. This way, builders will generate rows & columns and send it as a commitment to their exec block body, along with their bids.
Alternatively, the specific dimensions can be set to predetermined upper bound. To fill empty rows and columns, builders would have to pad blobs of data with zeros. This will spare the need to broadcast these values, which could potentially save some time. However, this will result in an unnecessary communication and storage overhead, arguably a worse trade-off.
A properly erasure-coded block result in more then a quarter of a million (512^2) samples. Any 50% of these is sufficient to reconstruct the whole block. Futhermore, any 50% of samples associated with a specific row/column is sufficient to reconstruct that row/column. However, before being requested all these sample must first be somehow put into the network. Hence, the aim of current component is to prepare network for the upcoming sampling queries.
There are two types of sampling the network must be able to support:
For the former, it would be sufficient to have a topology-aware broadcast that could reach all interested peers in the tightly-constraint time without creating a lot of unnecessary bandwidth. According to the PBS proposal validators will have roughly ~2.5 seconds (between T+5.33s and T+8s) to receive samples for their rows and columns. However, KZG proving for 32 MB of raw data will take ~1s with 32-64 core CPU, so unless all builders do this heavy computation before the winner announcement, the dissemination have even less time (~1.5s).
For the latter, it is crucial that samples are distributed uniformly and deterministically, such that any client could figure out where to look for the randomly chosen ones. Such a predictable configuration needs to remain this way even after the block has been called available. This is needed for nodes that have gone offline for an extended periods of time and are now wish to regain trust in the current leading chain, i.e. to know whether it was available that whole time.
In the following sections, we discuss some approaches for sample dissemination. We will then evaluate their performance and applicability according to the known parameters and based on our own simulation tests.
Individual validators remain interested in samples from the same rows and columns during the whole epoch (6.4 min today). Such rate of rotation is persistent enough to at least consider use of gossip protocols for dissemination samples to validators that follow the head.
At the start of each epoch, validators will subscribe to the topics indexed by the rows and columns they have just been assigned to. When ready they will receive samples from their subscribed topics until being re-assigned again in the next epoch.
There are 512 individual rows and columns (1024 total). Problem is that gossip communication tends to become unreliable if there is a large number of topics, each with few subscribers. This is because the sub-graph of nodes subscribed to a particular topic may no longer be connected causing flooding to fail.
Churn caused by frequently changing up topics is another issue. There are some strategies to perform such rotation slowly, which may help avoiding churn. However, these are tricky to implement in practice, especially with our strict time constraints.
One such approach was proposed by Vitalik. It involves sharding pub-sub topics horizontally on the subnets and propagate samples to subscribers vertically. This grid structure could provide a limited improvement when it comes to churn as horizontal subnets rotate slowly (once per epoch) while vertical subnets are not related to epochs schedule and thus can remain even longer than that. It also comes with some interesting shortcuts, e.g. indirect sample distribution, where peers could share a small number (~4) of public vertical subnets, to act as a force multiplier and increase coverage.
At this point, we already have one installation of the multi-sharded gossip on such a scale. The eth2-das prototype developed by Protolambda based on a slightly updated design, showed us that while feasible this will produce a lot of undesirable bandwidth. Most importantly, any gossip-based method will only work for live sampling, as there is no way to reliably request samples past the most recent block. With historical retrievability being a crucial requirement for DAS, this renders gossip-based approach suitable only as a parallel mechanism working along side with a structured network in case latter won't satisfy block time constraints.
Communication conducted over the structured overlays is known to have a much lower overhead compared to gossip-based protocols. Distributed hash tables (DHT) offer scalable routing latency and efficient lookup interface. However, structured protocols are generally more fragile in the presence of failures, lacking the natural resilience of gossip protocols. DHTs also were not designed for large-scale broadcast or data dissemination, requiring additional coordination to be used for samples dissemination.
A promising approach is to construct new broadcast primitives which combine structured and gossip-based approaches. In that way, one can benefit from the scalability and resilience of pure gossip-based and approximate the efficiency of the broadcast mechanism to that of structured solutions. Finally, data once disseminated into a P2P data structure could then be retrieved at any point in the future as long it remains stored by at least one live node.
In the rest of this section, we are to develop a specialized data dissemination protocol that considers network topology and the nature of data to be disseminated. Compared to the existing broadcast solutions that only define how a message is routed (routing strategy), our solution must also specify how individual pieces that collectively form the original data are bundled together into messages (bundling strategy) that are deterministically addressed to a certain selection of remote peers (peer selection strategy).
We define bundling as a means of grouping multiple pieces (samples from here and forth) of the same message into a bundle that can be sent over to certain remote peers. How that data was originally split (RS encoding here) is outside the bundling strategy definition, but it may be useful to consider that when allocating pieces to a group.
In practice, a bundling algorithm must take a list of samples and produce one or more groups of samples each addressed to one or more remote peers. Cases, where a single group is addressed to more than one peer, will be covered by the peer selection strategy definition. For now, assume that each bundle is intended for a certain remote peer based on some commonly shared characteristic(s), that must be unambiguously recognized by any node in the network. To ensure determinism and reproducibility, only public values must be used when mapping samples to their destination peers.
Below we present 3 bundling strategies and the trade-offs between them.
Kademlia uses XOR to determine the distance between two points in the id space. XOR is symmetric, allowing participants to evaluate queries from each other's point of view by only knowing their node identities. This property makes Kademlia's distance metric perfectly suited for grouping samples.
The initiator will start by calculating distances between each sample-id
and node-ids
of all remote peers it knows. Which she then sorts and selects node-id
, resulting in one or more bundles of samples, mapped by node-id
they are intended for.
pseudocode:
def group_by_closest_node(self, sample_ids, k):
node_ids = self.discv5.get_all_known_node_ids()
result = {} # Map<NodeId, [SampleId]>
for node_id in node_ids:
sorted_sample_ids = sample_ids.sort_by(lambda sample_id: sample_id.xor_distance(node_id))
closest_sample_ids = sorted_sample_ids[:k]
result[node_id] = closest_sample_ids
return result
This method is simple and understandable, but it is far from being efficient: the complexity of this step is
In Kademlia, peers are organized into a binary tree. Its routing table, usually referred to as k-buckets table, is organized around these subtrees. Hence, the most natural way to map Kademlia's broadcasting space is bucket-wise.
The initiator will start by mapping samples to k-bucket indexes based on the XOR-distance between each sample-id
and local node-id
. For each bucket index, initiator will select node-ids
.
pseudocode:
def group_by_buckets(self, sample_ids, k):
local_view = self.discv5.get_nodes_per_bucket()
bucket_indexes = [sample_id.xor_distance(self.local_node_id) for sample_id in sample_ids]
grouped_sample_ids = group_sample_ids_by_bucket(sample_ids, bucket_indexes)
result = {} # Map<[NodeId], [SampleId]>
for (bucket_idx, samples) in grouped_sample_ids:
nodes = local_view[bucket_idx]
result[nodes] = samples
return result
A remote peer receiving a bundle of samples from the initiator will perform the same operations based on their own perspective, using their local node-id
. The one tricky part comes from the fact that buckets are different for each node (as they represent distances to the node). Thus, one must be sure that when a node sends a message to a contact from one of their buckets, this contact will in turn be able to map this bucket to its own routing table. Luckily, thanks to the proof in [2: p. 758], we know that for any forwarder
It is true that ids generated using collision-resistant hash functions form a uniform distribution over the Kademlia key space. This, however, is no longer the case when the relative distances are mapped to a finite number of buckets. Closer buckets are narrower, while most of the samples will correspond to a few furthest buckets. This will result in distant nodes receiving larger bundles than the closer ones. Although no particularly problematic since dissemination origin will change each slot, one can minimize this disproportion by tweaking parameter
Kademlia authors indicate that, although described in different terms [4: p. 1], Pastry’s prefix-based routing is very similar to that of Kademlia. Indeed, Kademlia’s buckets can be mapped to entries in a prefix-based routing table, while an increasing common prefix will directly result in shorter XOR-distance. We use this insight to define prefix-based bundling. Similar method was also proposed by Protolambda in his more recent work.
The initiator must maintain a secondary routing table along with Kademlia's primary one; it's essentially a binary prefix tree (trie) which organize remote peers in branches based on their common prefix (see example). An initiator will start by grouping samples by their common prefix of node-ids
.
A peer receiving a bundle is required to determine the current branching position on the trie to decide on further samples bundling. As proposed in [5: p. 2], such context awareness can be achieved by sending packets carrying the prefix currently addressed, which authors call destination prefix. This destination prefix will grow in length with every forwarding hop while descending the trie.
Since common prefix length and XOR-distance are directly correlated, it's likely that samples disseminated with this method will generally be retrievable by the standard Kademlia lookups. Nonetheless, one should mind edge cases and consider the possibility of degraded lookup efficiency and consider using a specialized prefix-based routing for lookups too.
We define peer selection as a set of rules that control quantity and order by which peers are picked as a destination for bundles. Under stable condition it would be most efficient for each sample to be stored in one place. However, we must assume no such conditions continually, thus some redundancy settings must be applied.
The main problem with the structured dissemination is that a failure in a node will cause a whole tree branch to be mishandled. This is especially likely in networks where nodes join and leave the system often, known as churn. Common mitigation methods trade better coverage for an increased latency or a higher number of sent messages. In this section we cover the latter trade-off, introducing 3 settings to exploit redundancy for churn and network fault tolerance. In later sections we cover other techniques for dealing with these threats.
The advantage of this technique is that it adds no latency since parallel messages are sent almost simultaneously. When redundancy is applied, the number of messages is multiplied and a given node may receive the same message several times. This requires each message (bundle) to have a unique identifier so that nodes can process it according to a special pre-configured policies.
Replication multiplier
(denoted as radius
–- a XOR-distance between a sample-id
and any node encountered during the dissemination.
Another pair of settings controls when messages are replicated: whether it happens once by the initiator –- replicate-one/R1
; or on each hop so that any node receiving a message would forward to a several more locations –- replicate-all/RA
.
A node receiving a same bundle of samples for the second time can simply skip handling it again –- forward-one/F1
configuration. Alternatively, node can keep forwarding it to the next nodes in the routing tree –- forward-all/FA
. With the latter option, if a node receives a sample it already handled it will attempt to forward it to the same set of peers as the first time unless its routing table was updated in between invocations. This is expected behavior to account for lost messages and mishandled routing branches in an attempt to maximize coverage.
Considering both previously introduced policies, we now have four possible configurations: R1/F1
, R1/FA
, RA/F1
and RA/FA
. We can immediately discard the RA/FA
option because it would lead to an explosion of messages on the network. As mentioned, R1/F1
option can only be applied under stable network conditions, but since there is no reliable way for the single nodes points of view to estimate network health, use of this setting is not recommended. In [6: p. 7] authors came to conclusion, based on their experimental results, that best coverage under the simulated churn is achieved with RA/FS
, where forward-some/FS
is an additional setting that only forwards duplicated messages to a certain threshold, that needs to be experimentally determined based on the network size and topology.
Sometimes it is more useful to consider quality rather than quantity. Peer scoring is a powerful technique that can greatly improve both coverage and latency during samples dissemination. It is best when peer scoring applied together with redundancy, in a way that messages are replicated to both more and better peers.
In dv5das, Proto proposes a set of behaviors that either positively or negatively impact peer's score. In this work we analyze three more metrics, which emerge from the following considerations:
There are, however, some difficulties with translating these metrics into scores, which we discuss in the rest of this section:
Routing tree resolution requires variable time. Assuming that remote peer won't respond due to being blocked until all subsequent nodes in the tree finish their lookups, there is no point in measuring request handling time as latency metric. However, we can require nodes to respond with handle-id
immediately and ACK about finished broadcast as a separate request referencing previously generated handle-id
. Here, handle-id
is a one-time identifier to associate response that will be sent separately with a given request. This way, one can measure time it took remote peer to respond initially with handle-id
and apply it to their score.
Ethereum nodes do not expose their routing table sizes, as there is no reliable and efficient way to prove this. However, one can make an approximate estimate about other's network view by just looking at their NODES
responses. If remote peer consistently replies with close to 16 nodes, then it is likely to have a larger view than peers that are more restrained in their responses. Still, rely on this logic may prove be misleading as nodes are less likely to have many closer nodes when the lookup destination itself is close.
pseudocode:
def estimate_table_size(self, remote_peer, lookup_key, closer_nodes):
previous_score = self.routing_table_sizes[remote_peer]
distance = remote_peer.xor_distance(lookup_key)
# COEFF is empirically determined constant to balance lookup distance with a number of received closer nodes.
new_score = (previous_score + (distance * len(closer_nodes) * COEFF)) / 2
self.routing_table_sizes[remote_peer] = new_score
return new_score
Accounting for peer role can be easiest and perhaps most efficient metric. However, node ids of builders aren't publicly known, that would be a violation of validator privacy and a possible DoS vulnerability. Today, builders are only identified by their special builder pub-key, which relays collect when receiving new block.
Routing defines how a message is propagated in the overlay network, what nodes are responsible for a its destination, and how it can be found again later. It is crucial for routing to remain deterministic, otherwise disseminated data won't be retrievable from the network by anyone other then ones who put it there.
In Kademlia, routing is a done by comparing distances between the lookup key and the known node ids. It’s an iterative process as each request results in either more discovered nodes to send new requests to or that a node/value for lookup key was found and a routing session is finished. For samples dissemination, the iteration handler is extended with bundling, peer selection, and storage writes (in cases when no closer nodes were found for processed samples), but the underlying scheme generally remains the same.
Literature distinguishes two alternative routing modes:
Concrete implications of using one mode or another will be more relevant in the context of random sampling. For now, assume that iterative routing provides the initiator node with more information and influence over the routing path, while recursive mode is generally faster and results in less overhead, most notably in stable networks.
A natural way to route data dissemination is the recursive one. Existing data dissemination schemes [I, II, III] are all based on recursive routing, as it appears to be easier to reason about and result in a much cleaner implementation. One only needs to figure out a single algorithm intended for all nodes in the DAS overlay network. Recursive routing makes no distinction between initiator and forwarder nodes, as nodes on each hop act as a root of the broadcast tree and perform same set of operations. It is also more hardware-friendly as it assumes no unique bandwidth requirements for the initiator nor for intermediary nodes.
The initiator starts by picking node-ids
. Each bundle is assigned with bundle-id
needed to avoid double-handling when redundancy settings are applied. Initiator then sends parallel and asynchronous RPCs to push bundles along with their ids to their intended nodes. If destination is the initiator node, then samples are written to local storage.
Specific wire protocols will be discussed in later sections. For now, we only require it being asynchronous and support acknowledgements for the successfully handled requests, so that the initiator can send next bundles in pace limited by the concurrency parameter
. In general, for recursive routing we recommend using unbounded
, but additional considerations about the wire protocol or implementation specificities may apply.
A peer receiving an RPC request will parse it into bundle-id
and a list of samples. If a request the with same bundle-id
was already processed and redundancy settings require no additional forwarding, then it can be skipped. Otherwise, receiver node works as follows:
sample-ids
of no closer nodes can be found in receiver's routing table, then samples are written to local storage.pseudocode:
def disseminate_samples_recursively(self, samples, args):
allocation = self.bundle_samples(samples, args.bundling_strategy)
for (node_id, samples) in allocation:
if next == local_node_id:
self.store_samples(samples)
continue
bundle_id = generate_id()
msg = (bundle_id, samples)
self.send_to(node_id, msg)
def handle_dissemination_request(self, from, msg, args):
(bundle_id, samples) = msg
if self.hadled_messages[bundle_id] and args.forward_policy != FP.ForwardAll:
return
self.disseminate_samples_recursively(samples, args)
With iterative routing, the initiator has more control over the route, allowing them to react to problems faster, e.g. switch to alternative branches in case of remote node failure. As an obvious trade-off, it requires greater resources from the initiator (e.g. bandwidth), but this isn't necessarily a problem. On the contrary, since the initiator will always be a builder, this may lead to a more optimal utilization of available resources.
The initiator starts same as with recursive routing, but instead of sending bundles with full samples, she only sends their ids. She then waits for remote peers to respond. When all responses associated with a given bundle-id
are received, the initiator starts work as follows:
A peer receiving an RPC request will start same as with recursive routing, but instead of sending re-bundled samples to closer nodes they know, they will send their ENRs back to the initiator. Note, that in this design, it is up to the initiator to decide which samples to send to a remote peer, based on their own assessment of that peer's local view. In an alternative design, a remote peer responds with NODES(Vec<Enr>)
if the know closer nodes to a given sample-id
, and with STORE(Vec<SampleId>)
if not. This solution is simpler, but incurs more communication overhead than the former one.
pseudocode:
def disseminate_samples_iteratively(self, samples, args):
allocation = bundle_samples(samples, self.nodes, args.bundling_strategy)
for True:
new_allocation = {}
for (node_id, samples) in allocation:
if next == local_node_id:
self.store_samples(samples)
continue
bundle_id = generate_id()
msg = (bundle_id, "Keys", samples.to_ids())
closer_nodes = self.send_to(node_id, msg)
if len(closer_nodes) > 0:
new_allocation.append(bundle_samples(samples, response.nodes, args.bundling_strategy))
else:
msg = (bundle_id, "Samples", samples.to_ids())
self.send_to(node_id, samples)
if len(new_allocation) == 0:
return
else:
allocation = new_allocation;
def handle_dissemination_request(self, from, msg, args):
(bundle_id, type, samples) = msg;
if type == "Keys":
closer_nodes = set()
for sample_id in samples:
closer_nodes.append(self.close_nodes_to(sample_id))
self.respond_to(closer_nodes, from)
else:
self.store_samples(samples)
The destructive implications of high churn along with one mitigation strategy (redundancy) were covered in Section 4.4. Here, we take a look at some other known techniques for dealing with churn in the structured networks.
An anti-entropy protocol based on gossip/flooding operating on top of a DHT-based structured overlay can improve coverage and consistency. This family of protocols lets nodes gossip among each other to find out whether they are missing any data, which then can be transmitted. Combined with a structured broadcast algorithm that attempts to disseminate data to all nodes with guarantee to never send redundant messages, anti-entropy protocol can help ensuring that all nodes with high probability and low cost get all the messages.
The major concern is that the gossip protocols have probabilistic guarantees, i.e. more times it was repeated the more reliable the broadcast gets. Considering our strict time constraints, the anti-entropy might never reach enough repetitions to be useful. According to [6: p. 11], parallel flooding produces a rather significant increase in the number of sent messages while result in minor coverage improvements. In the same work, authors recommend using more specialized and deterministic methods, which we discuss next.
In order to fight unexpected failures, one can enhance the protocols with a simple acknowledgment mechanism: sending of an ACK as soon as a node has correctly processed and forwarded a received message. If the ACK is not received, the sender will resubmit to a different contact in the bucket. Once a node has acknowledged the received bundle, it's her responsibility to wait for ACKs from all subsequent nodes in the broadcast tree and to handle failures accordingly.
If only ACKs and resubmissions are applied, nodes should only receive a message once, since no parallel paths are used. However, it's most effective when both redundancy and AKCs are applied together. In [6: p. 11], authors experimentally showed that R1/FA + ACKs
achieves excellent coverage even under extreme churn and failure rate. As a trade-off, each resubmission adds a slight increase in latency, though this is amortized when redundant paths are used prior to resubmissions. Overall, a 5-10% latency spike seems well worth the gains in coverage, especially considering the long-term need for historical sampling.
By default, regions (buckets) without contacts are just ignored in broadcasts. In stable conditions, this is safe, but in the presence of churn, routing data can be momentarily inaccurate and new nodes may be unknown. The ERR lets empty regions be handled by peers in close regions, in the hope that they may know about nodes the current handler still ignores. The use of such an algorithm does not increase latency or the number of messages, but under severe churn or failure conditions, it offers limited coverage gain.
Establishing the ability to selectively query samples from the network is essential as it allows to probabilistically check data availability by downloading only a small random portion, rather than the whole block. For network with 10k validators who actively involved in DAS, the availability of a 32 Mb block can be verified when each validator samples ~75 randomly selected chunks.
Following are properties that sampling functionality must satisfy:
For identifiability (1), we consider two approaches. The obvious one is to randomly pick a number of sample-ids
and use them to query chunks from network. We refer to this as structured sampling. The second option assumes uniform distribution of chunks in the network to allow sampling the network in an unstructured way without deciding on specific ids beforehand.
For addressability (2), we consider data retrieval in pull (DHT lookups, req/resp) or push-like manner (topic-sub). Due to the need for samples to be disseminated to their corresponding hosts beforehand, the pull model implies higher latency. Nonetheless, we believe this is the only right way to meet all five properties, including historical retrievability (4): since unstructured/push option characterized by an ephemeral access to data compared to the persistence associated with structured/pull sampling.
In the following sections, we discuss both these approaches in greater details. However, only the structured/pull one will be considered for evaluation and simulation tests.
Validators can subscribe to the topics which correspond to the samples they wish to receive. Latency reduction is possible because validators don't wait until samples are disseminated, but can receive their assigned data while the dissemination is taking place.
Topic subscription could fit for row/column sampling, as validators remain interested in the same chunks during the whole epoch and the number of topics is bounded by the number of rows and columns (1024, though still high). The same isn't true for private random sampling. Even if topics are allocated per 1000 of samples, thus requiring around 200 topics in total, individual clients still only need ~75 of 1000. This will produce a lot of waste bandwidth and is likely to still suffer from churn issues.
Perhaps less intuitive, but this approach has another fault that both validators and private clients could suffer from. A rogue builder, trying to make a block unavailable, could potentially push (to the majority of topics) just enough samples to convince a majority of validators about the availability, but not enough to reconstruct the full block. Even if a builder is unaware of the subscribers' identity, it still has more control over sample disclosure. And since there is a finite number of groups (subnets), the builder can perform a withholding attack on the minority of peers.
The primary observation behind this approach comes from the fact that it's possible to randomly sample entries of the set that is uniformly distributed across the P2P network by simply asking some number of peers about the entries they store. However, there is a number of challenges with this method. First of all, the client may not (and probably won't) have enough contacts to ask for the needed number of samples. This is especially true for light clients that usually maintain one or two connections to full nodes in the network. The solution to this problem is a random walk. A simple random walk is a sequence of nodes visited, where at each step, the next destination node is selected with a fixed probability.
Collecting uniform random data samples is also challenging due to the irregular nature of the P2P network: varying degrees of connectivity between peers, widely varying amounts of shared data. In [7: p. 3-6], use an algorithm based on Metropolis-Hasting to actively tweak the length of the random walk (known as the mixing time of the Markov chain) allows converging to a stationary uniform distribution. For this, authors also instantiate a virtual network, where internal (close) links are used for communication, while external (long-range) links are used to perform random walk. Further research on this method is needed, but it is clear that implementing it in the existing network would be quite challenging.
In DHT networks, there is an opportunity to develop specialized random walk algorithms, considering the distinctive properties of the network graphs. The simplest method is to address queries to random node ids until by chance an occupied one is found. Using this method, every node is selected with the same probability -
It's possible to account for this unevenness when accepting samples from a "randomly" found node: accept with probability
Overall, random walk can be an interesting option for private random sampling, but it lacks precision needed for row/column sampling and reconstruction. It would also be more challenging to formally prove the security and effectiveness of this technique.
In Kademlia, key lookups involve contacting a logarithmic number of nodes until a node responsible for a key is found. It is simple, reliable and time-proven method to search data in the P2P networks. To assess compatibility KAD lookups for the random sampling, we identify following three requirements:
Our findings prove that Kademlia is in fact a promising candidate for caring out large load of random sample queries:
Now we define how sampling client work: A sampling client starts by randomly picking sample ids = [H(fork_digest, randao_mix, data_id, i) for i in sample_indexes]
. It then takes the first sample-ids
and works as follows:
sample-id
, in parallel and asynchronously, client searches its routing table for sample-id
fails, client can take different sample-ids
are found, then client can consider the block available. In this case, client also re-broadcasts all found samples to the rest of the network.In the rest of this section, we take a look at optimizations to further improve latency, coverage, and lookup consistency.
Random sampling is latency-sensitive application. Kademlia lookups latency on the other hand is only good enough. Following are some optimization strategies that can be incorporated into overlay DHT to consistently improve lookup time.
Lookup latency is proportional to the number of hops it take to reach destination node. It is possible to reduce this number by increasing total number of buckets, which is one of the modifications in the Discv5 protocol. However, extensive study in [11: p. 7] reveals that it's just as effective to increase size of individual buckets, while also being simpler to implement, resulting lower maintenance cost, and improving resistance to churn as a side effect.
Another particularly interesting insights was mentioned in [9: p. 5] –- given the structure of a Kademlia routing table, on average, the first bucket is used in half of the lookups, the second bucket in a quarter of the lookups, and so forth. Hence it's only logical to enlarge buckets proportionally to the probability of them being used in a given lookup, e.g. the first buckets hold 128, 64, 32, and 16 nodes respectively, while the rest of the bucket sizes remain at 8 nodes. In same work, authors measure effectiveness of this optimization and conclude that only 0.1% of modified lookups take over a second, compared to over 27% in Mainland DHT.
To achieve lower latencies, the overlay nodes must be aware of the underlying network topology. Without any knowledge, single overlay hops have the average latency of average round trip time (RRT) in the underlying network. In [12: p. 2-3] authors describe three approaches to reduce routing latency by exploiting knowledge of the physical network topology:
node-id
distribution).Obviously, the above methods assume some degree of awareness about the nodes' physical distribution, which may not be known in advance. Also note, that such knowledge may open certain DoS attack vectors that wouldn't otherwise be there. Nonetheless, it might be quite advantageous to consider the latency of past responses in the routing algorithm. In Section 4.4.3 we proposed scoring peers based on RRT. The impact of low-RTT bias was studied in [9: p. 4]. Authors reveal that Kademlia nodes with proximity routing modification are 90% more likely to perform routing under 500 ms.
Kademlia's original paper only specifies iterative lookups as described in Section 2.3. The definition of recursive routing is given in Section 4.5. The primary benefit of doing lookups recursively is to reduce latency, since on each hop node forwards a message directly, without communicating with the lookup initiator first. Additionally, this greatly reduces time spent traversing NAT, which we discuss in Section 7.2.2.
In iterative routing, data collected during routing helps maintain and fill up Kademlia's k-buckets. Hence, in recursive mode there must be a mechanism to notify the initiator of the nodes found on the path. More importantly, the initiator also needs to receive a value corresponding to the key they start the lookup for.
For this, author of R/Kademlia [13: p. 104-105] present two signaling modes to pass data back to the lookup initiator:
Acknowledgments are essential for recursive routing too. With ACKs the occurrence of failed nodes on the path does not result in a rollback of the whole routing procedure, but only in forwarding the message to an alternative node.
The experimental results in [13: p. 106] reveal almost two times latency reduction. In scenarios where interactive routing (
We already described a number of techniques and optimizations to achieve good coverage during the samples dissemination phase. We believe that clients doing random sampling can assist with this goal, even after the main dissemination work has taken place. This is particularly useful for historical sampling, but also help preventing destructive effects of high churn. Methods in this section assume that the network is configured to accept certain level of redundancy when it come to storing samples, which it certainly should (again because of churn).
During the course of the sample lookup process, a client may encounter nodes along the search path which do not have the sample but for which the sample-id
does fall within the replication radius
. These are nodes that should be interested in storing this sample. If the client successfully retrieves the sample from another node, it should send an Offer
message for that sample to those interested nodes. This mechanism is designed and implemented by the Trin team to help continually improve coverage of the content stored in the Portal Network. We believe that it would be useful for the purposes of DAS as well.
The aim of DAS is to prove that data was made available at some point in time and for enough time that all interested actors could be able to download it. Generally, any attempt to prevent this from happening should be considered an adversarial behavior. We assume that DAS components are deployed properly, ie. there are enough sampling nodes in the network, clients query the network for the sufficient number of randomly chosen sample ids, KZG proofs are sound, and all actors are sufficiently incentivized. We thereby make a consequent assumption that any attack that results in explicitly unavailable data (e.g. failed dissemination, large-scale eclipse attack, etc.) will only lead to the block being discarded due to failed availability check.
Hence, the real concern are the attacks that allow one to trick the majority of sampling validators about data availability while selectively withholding that data. We must also seek ways to prevent attacks that result in continuous failure of availability verification thereby halting consensus and block production.
In the original DAS paper [14: p. 15], "selective share disclosure" attack is defined by the ability of adversaries to link sample requests to individual sampling clients, which allows them to selectively disclose just enough shares to convince a majority of clients of the availability when in reality not enough data was released to reconstruct the whole block. To prevent this attack, each sample request must be sent over the anonymous channel and received in random order, such that it would be impossible to correlate back to its sender.
Achieving this in practice, however, is far from trivial. Firstly, such a channel would require adding a random delay to each request, thus increasing the overall latency. Secondly, while anonymizing the inbound requests is generally figured out, it is usually much harder to route responses back to the requestor without compromising on either privacy or latency even more. Finally, the sheer load of sampling requests is significant enough to cripple any privacy layer that doesn't scale well.
The use of Chaumian mixnets was mentioned in the DAS paper. Unfortunately, the problem with their underlying fixed-cascade topology is that is scales pretty badly. Additionally, their mixing latency is unbounded, which is particularly bad considering our strict time constraints. An alternative design, known as stratified topology, is shown [15: p. 13] to have superior scalability and is currently led by the Nym mixnet. In fact, Nym latency is claimed to reduce with a number of users as less per node mixing delay is required: for 10k clients the expected latency is around 150ms. Receiving responses, however, is still challenging: Nym provides the SURBs (Single Use Reply Blocks) mechanism but it impose a payload limit (2 KB) and causes additional latency. Even more problematic is to conduct Kademlia lookups behind the mixnet. If all requests are sent anonymously then the average latency for the network of 10k validators is
It is useful to break the anonymity problem with Kademlia lookups in two: 1) private routing and 2) private data retrieval. The goal of (1) is to find node in the network responsible for storing sample, without revealing the interest about the specific sample-id
to all neighbors. For this, we propose a modified lookup procedure that searches by sample-id
's prefix of sample-id
, which only the requesting client knows. A client can then select only those nodes whose ids fall in the shortest XOR-proximity to the actual sample-id
and proceed to the next step.
The aim of (2) is to download the sample data from the node storing it (or gain confidence that it really does) without revealing the identity of the interested client. As the previous step resulted in multiple found nodes, they all need to be asked. However, instead of asking them openly or through slow mixnet, we propose the following clarification step: a client asks each node what samples they HAVE
on which a node can send all the known sample-ids
. Of course, due to the large bandwidth overhead, that would be impractical. Instead, two cryptographic techniques are at our disposal: private set intersection (PSI) and bloom filters.
PSI allows checking whether an element or key is known by both parties, without revealing any additional information. In practice, this is usually done by running a multi-round protocol between parties that involves ECDH (Elliptic Curve Diffie-Hellman), binding, and checking steps. Bloom filter (or Cuckoo filter or Golomb-compressed set) is a space-efficient probabilistic data structure that a node can send to a client to allow checking that a sample with small false-positive probability is known by that node. Often both of these primitives are used together to optimize accuracy and efficiency that they lack in isolation. Finally, once a node that is likely to store a sample is identified, a client can issue a download request secured by the mixnet or PSI-compatible data transfer (needs further research).
Currently, two independent teams work on prototyping such modification for IPFS in hopes of improving privacy for content consumers. ChainSafe's privacy engineering team has successfully prototyped and tested prefix lookups for the Libp2p-based KAD system. In parallel, another line of prototyping work is being done to extend the Bitswap protocol with PSI and bloom filters. Overall, this is quite an interesting design space and we are looking forward to extending on this in future research.
Failed availability check doesn't pose an existential threat as long as it possible to retry in the next slot. What's worrying, however, is an attacker who is able to continuously make data unavailable thus halting the chain and all L2s with it. Attacks such as Sybil, Eclipse, DoS can potentially be performed to cripple the DAS network. Unfortunately, DHTs along with Kademlia are known to be especially vulnerable to these attacks. It's possible to limit attacker's ability to render these attacks by preferring long-lived contacts, implementing a more sophisticated identity system, and other well-researched techniques. In this work, we focus on one technique that seems most effective in the context of DAS.
A structured overlay can employ not one but multiple routing tables, each containing different sets of peers. In S/Kademlia [16: p. 3-5], key lookups are performed first using the primary table (that involves all nodes), and if this fails, using the secondary table, which contains only the nodes who has solved PoW-puzzles to mine hardened node-ids
. In a recent adaption by Dankrad, PoW puzzles were replaced with "proof of validator" and the idea of multiple routing tables was applied to secure DAS. The result is a hardened DHT that can theoretically withstand large-scale Sybil and DoS attacks. This promising design space is currently explored by EchoAlice.
In this section, we go over some practical considerations for implementing DAS components and the specific algorithms we have introduced so far.
There is a number of considerations when it comes to transport layer protocols: reliability, communication overhead, packet size limit, open connections limit. As Discv5 is a primary tool for achieving goals of our work, we will therefore focus on the UDP (User Datagram Protocol) transport. Reasons why UDP is said to be a mandatory wire protocol for Discv5 can be found here.
A defining design choice of UDP is in the absence of acknowledgements for packets. Thanks to it, UDP speed is superior over TCP, but it also a big reliability trade-off that should be considered. To minimize likelihood of packet lost, application-layer protocols usually enforce a packet size limit. Discv5 spec sets this to 1280 bytes. Now, since each sample is 512 bytes and nodes would potentially require sending them by thousands, this immediately becomes a major issue. Following are some approaches to work around this limitation.
The first approach can be found inside the Discv5 itself. Namely, the NODES(request-id, total, [ENR, ...])
response support sending up to 16 ENRs, but since each ENR can be up to 300 bytes the remote node would send multiple such responses each having at max 3 ENRs. Same trick can be done when sending samples, ie. up to 2 samples can be send in one UDP packet.
This method is very efficient as each message consist of useful bytes only, nothing more, nothing less. Unless of course, a packet is lost, then these samples may never be disseminated and data availability will be at risk. Even a 0.01% chance of packet loss is considerable enough to have some sort of automatic retransmission mechanism. Implementing it on the application-layer is tricky and error prone, thus adopting some existing solutions is generally safer and easier.
Discv5 designed to be minimal and concise, but it can be easily extended using TALKREQ
and TALKRESP
methods, which can power an unlimited number of sub-protocols. To avoid conflicts one only need to specify protocol
argument in the request body. It is therefore possible to support any higher-level UDP-based protocol on top of Discv5's generic RPCs. As happens, one such protocol is already implemented by the Portal Network team in their Rust client called Trin.
μTP (Micro Transport Protocol) is a simple streaming protocol originally designed to provide reliable and ordered delivery for BitTorrent over UDP, while avoiding poor latency and congestion control problems found in TCP based clients. Current version of the Portal Network follows the original BEP29 spec and uTP reference implementation, but instead of a raw UDP packet, the TALKREQ
message is used as transport.
In Trin, uTP streaming is always used in response (e.g. ACCEPT(conn_id)
for OFFER(content_id)
). In our case, streaming is instead applied as a request call (e.g. SAMPLES(conn_id)
). Note, that Discv5 implementation may queue sending next TALKREQ
before a TALKRESP
for the previous request to the same node is received. This will block uTP handshake when used as a request. To avoid this, we propose nodes receiving uTP connection-id
to respond with PROMISE(promise_id)
immediately and use it to send request results later as a separate TALKREQ
. See diagram bellow for the example use.
As seen, Discv5 implementations come with their own design restrictions, which may cause inconvenience or even worse performance. Instead, it could be better to use raw uTP or perhaps some other protocol that already has a performant library. The only concern is that time spent on negotiating another transport over a new connection may end up offsetting the benefits of that transport.
Discv5 specification describes TALKREQ
as means to pre-negotiating connections for new application-specific protocols. From the words of Felix Lange, the intended use of this method is to "agreeing to talk", which is also a reason for its name. Overall, there are two things to exchange before streaming over a new protocol can proceed: a new port and encryption key.
Key exchange is the more expensive of two, so it's worth using existing keys. In theory, we can skip negotiating a new port as well: since UDP is stateless, nothing prevents multiple processes to receive packets on the same port. In practice, this requires extending each packet with a special header, which would be used to dispatch messages to their intended handlers: some to Discv5, others to whatever secondary wire protocol is chosen. This isn't a significant overhead, but it certainly is more that just sending 8-bit number (port) once.
Most notable advantage of having secondary wire protocol decoupled from Discv5, is in the ease and flexibility of integrating such protocol. For instance, one can plug any existing and battle-tested implementation of uTP: libutp
, meqif/rust-utp
, anacrolix/go-libutp
. There are other UDP-based protocols worth considering.
KCP is a reliable ARQ (Automatic Repeat Query) protocol that trades 10-20% bandwidth overhead in exchange for a 30-40% latency reduction (compared to TCP). Amongst other optimizations, KCP retransmits packets more quickly by anticipating packet loss before retransmission timeouts. It natively supports stream encryption and forward error correction. KCP is often used in low-resource devices like routers and some applications related to online gaming and audio/video transmission. Available implementations are skywind3000/kcp
, xtaci/kcp-go
, en/kcp-rs
+ libp2p stack integration paralin/go-libp2p-kcp
. Learn more about KCP here.
QUIC is a recently standardized transport protocol that stacks multiplexing and retransmission control on top of UDP. All data streams in QUIC are independent and processed concurrently, they manage their flow control and their lost data retransmission. QUIC connections can be terminated on demand or automatically after idle timeout. It natively handles stream encryption, but forward error correction is still in draft. Available implementations are cloudflare/quiche
, lucas-clemente/quic-go
, quinn-rs/quinn
+ official integration with libp2p stack.
According to this simulation, KCP tends to have better latency then QUIC, particularly with high packet loss rate.
Most of p2p routing and search algorithms such as flooding, random walk and DHT lookups assume that peers are equally accessible to each other. Literature [17: p. 378], however, shows that more than 80% computers are behind routers or firewalls, thus requiring NAT (Network Address Translation) traversal to be accessed. As a result, messages follow a longer path to arrive since the sender cannot initiate a direct connection to a remote peer behind NAT right away, a technique known as "hole punching" must be applied first. This becomes particularly problematic with Kademlia's iterative lookups, where the initiator needs to contact many previously unknown peers, leading to high NAT traversal overhead.
In UDP, no explicit hole-punching is required if the NAT setup is capable of full-cone translation, i.e. a first packet sent to remote node establishes a port mapping which allows next packets reach the node behind NAT. Nonetheless, it's worth considering some application-layer techniques capable to amortize NAT overhead regardless of the address/port translation type used.
With iterative routing, direct connection can be established with the help of neighbor nodes. When a node queries one of its neighbors for certain data and retrieves a node list of candidate nodes, this neighbor can help the initiator and candidates to establish a direct connection since the neighbor is already connected to both. This assumes that peers maintain long-lived connections with neighbors from their routing table. In practice, this require nodes behind NAT to periodically send UDP keep-alive messages to all the neighbor nodes. Nodes in Discv5 are already doing this as part of routing table maintenance routine. Authors of [17: p. 6] show that use of this technique in what they call "strongly-connected model" leads to 60% reduction in number of hops and generally outperforms existing model in delivery rate and scalability.
Recursive routing can greatly reduces the need for establishing new connections, as a lookup message travels between neighboring peers, which by assumption already maintain existing connection. However, as lookup originator still needs to get data about nodes found along the path, those nodes including the destination one must report about the contacted candidates back to originator. The problem is that the originator and most nodes on the routing path are not peers, so they not directly connected, which leads to NAT overhead yet again. Likely, the source-routing signaling described in Section 5.4.3 reduces NAT-ed communication to once per lookup (between destination node and the initiator).
Discovery V5 network consist not only of Ethereum nodes. It also serves other networks, forks, test-nets, etc. Hence, we need a way to separate DAS nodes from everything else, i.e. an overlay. This should require no changes to the Discv5 protocol, but may extend it with new features, such as FIND_VALUE
lookups, alternative/bigger routing table, modified maintenance routine, peer scoring, etc.
Topic advertisements are used by nodes to make themselves discoverable under a specific topic. Internally, this is implemented using a data structure known as a "topic table", which holds multiple queues (FIFO lists), each containing nodes that have advertised themselves by a specific topic. Ethereum nodes involved in DAS, can advertise themselves using a special topic. This way, dissemination requests and sample lookups can be restricted based on the current state of local topic tables. Assuming that new ads are throttled for a certain amount of time (15 min) before being put on air, it would be challenging to keep this data up to date, especially in the presence of high churn. This approach also gives no advantage in extending Discv5 as it only provides a way to separate nodes based on the services they provide.
Peers can maintain a secondary routing table forming an independent DHT exclusively for the purposes of DAS. Application-specific routing table can be initialized with records from the Discv5's primary table. In case only a subset of Ethereum nodes intended to assist with DAS, topic advertisement can be used to selectively pick relevant peers in the advertisement radius. Once initiated, secondary routing table can grow independently from the Discv5's primary one, hosting records that are not presented in the latter.
Modification-wise, the application-specific table can be configured larger. One can replace the k-buckets table with an alternative data structure, e.g. binary trie used for prefix-based routing described in Section 4.3.3. Discv5's TALKREQ
/TALKRESP
messages are well suited to carry out application-specific communication. Though it's possible to open a separate UDP connection or use other higher-level wire protocols for such a purpose. The overhead of maintaining two routing tables is actually quite low because Kademlia can do most updates based on the data extracted from DHT requests. Additional tasks specified by Discv5 are infrequent and are performed on a peer basis anyway. Any extra parameters (e.g. peer scoring) can be maintained as a separate routine.
Finally, one can extend the overlay functionality by introducing application-specific routing and lookup algorithms. An example of this is FIND_CONTENT
lookups specified by the Portal Network and implemented in the Trin.
In order to evaluate and compare different algorithms presented in this work, we developed a simulator[1]. In a center of it we placed a modular DAS node that behaves according to the simulation plan served as set of CLI arguments that include: 1) core algorithm (dissemination/sampling) to run, 2) numerous strategies (routing, bundling, peer selection), 3) wire protocols (Discv5-uTP/Libp2p-TCP), and other algorithm-specific parameters (e.g. parallelism).
We developed a simple topology bootstrapping mechanism that generates a configured number of nodes (ENRs) and fills their routing tables according to the passed arguments. Supported topologies are: linear (for simple routing latency tests) and uniform (closer to live network). We allow saving generated topology along with a randomness seeds used elsewhere during simulation (e.g. picking a set of sample-ids
during random sampling). Although we cannot guarantee full reproducibility as something like testground
does, it's quite a convenient to compare alternative algorithms on the same topology.
There are a few reasons why we decided to roll with our own simulation software as oppose to existing tools like testground
or PeerSim
. Firstly, instead of just Kademlia we employ Discv5 that, while technically is similar, still has some major design differences. Secondly, our solutions often require particular modifications to the original components of Kademlia, such as changes to the underlying routing protocol. Finally, we wanted to re-use some of the existing solutions build by other teams, such as ethereum/trin
from which we forked the discv5-overlay
functionality. It's also because of the latter, that we choose Rust to be a primary programming language for this project.
The simulation results are prepared based on the modified Performance vs. Cost framework (PVC) [10: p. 2] to identify the best algorithms and parameter combinations regarding routing latencies and bandwidth consumption. We measure a total number of messages sent over the wire as our cost metric (orange lines). We ignore storage cost (e.g. samples written in nodes local store) because communication is typically far more expensive than storage. For performance metric we measure total time it took to disseminate all samples into the network (blue lines). We then map both of these metrics on the vertical axes oppose to the number of samples. We tried keeping network size (number of nodes) consistent for all tests, but for certain configurations that wasn't possible due to technical difficulties that we haven't yet solved.
Following figures demonstrate performance vs cost simulation results for the different routing and bundling strategies (left graph). To demonstrate coverage efficiency and uniformity we also plot the distribution of samples stored by each node in the network (left graph).
So far we only seen performance of the network communicating through uTP over Discv5 TALKREQ
RPCs wires. This way is efficient but it doesn't scale very well with the increase of bandwidth (number of samples), which is most noticeable with iterative routing where the originator performs most of the dissemination work. For comparison we include same measurements but for the network running on the strikingly different transport - TCP implemented in the Libp2p stack.
We evaluate random sampling using the same cost vs performance framework. For performance, we measure the time it took 100 concurrent clients to query 75 random samples each (blue bars). The cost metric is measured in the number of messages sent over the wire (orange bars). We show the results of these metrics for each of the four dissemination configurations. Additionally, we also demonstrate the sampling failure rate by comparing the number of successfully retrieved samples to the total number of queries (7500).
This document presents our research on the Data Availability problem space. This includes findings sourced from a large body of academic work and our discussions with industry experts. The primary focus of this work was on sample dissemination and random sampling, for which we lay down different approaches related to unstructured peer-to-peer networks and structured overlays, as well as some practical considerations regarding security, network faults, and transport protocols. We then propose and briefly specify multiple algorithms and optimization strategies. In doing so we carefully consider the underlying cryptography and information theory of DAS as well as Ethereum's networking model.
Our samples dissemination design includes 2 routing modes (iterative and recursive) used to traverse DHT, 3 bundling strategies (by distance/buckets/prefix) for routing data chunked into a large number of pieces, and multiple peer selection settings that control redundancy and peer scoring. We conclude that bucket-wise bundling has a bigger optimization surface but its design is more nuanced than distance or prefix based strategies. The design of peer selection strategies is even broader: studied peer scoring models (based on latency, routing table size, and role) all come with inherent flaws that make it hard to justify their use. Redundancy also comes with a variety of configurations from which we lean towards replicate-some/RS
+ forward-all/FA
but leave determining a specific number of redundancy multiplier
for future reserch and experminentation.
For random sampling, we analyzed both unstructured approaches (topic subscription and random walk) and structured ones (DHT lookups) and conclude that only the latter satisfies all underlying requirements. We also include different techniques to improve lookup performance and coverage. Among them the most promising in our view are bucket resizing, POKE mechanism, and recursive routing.
Finally, we develop a specialized simulation software and used it to test and benchmark our solutions. The results conclude that iterative dissemination is consistently slower than the recursive routing mode. This trend persists during random sampling as well, resulting in longer lookups and a significant failure rate. When it comes to bundling strategies, distance-wise bundling appears to be more efficient during dissemination but leads to slower sampling than bucket-wise bundling.
For the future, we identify the following directions that are worth further exploration: 1) proofs of custody and 2) sampling validator privacy. Proofs of custody were initially considered as a means to penalize "lazy validators" for signing data without actually downloading and verifying its correctness, but can also be used to enforce validators to store their assigned chunks for a certain period of time, e.g. 1 month. State-of-the-art techniques are interactive (MPC) and non-interactive (SNARKs). Current efforts [I, II, III] target bandwidth and interactivity reduction for existing schemes.
The sampling validator privacy problem was already discussed in this work. We have proposed and briefly analyzed two approaches based on mixnets and prefix-based lookups + bloom filters respectively. In the future, we plan to extend on the second approach and test its practicality in the context of DAS.
We thank Protolambda (Optimism), Jacob Kaufmann, Ognyan Ogenev (EF/Portal Network), Cayman Nava (ChainSafe) and EchoAlice (EPF) for feedback and all the helpful discussions.