owned this note
owned this note
Published
Linked with GitHub
# Attestation aggregation and dissemination strategy (WIP)
### Preliminary scalability considerations
There will be many participant nodes, even we consider only a part of the overall network active at a given slot, e.g. attesters, intermediate nodes (aggregators, transient nodes) and proposers (of next blocks). It's expected there will be around 300K attesters initially (10M ether). This means about 300 attesters per shard/committee. Given 16 committees per slot, it's around 5K nodes. In future, amount of validators may grow, so if there are 1M validators, there will be 1K attesters per committee, i.e. around 15-16K nodes with at least one attester (assuming 100K nodes overall, it will be a rare situation to have more than one attester on a node).
The same issue with shard subnets, i.e. it's expected that around 300 validators plus 200 standard nodes are listening to a shard. There are several shards in a subnet, so a subnet size is several times more than a committee size. There are up to 16 active subnets in a slot, so it's lots of nodes too.
Given all above, one should be very careful when designing an aggregation/dissemination protocol. We'll look at this in more details in the following sections.
**NB** The estimates of 300 validators per shard and 200 standard nodes per shard are based on [p2p/issues/1](https://github.com/ethresearch/p2p/issues/1) and [p2p/issues/6](https://github.com/ethresearch/p2p/issues/6#issue-383052296).
### Aggregation and result delivery are separate problems
The [network specifications](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/networking/p2p-interface.md#mainnet-3) states:
> Unaggregated attestations are sent to the subnet topic. Aggregated attestations are sent to the beacon_attestation topic.
This implies only a subset of aggregators should send their results to proposers (beacon_attestation). Else it's too much traffic and it is probably simpler to send individual attestations directly (unless aggregators size is much smaller than attesters size).
This means the subset should be chosen somehow (e.g. by rolling a dice).
The subset size is a scalability vs BFT tradeoff, i.e. too small size means less BF tolerance (easier to block results propagation), while too big size means two many messages.
This is regardless of which aggregation protocol is employed. E.g. if Handel is employed to aggregate attestations, then the aggregating nodes still have to decides who sends the results.
### Partial aggregates may be okay
Since several aggregators have to send their results to proposers, it may be okay not to wait when aggregates become complete or near complete (include all or almost all individual attestations).
If several agregators are to send aggregates, then we can weaken the requirement so that union of the aggregates should cover the committee. In other words, a proposer can participate in an aggregation protocol on the final round, performing the final merge.
Given network or byzantine failures, this is a highly desirable property, since some attestations may be lost for various reasons. This allows for aggregation protocol to stop before a final solution is obtained (which may be too late). However, it raises additional problems (see below).
### Coordinated vs random aggregation
Let's look at the aggregation part in more details. There are three general kinds of protocols to aggregate data in p2p-networks:
* Tree-like protocols. It's very efficient communication-wise, bit not tolerant to failures, specially byzantine ones.
* Gossip-like protocols. Participants send partial aggregates to several (randomly) selected peers, to avoid traffic amplification. After some rounds, most participants should have received most individual items. It may be too long to wait when all nodes receive all items, especially in a byzantine context.
* Hybrid protocols. Try to aggregate data in a coordinated manner (via tree-like structures), but handling network/byzantine failures (falling to some kind of a gossip protocol). Handel is a good example of the approach.
When node/link failures cannot be ignored, we have only two options, either a gossip or a hybrid approach. A gossip approach has a significant drawback: since partial aggregates are sent in a random way, at some point, it will be difficult or impossible to merge two partial aggregations, because the sets of their attesters are overlapping, i.e. there is one or more attesters, whose attestations are included in both partial aggregates. The problem is caused by the [Attestation](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/core/0_beacon-chain.md#attestation) class, which uses bitifields to account for attesters.
A coordinated approach is required to avoid this, so that nodes should communicate in a way that allows for non-overlapping partial aggregates. Organizing nodes in a tree is an ideal choice in a fault-free setup, but in byzantine context, rather a forest of trees should be constructed to mask failures and message omissions.
Handel follows this way. However, it imposes some overhead. E.g. Handel requires pairwise connections between nodes, which is not compatible with p2p-graph approach, without modifications (e.g. messages will pass through transient nodes, which may happen to be byzantine).
### Medium-sized partial aggregation
Gossip-like protocols are attractive because they require less coordination and well matched to p2p communication graph. Also it's beneficial (and may be even required if slot duration is about to elapse) to stop the aggregation stage before a final result is reached.
So, if the partial aggregate merge problem is resolved, then gossip protocols is a very attractive solution.
I.e. aggregators send their peers partial aggregates for several rounds, and when a aggregate become big enough (around sqrt(m), where m is the committee size) or before the end of the slot, an aggregator roll a dice and send its partial aggregate to proposers. It also may roll a dice several times.
Actually, the beacon block structure allows storing multiple partial attestations of a committee. The main obstacle is the 128 limit on total amount of Attestations. More importantly, storing too many attestation will bloat a beacon block, which can be a problem to scalability. However, we think that the problem can be resolved with proper block structure and/or smart compression. See [here](https://hackmd.io/ZCOiGwjLRy6il6yuAnF05w) for details.
### Handel is a partial solution
Handel is an interesting protocols, however as it follows from the above notes, it's not a complete solution.
First, Handel requires pairwise connections between nodes, which doesn't fit well p2p-graph, i.e. instead of direct connections, messages will pass through transient nodes, which means: a) additional delays, b) opportunities for byzantine attacks. The last is not Handel specific, though.
Second, after Handel is complete or partially complete, the results should be sent somehow to proposers in a reliable fashion - the problem common to all attestation aggregation-dissemination strategies (discussed before).
Third, Handel paper says that Handel is able to aggregate 4K attestations under 1 second in case of UDP setup. However, when using QUIC, Handel developers [report](https://github.com/ConsenSys/handel/issues/126) it's three times slower. In case of p2p-graph, when a pairwise connection between nodes have to be implemented with sending a message via transient nodes, it means an additional latency. So, when implemented in the context of Ethereum 2.0 requirements, it's not clear whether it's performant enough or not.
Overall, if follow a coordinated route, Handel seems to be a very good starting point, which should be augmented to resolve the above issues.
### Topic-based Publish-Subscribe pattern seems to be a poor match
As quoted before, the network specifications states:
> Unaggregated attestations are sent to the subnet topic. Aggregated attestations are sent to the beacon_attestation topic.
However, fully delivering of individual attestations to subnet topic is very resource consuming. Earlier, we estimated that the are 16 committees of 300-1000 senders and each should send to a subnet topic of a size which is several times more, i.e. around thousands subscribers.
Actually, it's excessive since individual attestations have to be delivered to only some of aggregators. The final aggregation is obtained via several rounds of aggregation protocol. If all individual attestations are propagated to all members of a shard subnet, then there is no need for an aggregation protocol at all, since they can be sent to proposers directly, with less efforts (assuming amount of proposers in beacon_attestation is much less than amount of subscribers to a shard subnet topic).
An aggregation protocol also doesn't match topic-based publish subscribe pattern, since aggregators send partial aggregates which are growing with each round, so there are different messages.
The final stage, where aggregators send their results to proposers, looks a good match on a high level. However, considering implementation details, the subscribers to beacon_attestation topic are constantly changing. So, this is a serious problem with topic membership management, which discussed in the following section
### Overlay management and Topic discovery
New proposers should subscribe to the topic beforehand to be able to receive results. And later unsubscribe (to keep topic subscribers small). The appropriate information about topic membership changes should be propagated to aggregator nodes, so that they know whom to send their results.
As the specification assumes that beacon_attestation are mostly proposers, we assume there won't be many subscribers -- around tens of them. From scalabilty point of view, ideally there should be one subscriber each slot -- the proposer of the next slot. However, it's safer to assume there will be proposers of some slots before and after the current one.
If the topic membership is small and changes rapidly, then it will be a problem for gossipsub to maintain the mesh for the topic. Basically, we should assume, a gossipsub router at a node should request beforehand Topic Discovery service for an information of latest topic changes. Moreover, for a validator which is assigned to be an attester for a particular slot, it's most important that the topic membership information includes the entry of the next slot proposer.
Which means the next slot proposer should beforehand advertise itself with a Topic Discovery as the topic member. Since it's a lengthy process and proposers are assigned in a preceding epoch before, it becomes a serious problem. Basically, the next slot proposer and the current slot attesters have from 64 to 128 slots (6 seconds), i.e. about 3-6 minutes to exchange with the necessary information.
### Topic Discovery and BFT
Another critical problem is byzantine fault tolerance properties of Topic Discovery service. An adversary can advertise wrong records in Topic Discovery service or run Topic Discovery service instance which provides wrong records to honest nodes about who are the members of the beacon_attestation topic. The honest nodes will send their attestations in a wrong direction.
It's not clear how BF tolerant Topic Dsicovery is, but an excerpt from [here](https://github.com/ethereum/devp2p/blob/master/discv5/discv5-theory.md#amplifying-network-traffic-by-returning-fake-registrations) suggests it's not
>An attacker might wish to direct discovery traffic to a chosen address by returning records pointing to that address.
>
> TBD: this is not solved.
Basically, Topic Discovery is based around Kademlia DHT and p2p DHT are known to have problems with BFT. The BFT in the context of p2p and DHT is also discussed [here](https://www.researchgate.net/publication/228842563_Byzantine_Fault_Tolerance_of_Inverse_de_Bruijn_Overlay_Networks_for_Secure_P2P_Routing).