Gossip pubsub simulation in the context of Beacon Chain

The network of Libp2p Gossip nodes is simulated to find out how different gossip (and non-gossip) parameters affect message dessimination and network traffic

Simulator overview

Based on JVM Libp2p Gossip implementation
Written in Kotlin
Currently implemented in the form of tests in a separate jvm-libp2p branch
Capable of running any Libp2p based protocol
Large scale capability (tested up to 100K nodes) due to short cutting network stack and protocol data serialization
Time control: time can be moved forward, all scheduled tasks are executed in the same order as in a usual system enviroment
Determinism: any simulation with the same random seed may be replayed 1:1
Multithreading. Two different (not mutually excluding) options available:
- Parallel simulation processing: runs several simulations with different parameters concurrently. N threads require xN memory but gains xN speedup. Determinism preserved
- Parallel node processing: a single simulation starts M threads and shares them across all nodes. Has the same memory footprint, gains speedup depending on protocol performance (gossip gains just up to x2.5). Not deterministic.
Network parameters: configurrable topology, individual connection latency (constant or random), node system clock shifts, total node net throughput (not yet tested)

Simulation setup

Network: 5000-10000 nodes, each has N (normally 10-30) peers connected randomly.
Message payload size: 32Kb as an average block size (having in mind block propagation)
Network latency: either constant 1ms or uniformely distributed random value (1-50ms)
Default Gossip values from the Eth2.0 spec
Statistics are gathered based on N randomly generated networks (normaly 10) and M messages sent from random nodes (normally 3-5) for each parameter set
Some BFT test simulations contain a percentage of bad nodes (80-95%, i.e. 5:1-20:1 bad/fair nodes ratio). Those nodes just don't propagate Gossip messages to their peers
In some simulations the Gossip heartbeat start moment is randomly scattered to avoid side effects due to simultaneous node heartbeats
After network is connected a warmUp time (5 sec) is rewinded allowing Gossip to exchange initial messages and build more or less stable meshes
Some simulation results contains two sets of statistics
- 0-*: 'immediate' effect after publishing a message when no heartbeats executed yet. This is when 'gossiping' (IHAVE/IWANT messages with message IDs) is not involved.
- N-*: when a significant number of heartbeats is executed after publishing a message and the 'gossiping' is expected to be done across the whole network
The main result measurements are
- traffic: number of bytes transmitted (TCP framing is taken into account) per one node per one published message
- %% of nodes received the message
- message delivery time (ms): 50% percentile/90% percentile/max
- traffic overhead: traffic / [ideal traffic], where ideal traffic if the simulated message payload size (32Kb for block propagation case)

Simulation cases

BFT

Simulates network with different fractions of 'bad' nodes. 'Bad' node - the node which doesn't propagate incoming messages
With default gossip settings and 20 peer connections the upper bound of bad nodes is 80% (or ~x5 of normal nodes)
It shows that 'gossiping' part of the protocol start working in extreme environment only. In that case message delivery timing becomes pretty high.

[Results]

Results consistency

This is just to check that simulation results don't bounce drammatically when the network size changes.

[Results]

More peers - better BFT

Checking how number of connected peers affects BFT. Simulating the network where bad/honest nodes ratio is 10/1. The results are not surprising: more peer connections - better BFT. Traffic increase is the cost of BFT growth.

[Results]

Less Gossip `D` - less traffic overhead

The 'ideal' traffic to disseminate a message is the x1 of a message payload (32Kb in our simulations). Current Eth2.0 gossip settings result in ~x5-6 of traffic overhead. Reducing the D param may decrease the overall traffic, while preserving BFT and delivery rate due to 'gossiping'

For more consistent results DLow and DHigh params are set closer to D. This would increase gossip mesh rebuilding frequesncy (GRAFT messages traffic) but would keep the real mesh size closer to simulated param.

[Results]

Gossip `DLazy` has no much traffic overhead

'Gossiping' improves BFT at little traffic cost. Increasing DLazy doesn't result in a significant traffic overhead in a healthy network.

[Results]

Playing with heartbeat period

While 'gossiping' saves traffic and improves BFT its delivery delay could be an issue for time critical applications. Message IDs are 'gossiped' on hearbeats (which period is defaulted to 1 sec in Eth2.0 spec) and the cummulative delivery time becomes order of seconds.

This simulation increases heartbeat frequency in agressive environment (90% of bad nodes). Due to this delivery time singnificantly improves with no large traffic cost. However period < 100ms causes significant traffic increase (probably due to the network latency interference).

[Results]

Trying immediate gossip

As a follow up of the previous simulation, this one modifies Gossip implementation to immediately broadcast the message gossip to the peers outside of the mesh. (the similar way Eth1.0 propagates a block: minor peers subset receives full block, others receive only hash)

The results are similar to heartbeats with small period: 'gossiping' delivery time significantly improves while traffic overhead increases in a healthy network

[Results]

Conclusions

default gossip options almost don't utilize 'gossipping' feature in a 'normal' environment.
Even when 'gossiping' feature is utilized (e.g. in 'aggressive' environment) the delivery time may be unacceptable due to infrequent heartbeats (1 sec)
In the absense of 'gossiping' effect there is x5-x6 traffic overhead (for block propagation) due to full message duplications
the Gossip is BFT
(unrelated to gossip) BFT is greatly improved by peers number
BFT can be improved by increasing DLazy with relatively small cost
Possible attack vector is to spam the network with old messages thus we need to make sure that the gossip 'seen message set' size is large enough and the application validator rejects 'old' data

Gossip protocol proposals/issues

Separate 'gossiping' (IHAVE) and 'mesh' (GRAFT) heartbeat periods. To enable fast 'gossiping' without too frequent mesh rebuilding.
consider 0 'gossiping' heartbeat delay (like Eth1.x sends block hashes immediately): While simulation shown x10 faster 'gossippng' message propagation in the normal scenario traffic increases up to x2
Don't duplicate IHAVE messages to peers (on each gossip_advertise cycle gossip picks DLazy random peers with no respect to already sent IHAVE messages). While it may save some bytes of traffic it may over-complicate the gossip protocol.

Gossip pubsub simulation in the context of Beacon Chain

Simulator overview

Simulation setup

Simulation cases

BFT

Results consistency

More peers - better BFT

Less Gossip D - less traffic overhead

Gossip DLazy has no much traffic overhead

Playing with heartbeat period

Trying immediate gossip

Conclusions

Gossip protocol proposals/issues

Read more

Alternative DAS concept based on RLNC

Comparing Reed-Solomon and RLNC for Pubsub

Ideal pubsub

Network Shards

Less Gossip `D` - less traffic overhead

Gossip `DLazy` has no much traffic overhead