# Gossip pubsub simulation in the context of Beacon Chain The network of Libp2p Gossip nodes is simulated to find out how different gossip (and non-gossip) parameters affect message dessimination and network traffic ## Simulator overview - Based on JVM Libp2p Gossip implementation - Written in Kotlin - Currently implemented in the form of [tests](https://github.com/libp2p/jvm-libp2p/blob/feature/simulator/src/test/kotlin/io/libp2p/simulate/gossip/Simulation1.kt) in a separate [jvm-libp2p branch](https://github.com/libp2p/jvm-libp2p/tree/feature/simulator) - Capable of running any Libp2p based protocol - Large scale capability (tested up to 100K nodes) due to short cutting network stack and protocol data serialization - Time control: time can be moved forward, all scheduled tasks are executed in the same order as in a usual system enviroment - Determinism: any simulation with the same random seed may be replayed 1:1 - Multithreading. Two different (not mutually excluding) options available: - Parallel simulation processing: runs several simulations with different parameters concurrently. `N` threads require `xN` memory but gains `xN` speedup. Determinism preserved - Parallel node processing: a single simulation starts `M` threads and shares them across all nodes. Has the same memory footprint, gains speedup depending on protocol performance (gossip gains just up to `x2.5`). Not deterministic. - Network parameters: configurrable topology, individual connection latency (constant or random), node system clock shifts, total node net throughput (not yet tested) ## Simulation setup - Network: 5000-10000 nodes, each has N (normally 10-30) peers connected randomly. - Message payload size: 32Kb as an average block size (having in mind block propagation) - Network latency: either constant 1ms or uniformely distributed random value (1-50ms) - Default Gossip values from the [Eth2.0 spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/networking/p2p-interface.md#the-gossip-domain-gossipsub) - Statistics are gathered based on `N` randomly generated networks (normaly 10) and `M` messages sent from random nodes (normally 3-5) for each parameter set - Some BFT test simulations contain a percentage of _bad_ nodes (80-95%, i.e. 5:1-20:1 bad/fair nodes ratio). Those nodes just don't propagate Gossip messages to their peers - In some simulations the Gossip heartbeat start moment is randomly scattered to avoid side effects due to simultaneous node heartbeats - After network is connected a `warmUp` time (5 sec) is rewinded allowing Gossip to exchange initial messages and build more or less stable meshes - Some simulation results contains two sets of statistics - `0-*`: 'immediate' effect after publishing a message when no heartbeats executed yet. This is when 'gossiping' (`IHAVE/IWANT` messages with message IDs) is not involved. - `N-*`: when a significant number of heartbeats is executed after publishing a message and the 'gossiping' is expected to be done across the whole network - The main result measurements are - traffic: number of bytes transmitted (TCP framing is taken into account) per one node per one published message - %% of nodes received the message - message delivery time (ms): 50% percentile/90% percentile/max - traffic overhead: `traffic / [ideal traffic]`, where `ideal traffic` if the simulated message payload size (32Kb for block propagation case) ## Simulation cases ### BFT Simulates network with different fractions of 'bad' nodes. 'Bad' node - the node which doesn't propagate incoming messages With default gossip settings and 20 peer connections the upper bound of bad nodes is 80% (or ~x5 of normal nodes) It shows that 'gossiping' part of the protocol start working in extreme environment only. In that case message delivery timing becomes pretty high. [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=0) ### Results consistency This is just to check that simulation results don't bounce drammatically when the network size changes. [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=1024299423) ### More peers - better BFT Checking how number of connected peers affects BFT. Simulating the network where bad/honest nodes ratio is 10/1. The results are not surprising: more peer connections - better BFT. Traffic increase is the cost of BFT growth. [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=1880641373) ### Less Gossip `D` - less traffic overhead The 'ideal' traffic to disseminate a message is the x1 of a message payload (32Kb in our simulations). Current Eth2.0 gossip settings result in ~x5-6 of traffic overhead. Reducing the `D` param may decrease the overall traffic, while preserving BFT and delivery rate due to 'gossiping' For more consistent results `DLow` and `DHigh` params are set closer to `D`. This would increase gossip mesh rebuilding frequesncy (`GRAFT` messages traffic) but would keep the real mesh size closer to simulated param. [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=130055630) ### Gossip `DLazy` has no much traffic overhead 'Gossiping' improves BFT at little traffic cost. Increasing `DLazy` doesn't result in a significant traffic overhead in a healthy network. [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=305605107) ### Playing with heartbeat period While 'gossiping' saves traffic and improves BFT its delivery delay could be an issue for time critical applications. Message IDs are 'gossiped' on hearbeats (which period is defaulted to 1 sec in Eth2.0 spec) and the cummulative delivery time becomes order of seconds. This simulation increases heartbeat frequency in agressive environment (90% of bad nodes). Due to this delivery time singnificantly improves with no large traffic cost. However `period < 100ms` causes significant traffic increase (probably due to the network latency interference). [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=114511837) ### Trying immediate gossip As a follow up of the previous simulation, this one modifies Gossip implementation to immediately broadcast the message gossip to the peers outside of the mesh. (the similar way Eth1.0 propagates a block: minor peers subset receives full block, others receive only hash) The results are similar to heartbeats with small period: 'gossiping' delivery time significantly improves while traffic overhead increases in a healthy network [[Results]](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=341160593) ## Conclusions - default gossip options almost don't utilize 'gossipping' feature in a 'normal' environment. - Even when 'gossiping' feature is utilized (e.g. in 'aggressive' environment) the delivery time may be unacceptable due to infrequent heartbeats (1 sec) - In the absense of 'gossiping' effect there is x5-x6 traffic overhead (for block propagation) due to full message duplications - the Gossip is BFT - (unrelated to gossip) BFT is greatly improved by peers number - BFT can be improved by increasing `DLazy` with relatively small cost - Possible attack vector is to spam the network with old messages thus we need to make sure that the gossip 'seen message set' size is large enough and the application validator rejects 'old' data ## Gossip protocol proposals/issues - Separate 'gossiping' (`IHAVE`) and 'mesh' (`GRAFT`) heartbeat periods. To enable fast 'gossiping' without too frequent mesh rebuilding. - consider 0 'gossiping' heartbeat delay (like Eth1.x sends block hashes immediately): While [simulation](https://docs.google.com/spreadsheets/d/1QvYio6ALnjrEYmsgZMTuK-ktIZzeRbqIOvPrU4b-Cl0/edit#gid=341160593) shown x10 faster 'gossippng' message propagation in the normal scenario traffic increases up to x2 - Don't duplicate `IHAVE` messages to peers (on each `gossip_advertise` cycle gossip picks `DLazy` random peers with no respect to already sent `IHAVE` messages). While it may save some bytes of traffic it may over-complicate the gossip protocol.