owned this note
owned this note
Published
Linked with GitHub
# GossipSub for big messages
## Background
This document sums up my findings on using GossipSub for bigger messages, specifically for [EIP-4844](https://www.eip4844.com/).
It's based on various simulations to understand real network behaviors.
You'll find the less polished history here: https://github.com/status-im/nim-libp2p/issues/850
The main goal is to keep the message propagation time low, and possibly save bandwidth.
All of the simulations are based on the assumption that one 0.5mb message will be broadcasted every 12 seconds in the network.
Thanks to [Anton Nashatyrev](https://github.com/Nashatyrev) for his help and simulations, and [arnetheduck](https://github.com/arnetheduck) for his ideas
## Learnings
### 1 - Duplicates don't matter that much for propagation time
![](https://user-images.githubusercontent.com/13471753/227965249-d0871320-ee62-44e7-bfa7-04eb63ebbd56.png)
Wisdom was that once we reached messages as big as 0.5mb, duplicates were becoming our worst enemies, since the connections were busy sending them instead of sending useful data.
In practice, even eleminating 100% of duplicates has a very limited impact on propagation time in this scenario (about ~12%).
This can be explained because most of the duplicates actually happen once a majority of peer already have the messages, so while that bandwidth will be wasted, it will not be at the detriment of the sending speed of useful messages.
Obviously, reducing duplicates will still have an impact on the overall traffic
### 2 - Bursty traffic is hard
![](https://www.researchgate.net/publication/222417532/figure/fig13/AS:668615292510214@1536421716407/TCP-congestion-window-cwnd-evolution.png)
TCP (cubic) tries to guess how much bandwidth is available on a specific socket by looking at packet looses. If packets are lost, it will decrease its estimate, if no packets are lost, it will increase it.
That's great, except that in ethereum case, we only use the connection to its full potential every few seconds, leaving it idle / at very slow packet rate for the rest of the time.
Since TCP won't have enough packets to estimate the bandwidth, it will just reset its estimate, and restart the guess process. So sending a 0.5mb message on a "cold" connection will take a lot longer than on a "warm" one.
## Proposed gossipsub modifications
### 1 - ChokeMessage
Upon receiving a big enough message, a node can send a message to inform its mesh that he doesn't need this message anymore. This enables to reduces duplicates on large messages
### 2 - Ping
By adding ping support to gossipsub, we can continously ping each connection to make sure they stay warm (at least every 2 RTTs). This is a bit hacky, as the TCP estimate might be off, but it's still better than going back to a cold connection.
Another solution might be to use the `tcp_slow_start_after_idle` sysctl parameter, but that can have some [side effects](https://groups.google.com/g/fa.linux.kernel/c/qBPVxpjVMvU?pli=1), since this is system-wide
### 3 - Staggered (/ sequential) relaying
![](https://i.imgur.com/Dhv9mYR.png)
![](https://i.imgur.com/TTJWZTE.gif)
Instead of relaying a message to all peers in your mesh at once, it's better to relay one (or two) at a time. This enables the peers to start relaying said message faster. This also gives more time to receive ChokeMessages from peers, reducing duplicates further.
Each message should be sent to the mesh in a random order. This will help distribute the load better accross the mesh
![](https://i.imgur.com/ysoAwDd.png)
To implement this, two ways are proposed:
1. ACKed based
After sending each message, ping the peer on the same stream and wait for a response. This will cost half a RTT more than necessary, but is bulletproof and easy to implement. Because of the half RTT, this should be limited to big messages
2. Bandwidth estimation
Use the data from the continuous pings to estimate the available bandwidth with a peer. This data can then be used to estimate how long the send will take for each peer. See an example [here](https://github.com/status-im/nim-libp2p/commit/fd6874e6474518a475595260996530dda8a358ff)
### 4 - BBR
When possible, the BBR congestion algorithm should be used. It's unfortunately hard to enforce, since this requires a `modprobe` by the users on linux, but it gives a significant boost in message propagation time (>15%)
## All together
![](https://i.imgur.com/r6mZ3tC.png)
Gains compared to Default:
| | Default | 4 chunks | 4 chunks + 2 parallel | 4 chunks + 2 parallel + choke
-- | -- | -- | -- | --
Max lat | 0.00% | 26.00% | 50.03% | 61.47%
Mean lat | 0.00% | 18.92% | 54.00% | 58.69%
Bandwidth | 0.00% | 12.40% | 20.80% | 52.79%
4 chunks: splitting the message in 4 chunks
2 parallel: limiting to 2 parallel relays
choke: ChokeMessage
## Potential improvements which are still WiP
### "Notify" message
Before sending a big message, a node could send a small "About to send ``<msgid>``", which the receive node could use to send Choke to its mesh optimistically
Because of mixed results in the simulations and potential security issues, its still early to push for wider adoption
### Lazy publish
A node could push eagerly the message to 2 peers, and just send IHAVE to the rest of his mesh. Assuming nodes sent a single IWANT per message, this will further reduce the amount of duplicates in the network.
This needs further investigation in terms of latency & security considerations.