GossipSub Performance Analysis

**GossipSub Performance Analysis** # **Machine Spec** **Bharath's Machine:** CPU: 8 cores/ 8 threads Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz RAM: 32GB Disk: 64.2k Read IOPS and 15.2k Write IOPS. 4TB NVME SSD disk This is a virtual machine running in a data center in Berlin. bharath-123@devbox-bahrath -------------------------- OS Debian GNU/Linux 12 (bookworm) x86_64 Host KVM/QEMU Standard PC (i440FX + PIIX, 1996) (pc-i440fx-9.2) Kernel Linux 6.1.0-32-cloud-amd64 Uptime 5 days, 1 hour, 18 mins Packages 494 (dpkg) Shell bash 5.2.15 Terminal exe CPU Intel(R) Xeon(R) Gold 6230R (8) @ 2.10 GHz GPU Unknown Device 1111 (VGA compatible) Memory 17.85 GiB / 31.36 GiB (57%) Swap Disabled Disk (/) 5.39 GiB / 251.78 GiB (2%) - ext4 Disk (/data) 1.44 TiB / 3.94 TiB (37%) - ext4 Local IP (eth0) 10.128.2.157/24 Locale C.UTF-8 CPU Rating: MT Rating: 32881, ST Rating: 2155 **Marco's Machine:** CPU: 12 cores/ 24 threads RAM: 64GB Disk: 4TB This is a dedicated machine running somewhere in the US by a cloud provider. dev@p2p ------- OS Ubuntu 24.04.3 LTS x86_64 Kernel Linux 6.8.0-79-generic Uptime 13 days, 20 hours, 14 mins Packages 806 (dpkg), 241 (nix-user), 60 (nix-default) Shell zsh 5.9 Terminal /dev/pts/0 CPU AMD Ryzen 9 7900X (24) @ 5.73 GHz GPU 1 AMD Raphael [Integrated] GPU 2 ASPEED Technology, Inc. ASPEED Graphics Family Memory 15.78 GiB / 61.92 GiB (25%) Swap 453.50 MiB / 8.00 GiB (6%) Disk (/) 1.38 TiB / 3.58 TiB (39%) - ext4 Local IP (enp4s0) 157.250.202.182/30 Locale en_US.UTF-8 CPU Rating: MT Rating: 51376, ST Rating: 4235 # Grafana Dashboards: **Bharath's Machine:** https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-15m&to=now&timezone=browser **Marco's Machine:** http://p2p.dev.marcopolo.io:3000/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/incoming-and-sendmsg-queue-depth?orgId=1&from=now-30m&to=now&timezone=browser # Metrics being tracked # Observations (On Bharath's Machine) 1. Gossipsub heartbeat seems to take ~4.5ms in the p90. According to Marco and Raul that should be generally fine? 2. the publish_message event seems to be stuck in the event loop for an unusually long time? Not sure why thats the case? It seems to be consistently waiting for 4.5ms at the p90? https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-1h&to=now&timezone=browser&viewPanel=panel-2 3. The sendMsg channel seems to be having significant amount of contention https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-1h&to=now&timezone=browser&viewPanel=panel-16 around ~3ms in p90 which doesn't seem right. This is in contrast with the incoming channel which has a very negligible contention time: https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-1h&to=now&timezone=browser&viewPanel=panel-11. Initial thoughts on this were that, maybe it's running on a bad CPU but the CPU ST and MT rating are sufficient according to https://eips.ethereum.org/EIPS/eip-7870#cpu. An explanation could that the sendMsg buffer but that doesn't seem to be the case https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-1h&to=now&timezone=browser&viewPanel=panel-6 at p95, the buffer size is almost 27-28, which means only in 5% times the size is >28 which shouldn't cause a lot of contention. An interesting experiment could be to increase the buffer size to 64 and see if that impacts the contention time. 4. The peer_dead event seems to be taking up decent time at the p90-p95? https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-24h&to=now&timezone=browser&viewPanel=panel-3 5. We are recording the time taken for `validateMsg` and some of the numbers here seem insane? https://grafana.ethquokkaops.io/d/cbb50869-3779-45ff-8039-5fb4ee6c7072/libp2p-gossipsub-dashboards?orgId=6&from=now-24h&to=now&timezone=browser&viewPanel=panel-44 Some validators are taking >4s in p90-p95? 6. The time to forward message to RPC generally almost matches the async validation time. 7. I am noticing that there is a spike in sending messages to rpc and peer during time frame {"from":"2025-10-05 20:49:09","to":"2025-10-05 21:49:09"} but there is no spike in async validation. I did notice a lot of dropped rpcs. Is there any relation? 8. Message publish time can be >= message rpc push time since message publish time involves network latencies of prior messages in the rpc queue. 9. message rpc push time >= async validation time since there can be some time b/w validation and message publishing. Post validation, the message is sent over the sendMsg channel buffer to be processed by the gossipsub router. There could be a slow subscriber? 10. There are some cases where async validation takes more time than message publishing? how? 11. Single threaded performance is quite key for a gossipsub 12. Subscribing to more topics increases network bandwidth requirements. 13. The number of IWANTS sent increases when subscribed to all subnets. 14. Peer outgoing rpc queue sizes become full. # Questions: 1. Why is the heartbeat 700ms in consensus-specs? What is the tradeoff to having a higher/lower heartbeat timer? 2. Why is only `publish_message` waiting longer in the event loop? Is it because the sendMsg queue is more sizeable? 3. Otel metrics do introduce some extra memory allocations and latency. Let's optimize it # Current WIP stuff: 1. Write go benchmarks to see if metrics introduce any overheads? 2. Remove metrics with attributes(and the one's we don't use) to see how it impacts latency metrics. Remove the topicMsgSent and topicBytesSent etc