Peering Refactoring

# Peering Refactor # ## -or- ## # Modularization: No really. We should do this. # ## Situation ## The 24.5.0 release, which dropped on 5/14 was initially scheduled as a 24.3.1 release at the end of March. It took 7 weeks and 4 release candidates before we were confident that the issue had been addressed sufficiently enough to warrant a release, and wasn't going to make things worse. A full analysis of this situation is beyond the scope of this recommendation, however reading the [corresponging post-mortem](https://hackmd.io/@RoboCopsGoneMad/rJWoqo74C) is strongly suggested as background for this strategy. During this release cycle, we experienced regressions due to multiple peering and syncing factors, some of which interacted with each other. This sub-optimal behavior was present in 24.3.0 and was eventually noticed by the community, after which a speedy response was impossible. ### Complications ### - lack of observability - it still remains difficult to trace each devp2p handshake and/or disconnection. The sync pipelines are even more opaque, making collaboration between devs very difficult. - lack of tooling - we don't have any ethereum specific diagnostic tools that work with besu for the peering, rlpX, and syncing domains. Developers are limited to using general purpose tools like debuggers and logging. - lack of isolatability - the lines between discv5, RLPx, and the various eth/n and snap/n subprotocols are insufficiently clear; they are capable of communicating errors across the layers in ways that can make an RLPx issue look like an eth or snap issue, making analysis time consuming and difficult. ### Short Term Resolutions ### - improve observability in two areas, peering and syncing pipelines. Extensive logging is not enough, that logging must be aggregatable, isolatable, and queryable so that orders of operations for the protocol are clearly illustrated for devs. - suggest investing in additional metrics and visualizations for peering and sync related tasks. - focus on monitoring peering and syncing, increasing the priority and granularity of alerting and responses to canaries. ### Long Term Refactoring ### For each of the refactorings below, we need to meet some acceptance criteria: - is the observability improved? - is the testability improved? - is it easier to create tooling? An incremental refactoring path such as the one outlined below, could lead to a viable EIP-4444 history client, which would be useful on its own even if we choose not to fold these modules back into Besu. - Build a discovery module based on the latest version of the protocol, [discv5](https://github.com/ethereum/devp2p/blob/master/discv5/discv5.md). Run it on its own, show that it discovers peers and is discovered by them. Compare to current Besu performance. - Build an [RLPx](https://github.com/ethereum/devp2p/blob/master/rlpx.md) client which depends on a mock version of the discv5 module. Then inject the implementation above. - Build multiple "capabilities" on top of the RLPx module, to support the `eth` and `snap` capabilities. - Build a storage "module" for Bonsai. Existing Bonsai storage plugin for db already should allow for this. - Build a state sync module. Sync from snapsync servers. ### Justification ### - Modular approaches have market validation. Reth and Erigon are both highly modular. - Rewriting a subcomponent has worked before, with transaction pool and bonsai. - Opportunity to continue DI deployment. Besu has been battle tested with limited use of Dagger, further adoption is only a matter of will and time. Rust also uses a compile-time dependency injection mechanism specific to Rust, much like Besus use of Dagger. - EIP-4444 and Portal network may require a whole 'nother application layer on top of discV5. History serving could end up looking like its own, detachable client. EIP-4444 is not scheduled for Prague, but is being worked on early. Opportunity for Besu to be a first-mover here.