mev-boost community call #2 recap

--- tags: mev-boost-community-call --- # mev-boost community call #2 recap Recording: [https://www.youtube.com/watch?v=KmmLQRYWGmc](https://www.youtube.com/watch?v=KmmLQRYWGmc) # Summary - Capella fork is coming up on mainnet. Builders and relays need to make sure they are running up-to-date software - Ultrasound Optimistic Relay seeing good adoption as well as some interest from more relays. In-progress work happening on a V2 and in the meantime exploring changes that can produce even more latency optimizations - Ultrasound Relay interested in more analysis on missed slots and overall network health # Capella ### Goerli Capella is coming up on mainnet (two weeks or so from call). Goerli went well, albeit some confusion around which software to run. Eventually everyone shifted over to the right code. Blocknative had an issue on their Dreamboat relay related to an infrastructure misconfiguration of Prysm vs. Lighthouse nodes. Once resolved, Dreamboat handled the change on Goerli. Generally went well for everyone. ### Mainnet Builders have to be prepared since things change after the fork. For relays it’s important to have the right code at the right time. Proposer / MEV-Boost side need the right release for smooth operation. (Chris from Flashbots gives update on the software in preparation for Capella on mainnet). Pretty much everything is merged across the builder, relay, and MEV-Boost fronts. MEV-Boost is ready with version 1.5.0 and has been tested on both Goerli and Sepolia. So far have not seen any issues. MEV-Boost relay version v1.0.alpha2 is available and has been tested through all testnets. This does not need the custom Prysm fork. Can run either Prysm or Lighthouse at fork and later any other CL client once the others support a specific subscription. On relay block validation front, deprecated block validation repo and now using the builder project. Consolidated on builder project to avoid lots of rebases on Geth and optimize maintenance efforts. Block builders can use the `main` branch of the repo, though still require the custom fork. The switch is being wrapped up after which instead of polling Prysm for withdrawals you’ll be able to receive through the subscription. Either Prysm or Lighthouse recommended for going through fork since others are still WIP. Communicating with Erigon and Reth on block validation / block building. Goal is to 1) get more client diversity and 2) get significant performance improvements. On relay diversity front, Dreamboat is ready for Capella on mainnet; currently needs the Flashbots Prysm fork. # Optimistic Relay ### Update on V1 (Updates from Mike from EF) First, a quick summary of the V1 release (happened about two weeks ago from call). Seeing good adoption of Optimistic Relaying (OR). 12 builders have posted collateral of 1 ETH each with Ultrasound. Have an event log of all bid submissions by all relays. Numbers look pretty good; seeing more payloads delivered through ultrasound relay. Have seen tens of millions of optimistic bids. Additionally, have been working with Flashbots to get a PR in shape to upstream the OR code into the main repo to have in canonical codebase as an option. Would be opt-in for anyone to run if interested. Flashbots not planning on running optimistic relaying. So far a few other relays expressed interest in running OR including Aestus, Agnostic, Blocknative, and potentially Frontier Research. This will likely come after Capella. There were a few issues on relay end with OR. One was with block validation failing with “unknown ancestor” errors. This is caused due to nodes getting out of sync. If Geth falls out of sync, you have block correctly identify hash but nodes fail validation. This is so far probably the biggest hurdle with running optimistic relaying on the relay end. ### Incident with invalid block Haven’t had any missed slots over 2 weeks. Had one incident where a block was submitted that was invalid. The issue wast the nonce being reused by two transactions in block which wasn’t caught by the builder. Once relevant info was shared with builder by Ultrasound, was fixed and hasn’t happened since. Got lucky that the invalid bid didn’t win the auction, therefore didn’t miss the slot. In general, seeing that builders are good at building valid blocks. ### Protection against invalid block Implemented some more measures to be conservative in case of an invalid submitted block. Before, if a builder was submitting an invalid block the relay didn’t know until end of slot so it would still be delivered. Now as soon as a builder submits an invalid block, the builder gets immediately demoted. This lowers probability of issues in the future even more. No plans to use collateral, only keeping for last resort in case of some very rogue builders. The idea for now is that if any issues result in proposers losing out on the reward, builders will refund directly with the relay stepping in to guide through this. ### V2 Lastly, working towards a V2. The idea is to enable header-only bids. Currently, a big bottleneck is getting bytes for payload from builder to relay. Execution payload can be several KB to MB and builders need to get bids activated on relay. The proposal is for a new type to submit blocks which will separate builder bids into packets and only the first packet is enough for relay to validate that its eligible to win the auction. Anticipate submission time of builders to drop by 100ms more. Generally happy with direction of incentivizing builders to build valid blocks. Also good and healthy preparation for enshrined PBS where builders stand to lose money for invalid bids. ### Bid submission queues On builder submission / error front, considering a new queuing mechanism. In Flashbots MEV-Boost relay right now there are two priority queues: a high priority and a low priority one. The idea is to improve the efficiency by reducing the amount of resources spent on simulation. Proposal is to put everyone (about 20 builders to connected to relay with some reputation) in high priority queue and only simulate blocks that stand to win, i.e. only if bid value increases vs. best known bid. Everything else goes into low priority queue. One bad thing that can happen is spam on the queue with blocks that are invalid. For this can have automatic demotion of builders if there is even a single bad bid. Why do this? Most of the time there is no queue. Instead, there is a builder leading the pack with some alpha or proprietary order flow and so the situation is a constantly updated bid by the builder setting the top bid. The relay only needs to simulate this and everything else can be skipped and not compete in the simulation race. (Chris from Flashbots chimes in). On board with the idea of transparency + innovation, especially on block validation front as it is a major operational pain. Interested in smarter scheduling that can make the life of good submissions easier. Some thoughts regarding cancellations. Current relay allows cancellations for block builders and without changing this mechanic can’t ignore lower value bids. The queuing system is incompatible with block cancellation. If could remove cancellations, would be easy to deprioritize lower value bids. On the high / low priority queue, from prior experience Flashbots found that people will create identities and spam the high priority queue. Another idea is that maybe with Reth / Erigon the relay could deal with more load? Currently estimating around 3x block validation performance. Also maybe could come up with some form of pre-validation? Of course ultimately only know post-validation. ### Handling missing slots How does Ultrasound handle a missed slot? What about a case of a missed slot because of latency of relay <> proposer? Justin points out that this happens about once a day, when you’re the relay of sufficient size. Can see a bunch of orphan blocks on explorer. Ultrasound does an investigation when proposer misses a slot. So far has found that it’s usually the proposer’s faulty setup or connectivity issue. For example, had a case with some misconfiguration on one of the Lido validators. Another example was a RocketPool validator on home internet connection w/ random disconnections and just got unlucky during a block. Ultimately Ultrasound takes operational responsibility. The infra is running on Google Cloud with good connectivity, but of course in the event of some outage it can be affected and happy to take on a liability up to 10ETH. Infra-wise have a Telegram bot to alert about a missed slot. Again, happens about once a day. If Ultrasound knows they messed up something, for example something in devops, even if it’s for 1 second, manually check that it didn’t lead to a missed slot. The relay takes responsibility and investigates either when the proposer asks or if there is an operational hiccup. Have also looked historically at all missed slots to find patterns. Found that about half of missed slots have a block that maxes out the gas used. This suggests a very large block and/or a block that takes a very long time to execute. The idea is that perhaps attesters are not able to download fast enough and attest. From an analysis perspective, looks like the block never made it, but actually it was just a bit too bulky. Also considering trying to see per-incident if it’s home validation vs. professional, idea being that would expect professionals to have good connectivity on cloud. In future want to have a project to come up with a heuristic on every missed slot but so far looking at logs manually when have a request from proposer. With optimistic relaying operation were able to find a dozen or even more builder bugs and report to builders to fix simulation errors. Perhaps can do something similar with missed slots? Can collect more data per incident, for example - What was the consensus client? - What was the execution client? - What was the internet connection? Use this to track down the source of the missed slot with extra data analysis helping paint a picture of what’s going on. Onto the topic of finding and triaging builder errors. Before, if a builder submitted an invalid block (for example what Ultrasound OR saw with a nonce issue that the builder didn’t catch), the relay would only respond with an error. The error code would not contain much specific information other than some sort of “simulation” failure notification, so unless the builders or relays went into the logs, there was not much info to extract. With OR, if there is a submission error, the builder public key gets instantly demoted and is no longer in optimistic mode. Ultrasound then goes to tell them to fix the error to reactive / reinstate. While before the errors got washed out with the rest of the requests, now have more visibility. When there are lots of submissions, should expect to fail for multiple different reasons. Maybe can automate attribution of networking issues to a certain extent. For example, look at the IP of proposer who requested the payload and correlate with when they made a call to `getHeader()`. Maybe worthwhile to record the timestamps of `getHeader()` and `getPayload()` in the DB using the IP address as a heuristic. Then, if these are within bounds, maybe the issue has to do with the relay itself. Though from experience so far, the `getHeader()` and `getPayload()` calls are just made too late, explaining the missed slots. Also thinking of collecting data on length-1 reorgs, perhaps there is some relationship to large blocks there. Overall network health would be better with less length-1 reorgs. Long-term, Ultrasound hopes to have good visibility into the health of participation rate of Ethereum. Apart from offline missing attestations, all other cases need 100% participation. Ultrasound team is easy to reach and encourages collaboration if anyone wants to reach out.