Try  HackMD Logo HackMD

Rust Pre-mortem

Why?

Base is running op-reth in production and will transition their sequencer to op-reth. If op-reth has a chain split with op-geth due to some unknown or introduced bug, the consequences are dire - the chain would need to be rolled back.

This is a pre-mortem document to asses why this happened and how OP Labs can evolve as an organization to prevent this going forward.

Background

op-reth was initially built from Eth Denver 2023 (February) - Paradigm Frontiers 2023 (August) by @refcell and @clabby - two OP Labs engineers. Afterwards, support by OP Labs was not given so the burden fell on reth core contributors to maintain as well as Base engineers (Brian). As Base has scaled, they've adopted op-reth as a solution to op-geth's poor scaling.

OP Labs never committed to running internally. We are just now beginning to run op-reth in our infrastructure as a reactionary measure to Base productionizing the client.

Current Situation

Once in production, a chain split between op-reth and op-geth is currently undefined, and OP Labs has no guidance on how to handle this.

Bringing other chains like Base to Stage 1 will certainly help. If there's a chain split, with op-reth breaking, the valid op-program STF will cause the fault proof to resolve the associated output root as invalid. With Stage 1, there is a window to fall back to permissioned outputs.

The long-term multi-client solution may be stateless execution, or N-of-M clients (where M > 2 - op-reth, op-geth, nethermind?), or a combination of both.

We need to figure this out and this document is meant as a call-to-action.

Future Mitigations

In the future, to gain visibility into these issues sooner, we need to stop being reactionary and be proactive when it comes to new rust components and more generally, alternative components/clients.

A significant part of the issue why this can't happen today is OP Labs only has 2 engineers proficient in Rust that are also very focused on Proofs. Building a Rust squad that can lead Rust implementation + collaboration efforts across external contributors will allow OP Labs to be proactive.

P.S.

It's not a matter of if a piece of software breaks, it's when.