**Date:** 2026 / 04 / 30 **Hosts:** STEEL **Time:** 13:30-15:00 **Context:** Steel team-led feedback session mainly focused on consensus testing improvements and pain points encountered during glamsterdam devnet work. Format was freestyle, soliciting feedback on how the ecosystem can improve testing speed and safety. --- ## 1. Steel Team Status Update - Steel continues to work primarily on specs and tests, with benchmarking taking significant time - [Post-weld](https://steel.ethereum.foundation/blog/2025-11-04_weld_final/) of execution-specs and execution-spec-tests (October 2024), team is still finding processes for the new merged repo - **glamsterdam is the first fork** being run under the new repo structure - Current branching strategy: each EIP implemented in its own branch, only merged to `forks/amsterdam` when SFI'd - Conceptually clean (each EIP is diffable), but painful when composing devnet releases due to interaction issues - Helper scripts/functions are being developed to manage cross-branch interactions - Strategy may change soon - Documentation: The `execution-specs` HTML docs with Specs and Test Case Reference are now published (per dev branch) on the STEEL Team website: - https://steel.ethereum.foundation/docs/ --- ## 2. Targeted/Incremental Releases — Decision: Not Worth It **Question raised:** Would composable per-EIP releases (e.g., 8037 on top of glamsterdam-devnet-3, 7708 separately) be helpful before moving to glamsterdam-devnet-4? **Conclusion:** Mostly "nice-to-have, not essential." Steel will not prioritize this. **Key tension — gas repricing:** - Erigon: Prefers sequential layering, starting with gas EIP changes first; "massive pain" to bunch all feature branches - Nethermind (Speaker 12): Cautioned that interaction bugs will appear eventually regardless; suggested better-defined interaction tests instead - Mark Holt (Erigon): Gas repricing creates iterative "fix one, break twenty" loops; suggests insufficient unit-level testing of EVM/gas interactions - Maria: Some features (e.g., block-level access lists) need to be implemented first because they impact performance, then gas prices integrated at the end - Nethermind (Speaker 10): Benchmarking and functional tests should be separated; benchmarks must run on final spec --- ## 3. Testing Paradigm Shift — Move Away from Hive as Primary Verification **Current pain:** Hive is Dockerized, slow, and painful to debug. Iterating on a Hive failure requires PR'ing to execution-specs to fix exception mappings → ~24h loop. **Current client practice:** Most teams already run blockchain tests and state tests locally rather than relying on Hive for debugging. ### Proposed Solution: Engine Test Module Interface Spencer prototyped a direct binary `enginetest` interface across clients (similar to block test). Bypasses JSON-RPC and Docker startup; calls engine API methods directly: - https://github.com/besu-eth/besu/pull/10184 - https://github.com/bluealloy/revm/pull/3544 - https://github.com/paradigmxyz/reth/pull/23361 - https://github.com/status-im/nimbus-eth1/pull/4101 - https://github.com/lambdaclass/ethrex/pull/6445 - https://github.com/erigontech/erigon/pull/20315 - https://github.com/NethermindEth/nethermind/pull/11035 **Performance gains:** - :rocket: Nethermind/Besu: ~20 hours → minutes (massive due to JVM/CLR startup) - :rocket: Nethermind: 256 runners / 6 hours → 1 runner / 2 minutes **Concerns raised:** - Speaker 9 (reth): Separate entry point risks divergence from main node config (CLI flags, defaults, config files) — "this is not an end-to-end test anymore" - Mario Vega: Subtle bugs may exist between engine-test path and full consume-engine path that only Hive can catch **Agreed flow:** state test → blockchain test → engine test (module) → consume-engine Hive (final source of truth) --- ## 4. Exception Mapping — Major Discussion **Current pain:** Exception mapping lives in execution-spec-tests; clients must PR upstream to fix mismatches. ### Short-term decision: Move exception mapper to client repos - Same filename/location convention across clients - Steel + prototyping team to provide a GitHub Action wrapper around client direct binary interfaces - Handles exception mapping checks - Supports filtering, skip lists, xfail lists with reasons - Configurable per format (state/blockchain/engine) ### Long-term debate: Standardized error codes vs. mappers **For error codes (Nethermind, Besu, Erigon-Mark Holt):** - Easier debugging on devnets when comparing client behavior - One-time per error vs. ongoing string maintenance - Better fuzzing context - Security discipline (don't leak internal error strings) - EEST already has codes — can become the standard **Against / cautious (Felix - Geth):** - Fundamentally impossible to enforce identical error orderings across clients without making everyone an execution-specs-shaped client - Risks creating redundant validation checks just to pass tests, when gas accounting already prevents the conditions - Suggested alternative: have clients output execution outcomes (logs, post-state, gas used) as JSON for the test harness to inspect — not for production, just debugging context **Sub-discussion: false positives via state root mismatch** - Mario Vega: Documented real consensus risk — buggy clients reject for wrong reason (state root mismatch instead of intended check), allowing crafted blocks to split clients - Speaker 9: Saw this during devnet development — passing tests for wrong reason (BAL root mismatch hiding real issue) - Suggestion: ban state root and receipts root as accepted invalidation reasons in consume-engine ### Action items: - Steel + prototyping to build proof-of-concept for error codes with subset of clients - Continue mapper approach in parallel - Revisit after PoC --- ## 5. `consume/enginex` Simulator (Hive Speedup) The [`enginex` simulator](https://steel.ethereum.foundation/docs/execution-specs/forks/amsterdam/running_tests/running/#enginex) uses the benchmarking methodology: groups tests with same fork/block header and orthogonal addresses into one genesis, run the payloads from multiple tests against a single client -> less startups. - **Best benefit for slow-startup clients** (Besu, Nethermind) - Reth: 14 min → 1.5 min on ~3,000 Shanghai tests (10x even on already-fast clients) - Besu: ~30h → 2.5h (still too slow for per-PR, but viable as nightly) **Concern (Speaker 7):** Side chain accumulation when sending FCUs to switch heads — clients keeping N side chains may OOM. Benchmarking handles via container restarts from filesystem snapshots. **Action:** Steel can integrate `enginex` into Hive dashboard for clients that want it; needs investigation for Nethermind. --- ## 6. Public Dashboard Considerations - Hive dashboard remains valuable as public convergence point - Direct binary engine tests should also have a public dashboard, not just CI - Nethermind goal: make engine test a mandatory PR check (currently 15-20 min checks; this would be ~2 min) - Hive can move to nightly/once-a-day cadence for end-to-end validation --- ## 7. Erigon-Specific Note Engine test slower for Erigon due to: - TOML snapshot loading (~40MB on every startup, embedded snapshots) - ~300-block memory load on startup - Architecturally different from other clients Module-level test (single instance, multiple tests) gave Erigon a meaningful speedup where it didn't help others as much. --- ## 8. Other Topics Raised ### discv5 / discv4 (Speaker 6) - Some clients switching off discv4, moving fully to discv5 - Hive discv4 test failures becoming noise for clients that have moved - Three more discv5 expansion tests proposed - **Action:** Clients need to apply; not essential if discv5 is solid ### Metrics standardization (Speaker 10) - Carlos Perez initiated work to unify a small subset of client metrics (cache hits, state access timing, etc.) - Schema-level agreement (field names, presence) more important than value matching - Currently divergent across clients - **Action:** Carlos to share link; new test category to be considered after error codes ### Legacy tests (Speaker 11 / Speaker 16 / Pavel) - `ethereum/tests` legacy repo: state tests live in EEST and have been ported to Python - Blockchain tests still need porting — call for help - Some test cases (e.g., revert_to_touch) only exist in legacy legacy tests; coverage analysis suggests mostly false positives due to fork range differences (e.g., Constantinople fix) - Steel will publish static tests in releases more frequently - Goal: archive legacy repo soon --- ## Summary of Action Items | Owner | Action | |---|---| | Steel + prototyping | Build error code PoC with subset of clients | | Steel | Provide GitHub Action wrapper for direct binary test runners (mapper, filters, xfail) | | All clients | Move exception mapper file to client repos (standardized location/name) | | Steel | Integrate Nginx simulator into Hive dashboard for interested clients | | Steel | Investigate Nginx compatibility with Nethermind | | All clients | Implement engine test module interface (most already done) | | Steel | Public dashboard for direct binary engine tests | | Pavel / community | Port legacy blockchain tests to Python | | Steel | Publish static tests in releases more frequently | | Kamil, Nethermind | Share metrics standardization PR/link from cperrez | | Clients on discv4 only | Migrate to discv5 | ## Overall Direction Faster testing → faster releases → faster forks. Goal is per-PR (sub-minute) feedback on direct binary engine tests, with Hive demoted to nightly end-to-end source-of-truth verification. Exception handling and metrics schemas to be progressively standardized.