zkVM Guest Program Test Plan

# Overview The proposal focuses on testing of the execution witness (stateless input) checker program, which is to be run as a zkVM guest. Specifically, it aims to: - document assumptions - define coverage criteria (work in progress) - propose a way to achieve the required level of coverage. **Note**: The Stateless Input tests can be used to test downstream components (within end-to-end and integration testing scenarios). However, it’s assumed that comprehensive testing of the components is addressed separately. # Related documents https://hackmd.io/@jsign/zkvm-eth-minimal-execution-witness https://hackmd.io/@jsign/zkvm-eth-guest-program-api-standarization https://github.com/eth-act/planning/blob/main/projects.md#high-level-end-to-end-flow https://zkevm.ethereum.foundation/blog/zkevm-security-overview # Assumptions Technically, Execution Witness and Stateless Input are not the same, but for the purpose of this document, they are used interchangeably. The Execution Witness checker is assumed to be the only zkVM guest program. The Execution Witness checker performs two duties: - checks that Stateless Input is well-formed - checks that it defines a valid EELS state transition. # Test formats ## Standard EELS test format On the valid inputs, the Execution Witness checker behavior should exactly match the EELS specification. Therefore, it’s important that the standard EELS tests are consumable. ## Stateless Input test format For testing invalid inputs, a dedicated SI format might be needed. It can be used for testing valid scenarios too. Alternatively, SI test vectors can be produced by corrupting valid SI inputs. # Testing scenarios ## End-to-end testing - Tests in EEST formats are executed by EL in the stateless mode. - The resulting stateless input is validated by checker program, run as a zkVM guest. - The resulting proof is verified by a proof checker. ## Integration testing/Partial tests Execution Witness checker is executed as a zkVM guest, however, zkVM runs tracer only, which should be much faster. Should consume standard EEST tests. Since the scenario doesn’t involve EL client explicitly, it’s possible to run invalid Stateless Input tests. ## Component tests The Execution Witness checker program is executed in isolation from other components. **Note**: an EL client may be implicitly involved in producing SI inputs (e.g. filling). There are two main ways to execute the Execution Witness checker program: - emulated execution of the Execution Witness checker, compiled to a standard target - native execution of the Execution Witness checker Both standard EEST tests and (invalid) Stateless Input tests should be consumable. ## Performance considerations If a testing scenario involves proof generation (e.g. end-to-end testing) then it will be slow. It will be very costly to achieve comprehensive coverage this way. Therefore, there should be two groups of coverage criteria (call them “profiles”???): - smaller set, oriented on system level and heavyweight integration testing - it might be beneficial to have a fewer amount of longer test vectors here - larger set, oriented on comprehensive testing of witness checker or lightweight integration testing scenarios (i.e. avoiding proof generation) - It might be okay to have lots of shorter test vectors here. # Test generation There can be multiple sources of test vectors: - existing EELS test suites - test vectors obtained from fuzzing campaigns - recorded EVM transactions - dedicated test generators (e.g. for invalid Stateless Input testing). We assume a combination of these approaches will be used, i.e. existing tests, recorded transactions plus generated tests can be used as seed corpora to mutation-based fuzzers, possibly coverage-guided. ## Model-based Fuzzing Fuzz testing can be efficient, however, it doesn’t guarantee coverage. We propose to complement fuzzing with enumerative (deterministic) Model-based testing as a way to achieve guaranteed coverage. [Model-based Testing](https://en.wikipedia.org/wiki/Model-based_testing) can be further combined with fuzzing via two routes (can be combined): - model solutions can be instantiated as test vectors in a randomized fashion - generated test vectors can be used as seed corpora for mutation-based fuzzing. ### Model based testing It can be convenient to define a coverage criterion with a finite model, possibly parameterized. In particular, such models can be composed of smaller components. For example, many EVM instructions share the same stack access pattern, thus the stack access test cases can be re-used across such instructions. Finite models can be used to generate tests, i.e. each solution of a model can be instantiated as a test vector. If it’s possible to do for every solution, then we obtain a test suite with 100% coverage. Models can be used to synthesize test vectors, which in turn can be used to implement mutation operators (as part of fuzz testing). # Coverage criteria In the case of zkEVM testing, standard code coverage metrics are likely not enough. First, there are multiple codes to cover (EELS spec, Geth as the de facto standard, Execution Witness checker program). Ideally, we need simultaneous coverage, but it’s difficult to achieve. Second, code coverage can be too coarse. For example, consider ADD instruction. One can identify several test cases, based on code coverage: - stack underflow: `stack_size == 0 | stack_size == 1` - out of gas: `stack_size >= 2 & gas < GAS_VERY_LOW` - normal behavior: `stack_size >= 2 & gas >= GAS_VERY_LOW`. However, there can be other important cases: - stack underflow + out of gas: `stack_size < 2` **&** `gas < GAS_VERY_LOW` - integer addition overflow/no-overflow: - `stack[0] + stack[1] < U256_MAX` - `stack[0] + stack[1] >= U256_MAX` Additionally, there may be important values stemming from [Boundary-value analysis](https://en.wikipedia.org/wiki/Boundary-value_analysis) (e.g. `stack_size == 1024` or `stack[0] == U256_MAX-1`, etc). Third, higher-level patterns like inter-contract or intra-contract interactions (call, create, etc) are arguably more tricky to characterize in terms of code coverage. ## What is coverage criterion? In the document, let’s define a **coverage criterion** to be a set of **coverage items** (behaviors), which is: - *enumerable*, i.e. there is an efficient procedure to list all items - *measurable*, i.e. for each item, one can determine if it has been covered during testing. If a coverage criterion is enumerable and measurable, then the corresponding metric is straightforward to define and evaluate. In practice, both EELS and Geth support instruction tracing, so implementing some of the relevant criteria should be straightforward. # Test levels The part is very preliminary and outlines high-level aspects to be covered with tests. ## Witness checker program - Invalid Stateless Input tests - Valid tests (EELS State Transition) - Block level - Transaction level - Interpreter level ### Invalid Stateless Input tests - Malformed serialization - Invalid block - Missing or incorrect input state entries - Missing, incorrect or extra output state entries - Inconsistent Stateless Input ## Valid (State Transition) tests ### Block level **Note**: Do we need multi-block test scenarios?? I guess not, as it should be covered by EELS tests. - Header tests - Transaction tests - Empty list, single transaction, multiple transactions - Ommer tests?? - Withdrawal tests - Request tests - Combinations of the above ### Transaction level Cover transaction behavior in more details. - Individual transaction type tests - LegacyTransaction - AccessListTransaction - FeeMarketTransaction - BlobTransaction - SetCodeTransaction - Intra-transaction interaction tests - Storage - Transient storage ### Interpreter level #### Control flow tests - Loop free Control Flow Graphs (DAG) - Enumerate trees with a bounded model - Enumerate DAGs with a bounded model - CFG with Loops - Enumerate trees and/or DAGs and insert back edges - Enumerate Loop Nesting Forests with a bounded model - `STOP`, `RETURN`, `REVERT`, `SELFDESTRUCT` - (Limited) Combinations of the above #### Inter-contract tests Very similar to Control Flow Graphs, but Contract Call/Create Graphs instead. Particularly, both can have a shape of a tree, a DAG or of a graph with cycles. - Enumerate trees with a bounded model - Enumerate DAGs with a bounded model - Construct loopy graphs (e.g. trees/DAGs plus back edges) - `CREATE`, `CREATE2`, `CALL`, `DELEGATECALL`, `STATICCALL`, `CALLCODE`, `REVERT`, `RETURN` - (Limited) Combinations of the above #### Instruction interaction tests Instructions can interact via their effects, e.g. environment, storage, memory, stack, etc. We are unlikely to cover all interesting combinations, however, certain pairs or triples of important interactions should be covered. For example, an arithmetic instruction obtains their parameters via stack, which can be populated via different ways: - other arithmetic instruction results - call results - environment queries - memory reads - storage reads - etc #### Individual instruction tests For each instruction, a number of important behaviors should be identified. The behaviors are to be identified based on: - [code coverage](https://en.wikipedia.org/wiki/Code_coverage) - [Equivalence Class Partitioning](https://en.wikipedia.org/wiki/Equivalence_partitioning) - [Boundary-value Analysis](https://en.wikipedia.org/wiki/Boundary-value_analysis). Since many instructions share stack access patterns, it makes sense to reuse them. For example, consider *pop, pop, push* pattern, then important scenarios are: - stack underflow: `stack_size == 0 | stack_size == 1` - out of gas - out of gas & stack underflow - normal behavior. Many arithmetic, bit-wise or logical instructions use this pattern, so its coverage items can be combined with instruction specific ones. For example, for `ADD`: - arguments: `0`, `1`, `U256_MAX-1` - no integer overflow: `a + b < U256_MAX` - integer overflow: `a + b >= U256_MAX`