zkVM - Proof separation: Execution vs. Pre/Post state

# zkVM - Proof separation: Execution vs. Pre/Post state This document has experimental results of separating a monolithic Ethereum STF proof into two proofs: i) Execution-only workload, ii) Pre/Post state check workload (i.e. keccak heavy). The results of this document allows to understand the overheads of separating a monolithic proof into two proofs regarding cross-proof checkings. I also explores the same guest program separation without checks which, although would be unsafe for mainnet use, it allows to remove overhead to understand potential Ethereum protocol changes understanding the ratio of proving overhead of the EVM execution vs state-tree/unchunkified-code workload. ## Intro The usual style of block proving involves proving the whole STF, which includes three main parts: 1. Pre-state validation against the parent block state root, plus input bytecode hashing. 2. The STF starts from the proven state in 1. and applies the received new block, generating some state diff, plus doing many consensus validation checks (e.g., block gas used, receipts root, etc) 3. Applying the state diff from 2. into the pre-state trie and calculating the new state root to check the claimed `state_root` field in the received block header. We were interested in separating this monolithic proof into two parts: - *Execution-only proof*: only proves 2. - *Pre/Post state check*: only proves 1. and 3. This can be interesting for many reasons, for example: - It is interesting to gain a better understanding of how much of the regular STF proof time comes from EVM execution and keccak-heavy workloads (i.e., state tree and bytecode operations). - Just relying on cycle counts can be a red herring, since keccak cycles in zkVM precompiles don't have the same proving time as regular cycles (i.e., non-zkVM-precompile cycles). - Doing full logic separation in different guest programs allows for measuring wall-clock proving time. - Can provide much more concrete data to show how switching to another state tree and implementing bytecode chunkification can improve performance. ## High-level picture of proof separation Let’s clarify with more precision how these guest programs are used for the experiment, so we have the right mental model of what is being proven in each. ## Monolithic proof The *Monolithic proof* is just the usual guest program proving the whole STF: - Public inputs: - Parent block hash - Target block hash - Success (usually `true`, but the guest program could prove an invalid block, in which case it would be `false`) - Private inputs: - Parent and target block - Pre-state sparse Merkle trie nodes (i.e., pre-state) - Bytecodes (required since code isn't part of the tree) - Ancestor headers (required for BLOCKHASH resolution) - Assumptions: none. - Claim: 1. Verifies that all private inputs are correct, meaning: 1. The Pre-state trie nodes are valid, resulting in a trie root node matching the parent block hash `state_root` field. (i.e., the pre-state is valid). 2. The ancestor headers are valid (i.e., coherent with the required BLOCKHASH needs and chain-coherent). 3. Bytecodes are hashed to know they match `code_hash` in the tree leaves for execution. 2. The block is executed using the validated input data. 3. The EVM execution generates a state-diff, which is applied to the (sparse) MPT, calculating the new `state_root`. This is checked against the block header, apart from other block header fields (e.g., block gas used, etc). ## Execution + Pre/Post state proofs This setup has two different proofs, thus two guest programs. Let’s briefly explain both first, and then we will explain how things tie together: - *Execution-only* guest program: - Public inputs: - Parent block hash - Target block hash - Flat-db (pre-state) hash commitment - Post-execution state-diff hash commitment - Success - Private inputs: - Parent and target block - Flat-db (pre-state): flat key-value Ethereum state (i.e, no trie nodes) - Ancestor block hashes - Bytecodes - Assumptions: - The Flat-db contents are valid. - Post-state root check is done externally. - The Bytecodes are received as `(code_hash, bytecodes)`, it is assumed `code_hash` was correctly calculated. - Claim: 1. Executes the block using the flat-db containing the pre-state. 2. Commits to the generated post-execution state-diff and sets it as public input. - *Pre/Post-state check* guest program: - Public inputs: - Parent block hash - Target block hash - Flat-db (pre-state) hash commitment - Post-execution state-diff hash commitment - Success - Private inputs: - Parent and target blocks - Pre-state sparse Merkle trie nodes - Flat-db (pre-state) - Bytecodes - Post-execution state-diff - Assumptions: - The Post-execution state-diff was generated correctly. - Claim: 1. The pre-state sparse MPT nodes are valid against the parent `state_root` (the same verification of pre-state as in the monolithic proof). 2. Commit the Flat-db and the post-state-diff received in the private input to set them as public inputs. 3. Every entry in the Flat-db is verified against the (already verified) sparse MPT, i.e., each entry claimed in the flat-db is correct. 4. Apply the post-execution state-diff into the sparse MPT to verify that the calculated new post-state root matches the expected one. Regarding each guest program assumptions: - The *Execution-only guest program* has two assumptions: i. The Flat-db pre-state contained correct data, ii. Bytecode hashes are correct, iii. Somebody else outside this proof will verify the post-state root. - The *Pre/Post-state check guest* program assumes that the post-execution state diff was generated correctly. But if you notice, each guest program's *Claims* prove the assumption of the other e.g. the *Pre/Post-state guest* verifies the post-state root, bytecode hashing, and flat-db is correct against the parent root. The downside of separating a monolithic proof into separate proofs is that each guest program has to commit and check cross-guest program input/outputs (e.g, the flat-db used in the *Execution-only guest program* has to be committed as public input, such that it can be connected to the one that was verified in the *Pre/Post-state check guest* one). Most of the overhead work is hashing needed for public input calculations for doing cross-checks. In a monolithic proof, this isn’t required since both are part of the same proof. Although the expected total work in separate proofs is higher, those proofs could be generated in parallel. # Proving results A run was performed over 30 mainnet blocks (`[23743854, 23743883]`) to analyze the implications of the monolithic proof vs. the dual-proofs style. The mainnet blocks weren’t specifically hand-picked. For the sake of understanding things deeper, we actually benchmark two variants: - The guest programs as described in the last section. - The same guest programs, but with all the flat-db and post-state-diff checks and commitments disabled. The former is basically benchmarking the proof separation with full correctness, since cross-proof validations must be done. The latter has the goal of removing these extra overheads to gain understanding of how much % of the monolithic proof does regarding pure EVM-execution vs keccak-heavy load (i.e. pre-state check, post-state root calc, and bytecode hashing). In this section we only provide wall-clock real proving times, since are the most reliable/real proxy to understand our interests. Counting cycles can be a red herring since zkEVM precompiles, such as keccak, might count as a one cycle, but the proving time for those cycles is slower than *regular* cycles. Still, in the _Appendix_ section we show cycles results if are interesting to the reader. The runs where done for two zkVMs: SP1 Turbo v5.2.1 and ZisK v0.13.0, to not bias much of the conclusions in a single zkVM. Both were configured with full *zkVM precompiles* support. The host machine used has a AMD EPYC-Milan Processor, 128GiB RAM and a single Nvidia L40S. As a reference for code used: - Reth: - A [modified-Reth](https://github.com/kevaundray/reth/pull/28) which defines the new reorganized STF into the desired parts - A [separate branch](https://github.com/jsign/reth/compare/jisng-flat-db...jsign:reth:jsign-redacted-cycles?expand=1) on top of the above for no-checks. - zkevm-benchmark-workload: - A [modified branch](https://github.com/eth-act/zkevm-benchmark-workload/compare/master...jsign:zkevm-benchmark-workload:jsign-exec-storage-separation?expand=1) that has the wiring of all new guest programs and witness generation. - And [separate branch](https://github.com/jsign/zkevm-benchmark-workload/compare/jsign-exec-storage-separation...jsign:zkevm-benchmark-workload:jsign-exec-storage-separation-no-checks?expand=1) on top of the bullet above for no checks. - [Raw guest program inputs and runs outputs](https://drive.google.com/file/d/1dLyoG51bpeDHrkCMWeG8jdLVxxyz0hIo/view?usp=sharing). It is not needed to read the code to continue reading the article -- just if you want to see code details. ## With-checks results Here we present the results for the guest programs with checks i.e. flat-db commitment, state-diff commitment, flat-db verification against sparse MPT, etc. Recall this would be the full version for full correctness for the proof separation. ### SP1 #### Proving times ![image](https://hackmd.io/_uploads/r1Cu1ebeZg.png) Notes: - For SP1, _Execution-only_ and _Pre/Post-state-check_ guest program for blocks takes roughly the same time (12% diff) - The overhead of separating proofs regarding cross-input checks has significant overhead. #### Cycles As said multiple times before, take cycles with a grain of salt. ![image](https://hackmd.io/_uploads/BJpY1g-lWg.png) ![image](https://hackmd.io/_uploads/SyU9JeWg-x.png) ![image](https://hackmd.io/_uploads/HyqcJlWx-g.png) ![image](https://hackmd.io/_uploads/SyRc1g-xWe.png) Notes: - For _Pre/Post-state checks_ guest program, reading the input and doing public input commitment of the flat-db and post-state-diff take a large number of cycles. This makes sense since this is a lot of data to receive, parse, and hash for public inputs. - Flat-db serialization and hashing (for public input) takes a significant amount of time compared to post-state-diff serialization and hashing (for public input). This makes sense since post-state diffs are much smaller than pre-state for execution. ### ZisK #### Proving times ![image](https://hackmd.io/_uploads/rJsaJe-lWl.png) Notes: - ZisK overall proving times (avg. 164s) is faster than SP1 (avg. 289s). Having this in mind is relevant to understand relative proving time differences. - The *Pre/Post-state check* proving effort is around 65% more than _Execution-only_. This means that the EVM-execution part is way more efficient in ZisK than keccak-heavy workload in _Pre/Post-state check_. This sounds reasonable since at the limit, zkVM precompiles are hard to keep optimizing after some point. - The previous bullet results in a potential conclusion that, for ZisK, keccak-heavy workload dominates compared to EVM-execution. - As mentioned with SP1, since these guest programs have cross-proof checks enabled there is still significant overhead in the numbers -- to remove this see the following _Without-checks results_. #### Cycles ![image](https://hackmd.io/_uploads/SkP0keZgZx.png) There is no support for regional cycles capturing. ## Without-checks results Here we present the results for the guest programs **without checks**. This doesn't prove the correctness of cross-proof claims, but is focused on removing this overhead to gain more understanding of proving times of the STF related to EVM-only-execution and keccak-heavy load (i.e. pre-state verification, post-state root calculation, and unchunkified bytecode hashing) Note that these _Without checks_ results don't have _Full validation_ variant since it that variant is irrelevant for with or without checks and was presented in the previous section. ### SP1 #### Proving times ![image](https://hackmd.io/_uploads/SJLyggbxZg.png) ![image](https://hackmd.io/_uploads/ryf4gx-lZg.png) Notes: - Removing the proof cross-checks overhead, results in SP1 having _Pre/Post-state checking_ workload taking less time than _Execution-only_ (0.79x). ![image](https://hackmd.io/_uploads/H1qdlxWgWe.png) Notes: - This confirms that cycles which are heavily related to keccak are more costly to prove than regular cycles (1.14x) #### Cycles ![image](https://hackmd.io/_uploads/HJdFlxZxZl.png) ![image](https://hackmd.io/_uploads/H1pYgeWg-e.png) ![image](https://hackmd.io/_uploads/SJW5lxWlZe.png) ### ZisK #### Proving times ![image](https://hackmd.io/_uploads/Sy2cxlbgbe.png) ![image](https://hackmd.io/_uploads/B14oelblbe.png) Notes: - Removing the proof cross-checks overhead, results in ZisK having _Execution-only_ being significantly faster than _Pre/Post-state checking_ i.e. keecak-heavy workload takes 1.69x more time than execution-only workload. ![image](https://hackmd.io/_uploads/H1XnegWlWg.png) Notes: - These numbers confirms again that cycles related to keccak are more costly to prove than regular cycles (1.34x). #### Cycles There is no support for regional cycles capturing. ## With vs Without checks These results are only useful to have a sense of the overhead of separating a monolithic proof into the two proofs due to the cross-proof checkings needed. These are only informational -- disabling checks isn't correct and it is only useful to try to deaggregate monolithic proving times, or understand the overhead of separating proofs compared to the monolithic one. These overheads might be still optimizable, but can give a rough idea. ![image](https://hackmd.io/_uploads/SkUpegWlbg.png) ![image](https://hackmd.io/_uploads/ryTpggWxZl.png) ## Conclusions The following are the highlights for each experiment: - Guest program separation **with-checks**: - The separation of guest programs induce a significant overhead related to cross-proof checks related to flatdb and post-state diff serialization and hashing for public inputs. - Regarding the relationship between proving times of _Execution-only_ guest program and _Pre/post-state check_ guest programs: - For SP1, _Execution-only_ is roughly the same as the _Pre/post-state check_ guest programs (12% diff). - For ZisK, the _Pre/post-state check_ guest program is 1.65x slower than _Execution-only_. - **Have in minds that since checks are enabled, those are significant overhead.** Please look at the _Guest program separation **without-checks**_ below for this overhead being removed. - Guest program separation **without-checks**: - Removing the checks removes significant overhead compared with enabled checks (between 15% and 19%). - Regarding the relationship between proving times of _Execution-only_ guest program and _Pre/post-state check_ guest programs: - For SP1, the _Execution-only_ workload has a 1.27x bigger proving time than _Pre/Post-state checking_. - For ZisK, the _Pre/Post-state checks_ guest program has 1.34x bigger proving time than _Execution-only_ guest program. - Both SP1 and ZisK numbers on _Proving time per million cycles_ confirm that being guided by cycles is a red herring since cycles containing keccak have around 1.13x for SP1 and 1.34x for Zisk. (Note that these numbers are still intrinsically affected by read input overheads and other MPT business logic -- but should be good approximations) - Considering that Zisk has lower average proving time per block than SP1, since for Zisk case the proving time of _Pre/post-state check_ is 1.34x higher than _Execution-only_ guest program, it seems to conclude that keccak-heavy workload of the STF dominates EVM execution. For SP1 the opposite is true but with a smaller difference. - This also shows how different zkVMs are optimized differently, making harder to have decisive conclusions without having a broad coverage of them. Also note this was only using Reth as an EL, but this assumption should have less impact considering the amount of keccaks for _Pre/post-state checks_ should be the same in any guest program (but the EVM execution could have different efficiencies between ELs).