![](https://i.imgur.com/KcxSSBm.png =150x150) ### Taiko ZK-EVM: Overview and Optimizations #### KECCAK-256 and SHA-256 Brecht Devos \ <brecht@taiko.xyz> --- ### Taiko - Officially started in Q1 2022 - I started contributing to the EF PSE ZK-EVM effort in Q3 2021 - Several teammates came from Loopring - Split because Loopring is an app-specific ZKR, while Taiko is general purpose - Around 20 people now --- ### Taiko - Type 1 ZK-EVM - Prioritize Ethereum equivalence, give up ZK-friendly optimizations - Main benefit is smooth migration and reusability of all L1 dapps, tools, and infrastructure - Main downside is slower proof generation - Decentralized - Only relies on Ethereum public data, anyone can run a node - Permissionless - Any proposer or prover can participate - Architecture - ZK-EVM (community effort from PSE ZK-EVM circuits) - L2 node - Protocol --- ### Taiko Protocol ![](https://i.imgur.com/gokcQ7R.png) - Block proposers: submit a list of transactions to the rollup contract, appended to the chain - Block provers: generate proofs for blocks - Key properties: - Block execution is deterministic after the block is proposed - Instant finality - Parallel proof generation - Block on-chain finalized if block and all its predecessors proven (L2 -> Lx) - Extra transaction injected at block proposing time - Used to bring the L1 block hash into an L2 smart contract (for bridging) - Store all L2 block hashes - Also used to verify some L2 data (no specialized circuit code is required) --- ### KECCAK-256/SHA-256 - Repetition of the same operations in N rounds on some state ![](https://i.imgur.com/4JZO1KJ.png) - After N rounds: - More data to process: carry state over - All data processed: load initial state --- ### KECCAK-256/SHA-256 - Not really efficient: - Work on binary fields (vs native field) - Large state - Lots of binary operations - Two different approaches: - Bit approach ![](https://i.imgur.com/xXOrGSD.png) - Packed approach (example: 2 bits/data bit) ![](https://i.imgur.com/wJi4NSv.png) --- ### KECCAK-256: s = a ^ ((~b) & c) | | | | -------- |---------------------------------------------| | AND | b0\*b1 (degree(b0) + degree(b1)) | | OR | 1-(1-b0)\*(1-b1) (degree(b0) + degree(b1)) | | XOR | b0+b1 - 2\*b0\*b1 (degree(b0) + degree(b1)) | | NOT | 1 - b0 (degree(b0)) | - Bit approach: - s = a + (1-b)\*c - 2\*a\*(1-b)\*c (degree 3) - But, s = a ^ ((~b) & c) ⇔ s ^ a = ((~b) & c) => s\*a-2\*s\*a = (1-b)\*c (degree 2) - Packed approach: - Could lookup (a, b, c, s), but needs 2^3=8 entries per bit - Instead lookup (3 - 2\*a + b - c, s) with lookup table [0, 1, 1, 0, 0], only 5 entries/bit - Quite easy to find best solution with brute force - Have to make sure there’s no overflow/underflow per bit - Difference adds up, minimum circuit height for 8 bits - log2(8^8)=24 vs log2(5^8)=18.6 --- ### XOR - Bit approach: - Custom gates - Chaining many of these results in a high degree - Packed approach - Just add them together, do & 1 using a lookup - Mixed approach - Do custom gates up to some target degree - Then do lookup on more efficient table --- ### Bit rotation/shift - Bit approach: - Free: just substitute directly in the constraint expression - Packed approach - Split up packed representation in parts at the rotation point, merge back together - Can delay merging/resplitting if the resulting parts are again needed in lookups, so split smart - Example: `12345678abcd` as the 12-bit word - `rot 1: 123|4567|8abc|d -> d|123 | 4567 | 8abc` - `rot 2: 12|3456|78ab|cd -> cd|12 | 3456 | 78ab` - `rot 5: 123|4567|8abc|d -> 8abc | d|123 | 4567` --- ### Bit approach - Binary representation - Use custom gates - Great because: - Simple - Pretty good performance - Many custom gates, but custom gates are cheap - MSM with binary coefficients cheap (reduces to a conditional sum) - But: - Requires a lot of cells (1 cell/bit) - Have to be very careful about the expression degree (HyperPlonk?) - Efficient high degree circuits useful to minimize verification costs (fewer columns) - Generally less flexible circuit layout wise --- ### Packed approach - Packs multiple bits together (for keccak: 3 bits/bit) - More limited use of custom gates - ADD/SUB/MUL(with constant) - PACK/UNPACK/SPLIT/MERGE - Have to use lookups for most calculations - Sometimes “free” when range check already needed - Great because: - Efficient if lookups are efficient - Generally uses fewer cells - Very flexible circuit layout wise - But: - Lookups currently not very efficient (Caulk?) - Circuits generally more complex for these types of calculations --- ### Circuit layout - Extremely important for performance - Bit approach: - Most efficient generally to just do a round/row - This requires a lot of custom gates per row and many columns - Can split a round over multiple rows by disabling custom gates, but loses some efficiency - Packed approach: - Easy to spread out lookups over multiple rows, using relative offsets to read results on every nth row - Not custom gate heavy, only used as the glew, so no problem disabling on many rows without any significant loss of efficiency - Example: a + b + c ![](https://i.imgur.com/NpBG7IU.png) --- ### SHA-256 ``` for i from 0 to 63 S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25) ch := (e and f) xor ((not e) and g) temp1 := h + S1 + ch + k[i] + w[i] S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22) maj := (a and b) xor (a and c) xor (b and c) temp2 := S0 + maj h := g g := f f := e e := d + temp1 d := c c := b b := a a := temp1 + temp2 ``` - Additions are modulo 2^32 - Use optimized versions of `ch` and `maj` to lower the degree: - select(s, t, f) := s\*t + (1-s)\*f - ch := select(e, f, g) - maj := select(b^c, a, b); --- ### SHA-256 ![](https://i.imgur.com/okOFdqH.png) --- ### SHA-256 circuit layout ![](https://i.imgur.com/8O3kid3.png) - Some extra bits to handle additions modulo 2^32 - w pretty similar to the other state - Only w[i] required on each row ``` # copy chunk into first 16 words w[0..15] of the message schedule array # Extend the first 16 words into the remaining 48 words w[16..63] of the array: for i from 16 to 63 s0 := (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3) s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10) w[i] := w[i-16] + s0 + w[i-7] + s1 ``` --- ### Results ![](https://i.imgur.com/PJm1FdY.png) - Hardware: c5.18xlarge - Software: PSE halo2 + different FFT - But - CPU pretty old - Circuit size has an impact on performance - KECCAK-256 (bit) (2^16) (16 cores): 14.8 hashes/s - Number of lookups required decreases the higher the circuit - Not all optimizations enabled/implemented (not using asm field operations, not the best MSM, …) - Circuit expression degree higher than needed - Scaling pretty badly over multiple cores currently - Take into account recursion? May want to prefer higher expression degree to reduce columns --- ### Thanks! ![](https://i.imgur.com/KcxSSBm.png =150x150) Taiko - Website: <http://taiko.xyz> - GitHub: <http://github.com/taikoxyz> - Twitter: <http://twitter.com/taikoxyz> - More in depth hash impl docs: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q - PSE zkevm-circuits: https://github.com/privacy-scaling-explorations/zkevm-circuits <style> .reveal { font-size: 24px; } </style>
{"metaMigratedAt":"2023-06-17T14:49:44.623Z","metaMigratedFrom":"YAML","title":"Taiko ZK-EVM: Overview and Optimizations","breaks":true,"description":"Taiko talk on overview and optimizations -- KECCAK-256 and SHA-256","contributors":"[{\"id\":\"2435332b-f6a2-41d6-a459-c558af9bc6e3\",\"add\":14548,\"del\":6410}]"}
    2243 views