Taiko ZK-EVM: Overview and Optimizations

Taiko

Officially started in Q1 2022
I started contributing to the EF PSE ZK-EVM effort in Q3 2021
Several teammates came from Loopring
- Split because Loopring is an app-specific ZKR, while Taiko is general purpose
Around 20 people now

Taiko

Type 1 ZK-EVM
- Prioritize Ethereum equivalence, give up ZK-friendly optimizations
- Main benefit is smooth migration and reusability of all L1 dapps, tools, and infrastructure
- Main downside is slower proof generation
Decentralized
- Only relies on Ethereum public data, anyone can run a node
Permissionless
- Any proposer or prover can participate
Architecture
- ZK-EVM (community effort from PSE ZK-EVM circuits)
- L2 node
- Protocol

Taiko Protocol

Block proposers: submit a list of transactions to the rollup contract, appended to the chain
Block provers: generate proofs for blocks
Key properties:
- Block execution is deterministic after the block is proposed
  - Instant finality
  - Parallel proof generation
- Block on-chain finalized if block and all its predecessors proven (L2 -> Lx)
- Extra transaction injected at block proposing time
  - Used to bring the L1 block hash into an L2 smart contract (for bridging)
  - Store all L2 block hashes
  - Also used to verify some L2 data (no specialized circuit code is required)

KECCAK-256/SHA-256

Repetition of the same operations in N rounds on some state
After N rounds:
- More data to process: carry state over
- All data processed: load initial state

KECCAK-256/SHA-256

Not really efficient:
- Work on binary fields (vs native field)
- Large state
- Lots of binary operations
Two different approaches:
- Bit approach
- Packed approach (example: 2 bits/data bit)

KECCAK-256: s = a ^ ((~b) & c)


AND	b0*b1 (degree(b0) + degree(b1))
OR	1-(1-b0)*(1-b1) (degree(b0) + degree(b1))
XOR	b0+b1 - 2b0b1 (degree(b0) + degree(b1))
NOT	1 - b0 (degree(b0))

Bit approach:
- s = a + (1-b)*c - 2*a*(1-b)*c (degree 3)
- But, s = a ^ ((~b) & c) ⇔ s ^ a = ((~b) & c) => s*a-2*s*a = (1-b)*c (degree 2)
Packed approach:
- Could lookup (a, b, c, s), but needs 2^3=8 entries per bit
- Instead lookup (3 - 2*a + b - c, s) with lookup table [0, 1, 1, 0, 0], only 5 entries/bit
  - Quite easy to find best solution with brute force
  - Have to make sure there’s no overflow/underflow per bit
- Difference adds up, minimum circuit height for 8 bits
  - log2(8^8)=24 vs log2(5^8)=18.6

XOR

Bit approach:
- Custom gates
- Chaining many of these results in a high degree
Packed approach
- Just add them together, do & 1 using a lookup
Mixed approach
- Do custom gates up to some target degree
- Then do lookup on more efficient table

Bit rotation/shift

Bit approach:
- Free: just substitute directly in the constraint expression
Packed approach
- Split up packed representation in parts at the rotation point, merge back together
- Can delay merging/resplitting if the resulting parts are again needed in lookups, so split smart
- Example: 12345678abcd as the 12-bit word
  - rot 1: 123|4567|8abc|d -> d|123 | 4567 | 8abc
  - rot 2: 12|3456|78ab|cd -> cd|12 | 3456 | 78ab
  - rot 5: 123|4567|8abc|d -> 8abc | d|123 | 4567

Bit approach

Binary representation
Use custom gates
Great because:
- Simple
- Pretty good performance
- Many custom gates, but custom gates are cheap
- MSM with binary coefficients cheap (reduces to a conditional sum)
But:
- Requires a lot of cells (1 cell/bit)
- Have to be very careful about the expression degree (HyperPlonk?)
  - Efficient high degree circuits useful to minimize verification costs (fewer columns)
- Generally less flexible circuit layout wise

Packed approach

Packs multiple bits together (for keccak: 3 bits/bit)
More limited use of custom gates
- ADD/SUB/MUL(with constant)
- PACK/UNPACK/SPLIT/MERGE
Have to use lookups for most calculations
- Sometimes “free” when range check already needed
Great because:
- Efficient if lookups are efficient
- Generally uses fewer cells
- Very flexible circuit layout wise
But:
- Lookups currently not very efficient (Caulk?)
- Circuits generally more complex for these types of calculations

Circuit layout

Extremely important for performance
Bit approach:
- Most efficient generally to just do a round/row
- This requires a lot of custom gates per row and many columns
- Can split a round over multiple rows by disabling custom gates, but loses some efficiency
Packed approach:
- Easy to spread out lookups over multiple rows, using relative offsets to read results on every nth row
- Not custom gate heavy, only used as the glew, so no problem disabling on many rows without any significant loss of efficiency
- Example: a + b + c

SHA-256

for i from 0 to 63
    S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    temp1 := h + S1 + ch + k[i] + w[i]
    S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    temp2 := S0 + maj

    h := g
    g := f
    f := e
    e := d + temp1
    d := c
    c := b
    b := a
    a := temp1 + temp2

Additions are modulo 2^32
Use optimized versions of ch and maj to lower the degree:
- select(s, t, f) := s*t + (1-s)*f
- ch := select(e, f, g)
- maj := select(b^c, a, b);

SHA-256

SHA-256 circuit layout

Some extra bits to handle additions modulo 2^32
w pretty similar to the other state
Only w[i] required on each row

# copy chunk into first 16 words w[0..15] of the message schedule array
# Extend the first 16 words into the remaining 48 words w[16..63] of the array:
for i from 16 to 63
    s0 := (w[i-15] rightrotate  7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift  3)
    s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10)
    w[i] := w[i-16] + s0 + w[i-7] + s1

Results

Hardware: c5.18xlarge
Software: PSE halo2 + different FFT
But
- CPU pretty old
- Circuit size has an impact on performance
  - KECCAK-256 (bit) (2^16) (16 cores): 14.8 hashes/s
  - Number of lookups required decreases the higher the circuit
- Not all optimizations enabled/implemented (not using asm field operations, not the best MSM, …)
- Circuit expression degree higher than needed
- Scaling pretty badly over multiple cores currently
- Take into account recursion? May want to prefer higher expression degree to reduce columns

Thanks!

Taiko

Website: http://taiko.xyz
GitHub: http://github.com/taikoxyz
Twitter: http://twitter.com/taikoxyz
More in depth hash impl docs: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q
PSE zkevm-circuits: https://github.com/privacy-scaling-explorations/zkevm-circuits

Taiko ZK-EVM: Overview and Optimizations

KECCAK-256 and SHA-256