changed 3 years ago
Published Linked with GitHub

Taiko ZK-EVM: Overview and Optimizations

KECCAK-256 and SHA-256

Brecht Devos
brecht@taiko.xyz


Taiko

  • Officially started in Q1 2022
  • I started contributing to the EF PSE ZK-EVM effort in Q3 2021
  • Several teammates came from Loopring
    • Split because Loopring is an app-specific ZKR, while Taiko is general purpose
  • Around 20 people now

Taiko

  • Type 1 ZK-EVM
    • Prioritize Ethereum equivalence, give up ZK-friendly optimizations
    • Main benefit is smooth migration and reusability of all L1 dapps, tools, and infrastructure
    • Main downside is slower proof generation
  • Decentralized
    • Only relies on Ethereum public data, anyone can run a node
  • Permissionless
    • Any proposer or prover can participate
  • Architecture
    • ZK-EVM (community effort from PSE ZK-EVM circuits)
    • L2 node
    • Protocol

Taiko Protocol

  • Block proposers: submit a list of transactions to the rollup contract, appended to the chain
  • Block provers: generate proofs for blocks
  • Key properties:
    • Block execution is deterministic after the block is proposed
      • Instant finality
      • Parallel proof generation
    • Block on-chain finalized if block and all its predecessors proven (L2 -> Lx)
    • Extra transaction injected at block proposing time
      • Used to bring the L1 block hash into an L2 smart contract (for bridging)
      • Store all L2 block hashes
      • Also used to verify some L2 data (no specialized circuit code is required)

KECCAK-256/SHA-256

  • Repetition of the same operations in N rounds on some state

  • After N rounds:

    • More data to process: carry state over
    • All data processed: load initial state

KECCAK-256/SHA-256

  • Not really efficient:
    • Work on binary fields (vs native field)
    • Large state
    • Lots of binary operations
  • Two different approaches:
    • Bit approach
    • Packed approach (example: 2 bits/data bit)

KECCAK-256: s = a ^ ((~b) & c)

AND b0*b1 (degree(b0) + degree(b1))
OR 1-(1-b0)*(1-b1) (degree(b0) + degree(b1))
XOR b0+b1 - 2*b0*b1 (degree(b0) + degree(b1))
NOT 1 - b0 (degree(b0))
  • Bit approach:
    • s = a + (1-b)*c - 2*a*(1-b)*c (degree 3)
    • But, s = a ^ ((~b) & c) ⇔ s ^ a = ((~b) & c) => s*a-2*s*a = (1-b)*c (degree 2)
  • Packed approach:
    • Could lookup (a, b, c, s), but needs 2^3=8 entries per bit
    • Instead lookup (3 - 2*a + b - c, s) with lookup table [0, 1, 1, 0, 0], only 5 entries/bit
      • Quite easy to find best solution with brute force
      • Have to make sure there’s no overflow/underflow per bit
    • Difference adds up, minimum circuit height for 8 bits
      • log2(8^8)=24 vs log2(5^8)=18.6

XOR

  • Bit approach:

    • Custom gates
    • Chaining many of these results in a high degree
  • Packed approach

    • Just add them together, do & 1 using a lookup
  • Mixed approach

    • Do custom gates up to some target degree
    • Then do lookup on more efficient table

Bit rotation/shift

  • Bit approach:
    • Free: just substitute directly in the constraint expression
  • Packed approach
    • Split up packed representation in parts at the rotation point, merge back together
    • Can delay merging/resplitting if the resulting parts are again needed in lookups, so split smart
    • Example: 12345678abcd as the 12-bit word
      • rot 1: 123|4567|8abc|d -> d|123 | 4567 | 8abc
      • rot 2: 12|3456|78ab|cd -> cd|12 | 3456 | 78ab
      • rot 5: 123|4567|8abc|d -> 8abc | d|123 | 4567

Bit approach

  • Binary representation
  • Use custom gates
  • Great because:
    • Simple
    • Pretty good performance
    • Many custom gates, but custom gates are cheap
    • MSM with binary coefficients cheap (reduces to a conditional sum)
  • But:
    • Requires a lot of cells (1 cell/bit)
    • Have to be very careful about the expression degree (HyperPlonk?)
      • Efficient high degree circuits useful to minimize verification costs (fewer columns)
    • Generally less flexible circuit layout wise

Packed approach

  • Packs multiple bits together (for keccak: 3 bits/bit)
  • More limited use of custom gates
    • ADD/SUB/MUL(with constant)
    • PACK/UNPACK/SPLIT/MERGE
  • Have to use lookups for most calculations
    • Sometimes “free” when range check already needed
  • Great because:
    • Efficient if lookups are efficient
    • Generally uses fewer cells
    • Very flexible circuit layout wise
  • But:
    • Lookups currently not very efficient (Caulk?)
    • Circuits generally more complex for these types of calculations

Circuit layout

  • Extremely important for performance
  • Bit approach:
    • Most efficient generally to just do a round/row
    • This requires a lot of custom gates per row and many columns
    • Can split a round over multiple rows by disabling custom gates, but loses some efficiency
  • Packed approach:
    • Easy to spread out lookups over multiple rows, using relative offsets to read results on every nth row
    • Not custom gate heavy, only used as the glew, so no problem disabling on many rows without any significant loss of efficiency
    • Example: a + b + c

SHA-256

for i from 0 to 63
    S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    temp1 := h + S1 + ch + k[i] + w[i]
    S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    temp2 := S0 + maj

    h := g
    g := f
    f := e
    e := d + temp1
    d := c
    c := b
    b := a
    a := temp1 + temp2
  • Additions are modulo 2^32
  • Use optimized versions of ch and maj to lower the degree:
    • select(s, t, f) := s*t + (1-s)*f
    • ch := select(e, f, g)
    • maj := select(b^c, a, b);

SHA-256


SHA-256 circuit layout

  • Some extra bits to handle additions modulo 2^32
  • w pretty similar to the other state
  • Only w[i] required on each row
# copy chunk into first 16 words w[0..15] of the message schedule array
# Extend the first 16 words into the remaining 48 words w[16..63] of the array:
for i from 16 to 63
    s0 := (w[i-15] rightrotate  7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift  3)
    s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10)
    w[i] := w[i-16] + s0 + w[i-7] + s1

Results

  • Hardware: c5.18xlarge
  • Software: PSE halo2 + different FFT
  • But
    • CPU pretty old
    • Circuit size has an impact on performance
      • KECCAK-256 (bit) (2^16) (16 cores): 14.8 hashes/s
      • Number of lookups required decreases the higher the circuit
    • Not all optimizations enabled/implemented (not using asm field operations, not the best MSM, …)
    • Circuit expression degree higher than needed
    • Scaling pretty badly over multiple cores currently
    • Take into account recursion? May want to prefer higher expression degree to reduce columns

Thanks!


Taiko

Select a repo