![](https://i.imgur.com/KcxSSBm.png =150x150)
### Taiko ZK-EVM: Overview and Optimizations
#### KECCAK-256 and SHA-256
Brecht Devos \
<brecht@taiko.xyz>
---
### Taiko
- Officially started in Q1 2022
- I started contributing to the EF PSE ZK-EVM effort in Q3 2021
- Several teammates came from Loopring
- Split because Loopring is an app-specific ZKR, while Taiko is general purpose
- Around 20 people now
---
### Taiko
- Type 1 ZK-EVM
- Prioritize Ethereum equivalence, give up ZK-friendly optimizations
- Main benefit is smooth migration and reusability of all L1 dapps, tools, and infrastructure
- Main downside is slower proof generation
- Decentralized
- Only relies on Ethereum public data, anyone can run a node
- Permissionless
- Any proposer or prover can participate
- Architecture
- ZK-EVM (community effort from PSE ZK-EVM circuits)
- L2 node
- Protocol
---
### Taiko Protocol
![](https://i.imgur.com/gokcQ7R.png)
- Block proposers: submit a list of transactions to the rollup contract, appended to the chain
- Block provers: generate proofs for blocks
- Key properties:
- Block execution is deterministic after the block is proposed
- Instant finality
- Parallel proof generation
- Block on-chain finalized if block and all its predecessors proven (L2 -> Lx)
- Extra transaction injected at block proposing time
- Used to bring the L1 block hash into an L2 smart contract (for bridging)
- Store all L2 block hashes
- Also used to verify some L2 data (no specialized circuit code is required)
---
### KECCAK-256/SHA-256
- Repetition of the same operations in N rounds on some state
![](https://i.imgur.com/4JZO1KJ.png)
- After N rounds:
- More data to process: carry state over
- All data processed: load initial state
---
### KECCAK-256/SHA-256
- Not really efficient:
- Work on binary fields (vs native field)
- Large state
- Lots of binary operations
- Two different approaches:
- Bit approach
![](https://i.imgur.com/xXOrGSD.png)
- Packed approach (example: 2 bits/data bit)
![](https://i.imgur.com/wJi4NSv.png)
---
### KECCAK-256: s = a ^ ((~b) & c)
| | |
| -------- |---------------------------------------------|
| AND | b0\*b1 (degree(b0) + degree(b1)) |
| OR | 1-(1-b0)\*(1-b1) (degree(b0) + degree(b1)) |
| XOR | b0+b1 - 2\*b0\*b1 (degree(b0) + degree(b1)) |
| NOT | 1 - b0 (degree(b0)) |
- Bit approach:
- s = a + (1-b)\*c - 2\*a\*(1-b)\*c (degree 3)
- But, s = a ^ ((~b) & c) ⇔ s ^ a = ((~b) & c) => s\*a-2\*s\*a = (1-b)\*c (degree 2)
- Packed approach:
- Could lookup (a, b, c, s), but needs 2^3=8 entries per bit
- Instead lookup (3 - 2\*a + b - c, s) with lookup table [0, 1, 1, 0, 0], only 5 entries/bit
- Quite easy to find best solution with brute force
- Have to make sure there’s no overflow/underflow per bit
- Difference adds up, minimum circuit height for 8 bits
- log2(8^8)=24 vs log2(5^8)=18.6
---
### XOR
- Bit approach:
- Custom gates
- Chaining many of these results in a high degree
- Packed approach
- Just add them together, do & 1 using a lookup
- Mixed approach
- Do custom gates up to some target degree
- Then do lookup on more efficient table
---
### Bit rotation/shift
- Bit approach:
- Free: just substitute directly in the constraint expression
- Packed approach
- Split up packed representation in parts at the rotation point, merge back together
- Can delay merging/resplitting if the resulting parts are again needed in lookups, so split smart
- Example: `12345678abcd` as the 12-bit word
- `rot 1: 123|4567|8abc|d -> d|123 | 4567 | 8abc`
- `rot 2: 12|3456|78ab|cd -> cd|12 | 3456 | 78ab`
- `rot 5: 123|4567|8abc|d -> 8abc | d|123 | 4567`
---
### Bit approach
- Binary representation
- Use custom gates
- Great because:
- Simple
- Pretty good performance
- Many custom gates, but custom gates are cheap
- MSM with binary coefficients cheap (reduces to a conditional sum)
- But:
- Requires a lot of cells (1 cell/bit)
- Have to be very careful about the expression degree (HyperPlonk?)
- Efficient high degree circuits useful to minimize verification costs (fewer columns)
- Generally less flexible circuit layout wise
---
### Packed approach
- Packs multiple bits together (for keccak: 3 bits/bit)
- More limited use of custom gates
- ADD/SUB/MUL(with constant)
- PACK/UNPACK/SPLIT/MERGE
- Have to use lookups for most calculations
- Sometimes “free” when range check already needed
- Great because:
- Efficient if lookups are efficient
- Generally uses fewer cells
- Very flexible circuit layout wise
- But:
- Lookups currently not very efficient (Caulk?)
- Circuits generally more complex for these types of calculations
---
### Circuit layout
- Extremely important for performance
- Bit approach:
- Most efficient generally to just do a round/row
- This requires a lot of custom gates per row and many columns
- Can split a round over multiple rows by disabling custom gates, but loses some efficiency
- Packed approach:
- Easy to spread out lookups over multiple rows, using relative offsets to read results on every nth row
- Not custom gate heavy, only used as the glew, so no problem disabling on many rows without any significant loss of efficiency
- Example: a + b + c
![](https://i.imgur.com/NpBG7IU.png)
---
### SHA-256
```
for i from 0 to 63
S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
ch := (e and f) xor ((not e) and g)
temp1 := h + S1 + ch + k[i] + w[i]
S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
maj := (a and b) xor (a and c) xor (b and c)
temp2 := S0 + maj
h := g
g := f
f := e
e := d + temp1
d := c
c := b
b := a
a := temp1 + temp2
```
- Additions are modulo 2^32
- Use optimized versions of `ch` and `maj` to lower the degree:
- select(s, t, f) := s\*t + (1-s)\*f
- ch := select(e, f, g)
- maj := select(b^c, a, b);
---
### SHA-256
![](https://i.imgur.com/okOFdqH.png)
---
### SHA-256 circuit layout
![](https://i.imgur.com/8O3kid3.png)
- Some extra bits to handle additions modulo 2^32
- w pretty similar to the other state
- Only w[i] required on each row
```
# copy chunk into first 16 words w[0..15] of the message schedule array
# Extend the first 16 words into the remaining 48 words w[16..63] of the array:
for i from 16 to 63
s0 := (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3)
s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10)
w[i] := w[i-16] + s0 + w[i-7] + s1
```
---
### Results
![](https://i.imgur.com/PJm1FdY.png)
- Hardware: c5.18xlarge
- Software: PSE halo2 + different FFT
- But
- CPU pretty old
- Circuit size has an impact on performance
- KECCAK-256 (bit) (2^16) (16 cores): 14.8 hashes/s
- Number of lookups required decreases the higher the circuit
- Not all optimizations enabled/implemented (not using asm field operations, not the best MSM, …)
- Circuit expression degree higher than needed
- Scaling pretty badly over multiple cores currently
- Take into account recursion? May want to prefer higher expression degree to reduce columns
---
### Thanks!
![](https://i.imgur.com/KcxSSBm.png =150x150)
Taiko
- Website: <http://taiko.xyz>
- GitHub: <http://github.com/taikoxyz>
- Twitter: <http://twitter.com/taikoxyz>
- More in depth hash impl docs: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q
- PSE zkevm-circuits: https://github.com/privacy-scaling-explorations/zkevm-circuits
<style>
.reveal {
font-size: 24px;
}
</style>

{"title":"Taiko ZK-EVM: Overview and Optimizations","description":"Taiko talk on overview and optimizations -- KECCAK-256 and SHA-256"}