Keccak-256 Halo2 circuit as zkWASM host circuit

> a technical overview of HyperOracle’s implementation of Keccak256 circuits as a host circuit on DelphinusLab zkWASM, in which host VM can leverage the host API circuit to prove that the witness of host API calls are valid. # Keccak-256 Halo2 circuit as zkWASM host circuit ## Specs | **Item** | **Spec** | **Note** | |---------- |---------- |---------- | | b | 1600 | | | r | 1088 | | | c | 512 | | | w | 64 | | | l | 6 | | | n_r | 24 | number of rounds | ### padding rule pad10∗1 rule *** ## Structs - use limb to represent a lane ### Limb ``` pub struct Limb<F: FieldExt> { cell: Cell<F>, value: F, } ``` ### KeccakState ``` pub struct KeccakState<F: FieldExt> { state: [[Limb<F>; 5]; 5] //limb is 64 bits default: [[Limb<F>; 5]; 5] } ``` ### Keccakchip ``` pub struct KeccakChip<F:FieldExt> { pub config: CommonGateConfig, pub spec: Spec<F, r>, keccak_state: KeccakState<F>, round: u64, _marker: PhantomData<F> } ``` ## Circuit Design A `KeccakChip` would be designed with the following steps: 1. **State Initialization:** Map the input bits into the `KeccakState` structure. 2. **Permutation:** Perform the five steps (θ, ρ, π, χ, ι) 24 times. 3. **Squeeze:** Take the first 256 bits from the state and treat this as the output of the function. ### Round function ``` // θ step let mut c = vec![]; // for x in 0..5 { let xor01 = let xor012 = let xor0123 = let xor01234 = c.push(xor01234); } let mut d = vec![]; for x in 0..5 { C[x-1] xor rot(C[x+1],1) // rotate one position to the left in bit representation } for x in 0..5 { A[x,y] = A[x,y] xor D[x], } } // ρ and π steps A[x,y] = B[x,y] xor ((not B[x+1,y]) and B[x+2,y]), for (x,y) in (0…4,0…4) // χ step A[0,0] = A[0,0] xor RC ``` ## Design Overview Keccak256 hash algorithm proceeds in iterations of keccak_f, squeezing out a new value per keccak_f. In Keccak-256, the capacity each round is 1088 bits, which equivalent to 17 Limbs of 64 bits. In Host VM, we use 'reset' flag to indicate the start of the hashing operation, so that we can take variable length input that is longer than the capacity of one round. The whole Keccak256 Host circuit contains three components: 1. A **WASM keccakHasher** that can update the inputs and proceeds in rounds of keccak_f to provide the correct hashing result for validating the witness of Keccak host API calls. 2. A **Halo2** implementation of **Keccak256 Host circuit**: 1. a customized gate for common host circuits 2. a specialized lookup table designed for the bit arithmetic and rotation operation for Keccak256 host circuit 3. A **keccak adaptor** that connect Keccak host circuit to Host VM by preparing host API calls to the Host API call table with corresponding opcodes. ## Keccak-256 host API ``` pub struct KeccakState<F: FieldExt> { state: [[Limb<F>; 5]; 5], default: [[Limb<F>; 5]; 5], rc: [Limb<F>; N_R], } ``` The state of Keccak-256 is a 5 x 5 matrix with each element being a 64 bit lane that's represented by Limb<F> in our circuit. ``` pub struct KeccakGateConfig { pub common: CommonGateConfig, pub arith: BitsArithConfig, } ``` KeccakGate contains a custom gate config and a lookup table config, details will be explained in later section ``` pub fn get_permute_result( &mut self, region: &mut Region<F>, offset: &mut usize, values: &[Limb<F>; RATE_LANES], reset: &Limb<F>, ) -> Result<[Limb<F>; 4], Error> ``` The function takes a Limb array of known fixed length, a reset indicator and return the first 4 Limb after the keccak permutation as the result of the round. The reset flag: - If the reset is true, the function will absorb the input array into a default empty state and perform keccak_f permutation, then produce the result - If the reset is false, the function will absorb the input array into the current state so the hashing will continue ``` pub fn decompose_bytes<F: FieldExt>( &self, region: &mut Region<F>, offset: &mut usize, limb: &Limb<F>, rotate: usize, //how limbs rotated towards high bits hint: u64, // lookup hints ) -> Result<(Limb<F>, [Limb<F>; 8]), Error> ``` This function serves as the foundation of this implementation. The function takes a limb of 64 bits, how many bits to rotate, and a lookup hint and decompose the limb into bytes cells, in little endian and return (a, [a0,a1,a2,a3,a4,a5,a6,a7]). For every 64 bits limb, the function will decompose into 8 byte cells which will be assign to 2 lines into circuit / region. The rotation bits will determine which set of 4 byte cells will be assigned into the same line. ### Keccak Gate Config #### Common Gate ``` customized_circuits!(CommonGateConfig, 2, 5, 13, 0); ``` The common gate has 2 rows, 5 advice columns, 13 fixed columns and 0 selector. ``` // a0 a1 a2 a3 // a4 a5 a6 a7 // b0 b1 b2 b3 // b4 b5 b6 b7 // c0 c1 c2 c3 // c4 c5 c6 c7 // (a0,b0,c0) in lookup_set ``` For example, the XOR operation will produce the above region in the circuit. a0...a7 being the limb of the first operand, b0...b7 being the limb of second operand, and c0...c7 being the rest limb. when configuring the custom gate, we register ``` a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 lookup_hint ``` with lookup indicator. #### BitsArithConfig ``` customized_circuits!(BitsArithConfig, 1, 0, 4, 0, | lhs | rhs | res | op ); ``` The gate for lookup table has 1 row, 0 advice column, 4 fixed columns and 0 selector. ``` fn register<F: FieldExt>( &self, cs: &mut ConstraintSystem<F>, cols: impl Fn(&mut VirtualCells<F>) -> Vec<Expression<F>>, ) { for i in 0..4 { cs.lookup_any("check bits arith", |meta| { let lhs = self.get_expr(meta, BitsArithConfig::lhs()); let rhs = self.get_expr(meta, BitsArithConfig::rhs()); let op = self.get_expr(meta, BitsArithConfig::op()); let res = self.get_expr(meta, BitsArithConfig::res()); let icols = cols(meta); vec![ (icols[i].clone(), lhs), (icols[i + 4].clone(), rhs), (icols[i + 8].clone(), res), (icols[12].clone(), op), ] }); } ``` BitsArithConfig will register (a0, b0, c0), (a1, b1, c1)... as one set and assign opcode accordingly. ### Lookup Table Specs * **BIT_XOR**: u8 = 1; * **BIT_AND**: u8 = 2; * **BIT_NOT_AND**: u8 = 3; * **BIT_ROTATE_LEFT**: u8 = 4; // 4 + 7, max 11 ---- total 2^4 * **BIT_ROTATE_RIGHT**: u8 = 12; // 12 + 7, max 21 -- total 2^5 BIT_ROTATE_LEFT is set to 4 initially. 5 means rotate left by 1 bit, 6 means rotate left by 2 bits and so on, and the same logic for BIT_ROTATE_RIGHT. ### Initiation of the lookup table In bitArith table, (lhs, rhs, res, opcode) is registered as a set. - For XOR, AND, NOT_AND Basically, the bitArith table is initialized in a nested loop fashion. For every possible value of a byte, perform arithmetic operation correspond to the OPCODE. - For ROTATE For every rotation bit number from 0 to 7, The bitArith table is initialized with every possible value. e.g opcode = 6, the bitArith table is filled with every 8 bits value that's rotated left by 2 bits. ### Example [testing logic] ``` #[test] fn generate_keccak_input_single() { let table = hash_to_host_call_table(vec![[ Fr::one(), Fr::one(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::zero(), Fr::from(1u64 << 63), ]]); let file = File::create("keccak256_test.json").expect("can not create file"); serde_json::to_writer_pretty(file, &table).expect("can not write to file"); } ``` 1. we passed the input after padding to the hash_to_host_call_table function to generate the json file. 2. create output and params folder under root 3. run test ``` cargo test generate_keccak RUST_BACKTRACE=1 cargo run --release -- features cuda -- -k 22 --input keccak256_test.json --opname keccakhash -- output output --param params RUST_BACKTRACE=1 cargo run --release -- features cuda -- -k 22 --input keccak256_test_multi.json --opname keccakhash -- output output --param params ``` 4. the prover will check each proofs and assert the host API call traces are the same both in the guest circuit and in the host circuit. 5. Proof generated for keccak host circuit ## Integrating zkWASM [keccakadaptor] - the host API call traces should be the same both in the guest circuit and in the host circuit ``` fn hash_to_host_call_table(inputs: &[Fr; 17], result: &[Fr; 4]) -> ExternalHostCallEntryTable { let mut r = vec![]; r.push(hash_cont(true)); for f in inputs.iter() { r.push(crate::adaptor::fr_to_args(*f, 1, 64, Keccak256Push)); } for f in result.iter() { r.push(crate::adaptor::fr_to_args(*f, 1, 64, Keccak256Finalize)); } ExternalHostCallEntryTable(r.into_iter().flatten().collect()) } ``` **Fn hash_to_host_call_table**: log the call trace to the shared call table The function takes a vector of ***field element array*** with fixed length of 17 as input, within which each element equal to 1 round of 1088 bits. For batch round, it start with push a start flag to the table, then set the flag to false for the following round. Inside the round, use fr_to_args to transform each Fr element to a vector of table entry and log into the table. At the end of each round, push the 4 resulting Fr with Foreign Instruction Keccak Finalize to the table. At the end of the batch, flatten the vectors produced by fr_to_args. ## Optimization Optimization mainly has two direction: - Optimize and balance the circuit design with columns and rows to reduce the cost of circuit - Rust (general efficiency) ### Decompose Bytes > v1 Implementation During the first iteration, the lookup table is implemented similar to the logic of *range check*. In the Keccak circuit, 1. lhs, rhs, and the res limb was encoded using linear algebra 2. lhs, rhs, and the encoded limb was passed into the lookup table 3. decompose the limbs into 8 bits with the help of accumulator and size to constrain that the sum of decomposed component is equal to the orginal value 4. Perform lookup to check if (lhs, rhs, encoded limb) is in the lookup table The approach is not a good design choice in the following ways: 1. the bitArith table has too many columns 2. range check logic doesn't fit here. The logic of range check is to slash the desired size from the original limb and check the slashed element is within the desired range. It also use the accumulator and size factor to control how many slash operation is performed, so that if the limb checked is not within the desired range, the remaining element can't be find in the lookup table 3. decompose the limb in the arith table and putting constraints, and we don't have needs to check the range of the limb because the limb length is known >v2 Implementation ``` /// return (a, [a0,a1,a2,a3,a4,a5,a6,a7]) pub fn decompose_bytes<F: FieldExt>( &self, region: &mut Region<F>, offset: &mut usize, limb: &Limb<F>, ) -> Result<(Limb<F>, [Limb<F>;8]), Error> ``` 1. decompose the 64 bits limb to 8 byte cells in the keccak circuit, instead of inside lookup table 2. **Row reduction** decompose bytes just use 2 rows, while decompose limb (decompose 64 bit limb into 64 limbs of 1 bit) use 16 rows 3. rotation operation is dealt sperately 4. ?? why return the original Limb, c[4]?? for sinka > v3 implementation ``` pub fn decompose_bytes<F: FieldExt>( &self, region: &mut Region<F>, offset: &mut usize, limb: &Limb<F>, rotate: usize, //how limbs rotated towards high bits hint: u64, // lookup hints ) -> Result<(Limb<F>, [Limb<F>; 8]), Error> ``` this iteration is further optimized: 1. the function takes additional ***rotate*** and ***hint*** argument 2. squeeze the rotation logic to **save rows (assign_witness)** ### ROTATE_LEFT / RIGHT > v1 implementation 1. first iteration use decompose_limb function 2. convert to array and call rotate_left function 3. operate and constrain bitwisely 4. reconstruct the first rotated limb and second rotated limb and assert equal > v2 implementaion ``` pub fn rotate_left( &self, config: &CommonGateConfig, region: &mut Region<F>, offset: &mut usize, input: &Limb<F>, n: usize, ) -> Result<Limb<F>, Error> { let v = field_to_u64(&input.value).rotate_left(n as u32); let chunk = n / 8; // how many chunks we have to move let rem = n % 8; // how many bits we have to move let (_, bytes) = config.decompose_bytes(region, offset, input)?; let (v, vbytes) = config.decompose_bytes(region, offset, &Limb::new(None, F::from(v)))?; for i in 0..8 { let current = bytes[(i+chunk) % 8].clone(); let next = bytes[(i+chunk+1) % 8].clone(); config.assign_witness( region, &mut (), offset, [ Some(current), Some(next), Some(vbytes[(i + chunk)%8].clone()), None, None, ], (BIT_ROTATE_LEFT as usize + rem) as u64, )?; } Ok(v) } ``` 1. convert passed field element to u64 with field_to_u64 2. call rotate_left on u64, instead of the limb array in the v1 implementation 3. use decompose_byte to decompose rotated limb and the input limb 4. For every byte cell, assign itself, the affected byte cell and the rotated byte to the circuit 5. 8 rows per rotate operation because the byte cells array is length of 8 >v3 implementaion 1. rotation logic blends into decompose_byte function 2. in the rotation_left function, only assign 8 bytes cell into 2 rows (**reduce 6 rows per rotate operation**) ``` lookup_assist_config.register(cs, |c| { vec![ config.get_expr(c, CommonGateConfig::l0()) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l1()) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l2()) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l3()) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l0().next(2)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l1().next(2)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l2().next(2)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l3().next(2)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l0().next(4)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l1().next(4)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l2().next(4)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::l3().next(4)) * config.get_expr(c, CommonGateConfig::lookup_ind()), config.get_expr(c, CommonGateConfig::lookup_hint()) * config.get_expr(c, CommonGateConfig::lookup_ind()), ] ``` **NOTE:** because of the above register logic, ROTATE_LEFT is 1. decompose the input limb (to be rotated) 2. assign witness -- **total 2 rows** - 7, 0, 1, 2 - 3, 4, 5, 6 3. decompose the rotated limb (res limb) **NOTE:** - chunk: chunk is passed into *decompose_byte* - rem: rem is used in *rotate_left* ### Xi and XOR_NOT_AND **NOT operation** > v1 implementation - decompose the target Limb into [Limb<F>;64], with each Limb<F> being just 1 bit - use field_to_u64 to transform every 1 bit Limb back to u64 (large overhead) - perform not operation on the u64 array - reconstruct the Limb back as not_limb - constrain reconstructed not_limb to be equal to the !limb.value **XOR** > v1 implementation - create a Limb object with the computation result from built-in function ^. - decompose lhs, rhs, and res into [Limb<F>;64], with each Limb<F> being just 1 bit - use field_to_u64 to transform every 1 bit Limb back to u64 (large overhead) - reconstruct the Limb back from the array of u64 - encode the lhs, rhs and res as one field element then use lookup table to constrain > v1 implementation treat NOT and XOR individually. First constrain the validity of NOT operation then XOR it with the third limb. For NOT operation, decompose_limb takes 16 rows and constrain the validity cost 1 more row. In v1, there's no constrain for NOT_AND **NOT_XOR_AND** Combine NOT, AND for optimization opportunity >v1 implementation ``` pub fn xor_not_and( &self, config: &CommonGateConfig, region: &mut Region<F>, offset: &mut usize, a: &Limb<F>, b: &Limb<F>, c: &Limb<F>, ) -> Result<Limb<F>, Error> ``` - calculate one binary arithmetic at a time - d = (!b)& c - e = a ^ d - apply decompose_byte on **a, b, c, d, e** - zip **b_limbs,c_limbs,d_limbs** array - For each set of (**b**_limb,**c**_limb,**d**_limb), assign (b_limb,c_limb,d_limb,op=BIT_NOT_AND), **8 rows** - For each set of (**a**_limb,**d**_limb,**e**_limb), assign (a_limb,d_limb,e_limb,op=BIT_XOR), **8 rows** > v2 implementation - **remove the 8 + 8 rows per XOR_NOT_AND operation** because lookup_hint is blended into decompose_byte function - **Further reduce the row count by combine operations** every simple operation like XOR and NOT_AND cost 6 rows, but XOR_NOT_AND can save 2 rows (1 decompose_byte) because the result of NOT_AND can be used as the input of XOR ## ***Misc***: - use for most precise type to represent the data, i.e use u64 for lane than Fr ### Rotation Constant - All the constant must be assigned into circuit - use register function in lookup assist config to form gate shape you want ### Fix result into 4 Fr - The hash result of Keccak256 is the first 256 bits of the state after permutation. Instead of providing a single hash digest that is composed of the first 4 limb ( 4 * 64 = 256 ) as the hash result, the composed value exceeds the range of BN254. Thus, the final implementation returns 4 Limbs<Fr> instead. ## Potential optimization opportunity ### Row Usage Analysis - per arithmetic operation : 3 * 2 = 6 rows - XOR_NOT_AND : 5 * 2 = 10 rows - KeccakState.initialize - assign zero : 1 row - assign rc : 24 rows - KeccakChip.get_permute_result : - select : 1 row * 5 = 5 rows - XOR : 24 * 6 = 144 rows - KeccakState.permute = 736 * 24 rounds = 17664 - theta : 330 rows - 5 * 4 XOR * 6 = 120 rows - 5 ROTATE_LEFT * 6 = 30 rows - 5 XOR * 6 = 30 rows - 5 * 5 XOR * 6 = 150 rows - rho : 150 rows - 5 * 5 ROTATE_LEFT * 6 = 150 rows - pi : - 0 row - xi : 250 rows - 5 * 5 XOR_NOT_AND * 10 = 250 rows - iota : 6 rows - 1 XOR * 6 = 6 rows - per round: 736 rows > total 17664 + 5 + 144 + 1 + 24 = 17837 rows K15 almost K14 **Goal**: reduce the row count to 12000ish (K14), that's is 200ish row reduction per round - [x] 30 reduction in theta - [ ] - ##### explore direction: - reduce row usage for operation - reuse duplicate limb / avoid redundent decomposation - In theta, to compute di, the rhs of the XOR operation is always the rotated next c. Maybe we can reuse the result - Also, when computing di, we XOR c[prev(x)] which is decomposed already - if it's possible to have different registration logic in the same circuit - combine consecutive XOR in theta, for 0..5, 0 xor 1 xor 2 xor 3 xor 4, 24 -> 18 save 6 rows each iteration, 30 rows in total - streamline the calculation of D[x], by further combine the above mentioned consecutive xor operations and rotate_left - reduce number of in-circuit operation - only constraint the start and end, streamline the operation